Dauo - Droid

1. Introduction

The current image has no alternative text. The file name is: IMG_20250912_225223-scaled.jpg

Web scraping is the process of extracting data from websites. In Python, we commonly use libraries like requests (to fetch web pages) and BeautifulSoup (to parse and extract information from HTML).

2. Installing Required Libraries

Before scraping, install the libraries:

pip install requests beautifulsoup4

3. Basic Web Scraping Example

import requests
from bs4 import BeautifulSoup

# Step 1: Fetch a web page
url = "https://quotes.toscrape.com/"
response = requests.get(url)

# Step 2: Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")

# Step 3: Extract quotes and authors
quotes = soup.find_all("span", class_="text")
authors = soup.find_all("small", class_="author")

for q, a in zip(quotes, authors):
    print(f"{q.get_text()} - {a.get_text()}")

Output Example:

“The world as we have created it is a process of our thinking.” - Albert Einstein
“It is our choices, Harry, that show what we truly are.” - J.K. Rowling
...

4. Commonly Used BeautifulSoup Methods

Method	Description
`soup.find(tag)`	Finds the first occurrence of a tag
`soup.find_all(tag)`	Finds all occurrences of a tag
`element.get_text()`	Extracts text content inside an element
`element['attribute']`	Gets the value of an attribute (e.g., `href`)
`soup.select("css_selector")`	Finds elements using CSS selectors

5. Extracting Links Example

links = soup.find_all("a")
for link in links:
    href = link.get("href")
    text = link.get_text(strip=True)
    print(f"Text: {text} -> Link: {href}")

6. Scraping with CSS Selectors

# Example: Get all quotes using CSS selectors
quotes = soup.select("span.text")
for q in quotes:
    print(q.text)

7. Handling Pagination (Multiple Pages)

page = 1
while True:
    url = f"https://quotes.toscrape.com/page/{page}/"
    response = requests.get(url)
    
    if "No quotes found!" in response.text:
        break
    
    soup = BeautifulSoup(response.text, "html.parser")
    quotes = soup.find_all("span", class_="text")
    
    for q in quotes:
        print(q.get_text())
    
    page += 1

8. Best Practices for Web Scraping

✅ Always check the website’s robots.txt rules.
✅ Avoid overloading servers (use delays with time.sleep).
✅ Consider using APIs if available instead of scraping.
✅ Be respectful and ethical when scraping.

9. Summary

requests → Fetches web pages.
BeautifulSoup → Parses and extracts data from HTML.
Methods like .find(), .find_all(), .select(), and .get_text() help extract elements.
Pagination lets you scrape multiple pages in a loop.
Always follow best practices and respect website policies.