mobu — Journey of Scraping

This is the first part of a series dedicated to web scrapers. I’ll talk about the most common web scraping issues and challenges. Note: I’m still new to Python and web scraping. This series is for me to learn. You can find the full code in this GitHub repository

Goals of scraper

Extract real estate data from three websites
Save extracted data to Supabase
Display extracted data in simple frontend application
Create a cron job on a VPS to run the scraper each day and send a notification to Telegram

Extracting data

For extracting data I chose three websites. Let’s call them A, B, and C. A is the easy one – no Cloudflare, just go for it. B and C use Cloudflare protection; this is where the fun begins.

Scraping website without any bot protection

I defined the links that I wanted to scrape from website A in a simple array and ran a for loop which extracted and accumulated the data.

    for link in WEBSITE_A_LINKS:
       result = get_listing(link) 
       all_listings.extend(result)

For website A a simple GET request using the requests package was enough. I didn’t encounter any bot protection.

def get_listing(link):
    response = requests.get(link, headers=HEADERS)
    ...

For HTML parsing I used BeautifulSoup

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Find all elements with a specific class
listings = soup.find_all(class_ = SELECTORS['row'])

After getting all elements I extracted and cleaned the data and that’s it.

Scraping website with bot protection - cloudflare

Both websites B and C used Cloudflare so if you tried to open them using the requests package you were immediately greeted with a bot-check page. A popular replacement for the requests package in this case is Playwright.

Instead of a simple get request for websites B and C you first have to create a function that handles opening and closing the browser with stealth settings to avoid detection.

What are stealth settings? They are configurable options to avoid detection.

user_agent is a string of text that identifies the client. I used "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36". For the requests package it defaults to something similar like python-requests/2.31.0; as a result it makes it easily distinguishable as a bot.
headless defines if the actual browser window is launched or actions are run in the background. I used headless=False as using headless mode makes it easier to detect you as a bot.
Launch Arguments these are a bunch of flags that control browser behavior

args=[
    "--no-sandbox", 
    "--disable-setuid-sandbox", 
    "--disable-dev-shm-usage", # Prevents memory crashes in small VPS containers
    "--disable-blink-features=AutomationControlled" # Extra layer of stealth
]

Context level protection they are used to mimic a real user browser environment. I used a realistic viewport and locale

    viewport={'width': 1920, 'height': 1080},
    locale="lt-LT"

Human-like Interaction these are used to mimic real user interaction with the browser. I implemented different delays between actions and simulated mouse movements.

Having these in place you’re ready to scrape websites protected by Cloudflare.

Here’s the final Playwright browser settings that I used.

def get_stealth_page():
    with Stealth().use_sync(sync_playwright()) as p:
        # 1. Launch the browser with 'safe' arguments
        browser = p.chromium.launch(
            headless=False,
            args=[
                "--no-sandbox", 
                "--disable-setuid-sandbox", 
                "--disable-dev-shm-usage",
                "--disable-blink-features=AutomationControlled"
            ],
        )
        
        # 2. Create a 'Context' 
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
            viewport={'width': 1920, 'height': 1080},
            locale="lt-LT",
        )

        # 3. Create the page and apply stealth
        page = context.new_page()

        try:
            # This 'yields' the page back to your main code
            yield page
        finally:
            # This ensures the browser closes even if your scraper crashes
            browser.close()

In this case iterating through pages looked like this:

Get a stealth page
Iterate through pages with simulated delays and mouse movements
Extract and parse data. Parsing data flow is exact the same as for website A

Here’s how it looks

def get_listings(
    links,
    extract_listings,
    get_next_page,
    wait_until_ready,
    base_url,
):
    results = []

    with get_stealth_page() as page:
        for link in links:
            logger.info("Scraping search link: %s", link)

            for page_url in iterate_pages(page, link, base_url, get_next_page):
                try:
                    wait_until_ready(page)

                    small_delay(page)
                    data = extract_listings(page)

                    if not data:
                        logger.debug("No data extracted from %s", page_url)
                        continue

                    results.extend(data)

                except Exception:
                    logger.exception("Extractor failed at %s", page_url)

    logger.info("Collected %d listings", len(results))
    return results


def iterate_pages(page, start_url, base_url, get_next_page):
    url = start_url

    while url:
        logger.debug("Visiting %s", url)

        simulate_user(page)
        human_delay(page)
        page.goto(url, wait_until="domcontentloaded")
        yield url

        try:
            next_url = get_next_page(page, base_url)
        except Exception:
            logger.exception("Paginator failed at %s", url)
            break

        if not next_url:
            break
        url = next_url

What’s next

This first part was about getting the scraper working locally and dealing with Cloudflare. In part 2 I’ll cover deploying it to a VPS, setting up a cron job, and sending notifications (Telegram) so the scraper can run on its own.