2026-03-09
Journey of Scraping – Part 1
Scraping real estate websites
- Python
- Playwright
- Scraping
- English
- Journey of scraping
This is the first part of a series dedicated to web scrapers. I’ll talk about the most common web scraping issues and challenges. Note: I’m still new to Python and web scraping. This series is for me to learn. You can find the full code in this GitHub repository
Goals of scraper
- Extract real estate data from three websites
- Save extracted data to Supabase
- Display extracted data in simple frontend application
- Create a cron job on a VPS to run the scraper each day and send a notification to Telegram
Extracting data
For extracting data I chose three websites. Let’s call them A, B, and C. A is the easy one – no Cloudflare, just go for it. B and C use Cloudflare protection; this is where the fun begins.
Scraping website without any bot protection
I defined the links that I wanted to scrape from website A in a simple array and ran a for loop which extracted and accumulated the data.
for link in WEBSITE_A_LINKS:
result = get_listing(link)
all_listings.extend(result)
For website A a simple GET request using the requests package was enough. I didn’t encounter any bot protection.
def get_listing(link):
response = requests.get(link, headers=HEADERS)
...
For HTML parsing I used BeautifulSoup
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Find all elements with a specific class
listings = soup.find_all(class_ = SELECTORS['row'])
After getting all elements I extracted and cleaned the data and that’s it.
Scraping website with bot protection - cloudflare
Both websites B and C used Cloudflare so if you tried to open them using the requests package you were immediately greeted with a bot-check page. A popular replacement for the requests package in this case is Playwright.
Instead of a simple get request for websites B and C you first have to create a function that handles opening and closing the browser with stealth settings to avoid detection.
What are stealth settings? They are configurable options to avoid detection.
- user_agent is a string of text that identifies the client. I used
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36". For therequestspackage it defaults to something similar likepython-requests/2.31.0; as a result it makes it easily distinguishable as a bot. - headless defines if the actual browser window is launched or actions are run in the background. I used
headless=Falseas using headless mode makes it easier to detect you as a bot. - Launch Arguments these are a bunch of flags that control browser behavior
args=[
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage", # Prevents memory crashes in small VPS containers
"--disable-blink-features=AutomationControlled" # Extra layer of stealth
]
- Context level protection they are used to mimic a real user browser environment. I used a realistic viewport and locale
viewport={'width': 1920, 'height': 1080},
locale="lt-LT"
- Human-like Interaction these are used to mimic real user interaction with the browser. I implemented different delays between actions and simulated mouse movements.
Having these in place you’re ready to scrape websites protected by Cloudflare.
Here’s the final Playwright browser settings that I used.
def get_stealth_page():
with Stealth().use_sync(sync_playwright()) as p:
# 1. Launch the browser with 'safe' arguments
browser = p.chromium.launch(
headless=False,
args=[
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
"--disable-blink-features=AutomationControlled"
],
)
# 2. Create a 'Context'
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
viewport={'width': 1920, 'height': 1080},
locale="lt-LT",
)
# 3. Create the page and apply stealth
page = context.new_page()
try:
# This 'yields' the page back to your main code
yield page
finally:
# This ensures the browser closes even if your scraper crashes
browser.close()
In this case iterating through pages looked like this:
- Get a stealth page
- Iterate through pages with simulated delays and mouse movements
- Extract and parse data. Parsing data flow is exact the same as for website A
Here’s how it looks
def get_listings(
links,
extract_listings,
get_next_page,
wait_until_ready,
base_url,
):
results = []
with get_stealth_page() as page:
for link in links:
logger.info("Scraping search link: %s", link)
for page_url in iterate_pages(page, link, base_url, get_next_page):
try:
wait_until_ready(page)
small_delay(page)
data = extract_listings(page)
if not data:
logger.debug("No data extracted from %s", page_url)
continue
results.extend(data)
except Exception:
logger.exception("Extractor failed at %s", page_url)
logger.info("Collected %d listings", len(results))
return results
def iterate_pages(page, start_url, base_url, get_next_page):
url = start_url
while url:
logger.debug("Visiting %s", url)
simulate_user(page)
human_delay(page)
page.goto(url, wait_until="domcontentloaded")
yield url
try:
next_url = get_next_page(page, base_url)
except Exception:
logger.exception("Paginator failed at %s", url)
break
if not next_url:
break
url = next_url
What’s next
This first part was about getting the scraper working locally and dealing with Cloudflare. In part 2 I’ll cover deploying it to a VPS, setting up a cron job, and sending notifications (Telegram) so the scraper can run on its own.