Mastering Dynamic Scraping

Introduction to Dynamic Web Scraping

Web scraping involves extracting data from websites, which can be particularly challenging with dynamic content that changes frequently. Our experts, Greg, Dario, and Diego, provided practical advice on overcoming these challenges using tools like Puppeteer, Selenium, and Playwright.

Greg: Based in San Francisco, Greg works at Andela and specializes in traditional software engineering and web scraping.
Dario: From Argentina, Dario is with Mabel, focusing on low-code QA automation tools and contributing to PuppeteerSharp and Playwright’s .NET version.
Diego: Joining from Valencia, Spain, Diego is a lead in the Selenium project and an open-source lead at Sauce Labs, a cloud testing platform.

Key Discussion Points

1. Keeping Up with HTML Changes

Greg emphasized the importance of choosing reliable selectors. Avoid long chains of hard-coded selectors like div > div > p. Instead, use more stable selectors such as aria-label or text-based selectors. Even better, intercept API requests when possible, as these are less likely to change than the DOM.

Tools and Techniques:

Mozilla Readability: Converts complex HTML into a clean, readable format.
API Interception: Directly access data endpoints to bypass DOM changes.

2. Navigating Client-Side Routing and Single Page Applications (SPAs)

Dario discussed handling SPAs, which often use client-side routing, making data extraction tricky. Key points included:

Avoid Trusting Initial Loads: Use multiple checks to confirm that the page is fully loaded.
Check Context: Ensure you are on the correct step of multi-step processes by verifying titles or other stable elements.
Tools: Use Playwright or Puppeteer’s navigation and waiting functions effectively.

3. Accessing Asynchronously Loaded Data

Diego recommended using frameworks that handle synchronization automatically, such as Selenide for Java or WebDriverIO for JavaScript. These frameworks simplify waiting for elements to load by providing built-in methods.

Tips:

Event-Driven Predicates: Use waitForFunction to wait for specific states in the DOM.
Request Interception: Capture and work with API responses directly when possible.

4. Mimicking User Interactions for Lazy Loading

Greg provided techniques for dealing with lazy loading, where content loads as you interact with the page. Key strategies included:

Keyboard Interactions: Use page down keys to trigger loading.
Capture Data Incrementally: Save data in chunks to avoid loss if a script fails.
Avoid User Interactions When Possible: Directly intercept API responses to bypass the need for user actions.

5. Extracting Information from Shadow DOM Components

Dario explained how to handle Shadow DOM, which encapsulates parts of the web page, making them harder to scrape. Key points included:

Understanding Open vs. Closed Shadow DOM: Most tools can pierce open Shadow DOM but not closed.
Manual JavaScript Handling: Use the shadowRoot property to access elements within Shadow DOM manually.
Framework Support: Tools like Playwright and Puppeteer handle Shadow DOM effectively.

6. Capturing Full Page Screenshots

Diego recommended using Firefox for full-page screenshots, as it has a native command for this purpose. Integrating Selenium with the Chrome DevTools Protocol is another effective method.

Tips:

Network Idle: Ensure all elements are loaded before capturing screenshots.
Use Built-in Methods: Tools like Puppeteer’s fullPage option simplify the process.

7. Large Scale Operations

Jacob discussed the challenges of scaling web scraping operations, such as managing fingerprints, session handling, and rotating IPs. He introduced Bright Data’s Scraping Browser, which abstracts these complexities, allowing developers to focus on scripting.

Key Features:

Session Management: Automatically handle sessions to avoid detection.
IP Rotation: Use a variety of IP addresses to simulate different users.
Playground for Testing: Test your scripts in a controlled environment before scaling up.

Interactive Q&A Session

The webinar concluded with a Q&A session where participants asked about various aspects of web scraping. Key topics included:

Intercepting Frontend API Calls: Use browser DevTools to identify and replicate API requests.
Robust Selectors: Avoid using XPath; instead, use more stable and reliable selectors.
Handling Authentication: Cache authentication tokens and handle two-factor authentication manually when necessary.

Conclusion

This webinar provided a wealth of knowledge for developers looking to master dynamic web scraping. By leveraging the insights shared by Greg, Dario, and Diego, you can enhance your scraping techniques, making your scripts more robust and efficient. For those who missed the live session, the recording will be available soon. Stay tuned for more educational content from Bright Data, helping you excel in web scraping and data extraction.

Happy scraping!

How developers leverage Bright Data

Introduction to Dynamic Web Scraping

Key Discussion Points

1. Keeping Up with HTML Changes

2. Navigating Client-Side Routing and Single Page Applications (SPAs)

3. Accessing Asynchronously Loaded Data

4. Mimicking User Interactions for Lazy Loading

5. Extracting Information from Shadow DOM Components

6. Capturing Full Page Screenshots

7. Large Scale Operations

Interactive Q&A Session

Conclusion

The Data You Need
Is Only One Click Away.

How developers leverage Bright Data

Introduction to Dynamic Web Scraping

Key Discussion Points

1. Keeping Up with HTML Changes

2. Navigating Client-Side Routing and Single Page Applications (SPAs)

3. Accessing Asynchronously Loaded Data

4. Mimicking User Interactions for Lazy Loading

5. Extracting Information from Shadow DOM Components

6. Capturing Full Page Screenshots

7. Large Scale Operations

Interactive Q&A Session

Conclusion

The Data You Need Is Only One Click Away.

The Data You Need
Is Only One Click Away.