Inside domharvest-playwright: Architecture, Design Patterns, and Advanced Techniques
While many developers use domharvest-playwright for its simple API, there’s a sophisticated architecture underneath that makes it both powerful and flexible. In this deep dive, we’ll explore the design decisions, patterns, and advanced techniques that make domharvest-playwright production-ready.
This article is for developers who want to understand not just how to use the library, but why it works the way it does.
Architectural Overview
domharvest-playwright follows a layered architecture that progressively builds abstractions over Playwright’s browser automation:
┌─────────────────────────────────────────┐
│ High-Level API (harvest function) │ ← Simple convenience layer
├─────────────────────────────────────────┤
│ DOMHarvester Class │ ← Object-oriented interface
├─────────────────────────────────────────┤
│ Resource Management Layer │ ← Browser lifecycle handling
├─────────────────────────────────────────┤
│ Function Serialization & Evaluation │ ← Page context execution
├─────────────────────────────────────────┤
│ Playwright Core (Browser Drivers) │ ← Foundation
└─────────────────────────────────────────┘
This layered design follows the principle of progressive disclosure: simple tasks use simple APIs, complex tasks have access to deeper layers.
Design Philosophy
1. Separation of Concerns
The library cleanly separates three distinct responsibilities:
Browser Management
- Launching and closing browsers
- Creating and managing contexts
- Resource cleanup
Navigation and Waiting
- Page navigation
- Element waiting and timeouts
- Error handling
Data Extraction
- Selector evaluation
- Element transformation
- Result aggregation
This separation makes the code testable, maintainable, and easier to reason about.
2. Dual API Pattern
The library provides two complementary interfaces:
Function-Based API (harvest()):
const data = await harvest(url, selector, extractor, options)
Class-Based API (DOMHarvester):
const harvester = new DOMHarvester(options)
await harvester.init()
const data = await harvester.harvest(url, selector, extractor)
await harvester.close()
This pattern serves different use cases:
- Function API: Optimized for one-time operations, automatic cleanup
- Class API: Optimized for reuse, explicit lifecycle management
Let’s examine how this is implemented under the hood.
Implementation Deep Dive
The harvest() Function: Simplicity Through Abstraction
The harvest() function is a thin wrapper that creates, uses, and destroys a DOMHarvester instance:
export async function harvest(url, selector, extractor, options = {}) {
const harvester = new DOMHarvester(options)
try {
await harvester.init()
return await harvester.harvest(url, selector, extractor)
} finally {
await harvester.close()
}
}
This implementation demonstrates several patterns:
- Factory Pattern: Creates the harvester instance
- Resource Acquisition Is Initialization (RAII): Guarantees cleanup via try/finally
- Facade Pattern: Hides complexity behind a simple interface
The trade-off is clear: simplicity for one-time use, with the cost of creating a new browser instance each time.
The DOMHarvester Class: Power Through Flexibility
The DOMHarvester class is where the real work happens. Let’s examine its structure:
class DOMHarvester {
constructor(options = {}) {
this.options = {
headless: options.headless ?? true,
timeout: options.timeout ?? 30000
}
this.browser = null
this.context = null
this.page = null
}
async init() {
// Launch browser with Playwright
this.browser = await chromium.launch({
headless: this.options.headless
})
// Create browser context (isolated session)
this.context = await this.browser.newContext()
// Create a page in that context
this.page = await this.context.newPage()
}
async harvest(url, selector, extractor) {
// Navigate to URL with timeout
await this.page.goto(url, {
timeout: this.options.timeout,
waitUntil: 'domcontentloaded'
})
// Wait for selector to appear
await this.page.waitForSelector(selector, {
timeout: this.options.timeout
})
// Execute extraction in browser context
return await this.page.$$eval(selector, (elements, extractorStr) => {
const extractorFn = eval(`(${extractorStr})`)
return elements.map(el => extractorFn(el))
}, extractor.toString())
}
async close() {
if (this.browser) {
await this.browser.close()
}
}
}
This implementation reveals several important patterns:
Pattern 1: Explicit Initialization
The constructor doesn’t do any heavy work—it just sets up state. The actual browser launch happens in init(). This is intentional:
Benefits:
- Constructor can’t fail (no async operations)
- Explicit control over when browser launches
- Easier to test and mock
- Clear lifecycle stages
Trade-off:
- Requires calling
init()before use - More verbose than automatic initialization
Pattern 2: Context Isolation
The class creates a browser context, not just a browser:
this.browser = await chromium.launch(...)
this.context = await this.browser.newContext()
this.page = await this.context.newPage()
Why contexts matter:
- Isolate cookies, storage, and cache
- Enable parallel operations without interference
- Simulate different users or sessions
- Essential for production scraping
Pattern 3: Function Serialization
The most interesting part is how extractor functions are handled:
return await this.page.$$eval(selector, (elements, extractorStr) => {
const extractorFn = eval(`(${extractorStr})`)
return elements.map(el => extractorFn(el))
}, extractor.toString())
This is necessary because the extractor function needs to run in the browser context, not in Node.js. Let’s break down what happens:
- Serialization:
extractor.toString()converts the function to a string - Injection: The string is passed to the browser context
- Reconstruction:
eval()recreates the function in the browser - Execution: The function runs on each matched element
Implications:
- Extractor functions can’t access outer scope variables
- Functions must be self-contained
- Complex logic needs to be embedded in the function
This is a fundamental constraint of browser automation, not a limitation of the library.
Advanced Techniques
Technique 1: Direct Playwright API Access
The DOMHarvester class exposes its internal Playwright instances:
const harvester = new DOMHarvester()
await harvester.init()
// Access the page directly
await harvester.page.goto('https://example.com')
await harvester.page.waitForSelector('.dynamic-content')
// Use Playwright's full API
const screenshot = await harvester.page.screenshot()
const cookies = await harvester.context.cookies()
const title = await harvester.page.title()
This design choice follows the Principle of Least Surprise: when you need more power, the full Playwright API is right there.
Technique 2: Custom Page Evaluation
The harvestCustom() method provides a different pattern for complex extraction:
async harvestCustom(url, pageFunction) {
await this.page.goto(url, {
timeout: this.options.timeout,
waitUntil: 'domcontentloaded'
})
return await this.page.evaluate(pageFunction)
}
Usage:
const data = await harvester.harvestCustom(url, () => {
// This runs in the browser
return {
title: document.title,
links: Array.from(document.querySelectorAll('a')).map(a => a.href),
scripts: Array.from(document.scripts).map(s => s.src)
}
})
This pattern is more powerful than harvest() because:
- You control the entire extraction logic
- You can aggregate data across selectors
- You can implement custom waiting or filtering logic
Technique 3: Handling Dynamic Content
Modern websites often load content dynamically. Here’s how to handle it:
const harvester = new DOMHarvester({ timeout: 60000 })
await harvester.init()
try {
// Navigate to page
await harvester.page.goto('https://example.com/infinite-scroll')
// Wait for initial content
await harvester.page.waitForSelector('.item', { timeout: 5000 })
// Scroll to trigger lazy loading
await harvester.page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight)
})
// Wait for new content to appear
await harvester.page.waitForFunction(
() => document.querySelectorAll('.item').length > 20,
{ timeout: 10000 }
)
// Now extract all items
const items = await harvester.page.$$eval('.item', items =>
items.map(item => ({
title: item.querySelector('h2')?.textContent,
content: item.querySelector('p')?.textContent
}))
)
console.log(`Extracted ${items.length} items`)
} finally {
await harvester.close()
}
This demonstrates:
- Custom wait conditions with
waitForFunction() - Programmatic scrolling to trigger lazy loading
- Dynamic content extraction after loading completes
Technique 4: Network Interception
Sometimes the best approach isn’t scraping HTML at all—it’s intercepting API calls:
const harvester = new DOMHarvester()
await harvester.init()
try {
// Set up request interception
await harvester.page.route('**/api/data', route => {
route.continue()
})
// Listen for responses
const apiData = await new Promise(resolve => {
harvester.page.on('response', async response => {
if (response.url().includes('/api/data')) {
const json = await response.json()
resolve(json)
}
})
// Trigger the API call by navigating
harvester.page.goto('https://example.com')
})
console.log('Intercepted API data:', apiData)
} finally {
await harvester.close()
}
This technique:
- Captures structured data before it’s rendered
- Avoids parsing HTML entirely
- Often more reliable than DOM scraping
- Reveals the actual data source
Technique 5: Parallel Scraping with Multiple Contexts
For production scraping, you often need to scrape multiple pages in parallel:
import { chromium } from 'playwright'
async function parallelScrape(urls, concurrency = 5) {
const browser = await chromium.launch({ headless: true })
// Process URLs in chunks
const results = []
for (let i = 0; i < urls.length; i += concurrency) {
const chunk = urls.slice(i, i + concurrency)
// Create a context and page for each URL in the chunk
const chunkResults = await Promise.all(
chunk.map(async url => {
const context = await browser.newContext()
const page = await context.newPage()
try {
await page.goto(url, { timeout: 30000 })
const data = await page.$$eval('.item', items =>
items.map(item => ({ text: item.textContent }))
)
return { url, data, error: null }
} catch (error) {
return { url, data: null, error: error.message }
} finally {
await context.close()
}
})
)
results.push(...chunkResults)
// Rate limiting between chunks
await new Promise(resolve => setTimeout(resolve, 1000))
}
await browser.close()
return results
}
// Scrape 50 URLs with 5 concurrent requests
const results = await parallelScrape(urls, 5)
Key concepts:
- Create one browser instance
- Use multiple contexts for isolation
- Process URLs in chunks to limit concurrency
- Handle errors per-URL to avoid failing entire batch
- Add rate limiting between chunks
Performance Considerations
Memory Management
Browser instances consume significant memory. Here’s how to optimize:
1. Reuse Browser Instances
// Bad: Creates new browser for each page
for (const url of urls) {
const data = await harvest(url, selector, extractor)
}
// Good: Reuses same browser
const harvester = new DOMHarvester()
await harvester.init()
for (const url of urls) {
await harvester.harvest(url, selector, extractor)
}
await harvester.close()
2. Close Contexts When Done
// Create multiple contexts for parallel work
const contexts = await Promise.all(
[1, 2, 3].map(() => browser.newContext())
)
// Use them
// ...
// Close them
await Promise.all(contexts.map(ctx => ctx.close()))
3. Set Resource Limits
const context = await browser.newContext({
// Block unnecessary resources
bypassCSP: true,
javaScriptEnabled: true,
ignoreHTTPSErrors: true
})
// Block images and stylesheets
await context.route('**/*', route => {
const type = route.request().resourceType()
if (type === 'image' || type === 'stylesheet') {
route.abort()
} else {
route.continue()
}
})
Timeout Strategy
Timeouts are critical for reliability. Here’s a sophisticated approach:
class SmartHarvester extends DOMHarvester {
async harvest(url, selector, extractor, retries = 3) {
for (let attempt = 1; attempt <= retries; attempt++) {
try {
// Exponential backoff for timeout
const timeout = this.options.timeout * attempt
await this.page.goto(url, {
timeout,
waitUntil: 'domcontentloaded'
})
await this.page.waitForSelector(selector, { timeout })
return await this.page.$$eval(selector, (elements, extractorStr) => {
const extractorFn = eval(`(${extractorStr})`)
return elements.map(el => extractorFn(el))
}, extractor.toString())
} catch (error) {
if (attempt === retries) throw error
console.log(`Attempt ${attempt} failed, retrying...`)
await new Promise(resolve => setTimeout(resolve, 1000 * attempt))
}
}
}
}
This implements:
- Retry logic with exponential backoff
- Increasing timeouts for each retry
- Configurable retry count
Comparing with Alternatives
Let’s compare domharvest-playwright with other scraping approaches:
vs. Raw Playwright
domharvest-playwright:
const data = await harvest(url, '.item', el => ({
title: el.querySelector('h2')?.textContent
}))
Raw Playwright:
const browser = await chromium.launch()
const context = await browser.newContext()
const page = await context.newPage()
try {
await page.goto(url)
await page.waitForSelector('.item')
const data = await page.$$eval('.item', items =>
items.map(item => ({
title: item.querySelector('h2')?.textContent
}))
)
} finally {
await browser.close()
}
Trade-off: domharvest-playwright reduces boilerplate by ~60% for common cases, but you can still access raw Playwright when needed.
vs. Cheerio
Cheerio is faster but can’t handle JavaScript-rendered content:
// Cheerio: Fast but no JS execution
const $ = cheerio.load(html)
const data = $('.item').map((i, el) => ({
title: $(el).find('h2').text()
})).get()
// domharvest-playwright: Slower but handles SPAs
const data = await harvest(url, '.item', el => ({
title: el.querySelector('h2')?.textContent
}))
When to use each:
- Cheerio: Static HTML, known structure, speed critical
- domharvest-playwright: Dynamic content, SPAs, JavaScript-heavy sites
vs. Puppeteer
Puppeteer is similar to Playwright, but Playwright has advantages:
Playwright Benefits:
- Multi-browser support (Chromium, Firefox, WebKit)
- Better auto-waiting mechanisms
- More active development
- Better network interception
domharvest-playwright builds on these advantages while adding DOM extraction patterns.
Best Practices for Production
1. Implement Circuit Breakers
Prevent cascading failures:
class CircuitBreaker {
constructor(threshold = 5, timeout = 60000) {
this.failures = 0
this.threshold = threshold
this.timeout = timeout
this.state = 'closed' // closed, open, half-open
this.nextAttempt = Date.now()
}
async execute(fn) {
if (this.state === 'open') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is open')
}
this.state = 'half-open'
}
try {
const result = await fn()
this.onSuccess()
return result
} catch (error) {
this.onFailure()
throw error
}
}
onSuccess() {
this.failures = 0
this.state = 'closed'
}
onFailure() {
this.failures++
if (this.failures >= this.threshold) {
this.state = 'open'
this.nextAttempt = Date.now() + this.timeout
}
}
}
// Usage
const breaker = new CircuitBreaker()
const harvester = new DOMHarvester()
for (const url of urls) {
try {
await breaker.execute(async () => {
return await harvester.harvest(url, selector, extractor)
})
} catch (error) {
console.error(`Failed to scrape ${url}: ${error.message}`)
}
}
2. Add Comprehensive Logging
class LoggedHarvester extends DOMHarvester {
async harvest(url, selector, extractor) {
const start = Date.now()
console.log(`[${new Date().toISOString()}] Starting harvest: ${url}`)
try {
const result = await super.harvest(url, selector, extractor)
const duration = Date.now() - start
console.log(`[${new Date().toISOString()}] Success: ${url} (${duration}ms, ${result.length} items)`)
return result
} catch (error) {
const duration = Date.now() - start
console.error(`[${new Date().toISOString()}] Failed: ${url} (${duration}ms) - ${error.message}`)
throw error
}
}
}
3. Use Type Safety with TypeScript
import { DOMHarvester } from 'domharvest-playwright'
interface Product {
title: string
price: number
inStock: boolean
}
async function scrapeProducts(url: string): Promise<Product[]> {
const harvester = new DOMHarvester()
await harvester.init()
try {
return await harvester.harvest<Product>(
url,
'.product',
(el): Product => ({
title: el.querySelector('h2')?.textContent?.trim() ?? '',
price: parseFloat(
el.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '') ?? '0'
),
inStock: el.querySelector('.stock')?.textContent?.includes('In stock') ?? false
})
)
} finally {
await harvester.close()
}
}
Conclusion
domharvest-playwright’s architecture demonstrates that simple APIs don’t require sacrificing power or flexibility. The key design decisions—layered architecture, dual API pattern, explicit lifecycle management, and direct Playwright access—create a library that scales from quick scripts to production systems.
Key takeaways:
- Layered architecture provides progressive disclosure of complexity
- Function serialization is necessary for browser context execution
- Explicit initialization gives control over resource lifecycle
- Direct API access enables advanced techniques when needed
- Context isolation is essential for parallel and production scraping
Whether you’re building a simple scraper or a production data pipeline, understanding these patterns will help you use domharvest-playwright effectively and extend it when needed.
Resources:
Have architectural questions? Discuss on GitHub Issues!