Building domharvest-playwright: A Modern DOM Harvesting Tool
Introduction
Web scraping doesn’t have to be complicated. Yet, every time I started a new data extraction project, I found myself writing the same boilerplate code: launching browsers, navigating pages, waiting for elements, extracting data, and cleaning up resources. While Playwright is an excellent browser automation framework, it’s designed as a general-purpose tool for testing and automation, not specifically for DOM harvesting workflows.
That’s why I built domharvest-playwright: a focused library that wraps Playwright’s power into a simple, intuitive API specifically designed for extracting structured data from websites.
The Challenge
Modern web scraping presents several recurring challenges:
- Boilerplate Overhead: Every scraping script needs browser initialization, navigation, waiting logic, and cleanup code
- Inconsistent Patterns: Different developers solve the same problems in different ways, making code harder to maintain
- Error-Prone Resource Management: Forgetting to close browsers or pages can lead to memory leaks in production
- Complex Extraction Logic: Transforming raw DOM elements into structured data requires repetitive querySelector chains and null checks
While tools like Puppeteer and Playwright solve browser automation, they don’t provide opinionated patterns for the specific use case of DOM harvesting. I wanted a tool that would make the common case simple while still allowing full control when needed.
Why Playwright?
Choosing the right foundation was critical. After years of working with various browser automation tools, Playwright emerged as the clear choice for several reasons:
- Modern Architecture: Built from the ground up for modern web applications with native support for SPAs, dynamic content, and JavaScript-heavy sites
- Multi-Browser Support: Test and scrape across Chromium, Firefox, and WebKit with a unified API
- Superior Performance: Faster execution and better resource management compared to older tools like Selenium
- Active Development: Backed by Microsoft with regular updates, security patches, and new features
- Auto-Waiting: Intelligent waiting mechanisms that reduce flakiness and eliminate most explicit wait statements
- Network Control: Intercept and modify requests, perfect for bypassing certain anti-scraping measures
Playwright’s battle-tested reliability made it the ideal foundation for a production-ready scraping tool.
Key Features
1. Simple Function-Based API
The most common use case should be the simplest. With domharvest-playwright, extracting data from a website is just one function call:
const quotes = await harvest(
'https://quotes.toscrape.com/',
'.quote',
(el) => ({
text: el.querySelector('.text')?.textContent?.trim(),
author: el.querySelector('.author')?.textContent?.trim(),
tags: Array.from(el.querySelectorAll('.tag')).map(tag => tag.textContent?.trim())
})
)
No need to manually launch browsers, navigate pages, or clean up resources. The library handles all of that for you.
2. Class-Based Interface for Complex Workflows
When you need more control, the class-based API gives you full access to the browser instance:
const harvester = new DOMHarvester()
await harvester.init()
const data = await harvester.harvest(url, selector, transformFn)
// Perform multiple operations with the same browser instance
const moreData = await harvester.harvestCustom(url, customPageFunction)
await harvester.close()
3. Custom Page Analysis
The harvestCustom() method allows you to inject arbitrary functions into the page context for complex extraction logic:
const result = await harvester.harvestCustom(
url,
() => {
// This function runs in the browser context
return {
title: document.title,
links: Array.from(document.querySelectorAll('a')).map(a => a.href),
metadata: Array.from(document.querySelectorAll('meta')).map(m => ({
name: m.getAttribute('name'),
content: m.getAttribute('content')
}))
}
}
)
4. Configurable Behavior
Every aspect of the harvesting process can be customized:
- Headless/Headed Mode: Debug visually or run in production headless mode
- Timeouts: Configure wait times for slow-loading pages
- Browser Selection: Choose between Chromium, Firefox, or WebKit
- Custom Selectors: Use any valid CSS selector for element targeting
Architecture Overview
The library follows a layered architecture designed for simplicity and flexibility:
┌─────────────────────────────────────┐
│ High-Level API (harvest()) │ ← Simple function for common cases
├─────────────────────────────────────┤
│ DOMHarvester Class │ ← Object-oriented interface
├─────────────────────────────────────┤
│ Playwright Browser Management │ ← Browser lifecycle handling
├─────────────────────────────────────┤
│ Page Navigation & Evaluation │ ← DOM interaction layer
├─────────────────────────────────────┤
│ Playwright Core (Browser Drivers) │ ← Foundation
└─────────────────────────────────────┘
Key Design Decisions:
- ES Modules: Modern JavaScript with native module support for better tree-shaking and compatibility
- Automatic Cleanup: Resource management handled automatically to prevent memory leaks
- Function Serialization: Transform functions are serialized and executed in the browser context for maximum flexibility
- Minimal Dependencies: Only Playwright as a dependency keeps the package lean and maintainable
Lessons Learned
Building domharvest-playwright taught me valuable lessons about API design and web scraping:
-
Simplicity Wins: The most powerful feature is often the simplest API. Users gravitate toward the one-line
harvest()function even though the class-based API offers more control. Making the common case trivial is more important than exposing every feature upfront. -
Resource Management is Critical: In production environments, forgotten browser instances can quickly consume all available memory. Automatic cleanup isn’t just a convenience—it’s essential for reliability. Every code path must guarantee resource cleanup, even on errors.
-
Context Switching is Tricky: Serializing functions to run in the browser context introduces subtle gotchas. Variables from the outer scope don’t transfer, and debugging is harder. Clear documentation and good error messages are essential to help users understand the execution model.
-
Testing Matters for Scraping Tools: Practice websites like quotes.toscrape.com and books.toscrape.com are invaluable for testing without ethical concerns. They let you develop and test scraping logic without worrying about rate limiting, legal issues, or changing production sites.
-
Standards Improve Code Quality: Adopting JavaScript Standard Style from day one eliminated bikeshedding about formatting and caught subtle bugs through linting. Consistency makes the codebase easier to maintain and contribute to.
Future Roadmap
The library is production-ready, but there’s always room for improvement. Here’s what’s on the horizon:
- Retry Logic: Automatic retries with exponential backoff for flaky networks or rate-limited sites
- Parallel Harvesting: Built-in support for concurrent scraping of multiple URLs with configurable concurrency limits
- Data Export Formats: Native support for exporting to JSON, CSV, and other common formats
- Request Interception Helpers: Simplified API for blocking images, stylesheets, or other resources to speed up scraping
- Screenshot & PDF Generation: Capture visual content alongside structured data
- Proxy Support: Easy configuration of proxy servers for distributed scraping
- Performance Metrics: Built-in timing and performance tracking for optimization
Contributions and feature requests are welcome on GitHub!
Getting Started
Install via npm:
npm install domharvest-playwright
Check out the documentation for detailed usage examples.
Contributing
The project is open source and welcomes contributions! Visit the GitHub repository to get involved.
Conclusion
Building domharvest-playwright has been a journey in creating developer-friendly abstractions without sacrificing power. By focusing on the most common web scraping workflows and wrapping them in an intuitive API, the library makes it easier to build reliable data extraction pipelines.
The goal wasn’t to replace Playwright but to complement it—providing a higher-level interface for DOM harvesting while still allowing access to Playwright’s full capabilities when needed. Whether you’re building a one-off scraper or a production data pipeline, domharvest-playwright aims to get you started quickly and scale with your needs.
If you’re working with web scraping or data extraction, give it a try and let me know what you think. The project is open source and actively maintained, and I’m always looking for feedback and contributions from the community.
Happy harvesting!
Links:
Have questions or suggestions? Reach out on GitHub or Mastodon!