domharvest-playwright 1.3.0: Declarative DSL and Authentication Support
I’m excited to announce the release of domharvest-playwright 1.3.0, the most significant update since the project launched. This release introduces two major features that fundamentally enhance how you build web scrapers: a declarative DSL for data extraction and comprehensive authentication support. Plus, we’ve significantly improved our testing infrastructure with enforced code coverage thresholds.
Let’s dive into what’s new.
Declarative DSL: Cleaner Data Extraction
The biggest addition in 1.3.0 is a powerful Domain Specific Language (DSL) for declarative data extraction. If you’ve been writing scrapers with domharvest-playwright, you know the pattern: CSS selectors, querySelector calls, optional chaining, and trim() everywhere. It works, but it’s verbose.
The new DSL eliminates that boilerplate entirely.
Before and After
Traditional approach (still supported):
const products = await harvester.harvest(
'https://example.com/products',
'.product',
(el) => ({
name: el.querySelector('h2')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
image: el.querySelector('img')?.getAttribute('src'),
inStock: el.querySelector('.stock-badge') !== null
})
)
DSL approach (new in 1.3.0):
import { text, attr, exists } from 'domharvest-playwright'
const products = await harvester.harvest(
'https://example.com/products',
'.product',
{
name: text('h2'),
price: text('.price'),
image: attr('img', 'src'),
inStock: exists('.stock-badge')
}
)
The difference is striking. The DSL version is shorter, cleaner, and more declarative. You focus on what you want to extract, not how to extract it.
Complete Helper Function Reference
The DSL provides six core helpers covering the most common extraction patterns:
1. text() — Extract Text Content
import { text } from 'domharvest-playwright'
// Basic usage
text('h1') // Extracts and trims text from <h1>
// With default value
text('.subtitle', { default: 'No subtitle' })
// Without trimming
text('.description', { trim: false })
The text() helper automatically handles null safety and trimming, eliminating the need for ?.textContent?.trim() everywhere.
2. attr() — Extract Attributes
import { attr } from 'domharvest-playwright'
// Get attribute value
attr('img', 'src') // Get src attribute from <img>
attr('a', 'href') // Get href from <a>
// With default value
attr('img', 'alt', { default: 'No description' })
Perfect for extracting URLs, IDs, data attributes, or any HTML attribute.
3. array() — Process Collections
import { array, text } from 'domharvest-playwright'
// Extract array of text values
array('.tag', text())
// Extract array of links
array('a', attr('href'))
// Extract array of objects
array('.review', {
author: text('.author'),
rating: text('.rating'),
comment: text('.comment')
})
The array() helper simplifies working with collections, replacing verbose Array.from() and .map() chains.
4. exists() — Check Element Presence
import { exists } from 'domharvest-playwright'
// Check if element exists
exists('.premium-badge') // Returns true/false
exists('.out-of-stock') // Boolean presence check
Clean boolean checks without !== null comparisons.
5. html() — Extract HTML Content
import { html } from 'domharvest-playwright'
// Get innerHTML
html('.description')
// With default value
html('.content', { default: '<p>No content</p>' })
Useful when you need the raw HTML structure, not just text.
6. count() — Count Elements
import { count } from 'domharvest-playwright'
// Count matching elements
count('.review') // Number of reviews
count('.star.filled') // Number of filled stars
Simple element counting without manual array length checks.
Real-World Example: E-commerce Scraper
Here’s a complete example showing how these helpers work together:
import { DOMHarvester, text, attr, array, exists, count } from 'domharvest-playwright'
const harvester = new DOMHarvester({
rateLimit: { requestsPerSecond: 2 }
})
await harvester.init()
try {
const products = await harvester.harvest(
'https://example.com/products',
'.product-card',
{
// Basic text extraction
name: text('h2.product-name'),
price: text('.price'),
description: text('.description', { default: 'No description' }),
// Attribute extraction
productId: attr('[data-product-id]', 'data-product-id'),
image: attr('img.product-image', 'src'),
link: attr('a.product-link', 'href'),
// Arrays
tags: array('.tag', text()),
images: array('.gallery img', attr('src')),
// Booleans
inStock: exists('.in-stock-badge'),
onSale: exists('.sale-badge'),
// Counts
reviewCount: count('.review'),
// Nested objects
reviews: array('.review', {
author: text('.review-author'),
rating: text('.review-rating'),
date: text('.review-date'),
comment: text('.review-comment')
})
}
)
console.log(`Extracted ${products.length} products`)
console.log(products[0])
} finally {
await harvester.close()
}
Pure DSL vs Mixed Mode
The DSL supports two execution modes:
Pure DSL Mode: When your extractor uses only DSL helpers, execution happens entirely in the browser context for optimal performance:
// Pure DSL - optimized browser-side execution
{
name: text('h1'),
price: text('.price'),
tags: array('.tag', text())
}
Mixed Mode: You can combine DSL helpers with custom functions when needed:
// Mixed mode - combines DSL with custom logic
{
name: text('h1'),
price: text('.price'),
discount: (el) => {
const original = parseFloat(el.querySelector('.original-price')?.textContent || '0')
const current = parseFloat(el.querySelector('.current-price')?.textContent || '0')
return Math.round(((original - current) / original) * 100)
}
}
Mixed mode falls back to Node.js-side execution but still provides all the benefits of DSL helpers where used.
Backward Compatibility
Important: The DSL is completely optional and fully backward compatible. All existing code continues to work unchanged. You can:
- Use function extractors exclusively (existing approach)
- Use DSL extractors exclusively (new approach)
- Mix both approaches in the same project
- Gradually migrate to DSL at your own pace
There are no breaking changes in this release.
Authentication and Session Management
The second major feature in 1.3.0 is comprehensive authentication support. Many real-world scraping scenarios require authentication: accessing user profiles, scraping private repositories, extracting personalized data, or monitoring members-only areas.
Previously, you had to handle authentication manually with Playwright’s API. Now, domharvest-playwright provides built-in helpers for the most common patterns.
Form-Based Login
The login() helper automatically detects and fills common login forms:
import { DOMHarvester, login } from 'domharvest-playwright'
const harvester = new DOMHarvester()
await harvester.init()
try {
const page = await harvester.getPage()
// Automatic login with form detection
await login(page, 'https://example.com/login', {
username: process.env.USERNAME,
password: process.env.PASSWORD
})
// Now scrape authenticated content
const data = await harvester.harvest(
'https://example.com/dashboard',
'.data',
{ value: text('.value') }
)
console.log('Authenticated data:', data)
} finally {
await harvester.close()
}
The login() helper:
- Automatically detects common login form patterns
- Fills username and password fields
- Submits the form
- Waits for navigation to complete
For non-standard forms, you can provide custom selectors:
await login(page, 'https://example.com/signin', {
username: process.env.USERNAME,
password: process.env.PASSWORD
}, {
usernameSelector: '#email',
passwordSelector: '#pwd',
submitSelector: 'button[type="submit"]'
})
Cookie Persistence
Save and restore authentication cookies to skip login on subsequent runs:
import { DOMHarvester, login, saveCookies, loadCookies } from 'domharvest-playwright'
const harvester = new DOMHarvester()
await harvester.init()
try {
const page = await harvester.getPage()
const context = page.context()
// Try loading existing cookies
const cookiesExist = await loadCookies(context, './cookies.json')
if (!cookiesExist) {
// First run - perform login
await login(page, 'https://example.com/login', {
username: process.env.USERNAME,
password: process.env.PASSWORD
})
// Save cookies for future runs
await saveCookies(context, './cookies.json')
console.log('Logged in and saved cookies')
} else {
console.log('Loaded existing cookies')
}
// Navigate to authenticated area
await page.goto('https://example.com/dashboard')
// Scrape authenticated content
const data = await harvester.harvest(
'https://example.com/dashboard',
'.data',
{ value: text('.value') }
)
console.log('Data:', data)
} finally {
await harvester.close()
}
Complete Session Management
For the most robust authentication persistence, use the SessionManager class. Unlike simple cookie persistence, it saves the complete browser state:
- Cookies
- localStorage
- sessionStorage
- Origins
- Permissions
This is essential for sites that store authentication state beyond cookies.
import { DOMHarvester, login } from 'domharvest-playwright'
import { SessionManager } from 'domharvest-playwright/auth'
const harvester = new DOMHarvester()
await harvester.init()
try {
const page = await harvester.getPage()
const context = page.context()
const sessionManager = new SessionManager('./sessions')
const sessionId = 'my-account'
// Try loading existing session
const loaded = await sessionManager.loadSession(context, sessionId)
if (!loaded) {
// First run - perform login
await login(page, 'https://example.com/login', {
username: process.env.USERNAME,
password: process.env.PASSWORD
})
// Save complete session state
await sessionManager.saveSession(context, sessionId)
console.log('Logged in and saved session')
} else {
console.log('Loaded existing session')
}
// Navigate to authenticated area
await page.goto('https://example.com/dashboard')
// Scrape authenticated content
const data = await harvester.harvest(
'https://example.com/dashboard',
'.data',
{ value: text('.value') }
)
console.log('Data:', data)
} finally {
await harvester.close()
}
SessionManager API:
const sessionManager = new SessionManager('./sessions')
// Save session
await sessionManager.saveSession(context, 'account-1')
// Load session
const loaded = await sessionManager.loadSession(context, 'account-1')
// Check if session exists
const exists = await sessionManager.hasSession('account-1')
// Delete session
await sessionManager.deleteSession('account-1')
// List all sessions
const sessions = await sessionManager.listSessions()
console.log('Available sessions:', sessions)
Multi-Account Support
The SessionManager makes multi-account scraping trivial:
import { DOMHarvester } from 'domharvest-playwright'
import { SessionManager } from 'domharvest-playwright/auth'
const accounts = [
{ id: 'account-1', username: process.env.USER1, password: process.env.PASS1 },
{ id: 'account-2', username: process.env.USER2, password: process.env.PASS2 }
]
const sessionManager = new SessionManager('./sessions')
for (const account of accounts) {
const harvester = new DOMHarvester()
await harvester.init()
try {
const page = await harvester.getPage()
const context = page.context()
// Load or create session for this account
const loaded = await sessionManager.loadSession(context, account.id)
if (!loaded) {
await login(page, 'https://example.com/login', {
username: account.username,
password: account.password
})
await sessionManager.saveSession(context, account.id)
}
// Scrape with this account
await page.goto('https://example.com/dashboard')
const data = await harvester.harvest(
'https://example.com/dashboard',
'.data',
{ value: text('.value') }
)
console.log(`Data for ${account.id}:`, data)
} finally {
await harvester.close()
}
}
Real-World Example: GitHub Scraper
Here’s a complete example scraping authenticated GitHub data:
import { DOMHarvester, login, text, array } from 'domharvest-playwright'
import { SessionManager } from 'domharvest-playwright/auth'
async function scrapeGitHubDashboard() {
const harvester = new DOMHarvester({
rateLimit: { requestsPerSecond: 1 }
})
await harvester.init()
try {
const page = await harvester.getPage()
const context = page.context()
const sessionManager = new SessionManager('./sessions')
const loaded = await sessionManager.loadSession(context, 'github')
if (!loaded) {
await login(page, 'https://github.com/login', {
username: process.env.GITHUB_USERNAME,
password: process.env.GITHUB_PASSWORD
}, {
usernameSelector: '#login_field',
passwordSelector: '#password'
})
await sessionManager.saveSession(context, 'github')
}
// Scrape dashboard
const repos = await harvester.harvest(
'https://github.com',
'.repo',
{
name: text('.repo-name'),
description: text('.repo-description'),
language: text('[itemprop="programmingLanguage"]'),
stars: text('.stars'),
updated: text('relative-time')
}
)
console.log(`Found ${repos.length} repositories`)
return repos
} finally {
await harvester.close()
}
}
await scrapeGitHubDashboard()
Security Best Practices
Never hardcode credentials. Always use environment variables:
// Good - environment variables
await login(page, url, {
username: process.env.USERNAME,
password: process.env.PASSWORD
})
// Bad - hardcoded credentials
await login(page, url, {
username: 'myuser@example.com', // Never do this!
password: 'mypassword123' // Never do this!
})
Store sessions outside version control:
# .gitignore
sessions/
cookies.json
*.session
Handle 2FA manually: For sites with two-factor authentication, run in headed mode:
const harvester = new DOMHarvester({
headless: false // Visible browser for manual 2FA
})
Rotate sessions periodically:
const sessionManager = new SessionManager('./sessions')
// Delete old sessions
if (await sessionManager.hasSession('account-1')) {
await sessionManager.deleteSession('account-1')
}
// Force fresh login
await login(page, url, credentials)
await sessionManager.saveSession(context, 'account-1')
Enhanced Testing Infrastructure
Version 1.3.0 also brings significant improvements to code quality and testing:
Enforced Coverage Thresholds
We’ve implemented minimum coverage requirements enforced in CI:
- 80% minimum for lines, functions, and statements
- 70% minimum for branch coverage
Any PR that drops below these thresholds will fail CI, ensuring we maintain high code quality standards.
{
"coverageThreshold": {
"global": {
"lines": 80,
"functions": 80,
"statements": 80,
"branches": 70
}
}
}
Current Coverage Stats
The project maintains 86%+ test coverage across all modules:
- Statement coverage: 86.62%
- Branch coverage: 71.65%
- Function coverage: 89.65%
- Line coverage: 86.62%
DSL Module Coverage
The new DSL module has 83% test coverage, including comprehensive tests for:
- All helper functions (text, attr, array, exists, html, count)
- Pure DSL mode execution
- Mixed mode execution
- Nested objects and complex schemas
- Error handling and edge cases
- Default values and null safety
Authentication Module Coverage
The authentication features have 95.84% test coverage, including:
- Form-based login with auto-detection
- Custom selector support
- Cookie persistence (save/load)
- SessionManager complete lifecycle
- Multi-account session isolation
- Error handling and validation
Comprehensive Documentation
Along with the code, we’ve added extensive documentation:
- DSL Guide - Complete API reference and practical examples
- Authentication Guide - Real-world authentication patterns
- Testing Guide - Testing best practices and coverage requirements
Migration Guide
Upgrading to 1.3.0 is seamless—there are no breaking changes.
Install the Update
npm install domharvest-playwright@1.3.0
Start Using DSL (Optional)
You can gradually adopt the DSL in new code or refactor existing extractors:
// Old code (still works)
const data = await harvester.harvest(url, selector, (el) => ({
title: el.querySelector('h1')?.textContent?.trim()
}))
// New DSL approach (recommended)
import { text } from 'domharvest-playwright'
const data = await harvester.harvest(url, selector, {
title: text('h1')
})
Add Authentication (Optional)
If you need authentication, import the new helpers:
import { login, saveCookies, loadCookies } from 'domharvest-playwright'
// or
import { SessionManager } from 'domharvest-playwright/auth'
What’s Next
Looking ahead, here are some features we’re considering for future releases:
- DSL helpers for forms - Declarative form filling
- Advanced wait strategies - More DSL helpers for dynamic content
- OAuth support - Built-in OAuth 2.0 flow handling
- Headless 2FA helpers - Programmatic 2FA token handling
- Proxy rotation - Built-in proxy management
- Distributed scraping - Multi-machine coordination
Have ideas? Open an issue on GitHub or reach out on Mastodon.
Conclusion
Version 1.3.0 represents a major step forward for domharvest-playwright. The declarative DSL makes data extraction cleaner and more maintainable, while authentication support unlocks entire categories of scraping use cases that previously required significant custom code.
Combined with enforced test coverage and comprehensive documentation, this release solidifies domharvest-playwright as a production-ready scraping framework.
Key highlights:
- Declarative DSL with 6 core helpers (text, attr, array, exists, html, count)
- Pure DSL mode for optimized browser-side execution
- Backward compatible - all existing code works unchanged
- Authentication support with form login, cookie persistence, and SessionManager
- Multi-account scraping made simple
- 95%+ auth coverage and 83% DSL coverage
- Enforced 80% coverage threshold in CI
- Comprehensive documentation for all new features
Whether you’re building a simple scraper or a complex multi-account data extraction pipeline, 1.3.0 gives you the tools to do it cleanly and reliably.
Upgrade today and start building better scrapers!
npm install domharvest-playwright@1.3.0
Resources:
Questions or feedback? Open an issue on GitHub or reach out on Mastodon!