Max B. - Web Scraping & Automation

domharvest-playwright 1.3.0: Declarative DSL and Authentication Support

13 Jan 2026

domharvest-playwright 1.3.0: Declarative DSL and Authentication Support

I’m excited to announce the release of domharvest-playwright 1.3.0, the most significant update since the project launched. This release introduces two major features that fundamentally enhance how you build web scrapers: a declarative DSL for data extraction and comprehensive authentication support. Plus, we’ve significantly improved our testing infrastructure with enforced code coverage thresholds.

Let’s dive into what’s new.

Declarative DSL: Cleaner Data Extraction

The biggest addition in 1.3.0 is a powerful Domain Specific Language (DSL) for declarative data extraction. If you’ve been writing scrapers with domharvest-playwright, you know the pattern: CSS selectors, querySelector calls, optional chaining, and trim() everywhere. It works, but it’s verbose.

The new DSL eliminates that boilerplate entirely.

Before and After

Traditional approach (still supported):

const products = await harvester.harvest(
  'https://example.com/products',
  '.product',
  (el) => ({
    name: el.querySelector('h2')?.textContent?.trim(),
    price: el.querySelector('.price')?.textContent?.trim(),
    image: el.querySelector('img')?.getAttribute('src'),
    inStock: el.querySelector('.stock-badge') !== null
  })
)

DSL approach (new in 1.3.0):

import { text, attr, exists } from 'domharvest-playwright'

const products = await harvester.harvest(
  'https://example.com/products',
  '.product',
  {
    name: text('h2'),
    price: text('.price'),
    image: attr('img', 'src'),
    inStock: exists('.stock-badge')
  }
)

The difference is striking. The DSL version is shorter, cleaner, and more declarative. You focus on what you want to extract, not how to extract it.

Complete Helper Function Reference

The DSL provides six core helpers covering the most common extraction patterns:

1. text() — Extract Text Content

import { text } from 'domharvest-playwright'

// Basic usage
text('h1')  // Extracts and trims text from <h1>

// With default value
text('.subtitle', { default: 'No subtitle' })

// Without trimming
text('.description', { trim: false })

The text() helper automatically handles null safety and trimming, eliminating the need for ?.textContent?.trim() everywhere.

2. attr() — Extract Attributes

import { attr } from 'domharvest-playwright'

// Get attribute value
attr('img', 'src')  // Get src attribute from <img>
attr('a', 'href')   // Get href from <a>

// With default value
attr('img', 'alt', { default: 'No description' })

Perfect for extracting URLs, IDs, data attributes, or any HTML attribute.

3. array() — Process Collections

import { array, text } from 'domharvest-playwright'

// Extract array of text values
array('.tag', text())

// Extract array of links
array('a', attr('href'))

// Extract array of objects
array('.review', {
  author: text('.author'),
  rating: text('.rating'),
  comment: text('.comment')
})

The array() helper simplifies working with collections, replacing verbose Array.from() and .map() chains.

4. exists() — Check Element Presence

import { exists } from 'domharvest-playwright'

// Check if element exists
exists('.premium-badge')  // Returns true/false
exists('.out-of-stock')   // Boolean presence check

Clean boolean checks without !== null comparisons.

5. html() — Extract HTML Content

import { html } from 'domharvest-playwright'

// Get innerHTML
html('.description')

// With default value
html('.content', { default: '<p>No content</p>' })

Useful when you need the raw HTML structure, not just text.

6. count() — Count Elements

import { count } from 'domharvest-playwright'

// Count matching elements
count('.review')      // Number of reviews
count('.star.filled') // Number of filled stars

Simple element counting without manual array length checks.

Real-World Example: E-commerce Scraper

Here’s a complete example showing how these helpers work together:

import { DOMHarvester, text, attr, array, exists, count } from 'domharvest-playwright'

const harvester = new DOMHarvester({
  rateLimit: { requestsPerSecond: 2 }
})

await harvester.init()

try {
  const products = await harvester.harvest(
    'https://example.com/products',
    '.product-card',
    {
      // Basic text extraction
      name: text('h2.product-name'),
      price: text('.price'),
      description: text('.description', { default: 'No description' }),

      // Attribute extraction
      productId: attr('[data-product-id]', 'data-product-id'),
      image: attr('img.product-image', 'src'),
      link: attr('a.product-link', 'href'),

      // Arrays
      tags: array('.tag', text()),
      images: array('.gallery img', attr('src')),

      // Booleans
      inStock: exists('.in-stock-badge'),
      onSale: exists('.sale-badge'),

      // Counts
      reviewCount: count('.review'),

      // Nested objects
      reviews: array('.review', {
        author: text('.review-author'),
        rating: text('.review-rating'),
        date: text('.review-date'),
        comment: text('.review-comment')
      })
    }
  )

  console.log(`Extracted ${products.length} products`)
  console.log(products[0])
} finally {
  await harvester.close()
}

Pure DSL vs Mixed Mode

The DSL supports two execution modes:

Pure DSL Mode: When your extractor uses only DSL helpers, execution happens entirely in the browser context for optimal performance:

// Pure DSL - optimized browser-side execution
{
  name: text('h1'),
  price: text('.price'),
  tags: array('.tag', text())
}

Mixed Mode: You can combine DSL helpers with custom functions when needed:

// Mixed mode - combines DSL with custom logic
{
  name: text('h1'),
  price: text('.price'),
  discount: (el) => {
    const original = parseFloat(el.querySelector('.original-price')?.textContent || '0')
    const current = parseFloat(el.querySelector('.current-price')?.textContent || '0')
    return Math.round(((original - current) / original) * 100)
  }
}

Mixed mode falls back to Node.js-side execution but still provides all the benefits of DSL helpers where used.

Backward Compatibility

Important: The DSL is completely optional and fully backward compatible. All existing code continues to work unchanged. You can:

  • Use function extractors exclusively (existing approach)
  • Use DSL extractors exclusively (new approach)
  • Mix both approaches in the same project
  • Gradually migrate to DSL at your own pace

There are no breaking changes in this release.

Authentication and Session Management

The second major feature in 1.3.0 is comprehensive authentication support. Many real-world scraping scenarios require authentication: accessing user profiles, scraping private repositories, extracting personalized data, or monitoring members-only areas.

Previously, you had to handle authentication manually with Playwright’s API. Now, domharvest-playwright provides built-in helpers for the most common patterns.

Form-Based Login

The login() helper automatically detects and fills common login forms:

import { DOMHarvester, login } from 'domharvest-playwright'

const harvester = new DOMHarvester()
await harvester.init()

try {
  const page = await harvester.getPage()

  // Automatic login with form detection
  await login(page, 'https://example.com/login', {
    username: process.env.USERNAME,
    password: process.env.PASSWORD
  })

  // Now scrape authenticated content
  const data = await harvester.harvest(
    'https://example.com/dashboard',
    '.data',
    { value: text('.value') }
  )

  console.log('Authenticated data:', data)
} finally {
  await harvester.close()
}

The login() helper:

  • Automatically detects common login form patterns
  • Fills username and password fields
  • Submits the form
  • Waits for navigation to complete

For non-standard forms, you can provide custom selectors:

await login(page, 'https://example.com/signin', {
  username: process.env.USERNAME,
  password: process.env.PASSWORD
}, {
  usernameSelector: '#email',
  passwordSelector: '#pwd',
  submitSelector: 'button[type="submit"]'
})

Save and restore authentication cookies to skip login on subsequent runs:

import { DOMHarvester, login, saveCookies, loadCookies } from 'domharvest-playwright'

const harvester = new DOMHarvester()
await harvester.init()

try {
  const page = await harvester.getPage()
  const context = page.context()

  // Try loading existing cookies
  const cookiesExist = await loadCookies(context, './cookies.json')

  if (!cookiesExist) {
    // First run - perform login
    await login(page, 'https://example.com/login', {
      username: process.env.USERNAME,
      password: process.env.PASSWORD
    })

    // Save cookies for future runs
    await saveCookies(context, './cookies.json')
    console.log('Logged in and saved cookies')
  } else {
    console.log('Loaded existing cookies')
  }

  // Navigate to authenticated area
  await page.goto('https://example.com/dashboard')

  // Scrape authenticated content
  const data = await harvester.harvest(
    'https://example.com/dashboard',
    '.data',
    { value: text('.value') }
  )

  console.log('Data:', data)
} finally {
  await harvester.close()
}

Complete Session Management

For the most robust authentication persistence, use the SessionManager class. Unlike simple cookie persistence, it saves the complete browser state:

  • Cookies
  • localStorage
  • sessionStorage
  • Origins
  • Permissions

This is essential for sites that store authentication state beyond cookies.

import { DOMHarvester, login } from 'domharvest-playwright'
import { SessionManager } from 'domharvest-playwright/auth'

const harvester = new DOMHarvester()
await harvester.init()

try {
  const page = await harvester.getPage()
  const context = page.context()

  const sessionManager = new SessionManager('./sessions')
  const sessionId = 'my-account'

  // Try loading existing session
  const loaded = await sessionManager.loadSession(context, sessionId)

  if (!loaded) {
    // First run - perform login
    await login(page, 'https://example.com/login', {
      username: process.env.USERNAME,
      password: process.env.PASSWORD
    })

    // Save complete session state
    await sessionManager.saveSession(context, sessionId)
    console.log('Logged in and saved session')
  } else {
    console.log('Loaded existing session')
  }

  // Navigate to authenticated area
  await page.goto('https://example.com/dashboard')

  // Scrape authenticated content
  const data = await harvester.harvest(
    'https://example.com/dashboard',
    '.data',
    { value: text('.value') }
  )

  console.log('Data:', data)
} finally {
  await harvester.close()
}

SessionManager API:

const sessionManager = new SessionManager('./sessions')

// Save session
await sessionManager.saveSession(context, 'account-1')

// Load session
const loaded = await sessionManager.loadSession(context, 'account-1')

// Check if session exists
const exists = await sessionManager.hasSession('account-1')

// Delete session
await sessionManager.deleteSession('account-1')

// List all sessions
const sessions = await sessionManager.listSessions()
console.log('Available sessions:', sessions)

Multi-Account Support

The SessionManager makes multi-account scraping trivial:

import { DOMHarvester } from 'domharvest-playwright'
import { SessionManager } from 'domharvest-playwright/auth'

const accounts = [
  { id: 'account-1', username: process.env.USER1, password: process.env.PASS1 },
  { id: 'account-2', username: process.env.USER2, password: process.env.PASS2 }
]

const sessionManager = new SessionManager('./sessions')

for (const account of accounts) {
  const harvester = new DOMHarvester()
  await harvester.init()

  try {
    const page = await harvester.getPage()
    const context = page.context()

    // Load or create session for this account
    const loaded = await sessionManager.loadSession(context, account.id)

    if (!loaded) {
      await login(page, 'https://example.com/login', {
        username: account.username,
        password: account.password
      })
      await sessionManager.saveSession(context, account.id)
    }

    // Scrape with this account
    await page.goto('https://example.com/dashboard')
    const data = await harvester.harvest(
      'https://example.com/dashboard',
      '.data',
      { value: text('.value') }
    )

    console.log(`Data for ${account.id}:`, data)
  } finally {
    await harvester.close()
  }
}

Real-World Example: GitHub Scraper

Here’s a complete example scraping authenticated GitHub data:

import { DOMHarvester, login, text, array } from 'domharvest-playwright'
import { SessionManager } from 'domharvest-playwright/auth'

async function scrapeGitHubDashboard() {
  const harvester = new DOMHarvester({
    rateLimit: { requestsPerSecond: 1 }
  })

  await harvester.init()

  try {
    const page = await harvester.getPage()
    const context = page.context()

    const sessionManager = new SessionManager('./sessions')
    const loaded = await sessionManager.loadSession(context, 'github')

    if (!loaded) {
      await login(page, 'https://github.com/login', {
        username: process.env.GITHUB_USERNAME,
        password: process.env.GITHUB_PASSWORD
      }, {
        usernameSelector: '#login_field',
        passwordSelector: '#password'
      })

      await sessionManager.saveSession(context, 'github')
    }

    // Scrape dashboard
    const repos = await harvester.harvest(
      'https://github.com',
      '.repo',
      {
        name: text('.repo-name'),
        description: text('.repo-description'),
        language: text('[itemprop="programmingLanguage"]'),
        stars: text('.stars'),
        updated: text('relative-time')
      }
    )

    console.log(`Found ${repos.length} repositories`)
    return repos
  } finally {
    await harvester.close()
  }
}

await scrapeGitHubDashboard()

Security Best Practices

Never hardcode credentials. Always use environment variables:

// Good - environment variables
await login(page, url, {
  username: process.env.USERNAME,
  password: process.env.PASSWORD
})

// Bad - hardcoded credentials
await login(page, url, {
  username: 'myuser@example.com',  // Never do this!
  password: 'mypassword123'         // Never do this!
})

Store sessions outside version control:

# .gitignore
sessions/
cookies.json
*.session

Handle 2FA manually: For sites with two-factor authentication, run in headed mode:

const harvester = new DOMHarvester({
  headless: false  // Visible browser for manual 2FA
})

Rotate sessions periodically:

const sessionManager = new SessionManager('./sessions')

// Delete old sessions
if (await sessionManager.hasSession('account-1')) {
  await sessionManager.deleteSession('account-1')
}

// Force fresh login
await login(page, url, credentials)
await sessionManager.saveSession(context, 'account-1')

Enhanced Testing Infrastructure

Version 1.3.0 also brings significant improvements to code quality and testing:

Enforced Coverage Thresholds

We’ve implemented minimum coverage requirements enforced in CI:

  • 80% minimum for lines, functions, and statements
  • 70% minimum for branch coverage

Any PR that drops below these thresholds will fail CI, ensuring we maintain high code quality standards.

{
  "coverageThreshold": {
    "global": {
      "lines": 80,
      "functions": 80,
      "statements": 80,
      "branches": 70
    }
  }
}

Current Coverage Stats

The project maintains 86%+ test coverage across all modules:

  • Statement coverage: 86.62%
  • Branch coverage: 71.65%
  • Function coverage: 89.65%
  • Line coverage: 86.62%

DSL Module Coverage

The new DSL module has 83% test coverage, including comprehensive tests for:

  • All helper functions (text, attr, array, exists, html, count)
  • Pure DSL mode execution
  • Mixed mode execution
  • Nested objects and complex schemas
  • Error handling and edge cases
  • Default values and null safety

Authentication Module Coverage

The authentication features have 95.84% test coverage, including:

  • Form-based login with auto-detection
  • Custom selector support
  • Cookie persistence (save/load)
  • SessionManager complete lifecycle
  • Multi-account session isolation
  • Error handling and validation

Comprehensive Documentation

Along with the code, we’ve added extensive documentation:

Migration Guide

Upgrading to 1.3.0 is seamless—there are no breaking changes.

Install the Update

npm install domharvest-playwright@1.3.0

Start Using DSL (Optional)

You can gradually adopt the DSL in new code or refactor existing extractors:

// Old code (still works)
const data = await harvester.harvest(url, selector, (el) => ({
  title: el.querySelector('h1')?.textContent?.trim()
}))

// New DSL approach (recommended)
import { text } from 'domharvest-playwright'

const data = await harvester.harvest(url, selector, {
  title: text('h1')
})

Add Authentication (Optional)

If you need authentication, import the new helpers:

import { login, saveCookies, loadCookies } from 'domharvest-playwright'
// or
import { SessionManager } from 'domharvest-playwright/auth'

What’s Next

Looking ahead, here are some features we’re considering for future releases:

  • DSL helpers for forms - Declarative form filling
  • Advanced wait strategies - More DSL helpers for dynamic content
  • OAuth support - Built-in OAuth 2.0 flow handling
  • Headless 2FA helpers - Programmatic 2FA token handling
  • Proxy rotation - Built-in proxy management
  • Distributed scraping - Multi-machine coordination

Have ideas? Open an issue on GitHub or reach out on Mastodon.

Conclusion

Version 1.3.0 represents a major step forward for domharvest-playwright. The declarative DSL makes data extraction cleaner and more maintainable, while authentication support unlocks entire categories of scraping use cases that previously required significant custom code.

Combined with enforced test coverage and comprehensive documentation, this release solidifies domharvest-playwright as a production-ready scraping framework.

Key highlights:

  • Declarative DSL with 6 core helpers (text, attr, array, exists, html, count)
  • Pure DSL mode for optimized browser-side execution
  • Backward compatible - all existing code works unchanged
  • Authentication support with form login, cookie persistence, and SessionManager
  • Multi-account scraping made simple
  • 95%+ auth coverage and 83% DSL coverage
  • Enforced 80% coverage threshold in CI
  • Comprehensive documentation for all new features

Whether you’re building a simple scraper or a complex multi-account data extraction pipeline, 1.3.0 gives you the tools to do it cleanly and reliably.

Upgrade today and start building better scrapers!

npm install domharvest-playwright@1.3.0

Resources:

Questions or feedback? Open an issue on GitHub or reach out on Mastodon!