Files
pitchfast/backlog/tasks/task-8 - Implement-Playwright-website-crawling-and-screenshot-capture.md

3.4 KiB

id, title, status, assignee, created_date, updated_date, labels, dependencies, references, priority, ordinal
id title status assignee created_date updated_date labels dependencies references priority ordinal
TASK-8 Implement Playwright website crawling and screenshot capture To Do
2026-06-03 19:13 2026-06-04 14:08
mvp
audit
playwright
TASK-7
PRD.md
high 8000

Description

Build the website inspection and contact-enrichment layer using Playwright. For qualified leads, the system should load the company website, inspect the homepage and a small set of relevant subpages, capture desktop/mobile screenshots, extract visible text and contact signals, store all raw evidence in Convex, and feed found email candidates back into the TASK-7 qualification rules before a lead remains in Kontakt fehlt. Google Places does not provide business email fields, so website crawl evidence is the primary MVP source for usable business email addresses.

Acceptance Criteria

  • #1 Playwright captures desktop and mobile screenshots for the homepage and stores them in Convex File Storage
  • #2 Crawler visits a bounded set of relevant subpages: Kontakt, Impressum, Leistungen/Angebot, Über uns/Team when discoverable
  • #3 Crawler extracts visible text, page title, meta description, headings, links, phone numbers, email candidates, email source URLs, contact-person context, and CTA/contact-form signals
  • #4 Extracted email candidates are classified through the TASK-7 rules: generic business emails are preferred; named emails are accepted only when explicitly published as business contact addresses; no guessed addresses are generated
  • #5 Leads discovered by Google Places with a website are automatically scheduled for contact enrichment before they remain in Kontakt fehlt; found usable email updates the lead contact fields and status while preserving phone and source data
  • #6 Simple technical checks include HTTPS/final URL, missing title/meta description, visible contact path, and obvious broken internal links within the crawl limit
  • #7 Crawler failures produce useful dashboard-visible errors without blocking unrelated leads

Implementation Plan

  1. Add Playwright runtime setup compatible with local development and Coolify container deployment.
  2. Define crawl limits, viewports, timeout behavior, and allowed same-domain URL rules.
  3. Capture homepage desktop/mobile screenshots and upload them to Convex storage.
  4. Discover and inspect relevant subpages with bounded depth.
  5. Extract visible text, metadata, links, phone numbers, email candidates, contact-person context, CTA/contact-form signals, and source URLs.
  6. Normalize and score email candidates, then call the existing TASK-7 lead review/contact qualification path so usable emails update lead contact fields and unqualified named emails do not.
  7. Add contact-enrichment run state and dashboard-visible run events/errors for leads that still need manual contact research.
  8. Persist extracted raw evidence, technical checks, screenshots, and crawler errors in Convex.

Implementation Notes

Expanded TASK-8 to cover website-based contact enrichment because Google Places does not provide business email fields. This keeps email handling evidence-based and reuses TASK-7 qualification rules instead of guessing addresses.