feat: add website enrichment crawler
This commit is contained in:
@@ -1,10 +1,10 @@
|
||||
---
|
||||
id: TASK-8
|
||||
title: Implement Playwright website crawling and screenshot capture
|
||||
status: To Do
|
||||
status: In Progress
|
||||
assignee: []
|
||||
created_date: '2026-06-03 19:13'
|
||||
updated_date: '2026-06-04 14:08'
|
||||
updated_date: '2026-06-04 18:09'
|
||||
labels:
|
||||
- mvp
|
||||
- audit
|
||||
@@ -25,32 +25,51 @@ Build the website inspection and contact-enrichment layer using Playwright. For
|
||||
|
||||
## Acceptance Criteria
|
||||
<!-- AC:BEGIN -->
|
||||
- [ ] #1 Playwright captures desktop and mobile screenshots for the homepage and stores them in Convex File Storage
|
||||
- [ ] #2 Crawler visits a bounded set of relevant subpages: Kontakt, Impressum, Leistungen/Angebot, Über uns/Team when discoverable
|
||||
- [ ] #3 Crawler extracts visible text, page title, meta description, headings, links, phone numbers, email candidates, email source URLs, contact-person context, and CTA/contact-form signals
|
||||
- [ ] #4 Extracted email candidates are classified through the TASK-7 rules: generic business emails are preferred; named emails are accepted only when explicitly published as business contact addresses; no guessed addresses are generated
|
||||
- [ ] #5 Leads discovered by Google Places with a website are automatically scheduled for contact enrichment before they remain in Kontakt fehlt; found usable email updates the lead contact fields and status while preserving phone and source data
|
||||
- [ ] #6 Simple technical checks include HTTPS/final URL, missing title/meta description, visible contact path, and obvious broken internal links within the crawl limit
|
||||
- [ ] #7 Crawler failures produce useful dashboard-visible errors without blocking unrelated leads
|
||||
- [x] #1 Playwright captures desktop and mobile screenshots for the homepage and stores them in Convex File Storage
|
||||
- [x] #2 Crawler visits a bounded set of relevant subpages: Kontakt, Impressum, Leistungen/Angebot, Über uns/Team when discoverable
|
||||
- [x] #3 Crawler extracts visible text, page title, meta description, headings, links, phone numbers, email candidates, email source URLs, contact-person context, and CTA/contact-form signals
|
||||
- [x] #4 Extracted email candidates are classified through the TASK-7 rules: generic business emails are preferred; named emails are accepted only when explicitly published as business contact addresses; no guessed addresses are generated
|
||||
- [x] #5 Leads discovered by Google Places with a website are automatically scheduled for contact enrichment before they remain in Kontakt fehlt; found usable email updates the lead contact fields and status while preserving phone and source data
|
||||
- [x] #6 Simple technical checks include HTTPS/final URL, missing title/meta description, visible contact path, and obvious broken internal links within the crawl limit
|
||||
- [x] #7 Crawler failures produce useful dashboard-visible errors without blocking unrelated leads
|
||||
<!-- AC:END -->
|
||||
|
||||
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
<!-- SECTION:PLAN:BEGIN -->
|
||||
1. Add Playwright runtime setup compatible with local development and Coolify container deployment.
|
||||
2. Define crawl limits, viewports, timeout behavior, and allowed same-domain URL rules.
|
||||
3. Capture homepage desktop/mobile screenshots and upload them to Convex storage.
|
||||
4. Discover and inspect relevant subpages with bounded depth.
|
||||
5. Extract visible text, metadata, links, phone numbers, email candidates, contact-person context, CTA/contact-form signals, and source URLs.
|
||||
6. Normalize and score email candidates, then call the existing TASK-7 lead review/contact qualification path so usable emails update lead contact fields and unqualified named emails do not.
|
||||
7. Add contact-enrichment run state and dashboard-visible run events/errors for leads that still need manual contact research.
|
||||
8. Persist extracted raw evidence, technical checks, screenshots, and crawler errors in Convex.
|
||||
1. Worker A: add pure crawler/extraction helpers with RED/GREEN tests.
|
||||
2. Worker B: add Convex schema/run/storage persistence with RED/GREEN tests.
|
||||
3. Worker C: wire lead-discovery scheduling/contact update flow with RED/GREEN tests.
|
||||
4. Worker D: add dashboard-visible enrichment state/error UI with RED/GREEN tests where practical.
|
||||
5. Orchestrator: run spec review, code-quality review, full verification, and update acceptance criteria without marking Done.
|
||||
<!-- SECTION:PLAN:END -->
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
<!-- SECTION:NOTES:BEGIN -->
|
||||
Expanded TASK-8 to cover website-based contact enrichment because Google Places does not provide business email fields. This keeps email handling evidence-based and reuses TASK-7 qualification rules instead of guessing addresses.
|
||||
|
||||
Orchestration started on branch codex-task-8-playwright-enrichment. Parallel wave 1 dispatched with gpt-5.3-codex-spark: Worker A owns lib/website-crawler.ts + tests/website-crawler.test.ts; Worker B owns convex/schema.ts + schema tests; Worker C owns Playwright package/runtime docs. All workers instructed to use TDD or config verification and avoid unrelated changes.
|
||||
|
||||
Completed wave 1 foundations: Playwright runtime/docs approved; crawler helper spec+quality approved; Convex enrichment schema/run-type parity spec+quality approved. Wave 2 dispatched with gpt-5.3-codex-spark: Worker D owns convex/websiteEnrichment.ts action/persistence; Worker E owns lead-discovery scheduling integration. Orchestrator remains code-review/integration only.
|
||||
|
||||
2026-06-04: Worker D started implementing convex/websiteEnrichment.ts with unit/source tests for queue/process/persist enrichment flow and Playwright evidence capture.
|
||||
|
||||
2026-06-04: Added TASK-8 source tests for website-enrichment action queue/process/persistence contract and confirmed all assertions pass with existing implementation.
|
||||
|
||||
Worker G retry: moved website enrichment scheduling out of persistDiscoveredLeads into processCampaignRun (returns queue items), scoped startCampaignRun active checks to by_type_and_status campaign running, and added source assertions for this sequencing.
|
||||
|
||||
Implementation complete pending user confirmation. Built Playwright Chromium website enrichment with bounded crawl, desktop/mobile screenshot storage, raw evidence tables, TASK-7 email qualification reuse, post-discovery scheduling, technical checks, and dashboard-visible run events/errors. Final verification passed: pnpm exec tsc -p tsconfig.json; pnpm test (105/105); pnpm lint (0 errors, existing generated BetterAuth warnings only); pnpm exec convex codegen --dry-run --typecheck enable.
|
||||
|
||||
2026-06-04: Updated source tests/README/.env for TASK-8 browser-runtime strategy migration to @sparticuz/chromium-min and TASK8_BROWSER_ASSET_URL deployment expectations.
|
||||
|
||||
Resolved Convex Playwright runtime follow-up: local npx playwright install only populates the developer machine cache, not Convex runtime. Full playwright was replaced with playwright-core + @sparticuz/chromium-min and a required TASK8_BROWSER_ASSET_URL source so Convex no longer relies on /home/sbx_user ms-playwright cache. Verification passed: pnpm exec tsc -p tsconfig.json; pnpm test; pnpm lint (existing generated BetterAuth warnings only); pnpm exec convex codegen --dry-run --typecheck enable.
|
||||
|
||||
TASK-21 runtime cache fix applied to TASK-8 crawler action: stale @sparticuz/chromium-min /tmp cache is invalidated when browser asset source changes, addressing repeated /tmp/chromium cannot execute binary file after x64/arm64 URL changes.
|
||||
|
||||
TASK-8 crawler action now explicitly prepares @sparticuz/chromium-min AL2023 shared libraries for Convex to address /tmp/chromium libnspr4.so missing errors before screenshot/crawl launch.
|
||||
|
||||
TASK-23 extractor improvement applied: website enrichment now extracts published emails from mailto links with query params, common German obfuscations, HTML entities/spaced separators, and footer/impressum/contact contexts while preserving TASK-7 no-guessing rules.
|
||||
|
||||
TASK-24 Bock Rechtsanwaelte follow-up: mailto candidates on real Impressum HTML were found but incorrectly marked non-business due index mismatch in context detection. Fixed mailto business-context detection and email-label contactPerson suppression; captured Bock HTML now yields usable chemnitz@bock-rechtsanwaelte.de.
|
||||
<!-- SECTION:NOTES:END -->
|
||||
|
||||
Reference in New Issue
Block a user