feat: add website enrichment crawler

2026-06-04 20:29:23 +02:00
parent ca42c8d5a6
commit 1f6e31c01c
25 changed files with 3539 additions and 56 deletions
--- a/Replace-oversized-Convex-browser-runtime-dependency.md
+++ b/Replace-oversized-Convex-browser-runtime-dependency.md
@@ -0,0 +1,45 @@
+---
+id: TASK-21
+title: Replace oversized Convex browser runtime dependency
+status: In Progress
+assignee: []
+created_date: '2026-06-04 15:30'
+updated_date: '2026-06-04 16:41'
+labels: []
+dependencies: []
+priority: high
+ordinal: 23000
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Reduce Convex function module size by replacing @sparticuz/chromium with a minimal serverless Chromium strategy for websiteEnrichmentAction while keeping screenshot/crawl functionality.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 Action no longer imports @sparticuz/chromium
+- [x] #2 Convex external package list reflects the replacement
+- [x] #3 Deployment guidance includes required env var and failure mode for missing browser URL
+<!-- AC:END -->
+
+## Implementation Plan
+
+<!-- SECTION:PLAN:BEGIN -->
+1. Verify existing oversized browser dependency path in Convex action and env strategy
+2. Replace @sparticuz/chromium with chromium-min + runtime executable source env var
+3. Validate by TS/typecheck
+<!-- SECTION:PLAN:END -->
+
+## Implementation Notes
+
+<!-- SECTION:NOTES:BEGIN -->
+Durchgeführt: Dependency-Swap auf @sparticuz/chromium-min und Nutzung von runtime executableSource aus ENV in convex/websiteEnrichmentAction.ts. convex.json ExternalPackages auf Chromium-Min aktualisiert. Konfigurierter Fehlerpfad bei fehlender Chromium-Variable.
+
+Final verification passed after switching to @sparticuz/chromium-min with TASK8_BROWSER_ASSET_URL as primary runtime browser asset source. Convex codegen dry-run/typecheck now uploads functions successfully; previous ModulesTooLarge error is resolved.
+
+Follow-up for repeated /tmp/chromium cannot execute binary file: Context7 confirmed chromium-min remote pack usage; local package code reuses existing /tmp/chromium. Added marker-based /tmp cache invalidation keyed by TASK8_BROWSER_ASSET_URL so architecture/source changes remove stale /tmp/chromium and /tmp/chromium-pack before executablePath(). Verification passed: pnpm exec tsc -p tsconfig.json; pnpm test (108/108); pnpm lint (existing generated BetterAuth warnings only); pnpm exec convex codegen --dry-run --typecheck enable.
+
+Follow-up for libnspr4.so runtime error: Context7 and local @sparticuz/chromium-min docs show remote pack includes al2023.tar.br, but package only auto-inflates it when AL2023 detection fires. Convex needs those shared libs without being detected. Added explicit AL2023 shared-library preparation after executablePath(): inflate CHROMIUM_PACK_PATH/al2023.tar.br and setupLambdaEnvironment(/tmp/al2023/lib) before Playwright launch. Verification passed: pnpm exec tsc -p tsconfig.json; pnpm test (109/109); pnpm lint (existing generated BetterAuth warnings only); pnpm exec convex codegen --dry-run --typecheck enable.
+<!-- SECTION:NOTES:END -->
--- a/Add-source-assertions-for-Convex-AL2023-Chromium-lib-setup.md
+++ b/Add-source-assertions-for-Convex-AL2023-Chromium-lib-setup.md
@@ -0,0 +1,40 @@
+---
+id: TASK-22
+title: Add source assertions for Convex AL2023 Chromium lib setup
+status: In Progress
+assignee: []
+created_date: '2026-06-04 16:37'
+updated_date: '2026-06-04 16:41'
+labels: []
+dependencies: []
+priority: high
+ordinal: 24000
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Add tests that fail until websiteEnrichmentAction explicitly handles AL2023 shared libs for chromium-min packaging in Convex.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 Test asserts chromium-min dynamic import exposes inflate/setupLambdaEnvironment or explicit LD_LIBRARY_PATH handling for /tmp/al2023/lib.
+- [x] #2 Assertion checks that runtime setup runs before Playwright launch and after executablePath resolution.
+<!-- AC:END -->
+
+## Implementation Plan
+
+<!-- SECTION:PLAN:BEGIN -->
+1. Add source assertions for AL2023 runtime setup and launch ordering
+2. Run focused website-enrichment action test
+3. Confirm failing output and report
+<!-- SECTION:PLAN:END -->
+
+## Implementation Notes
+
+<!-- SECTION:NOTES:BEGIN -->
+Added source-only assertion in tests/website-enrichment-action.test.ts for AL2023 lib setup. Targeted run `pnpm tsc -p tsconfig.test.json && node --test .test-output/tests/website-enrichment-action.test.js` currently fails as expected on current action source (missing setup/LD_LIBRARY_PATH/al2023 archive handling).
+
+GREEN follow-up completed: runtime action now exposes chromium-min inflate/setupLambdaEnvironment, prepares /tmp/al2023/lib after executablePath resolution and before Playwright launch, and focused/full verification passes.
+<!-- SECTION:NOTES:END -->
--- a/Improve-website-email-extraction.md
+++ b/Improve-website-email-extraction.md
@@ -0,0 +1,35 @@
+---
+id: TASK-23
+title: Improve website email extraction
+status: In Progress
+assignee: []
+created_date: '2026-06-04 17:28'
+updated_date: '2026-06-04 17:34'
+labels: []
+dependencies: []
+priority: high
+ordinal: 25000
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Fix TASK-8 website enrichment so Playwright crawls contact/imprint/footer email patterns that are visible on crawled pages but currently missed by the extractor.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 Extract mailto href emails even with query parameters and labels
+- [x] #2 Extract common obfuscated German website email patterns such as [at], (at), at, and spaced @/dot forms
+- [x] #3 Treat emails found on Kontakt/Impressum pages or footer contact context as business contact candidates without guessing addresses
+- [x] #4 Keep TASK-7 rules intact: no generated emails, named emails require explicit business context
+- [x] #5 Verify with focused RED/GREEN tests and full suite
+<!-- AC:END -->
+
+## Implementation Notes
+
+<!-- SECTION:NOTES:BEGIN -->
+Updated website-crawler extractor to support mailto query stripping/decoding, HTML entity decoding for email separators, obfuscated [at]/(at)/dot/punkt and spaced @/dot forms, and expanded business-context detection for footer/impressum/contact regions. Limited to lib/website-crawler.ts only.
+
+Implemented via subagents/TDD: added RED tests for mailto query params, obfuscated email forms, footer/impressum usability, no-guessing false-positive guard, and mailto dedupe. Extractor now decodes common HTML entities, strips/decodes mailto query strings, parses [at]/(at)/punkt/dot/spaced forms with guardrails, expands footer/impressum/contact business context, and leaves TASK-7 selection unchanged. Verification passed: pnpm exec tsc -p tsconfig.json; pnpm test (114/114); pnpm lint (existing generated BetterAuth warnings only); pnpm exec convex codegen --dry-run --typecheck enable.
+<!-- SECTION:NOTES:END -->
--- a/Improve-crawler-handling-for-Bock-Rechtsanwaelte-edge-cases.md
+++ b/Improve-crawler-handling-for-Bock-Rechtsanwaelte-edge-cases.md
@@ -0,0 +1,50 @@
+---
+id: TASK-24
+title: Improve crawler handling for Bock Rechtsanwaelte edge cases
+status: In Progress
+assignee: []
+created_date: '2026-06-04 18:04'
+updated_date: '2026-06-04 18:09'
+labels: []
+dependencies: []
+priority: high
+ordinal: 26000
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Investigate the remaining TASK-8 case where bock-rechtsanwaelte.de/impressum contains a visible email but website enrichment misses it, and address the same-domain timeout separately if reproducible.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 Reproduce the missing email against the public impressum page or captured HTML
+- [x] #2 Add RED tests for the missed email/link pattern
+- [x] #3 Keep no-guessing email rules intact
+- [ ] #4 Add focused timeout mitigation only if root cause is identified
+- [x] #5 Verify focused tests and full suite
+<!-- AC:END -->
+
+## Implementation Plan
+
+<!-- SECTION:PLAN:BEGIN -->
+1. Inspect existing website crawler tests
+2. Add failing regression tests for Bock Impressum
+3. Keep no-context named-email rejection test unchanged
+4. Run focused crawler test and confirm RED
+<!-- SECTION:PLAN:END -->
+
+## Implementation Notes
+
+<!-- SECTION:NOTES:BEGIN -->
+Working on adding focused RED tests for Bock Rechtsanwaelte email extraction failure; limiting changes to tests/website-crawler.test.ts
+
+Added 2 RED coverage tests in tests/website-crawler.test.ts. Focused run of .test-output/tests/website-crawler.test.js fails on 2 assertions: Bock Impressum candidate business-context false due expected mismatch behavior, and email-labeled mailto contactPerson currently equals the email string.
+
+Running minimal fix for Bock Impressum email context/labeling in lib/website-crawler.ts. Next: implement anchor-indexing fix and email-label guard, then run focused tests.
+
+Minimal scoped fix applied in lib/website-crawler.ts: mailto business-context now evaluates against raw input using anchor indices, and email-like labels matching normalized email do not become contactPerson. Verified via focused command: pnpm exec tsc -p tsconfig.test.json && node --test .test-output/tests/website-crawler.test.js (19/19 passing).
+
+Reproduced Bock Impressum against captured public HTML. Extractor found 5 candidates but all were business=false because mailto anchor offsets from original HTML were checked against normalized HTML; TASK-7 therefore returned null. Added RED tests for Bock-like Impressum mailto context and email-label contactPerson behavior. Fixed mailto path to evaluate business context against original input offsets and suppress contactPerson when anchor label is the email itself. Verified captured real HTML now returns usable chemnitz@bock-rechtsanwaelte.de. Full verification passed: pnpm exec tsc -p tsconfig.json; pnpm test (116/116); pnpm lint (existing generated BetterAuth warnings only); pnpm exec convex codegen --dry-run --typecheck enable. Timeout mitigation not changed yet because timeout root cause is not identified.
+<!-- SECTION:NOTES:END -->
--- a/Implement-Playwright-website-crawling-and-screenshot-capture.md
+++ b/Implement-Playwright-website-crawling-and-screenshot-capture.md
@@ -1,10 +1,10 @@
 ---
 id: TASK-8
 title: Implement Playwright website crawling and screenshot capture
-status: To Do
+status: In Progress
 assignee: []
 created_date: '2026-06-03 19:13'
-updated_date: '2026-06-04 14:08'
+updated_date: '2026-06-04 18:09'
 labels:
  - mvp
  - audit
@@ -25,32 +25,51 @@ Build the website inspection and contact-enrichment layer using Playwright. For

 ## Acceptance Criteria
 <!-- AC:BEGIN -->
- [ ] #1 Playwright captures desktop and mobile screenshots for the homepage and stores them in Convex File Storage
- [ ] #2 Crawler visits a bounded set of relevant subpages: Kontakt, Impressum, Leistungen/Angebot, Über uns/Team when discoverable
- [ ] #3 Crawler extracts visible text, page title, meta description, headings, links, phone numbers, email candidates, email source URLs, contact-person context, and CTA/contact-form signals
- [ ] #4 Extracted email candidates are classified through the TASK-7 rules: generic business emails are preferred; named emails are accepted only when explicitly published as business contact addresses; no guessed addresses are generated
- [ ] #5 Leads discovered by Google Places with a website are automatically scheduled for contact enrichment before they remain in Kontakt fehlt; found usable email updates the lead contact fields and status while preserving phone and source data
- [ ] #6 Simple technical checks include HTTPS/final URL, missing title/meta description, visible contact path, and obvious broken internal links within the crawl limit
- [ ] #7 Crawler failures produce useful dashboard-visible errors without blocking unrelated leads
+- [x] #1 Playwright captures desktop and mobile screenshots for the homepage and stores them in Convex File Storage
+- [x] #2 Crawler visits a bounded set of relevant subpages: Kontakt, Impressum, Leistungen/Angebot, Über uns/Team when discoverable
+- [x] #3 Crawler extracts visible text, page title, meta description, headings, links, phone numbers, email candidates, email source URLs, contact-person context, and CTA/contact-form signals
+- [x] #4 Extracted email candidates are classified through the TASK-7 rules: generic business emails are preferred; named emails are accepted only when explicitly published as business contact addresses; no guessed addresses are generated
+- [x] #5 Leads discovered by Google Places with a website are automatically scheduled for contact enrichment before they remain in Kontakt fehlt; found usable email updates the lead contact fields and status while preserving phone and source data
+- [x] #6 Simple technical checks include HTTPS/final URL, missing title/meta description, visible contact path, and obvious broken internal links within the crawl limit
+- [x] #7 Crawler failures produce useful dashboard-visible errors without blocking unrelated leads
 <!-- AC:END -->

-
-
 ## Implementation Plan

 <!-- SECTION:PLAN:BEGIN -->
-1. Add Playwright runtime setup compatible with local development and Coolify container deployment.
-2. Define crawl limits, viewports, timeout behavior, and allowed same-domain URL rules.
-3. Capture homepage desktop/mobile screenshots and upload them to Convex storage.
-4. Discover and inspect relevant subpages with bounded depth.
-5. Extract visible text, metadata, links, phone numbers, email candidates, contact-person context, CTA/contact-form signals, and source URLs.
-6. Normalize and score email candidates, then call the existing TASK-7 lead review/contact qualification path so usable emails update lead contact fields and unqualified named emails do not.
-7. Add contact-enrichment run state and dashboard-visible run events/errors for leads that still need manual contact research.
-8. Persist extracted raw evidence, technical checks, screenshots, and crawler errors in Convex.
+1. Worker A: add pure crawler/extraction helpers with RED/GREEN tests.
+2. Worker B: add Convex schema/run/storage persistence with RED/GREEN tests.
+3. Worker C: wire lead-discovery scheduling/contact update flow with RED/GREEN tests.
+4. Worker D: add dashboard-visible enrichment state/error UI with RED/GREEN tests where practical.
+5. Orchestrator: run spec review, code-quality review, full verification, and update acceptance criteria without marking Done.
 <!-- SECTION:PLAN:END -->

 ## Implementation Notes

 <!-- SECTION:NOTES:BEGIN -->
 Expanded TASK-8 to cover website-based contact enrichment because Google Places does not provide business email fields. This keeps email handling evidence-based and reuses TASK-7 qualification rules instead of guessing addresses.
+
+Orchestration started on branch codex-task-8-playwright-enrichment. Parallel wave 1 dispatched with gpt-5.3-codex-spark: Worker A owns lib/website-crawler.ts + tests/website-crawler.test.ts; Worker B owns convex/schema.ts + schema tests; Worker C owns Playwright package/runtime docs. All workers instructed to use TDD or config verification and avoid unrelated changes.
+
+Completed wave 1 foundations: Playwright runtime/docs approved; crawler helper spec+quality approved; Convex enrichment schema/run-type parity spec+quality approved. Wave 2 dispatched with gpt-5.3-codex-spark: Worker D owns convex/websiteEnrichment.ts action/persistence; Worker E owns lead-discovery scheduling integration. Orchestrator remains code-review/integration only.
+
+2026-06-04: Worker D started implementing convex/websiteEnrichment.ts with unit/source tests for queue/process/persist enrichment flow and Playwright evidence capture.
+
+2026-06-04: Added TASK-8 source tests for website-enrichment action queue/process/persistence contract and confirmed all assertions pass with existing implementation.
+
+Worker G retry: moved website enrichment scheduling out of persistDiscoveredLeads into processCampaignRun (returns queue items), scoped startCampaignRun active checks to by_type_and_status campaign running, and added source assertions for this sequencing.
+
+Implementation complete pending user confirmation. Built Playwright Chromium website enrichment with bounded crawl, desktop/mobile screenshot storage, raw evidence tables, TASK-7 email qualification reuse, post-discovery scheduling, technical checks, and dashboard-visible run events/errors. Final verification passed: pnpm exec tsc -p tsconfig.json; pnpm test (105/105); pnpm lint (0 errors, existing generated BetterAuth warnings only); pnpm exec convex codegen --dry-run --typecheck enable.
+
+2026-06-04: Updated source tests/README/.env for TASK-8 browser-runtime strategy migration to @sparticuz/chromium-min and TASK8_BROWSER_ASSET_URL deployment expectations.
+
+Resolved Convex Playwright runtime follow-up: local npx playwright install only populates the developer machine cache, not Convex runtime. Full playwright was replaced with playwright-core + @sparticuz/chromium-min and a required TASK8_BROWSER_ASSET_URL source so Convex no longer relies on /home/sbx_user ms-playwright cache. Verification passed: pnpm exec tsc -p tsconfig.json; pnpm test; pnpm lint (existing generated BetterAuth warnings only); pnpm exec convex codegen --dry-run --typecheck enable.
+
+TASK-21 runtime cache fix applied to TASK-8 crawler action: stale @sparticuz/chromium-min /tmp cache is invalidated when browser asset source changes, addressing repeated /tmp/chromium cannot execute binary file after x64/arm64 URL changes.
+
+TASK-8 crawler action now explicitly prepares @sparticuz/chromium-min AL2023 shared libraries for Convex to address /tmp/chromium libnspr4.so missing errors before screenshot/crawl launch.
+
+TASK-23 extractor improvement applied: website enrichment now extracts published emails from mailto links with query params, common German obfuscations, HTML entities/spaced separators, and footer/impressum/contact contexts while preserving TASK-7 no-guessing rules.
+
+TASK-24 Bock Rechtsanwaelte follow-up: mailto candidates on real Impressum HTML were found but incorrectly marked non-business due index mismatch in context detection. Fixed mailto business-context detection and email-label contactPerson suppression; captured Bock HTML now yields usable chemnitz@bock-rechtsanwaelte.de.
 <!-- SECTION:NOTES:END -->