feat: add website enrichment crawler
This commit is contained in:
35
backlog/tasks/task-23 - Improve-website-email-extraction.md
Normal file
35
backlog/tasks/task-23 - Improve-website-email-extraction.md
Normal file
@@ -0,0 +1,35 @@
|
||||
---
|
||||
id: TASK-23
|
||||
title: Improve website email extraction
|
||||
status: In Progress
|
||||
assignee: []
|
||||
created_date: '2026-06-04 17:28'
|
||||
updated_date: '2026-06-04 17:34'
|
||||
labels: []
|
||||
dependencies: []
|
||||
priority: high
|
||||
ordinal: 25000
|
||||
---
|
||||
|
||||
## Description
|
||||
|
||||
<!-- SECTION:DESCRIPTION:BEGIN -->
|
||||
Fix TASK-8 website enrichment so Playwright crawls contact/imprint/footer email patterns that are visible on crawled pages but currently missed by the extractor.
|
||||
<!-- SECTION:DESCRIPTION:END -->
|
||||
|
||||
## Acceptance Criteria
|
||||
<!-- AC:BEGIN -->
|
||||
- [x] #1 Extract mailto href emails even with query parameters and labels
|
||||
- [x] #2 Extract common obfuscated German website email patterns such as [at], (at), at, and spaced @/dot forms
|
||||
- [x] #3 Treat emails found on Kontakt/Impressum pages or footer contact context as business contact candidates without guessing addresses
|
||||
- [x] #4 Keep TASK-7 rules intact: no generated emails, named emails require explicit business context
|
||||
- [x] #5 Verify with focused RED/GREEN tests and full suite
|
||||
<!-- AC:END -->
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
<!-- SECTION:NOTES:BEGIN -->
|
||||
Updated website-crawler extractor to support mailto query stripping/decoding, HTML entity decoding for email separators, obfuscated [at]/(at)/dot/punkt and spaced @/dot forms, and expanded business-context detection for footer/impressum/contact regions. Limited to lib/website-crawler.ts only.
|
||||
|
||||
Implemented via subagents/TDD: added RED tests for mailto query params, obfuscated email forms, footer/impressum usability, no-guessing false-positive guard, and mailto dedupe. Extractor now decodes common HTML entities, strips/decodes mailto query strings, parses [at]/(at)/punkt/dot/spaced forms with guardrails, expands footer/impressum/contact business context, and leaves TASK-7 selection unchanged. Verification passed: pnpm exec tsc -p tsconfig.json; pnpm test (114/114); pnpm lint (existing generated BetterAuth warnings only); pnpm exec convex codegen --dry-run --typecheck enable.
|
||||
<!-- SECTION:NOTES:END -->
|
||||
Reference in New Issue
Block a user