--- id: TASK-24 title: Improve crawler handling for Bock Rechtsanwaelte edge cases status: Done assignee: [] created_date: '2026-06-04 18:04' updated_date: '2026-06-10 19:27' labels: [] dependencies: [] priority: high ordinal: 26000 --- ## Description Investigate the remaining TASK-8 case where bock-rechtsanwaelte.de/impressum contains a visible email but website enrichment misses it, and address the same-domain timeout separately if reproducible. ## Acceptance Criteria - [x] #1 Reproduce the missing email against the public impressum page or captured HTML - [x] #2 Add RED tests for the missed email/link pattern - [x] #3 Keep no-guessing email rules intact - [ ] #4 Add focused timeout mitigation only if root cause is identified - [x] #5 Verify focused tests and full suite ## Implementation Plan 1. Inspect existing website crawler tests 2. Add failing regression tests for Bock Impressum 3. Keep no-context named-email rejection test unchanged 4. Run focused crawler test and confirm RED ## Implementation Notes Working on adding focused RED tests for Bock Rechtsanwaelte email extraction failure; limiting changes to tests/website-crawler.test.ts Added 2 RED coverage tests in tests/website-crawler.test.ts. Focused run of .test-output/tests/website-crawler.test.js fails on 2 assertions: Bock Impressum candidate business-context false due expected mismatch behavior, and email-labeled mailto contactPerson currently equals the email string. Running minimal fix for Bock Impressum email context/labeling in lib/website-crawler.ts. Next: implement anchor-indexing fix and email-label guard, then run focused tests. Minimal scoped fix applied in lib/website-crawler.ts: mailto business-context now evaluates against raw input using anchor indices, and email-like labels matching normalized email do not become contactPerson. Verified via focused command: pnpm exec tsc -p tsconfig.test.json && node --test .test-output/tests/website-crawler.test.js (19/19 passing). Reproduced Bock Impressum against captured public HTML. Extractor found 5 candidates but all were business=false because mailto anchor offsets from original HTML were checked against normalized HTML; TASK-7 therefore returned null. Added RED tests for Bock-like Impressum mailto context and email-label contactPerson behavior. Fixed mailto path to evaluate business context against original input offsets and suppress contactPerson when anchor label is the email itself. Verified captured real HTML now returns usable chemnitz@bock-rechtsanwaelte.de. Full verification passed: pnpm exec tsc -p tsconfig.json; pnpm test (116/116); pnpm lint (existing generated BetterAuth warnings only); pnpm exec convex codegen --dry-run --typecheck enable. Timeout mitigation not changed yet because timeout root cause is not identified. ## Final Summary Closed per explicit user request while switching project tracking to pitchfast.