feat: add lead qualification workflow

This commit is contained in:
2026-06-04 16:09:47 +02:00
parent 15d8bfeb66
commit 59824b7336
19 changed files with 2833 additions and 78 deletions

View File

@@ -1,9 +1,10 @@
---
id: TASK-7
title: 'Add lead qualification, deduplication, and blacklist handling'
status: To Do
status: Done
assignee: []
created_date: '2026-06-03 19:13'
updated_date: '2026-06-04 14:09'
labels:
- mvp
- leads
@@ -24,19 +25,57 @@ Implement the rules that turn raw business discoveries into usable lead states.
## Acceptance Criteria
<!-- AC:BEGIN -->
- [ ] #1 Leads with no usable email are placed in Kontakt fehlt while preserving phone and source data
- [ ] #2 Generic business emails are preferred and named emails are accepted only when explicitly found as business contact addresses
- [ ] #3 Hard duplicates are detected by domain, Google Place ID, or email; probable duplicates are flagged by name plus address or phone
- [ ] #4 Manual blacklist entries for domain, email, phone, company name, and Place ID are enforced during discovery and review
- [ ] #5 Priority values Hoch, Mittel, Niedrig, Zurückstellen, and Gesperrt are assigned or editable with clear reasons
- [x] #1 Leads with no usable email are placed in Kontakt fehlt while preserving phone and source data
- [x] #2 Generic business emails are preferred and named emails are accepted only when explicitly found as business contact addresses
- [x] #3 Hard duplicates are detected by domain, Google Place ID, or email; probable duplicates are flagged by name plus address or phone
- [x] #4 Manual blacklist entries for domain, email, phone, company name, and Place ID are enforced during discovery and review
- [x] #5 Priority values Hoch, Mittel, Niedrig, Zurückstellen, and Gesperrt are assigned or editable with clear reasons
<!-- AC:END -->
## Implementation Plan
<!-- SECTION:PLAN:BEGIN -->
1. Add blacklist CRUD in Convex and dashboard UI.
2. Implement email/contact extraction result fields and Kontakt fehlt transitions.
3. Add hard and probable duplicate matching rules.
4. Add priority assignment rules based on website/contact signals.
5. Surface reasons and source data in lead detail and run logs.
Subagent-driven TDD execution plan
Orchestrator responsibilities:
1. Coordinate TASK-7 implementation end to end.
2. Use gpt-5.3-codex-spark subagents for implementation and review slices.
3. Enforce TDD: write failing tests first, verify red, implement minimal production code, verify green, then refactor.
4. Keep Backlog notes current and do not mark Done until user confirms manual testing.
Implementation slices:
1. Rules/backend qualification: add tests and implementation for email usability, generic vs named email handling, hard duplicates by domain/place/email, probable duplicates by company+address or company+phone, blacklist normalization, and priority/status reason derivation.
2. Convex integration: extend schema/types/indexes and lead/blacklist APIs for qualification, editable priority/status/reasons, blacklist CRUD, and discovery/review enforcement.
3. Dashboard UI: replace Leads and Sperrliste placeholders with scan-friendly review tools that expose source data, duplicate/blacklist reasons, and editable priority/status controls.
4. Funnel/model polish: map blocked priority to Gesperrt and keep deferred/review funnel behavior coherent.
5. Verification: run targeted tests during each TDD slice, then pnpm test and pnpm lint at the end.
Acceptance criteria mapping:
- AC1: contact qualification stores leads without usable email as Kontakt fehlt while preserving phone/source metadata.
- AC2: email rules prefer generic business addresses and only allow named emails when explicitly sourced as business contact addresses.
- AC3: duplicate rules distinguish hard duplicates and probable duplicates.
- AC4: blacklist entries for domain/email/phone/company/place ID apply during discovery and review.
- AC5: Hoch, Mittel, Niedrig, Zurückstellen, and Gesperrt are assignable/editable with clear reasons.
<!-- SECTION:PLAN:END -->
## Implementation Notes
<!-- SECTION:NOTES:BEGIN -->
Execution started with subagent-driven TDD orchestration using gpt-5.3-codex-spark as requested.
Aufgabe 7: implementiere Google-Places-Email-Review-Regeln, Sperrlisten-Enforcement für bestehende Leads, und korrigiere Firmen-Normalisierung in Blacklist-Matching. Beginne mit neuen TDD-Tests in lib/lead-discovery-google + Convex-Review-Pfad.
TASK-7 implemented: added review-based email contact patch in convex/leads.ts, bounded blacklist enforcement on create/update in convex/blacklist.ts, company normalization fix in getBlacklistLookupValues/getBlacklistMatches, and schema support for new lead matching fields/reasons/blocked priority. Tests: pnpm -s test ✅ and pnpm -s tsc ✅.
Progress: implementing code-quality fixes in convex/blacklist.ts, convex/leads.ts, convex/leadDiscovery.ts; running requested test/type/lint commands after changes. Plan: tighten mutation patch typing, bound blacklist propagation, split website signal, and avoid empty normalized writes.
Executed requested TASK-7 backend quality fixes in scoped files and validated with pnpm -s test, pnpm -s tsc, and targeted eslint. Outstanding follow-up: keep an eye on very large blacklist match sets; enforcement currently remains batch-at-a-time by design.
TASK-7 implementation verified by orchestrator. Added lead qualification helpers and Convex integration for usable email handling, hard/probable duplicate detection, blacklist enforcement with scheduled backfill/apply batches, blocked priority/reason fields, and dashboard Leads/Sperrliste review UI. Verified: pnpm -s test (67 pass), pnpm -s tsc (exit 0), pnpm -s lint (0 errors, 2 generated Better Auth warnings). Browser plugin could not open localhost due ERR_BLOCKED_BY_CLIENT; route HEAD checks redirect to /login as expected for protected dashboard pages.
<!-- SECTION:NOTES:END -->
## Final Summary
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
Implemented lead qualification, duplicate handling, blacklist enforcement, blocked priority/reason support, and dashboard review surfaces. Verified acceptance criteria #1-#5 with tests/typecheck/lint; user confirmed TASK-7 is done.
<!-- SECTION:FINAL_SUMMARY:END -->

View File

@@ -4,6 +4,7 @@ title: Implement Playwright website crawling and screenshot capture
status: To Do
assignee: []
created_date: '2026-06-03 19:13'
updated_date: '2026-06-04 14:08'
labels:
- mvp
- audit
@@ -19,24 +20,37 @@ ordinal: 8000
## Description
<!-- SECTION:DESCRIPTION:BEGIN -->
Build the website inspection layer using Playwright. For qualified leads, the system should load the company website, inspect the homepage and a small set of relevant subpages, capture desktop/mobile screenshots, extract visible text and contact signals, and store all raw evidence in Convex.
Build the website inspection and contact-enrichment layer using Playwright. For qualified leads, the system should load the company website, inspect the homepage and a small set of relevant subpages, capture desktop/mobile screenshots, extract visible text and contact signals, store all raw evidence in Convex, and feed found email candidates back into the TASK-7 qualification rules before a lead remains in Kontakt fehlt. Google Places does not provide business email fields, so website crawl evidence is the primary MVP source for usable business email addresses.
<!-- SECTION:DESCRIPTION:END -->
## Acceptance Criteria
<!-- AC:BEGIN -->
- [ ] #1 Playwright captures desktop and mobile screenshots for the homepage and stores them in Convex File Storage
- [ ] #2 Crawler visits a bounded set of relevant subpages: Kontakt, Impressum, Leistungen/Angebot, Über uns/Team when discoverable
- [ ] #3 Crawler extracts visible text, page title, meta description, headings, links, phone numbers, email candidates, and CTA/contact-form signals
- [ ] #4 Simple technical checks include HTTPS/final URL, missing title/meta description, visible contact path, and obvious broken internal links within the crawl limit
- [ ] #5 Crawler failures produce useful dashboard-visible errors without blocking unrelated leads
- [ ] #3 Crawler extracts visible text, page title, meta description, headings, links, phone numbers, email candidates, email source URLs, contact-person context, and CTA/contact-form signals
- [ ] #4 Extracted email candidates are classified through the TASK-7 rules: generic business emails are preferred; named emails are accepted only when explicitly published as business contact addresses; no guessed addresses are generated
- [ ] #5 Leads discovered by Google Places with a website are automatically scheduled for contact enrichment before they remain in Kontakt fehlt; found usable email updates the lead contact fields and status while preserving phone and source data
- [ ] #6 Simple technical checks include HTTPS/final URL, missing title/meta description, visible contact path, and obvious broken internal links within the crawl limit
- [ ] #7 Crawler failures produce useful dashboard-visible errors without blocking unrelated leads
<!-- AC:END -->
## Implementation Plan
<!-- SECTION:PLAN:BEGIN -->
1. Add Playwright runtime setup compatible with local development and Coolify container deployment.
2. Define crawl limits, viewports, timeout behavior, and allowed same-domain URL rules.
3. Capture homepage desktop/mobile screenshots and upload to Convex storage.
3. Capture homepage desktop/mobile screenshots and upload them to Convex storage.
4. Discover and inspect relevant subpages with bounded depth.
5. Persist extracted text, metadata, contact candidates, technical checks, screenshots, and errors.
5. Extract visible text, metadata, links, phone numbers, email candidates, contact-person context, CTA/contact-form signals, and source URLs.
6. Normalize and score email candidates, then call the existing TASK-7 lead review/contact qualification path so usable emails update lead contact fields and unqualified named emails do not.
7. Add contact-enrichment run state and dashboard-visible run events/errors for leads that still need manual contact research.
8. Persist extracted raw evidence, technical checks, screenshots, and crawler errors in Convex.
<!-- SECTION:PLAN:END -->
## Implementation Notes
<!-- SECTION:NOTES:BEGIN -->
Expanded TASK-8 to cover website-based contact enrichment because Google Places does not provide business email fields. This keeps email handling evidence-based and reuses TASK-7 qualification rules instead of guessing addresses.
<!-- SECTION:NOTES:END -->