Data Quality Framework¶
How we measure, maintain, and communicate the quality of every attribute in PropPie's canonical store. Quality is not optional — the Honest Broker's credibility is only as good as the data underneath.
Core principle¶
Every attribute carries its own quality passport: source, confidence, freshness, lineage. Users and downstream systems never consume a number without knowing how trustworthy it is.
1. The five quality dimensions¶
| Dimension | Definition | Measured as |
|---|---|---|
| Confidence | How likely is the value correct? | 0.0 to 1.0 per attribute instance |
| Freshness | How recent is the underlying observation? | extracted_at timestamp + staleness threshold |
| Completeness | Is the value present and non-null? | % fill rate per attribute across corpus |
| Lineage | Can we trace back to the exact source? | source, source_url, source_doc_id, source_doc_page |
| Consistency | Do overlapping sources agree? | Conflict flag + resolution record |
2. Confidence scoring¶
2.1 How confidence is assigned¶
Every extracted or derived value gets a confidence score at write-time. Three methods:
| Method | When used | Example |
|---|---|---|
| Rule-based | Structured fields from authoritative sources | MahaRERA number from RERA page → 0.99 |
| Model-based | LLM/OCR extraction from unstructured docs | Address from scanned PDF → model confidence |
| Human-verified | Manual QA pass | Title chain reviewed by analyst → 0.95 |
2.2 Confidence thresholds¶
| Tier | Range | Policy |
|---|---|---|
| High | 0.85 – 1.0 | Display to users without qualification |
| Medium | 0.60 – 0.84 | Display with "approximate" label or confidence indicator |
| Low | 0.30 – 0.59 | Display only with explicit caveat; exclude from derived-attribute inputs unless no alternative |
| Unusable | < 0.30 | Do not display; do not use in derivations; flag for re-extraction or manual review |
2.3 Confidence propagation in derived attributes¶
When a derived attribute (e.g., risk.zone_risk_index) depends on multiple inputs:
Where formula_confidence is the intrinsic reliability of the computation method (e.g., simple arithmetic = 1.0; Monte Carlo simulation = 0.85; LLM synthesis = 0.80).
This is conservative by design. A chain is only as strong as its weakest link.
2.4 Confidence display to users¶
Users never see raw 0-1 numbers. Display as:
| Confidence | User-facing label | Visual |
|---|---|---|
| ≥ 0.85 | (no label — treated as reliable) | Solid text |
| 0.60 – 0.84 | "Approximate" or "Based on limited data" | Slightly muted, info icon |
| 0.30 – 0.59 | "Low confidence — verify independently" | Dashed border, warning icon |
| < 0.30 | Not shown | — |
3. Freshness rules¶
3.1 Staleness thresholds per source¶
| Source | Expected update frequency | Stale after | Action when stale |
|---|---|---|---|
| MahaRERA project pages | Quarterly (Form C) + ad-hoc | 120 days since last scrape | Re-scrape; flag "last updated X months ago" |
| IGR transactions (priority micromarkets) | Daily | 3 days | Alert pipeline team; fall back to weekly |
| IGR transactions (other) | Weekly | 14 days | Alert; acceptable lag |
| Government Resolutions | Daily | 48 hours | Alert; NLP classifier may miss fast-breaking policy |
| MahaBhulekh / 7-12 | On-demand | 180 days since lookup | Re-fetch on next query |
| GIS layers | Quarterly | 6 months | Acceptable; layers change slowly |
| News | Hourly | 6 hours | Alert if feed breaks |
| Social | Hourly | 6 hours | Alert; non-critical |
| Macro (repo rate, CPI) | Weekly | 14 days | Alert |
| Internal (Fractional) | Real-time / daily | 7 days | Alert |
3.2 Freshness display¶
Every user-facing data point shows as of [date] when the underlying observation is older than:
- 7 days for transaction data
- 30 days for project data
- 90 days for structural/spatial data
3.3 Freshness monitoring¶
Pipeline produces a daily freshness report:
For each (source, attribute_category):
- Last successful extraction timestamp
- Records updated in last 24h / 7d / 30d
- % of records past staleness threshold
- Alert: YES/NO
Alerts go to Slack channel + pipeline dashboard. Any critical priority attribute going stale triggers an on-call investigation within 4 hours.
4. Completeness rules¶
4.1 Fill-rate targets¶
| Priority | Target fill rate | Policy if below |
|---|---|---|
critical |
≥ 90% across active projects | Pipeline bug or source degradation — investigate immediately |
high |
≥ 70% | Acceptable gap; note in data-quality dashboard |
medium |
≥ 50% | Expected for some attributes (floor plans, parking details) |
low |
No target | Nice-to-have |
4.2 Null handling¶
- Explicit null with reason: every missing value stores a
null_reasonenum:source_missing,extraction_failed,not_applicable,awaiting_enrichment - No silent nulls: if a field is missing, the system must record why
- User display: missing critical fields show "Not available — [reason]" not blank space
5. Lineage and audit trail¶
5.1 Per-attribute lineage record¶
Every attribute instance carries:
| Field | Type | Example |
|---|---|---|
source |
enum | maharera, igr, gr_portal, mahabhulekh, internal, enrichment, derived |
source_url |
string | https://maharera.maharashtra.gov.in/project/P52100012345 |
source_doc_id |
string | Internal document UUID |
source_doc_page |
int/null | Page number in PDF |
extracted_at |
datetime | 2026-05-15T14:30:00+05:30 |
extraction_method |
enum | scraper_v3, llm_gpt4o, ocr_tesseract, manual, formula |
extraction_model_version |
string | gpt-4o-2025-12-01 |
confidence |
float | 0.92 |
confidence_method |
enum | rule, model, human |
human_verified |
bool | false |
human_verified_by |
string/null | aishvarya |
human_verified_at |
datetime/null |
5.2 Change history¶
For attributes that change over time (project cost revisions, completion dates, ownership):
change_history: [
{ value: "2027-03-31", source: "Form B v3", extracted_at: "2026-01-14", confidence: 0.95 },
{ value: "2026-12-31", source: "Form B v2", extracted_at: "2025-06-20", confidence: 0.95 },
{ value: "2025-12-31", source: "Form B v1", extracted_at: "2024-03-15", confidence: 0.95 }
]
This is critical for delay forensics — the number of revisions IS the signal.
5.3 Audit requirements¶
For DPDP and general defensibility:
- Immutable extraction log: every extraction event logged with inputs, outputs, model version. Append-only.
- Derived-attribute audit: every derived score stores the input attribute IDs + values at computation time, so a future query can reconstruct "why was this score X on date Y?"
- PII audit: all access to PII-containing attributes (investor data) logged with accessor, timestamp, purpose.
- Retention policy: extraction logs retained for 3 years minimum. PII logs per DPDP retention schedule.
6. Conflict resolution¶
6.1 When sources disagree¶
The source-of-truth hierarchy (from data-sources.md):
| Priority | Source | Wins on |
|---|---|---|
| 1 | IGR Index-II | Price, parties, transaction date |
| 2 | MahaRERA | Project metadata, dates, promoter |
| 3 | MahaBhulekh | Land ownership, area |
| 4 | GRs | Policy, infra timelines |
| 5 | GIS layers | Geospatial |
| 6 | Internal (Fractional) | Realised yields, vacancy |
| 7 | Licensed feeds | Where they aggregate primaries |
| 8 | News | Context only |
| 9 | Social | Sentiment only |
| 10 | Listings | Discovery only |
6.2 Conflict resolution process¶
When a new extraction disagrees with the existing canonical value:
1. If new_source priority > existing_source priority:
→ Replace. Log old value in change_history.
2. If new_source priority == existing_source priority:
→ Use more recent extraction (freshness wins within same tier).
→ Log conflict in conflict_log with both values + sources.
3. If new_source priority < existing_source priority:
→ Do NOT replace. Store as supplementary evidence.
→ Flag if the delta is > 20% on numeric values (investigate).
4. If conflict is > 50% on a numeric attribute from same-tier sources:
→ Flag for human review. Do not auto-resolve.
6.3 Conflict log¶
Maintain a queryable conflict log:
| Field | Type |
|---|---|
attribute_id |
string |
entity_id |
string (project/transaction/micromarket) |
existing_value |
any |
existing_source |
string |
conflicting_value |
any |
conflicting_source |
string |
delta_pct |
float |
resolution |
enum: replaced, kept_existing, flagged_for_review |
resolved_by |
string |
resolved_at |
datetime |
Monthly report: top 20 conflicting attributes by volume. Systemic conflicts indicate source degradation.
7. Quality SLAs per attribute priority¶
| Priority | Confidence floor | Fill rate floor | Max staleness | Conflict resolution SLA |
|---|---|---|---|---|
critical |
0.85 | 90% | Per source table | 4 hours if flagged |
high |
0.70 | 70% | Per source table | 24 hours |
medium |
0.60 | 50% | 2x source table | 72 hours |
low |
No floor | No floor | No enforcement | Best-effort |
Violations against critical SLAs trigger an alert + entry in the quality incident log.
8. Quality monitoring and alerting¶
8.1 Daily quality dashboard¶
Generated by pipeline, consumed by product + data teams:
| Metric | Granularity | Target |
|---|---|---|
| Extraction success rate | Per source, per day | > 95% |
| Confidence distribution | Per attribute category | Median > 0.85 for critical |
| Fill rate | Per attribute | Per SLA table |
| Staleness % | Per source | < 5% records stale |
| Conflict volume | Per attribute | < 2% of records |
| Human review queue depth | Global | < 50 items |
8.2 Alerts¶
| Alert | Trigger | Channel | Response SLA |
|---|---|---|---|
| Source down | Scraper returns errors for > 1 hour | Slack #pipeline-alerts | 2 hours |
| Confidence drop | Attribute category median drops > 10% day-over-day | Slack | 4 hours |
| Fill rate drop | Critical attribute fill rate drops > 5% | Slack | 4 hours |
| Staleness breach | Any critical attribute source past threshold | Slack | 4 hours |
| Conflict spike | > 10% conflict rate on any attribute in a day | Slack | 24 hours |
| Hallucination report | User reports incorrect AI output | Slack #incidents | 1 hour |
8.3 Quality incident log¶
Every alert that required intervention is logged:
| Field | Type |
|---|---|
incident_id |
UUID |
detected_at |
datetime |
alert_type |
enum |
affected_attributes |
list |
affected_entity_count |
int |
root_cause |
text |
resolution |
text |
resolved_at |
datetime |
resolved_by |
string |
user_impact |
bool (did any user see incorrect data?) |
Monthly review: patterns, systemic issues, improvement priorities.
9. Human-in-the-loop verification¶
9.1 When humans verify¶
- Always for Title Clarity Score on new projects (before first display)
- Always for Developer Trust Score on first computation
- Always for flagged conflicts (> 50% delta, same-tier sources)
- Sampling (5% random) on all LLM-extracted attributes weekly
- On user report of any incorrect data
9.2 Verification workflow¶
1. Item enters review queue with context (attribute, entity, source docs, extraction, confidence)
2. Reviewer confirms or corrects
3. If corrected:
a. Canonical store updated
b. Old value moved to change_history
c. confidence_method set to 'human'
d. human_verified = true
e. If correction indicates systemic extraction issue → raise pipeline bug
4. If confirmed:
a. human_verified = true
b. confidence boosted by 0.05 (capped at 1.0)
9.3 Review capacity planning¶
At scale (~40,000 projects, ~500 new/month): - Title Clarity: ~500/month new projects → 500 reviews/month - Developer Trust: ~100 new developers/year → manageable - Sampling: 5% of LLM extractions at ~10 attrs/project × 500 projects × 5% = ~2,500 spot-checks/month - Conflicts: estimate ~200/month flagged
Total: ~3,200 review items/month. At 5 min/item = ~270 hours/month. Budget for 1.5 FTE of data quality analyst or distribute across team with tooling.
10. Data quality for AI/LLM outputs¶
Special rules for derived attributes that involve LLM generation (narratives, summaries, explanations):
10.1 Grounding enforcement¶
Every LLM-generated output must: 1. Reference only attributes present in the canonical store 2. Not assert any fact not traceable to a stored attribute 3. Include inline citations that resolve to real source_urls 4. Pass a post-generation grounding check (automated: verify all cited numbers match canonical store)
10.2 Hallucination detection¶
Pipeline for every LLM output:
1. Generate output with citations
2. Extract all factual claims (regex + NLI model)
3. For each claim:
a. Resolve citation to canonical store
b. Verify numerical value matches (within tolerance)
c. Verify entity reference is correct
4. If any claim fails verification:
a. Flag output, do NOT serve to user
b. Regenerate with stricter prompt
c. If still fails: serve partial output with failed claims removed + "I couldn't verify X" note
5. Log all verification results for eval monitoring
10.3 Tone compliance¶
Every LLM output is checked for Honest Broker compliance: - No "you should" / "I recommend" / "good investment" language (regex + classifier) - Probability framing present on projections - "I don't know" present when confidence is low
Violations: block output, regenerate with compliance prompt injection.
11. Data quality culture¶
11.1 Ownership¶
Every attribute category has a named owner (see data-attributes.md). The owner is accountable for quality SLAs on their attributes.
11.2 Weekly quality standup¶
15-minute weekly review: - Top 5 quality incidents - Fill rate and staleness trends - Human review queue depth - Any new conflict patterns
11.3 Quarterly quality retro¶
- Full SLA compliance review
- Confidence calibration: are our confidence scores accurate? (Compare human-verified subset against automated scores.)
- Extraction model performance review
- Source health review (any sources degrading?)
- Update SLA thresholds if needed
12. Bootstrapping quality (first 90 days)¶
During the foundation phase, before full automation:
| Week | Focus |
|---|---|
| 1-2 | Establish confidence baseline: manually grade 100 projects across all critical attributes |
| 3-4 | Set up freshness monitoring for MahaRERA + IGR scrapers |
| 5-6 | Implement automated conflict detection on first 1,000 projects |
| 7-8 | Deploy hallucination detection on LLM extraction pipeline |
| 9-10 | First quality dashboard live; first weekly standup |
| 11-12 | First derived-attribute human verification pass (Title Clarity, Developer Trust) |
See also:
- data-attributes.md — attribute catalogue with per-attribute ownership
- derived-attributes-spec.md — math specs including validation methods
- pipeline-spec-for-vishal.md — SLAs consumed by pipeline
- ../../.cursor/skills/proppie-data-sources/SKILL.md — source hierarchy