Data Quality Framework¶

How we measure, maintain, and communicate the quality of every attribute in PropPie's canonical store. Quality is not optional — the Honest Broker's credibility is only as good as the data underneath.

Core principle¶

Every attribute carries its own quality passport: source, confidence, freshness, lineage. Users and downstream systems never consume a number without knowing how trustworthy it is.

1. The five quality dimensions¶

Dimension	Definition	Measured as
Confidence	How likely is the value correct?	0.0 to 1.0 per attribute instance
Freshness	How recent is the underlying observation?	`extracted_at` timestamp + staleness threshold
Completeness	Is the value present and non-null?	% fill rate per attribute across corpus
Lineage	Can we trace back to the exact source?	`source`, `source_url`, `source_doc_id`, `source_doc_page`
Consistency	Do overlapping sources agree?	Conflict flag + resolution record

2. Confidence scoring¶

2.1 How confidence is assigned¶

Every extracted or derived value gets a confidence score at write-time. Three methods:

Method	When used	Example
Rule-based	Structured fields from authoritative sources	MahaRERA number from RERA page → 0.99
Model-based	LLM/OCR extraction from unstructured docs	Address from scanned PDF → model confidence
Human-verified	Manual QA pass	Title chain reviewed by analyst → 0.95

2.2 Confidence thresholds¶

Tier	Range	Policy
High	0.85 – 1.0	Display to users without qualification
Medium	0.60 – 0.84	Display with "approximate" label or confidence indicator
Low	0.30 – 0.59	Display only with explicit caveat; exclude from derived-attribute inputs unless no alternative
Unusable	< 0.30	Do not display; do not use in derivations; flag for re-extraction or manual review

2.3 Confidence propagation in derived attributes¶

When a derived attribute (e.g., risk.zone_risk_index) depends on multiple inputs:

derived_confidence = min(input_confidences) * formula_confidence

Where formula_confidence is the intrinsic reliability of the computation method (e.g., simple arithmetic = 1.0; Monte Carlo simulation = 0.85; LLM synthesis = 0.80).

This is conservative by design. A chain is only as strong as its weakest link.

2.4 Confidence display to users¶

Users never see raw 0-1 numbers. Display as:

Confidence	User-facing label	Visual
≥ 0.85	(no label — treated as reliable)	Solid text
0.60 – 0.84	"Approximate" or "Based on limited data"	Slightly muted, info icon
0.30 – 0.59	"Low confidence — verify independently"	Dashed border, warning icon
< 0.30	Not shown	—

3. Freshness rules¶

3.1 Staleness thresholds per source¶

Source	Expected update frequency	Stale after	Action when stale
MahaRERA project pages	Quarterly (Form C) + ad-hoc	120 days since last scrape	Re-scrape; flag "last updated X months ago"
IGR transactions (priority micromarkets)	Daily	3 days	Alert pipeline team; fall back to weekly
IGR transactions (other)	Weekly	14 days	Alert; acceptable lag
Government Resolutions	Daily	48 hours	Alert; NLP classifier may miss fast-breaking policy
MahaBhulekh / 7-12	On-demand	180 days since lookup	Re-fetch on next query
GIS layers	Quarterly	6 months	Acceptable; layers change slowly
News	Hourly	6 hours	Alert if feed breaks
Social	Hourly	6 hours	Alert; non-critical
Macro (repo rate, CPI)	Weekly	14 days	Alert
Internal (Fractional)	Real-time / daily	7 days	Alert

3.2 Freshness display¶

Every user-facing data point shows as of [date] when the underlying observation is older than: - 7 days for transaction data - 30 days for project data - 90 days for structural/spatial data

3.3 Freshness monitoring¶

Pipeline produces a daily freshness report:

For each (source, attribute_category):
  - Last successful extraction timestamp
  - Records updated in last 24h / 7d / 30d
  - % of records past staleness threshold
  - Alert: YES/NO

Alerts go to Slack channel + pipeline dashboard. Any critical priority attribute going stale triggers an on-call investigation within 4 hours.

4. Completeness rules¶

4.1 Fill-rate targets¶

Priority	Target fill rate	Policy if below
`critical`	≥ 90% across active projects	Pipeline bug or source degradation — investigate immediately
`high`	≥ 70%	Acceptable gap; note in data-quality dashboard
`medium`	≥ 50%	Expected for some attributes (floor plans, parking details)
`low`	No target	Nice-to-have

4.2 Null handling¶

Explicit null with reason: every missing value stores a null_reason enum: source_missing, extraction_failed, not_applicable, awaiting_enrichment
No silent nulls: if a field is missing, the system must record why
User display: missing critical fields show "Not available — [reason]" not blank space

5. Lineage and audit trail¶

5.1 Per-attribute lineage record¶

Every attribute instance carries:

Field	Type	Example
`source`	enum	`maharera`, `igr`, `gr_portal`, `mahabhulekh`, `internal`, `enrichment`, `derived`
`source_url`	string	`https://maharera.maharashtra.gov.in/project/P52100012345`
`source_doc_id`	string	Internal document UUID
`source_doc_page`	int/null	Page number in PDF
`extracted_at`	datetime	`2026-05-15T14:30:00+05:30`
`extraction_method`	enum	`scraper_v3`, `llm_gpt4o`, `ocr_tesseract`, `manual`, `formula`
`extraction_model_version`	string	`gpt-4o-2025-12-01`
`confidence`	float	0.92
`confidence_method`	enum	`rule`, `model`, `human`
`human_verified`	bool	false
`human_verified_by`	string/null	`aishvarya`
`human_verified_at`	datetime/null

5.2 Change history¶

For attributes that change over time (project cost revisions, completion dates, ownership):

change_history: [
  { value: "2027-03-31", source: "Form B v3", extracted_at: "2026-01-14", confidence: 0.95 },
  { value: "2026-12-31", source: "Form B v2", extracted_at: "2025-06-20", confidence: 0.95 },
  { value: "2025-12-31", source: "Form B v1", extracted_at: "2024-03-15", confidence: 0.95 }
]

This is critical for delay forensics — the number of revisions IS the signal.

5.3 Audit requirements¶

For DPDP and general defensibility:

Immutable extraction log: every extraction event logged with inputs, outputs, model version. Append-only.
Derived-attribute audit: every derived score stores the input attribute IDs + values at computation time, so a future query can reconstruct "why was this score X on date Y?"
PII audit: all access to PII-containing attributes (investor data) logged with accessor, timestamp, purpose.
Retention policy: extraction logs retained for 3 years minimum. PII logs per DPDP retention schedule.

6. Conflict resolution¶

6.1 When sources disagree¶

The source-of-truth hierarchy (from data-sources.md):

Priority	Source	Wins on
1	IGR Index-II	Price, parties, transaction date
2	MahaRERA	Project metadata, dates, promoter
3	MahaBhulekh	Land ownership, area
4	GRs	Policy, infra timelines
5	GIS layers	Geospatial
6	Internal (Fractional)	Realised yields, vacancy
7	Licensed feeds	Where they aggregate primaries
8	News	Context only
9	Social	Sentiment only
10	Listings	Discovery only

6.2 Conflict resolution process¶

When a new extraction disagrees with the existing canonical value:

1. If new_source priority > existing_source priority:
     → Replace. Log old value in change_history.

2. If new_source priority == existing_source priority:
     → Use more recent extraction (freshness wins within same tier).
     → Log conflict in conflict_log with both values + sources.

3. If new_source priority < existing_source priority:
     → Do NOT replace. Store as supplementary evidence.
     → Flag if the delta is > 20% on numeric values (investigate).

4. If conflict is > 50% on a numeric attribute from same-tier sources:
     → Flag for human review. Do not auto-resolve.

6.3 Conflict log¶

Maintain a queryable conflict log:

Field	Type
`attribute_id`	string
`entity_id`	string (project/transaction/micromarket)
`existing_value`	any
`existing_source`	string
`conflicting_value`	any
`conflicting_source`	string
`delta_pct`	float
`resolution`	enum: `replaced`, `kept_existing`, `flagged_for_review`
`resolved_by`	string
`resolved_at`	datetime

Monthly report: top 20 conflicting attributes by volume. Systemic conflicts indicate source degradation.

7. Quality SLAs per attribute priority¶

Priority	Confidence floor	Fill rate floor	Max staleness	Conflict resolution SLA
`critical`	0.85	90%	Per source table	4 hours if flagged
`high`	0.70	70%	Per source table	24 hours
`medium`	0.60	50%	2x source table	72 hours
`low`	No floor	No floor	No enforcement	Best-effort

Violations against critical SLAs trigger an alert + entry in the quality incident log.

8. Quality monitoring and alerting¶

8.1 Daily quality dashboard¶

Generated by pipeline, consumed by product + data teams:

Metric	Granularity	Target
Extraction success rate	Per source, per day	> 95%
Confidence distribution	Per attribute category	Median > 0.85 for critical
Fill rate	Per attribute	Per SLA table
Staleness %	Per source	< 5% records stale
Conflict volume	Per attribute	< 2% of records
Human review queue depth	Global	< 50 items

8.2 Alerts¶

Alert	Trigger	Channel	Response SLA
Source down	Scraper returns errors for > 1 hour	Slack #pipeline-alerts	2 hours
Confidence drop	Attribute category median drops > 10% day-over-day	Slack	4 hours
Fill rate drop	Critical attribute fill rate drops > 5%	Slack	4 hours
Staleness breach	Any critical attribute source past threshold	Slack	4 hours
Conflict spike	> 10% conflict rate on any attribute in a day	Slack	24 hours
Hallucination report	User reports incorrect AI output	Slack #incidents	1 hour

8.3 Quality incident log¶

Every alert that required intervention is logged:

Field	Type
`incident_id`	UUID
`detected_at`	datetime
`alert_type`	enum
`affected_attributes`	list
`affected_entity_count`	int
`root_cause`	text
`resolution`	text
`resolved_at`	datetime
`resolved_by`	string
`user_impact`	bool (did any user see incorrect data?)

Monthly review: patterns, systemic issues, improvement priorities.

9. Human-in-the-loop verification¶

9.1 When humans verify¶

Always for Title Clarity Score on new projects (before first display)
Always for Developer Trust Score on first computation
Always for flagged conflicts (> 50% delta, same-tier sources)
Sampling (5% random) on all LLM-extracted attributes weekly
On user report of any incorrect data

9.2 Verification workflow¶

1. Item enters review queue with context (attribute, entity, source docs, extraction, confidence)
2. Reviewer confirms or corrects
3. If corrected:
   a. Canonical store updated
   b. Old value moved to change_history
   c. confidence_method set to 'human'
   d. human_verified = true
   e. If correction indicates systemic extraction issue → raise pipeline bug
4. If confirmed:
   a. human_verified = true
   b. confidence boosted by 0.05 (capped at 1.0)

9.3 Review capacity planning¶

At scale (~40,000 projects, ~500 new/month): - Title Clarity: ~500/month new projects → 500 reviews/month - Developer Trust: ~100 new developers/year → manageable - Sampling: 5% of LLM extractions at ~10 attrs/project × 500 projects × 5% = ~2,500 spot-checks/month - Conflicts: estimate ~200/month flagged

Total: ~3,200 review items/month. At 5 min/item = ~270 hours/month. Budget for 1.5 FTE of data quality analyst or distribute across team with tooling.

10. Data quality for AI/LLM outputs¶

Special rules for derived attributes that involve LLM generation (narratives, summaries, explanations):

10.1 Grounding enforcement¶

Every LLM-generated output must: 1. Reference only attributes present in the canonical store 2. Not assert any fact not traceable to a stored attribute 3. Include inline citations that resolve to real source_urls 4. Pass a post-generation grounding check (automated: verify all cited numbers match canonical store)

10.2 Hallucination detection¶

Pipeline for every LLM output:

1. Generate output with citations
2. Extract all factual claims (regex + NLI model)
3. For each claim:
   a. Resolve citation to canonical store
   b. Verify numerical value matches (within tolerance)
   c. Verify entity reference is correct
4. If any claim fails verification:
   a. Flag output, do NOT serve to user
   b. Regenerate with stricter prompt
   c. If still fails: serve partial output with failed claims removed + "I couldn't verify X" note
5. Log all verification results for eval monitoring

10.3 Tone compliance¶

Every LLM output is checked for Honest Broker compliance: - No "you should" / "I recommend" / "good investment" language (regex + classifier) - Probability framing present on projections - "I don't know" present when confidence is low

Violations: block output, regenerate with compliance prompt injection.

11. Data quality culture¶

11.1 Ownership¶

Every attribute category has a named owner (see data-attributes.md). The owner is accountable for quality SLAs on their attributes.

11.2 Weekly quality standup¶

15-minute weekly review: - Top 5 quality incidents - Fill rate and staleness trends - Human review queue depth - Any new conflict patterns

11.3 Quarterly quality retro¶

Full SLA compliance review
Confidence calibration: are our confidence scores accurate? (Compare human-verified subset against automated scores.)
Extraction model performance review
Source health review (any sources degrading?)
Update SLA thresholds if needed

12. Bootstrapping quality (first 90 days)¶

During the foundation phase, before full automation:

Week	Focus
1-2	Establish confidence baseline: manually grade 100 projects across all critical attributes
3-4	Set up freshness monitoring for MahaRERA + IGR scrapers
5-6	Implement automated conflict detection on first 1,000 projects
7-8	Deploy hallucination detection on LLM extraction pipeline
9-10	First quality dashboard live; first weekly standup
11-12	First derived-attribute human verification pass (Title Clarity, Developer Trust)

See also: - data-attributes.md — attribute catalogue with per-attribute ownership - derived-attributes-spec.md — math specs including validation methods - pipeline-spec-for-vishal.md — SLAs consumed by pipeline - ../../.cursor/skills/proppie-data-sources/SKILL.md — source hierarchy