Data Sources — Detailed Catalogue¶
This is the narrative companion to /.cursor/skills/proppie-data-sources/SKILL.md. The SKILL is the quick-reference; this doc has the longer narrative, the layered architecture diagram, the source-of-truth rules, and the gap analysis.
The five layers of PropPie's data¶
flowchart TD
L0[Layer 0: Raw Capture<br/>Scrapers / APIs / Document OCR / NLP]
L1[Layer 1: Canonical Extraction<br/>Vishal's schema v2 - structured per source]
L2[Layer 2: Entity Resolution + Fusion<br/>Same project across RERA + IGR + News]
L3[Layer 3: Derived Attributes<br/>Scores, indices, velocities, trajectories]
L4[Layer 4: Product-facing API<br/>Asset cards, dashboards, conversational]
L0 --> L1 --> L2 --> L3 --> L4
This document covers L0 and feeds L1. The derivations live in derived-attributes-spec.md. Lineage/confidence/freshness rules live in data-quality-framework.md.
Source-of-truth hierarchy¶
When two sources disagree about the same attribute, resolve in this order:
| Priority | Source | Wins on |
|---|---|---|
| 1 | IGR Index-II (registered deed) | Sale price, transaction date, parties, stamp duty paid |
| 2 | MahaRERA project filings | Project metadata, plans, completion dates, promoter, financials |
| 3 | MahaBhulekh / 7-12 / Property Card | Land ownership, area, mutations |
| 4 | Government Resolutions (GRs) | Policy, infra projects, zoning |
| 5 | GIS layers (MRSAC, OSM, Bhuvan) | Geospatial, proximity, environmental |
| 6 | Internal data (Fractional platform) | Realised yields, vacancy, secondary trades |
| 7 | Licensed feeds (PropEquity, CRE Matrix) | Where they aggregate primary sources |
| 8 | News articles | Event detection, context — not factual claims |
| 9 | Social / forums | Sentiment only, never factual |
| 10 | Listing portals | Discovery only, not pricing truth |
Every attribute carries its source, source_url, extracted_at, and confidence — so resolution is auditable.
Source #1 — MahaRERA¶
Access¶
- Portal: https://maharera.maharashtra.gov.in/
- API: None public
- Approach: Scraping with robust monitors
- Polite limits: 2 req/sec sustained, courtesy delays during peak hours
- robots.txt: Generally permissive for project pages
- Legal basis: Public records under RERA Act § 11
What we extract (mapped to Vishal's schema v2)¶
Vishal's schema covers this comprehensively. The project_identity, location, land_and_area_metrics, plot_details, parties, legal_and_compliance, rera_form_b, architectural_plans, escrow_bank_account, loan_disclosure, financial_details, project_info, documents objects all derive primarily from MahaRERA.
Document types ingested¶
| Document | What it tells us | Quality | Typical issues |
|---|---|---|---|
| Form A (application) | Project identity, promoter, location | Good | Older filings have scanned PDFs |
| Form B (project details) | Phases, towers, units, timeline, costs | Good | Free-form text in some fields |
| Form C (quarterly progress) | Construction status updates | Medium | Often perfunctory, hard to verify |
| Building Plan Approval | Sanctioned built-up, FSI, layout | Variable | Large PDFs, plan drawings |
| Commencement Certificate (CC) | Construction legally started | Good | |
| Occupancy / Part Occupancy Cert | Habitability | Good | |
| Title Search Report | Chain of title summary | Variable | Lawyer-written, varying detail |
| Title Certificate | Legal opinion on title | Variable | |
| Search Report (SR office) | Encumbrances | Variable | |
| Land Plan, Site Plan | Layout | Variable | Vector vs scanned |
| Layout Plan, Floor Plan | Spatial layout | Variable | |
| Sale Agreement (template) | Buyer terms | Good | Often boilerplate |
| Allotment Letter (template) | Booking terms | Good | |
| Parking Plan | Parking provision | Sometimes missing | |
| Environment Clearance | EC certificate | Sparse | Only large projects |
| Stamp Payment Receipt | Stamp duty paid | Indirect — implied by stamp duty rate |
Update cadence¶
- Project page re-scrape: Monthly
- Document delta: Detect file-list changes, re-extract changed documents
- Form C (progress): Quarterly per project
- Complaint records: Weekly
Known gotchas¶
- Shell registrations — some projects register but never upload documents. Detect via
documentscount = 0. - Name changes — promoter mergers/demergers don't auto-link. Use PAN/CIN as primary key.
- Phase confusion — large projects register phases separately; reconcile via
alternate_namesand address proximity. - Withdrawn projects — re-appear later under different names. Maintain a withdrawal-aware history.
- Scanned PDFs — older filings (pre-2020) often scanned; OCR + LLM extraction needed.
- Marathi text — addresses, descriptions, court orders often in Marathi.
- Bulk download is rate-limited — be polite, parallelise carefully.
- Date formats — DD/MM/YYYY, DD-MM-YYYY, "31st March 2026" all appear; normalise on extraction.
Volume estimates¶
- ~40,000 registered projects (residential + commercial) since 2017
- ~5-10 documents per project on average → ~300,000 documents total
- ~500-1,000 new project registrations per month
- ~50,000 quarterly updates per quarter
Source #2 — IGR Maharashtra (Inspector General of Registration)¶
Access¶
- Public portal: https://igrmaharashtra.gov.in/
- Search interface: District/SR-office → Year → Index-II lookup
- Bulk feeds: Licensed via PropEquity, CRE Matrix, etc. (₹2-25 Lakh/year/state)
- Approach: Hybrid — official portal for sample/spot-check, licensed feed for bulk
- Legal basis: Public records under Registration Act 1908 § 57
Index-II structure (what each deed gives us)¶
| Field | Always present | Notes |
|---|---|---|
| Deed number, SR office, sub-district, registration date | Yes | Primary key |
| Type of deed (Sale, L&L, Gift, Mortgage, etc.) | Yes | |
| Property identifier (CTS / Survey, address) | Yes | Address often Marathi |
| Parties (vendor / vendee / lessor / lessee) | Yes | With addresses, sometimes PAN |
| Consideration (sale price OR rent + tenure) | Yes | The declared amount |
| Market value (per ASR) | Yes | Government's view |
| Stamp duty + registration fee paid | Yes | |
| Document URL / scan | Sometimes | Often paywalled or rate-limited |
Update cadence¶
- Daily for priority micromarkets (Pune Tier-1, MMR core, Bhiwandi warehousing)
- Weekly for tier-2 micromarkets
- Monthly rollups for trend analytics
Known gotchas¶
- Declared price ≠ actual price for high-value transactions (cash component is illegal but real). ASR Gap % helps detect; we flag, never accuse.
- Marathi property descriptions require good translation pipeline.
- Same property registered multiple times (correction deeds, family transfers). Dedupe by (property_id, parties, date proximity).
- L&L lock-in < 11 months = designed to avoid registration; if registered, often suspicious.
- Stamp duty rates differ by deed type, property type (residential vs commercial), and exemption status.
- Indexed within 24-72 hours of registration — sub-day claims should be hedged.
- Bulk feed costs are non-trivial; budget item.
Linking IGR → MahaRERA¶
The hardest fusion problem. We use: - Survey / CTS number overlap - Geocoded address proximity (< 200m + textual fuzzy match) - Promoter name match (Class I) + parties match for sales from promoter to buyer - Project name in property description
Expected match rate: 40-60% for residential, higher for commercial transactions of promoter→buyer type, lower for secondary market.
Source #3 — Government Resolutions (GRs) and Notifications¶
Access¶
- Master portal: https://gr.maharashtra.gov.in/
- Department portals: UDD, Revenue, Housing, Industries, Environment, Town Planning
- Approach: Scraping + RSS where available
- Update cadence: Daily ingest, hourly during volatile periods (budget season, election notifications)
Departments to track¶
| Department | Why it matters |
|---|---|
| Urban Development Department (UDD) | Master plan changes, TP schemes, DCR notifications, FSI revisions, TDR |
| Revenue Department | Land conversion, NA orders, stamp duty changes |
| Housing Department | Affordable housing schemes, slum rehabilitation |
| Public Works Department | Road, bridge, infrastructure projects |
| Industries Department | MIDC plot allotments, SEZ notifications |
| Environment Department | EC clearances, CRZ, forest |
| Town Planning | DP changes, zoning, reserve plot use changes |
NLP pipeline¶
GRs are unstructured. Pipeline: 1. Ingest PDFs from portals (often Marathi) 2. OCR + translation to English (LLM-assisted for nuance) 3. Classify real-estate relevance (binary: relevant/not, then category) 4. Entity extraction — affected districts, talukas, project names, survey numbers, coordinates, effective dates 5. Impact scoring — high/medium/low impact on real-estate markets 6. Store with full text + classification + entities + impact
Known gotchas¶
- ~80% of GRs are irrelevant; classifier critical to keep noise out
- Effective date ≠ publication date — extract both
- Amendments — a single decision may have 4+ amendment GRs; chain them
- Annexures with coordinates / shape data are often what matters most
- Marathi nuance — "उपलब्ध" (available) vs "उपलब्ध" in different bureaucratic constructions can mean very different things; LLM-assisted
Coverage targets¶
- 100% of GRs from priority departments (UDD, Revenue, Housing) within 24 hours of publication
- 100% archive back to 2017 for trend analysis
Source #4 — MahaBhulekh and Land Records¶
Access¶
- Rural 7/12: https://bhulekh.mahabhumi.gov.in/
- Digital 7/12: https://digitalsatbara.mahabhumi.gov.in/
- Property Card (urban): City-specific portals (Mumbai's is at https://mahabhulekh.maharashtra.gov.in/PropertyCard)
- Approach: Per-property lookup (no bulk available); on-demand caching
What we extract¶
| Attribute | Source | Notes |
|---|---|---|
| Holder name(s), share | 7-12 / PC | |
| Area (hectares + gunthas / sqm) | 7-12 / PC | |
| Land class (Class I / Class II) | 7-12 | Class II = restricted transfer |
| Crops / use | 7-12 | Agricultural |
| Mutations (changes in ownership) | 7-12 / PC | Dates + reasons |
| Encumbrances / charges | 7-12 (Other Rights column) | Often partial |
| NA conversion order ref | 7-12 / Revenue records | If applicable |
Known gotchas¶
- Survey numbers split over time — track mutation chain
- Class II land can't be transferred without permission — always flag
- 7-12 stale on mutations by 30-90 days — note
last_mutation_date - Urban CTS ≠ rural survey numbers — different systems
- Property Cards are less digitised; some cities require physical request
Source #5 — GIS layers¶
Sources by use case¶
| Use case | Primary source | Secondary | License |
|---|---|---|---|
| Flood / inundation zones | MRSAC | BMC stormwater drainage data | MRSAC may require request |
| Forest cover | MRSAC + Forest Dept GIS | ISRO Bhuvan | Public + restricted |
| CRZ (Coastal Reg Zone) | MoEFCC notifications | MRSAC | Public |
| Land use / zoning | City Development Plan shapefiles | Bhuvan | Variable |
| Roads | OpenStreetMap | Google Roads API | OSM open / Google paid |
| Transit (metro, suburban rail) | Maha-Metro, MMRC, MMRDA | OSM | Mostly public |
| MIDC industrial boundaries | MIDC portal | Public | |
| SEZ boundaries | Government notifications | Public | |
| POIs (schools, hospitals, malls) | Google Places API | OSM | Google paid / OSM free |
| Demographic micro-areas | Census 2011 boundaries | MRSAC | Public but stale |
Pipeline¶
- Normalise all coordinates to EPSG:4326 on ingest
- Vector layers stored in PostGIS
- Spatial queries via PostGIS / GeoPandas; complex routing via OSRM
- Heat layers (sentiment, velocity) computed via H3 hex grid for consistent aggregation
Known gotchas¶
- Coordinate systems vary (WGS84, EPSG:32643). Normalise on ingest.
- MRSAC layers often need formal request — budget time
- DP shapefiles poor for tier-2 cities
- OSM building footprints incomplete in new developments
- Bhuvan older than ground truth in fast-changing areas
Source #6 — News, analyst reports, social¶
News sources¶
| Source | Type | Notes |
|---|---|---|
| Economic Times Realty | Mainstream, English | Daily |
| Hindustan Times Real Estate | Mainstream, English | Daily |
| Mint Real Estate | Analytical, English | Weekly + breaking |
| Business Standard | Business, English | Daily |
| Moneycontrol Real Estate | Investor angle, English | Daily |
| Loksatta / Maharashtra Times | Local, Marathi | Daily; hyperlocal infra news |
| Knight Frank, JLL, CBRE, Anarock, Cushman, Colliers reports | Analyst, English | Quarterly |
| MagicBricks / Housing / NoBroker blogs | Aspirational, English | Discovery only |
Pipeline¶
- Ingest (RSS + scraping + occasional vendor API)
- Entity linking — tag each article to project / micromarket / developer / authority / GR
- Sentiment — VADER + LLM-based per-entity sentiment
- Topic classification — pricing / launch / delay / regulation / infra / dispute
- Aggregate weekly per entity → input to derived sentiment score
- Never quote as our own claim — always attribute
Social listening¶
| Source | Use |
|---|---|
| Twitter / X | Real-time event, complaint sentiment |
| Reddit (r/india, r/mumbai, r/pune, r/IndianStreetBets) | Investor sentiment, complaints |
| YouTube (real-estate channels) | Comments + transcript sentiment |
| Local forums (Pune Whispers etc.) | Hyperlocal trust signal |
Vendor: Brandwatch / Talkwalker / Mention (₹15-30L/yr) OR self-built scraping + Meilisearch.
Always aggregate to entity-level signal. Never cite individual posts in user-facing content (defamation risk).
Cadence¶
- News: Hourly
- Social: Hourly
- Analyst reports: Quarterly
Source #7 — Internal data (PropPie Fractional)¶
What we have¶
| Data | Sensitivity | Use |
|---|---|---|
| Listed assets and SPV structures | Commercial | Ground truth for asset attributes, AI overlay |
| Investor KYC records | High (PII) | Personalisation (consent-required), never AI training without consent |
| Investment portfolios | Medium | Portfolio-fit recommendations (info-only) |
| Distribution history | Medium | Realised yields per asset — gold standard |
| Vacancy / occupancy data | Medium | Ground truth for cap-rate / yield validation |
| Secondary market trades | Medium | Liquidity signals |
| Asset documents (lease, title, valuation) | Variable | Document Q&A |
Access governance¶
- AI layer gets pseudonymised investor data only (no PAN/Aadhaar in prompts)
- Asset-level documents are commercial data — accessible
- PII vault separate from analytics layer
- Per-investor consent for AI features (separate from KYC consent)
Source #8 — Third-party / licensed feeds (budget)¶
| Feed | What it gives | Approx cost (₹/year) | Necessity |
|---|---|---|---|
| PropEquity | Pan-India transaction + project database | 5-25L | High — speeds up B2B Analytix |
| CRE Matrix | Commercial lease + sale transactions | 3-15L | High for B2B commercial |
| Liases Foras | Residential micromarket + forecasts | 5-20L | Medium |
| CIBIL Commercial | Developer credit ratings | per-pull | Medium for Developer Trust Score |
| MCA21 / Tofler / Probe42 | Company filings, directors, litigation | 1-5L | Medium for promoter diligence |
| CERSAI | Property charges registry | API access | Required for encumbrance |
| News API (Aggregate Intelligence, etc.) | Structured news | 2-10L | Optional — can self-scrape |
| Brandwatch / Talkwalker | Social listening | 15-30L | Optional — can self-build |
| Skymet / IMD | Weather/climate | Free + paid tiers | Medium for climate-risk attrs |
| Google Maps Platform | POIs, distance matrix | Per-call | Medium |
Total budget estimate for "full data stack" pre-revenue: ₹40-80L/year. Cheaper alternative (self-scrape primary, skip licensed feeds where possible): ₹15-25L/year + significant engineering time.
Gaps — Data we don't have but want¶
These are explicit. Call out the gap whenever a product feature depends on them.
| Gap | Workaround today | Future fix |
|---|---|---|
| Demographic micro-data (ward-level income, education) | Census 2011 + proxies | Anonymised app analytics partnerships; targeted surveys |
| Employment density shapefile | SEZ / IT-park boundaries | Telco data partnership (anonymised) |
| Crime statistics by area | NCRB aggregates only | RTI requests for selected hotspots |
| School / hospital quality | Review proxies | Partnership with rating agencies |
| Construction quality / structural audits | Only when complaints | Independent audit partnership |
| Builder financial stress | Partial via MCA + news | Subscription to credit feeds |
| NRI investor sentiment | None | Custom surveys / community partnerships |
| Real-time policy impact (post-GR) | News + manual analysis | Government engagement / advisory |
| Sub-market commute times | Google Distance Matrix (paid) | Local mobility partnerships |
| Property condition (interior, age, refurbishment) | Listings (unreliable) | User-submitted with verification |
Putting it together — the source landscape¶
flowchart LR
subgraph Govt[Government / Public]
RERA[MahaRERA]
IGR[IGR Index-II]
BHU[MahaBhulekh / PC]
GR[GR Portals]
MRSAC[MRSAC GIS]
end
subgraph Lic[Licensed / Paid]
PE[PropEquity]
CRE[CRE Matrix]
CIBIL[CIBIL Commercial]
MCA[MCA / Tofler]
Brand[Brandwatch]
end
subgraph Open[Open / Free]
OSM[OpenStreetMap]
News[News RSS]
Soc[Social Scraping]
Bhuvan[ISRO Bhuvan]
end
subgraph Int[Internal]
Frac[PropPie Fractional DB]
end
Govt --> Pipeline[Vishal's Pipeline]
Lic --> Pipeline
Open --> Pipeline
Int --> Pipeline
Pipeline --> CanonicalStore[Canonical Attribute Store]
CanonicalStore --> Products[Broker / Analytix / Fractional AI overlay]
Source-attribution display rules (user-facing)¶
When showing data to users, follow this:
| Attribute type | Display format |
|---|---|
| RERA-sourced | "Per MahaRERA Form B v3 (filed 14 Jan 2026)" + link |
| IGR-sourced | "Per registered sale deed, Pune SR-7, 22 Mar 2026" + link |
| Aggregated transactions | "Median of 23 registered transactions, Jan-Mar 2026, ±5% confidence" |
| GR-sourced | "Per GR No. UDD/2025/123, dated 14 Aug 2025" + link |
| News-sourced | "As reported by ET Realty, 18 Feb 2026" + link |
| Derived (Score) | "PropPie-derived score; click to see inputs" (links to expand) |
| Forecast | "Model projection; range ±X% at 80% confidence" with assumptions |
| Internal (Fractional) | "From PropPie Fractional asset records; verified" |
Display the source on every user-facing data point. No exceptions.
See also:
- /.cursor/skills/proppie-data-sources/SKILL.md — quick reference
- data-attributes.md — the ~140-attribute spec
- derived-attributes-spec.md — math for derived
- data-quality-framework.md — lineage / confidence / freshness
- pipeline-spec-for-vishal.md — contract for the pipeline