Data Attributes — Canonical Catalogue
This is the canonical attribute spec for PropPie. It is the refined evolution of the original 92-attribute CSV, brought in line with Vishal's extraction schema v2, the SOUL/compliance posture, and the wow-moments roadmap. ~140 attributes across 14 categories.
The machine-readable version is data-attributes.csv . The derivation math for every score / index lives in derived-attributes-spec.md . Lineage and quality rules are in data-quality-framework.md .
How to read this catalogue
Every attribute has:
Column
Meaning
ID
Stable identifier (used in pipeline, API, derived references)
Category
One of 14 buckets
Attribute
Human name
Type
raw (extracted directly) / derived (computed from others) / enrichment (from external API/feed)
Priority
critical / high / medium / low — drives sequencing
Source
Primary upstream source(s)
Refresh cadence
How often it's updated
Source confidence (target)
Baseline confidence (0-1) we expect on extraction
Legal basis
What makes ingestion lawful
Used in
Which product surface(s) — Broker, Analytix, Fractional
Wow moment
The user-facing magic this enables
Categorisation framework
flowchart TD
Static[Static / Structural<br/>What the property IS] --> Transactional[Transactional<br/>What happens TO it]
Static --> Legal[Legal & Compliance<br/>What's allowed]
Static --> Spatial[Spatial / Environmental<br/>WHERE it sits]
Transactional --> Market[Market Signals<br/>What the market is doing]
Transactional --> Tenant[Tenant & Lease Intel<br/>Cash flow reality]
Market --> Derived[Derived / AI<br/>Scores, projections, narratives]
Tenant --> Derived
Legal --> Derived
Spatial --> Derived
Policy[Policy & Infra<br/>What's coming] --> Derived
Persona[Investor Persona] --> Derived
Macro[Macro Overlay] --> Derived
Operations[Operations / Lineage] -.-> Static
Operations -.-> Transactional
Operations -.-> Derived
The 14 categories
#
Category
Approx count
Role
01
Project Identity
12
The "who and what"
02
Location & Geospatial
12
The "where"
03
Land & Area Metrics
10
Physical size, FSI
04
Parties & Developer
8
The "by whom"
05
Legal & Compliance
11
Title, encumbrance, litigation
06
Approvals & Permissions
8
Regulatory clearances
07
Financial & Transaction
14
Real prices, real rents
08
Tenant & Lease Intelligence
9
Commercial cash-flow
09
Unit & Configuration
7
Inventory and product mix
10
Market Signals
10
Velocity, momentum, yields
11
Policy & Infrastructure
8
GRs and infra catalysts
12
Risk Intelligence
9
Environmental, zonal, composite
13
AI / Derived Insights
12
Scores, projections, narratives
14
Investor Persona
8
For Broker personalisation
15
Macro Overlay
6
Repo, rates, employment
16
Fractional-specific
7
SPV, escrow, waterfall, secondary
Ops
Data Lineage & Audit
(per-attribute, not separate)
Confidence, freshness, source
Total: ~140 functional attributes (excluding per-attribute lineage columns which apply to all).
Category 01 — Project Identity (12)
The "who and what" — the entity resolution backbone.
ID
Attribute
Type
Priority
Source
Notes
proj.id
Canonical project ID
derived
critical
Internal
UUID; primary key across system
proj.maharera_number
MahaRERA registration number
raw
critical
MahaRERA
Anchor for entity resolution
proj.name
Project name
raw
critical
MahaRERA
proj.name_aliases
Alternate / marketed names
raw
high
MahaRERA + News
List
proj.building_or_wing_name
Sub-name
raw
high
MahaRERA
For multi-tower projects
proj.type
Type (Residential, Commercial, Industrial, Mixed-Use)
raw
critical
MahaRERA
Drives all segment filtering
proj.phases_count
Number of phases
raw
high
MahaRERA
proj.towers_count
Number of towers
raw
high
MahaRERA
proj.blocks_count
Number of blocks
raw
medium
MahaRERA
proj.promised_completion_dates
All declared/revised completion dates with versions
raw
critical
MahaRERA Form B
Used in dev.delay_history
proj.actual_completion_date
Actual delivery date
raw
high
MahaRERA OC/POC
proj.estimated_cost_history
Original + revised cost figures with versions
raw
high
MahaRERA Form A/B
Cost overrun signal
Category 02 — Location & Geospatial (12)
ID
Attribute
Type
Priority
Source
Notes
loc.full_address
Full address + PIN
raw
critical
MahaRERA
loc.unit_number, loc.street, loc.landmark
Address parts
raw
medium
MahaRERA
loc.locality
Locality / colony
raw
high
MahaRERA + reverse geocoding
loc.taluka, loc.district, loc.state
Admin hierarchy
raw
high
MahaRERA
loc.pin_code
PIN code
raw
high
MahaRERA
loc.lat, loc.lng
Coordinates
raw
critical
MahaRERA + Google Geocoding fallback
Normalised EPSG:4326
loc.survey_numbers
Survey/CTS/Gat/Hissa/TP numbers
raw
critical
MahaRERA + 7-12
List
loc.boundaries_n_s_e_w
Cardinal boundaries
raw
medium
MahaRERA + 7-12
loc.micromarket_id
Micromarket bucket
derived
critical
GIS clustering
See derived-spec
loc.micromarket_name
Human-readable label
derived
critical
Internal taxonomy
loc.infra_proximity_score
Composite 0-100
derived
critical
GIS + metro/SEZ/etc.
See derived-spec
loc.commute_times
Commute times to key job centres
enrichment
medium
Google Distance Matrix
Category 03 — Land & Area Metrics (10)
ID
Attribute
Type
Priority
Source
Notes
area.total_land_sqm
Total land area
raw
critical
MahaRERA
area.land_for_registration_sqm
Registered land area
raw
high
MahaRERA
area.permissible_builtup_sqm
Permissible built-up
raw
critical
MahaRERA / BPA
area.sanctioned_builtup_sqm
Sanctioned built-up
raw
critical
BPA
area.open_space_sqm
Aggregate open space
raw
medium
MahaRERA
area.carpet_sqft
Carpet area (per-unit or total)
raw
critical
Floor plans
RERA-mandated
area.built_up_sqft
Built-up area
raw
high
Floor plans
area.super_built_up_sqft
Super built-up
raw
medium
Floor plans
Being phased out
area.fsi_consumed
FSI consumed
raw
critical
MahaRERA / BPA
area.permissible_fsi
Permissible FSI
raw
critical
MahaRERA / DCR
Category 04 — Parties & Developer (8)
ID
Attribute
Type
Priority
Source
Notes
party.promoter_name
Promoter / developer name
raw
critical
MahaRERA
party.promoter_pan
Promoter PAN
raw
critical
MahaRERA
Primary key for promoter
party.promoter_cin
CIN if corporate
raw
high
MahaRERA + MCA
party.promoter_gst
GST
raw
medium
MahaRERA
party.land_owners
Landowner names + share
raw
high
MahaRERA + 7-12
party.architects
Engaged architects
raw
low
MahaRERA
party.contractors
Engaged contractors
raw
low
MahaRERA
dev.trust_score
Developer Trust Score 0-100
derived
critical
Multi-factor
See derived-spec — defamation-guarded
Category 05 — Legal & Compliance (11)
ID
Attribute
Type
Priority
Source
Notes
legal.title_search_report_url
TSR URL
raw
high
MahaRERA
legal.title_certificate_url
TC URL
raw
high
MahaRERA
legal.chain_of_title
Historical owners list
raw
high
TSR + 7-12
Structured
legal.encumbrances
Charges, mortgages, liens
raw
critical
TSR + CERSAI
legal.cersai_charges
CERSAI charge records
raw
high
CERSAI API
legal.litigation_records
Court cases
raw
high
TSR + court portals
legal.complaints_maharera
MahaRERA complaints count + status
raw
high
MahaRERA
legal.land_class
Class I / Class II
raw
critical
7-12
Class II = restricted
legal.na_order_status
NA conversion done?
raw
high
Revenue records
legal.crz_status
Within CRZ?
enrichment
high
MoEFCC + MRSAC
legal.title_clarity_score
Title Clarity Score 0-100
derived
critical
Multi-input
See derived-spec
Category 06 — Approvals & Permissions (8)
ID
Attribute
Type
Priority
Source
Notes
appr.building_plan_approval
BPA reference + date
raw
critical
MahaRERA + municipal
appr.commencement_certificate
CC reference + date
raw
critical
MahaRERA
appr.part_occupancy_certificate
POC ref + scope
raw
high
MahaRERA
appr.occupancy_certificate
OC ref + date
raw
critical
MahaRERA
Habitability anchor
appr.environment_clearance
EC ref + date
raw
high
MahaRERA + MoEFCC
appr.forest_clearance
FC if applicable
raw
high
MoEFCC
appr.fire_noc
Fire NOC
raw
medium
Municipal
appr.water_sewerage_noc
Water/sewerage NOC
raw
medium
Municipal
Category 07 — Financial & Transaction (14)
ID
Attribute
Type
Priority
Source
Notes
fin.sale_price
Declared sale price (per transaction)
raw
critical
IGR Index-II
fin.sale_date
Transaction date
raw
critical
IGR
fin.sale_parties
Buyer + seller
raw
critical
IGR
fin.stamp_duty_paid
Stamp duty paid
raw
high
IGR
fin.registration_fee_paid
Registration fee
raw
high
IGR
fin.market_value_asr
Market value per ASR
raw
high
IGR
Government view
fin.asr_value_per_sqft
ASR per sqft for the address
enrichment
critical
ASR portal
fin.asr_gap_pct
(Sale price - ASR) / ASR * 100
derived
critical
Sale + ASR
See derived-spec
fin.price_per_sqft_carpet
Price ÷ carpet area
derived
critical
Sale + area
fin.estimated_cost_revised
Revised project cost
raw
high
MahaRERA
fin.cost_overrun_pct
(Revised - Original) / Original * 100
derived
high
Cost history
See derived-spec
fin.escrow_account_details
70%-escrow account
raw
high
MahaRERA
fin.escrow_balance_proxy
Estimated balance (from filings)
derived
medium
Escrow + sale velocity
Approximate
fin.loan_disclosure
Project loan disclosure
raw
high
MahaRERA Form B
Category 08 — Tenant & Lease Intelligence (9)
ID
Attribute
Type
Priority
Source
Notes
lease.rent_amount_monthly
Monthly rent
raw
critical
IGR L&L
lease.tenure_months
Tenure
raw
critical
IGR L&L
lease.lock_in_months
Lock-in period
raw
high
IGR L&L
lease.security_deposit
Security deposit
raw
high
IGR L&L
lease.escalation_pct
Annual escalation %
raw
high
IGR L&L
lease.tenant_name
Tenant name
raw
high
IGR L&L
lease.tenant_industry
Tenant industry (classified)
derived
high
Tenant name + MCA enrichment
lease.tenant_credit_score
Tenant credit profile
enrichment
high
CIBIL Commercial
lease.tenant_anchor_quality_score
Anchor tenant score 0-100
derived
critical
Industry + credit + size
See derived-spec
Category 09 — Unit & Configuration (7)
ID
Attribute
Type
Priority
Source
Notes
unit.total_count
Total units
raw
critical
MahaRERA
unit.booked_count
Units booked
raw
high
MahaRERA quarterly
unit.sold_count
Units sold
raw
high
MahaRERA quarterly
unit.unsold_count
Units unsold
derived
high
Total - sold
unit.type_breakdown
Mix (1BHK/2BHK/Office/Retail)
raw
high
Floor plans
Object/array
unit.parking_provision
Covered/Open/Stilt counts
raw
high
MahaRERA + parking plan
unit.amenities
Amenities list
raw
medium
MahaRERA
Pool, gym, club, etc.
Category 10 — Market Signals (10)
ID
Attribute
Type
Priority
Source
Notes
mkt.transaction_velocity_90d
Registered txn count in 90d (micromarket)
derived
critical
IGR rolling
See derived-spec
mkt.transaction_velocity_180d
Same, 180d
derived
critical
IGR rolling
mkt.median_price_per_sqft_90d
Median ₹/sqft in micromarket
derived
critical
IGR aggregates
mkt.price_appreciation_yoy_pct
YoY price change
derived
high
IGR aggregates
mkt.absorption_rate
Sold / Launched per quarter
derived
high
MahaRERA + IGR
mkt.days_on_market_median
Listing-to-sale time
derived
medium
Listing portals + IGR
Approximate
mkt.sector_momentum_pct
% change in sector volume vs baseline
derived
critical
IGR sector aggregates
mkt.cap_rate_median
Median cap rate by sector/zone
derived
critical
L&L + sale aggregates
mkt.yield_benchmark
Gross yield benchmark by segment
derived
critical
L&L aggregates
mkt.micromarket_lifecycle_stage
Emerging / Growing / Mature / Cooling
derived
high
Time-series clustering
See derived-spec
Category 11 — Policy & Infrastructure (8)
ID
Attribute
Type
Priority
Source
Notes
policy.applicable_grs
List of relevant GRs (last 12 mo)
raw
critical
GR portals
Filtered
policy.tailwind_flags
Boolean flags for positive policies
derived
critical
GR NLP classifier
See derived-spec
policy.headwind_flags
Boolean flags for restrictive policies
derived
high
GR NLP classifier
infra.upcoming_projects
List of nearby infra projects with timelines
derived
critical
GRs + news + Maha-Metro
Event extraction
infra.metro_proximity_status
Operational / Under-construction / Approved / Speculated
derived
high
Maha-Metro filings
infra.sez_industrial_proximity
Distance + status of SEZ/MIDC
derived
high
Government data
infra.airport_proximity_km
Distance to nearest airport
enrichment
medium
Geospatial
infra.school_hospital_proximity
Count of quality schools/hospitals within 5km
derived
medium
Google Places + filters
Category 12 — Risk Intelligence (9)
ID
Attribute
Type
Priority
Source
Notes
risk.flood_zone_flag
In flood zone?
enrichment
critical
MRSAC + BMC
risk.forest_zone_flag
In forest area?
enrichment
high
Forest Dept GIS
risk.crz_flag
In CRZ?
enrichment
high
MoEFCC + MRSAC
risk.aqi_annual_avg
Annual avg AQI
enrichment
medium
CPCB data
risk.heat_island_index
Urban heat island index
enrichment
medium
MRSAC
risk.rainfall_trend
5-yr rainfall trend
enrichment
low
IMD
risk.urban_flooding_history
Past flooding incidents in pincode
enrichment
medium
BMC / news
risk.zone_risk_index
Composite Zone Risk Index 0-10
derived
critical
Ensemble
See derived-spec
risk.title_risk_flag
Composite title risk flag
derived
critical
Title score < threshold
Category 13 — AI / Derived Insights (12)
These are the magic. Every one has a math spec in derived-attributes-spec.md .
ID
Attribute
Type
Priority
Source
Notes
ai.alpha_narrative
"Why this, why now" plain-English summary
derived
critical
LLM over all signals
Cited
ai.risk_narrative
"What could go wrong" plain-English
derived
critical
LLM over risk attrs
Cited
ai.hidden_costs_breakdown
Full cost breakdown (stamp, registration, GST, society, etc.)
derived
critical
Calculator over attrs
Personalised to user state
ai.wealth_trajectory_paths
Monte Carlo wealth paths
derived
critical
Simulation
See derived-spec
ai.projected_irr_p50_p20_p80
IRR projections at 50/20/80 percentiles
derived
critical
Simulation
ai.comparable_set
5-10 comparable transactions / projects
derived
critical
Similarity + filters
ai.sentiment_score
Aggregated sentiment (news+social)
derived
high
NLP aggregation
ai.developer_track_record_summary
Plain-English developer summary
derived
high
LLM over developer data
Defamation-guarded
ai.title_chain_explanation
Plain-English title walkthrough
derived
high
LLM over legal docs
ai.gr_impact_summary
Plain-English GR impact on this asset
derived
high
LLM over relevant GRs
ai.persona_fit_score
Fit between user persona and asset
derived
critical
Cosine similarity
Personalises ranking
ai.brutal_honesty_flags
List of red flags + "consider before buying" notes
derived
high
Rule + LLM
Honest broker discipline
Category 14 — Investor Persona (8)
Only stored with explicit consent. Used to personalise Broker outputs (info-only, not advice).
ID
Attribute
Type
Priority
Source
Notes
persona.investment_goal
Retirement / Yield / Capital growth / Lifestyle / Tax planning
raw
high
User self-report
persona.horizon_years
Investment horizon
raw
high
User self-report
persona.risk_tolerance
Conservative / Moderate / Aggressive
raw
high
User self-report
persona.tax_bracket
User's effective tax bracket
raw
medium
User self-report
persona.residency_status
Resident / NRI
raw
high
User self-report
Affects FEMA flows
persona.budget_range
Investable surplus range
raw
high
User self-report
persona.family_stage
Single / Couple / Young Family / Empty Nester / Retired
raw
medium
User self-report
persona.existing_exposure
Other RE exposure (city/segment)
raw
medium
User self-report
Diversification
Category 15 — Macro Overlay (6)
ID
Attribute
Type
Priority
Source
Notes
macro.repo_rate
RBI repo rate
enrichment
high
RBI
Weekly refresh
macro.home_loan_rate_median
Median prime home loan rate
enrichment
high
Major banks
macro.it_hiring_index
IT-ITES hiring index (Maharashtra)
enrichment
medium
NASSCOM / job boards
macro.salary_growth_yoy
YoY salary growth (key segments)
enrichment
medium
LinkedIn / industry
macro.gdp_growth_state
Maharashtra GDP growth
enrichment
low
Govt
macro.cpi_inflation
CPI inflation
enrichment
medium
MoSPI
Category 16 — Fractional-specific (7)
Augments the existing PropPie Fractional asset records.
ID
Attribute
Type
Priority
Source
Notes
frac.spv_id
SPV identifier
raw
critical
Internal
frac.scheme_size_cr
Scheme size
raw
critical
Internal
SM-REIT threshold check
frac.investor_count
Current investor count
raw
critical
Internal
SM-REIT threshold check
frac.distribution_waterfall
Waterfall structure
raw
high
Internal
frac.secondary_market_active
Active on secondary market?
raw
high
Internal
frac.realised_yield_ttm
Trailing 12-month realised yield
derived
critical
Internal distributions
Ground truth
frac.vacancy_pct
Current vacancy
raw
critical
Internal
Operational columns (apply to every attribute)
Each attribute row in the canonical store carries these:
Operational column
Purpose
source
Where the value came from (RERA / IGR / GR / Internal / etc.)
source_url
Specific deep link if available
source_doc_id
Internal pointer to source document
source_doc_page
Page reference if PDF
extracted_at
When ingested
confidence
0-1
confidence_method
How confidence was computed (rule-based / model-based / human-verified)
human_verified
Bool — has a human signed off
conflict_resolution_applied
If sources disagreed, which rule resolved
last_changed_at
When this value last changed
change_history
(optional) prior values with timestamps
See data-quality-framework.md for the full rules.
Owner column (per attribute, for the team)
Each attribute is owned by someone. Owner determines who is responsible for definition, quality, and changes.
Owner
Categories
Vishal (CEO)
All pipeline / ingestion / extraction-layer attributes (raw)
Aishvarya (COO)
All derived attributes, scoring methodology, AI / persona / Broker-facing
TBD (Compliance Officer)
All legal / compliance / lineage / audit attributes
Designated dev per category
(Will be filled in per the team-and-roles doc)
What's intentionally NOT in this catalogue (yet)
Construction material quality — only via complaints; underreported
NRI repatriation profile — too dependent on individual circumstances
Listing portal asking prices — used only for trend detection, not authority
Influencer / blog mentions — too noisy
Property condition / age — interior, refurbishment status; user-submitted with verification later
Add when reliably sourceable.
Migration path from the original 92-attribute CSV
Original CSV count
Refined catalogue
92 attributes
~140 attributes
Manual "Captured?" column
Operational columns (confidence, human_verified, extracted_at)
Free-form "Suggested Data Sources"
Structured source enum with hierarchy
Implicit derivation
Explicit math spec per derived attribute
No persona / macro / fractional
Categories 14, 15, 16 added
No lineage / audit
Operational columns mandated
Next steps
Vishal to map this catalogue against the extraction schema v2 and identify what's already covered vs. needs new extraction work
Define each derived attribute's math in derived-attributes-spec.md (done) and validate with sample data
Establish source-confidence baselines per attribute by sampling 100 projects and human-grading extractions
Build the canonical store schema (Vishal's call: SQL? graph? key-value? lakehouse?)
Wire up freshness monitors per source — alarms if a feed goes stale
Document owner-per-attribute in a separate operational sheet (not this doc)
See also:
- data-attributes.csv — machine-readable version
- derived-attributes-spec.md — math for derived
- data-quality-framework.md — quality rules
- pipeline-spec-for-vishal.md — pipeline contract