Derived Attributes — Math Specification¶
Every derived attribute that appears in data-attributes.md has a precise math spec here. The contract is: given the same inputs, every implementation produces the same output to within documented precision. This makes derived attributes reproducible, auditable, and explainable.
How to read each spec¶
For each derived attribute:
- Layman explanation: one-sentence plain English
- Formula: the math
- Inputs: list with attribute IDs
- Validation: how to test correctness
- Failure modes: what makes it unreliable, what to do
- Display: how it shows up to users (info-only stance, citations)
- Refresh cadence: when to recompute
1. loc.micromarket_id and loc.micromarket_name¶
Layman: Groups properties into meaningful small neighbourhoods so users can compare similar areas.
Formula: Hierarchical clustering on (lat, lng, locality, taluka, pincode), constrained by: - Max diameter: 3km (urban), 8km (peri-urban), 25km (rural) - Min cluster size: 30 active projects OR 100 IGR transactions in 12 months - DP-shapefile boundaries respected (don't merge across municipal boundary)
Inputs: loc.lat, loc.lng, loc.locality, loc.taluka, loc.pin_code; municipal DP shapefiles; active project density; IGR txn density.
Validation: - Manual review of cluster boundaries by Pune/MMR market specialists (one-time + annual refresh) - Sanity: every popular branded area (Hinjewadi, Wakad, Kharadi, BKC, Lower Parel, etc.) is its own cluster
Failure modes: - New micromarkets emerging (e.g., new IT corridor) — re-cluster quarterly - Project on cluster boundary — assign by majority of nearest 5 transactions - Sparse rural areas — fallback to taluka-level
Display: "Hinjewadi (Pune) — micromarket" with optional info-icon revealing the cluster definition.
Refresh cadence: Quarterly re-cluster; daily new-project assignment.
2. loc.infra_proximity_score (0-100)¶
Layman: Measures how close the property is to important growth drivers like metro, SEZ, highway, airport.
Formula:
score = clip(100 * (
0.40 * metro_score
+ 0.30 * employment_hub_score
+ 0.20 * highway_airport_score
+ 0.10 * social_infra_score
), 0, 100)
Where each sub-score is in [0, 1]:
metro_score = max(0, 1 - distance_km / 5)
employment_hub_score = max(0, 1 - distance_to_nearest_SEZ_or_MIDC_or_IT_park_km / 8)
highway_airport_score = 0.5 * max(0, 1 - dist_highway_km / 3) + 0.5 * max(0, 1 - dist_airport_km / 25)
social_infra_score = clip(quality_school_count_within_3km / 10, 0, 1)
Inputs:
- loc.lat, loc.lng
- Metro / suburban rail GIS (operational + under-construction)
- SEZ / MIDC / IT-park boundaries
- Highway and airport coordinates
- Google Places API for schools (filtered for board, rating)
Validation: - Check that known prime areas (BKC, Hinjewadi-Phase-1) score > 75 - Check that remote rural areas score < 20 - Compare with anecdotal "this is a good location" sentiment in news/social (sanity)
Failure modes: - Distance-based; doesn't capture bad infrastructure proximity (next to landfill, sewage) - Future metro lines should be partially credited — apply 0.5 weight for under-construction, 0.2 for approved-only
Display: "Infrastructure Proximity: 78/100 — strong (metro 0.5km, BKC 4km, Mumbai Airport 14km)". Click expands to per-component breakdown.
Refresh cadence: Quarterly (boundaries change slowly).
3. fin.asr_gap_pct¶
Layman: Compares actual sale price with government's official ready reckoner value to spot bargains and over-valuation.
Formula:
asr_gap_pct = ((sale_price - asr_value) / asr_value) * 100
where asr_value = asr_value_per_sqft * sale_area_sqft
Inputs:
- fin.sale_price (per transaction)
- fin.asr_value_per_sqft (location-resolved)
- area.carpet_sqft (or built-up if carpet missing)
Validation: - Sample 100 transactions per priority micromarket - Median asr_gap_pct should be positive (sale > ASR) for residential premium areas (typically 15-50%) - Median should be near zero for distressed segments
Failure modes: - ASR may be stale (annual refresh — gap blows up in dynamic markets) - Cash-component transactions show suspiciously low gap — flag for investigation - Sale area mismatched to ASR area definition (carpet vs built-up) — normalise carefully
Display: "Sale price is 22% above ASR (₹89L → ₹1.08 Cr) — ASR is typically 15-30% below market; this is in range." Negative gap flagged: "Sale price is 8% below ASR — unusual; investigate for cash component or genuine distress."
Refresh cadence: On each new transaction; ASR refresh quarterly.
4. fin.cost_overrun_pct¶
Layman: How much more the project cost than originally promised.
Formula:
Inputs: proj.estimated_cost_history (sorted by version)
Validation: 0 if no revisions; positive overruns common; negative would indicate cost-cutting (rare; investigate).
Display: "Original estimated cost ₹85 Cr; latest revision ₹104 Cr — a 22% overrun across 3 revisions since 2022."
Refresh cadence: On each new Form B version.
5. mkt.transaction_velocity_90d / mkt.transaction_velocity_180d¶
Layman: Counts how many properties were bought/sold recently in the area — higher = hotter market.
Formula:
velocity_Nd = count of registered sale deeds in micromarket M
where deed_date in [today - N days, today]
and deed_type = 'Sale' or 'Allotment'
Variant: per-segment (residential vs commercial), per-unit-type breakdowns.
Inputs: fin.sale_date, loc.micromarket_id, deed type.
Validation: Sanity-check against MahaRERA quarterly progress and known launches. Anomalies (sudden spike) often correlate with new project launches or year-end stamp-duty deadlines.
Failure modes: - IGR indexing latency (24-72h) — show "indicative" labelled for current day - Stamp duty deadline (e.g., Mar 31) inflates volume artificially — note seasonality
Display: "117 registered transactions in Hinjewadi in the last 90 days (up 34% vs prior 90-day window)."
Refresh cadence: Daily.
6. mkt.median_price_per_sqft_90d¶
Layman: Typical price per sqft in the area, computed from real registered sales (not listings).
Formula:
median_price_per_sqft_90d = median(
sale_price / sale_area_sqft for txn in micromarket M, last 90 days
)
Filters: - Exclude transfers between related parties (where detectable: shared address, same surname for residential, group-company linkages for commercial) - Exclude clearly outlier sub-ASR transactions (>30% below ASR — likely benami/distress) - Segment: separately compute for residential vs commercial, by unit type
Inputs: fin.sale_price, area.carpet_sqft, loc.micromarket_id, fin.sale_parties.
Validation: Check against branded developer launches (their per-sqft is published) and listing-portal aggregates (which should be 5-15% higher).
Failure modes: - Sparse micromarkets (<10 transactions/90d) — fall back to 180d or note low confidence - Mix shift (a few luxury launches skew the median) — show distribution, not just median
Display: "Median ₹/sqft in Hinjewadi (last 90d, residential): ₹8,400 (n=87, range ₹6,200-₹12,500)." Always show n and range.
Refresh cadence: Daily.
7. mkt.price_appreciation_yoy_pct¶
Formula:
Computed over rolling 90-day windows centred on now vs. 12 months ago.
Validation: Compare with branded reports (Knight Frank, JLL) — directional consistency.
Display: "Hinjewadi residential 2BHK ₹/sqft up 11% YoY (from ₹7,560 to ₹8,400 median)." Include sample sizes.
8. mkt.sector_momentum_pct¶
Layman: Shows how fast each sector is growing by comparing recent sales with previous year.
Formula:
sector_momentum_pct = ((volume_last_90d - volume_baseline) / volume_baseline) * 100
where volume_baseline = avg quarterly volume over last 4 quarters
Per sector: Residential, Commercial-Office, Commercial-Retail, Warehousing/Industrial, Data Centers.
Display: "Warehousing in MMR North up 47% on volume vs 12-month baseline."
9. mkt.cap_rate_median and mkt.yield_benchmark¶
Layman: Average rental yield for every area/sector for easy comparison.
Formula (commercial, cap-rate):
cap_rate = (annual_rent * (1 - vacancy_assumption) - opex) / sale_price
opex assumption: 8% of gross rent for offices, 12% for retail, 5% for warehousing
vacancy assumption: micromarket-specific (default 8% offices, 12% retail, 6% warehousing)
Median over recent transactions matching same micromarket and sector. Reported with sample size and confidence band.
Formula (residential, gross yield):
Per type (1BHK, 2BHK, 3BHK).
Validation: Compare with REIT-reported cap rates (Embassy, Mindspace) — should be within 50-100bps band.
Display: "Hinjewadi Grade A office: median cap rate 8.4% (n=23 L&L + 47 sale pairs, last 12 months, ±50bps confidence)."
10. mkt.micromarket_lifecycle_stage¶
Layman: Tells if the area is still new/growing, in prime phase, or already mature/slowing.
Formula: K-means clustering (k=4) on 24-month time-series features: - Mean transaction velocity - Velocity trend slope - Price appreciation slope - New project launch rate - Sentiment trend
Cluster labels by feature centroids: - Emerging: low volume, positive slope, high launch rate - Growing: medium volume, strong slope, high launch rate - Mature: high volume, flat-to-slow slope, low launch rate - Cooling: declining volume, negative price slope, withdrawals
Validation: Manual labelling of 30 known-state micromarkets to anchor centroids.
Display: "Hinjewadi is in Growing stage — strong velocity, steady price appreciation, still active new launches."
Refresh cadence: Monthly.
11. policy.tailwind_flags and policy.headwind_flags¶
Layman: Scans government orders to flag new rules that will boost (or hurt) area value.
Formula: 1. NLP classifier (fine-tuned on labelled GR corpus) outputs: - Department, sub-category - Affected geography (district / taluka / pincode / coordinates) - Effective date - Impact direction (positive/negative/neutral) - Impact magnitude (low/medium/high) 2. Generate flags per affected micromarket with confidence.
Examples: - Tailwind: TDR rate cut, FSI increase, metro extension approval, SEZ approval - Headwind: Stamp duty hike, FSI restriction, environmental restriction
Validation: Human review on first 200 classifications; ongoing precision/recall monitoring.
Display: "3 policy tailwinds affecting Hinjewadi in the last 90 days: TDR rate cut (UDD GR 14 Jan 2026), metro phase-2 timeline confirmed (Maha-Metro 8 Feb 2026), MIDC plot expansion approved (Industries GR 22 Mar 2026)." Each links to GR.
12. dev.trust_score (Developer Trust Score 0-100)¶
Layman: Gives every developer a trust rating based on delivery track record and complaints.
Formula:
trust_score = 100 - (
35 * normalised_delay_penalty
+ 25 * complaint_density_penalty
+ 15 * withdrawal_penalty
+ 15 * cost_overrun_penalty
+ 10 * litigation_penalty
)
where each penalty is in [0, 1]:
normalised_delay_penalty = clip(median_delivery_delay_months / 24, 0, 1)
complaint_density_penalty = clip(complaints_per_project_per_year / 0.5, 0, 1)
withdrawal_penalty = withdrawn_count / total_projects
cost_overrun_penalty = clip(median_overrun_pct / 50, 0, 1)
litigation_penalty = clip(active_litigation_count / 5, 0, 1)
Only computed for developers with ≥ 3 MahaRERA-registered projects. Below that: show track record but no score.
Validation: Eyeball top-20 known good and bad developers — scores should align with reputation.
Defamation guardrails: - Never show absolute label ("bad") - Show percentile vs. comparable developers in same city tier and project count - Show all inputs with sources - Provide "request a review" link
Display: "Developer Trust Score: 68/100 (75th percentile vs comparable Pune promoters with 10+ projects). Inputs: median delay 6mo (low penalty), 2 active complaints (moderate penalty), 1 withdrawn project (low penalty), 18% cost overrun (low penalty), 0 active litigation." All inputs clickable.
Refresh cadence: Weekly.
13. legal.title_clarity_score (0-100)¶
Layman: One easy-to-understand score that tells how clean and safe the property's legal title is.
Formula:
title_clarity_score = 100 - (
35 * encumbrance_penalty
+ 25 * litigation_penalty
+ 20 * chain_gap_penalty
+ 10 * class_ii_penalty
+ 10 * cersai_charge_penalty
)
encumbrance_penalty = 1 if active mortgage not disclosed; 0.5 if disclosed; 0 if none
litigation_penalty = clip(num_active_litigations / 3, 0, 1)
chain_gap_penalty = clip(max_chain_gap_years / 30, 0, 1)
class_ii_penalty = 1 if Class II without permission; 0.5 if Class II with permission; 0 if Class I
cersai_charge_penalty = 1 if undisclosed charge; 0.5 if disclosed; 0 if none
Validation: Run on 100 sampled projects; compare against lawyer review on a subset.
Display: "Title Clarity: 84/100 (good). One historical mortgage cleared in 2020; chain of title clean since 2015; no active litigation; Class I land."
Refresh cadence: Monthly + on TSR update.
14. lease.tenant_anchor_quality_score (0-100)¶
Formula:
anchor_quality = 100 * (
0.40 * tenant_size_score
+ 0.30 * tenant_credit_score_normalised
+ 0.20 * industry_stability_score
+ 0.10 * lease_strength_score
)
tenant_size_score = clip(log10(employee_count) / 4, 0, 1) # 10k = 1.0
tenant_credit_score_normalised = (CIBIL Commercial - 500) / 400 # 500-900 → 0-1
industry_stability_score = {IT: 0.9, Banking: 0.95, FMCG: 0.85, Pharma: 0.85, Coworking: 0.6, ...}
lease_strength_score = clip(lock_in_months / 60, 0, 1)
Display: "Anchor Tenant Quality: 82/100 (TCS, IT services, multi-decade tenure, 36-month lock-in). Strong stability."
15. risk.zone_risk_index (0-10, lower is safer)¶
Layman: Gives every area one simple risk score by combining legal, market, policy and environmental factors.
Formula:
zri = 10 * (
0.30 * legal_risk_normalised
+ 0.25 * market_risk_normalised
+ 0.25 * policy_risk_normalised
+ 0.20 * environmental_risk_normalised
)
legal_risk_normalised = (100 - avg_title_clarity_score_in_micromarket) / 100
market_risk_normalised = clip(volatility_of_psf_24mo / 20, 0, 1)
policy_risk_normalised = density of headwind_flags / threshold
environmental_risk_normalised = 0.4 * flood + 0.3 * forest + 0.2 * CRZ + 0.1 * AQI bucket
Display: "Zone Risk Index for Hinjewadi: 3.2/10 (low-moderate). Drivers: low legal risk (avg title clarity 87), moderate market volatility, no environmental flags." All drivers clickable.
16. ai.alpha_narrative and ai.risk_narrative¶
Formula: LLM prompt with structured input:
INPUTS:
- Project identity
- Top 5 derived scores
- Top 3 micromarket signals
- Top 3 policy tailwinds/headwinds
- Top 3 risk flags
- User persona (if available, with consent)
OUTPUT TEMPLATE (Honest Broker voice):
1. One-line summary
2. Top 3 reasons "for" with citations
3. Top 3 risks / what could go wrong with citations
4. Comparison anchor (vs comparable set)
5. What to verify before deciding (sourced)
CONSTRAINTS:
- No "you should" language
- Every numerical claim has source citation
- Probabilities, not promises
- Include "I don't know" where data is thin
Validation: - Eval framework with 100 ground-truth projects + expert reviews - Hallucination rate < 1% (zero tolerance for fabricated citations) - Tone audit: % of outputs that pass Honest Broker rules (target 100%)
Display: The narrative is the main panel in asset cards and Broker output. Every numerical claim is inline-cited.
17. ai.wealth_trajectory_paths (Monte Carlo simulation)¶
Layman: Runs thousands of future scenarios to show how much money an investor could make over 5-10 years.
Formula: Monte Carlo with these stochastic inputs:
# Property appreciation (annual)
appreciation_t ~ N(mu_micromarket, sigma_micromarket)
where mu = base_yoy_appreciation
sigma = historical_volatility_micromarket
# Rental growth (annual)
rent_growth_t ~ N(mu_rent, sigma_rent)
where mu_rent = lease.escalation_pct (or sector default)
# Vacancy (annual)
vacancy_t ~ Beta(alpha, beta) calibrated to micromarket history
# Repo rate path (for opportunity cost)
repo_t ~ AR(1) calibrated to RBI history
# Idiosyncratic shocks (delay, oc-extension, policy)
shock_t = Bernoulli(p_shock) * shock_severity
Simulate 10,000 paths over T years.
Output: p20, p50, p80 wealth distributions.
Validation: Sanity — known stable areas have tighter distributions than volatile ones; backtest on past 5-year actuals where possible.
Display: Wealth Path chart with three lines (p20 lower bound, p50 median, p80 upper bound) and shaded band. "Median scenario: ₹1.8 Cr at year 5, range ₹1.3 Cr - ₹2.4 Cr at 80% confidence. Assumptions: 11% mean appreciation, 8% vacancy, 5% rent escalation. Stress tests: metro delay scenario, regulatory shock scenario."
Constraints: - Always show distribution, never just point estimate - Always show assumptions, with edit toggle - Never label any path as "expected" without confidence band
18. ai.projected_irr_p50_p20_p80¶
Derived from Monte Carlo cash-flow simulation (rent + appreciation - costs - tax). Standard IRR computation per path.
Display: "Projected IRR: median 11.4%, 80% confidence band 6.2% - 16.1%."
19. ai.comparable_set¶
Formula: For a given asset:
Compute embedding(asset) using features:
- Sector, unit type, size bucket
- Micromarket
- Price band
- Developer tier
- Age
Retrieve top-K (default 10) nearest neighbours from corpus of recent transactions where:
- Same micromarket OR adjacent micromarket
- Same sector
- Price within ±30%
- Transaction in last 12 months
Rank by similarity score, dedupe by project.
Display: Side-by-side cards with key fields, link to source deed.
20. ai.sentiment_score¶
Formula: Per-entity aggregated sentiment over last 30 days.
sentiment_score = weighted average of (sentiment_polarity * source_weight * recency_weight)
across news + social mentions
source_weight: news_outlet_credibility (predefined)
recency_weight: exp(-days_old / 14)
Output: bounded [-1, +1], displayed as bar with label.
Display: "Sentiment for Hinjewadi: +0.23 (mildly positive, n=87 mentions, last 30 days). Top topics: metro progress (+), traffic congestion (-), school openings (+)."
21. ai.persona_fit_score (info-only stance)¶
Layman: Matches the property to the user's investment style and goals.
Formula:
asset_vec = embedding of asset features (yield, growth, risk, ticket size, segment)
persona_vec = embedding of persona features (goal, horizon, risk, budget)
fit_score = cosine_similarity(asset_vec, persona_vec)
Critical compliance note: This score is used to rank what to show, NOT to "recommend." Display label is "Matches your stated preferences" not "Best for you".
Display: "This asset matches your stated preferences (high yield, 5-7 year horizon, ₹50L-₹1Cr budget) more closely than 80% of available options." Clickable to see which features matched.
22. ai.brutal_honesty_flags¶
Formula: Rule + LLM hybrid generating callouts:
- Title chain gap > 10 years → flag
- Active litigation → flag
- Developer Trust Score < 40 percentile → flag with percentile framing
- Cost overrun > 30% → flag
- ASR gap > 100% (likely overvalued) OR < -20% (likely cash component) → flag
- Located in flood zone → flag
- Project withdrawn-and-re-registered → flag
- Promoter complaints with similar patterns to current project → flag
Display: "Three things you should know before deciding: (1) ... (2) ... (3) ..." Each with citation.
23. infra.upcoming_projects¶
Formula: Event extraction from GRs + news; deduplication + timeline scoring.
For each GR / news article tagged as 'infrastructure':
- Extract project_name, type (metro/road/sez/etc.), affected micromarkets, announced_date, expected_completion, status (announced/approved/under_construction/operational)
- Confidence: weighted by source credibility
- Dedupe via fuzzy project name + geography
Display: Timeline visualisation with milestones, status, source links.
Implementation contract for Vishal's pipeline¶
Every derived attribute must:
- Be reproducible: same inputs → same output (modulo seeded stochastic for Monte Carlo)
- Carry its inputs: pipeline stores the input attribute IDs and their values at compute-time
- Carry its version: model/formula version string, so changes are tracked
- Carry its confidence: confidence value with method
- Be auditable: a query can retrieve "what produced this score on date X"
Pipeline contract details: pipeline-spec-for-vishal.md.
Validation framework¶
For every derived attribute:
- Unit test with synthetic inputs to verify formula
- Sanity test with 30 known-state samples (manual labels)
- Production monitoring: distribution shifts > 2σ trigger alarms
- Quarterly recalibration of weighted scores against new ground truth
Eval framework details: ../40-ai-architecture/evaluation-framework.md.
See also:
- data-attributes.md — full attribute catalogue
- data-quality-framework.md — confidence and lineage
- pipeline-spec-for-vishal.md — contract for the pipeline