# FarmMatch System Improvements Roadmap

## Overview

This document outlines improvements to enhance location pinpointing accuracy and criteria analysis effectiveness for the FarmMatch property evaluation system.

---

## Recent Improvements (2025-10-12)

### 1. Enhanced Breadcrumb Extraction with Multi-Layer Fallbacks

**Your Request**: "if the breadcrums dont clarify the location analyse the rest of the page in more detail"

**Implemented**: 4-layer fallback system for location extraction

#### Layer 1: Standard Breadcrumb Container (Primary)
```javascript
// Standard breadcrumb navigation
<nav aria-label='Breadcrumb'>
  France > Normandy > Manche > Tribehou
</nav>
```

#### Layer 2: Meta Tags (First Fallback)
```html
<meta property="og:locality" content="Tribehou, Normandy, France" />
```

#### Layer 3: Page Title Parsing (Second Fallback)
```html
<title>Farm for sale in Tribehou, Normandy, France</title>
```
Extracts location from:
- "Property in X, Y, Z"
- "Te koop in X, Y, Z" (Dutch)
- "For sale in X - Y - Z"

#### Layer 4: JSON-LD Structured Data (Third Fallback)
```json
{
  "@type": "RealEstateListing",
  "address": {
    "addressCountry": "France",
    "addressRegion": "Normandy",
    "addressLocality": "Tribehou"
  }
}
```

#### Layer 5: Page Text Pattern Matching (Fourth Fallback)
Searches page body for:
- "Located in X, Y, Z"
- "Location: X, Y, Z"
- "Locatie: X, Y, Z" (Dutch)
- "Te koop in X, Y, Z"

**Impact**:
- Before: 76 properties (45%) with no location data
- Expected After: ~20-30 properties (12-18%) with no location data
- Improvement: ~30% increase in location extraction success rate

---

## Future Improvements for Location Pinpointing

### Priority 1: High Impact, Low Effort

#### 1.1 Extract GPS Coordinates from Property Pages
**What**: Scrape embedded maps and GPS coordinates directly from property listings

**Why**: Many property sites embed Google Maps or OpenStreetMap with exact coordinates

**How**:
```python
# Look for embedded map iframe
map_iframe = await page.query_selector("iframe[src*='maps.google.com']")
if map_iframe:
    src = await map_iframe.get_attribute('src')
    # Parse coordinates from URL: ?q=49.2138787,-1.2426305
    coords = extract_coords_from_url(src)

# Look for map markers in JavaScript
map_scripts = await page.query_selector_all("script")
for script in map_scripts:
    text = await script.inner_text()
    if 'LatLng' in text or 'lat' in text:
        # Parse: new google.maps.LatLng(49.2138787, -1.2426305)
        coords = extract_coords_from_js(text)
```

**Expected Impact**: +20-30% geocoding accuracy (from ~70% to ~90%+)

#### 1.2 Use Multiple Geocoding Services with Fallbacks
**What**: Try multiple geocoding APIs in sequence when Nominatim fails

**Current**: Only Nominatim (OpenStreetMap)

**Proposed**:
1. Nominatim (free, unlimited)
2. Google Geocoding API (paid, high accuracy)
3. Mapbox Geocoding (paid, good rural coverage)
4. Here Geocoding (paid, excellent European coverage)

**Implementation**:
```python
async def geocode_with_fallbacks(breadcrumb):
    # Try Nominatim first (free)
    result = await nominatim_geocode(breadcrumb)
    if result:
        return result

    # Fallback to Google (best accuracy)
    result = await google_geocode(breadcrumb)
    if result:
        return result

    # Fallback to Mapbox (good for rural)
    result = await mapbox_geocode(breadcrumb)
    if result:
        return result

    return None
```

**Expected Impact**: +15-20% geocoding success rate

#### 1.3 Geocode with Increasing Specificity
**What**: Try full breadcrumb first, then gradually remove detail until success

**Example**:
```python
breadcrumb = "France > Normandy > Manche > Tribehou > 123 Farm Road"

# Try full address first
coords = geocode("France, Normandy, Manche, Tribehou, 123 Farm Road")

# If fails, try without street
if not coords:
    coords = geocode("France, Normandy, Manche, Tribehou")

# If fails, try region level
if not coords:
    coords = geocode("France, Normandy, Manche")

# If fails, try province level
if not coords:
    coords = geocode("France, Normandy")
```

**Expected Impact**: +10% geocoding success rate for obscure villages

### Priority 2: Medium Impact, Medium Effort

#### 2.1 Extract and Validate Postal Codes
**What**: Scrape postal codes from property pages and use for geocoding

**Why**: Postal codes are highly specific and work well with all geocoding APIs

**Implementation**:
```python
# Look for postal code patterns
postal_patterns = {
    'france': r'\b\d{5}\b',  # 14550
    'spain': r'\b\d{5}\b',   # 28001
    'portugal': r'\b\d{4}-\d{3}\b',  # 1000-001
    'italy': r'\b\d{5}\b',   # 00100
    'greece': r'\b\d{5}\b'   # 10431
}

# Search page text for postal code
for country, pattern in postal_patterns.items():
    match = re.search(pattern, page_text)
    if match:
        postal_code = match.group(0)
        # Geocode using: "postal_code, country"
        coords = geocode(f"{postal_code}, {country}")
```

**Expected Impact**: +10-15% geocoding accuracy, especially for rural properties

#### 2.2 Implement Geocoding Cache with Hierarchical Lookup
**What**: Cache geocoding results at multiple levels (city, region, province)

**Why**: Reduces API calls and speeds up processing

**Implementation**:
```python
geocode_cache = {
    'Tribehou, Manche, Normandy, France': (49.2138787, -1.2426305),
    'Manche, Normandy, France': (49.1167, -1.0833),
    'Normandy, France': (49.1829, -0.3707)
}

def geocode_with_cache(breadcrumb):
    # Try exact match
    if breadcrumb in geocode_cache:
        return geocode_cache[breadcrumb]

    # Try removing last part (street/village)
    parts = breadcrumb.split(' > ')
    while len(parts) > 1:
        parts = parts[:-1]
        partial = ' > '.join(parts)
        if partial in geocode_cache:
            return geocode_cache[partial]  # Return region-level coords

    # Nothing in cache, call API
    return geocode_api(breadcrumb)
```

**Expected Impact**: 10x faster geocoding, reduced API costs

#### 2.3 Add Distance-Based Validation
**What**: Validate geocoded coordinates against expected country/region

**Why**: Catches errors like property in France geocoding to Paris when it should be Normandy

**Implementation**:
```python
def validate_coords(coords, breadcrumb):
    lat, lon = coords

    # Extract country from breadcrumb
    country = breadcrumb.split(' > ')[0]

    # Check if coordinates are in expected country
    actual_country = reverse_geocode(lat, lon)

    if actual_country.lower() != country.lower():
        print(f"⚠️  Warning: Coordinates in {actual_country} but expected {country}")
        return False

    return True
```

**Expected Impact**: Eliminates ~90% of wrong location errors

### Priority 3: Advanced Improvements

#### 3.1 Machine Learning for Location Extraction
**What**: Train ML model to extract location from unstructured text

**When**: After collecting 500+ correctly geocoded properties

**How**:
1. Collect training data: (page_html, correct_breadcrumb)
2. Train NER (Named Entity Recognition) model to extract:
   - Country names
   - Region/province names
   - City/village names
3. Use model to extract location when standard methods fail

**Expected Impact**: +5-10% on edge cases

#### 3.2 Community-Sourced Location Database
**What**: Build database of property URL → known coordinates from successful geocodes

**Why**: Share geocoding results across runs and users

**Implementation**:
```python
# community_locations.json
{
  "https://www.properstar.nl/listing/102754054": {
    "lat": 49.2138787,
    "lon": -1.2426305,
    "breadcrumb": "France > Normandy > Manche > Tribehou",
    "verified": true,
    "verified_by": "user",
    "verified_date": "2025-10-12"
  }
}
```

**Expected Impact**: Perfect accuracy for previously seen properties

---

## Improvements for Criteria Analysis Effectiveness

### Priority 1: High Impact

#### 1.1 Extract Property Size and Building Dimensions
**What**: Scrape land size, building size, bedroom count directly from pages

**Why**: Currently relying on GPT analysis which may miss numeric details

**Implementation**:
```python
# Look for size patterns
size_patterns = {
    'land_size': r'(\d+[\d,\.]*)\s*(m2|m²|hectare|ha|acre)',
    'building_size': r'woonoppervlakte[:\s]*(\d+[\d,\.]*)\s*m',
    'bedrooms': r'(\d+)\s*slaapkamer',
    'bathrooms': r'(\d+)\s*badkamer'
}

for key, pattern in size_patterns.items():
    match = re.search(pattern, page_text, re.IGNORECASE)
    if match:
        kpis[key] = float(match.group(1).replace(',', ''))
```

**Expected Impact**:
- 90%+ accuracy for numeric KPIs (vs ~60% from GPT)
- More reliable score validation
- Better price/size calculations

#### 1.2 Extract Property Images for Visual Analysis
**What**: Download property photos and analyze for features

**Why**: Images reveal workshop potential, guest accommodation, rental units better than text

**Implementation**:
```python
# Extract all property images
images = await page.query_selector_all("img.property-image")
image_urls = [await img.get_attribute('src') for img in images]

# Analyze images with Vision AI
for url in image_urls:
    analysis = analyze_image_with_vision_ai(url)
    # Detect: workshops, guest houses, separate buildings, renovation needs
    kpis['has_workshop'] = 'workshop' in analysis['labels']
    kpis['has_guest_house'] = 'guest house' in analysis['labels']
    kpis['building_count'] = analysis['building_count']
```

**Expected Impact**:
- 30-40% better detection of workshops, guest houses, rental potential
- Catches details GPT misses in text descriptions

#### 1.3 Extract Price and Price History
**What**: Scrape current price and price changes over time

**Why**: All 169 properties currently have empty price data

**Implementation**:
```python
# Price patterns for different sites
price_patterns = {
    'properstar': r'€\s*([\d,\.]+)',
    'generic': r'(\d+[\d,\.]*)\s*(?:€|EUR|euro)',
}

# Extract price
for pattern_name, pattern in price_patterns.items():
    match = re.search(pattern, page_text)
    if match:
        price_str = match.group(1).replace(',', '').replace('.', '')
        price = int(price_str)
        kpis['price'] = price
        kpis['price_per_sqm'] = price / kpis.get('land_size', 1)
        break
```

**Expected Impact**:
- Price data for all properties
- Enable price-based filtering and sorting
- Calculate price/sqm for value comparison

#### 1.4 Implement Criteria Weights and Scoring System
**What**: Add configurable weights for different criteria

**Why**: Some users prioritize workshop > guest accommodation, others vice versa

**Implementation**:
```python
# User-configurable weights
criteria_weights = {
    'guest_accommodation': 0.4,  # 40% importance
    'workshop': 0.35,             # 35% importance
    'rental': 0.25                # 25% importance
}

# Calculate weighted score
def calculate_total_score(scores, weights):
    total = 0
    for criterion, score in scores.items():
        weight = weights.get(criterion, 0.33)
        total += score * weight
    return total * 2  # Scale to 0-10

# Example:
scores = {'guest_accommodation': 4, 'workshop': 2, 'rental': 3}
total_score = calculate_total_score(scores, criteria_weights)
# Result: (4*0.4 + 2*0.35 + 3*0.25) * 2 = 6.5/10
```

**Expected Impact**:
- Personalized scoring per user preferences
- More relevant property rankings
- Better match quality

### Priority 2: Medium Impact

#### 2.1 Add Comparative Analysis
**What**: Compare each property to averages in its region

**Why**: 4 bedrooms is great in rural France, average in suburban areas

**Implementation**:
```python
# Calculate regional averages
region_stats = {
    'Normandy': {
        'avg_land_size': 15000,  # m2
        'avg_bedrooms': 3.2,
        'avg_price': 250000
    }
}

# Compare property to region
def get_regional_percentile(property, region):
    stats = region_stats.get(region, {})
    percentiles = {}

    for key in ['land_size', 'bedrooms', 'price']:
        prop_value = property.get(key, 0)
        avg_value = stats.get(f'avg_{key}', prop_value)
        percentiles[key] = (prop_value / avg_value) * 100

    return percentiles

# Display: "This property has 150% of average land size for Normandy"
```

**Expected Impact**: Better understanding of property value relative to market

#### 2.2 Extract Renovation Needs and Condition
**What**: Identify properties needing major renovation vs move-in ready

**Why**: Critical for budget planning and timeline estimation

**Implementation**:
```python
renovation_keywords = {
    'major_renovation': ['à rénover', 'to renovate', 'renovation project', 'needs work'],
    'good_condition': ['excellent condition', 'move-in ready', 'pristine', 'renovated'],
    'partial_renovation': ['some work needed', 'cosmetic work', 'updating needed']
}

def assess_condition(description):
    for condition, keywords in renovation_keywords.items():
        if any(kw in description.lower() for kw in keywords):
            return condition
    return 'unknown'
```

**Expected Impact**: Better filtering by buyer readiness level

#### 2.3 Extract Amenities and Features
**What**: Create structured list of property features

**Why**: Enables filtering by specific requirements (pool, garage, barn, etc.)

**Implementation**:
```python
amenities_keywords = {
    'pool': ['pool', 'zwembad', 'piscine'],
    'garage': ['garage', 'carport', 'parking'],
    'barn': ['barn', 'schuur', 'grange'],
    'stable': ['stable', 'stal', 'écurie'],
    'vineyard': ['vineyard', 'wijngaard', 'vignoble'],
    'orchard': ['orchard', 'boomgaard', 'verger'],
    'well': ['well', 'put', 'puits'],
    'solar': ['solar', 'zonnepanelen', 'panneaux solaires']
}

# Extract amenities
amenities = {}
for amenity, keywords in amenities_keywords.items():
    amenities[amenity] = any(kw in description.lower() for kw in keywords)

# Filter: Show only properties with pool AND workshop
filtered = [p for p in properties if p['amenities']['pool'] and p['has_workshop']]
```

**Expected Impact**: Precise filtering by must-have features

### Priority 3: Advanced Features

#### 3.1 Historical Price Tracking
**What**: Track price changes over time by periodically scraping

**Implementation**:
```python
# price_history.json
{
  "https://www.properstar.nl/listing/102754054": [
    {"date": "2025-09-01", "price": 280000},
    {"date": "2025-10-01", "price": 265000},  # Price dropped
    {"date": "2025-10-12", "price": 265000}
  ]
}

# Detect price drops
def find_price_drops(min_drop_percent=10):
    for url, history in price_history.items():
        if len(history) >= 2:
            original = history[0]['price']
            current = history[-1]['price']
            drop_percent = ((original - current) / original) * 100
            if drop_percent >= min_drop_percent:
                print(f"💰 {url}: {drop_percent:.1f}% price drop!")
```

**Expected Impact**: Identify motivated sellers and negotiation opportunities

#### 3.2 Neighborhood Analysis
**What**: Analyze nearby amenities, services, and infrastructure

**Implementation**:
```python
# Query OpenStreetMap Overpass API for nearby POIs
def analyze_neighborhood(lat, lon, radius_km=5):
    # Find nearby:
    # - Grocery stores
    # - Schools
    # - Hospitals
    # - Restaurants
    # - Train stations

    nearby_amenities = query_overpass_api(lat, lon, radius_km)

    return {
        'grocery_stores': len([a for a in nearby_amenities if a['type'] == 'supermarket']),
        'schools': len([a for a in nearby_amenities if a['type'] == 'school']),
        'restaurants': len([a for a in nearby_amenities if a['type'] == 'restaurant']),
        'isolation_score': calculate_isolation_score(nearby_amenities)
    }
```

**Expected Impact**: Better assessment of location quality and lifestyle fit

#### 3.3 Commute Time Analysis
**What**: Calculate drive times to major cities/airports

**Implementation**:
```python
# Use Google Maps Distance Matrix API
def calculate_commute_times(property_coords):
    major_cities = {
        'Paris': (48.8566, 2.3522),
        'Lyon': (45.7640, 4.8357),
        'Bordeaux': (44.8378, -0.5792)
    }

    commute_times = {}
    for city, city_coords in major_cities.items():
        drive_time = get_drive_time(property_coords, city_coords)
        commute_times[city] = drive_time

    return commute_times

# Filter: Show properties within 2h drive of Paris
filtered = [p for p in properties if p['commute_times']['Paris'] <= 120]
```

**Expected Impact**: Better filtering by accessibility requirements

---

## Implementation Priority Matrix

### Quick Wins (Do First)
1. ✅ Enhanced breadcrumb extraction with fallbacks (DONE)
2. Extract GPS coordinates from property pages
3. Extract property sizes and KPIs from structured data
4. Extract prices from property pages

### High ROI (Do Second)
5. Multiple geocoding services with fallbacks
6. Geocode with increasing specificity
7. Extract and use postal codes
8. Implement scoring weights system

### Medium ROI (Do Third)
9. Property images analysis with Vision AI
10. Amenities extraction
11. Condition/renovation needs assessment
12. Geocoding cache with hierarchical lookup

### Advanced Features (Do Later)
13. Distance-based coordinate validation
14. Historical price tracking
15. Neighborhood analysis
16. Commute time analysis
17. ML-based location extraction
18. Community-sourced location database

---

## Expected Overall Impact

### Location Pinpointing Accuracy
- **Current**: ~60% properties with accurate coordinates
- **After Quick Wins**: ~80% with accurate coordinates
- **After High ROI**: ~92% with accurate coordinates
- **After Medium ROI**: ~96% with accurate coordinates

### Criteria Analysis Effectiveness
- **Current**: ~60% accuracy for numeric KPIs (bedrooms, size)
- **After Quick Wins**: ~90% accuracy for numeric KPIs
- **After High ROI**: Personalized scoring, better matches
- **After Medium ROI**: Rich filtering by amenities, condition, features

### User Experience
- **Current**: Manual verification needed for most properties
- **After Improvements**: High confidence in data quality, minimal manual verification

---

## Testing Strategy

### For Each Improvement:

1. **Unit Tests**: Test extraction logic on sample HTML
```python
def test_extract_gps_from_map():
    html = '<iframe src="https://maps.google.com/?q=49.2138787,-1.2426305"></iframe>'
    coords = extract_gps_from_html(html)
    assert coords == (49.2138787, -1.2426305)
```

2. **Integration Tests**: Test on 10-20 real property pages
```python
def test_extract_breadcrumb_fallbacks():
    urls = [...] #  20 test URLs
    for url in urls:
        breadcrumb = extract_breadcrumb(url)
        assert breadcrumb is not None
        assert len(breadcrumb) > 0
```

3. **Accuracy Validation**: Compare against manually verified data
```python
def test_geocoding_accuracy():
    # Load 50 manually verified coordinates
    verified = load_verified_coords()

    errors = []
    for url, expected_coords in verified.items():
        actual_coords = geocode_property(url)
        distance = calculate_distance(expected_coords, actual_coords)
        if distance > 5:  # >5km error
            errors.append((url, distance))

    assert len(errors) < 3  # <6% error rate
```

4. **A/B Testing**: Compare old vs new methods
```python
# Test on 100 properties
old_success_rate = test_old_geocoding(properties)
new_success_rate = test_new_geocoding(properties)

print(f"Improvement: {new_success_rate - old_success_rate}%")
```

---

## Next Steps

1. ✅ Implement enhanced breadcrumb extraction (DONE)
2. ⏭️  Extract GPS coordinates from embedded maps (NEXT)
3. ⏭️  Extract property sizes and KPIs from pages
4. ⏭️  Extract prices from property pages
5. ⏭️  Run full re-analysis with new extraction methods
6. ⏭️  Validate accuracy improvements
7. ⏭️  Document results and iterate

---

*This roadmap provides a structured path to significantly improve location accuracy and criteria analysis effectiveness. Each improvement builds on the previous ones, creating a robust and reliable property analysis system.*
