# Scraping Improvement Plan

## Problem Statement

Property https://www.properstar.nl/listing/108223811 has incomplete analysis because insufficient data is extracted during scraping.

### Current State:
- **enriched_data.json has:**
  - land_size_m2: 90 m²
  - building_size_m2: None
  - bedrooms: None
  - bathrooms: None
  - description: Empty

- **analyze_from_urls_optimized.py extracts:**
  - Only meta description tag (often just "Azure WAF" or generic text)
  - Title tag (generic: "Huis - Huis te koop - Properstar")
  - Location from CSV

- **Result:** GPT analyzes with almost no property details → vague, incomplete analysis

## Root Cause Analysis

### Data Flow:
1. `favorites_scraper.py` → extracts URLs → `extracted_property_urls.csv`
2. `extract_gps_and_kpis.py` → scrapes property pages → `enriched_data.json`
3. `analyze_from_urls_optimized.py` → **RE-SCRAPES pages** (only meta tags) → sends to GPT

**Problem:** Step 3 doesn't use the structured data from Step 2! It re-scrapes poorly.

## Solution: Two-Phase Approach

### Phase 1: Use Existing enriched_data.json (IMMEDIATE FIX)

**Modify analyze_from_urls_optimized.py to:**
1. Load property data from enriched_data.json
2. Build rich description from structured fields:
   ```
   Property: {title}
   Location: {location}
   Land size: {land_size_m2} m²
   Building size: {building_size_m2} m²
   Bedrooms: {bedrooms}
   Bathrooms: {bathrooms}
   Price: {price}
   ```
3. Use this structured data instead of meta description
4. No re-scraping needed!

**Benefits:**
- ✅ Immediate improvement
- ✅ Uses existing scraped data
- ✅ No additional HTTP requests
- ✅ Faster analysis

### Phase 2: Improve Initial Scraping (BETTER LONG-TERM)

**Enhance extract_gps_and_kpis.py to:**
1. Extract from semantic HTML structure:
   - `<div class="listing-section-content">` sections
   - `<div class="areas">` for sizes
   - `<div class="description">` for full description
   - Feature lists, amenities

2. Handle JavaScript-rendered pages:
   - Use Selenium/Playwright if needed
   - Or extract from JSON-LD structured data

3. Store rich data in enriched_data.json:
   ```json
   {
     "url": "...",
     "land_size_m2": 2100,
     "building_size_m2": 150,
     "bedrooms": 4,
     "bathrooms": 2,
     "full_description": "Detailed property description...",
     "features": ["swimming pool", "garden", "terrace"],
     "year_built": 1950,
     "condition": "to renovate"
   }
   ```

**Benefits:**
- ✅ More complete data for all future analyses
- ✅ Better GPT analysis quality
- ✅ Can filter properties by features

## Implementation Priority

### HIGH PRIORITY (Phase 1):
Modify `analyze_from_urls_optimized.py` to use structured data from enriched_data.json instead of re-scraping.

**Files to modify:**
- [analyze_from_urls_optimized.py:245-262](analyze_from_urls_optimized.py#L245-L262)

**Changes:**
```python
# BEFORE (current):
title = soup.title.text.strip() if soup.title else ""
desc_tag = soup.find("meta", {"name": "description"})
description = desc_tag["content"] if desc_tag else ""
full_text = description[:800]

# AFTER (improved):
# Use structured data from enriched_data.json
property_data = property_data_by_url.get(url, {})
full_text = f"""
Property: {property_data.get('title', 'Unknown')}
Location: {property_data.get('location', 'Unknown')}
Land size: {property_data.get('land_size_m2', 'Not specified')} m²
Building size: {property_data.get('building_size_m2', 'Not specified')} m²
Bedrooms: {property_data.get('bedrooms', 'Not specified')}
Bathrooms: {property_data.get('bathrooms', 'Not specified')}
Price: {property_data.get('price', 'Not specified')}
"""
```

### MEDIUM PRIORITY (Phase 2):
Improve `extract_gps_and_kpis.py` to extract more complete data from property pages.

**Challenge:** Properstar pages may be JavaScript-rendered (seeing "Azure WAF" response suggests WAF or dynamic content).

**Solutions:**
1. Check if pages have JSON-LD structured data
2. Use Selenium/Playwright for JavaScript rendering
3. Improve HTTP headers to avoid WAF blocking
4. Extract from API if available

## Success Metrics

**Phase 1:**
- [ ] Property 108223811 analysis uses structured data
- [ ] GPT receives land size, bedrooms, bathrooms
- [ ] Analysis quality improves (more specific details)

**Phase 2:**
- [ ] 80%+ of properties have complete data (bedrooms, bathrooms, sizes)
- [ ] Full descriptions extracted for most properties
- [ ] Features and amenities cataloged
- [ ] No properties with empty/minimal data

## Next Steps

1. **Implement Phase 1** (Quick Win - 30 minutes)
2. Test with property 108223811
3. Re-run analysis to verify improvement
4. Plan Phase 2 implementation (requires research into Properstar page structure)
