50 KiB

Data Quality Dashboard

Overview

The Data Quality Dashboard provides comprehensive monitoring and management of geocoding accuracy and location data integrity. This feature enables campaign administrators to identify and resolve data quality issues, track geocoding provider performance, and ensure reliable map data for canvassing operations.

Key Features:

  • Real-time geocoding quality metrics
  • Provider success rate tracking
  • Low-confidence location detection
  • Duplicate location identification
  • Bulk re-geocoding operations
  • Address validation reporting
  • Interactive quality charts
  • Export quality reports

Use Cases:

  • Monthly data quality audits
  • NAR import validation
  • Geocoding provider evaluation
  • Pre-canvass data verification
  • Address database cleanup
  • Campaign planning accuracy checks

Architecture Highlights:

  • Aggregate statistics via database queries
  • Confidence threshold filtering (0-100 scale)
  • Provider performance comparison
  • Duplicate detection via coordinate matching
  • Manual review workflows
  • Prometheus metrics integration

Architecture

flowchart TB
    subgraph Admin Interface
        Admin[Admin User]
        Dashboard[DataQualityDashboardPage]
        LocationsPage[LocationsPage]
    end

    subgraph API Layer
        StatsAPI["/api/locations/geocode-stats"]
        LocationsAPI["/api/locations"]
        DuplicatesAPI["/api/locations/duplicates"]
        RegeocodeAPI["/api/locations/:id/regeocode"]
        BulkGeocodeAPI["/api/locations/bulk-geocode"]
    end

    subgraph Database
        LocationsDB[(Locations)]
        Indexes[(Indexes)]
    end

    subgraph Geocoding Service
        GeocodingService[GeocodingService]
        Providers[6 Providers]
        Cache[Redis Cache]
    end

    subgraph Monitoring
        Prometheus[Prometheus]
        Metrics[cm_locations_low_confidence_count]
    end

    Admin --> Dashboard
    Admin --> LocationsPage

    Dashboard --> StatsAPI
    Dashboard --> LocationsAPI
    Dashboard --> DuplicatesAPI
    LocationsPage --> RegeocodeAPI
    LocationsPage --> BulkGeocodeAPI

    StatsAPI --> LocationsDB
    LocationsAPI --> LocationsDB
    DuplicatesAPI --> LocationsDB
    RegeocodeAPI --> GeocodingService
    BulkGeocodeAPI --> GeocodingService

    LocationsDB --> Indexes
    GeocodingService --> Providers
    GeocodingService --> Cache

    StatsAPI --> Prometheus
    Prometheus --> Metrics

Data Flow:

  1. Statistics Aggregation:

    • Query all locations with geocoding metadata
    • Calculate aggregate metrics (total, geocoded %, avg confidence)
    • Group by provider for success rate comparison
    • Identify low-confidence locations (< 50)
    • Detect duplicates via coordinate matching
  2. Quality Review:

    • Admin views dashboard statistics
    • Filters low-confidence locations
    • Reviews individual location details
    • Identifies patterns (provider failures, address format issues)
  3. Remediation:

    • Manual address correction
    • Single location re-geocoding
    • Bulk re-geocoding with different provider
    • Duplicate merging or marking
  4. Monitoring:

    • Prometheus metrics track quality trends
    • Alert rules trigger for quality degradation
    • Grafana dashboards visualize provider performance

Database Models

Location Model

model Location {
  id          Int      @id @default(autoincrement())
  address     String
  latitude    Float?
  longitude   Float?
  postalCode  String?
  province    String?

  // Geocoding metadata
  geocodeConfidence Int?        // 0-100 quality score
  geocodeProvider   String?     // Provider used for geocoding
  geocodedAt        DateTime?   // Timestamp of last geocode

  // NAR import fields
  locGuid           String?  @unique
  federalDistrict   String?
  buildingUse       Int?     // 1 = Residential

  addresses   Address[]

  createdAt   DateTime @default(now())
  updatedAt   DateTime @updatedAt

  @@index([geocodeConfidence])
  @@index([geocodeProvider])
  @@index([latitude, longitude])
  @@index([latitude, longitude], where: latitude IS NOT NULL AND longitude IS NOT NULL)
}

Geocode Confidence Scale:

  • 0-20: Very Low (manual review required)
  • 21-40: Low (likely incorrect, re-geocode recommended)
  • 41-60: Medium (acceptable but consider verification)
  • 61-80: Good (likely accurate)
  • 81-100: Excellent (high confidence)

Geocode Provider Enum:

enum GeocodeProvider {
  GOOGLE = 'GOOGLE',
  MAPBOX = 'MAPBOX',
  NOMINATIM = 'NOMINATIM',
  PHOTON = 'PHOTON',
  LOCATIONIQ = 'LOCATIONIQ',
  ARCGIS = 'ARCGIS',
  UNKNOWN = 'UNKNOWN'
}

Address Model

model Address {
  id         Int      @id @default(autoincrement())
  locationId Int
  location   Location @relation(fields: [locationId], references: [id], onDelete: Cascade)

  unitNumber   String?
  firstName    String?
  lastName     String?
  supportLevel Int?
  notes        String?

  // Address validation
  isValidated  Boolean  @default(false)
  validatedAt  DateTime?

  createdAt DateTime @default(now())
  updatedAt DateTime @updatedAt

  @@index([locationId])
}

API Endpoints

GET /api/locations/geocode-stats

Fetch aggregate geocoding quality statistics.

Authentication: Required (SUPER_ADMIN, MAP_ADMIN)

Response:

{
  "total": 1500,
  "geocoded": 1450,
  "geocodedPercent": 96.67,
  "avgConfidence": 78.5,
  "providerBreakdown": {
    "GOOGLE": 800,
    "MAPBOX": 350,
    "NOMINATIM": 200,
    "PHOTON": 100,
    "ARCGIS": 0,
    "LOCATIONIQ": 0,
    "UNKNOWN": 50
  },
  "confidenceDistribution": {
    "0-20": 15,
    "21-40": 35,
    "41-60": 150,
    "61-80": 450,
    "81-100": 800
  },
  "lowConfidenceCount": 50,
  "missingCoordinates": 50,
  "duplicatesCount": 12
}

Implementation:

// locations.service.ts
async getGeocodeStats() {
  const locations = await prisma.location.findMany({
    select: {
      latitude: true,
      longitude: true,
      geocodeConfidence: true,
      geocodeProvider: true
    }
  });

  const total = locations.length;
  const geocoded = locations.filter(l => l.latitude && l.longitude).length;
  const avgConfidence = locations.reduce((sum, l) =>
    sum + (l.geocodeConfidence || 0), 0) / total;

  const providerBreakdown = locations.reduce((acc, l) => {
    const provider = l.geocodeProvider || 'UNKNOWN';
    acc[provider] = (acc[provider] || 0) + 1;
    return acc;
  }, {} as Record<string, number>);

  const confidenceDistribution = {
    '0-20': 0,
    '21-40': 0,
    '41-60': 0,
    '61-80': 0,
    '81-100': 0
  };

  locations.forEach(l => {
    const conf = l.geocodeConfidence || 0;
    if (conf <= 20) confidenceDistribution['0-20']++;
    else if (conf <= 40) confidenceDistribution['21-40']++;
    else if (conf <= 60) confidenceDistribution['41-60']++;
    else if (conf <= 80) confidenceDistribution['61-80']++;
    else confidenceDistribution['81-100']++;
  });

  const lowConfidenceCount = locations.filter(l =>
    (l.geocodeConfidence || 0) < 50).length;

  return {
    total,
    geocoded,
    geocodedPercent: (geocoded / total) * 100,
    avgConfidence,
    providerBreakdown,
    confidenceDistribution,
    lowConfidenceCount,
    missingCoordinates: total - geocoded,
    duplicatesCount: await this.countDuplicates()
  };
}

GET /api/locations?geocodeConfidence=lt:50

Fetch locations filtered by geocode confidence.

Authentication: Required

Query Parameters:

  • geocodeConfidence (filter): lt:X, gt:X, eq:X, null
  • geocodeProvider (filter): Provider name (GOOGLE, MAPBOX, etc.)
  • page (optional): Page number (default: 1)
  • limit (optional): Results per page (default: 50)
  • sortBy (optional): Field to sort by (default: "geocodeConfidence")
  • order (optional): "asc" or "desc" (default: "asc")

Examples:

GET /api/locations?geocodeConfidence=lt:50
GET /api/locations?geocodeConfidence=null
GET /api/locations?geocodeProvider=NOMINATIM&geocodeConfidence=lt:70
GET /api/locations?geocodeConfidence=gt:80&sortBy=address

Response:

{
  "data": [
    {
      "id": 1001,
      "address": "123 Main St",
      "latitude": 43.6532,
      "longitude": -79.3832,
      "postalCode": "M5H 2N2",
      "geocodeConfidence": 45,
      "geocodeProvider": "NOMINATIM",
      "geocodedAt": "2025-02-10T10:00:00Z",
      "addresses": [...]
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 50,
    "total": 150,
    "pages": 3
  }
}

GET /api/locations/duplicates

Identify locations with identical coordinates.

Authentication: Required (SUPER_ADMIN, MAP_ADMIN)

Query Parameters:

  • threshold (optional): Distance threshold in meters (default: 1, matches exact duplicates)

Response:

{
  "duplicates": [
    {
      "coordinates": {
        "latitude": 43.6532,
        "longitude": -79.3832
      },
      "count": 3,
      "locations": [
        {
          "id": 1001,
          "address": "123 Main St",
          "postalCode": "M5H 2N2"
        },
        {
          "id": 1002,
          "address": "123 Main Street",
          "postalCode": "M5H 2N2"
        },
        {
          "id": 1003,
          "address": "123 Main St, Unit 1",
          "postalCode": "M5H 2N2"
        }
      ]
    }
  ],
  "total": 12
}

Implementation:

// locations.service.ts
async findDuplicates(thresholdMeters: number = 1) {
  const locations = await prisma.location.findMany({
    where: {
      AND: [
        { latitude: { not: null } },
        { longitude: { not: null } }
      ]
    },
    select: {
      id: true,
      address: true,
      latitude: true,
      longitude: true,
      postalCode: true
    }
  });

  const coordMap = new Map<string, typeof locations>();

  locations.forEach(loc => {
    // Round to 6 decimal places (~0.1m precision)
    const key = `${loc.latitude!.toFixed(6)},${loc.longitude!.toFixed(6)}`;
    if (!coordMap.has(key)) {
      coordMap.set(key, []);
    }
    coordMap.get(key)!.push(loc);
  });

  const duplicates = Array.from(coordMap.entries())
    .filter(([_, locs]) => locs.length > 1)
    .map(([coords, locs]) => {
      const [lat, lng] = coords.split(',').map(Number);
      return {
        coordinates: { latitude: lat, longitude: lng },
        count: locs.length,
        locations: locs
      };
    });

  return {
    duplicates,
    total: duplicates.reduce((sum, dup) => sum + dup.count, 0)
  };
}

POST /api/locations/:id/regeocode

Re-geocode a single location with specified provider.

Authentication: Required (SUPER_ADMIN, MAP_ADMIN)

Request Body:

{
  "provider": "GOOGLE",
  "address": "123 Main St, Toronto ON M5H 2N2"
}

Parameters:

  • provider (optional): Specific provider to use (default: fallback chain)
  • address (optional): Override address string (default: use existing)

Response:

{
  "id": 1001,
  "address": "123 Main St",
  "latitude": 43.6532,
  "longitude": -79.3832,
  "geocodeConfidence": 95,
  "geocodeProvider": "GOOGLE",
  "geocodedAt": "2025-02-13T10:30:00Z"
}

POST /api/locations/bulk-geocode

Bulk re-geocode multiple locations.

Authentication: Required (SUPER_ADMIN, MAP_ADMIN)

Request Body:

{
  "locationIds": [1001, 1002, 1003],
  "provider": "GOOGLE",
  "confidenceThreshold": 50
}

Parameters:

  • locationIds (optional): Specific location IDs (default: all with confidence < threshold)
  • provider (optional): Specific provider to use (default: fallback chain)
  • confidenceThreshold (optional): Only re-geocode locations below this confidence (default: 50)

Response:

{
  "jobId": "bulk-geocode-20250213-103000",
  "status": "queued",
  "total": 150,
  "message": "Bulk geocoding job started"
}

Job Progress Endpoint:

GET /api/locations/bulk-geocode/:jobId

Job Status Response:

{
  "jobId": "bulk-geocode-20250213-103000",
  "status": "processing",
  "progress": {
    "total": 150,
    "processed": 75,
    "successful": 70,
    "failed": 5,
    "percent": 50
  },
  "startedAt": "2025-02-13T10:30:00Z",
  "estimatedCompletion": "2025-02-13T10:35:00Z"
}

Configuration

Environment Variables

Variable Type Default Description
GEOCODE_CONFIDENCE_THRESHOLD number 50 Minimum confidence for acceptable geocoding
GEOCODE_PRIMARY_PROVIDER string GOOGLE Primary geocoding provider
GEOCODE_FALLBACK_PROVIDERS string MAPBOX,NOMINATIM Comma-separated fallback providers
GEOCODE_CACHE_TTL number 2592000 Cache TTL in seconds (30 days)

Quality Thresholds

Metric Warning Critical Description
Geocoded % < 95% < 90% Percentage of locations with coordinates
Avg Confidence < 70 < 60 Average geocode confidence score
Low Confidence Count > 50 > 100 Locations with confidence < 50
Duplicates > 20 > 50 Locations with identical coordinates
Missing Coordinates > 5% > 10% Locations without lat/lng

Prometheus Metrics

Custom Metrics:

// api/src/utils/metrics.ts

export const geocodingQualityGauge = new Gauge({
  name: 'cm_geocoding_avg_confidence',
  help: 'Average geocoding confidence score (0-100)',
  async collect() {
    const stats = await locationsService.getGeocodeStats();
    this.set(stats.avgConfidence);
  }
});

export const lowConfidenceLocationsGauge = new Gauge({
  name: 'cm_locations_low_confidence_count',
  help: 'Number of locations with geocode confidence < 50',
  async collect() {
    const stats = await locationsService.getGeocodeStats();
    this.set(stats.lowConfidenceCount);
  }
});

export const geocodedPercentGauge = new Gauge({
  name: 'cm_locations_geocoded_percent',
  help: 'Percentage of locations with coordinates',
  async collect() {
    const stats = await locationsService.getGeocodeStats();
    this.set(stats.geocodedPercent);
  }
});

export const duplicateLocationsGauge = new Gauge({
  name: 'cm_locations_duplicates_count',
  help: 'Number of duplicate location entries',
  async collect() {
    const duplicates = await locationsService.findDuplicates();
    this.set(duplicates.total);
  }
});

Alert Rules:

# configs/prometheus/alerts.yml

groups:
  - name: data_quality
    interval: 5m
    rules:
      - alert: LowGeocodingConfidence
        expr: cm_geocoding_avg_confidence < 60
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Low average geocoding confidence
          description: "Average geocoding confidence is {{ $value }}, below threshold of 60"

      - alert: HighLowConfidenceLocations
        expr: cm_locations_low_confidence_count > 100
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High number of low-confidence locations
          description: "{{ $value }} locations have geocoding confidence < 50"

      - alert: LowGeocodedPercent
        expr: cm_locations_geocoded_percent < 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Low percentage of geocoded locations
          description: "Only {{ $value }}% of locations have coordinates"

      - alert: HighDuplicateLocations
        expr: cm_locations_duplicates_count > 50
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: High number of duplicate locations
          description: "{{ $value }} duplicate location entries detected"

Quality Metrics

Geocoding Confidence

Calculation:

Geocoding confidence is calculated based on multiple factors:

interface GeocodeResult {
  latitude: number;
  longitude: number;
  matchType: 'exact' | 'interpolated' | 'approximate' | 'fallback';
  addressComponents: {
    streetNumber?: string;
    street?: string;
    city?: string;
    postalCode?: string;
    province?: string;
  };
  providerConfidence?: number; // Provider-specific score
}

function calculateConfidence(result: GeocodeResult, inputAddress: string): number {
  let confidence = 0;

  // Match type (0-40 points)
  switch (result.matchType) {
    case 'exact': confidence += 40; break;
    case 'interpolated': confidence += 30; break;
    case 'approximate': confidence += 20; break;
    case 'fallback': confidence += 10; break;
  }

  // Address component completeness (0-30 points)
  const components = result.addressComponents;
  if (components.streetNumber) confidence += 10;
  if (components.street) confidence += 10;
  if (components.postalCode) confidence += 10;

  // Provider-specific confidence (0-30 points)
  if (result.providerConfidence) {
    confidence += (result.providerConfidence / 100) * 30;
  }

  return Math.min(Math.round(confidence), 100);
}

Confidence Levels:

  • 81-100 (Excellent): Exact match with full address components
  • 61-80 (Good): Interpolated match with most components
  • 41-60 (Medium): Approximate match, missing some components
  • 21-40 (Low): Fallback geocoding, significant uncertainty
  • 0-20 (Very Low): Minimal match, likely incorrect

Provider Success Rates

Metrics Tracked:

interface ProviderMetrics {
  provider: GeocodeProvider;
  totalAttempts: number;
  successfulGeocodes: number;
  successRate: number; // 0-100%
  avgConfidence: number; // 0-100
  avgResponseTime: number; // milliseconds
  errorCount: number;
  lastError?: string;
}

Success Rate Calculation:

const calculateProviderMetrics = async (): Promise<ProviderMetrics[]> => {
  const locations = await prisma.location.findMany({
    select: {
      geocodeProvider: true,
      geocodeConfidence: true,
      latitude: true,
      longitude: true
    }
  });

  const providerGroups = groupBy(locations, 'geocodeProvider');

  return Object.entries(providerGroups).map(([provider, locs]) => {
    const total = locs.length;
    const successful = locs.filter(l => l.latitude && l.longitude).length;
    const avgConf = locs.reduce((sum, l) => sum + (l.geocodeConfidence || 0), 0) / total;

    return {
      provider: provider as GeocodeProvider,
      totalAttempts: total,
      successfulGeocodes: successful,
      successRate: (successful / total) * 100,
      avgConfidence: avgConf,
      avgResponseTime: 0, // Would need separate tracking
      errorCount: total - successful
    };
  });
};

Duplicate Detection

Detection Methods:

  1. Exact Coordinate Match:
// Round to 6 decimal places (~0.1m precision)
const isDuplicateExact = (loc1: Location, loc2: Location): boolean => {
  return loc1.latitude!.toFixed(6) === loc2.latitude!.toFixed(6) &&
         loc1.longitude!.toFixed(6) === loc2.longitude!.toFixed(6);
};
  1. Proximity Threshold:
// Haversine distance check
const isDuplicateProximity = (loc1: Location, loc2: Location, thresholdM: number): boolean => {
  const distance = haversineDistance(
    [loc1.latitude!, loc1.longitude!],
    [loc2.latitude!, loc2.longitude!]
  );
  return distance < thresholdM;
};
  1. Address Similarity:
import { distance as levenshteinDistance } from 'fastest-levenshtein';

const isDuplicateAddress = (addr1: string, addr2: string): boolean => {
  const normalized1 = normalizeAddress(addr1);
  const normalized2 = normalizeAddress(addr2);
  const dist = levenshteinDistance(normalized1, normalized2);
  const similarity = 1 - (dist / Math.max(normalized1.length, normalized2.length));
  return similarity > 0.9; // 90% similar
};

const normalizeAddress = (address: string): string => {
  return address
    .toLowerCase()
    .replace(/\bstreet\b/g, 'st')
    .replace(/\bavenue\b/g, 'ave')
    .replace(/\broad\b/g, 'rd')
    .replace(/\bdrive\b/g, 'dr')
    .replace(/[^a-z0-9]/g, '');
};

Address Validation

Validation Checks:

interface AddressValidationResult {
  isValid: boolean;
  issues: string[];
  suggestions?: string[];
}

const validateAddress = (address: string): AddressValidationResult => {
  const issues: string[] = [];

  // Check minimum length
  if (address.length < 5) {
    issues.push('Address too short');
  }

  // Check for street number
  if (!/^\d+/.test(address)) {
    issues.push('Missing street number');
  }

  // Check for street name
  if (!/\d+\s+([A-Za-z]+\s*)+/.test(address)) {
    issues.push('Missing street name');
  }

  // Check for postal code (Canadian format)
  if (!/[A-Z]\d[A-Z]\s?\d[A-Z]\d/.test(address)) {
    issues.push('Missing or invalid postal code');
  }

  // Check for unusual characters
  if (/[^A-Za-z0-9\s,.-]/.test(address)) {
    issues.push('Contains unusual characters');
  }

  return {
    isValid: issues.length === 0,
    issues
  };
};

Admin Workflow

Navigate to Data Quality Dashboard

Step 1: Access Dashboard

  1. Log in as SUPER_ADMIN or MAP_ADMIN
  2. Click Map in sidebar
  3. Click Data Quality submenu
  4. Dashboard loads with statistics

Step 2: Review Overall Statistics

Dashboard displays 4 main statistic cards:

┌──────────────────┬──────────────────┬──────────────────┬──────────────────┐
│ Total Locations  │ Geocoded         │ Avg Confidence   │ Low Confidence   │
│ 1,500            │ 1,450 (96.7%)    │ 78.5             │ 50               │
└──────────────────┴──────────────────┴──────────────────┴──────────────────┘

Step 3: Analyze Provider Performance

Provider breakdown table shows:

Provider Count Success Rate Avg Confidence
GOOGLE 800 99.2% 85.3
MAPBOX 350 97.1% 82.1
NOMINATIM 200 94.5% 75.8
PHOTON 100 91.0% 68.2
UNKNOWN 50 N/A 0

Step 4: Review Confidence Distribution

Bar chart displays confidence distribution:

Confidence Distribution
100 |              ┌──────┐
 80 |              │      │
 60 |        ┌──────┤      │
 40 |  ┌──────┤      │      │
 20 |  │      │      │      │
  0 └──┴──────┴──────┴──────┴──────┘
    0-20  21-40  41-60  61-80 81-100
     15     35    150    450    800

Identify and Review Low-Confidence Locations

Step 1: Filter Low-Confidence Locations

  1. Click Low Confidence tab on dashboard
  2. Table loads with locations where confidence < 50
  3. Sort by confidence (ascending) to prioritize worst

Step 2: Review Location Details

Click row to open detail drawer:

┌─────────────────────────────────────────┐
│ Location Details                        │
├─────────────────────────────────────────┤
│ Address: 123 Main St                    │
│ Postal Code: M5H 2N2                    │
│ Coordinates: 43.6532, -79.3832          │
│                                         │
│ Geocoding Info:                         │
│   Confidence: 45 (Low)                  │
│   Provider: NOMINATIM                   │
│   Geocoded: Feb 10, 2025 10:00 AM      │
│                                         │
│ Issues:                                 │
│   • Missing street number in response   │
│   • Approximate match only              │
│                                         │
│ [Re-geocode] [Edit Address] [View Map] │
└─────────────────────────────────────────┘

Step 3: Take Action

Options for remediation:

  1. Re-geocode with different provider:

    • Click Re-geocode button
    • Select provider (GOOGLE recommended for low confidence)
    • Click Geocode Now
    • New confidence displayed
  2. Edit address:

    • Click Edit Address
    • Correct typos or formatting issues
    • Save changes
    • Auto-triggers re-geocoding
  3. View on map:

    • Click View Map
    • Verify location accuracy visually
    • Drag marker to correct position if needed

Bulk Re-geocoding

Step 1: Select Locations

  1. In Low Confidence tab, use table checkboxes to select locations
  2. Or click Select All to select all visible
  3. Selected count displays: "50 selected"

Step 2: Choose Provider

  1. Click Bulk Re-geocode button
  2. Modal opens with provider selection:
    ┌─────────────────────────────────────┐
    │ Bulk Re-geocode                     │
    ├─────────────────────────────────────┤
    │ Re-geocode 50 locations             │
    │                                     │
    │ Provider: [GOOGLE ▼]                │
    │                                     │
    │ Options:                            │
    │ ☑ Only if confidence < 50           │
    │ ☑ Cache results                     │
    │ ☐ Overwrite existing coordinates    │
    │                                     │
    │ Estimated time: ~2 minutes          │
    │                                     │
    │ [Cancel] [Start Re-geocoding]       │
    └─────────────────────────────────────┘
    

Step 3: Monitor Progress

  1. Job starts, progress bar appears:

    Re-geocoding in progress... 25/50 (50%)
    [████████████░░░░░░░░░░░░] 50%
    
  2. Real-time updates:

    • Total processed
    • Successful geocodes
    • Failed geocodes
    • Average new confidence

Step 4: Review Results

Job completion summary:

┌─────────────────────────────────────┐
│ Bulk Re-geocode Complete            │
├─────────────────────────────────────┤
│ Processed: 50                       │
│ Successful: 47 (94%)                │
│ Failed: 3 (6%)                      │
│                                     │
│ Quality Improvement:                │
│   Avg Confidence Before: 42.5       │
│   Avg Confidence After: 81.3        │
│   Improvement: +38.8                │
│                                     │
│ [View Failed] [Close]               │
└─────────────────────────────────────┘

Handle Duplicates

Step 1: View Duplicates Tab

  1. Click Duplicates tab on dashboard
  2. Table groups locations by coordinates

Step 2: Review Duplicate Groups

Table displays:

Coordinates Count Addresses Action
43.6532, -79.3832 3 123 Main St, 123 Main Street, 123 Main St Unit 1 [Review]
43.6540, -79.3825 2 456 Bay St, 456 Bay Street [Review]

Step 3: Resolve Duplicates

Click Review to open resolution modal:

┌─────────────────────────────────────┐
│ Resolve Duplicates                  │
├─────────────────────────────────────┤
│ 3 locations at 43.6532, -79.3832    │
│                                     │
│ ○ Merge into single location        │
│   Primary: 123 Main St              │
│   Merge units from duplicates       │
│                                     │
│ ○ Keep as separate multi-unit       │
│   Mark as validated multi-unit      │
│                                     │
│ ○ Re-geocode individually           │
│   Try to get unique coordinates     │
│                                     │
│ [Cancel] [Resolve]                  │
└─────────────────────────────────────┘

Resolution Options:

  1. Merge: Combine into single Location with multiple Address records
  2. Multi-unit: Mark as legitimate multi-unit building
  3. Re-geocode: Attempt to get unique coordinates for each

Quality Improvement Strategies

Multi-Provider Geocoding

Fallback Chain:

// geocoding.service.ts

const PROVIDER_CHAIN: GeocodeProvider[] = [
  'GOOGLE',    // Primary: Best accuracy, paid
  'MAPBOX',    // Fallback 1: Good accuracy, paid
  'NOMINATIM', // Fallback 2: Free, decent accuracy
  'PHOTON',    // Fallback 3: Free, lower accuracy
  'ARCGIS'     // Fallback 4: Free, basic accuracy
];

async geocode(address: string): Promise<GeocodeResult | null> {
  for (const provider of PROVIDER_CHAIN) {
    try {
      const result = await this.geocodeWithProvider(address, provider);
      if (result && result.confidence >= 50) {
        return result; // Success, confidence acceptable
      }
    } catch (error) {
      logger.warn(`Geocoding failed with ${provider}:`, error);
      // Try next provider
    }
  }
  return null; // All providers failed
}

Benefits:

  • Increases success rate (90% → 96%+)
  • Reduces dependency on single provider
  • Cost optimization (use free providers as fallback)
  • Provider outage resilience

Address Normalization

Pre-Geocoding Normalization:

const normalizeAddressForGeocoding = (address: string): string => {
  let normalized = address;

  // Remove extra whitespace
  normalized = normalized.replace(/\s+/g, ' ').trim();

  // Standardize abbreviations
  const replacements: Record<string, string> = {
    'Street': 'St',
    'Avenue': 'Ave',
    'Road': 'Rd',
    'Drive': 'Dr',
    'Boulevard': 'Blvd',
    'Apartment': 'Apt',
    'Unit': 'Unit',
    'Suite': 'Ste'
  };

  Object.entries(replacements).forEach(([long, short]) => {
    const regex = new RegExp(`\\b${long}\\b`, 'gi');
    normalized = normalized.replace(regex, short);
  });

  // Ensure postal code spacing (Canadian format)
  normalized = normalized.replace(/([A-Z]\d[A-Z])(\d[A-Z]\d)/, '$1 $2');

  // Remove periods from abbreviations
  normalized = normalized.replace(/\./g, '');

  return normalized;
};

Improvements:

  • Reduces geocoding errors by 10-15%
  • Increases confidence scores
  • Better cache hit rate

Geocoding Cache

Redis Cache Implementation:

// geocoding.service.ts

private async geocodeWithCache(address: string): Promise<GeocodeResult | null> {
  const cacheKey = `geocode:${normalizeAddress(address)}`;

  // Check cache
  const cached = await redis.get(cacheKey);
  if (cached) {
    logger.debug('Geocoding cache hit:', address);
    return JSON.parse(cached);
  }

  // Cache miss, geocode
  const result = await this.geocode(address);
  if (result) {
    // Cache for 30 days
    await redis.setex(cacheKey, 2592000, JSON.stringify(result));
  }

  return result;
}

Benefits:

  • Reduces API costs (90% cache hit rate)
  • Faster response times (Redis: <5ms vs API: 200-500ms)
  • Consistent results for same address
  • Provider API rate limit avoidance

Manual Verification

Critical Location Verification:

Manually verify high-priority locations:

  1. Campaign offices: Ensure exact coordinates
  2. Shift start points: Verify accessibility
  3. Event venues: Confirm entrance location
  4. Polling stations: Critical for voter info

Verification Process:

// Mark location as manually verified
await prisma.location.update({
  where: { id: locationId },
  data: {
    geocodeConfidence: 100,
    geocodeProvider: 'MANUAL',
    geocodedAt: new Date()
  }
});

Regular Audits

Monthly Quality Audit Checklist:

  1. Run quality report:

    curl http://localhost:4000/api/locations/geocode-stats
    
  2. Check metrics against thresholds:

    • Geocoded % > 95%
    • Avg confidence > 70
    • Low confidence count < 50
    • Duplicates < 20
  3. Review low-confidence locations:

    • Filter locations with confidence < 50
    • Review top 20 by address
    • Identify patterns (specific streets, providers)
  4. Bulk re-geocode low confidence:

    • Use GOOGLE provider for accuracy
    • Monitor improvement in avg confidence
  5. Resolve duplicates:

    • Review all duplicate groups
    • Merge or mark as multi-unit
    • Update addresses as needed
  6. Export quality report:

    const report = await generateQualityReport();
    fs.writeFileSync(`quality-report-${date}.json`, JSON.stringify(report, null, 2));
    

Code Examples

DataQualityDashboardPage.tsx

import React, { useEffect, useState } from 'react';
import { Card, Row, Col, Statistic, Table, Tabs, Button, message } from 'antd';
import { WarningOutlined, CheckCircleOutlined } from '@ant-design/icons';
import { api } from '@/lib/api';
import { Bar } from 'react-chartjs-2';

interface GeocodeStats {
  total: number;
  geocoded: number;
  geocodedPercent: number;
  avgConfidence: number;
  providerBreakdown: Record<string, number>;
  confidenceDistribution: Record<string, number>;
  lowConfidenceCount: number;
  missingCoordinates: number;
  duplicatesCount: number;
}

const DataQualityDashboardPage: React.FC = () => {
  const [stats, setStats] = useState<GeocodeStats | null>(null);
  const [lowConfLocations, setLowConfLocations] = useState<any[]>([]);
  const [duplicates, setDuplicates] = useState<any[]>([]);
  const [loading, setLoading] = useState(false);

  useEffect(() => {
    fetchStats();
    fetchLowConfidenceLocations();
    fetchDuplicates();
  }, []);

  const fetchStats = async () => {
    setLoading(true);
    try {
      const { data } = await api.get<GeocodeStats>('/locations/geocode-stats');
      setStats(data);
    } catch (error) {
      message.error('Failed to load statistics');
    } finally {
      setLoading(false);
    }
  };

  const fetchLowConfidenceLocations = async () => {
    try {
      const { data } = await api.get('/locations?geocodeConfidence=lt:50&limit=100');
      setLowConfLocations(data.data);
    } catch (error) {
      message.error('Failed to load low-confidence locations');
    }
  };

  const fetchDuplicates = async () => {
    try {
      const { data } = await api.get('/locations/duplicates');
      setDuplicates(data.duplicates);
    } catch (error) {
      message.error('Failed to load duplicates');
    }
  };

  const handleRegeocodeLocation = async (locationId: number) => {
    try {
      await api.post(`/locations/${locationId}/regeocode`, { provider: 'GOOGLE' });
      message.success('Location re-geocoded successfully');
      fetchStats();
      fetchLowConfidenceLocations();
    } catch (error) {
      message.error('Failed to re-geocode location');
    }
  };

  const confidenceChartData = stats ? {
    labels: Object.keys(stats.confidenceDistribution),
    datasets: [{
      label: 'Locations',
      data: Object.values(stats.confidenceDistribution),
      backgroundColor: [
        '#e74c3c', // 0-20: Red
        '#f39c12', // 21-40: Orange
        '#f1c40f', // 41-60: Yellow
        '#3498db', // 61-80: Blue
        '#27ae60'  // 81-100: Green
      ]
    }]
  } : null;

  const lowConfColumns = [
    { title: 'Address', dataIndex: 'address', key: 'address' },
    { title: 'Confidence', dataIndex: 'geocodeConfidence', key: 'confidence', render: (val: number) => (
      <span style={{ color: val < 30 ? '#e74c3c' : '#f39c12' }}>{val}</span>
    )},
    { title: 'Provider', dataIndex: 'geocodeProvider', key: 'provider' },
    { title: 'Action', key: 'action', render: (_: any, record: any) => (
      <Button size="small" onClick={() => handleRegeocodeLocation(record.id)}>
        Re-geocode
      </Button>
    )}
  ];

  return (
    <div>
      <h1>Data Quality Dashboard</h1>

      {/* Statistics Cards */}
      <Row gutter={16} style={{ marginBottom: 24 }}>
        <Col span={6}>
          <Card>
            <Statistic
              title="Total Locations"
              value={stats?.total || 0}
              prefix={<CheckCircleOutlined />}
            />
          </Card>
        </Col>
        <Col span={6}>
          <Card>
            <Statistic
              title="Geocoded"
              value={stats?.geocoded || 0}
              suffix={`(${stats?.geocodedPercent.toFixed(1) || 0}%)`}
              valueStyle={{ color: (stats?.geocodedPercent || 0) > 95 ? '#27ae60' : '#f39c12' }}
            />
          </Card>
        </Col>
        <Col span={6}>
          <Card>
            <Statistic
              title="Avg Confidence"
              value={stats?.avgConfidence.toFixed(1) || 0}
              valueStyle={{ color: (stats?.avgConfidence || 0) > 70 ? '#27ae60' : '#f39c12' }}
            />
          </Card>
        </Col>
        <Col span={6}>
          <Card>
            <Statistic
              title="Low Confidence"
              value={stats?.lowConfidenceCount || 0}
              prefix={<WarningOutlined />}
              valueStyle={{ color: (stats?.lowConfidenceCount || 0) > 50 ? '#e74c3c' : '#f39c12' }}
            />
          </Card>
        </Col>
      </Row>

      {/* Charts and Tables */}
      <Tabs
        items={[
          {
            key: 'overview',
            label: 'Overview',
            children: (
              <div>
                <Card title="Confidence Distribution" style={{ marginBottom: 24 }}>
                  {confidenceChartData && <Bar data={confidenceChartData} />}
                </Card>
                <Card title="Provider Performance">
                  <Table
                    dataSource={stats ? Object.entries(stats.providerBreakdown).map(([provider, count]) => ({
                      provider,
                      count
                    })) : []}
                    columns={[
                      { title: 'Provider', dataIndex: 'provider', key: 'provider' },
                      { title: 'Count', dataIndex: 'count', key: 'count' }
                    ]}
                    pagination={false}
                  />
                </Card>
              </div>
            )
          },
          {
            key: 'low-confidence',
            label: `Low Confidence (${lowConfLocations.length})`,
            children: (
              <Table
                dataSource={lowConfLocations}
                columns={lowConfColumns}
                rowKey="id"
                loading={loading}
              />
            )
          },
          {
            key: 'duplicates',
            label: `Duplicates (${duplicates.length})`,
            children: (
              <Table
                dataSource={duplicates}
                columns={[
                  { title: 'Coordinates', key: 'coords', render: (_, record: any) =>
                    `${record.coordinates.latitude.toFixed(6)}, ${record.coordinates.longitude.toFixed(6)}`
                  },
                  { title: 'Count', dataIndex: 'count', key: 'count' },
                  { title: 'Addresses', key: 'addresses', render: (_, record: any) =>
                    record.locations.map((l: any) => l.address).join(', ')
                  }
                ]}
                rowKey={(record) => `${record.coordinates.latitude}-${record.coordinates.longitude}`}
              />
            )
          }
        ]}
      />
    </div>
  );
};

export default DataQualityDashboardPage;

Geocode Statistics Service

// locations.service.ts

import { prisma } from '@/config/database';
import type { GeocodeProvider } from '@prisma/client';

export class LocationsService {
  async getGeocodeStats() {
    const locations = await prisma.location.findMany({
      select: {
        id: true,
        latitude: true,
        longitude: true,
        geocodeConfidence: true,
        geocodeProvider: true
      }
    });

    const total = locations.length;
    const geocoded = locations.filter(l => l.latitude && l.longitude).length;

    const sumConfidence = locations.reduce((sum, l) => sum + (l.geocodeConfidence || 0), 0);
    const avgConfidence = total > 0 ? sumConfidence / total : 0;

    // Provider breakdown
    const providerBreakdown: Record<string, number> = {};
    locations.forEach(l => {
      const provider = l.geocodeProvider || 'UNKNOWN';
      providerBreakdown[provider] = (providerBreakdown[provider] || 0) + 1;
    });

    // Confidence distribution
    const confidenceDistribution = {
      '0-20': 0,
      '21-40': 0,
      '41-60': 0,
      '61-80': 0,
      '81-100': 0
    };

    locations.forEach(l => {
      const conf = l.geocodeConfidence || 0;
      if (conf <= 20) confidenceDistribution['0-20']++;
      else if (conf <= 40) confidenceDistribution['21-40']++;
      else if (conf <= 60) confidenceDistribution['41-60']++;
      else if (conf <= 80) confidenceDistribution['61-80']++;
      else confidenceDistribution['81-100']++;
    });

    const lowConfidenceCount = locations.filter(l => (l.geocodeConfidence || 0) < 50).length;
    const duplicatesCount = await this.countDuplicates();

    return {
      total,
      geocoded,
      geocodedPercent: total > 0 ? (geocoded / total) * 100 : 0,
      avgConfidence,
      providerBreakdown,
      confidenceDistribution,
      lowConfidenceCount,
      missingCoordinates: total - geocoded,
      duplicatesCount
    };
  }

  async countDuplicates(): Promise<number> {
    const locations = await prisma.location.findMany({
      where: {
        AND: [
          { latitude: { not: null } },
          { longitude: { not: null } }
        ]
      },
      select: { latitude: true, longitude: true }
    });

    const coordMap = new Map<string, number>();
    locations.forEach(l => {
      const key = `${l.latitude!.toFixed(6)},${l.longitude!.toFixed(6)}`;
      coordMap.set(key, (coordMap.get(key) || 0) + 1);
    });

    return Array.from(coordMap.values()).filter(count => count > 1).reduce((sum, count) => sum + count, 0);
  }

  async regeocode(locationId: number, provider?: GeocodeProvider) {
    const location = await prisma.location.findUnique({
      where: { id: locationId }
    });

    if (!location) {
      throw new Error('Location not found');
    }

    const result = await geocodingService.geocode(location.address, provider);

    if (!result) {
      throw new Error('Geocoding failed');
    }

    return await prisma.location.update({
      where: { id: locationId },
      data: {
        latitude: result.latitude,
        longitude: result.longitude,
        geocodeConfidence: result.confidence,
        geocodeProvider: result.provider,
        geocodedAt: new Date()
      }
    });
  }
}

Troubleshooting

Problem: Many low-confidence locations

Symptoms:

  • 100 locations with confidence < 50

  • Avg confidence < 60
  • Prometheus alert firing

Solutions:

  1. Check provider API keys:
# Test Google Geocoding API
curl "https://maps.googleapis.com/maps/api/geocode/json?address=123+Main+St+Toronto&key=YOUR_KEY"

# Verify key in .env
echo $GEOCODE_GOOGLE_API_KEY
  1. Try different primary provider:
# In .env, change primary provider
GEOCODE_PRIMARY_PROVIDER=GOOGLE  # Most accurate
# Or try:
GEOCODE_PRIMARY_PROVIDER=MAPBOX  # Good alternative
  1. Verify address format:
// Bad: Missing city/postal
"123 Main St"

// Good: Full address
"123 Main St, Toronto ON M5H 2N2"
  1. Use postal code for better accuracy:
// Append postal code if available
const fullAddress = location.postalCode
  ? `${location.address}, ${location.postalCode}`
  : location.address;
  1. Bulk re-geocode with Google:
# Via API
curl -X POST http://localhost:4000/api/locations/bulk-geocode \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"provider":"GOOGLE","confidenceThreshold":50}'

Problem: Duplicate locations detected

Symptoms:

  • Multiple locations at same coordinates
  • Duplicates tab shows many groups
  • Inflated location counts in cuts

Solutions:

  1. Check if legitimately multi-unit:
-- Find buildings with multiple addresses
SELECT l.id, l.address, COUNT(a.id) as unit_count
FROM "Location" l
JOIN "Address" a ON a."locationId" = l.id
GROUP BY l.id
HAVING COUNT(a.id) > 1;
  1. Verify geocoding precision:
// Check if rounding issue
const isDuplicateRounding = (loc1, loc2) => {
  // Use 4 decimal places (~11m precision) instead of 6 (~0.1m)
  return loc1.latitude.toFixed(4) === loc2.latitude.toFixed(4) &&
         loc1.longitude.toFixed(4) === loc2.longitude.toFixed(4);
};
  1. Review NAR import process:
// Ensure LOC_GUID unique constraint
const location = await prisma.location.upsert({
  where: { locGuid: narRecord.LOC_GUID },
  update: { /* update fields */ },
  create: { /* create fields */ }
});
  1. Merge duplicates:
// Merge function
const mergeDuplicates = async (primaryId: number, duplicateIds: number[]) => {
  // Move addresses to primary location
  await prisma.address.updateMany({
    where: { locationId: { in: duplicateIds } },
    data: { locationId: primaryId }
  });

  // Delete duplicates
  await prisma.location.deleteMany({
    where: { id: { in: duplicateIds } }
  });
};

Problem: Geocoding stats slow to load

Symptoms:

  • GET /api/locations/geocode-stats takes > 5 seconds
  • Dashboard timeout errors
  • High database CPU

Solutions:

  1. Add database indexes:
CREATE INDEX CONCURRENTLY idx_locations_geocode_confidence
  ON "Location"(geocodeConfidence);

CREATE INDEX CONCURRENTLY idx_locations_geocode_provider
  ON "Location"(geocodeProvider);

CREATE INDEX CONCURRENTLY idx_locations_coords
  ON "Location"(latitude, longitude)
  WHERE latitude IS NOT NULL AND longitude IS NOT NULL;
  1. Cache stats in Redis:
// Cache for 5 minutes
const getCachedStats = async () => {
  const cached = await redis.get('geocode:stats');
  if (cached) return JSON.parse(cached);

  const stats = await locationsService.getGeocodeStats();
  await redis.setex('geocode:stats', 300, JSON.stringify(stats));
  return stats;
};
  1. Use aggregation pipeline:
// Raw SQL for better performance
const stats = await prisma.$queryRaw`
  SELECT
    COUNT(*) as total,
    COUNT(latitude) as geocoded,
    AVG(COALESCE("geocodeConfidence", 0)) as avg_confidence,
    "geocodeProvider",
    COUNT(*) FILTER (WHERE "geocodeConfidence" < 50) as low_confidence
  FROM "Location"
  GROUP BY "geocodeProvider"
`;
  1. Materialize stats view:
-- Create materialized view
CREATE MATERIALIZED VIEW geocode_stats_mv AS
SELECT
  COUNT(*) as total,
  COUNT(latitude) FILTER (WHERE latitude IS NOT NULL) as geocoded,
  AVG(COALESCE("geocodeConfidence", 0)) as avg_confidence,
  COUNT(*) FILTER (WHERE "geocodeConfidence" < 50) as low_confidence
FROM "Location";

-- Refresh hourly
REFRESH MATERIALIZED VIEW geocode_stats_mv;

Performance Considerations

Database Query Optimization

Indexes:

  • geocodeConfidence (filtering)
  • geocodeProvider (grouping)
  • (latitude, longitude) composite (duplicate detection)
  • Partial index on non-null coordinates

Query Performance:

  • geocode-stats: ~500ms (1500 locations)
  • Low confidence filter: ~100ms (with index)
  • Duplicate detection: ~200ms (coordinate grouping)
  • Bulk re-geocode: ~2-5 min (150 locations, depends on provider)

API Rate Limits

Provider Limits:

  • Google: 50 QPS, $5/1000 requests
  • Mapbox: 100,000/month free, then $0.50/1000
  • Nominatim: 1 QPS (public), no commercial use
  • Photon: No official limit, self-hosted recommended
  • ArcGIS: 100,000/month free

Optimization:

  • Use Redis cache (30-day TTL)
  • Batch geocoding jobs (avoid rate limits)
  • Fallback to free providers for non-critical
  • Monitor usage via provider dashboards

Caching Strategy

Cache Layers:

  1. Application Cache (Redis):
// 30-day TTL for geocode results
const cacheKey = `geocode:${normalizeAddress(address)}`;
await redis.setex(cacheKey, 2592000, JSON.stringify(result));
  1. Statistics Cache:
// 5-minute TTL for stats
await redis.setex('geocode:stats', 300, JSON.stringify(stats));
  1. Provider Response Cache:
// Cache raw provider responses separately
await redis.setex(`provider:${provider}:${address}`, 604800, JSON.stringify(rawResponse));

Cache Hit Rates:

  • Geocoding: 90%+ (repeated addresses)
  • Statistics: 95%+ (frequent dashboard views)
  • Provider responses: 85%+ (re-geocoding attempts)

Backend Documentation

  • Locations Service: api/src/modules/map/locations/locations.service.ts

    • Geocode stats aggregation
    • Duplicate detection
    • Re-geocoding operations
  • Geocoding Service: api/src/modules/map/geocoding/geocoding.service.ts

    • Multi-provider fallback
    • Confidence calculation
    • Cache integration
  • Bulk Geocoding: api/src/modules/map/locations/bulk-geocode.routes.ts

    • Job queue integration
    • Progress tracking
    • Error handling

Frontend Documentation

  • Data Quality Dashboard: admin/src/pages/DataQualityDashboardPage.tsx

    • Statistics display
    • Charts and tables
    • Bulk actions
  • Locations Page: admin/src/pages/LocationsPage.tsx

    • CSV import/export
    • Inline geocoding
    • Address editing

Database Documentation

  • Location Model: api/prisma/schema.prisma
    • Geocoding metadata fields
    • Indexes for performance
    • Relations to Address

Monitoring Documentation

  • Prometheus Metrics: api/src/utils/metrics.ts

    • Custom geocoding metrics
    • Quality gauges
    • Alert integration
  • Grafana Dashboard: configs/grafana/dashboards/data-quality.json

    • Quality trend charts
    • Provider comparison
    • Alert visualization

External Resources