1844 lines
50 KiB
Markdown
1844 lines
50 KiB
Markdown
# Data Quality Dashboard
|
|
|
|
## Overview
|
|
|
|
The Data Quality Dashboard provides comprehensive monitoring and management of geocoding accuracy and location data integrity. This feature enables campaign administrators to identify and resolve data quality issues, track geocoding provider performance, and ensure reliable map data for canvassing operations.
|
|
|
|
**Key Features:**
|
|
|
|
- Real-time geocoding quality metrics
|
|
- Provider success rate tracking
|
|
- Low-confidence location detection
|
|
- Duplicate location identification
|
|
- Bulk re-geocoding operations
|
|
- Address validation reporting
|
|
- Interactive quality charts
|
|
- Export quality reports
|
|
|
|
**Use Cases:**
|
|
|
|
- Monthly data quality audits
|
|
- NAR import validation
|
|
- Geocoding provider evaluation
|
|
- Pre-canvass data verification
|
|
- Address database cleanup
|
|
- Campaign planning accuracy checks
|
|
|
|
**Architecture Highlights:**
|
|
|
|
- Aggregate statistics via database queries
|
|
- Confidence threshold filtering (0-100 scale)
|
|
- Provider performance comparison
|
|
- Duplicate detection via coordinate matching
|
|
- Manual review workflows
|
|
- Prometheus metrics integration
|
|
|
|
## Architecture
|
|
|
|
```mermaid
|
|
flowchart TB
|
|
subgraph Admin Interface
|
|
Admin[Admin User]
|
|
Dashboard[DataQualityDashboardPage]
|
|
LocationsPage[LocationsPage]
|
|
end
|
|
|
|
subgraph API Layer
|
|
StatsAPI["/api/locations/geocode-stats"]
|
|
LocationsAPI["/api/locations"]
|
|
DuplicatesAPI["/api/locations/duplicates"]
|
|
RegeocodeAPI["/api/locations/:id/regeocode"]
|
|
BulkGeocodeAPI["/api/locations/bulk-geocode"]
|
|
end
|
|
|
|
subgraph Database
|
|
LocationsDB[(Locations)]
|
|
Indexes[(Indexes)]
|
|
end
|
|
|
|
subgraph Geocoding Service
|
|
GeocodingService[GeocodingService]
|
|
Providers[6 Providers]
|
|
Cache[Redis Cache]
|
|
end
|
|
|
|
subgraph Monitoring
|
|
Prometheus[Prometheus]
|
|
Metrics[cm_locations_low_confidence_count]
|
|
end
|
|
|
|
Admin --> Dashboard
|
|
Admin --> LocationsPage
|
|
|
|
Dashboard --> StatsAPI
|
|
Dashboard --> LocationsAPI
|
|
Dashboard --> DuplicatesAPI
|
|
LocationsPage --> RegeocodeAPI
|
|
LocationsPage --> BulkGeocodeAPI
|
|
|
|
StatsAPI --> LocationsDB
|
|
LocationsAPI --> LocationsDB
|
|
DuplicatesAPI --> LocationsDB
|
|
RegeocodeAPI --> GeocodingService
|
|
BulkGeocodeAPI --> GeocodingService
|
|
|
|
LocationsDB --> Indexes
|
|
GeocodingService --> Providers
|
|
GeocodingService --> Cache
|
|
|
|
StatsAPI --> Prometheus
|
|
Prometheus --> Metrics
|
|
```
|
|
|
|
**Data Flow:**
|
|
|
|
1. **Statistics Aggregation:**
|
|
- Query all locations with geocoding metadata
|
|
- Calculate aggregate metrics (total, geocoded %, avg confidence)
|
|
- Group by provider for success rate comparison
|
|
- Identify low-confidence locations (< 50)
|
|
- Detect duplicates via coordinate matching
|
|
|
|
2. **Quality Review:**
|
|
- Admin views dashboard statistics
|
|
- Filters low-confidence locations
|
|
- Reviews individual location details
|
|
- Identifies patterns (provider failures, address format issues)
|
|
|
|
3. **Remediation:**
|
|
- Manual address correction
|
|
- Single location re-geocoding
|
|
- Bulk re-geocoding with different provider
|
|
- Duplicate merging or marking
|
|
|
|
4. **Monitoring:**
|
|
- Prometheus metrics track quality trends
|
|
- Alert rules trigger for quality degradation
|
|
- Grafana dashboards visualize provider performance
|
|
|
|
## Database Models
|
|
|
|
### Location Model
|
|
|
|
```prisma
|
|
model Location {
|
|
id Int @id @default(autoincrement())
|
|
address String
|
|
latitude Float?
|
|
longitude Float?
|
|
postalCode String?
|
|
province String?
|
|
|
|
// Geocoding metadata
|
|
geocodeConfidence Int? // 0-100 quality score
|
|
geocodeProvider String? // Provider used for geocoding
|
|
geocodedAt DateTime? // Timestamp of last geocode
|
|
|
|
// NAR import fields
|
|
locGuid String? @unique
|
|
federalDistrict String?
|
|
buildingUse Int? // 1 = Residential
|
|
|
|
addresses Address[]
|
|
|
|
createdAt DateTime @default(now())
|
|
updatedAt DateTime @updatedAt
|
|
|
|
@@index([geocodeConfidence])
|
|
@@index([geocodeProvider])
|
|
@@index([latitude, longitude])
|
|
@@index([latitude, longitude], where: latitude IS NOT NULL AND longitude IS NOT NULL)
|
|
}
|
|
```
|
|
|
|
**Geocode Confidence Scale:**
|
|
- 0-20: Very Low (manual review required)
|
|
- 21-40: Low (likely incorrect, re-geocode recommended)
|
|
- 41-60: Medium (acceptable but consider verification)
|
|
- 61-80: Good (likely accurate)
|
|
- 81-100: Excellent (high confidence)
|
|
|
|
**Geocode Provider Enum:**
|
|
```typescript
|
|
enum GeocodeProvider {
|
|
GOOGLE = 'GOOGLE',
|
|
MAPBOX = 'MAPBOX',
|
|
NOMINATIM = 'NOMINATIM',
|
|
PHOTON = 'PHOTON',
|
|
LOCATIONIQ = 'LOCATIONIQ',
|
|
ARCGIS = 'ARCGIS',
|
|
UNKNOWN = 'UNKNOWN'
|
|
}
|
|
```
|
|
|
|
### Address Model
|
|
|
|
```prisma
|
|
model Address {
|
|
id Int @id @default(autoincrement())
|
|
locationId Int
|
|
location Location @relation(fields: [locationId], references: [id], onDelete: Cascade)
|
|
|
|
unitNumber String?
|
|
firstName String?
|
|
lastName String?
|
|
supportLevel Int?
|
|
notes String?
|
|
|
|
// Address validation
|
|
isValidated Boolean @default(false)
|
|
validatedAt DateTime?
|
|
|
|
createdAt DateTime @default(now())
|
|
updatedAt DateTime @updatedAt
|
|
|
|
@@index([locationId])
|
|
}
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### GET /api/locations/geocode-stats
|
|
|
|
Fetch aggregate geocoding quality statistics.
|
|
|
|
**Authentication:** Required (SUPER_ADMIN, MAP_ADMIN)
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"total": 1500,
|
|
"geocoded": 1450,
|
|
"geocodedPercent": 96.67,
|
|
"avgConfidence": 78.5,
|
|
"providerBreakdown": {
|
|
"GOOGLE": 800,
|
|
"MAPBOX": 350,
|
|
"NOMINATIM": 200,
|
|
"PHOTON": 100,
|
|
"ARCGIS": 0,
|
|
"LOCATIONIQ": 0,
|
|
"UNKNOWN": 50
|
|
},
|
|
"confidenceDistribution": {
|
|
"0-20": 15,
|
|
"21-40": 35,
|
|
"41-60": 150,
|
|
"61-80": 450,
|
|
"81-100": 800
|
|
},
|
|
"lowConfidenceCount": 50,
|
|
"missingCoordinates": 50,
|
|
"duplicatesCount": 12
|
|
}
|
|
```
|
|
|
|
**Implementation:**
|
|
|
|
```typescript
|
|
// locations.service.ts
|
|
async getGeocodeStats() {
|
|
const locations = await prisma.location.findMany({
|
|
select: {
|
|
latitude: true,
|
|
longitude: true,
|
|
geocodeConfidence: true,
|
|
geocodeProvider: true
|
|
}
|
|
});
|
|
|
|
const total = locations.length;
|
|
const geocoded = locations.filter(l => l.latitude && l.longitude).length;
|
|
const avgConfidence = locations.reduce((sum, l) =>
|
|
sum + (l.geocodeConfidence || 0), 0) / total;
|
|
|
|
const providerBreakdown = locations.reduce((acc, l) => {
|
|
const provider = l.geocodeProvider || 'UNKNOWN';
|
|
acc[provider] = (acc[provider] || 0) + 1;
|
|
return acc;
|
|
}, {} as Record<string, number>);
|
|
|
|
const confidenceDistribution = {
|
|
'0-20': 0,
|
|
'21-40': 0,
|
|
'41-60': 0,
|
|
'61-80': 0,
|
|
'81-100': 0
|
|
};
|
|
|
|
locations.forEach(l => {
|
|
const conf = l.geocodeConfidence || 0;
|
|
if (conf <= 20) confidenceDistribution['0-20']++;
|
|
else if (conf <= 40) confidenceDistribution['21-40']++;
|
|
else if (conf <= 60) confidenceDistribution['41-60']++;
|
|
else if (conf <= 80) confidenceDistribution['61-80']++;
|
|
else confidenceDistribution['81-100']++;
|
|
});
|
|
|
|
const lowConfidenceCount = locations.filter(l =>
|
|
(l.geocodeConfidence || 0) < 50).length;
|
|
|
|
return {
|
|
total,
|
|
geocoded,
|
|
geocodedPercent: (geocoded / total) * 100,
|
|
avgConfidence,
|
|
providerBreakdown,
|
|
confidenceDistribution,
|
|
lowConfidenceCount,
|
|
missingCoordinates: total - geocoded,
|
|
duplicatesCount: await this.countDuplicates()
|
|
};
|
|
}
|
|
```
|
|
|
|
### GET /api/locations?geocodeConfidence=lt:50
|
|
|
|
Fetch locations filtered by geocode confidence.
|
|
|
|
**Authentication:** Required
|
|
|
|
**Query Parameters:**
|
|
- `geocodeConfidence` (filter): `lt:X`, `gt:X`, `eq:X`, `null`
|
|
- `geocodeProvider` (filter): Provider name (GOOGLE, MAPBOX, etc.)
|
|
- `page` (optional): Page number (default: 1)
|
|
- `limit` (optional): Results per page (default: 50)
|
|
- `sortBy` (optional): Field to sort by (default: "geocodeConfidence")
|
|
- `order` (optional): "asc" or "desc" (default: "asc")
|
|
|
|
**Examples:**
|
|
|
|
```
|
|
GET /api/locations?geocodeConfidence=lt:50
|
|
GET /api/locations?geocodeConfidence=null
|
|
GET /api/locations?geocodeProvider=NOMINATIM&geocodeConfidence=lt:70
|
|
GET /api/locations?geocodeConfidence=gt:80&sortBy=address
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"data": [
|
|
{
|
|
"id": 1001,
|
|
"address": "123 Main St",
|
|
"latitude": 43.6532,
|
|
"longitude": -79.3832,
|
|
"postalCode": "M5H 2N2",
|
|
"geocodeConfidence": 45,
|
|
"geocodeProvider": "NOMINATIM",
|
|
"geocodedAt": "2025-02-10T10:00:00Z",
|
|
"addresses": [...]
|
|
}
|
|
],
|
|
"pagination": {
|
|
"page": 1,
|
|
"limit": 50,
|
|
"total": 150,
|
|
"pages": 3
|
|
}
|
|
}
|
|
```
|
|
|
|
### GET /api/locations/duplicates
|
|
|
|
Identify locations with identical coordinates.
|
|
|
|
**Authentication:** Required (SUPER_ADMIN, MAP_ADMIN)
|
|
|
|
**Query Parameters:**
|
|
- `threshold` (optional): Distance threshold in meters (default: 1, matches exact duplicates)
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"duplicates": [
|
|
{
|
|
"coordinates": {
|
|
"latitude": 43.6532,
|
|
"longitude": -79.3832
|
|
},
|
|
"count": 3,
|
|
"locations": [
|
|
{
|
|
"id": 1001,
|
|
"address": "123 Main St",
|
|
"postalCode": "M5H 2N2"
|
|
},
|
|
{
|
|
"id": 1002,
|
|
"address": "123 Main Street",
|
|
"postalCode": "M5H 2N2"
|
|
},
|
|
{
|
|
"id": 1003,
|
|
"address": "123 Main St, Unit 1",
|
|
"postalCode": "M5H 2N2"
|
|
}
|
|
]
|
|
}
|
|
],
|
|
"total": 12
|
|
}
|
|
```
|
|
|
|
**Implementation:**
|
|
|
|
```typescript
|
|
// locations.service.ts
|
|
async findDuplicates(thresholdMeters: number = 1) {
|
|
const locations = await prisma.location.findMany({
|
|
where: {
|
|
AND: [
|
|
{ latitude: { not: null } },
|
|
{ longitude: { not: null } }
|
|
]
|
|
},
|
|
select: {
|
|
id: true,
|
|
address: true,
|
|
latitude: true,
|
|
longitude: true,
|
|
postalCode: true
|
|
}
|
|
});
|
|
|
|
const coordMap = new Map<string, typeof locations>();
|
|
|
|
locations.forEach(loc => {
|
|
// Round to 6 decimal places (~0.1m precision)
|
|
const key = `${loc.latitude!.toFixed(6)},${loc.longitude!.toFixed(6)}`;
|
|
if (!coordMap.has(key)) {
|
|
coordMap.set(key, []);
|
|
}
|
|
coordMap.get(key)!.push(loc);
|
|
});
|
|
|
|
const duplicates = Array.from(coordMap.entries())
|
|
.filter(([_, locs]) => locs.length > 1)
|
|
.map(([coords, locs]) => {
|
|
const [lat, lng] = coords.split(',').map(Number);
|
|
return {
|
|
coordinates: { latitude: lat, longitude: lng },
|
|
count: locs.length,
|
|
locations: locs
|
|
};
|
|
});
|
|
|
|
return {
|
|
duplicates,
|
|
total: duplicates.reduce((sum, dup) => sum + dup.count, 0)
|
|
};
|
|
}
|
|
```
|
|
|
|
### POST /api/locations/:id/regeocode
|
|
|
|
Re-geocode a single location with specified provider.
|
|
|
|
**Authentication:** Required (SUPER_ADMIN, MAP_ADMIN)
|
|
|
|
**Request Body:**
|
|
```json
|
|
{
|
|
"provider": "GOOGLE",
|
|
"address": "123 Main St, Toronto ON M5H 2N2"
|
|
}
|
|
```
|
|
|
|
**Parameters:**
|
|
- `provider` (optional): Specific provider to use (default: fallback chain)
|
|
- `address` (optional): Override address string (default: use existing)
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"id": 1001,
|
|
"address": "123 Main St",
|
|
"latitude": 43.6532,
|
|
"longitude": -79.3832,
|
|
"geocodeConfidence": 95,
|
|
"geocodeProvider": "GOOGLE",
|
|
"geocodedAt": "2025-02-13T10:30:00Z"
|
|
}
|
|
```
|
|
|
|
### POST /api/locations/bulk-geocode
|
|
|
|
Bulk re-geocode multiple locations.
|
|
|
|
**Authentication:** Required (SUPER_ADMIN, MAP_ADMIN)
|
|
|
|
**Request Body:**
|
|
```json
|
|
{
|
|
"locationIds": [1001, 1002, 1003],
|
|
"provider": "GOOGLE",
|
|
"confidenceThreshold": 50
|
|
}
|
|
```
|
|
|
|
**Parameters:**
|
|
- `locationIds` (optional): Specific location IDs (default: all with confidence < threshold)
|
|
- `provider` (optional): Specific provider to use (default: fallback chain)
|
|
- `confidenceThreshold` (optional): Only re-geocode locations below this confidence (default: 50)
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"jobId": "bulk-geocode-20250213-103000",
|
|
"status": "queued",
|
|
"total": 150,
|
|
"message": "Bulk geocoding job started"
|
|
}
|
|
```
|
|
|
|
**Job Progress Endpoint:**
|
|
```
|
|
GET /api/locations/bulk-geocode/:jobId
|
|
```
|
|
|
|
**Job Status Response:**
|
|
```json
|
|
{
|
|
"jobId": "bulk-geocode-20250213-103000",
|
|
"status": "processing",
|
|
"progress": {
|
|
"total": 150,
|
|
"processed": 75,
|
|
"successful": 70,
|
|
"failed": 5,
|
|
"percent": 50
|
|
},
|
|
"startedAt": "2025-02-13T10:30:00Z",
|
|
"estimatedCompletion": "2025-02-13T10:35:00Z"
|
|
}
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Type | Default | Description |
|
|
|----------|------|---------|-------------|
|
|
| GEOCODE_CONFIDENCE_THRESHOLD | number | 50 | Minimum confidence for acceptable geocoding |
|
|
| GEOCODE_PRIMARY_PROVIDER | string | GOOGLE | Primary geocoding provider |
|
|
| GEOCODE_FALLBACK_PROVIDERS | string | MAPBOX,NOMINATIM | Comma-separated fallback providers |
|
|
| GEOCODE_CACHE_TTL | number | 2592000 | Cache TTL in seconds (30 days) |
|
|
|
|
### Quality Thresholds
|
|
|
|
| Metric | Warning | Critical | Description |
|
|
|--------|---------|----------|-------------|
|
|
| Geocoded % | < 95% | < 90% | Percentage of locations with coordinates |
|
|
| Avg Confidence | < 70 | < 60 | Average geocode confidence score |
|
|
| Low Confidence Count | > 50 | > 100 | Locations with confidence < 50 |
|
|
| Duplicates | > 20 | > 50 | Locations with identical coordinates |
|
|
| Missing Coordinates | > 5% | > 10% | Locations without lat/lng |
|
|
|
|
### Prometheus Metrics
|
|
|
|
**Custom Metrics:**
|
|
|
|
```typescript
|
|
// api/src/utils/metrics.ts
|
|
|
|
export const geocodingQualityGauge = new Gauge({
|
|
name: 'cm_geocoding_avg_confidence',
|
|
help: 'Average geocoding confidence score (0-100)',
|
|
async collect() {
|
|
const stats = await locationsService.getGeocodeStats();
|
|
this.set(stats.avgConfidence);
|
|
}
|
|
});
|
|
|
|
export const lowConfidenceLocationsGauge = new Gauge({
|
|
name: 'cm_locations_low_confidence_count',
|
|
help: 'Number of locations with geocode confidence < 50',
|
|
async collect() {
|
|
const stats = await locationsService.getGeocodeStats();
|
|
this.set(stats.lowConfidenceCount);
|
|
}
|
|
});
|
|
|
|
export const geocodedPercentGauge = new Gauge({
|
|
name: 'cm_locations_geocoded_percent',
|
|
help: 'Percentage of locations with coordinates',
|
|
async collect() {
|
|
const stats = await locationsService.getGeocodeStats();
|
|
this.set(stats.geocodedPercent);
|
|
}
|
|
});
|
|
|
|
export const duplicateLocationsGauge = new Gauge({
|
|
name: 'cm_locations_duplicates_count',
|
|
help: 'Number of duplicate location entries',
|
|
async collect() {
|
|
const duplicates = await locationsService.findDuplicates();
|
|
this.set(duplicates.total);
|
|
}
|
|
});
|
|
```
|
|
|
|
**Alert Rules:**
|
|
|
|
```yaml
|
|
# configs/prometheus/alerts.yml
|
|
|
|
groups:
|
|
- name: data_quality
|
|
interval: 5m
|
|
rules:
|
|
- alert: LowGeocodingConfidence
|
|
expr: cm_geocoding_avg_confidence < 60
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: Low average geocoding confidence
|
|
description: "Average geocoding confidence is {{ $value }}, below threshold of 60"
|
|
|
|
- alert: HighLowConfidenceLocations
|
|
expr: cm_locations_low_confidence_count > 100
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: High number of low-confidence locations
|
|
description: "{{ $value }} locations have geocoding confidence < 50"
|
|
|
|
- alert: LowGeocodedPercent
|
|
expr: cm_locations_geocoded_percent < 90
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: Low percentage of geocoded locations
|
|
description: "Only {{ $value }}% of locations have coordinates"
|
|
|
|
- alert: HighDuplicateLocations
|
|
expr: cm_locations_duplicates_count > 50
|
|
for: 15m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: High number of duplicate locations
|
|
description: "{{ $value }} duplicate location entries detected"
|
|
```
|
|
|
|
## Quality Metrics
|
|
|
|
### Geocoding Confidence
|
|
|
|
**Calculation:**
|
|
|
|
Geocoding confidence is calculated based on multiple factors:
|
|
|
|
```typescript
|
|
interface GeocodeResult {
|
|
latitude: number;
|
|
longitude: number;
|
|
matchType: 'exact' | 'interpolated' | 'approximate' | 'fallback';
|
|
addressComponents: {
|
|
streetNumber?: string;
|
|
street?: string;
|
|
city?: string;
|
|
postalCode?: string;
|
|
province?: string;
|
|
};
|
|
providerConfidence?: number; // Provider-specific score
|
|
}
|
|
|
|
function calculateConfidence(result: GeocodeResult, inputAddress: string): number {
|
|
let confidence = 0;
|
|
|
|
// Match type (0-40 points)
|
|
switch (result.matchType) {
|
|
case 'exact': confidence += 40; break;
|
|
case 'interpolated': confidence += 30; break;
|
|
case 'approximate': confidence += 20; break;
|
|
case 'fallback': confidence += 10; break;
|
|
}
|
|
|
|
// Address component completeness (0-30 points)
|
|
const components = result.addressComponents;
|
|
if (components.streetNumber) confidence += 10;
|
|
if (components.street) confidence += 10;
|
|
if (components.postalCode) confidence += 10;
|
|
|
|
// Provider-specific confidence (0-30 points)
|
|
if (result.providerConfidence) {
|
|
confidence += (result.providerConfidence / 100) * 30;
|
|
}
|
|
|
|
return Math.min(Math.round(confidence), 100);
|
|
}
|
|
```
|
|
|
|
**Confidence Levels:**
|
|
|
|
- **81-100 (Excellent):** Exact match with full address components
|
|
- **61-80 (Good):** Interpolated match with most components
|
|
- **41-60 (Medium):** Approximate match, missing some components
|
|
- **21-40 (Low):** Fallback geocoding, significant uncertainty
|
|
- **0-20 (Very Low):** Minimal match, likely incorrect
|
|
|
|
### Provider Success Rates
|
|
|
|
**Metrics Tracked:**
|
|
|
|
```typescript
|
|
interface ProviderMetrics {
|
|
provider: GeocodeProvider;
|
|
totalAttempts: number;
|
|
successfulGeocodes: number;
|
|
successRate: number; // 0-100%
|
|
avgConfidence: number; // 0-100
|
|
avgResponseTime: number; // milliseconds
|
|
errorCount: number;
|
|
lastError?: string;
|
|
}
|
|
```
|
|
|
|
**Success Rate Calculation:**
|
|
|
|
```typescript
|
|
const calculateProviderMetrics = async (): Promise<ProviderMetrics[]> => {
|
|
const locations = await prisma.location.findMany({
|
|
select: {
|
|
geocodeProvider: true,
|
|
geocodeConfidence: true,
|
|
latitude: true,
|
|
longitude: true
|
|
}
|
|
});
|
|
|
|
const providerGroups = groupBy(locations, 'geocodeProvider');
|
|
|
|
return Object.entries(providerGroups).map(([provider, locs]) => {
|
|
const total = locs.length;
|
|
const successful = locs.filter(l => l.latitude && l.longitude).length;
|
|
const avgConf = locs.reduce((sum, l) => sum + (l.geocodeConfidence || 0), 0) / total;
|
|
|
|
return {
|
|
provider: provider as GeocodeProvider,
|
|
totalAttempts: total,
|
|
successfulGeocodes: successful,
|
|
successRate: (successful / total) * 100,
|
|
avgConfidence: avgConf,
|
|
avgResponseTime: 0, // Would need separate tracking
|
|
errorCount: total - successful
|
|
};
|
|
});
|
|
};
|
|
```
|
|
|
|
### Duplicate Detection
|
|
|
|
**Detection Methods:**
|
|
|
|
1. **Exact Coordinate Match:**
|
|
```typescript
|
|
// Round to 6 decimal places (~0.1m precision)
|
|
const isDuplicateExact = (loc1: Location, loc2: Location): boolean => {
|
|
return loc1.latitude!.toFixed(6) === loc2.latitude!.toFixed(6) &&
|
|
loc1.longitude!.toFixed(6) === loc2.longitude!.toFixed(6);
|
|
};
|
|
```
|
|
|
|
2. **Proximity Threshold:**
|
|
```typescript
|
|
// Haversine distance check
|
|
const isDuplicateProximity = (loc1: Location, loc2: Location, thresholdM: number): boolean => {
|
|
const distance = haversineDistance(
|
|
[loc1.latitude!, loc1.longitude!],
|
|
[loc2.latitude!, loc2.longitude!]
|
|
);
|
|
return distance < thresholdM;
|
|
};
|
|
```
|
|
|
|
3. **Address Similarity:**
|
|
```typescript
|
|
import { distance as levenshteinDistance } from 'fastest-levenshtein';
|
|
|
|
const isDuplicateAddress = (addr1: string, addr2: string): boolean => {
|
|
const normalized1 = normalizeAddress(addr1);
|
|
const normalized2 = normalizeAddress(addr2);
|
|
const dist = levenshteinDistance(normalized1, normalized2);
|
|
const similarity = 1 - (dist / Math.max(normalized1.length, normalized2.length));
|
|
return similarity > 0.9; // 90% similar
|
|
};
|
|
|
|
const normalizeAddress = (address: string): string => {
|
|
return address
|
|
.toLowerCase()
|
|
.replace(/\bstreet\b/g, 'st')
|
|
.replace(/\bavenue\b/g, 'ave')
|
|
.replace(/\broad\b/g, 'rd')
|
|
.replace(/\bdrive\b/g, 'dr')
|
|
.replace(/[^a-z0-9]/g, '');
|
|
};
|
|
```
|
|
|
|
### Address Validation
|
|
|
|
**Validation Checks:**
|
|
|
|
```typescript
|
|
interface AddressValidationResult {
|
|
isValid: boolean;
|
|
issues: string[];
|
|
suggestions?: string[];
|
|
}
|
|
|
|
const validateAddress = (address: string): AddressValidationResult => {
|
|
const issues: string[] = [];
|
|
|
|
// Check minimum length
|
|
if (address.length < 5) {
|
|
issues.push('Address too short');
|
|
}
|
|
|
|
// Check for street number
|
|
if (!/^\d+/.test(address)) {
|
|
issues.push('Missing street number');
|
|
}
|
|
|
|
// Check for street name
|
|
if (!/\d+\s+([A-Za-z]+\s*)+/.test(address)) {
|
|
issues.push('Missing street name');
|
|
}
|
|
|
|
// Check for postal code (Canadian format)
|
|
if (!/[A-Z]\d[A-Z]\s?\d[A-Z]\d/.test(address)) {
|
|
issues.push('Missing or invalid postal code');
|
|
}
|
|
|
|
// Check for unusual characters
|
|
if (/[^A-Za-z0-9\s,.-]/.test(address)) {
|
|
issues.push('Contains unusual characters');
|
|
}
|
|
|
|
return {
|
|
isValid: issues.length === 0,
|
|
issues
|
|
};
|
|
};
|
|
```
|
|
|
|
## Admin Workflow
|
|
|
|
### Navigate to Data Quality Dashboard
|
|
|
|
**Step 1: Access Dashboard**
|
|
|
|
1. Log in as SUPER_ADMIN or MAP_ADMIN
|
|
2. Click **Map** in sidebar
|
|
3. Click **Data Quality** submenu
|
|
4. Dashboard loads with statistics
|
|
|
|
**Step 2: Review Overall Statistics**
|
|
|
|
Dashboard displays 4 main statistic cards:
|
|
|
|
```plaintext
|
|
┌──────────────────┬──────────────────┬──────────────────┬──────────────────┐
|
|
│ Total Locations │ Geocoded │ Avg Confidence │ Low Confidence │
|
|
│ 1,500 │ 1,450 (96.7%) │ 78.5 │ 50 │
|
|
└──────────────────┴──────────────────┴──────────────────┴──────────────────┘
|
|
```
|
|
|
|
**Step 3: Analyze Provider Performance**
|
|
|
|
Provider breakdown table shows:
|
|
|
|
| Provider | Count | Success Rate | Avg Confidence |
|
|
|----------|-------|--------------|----------------|
|
|
| GOOGLE | 800 | 99.2% | 85.3 |
|
|
| MAPBOX | 350 | 97.1% | 82.1 |
|
|
| NOMINATIM | 200 | 94.5% | 75.8 |
|
|
| PHOTON | 100 | 91.0% | 68.2 |
|
|
| UNKNOWN | 50 | N/A | 0 |
|
|
|
|
**Step 4: Review Confidence Distribution**
|
|
|
|
Bar chart displays confidence distribution:
|
|
|
|
```plaintext
|
|
Confidence Distribution
|
|
100 | ┌──────┐
|
|
80 | │ │
|
|
60 | ┌──────┤ │
|
|
40 | ┌──────┤ │ │
|
|
20 | │ │ │ │
|
|
0 └──┴──────┴──────┴──────┴──────┘
|
|
0-20 21-40 41-60 61-80 81-100
|
|
15 35 150 450 800
|
|
```
|
|
|
|
### Identify and Review Low-Confidence Locations
|
|
|
|
**Step 1: Filter Low-Confidence Locations**
|
|
|
|
1. Click **Low Confidence** tab on dashboard
|
|
2. Table loads with locations where confidence < 50
|
|
3. Sort by confidence (ascending) to prioritize worst
|
|
|
|
**Step 2: Review Location Details**
|
|
|
|
Click row to open detail drawer:
|
|
|
|
```plaintext
|
|
┌─────────────────────────────────────────┐
|
|
│ Location Details │
|
|
├─────────────────────────────────────────┤
|
|
│ Address: 123 Main St │
|
|
│ Postal Code: M5H 2N2 │
|
|
│ Coordinates: 43.6532, -79.3832 │
|
|
│ │
|
|
│ Geocoding Info: │
|
|
│ Confidence: 45 (Low) │
|
|
│ Provider: NOMINATIM │
|
|
│ Geocoded: Feb 10, 2025 10:00 AM │
|
|
│ │
|
|
│ Issues: │
|
|
│ • Missing street number in response │
|
|
│ • Approximate match only │
|
|
│ │
|
|
│ [Re-geocode] [Edit Address] [View Map] │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
**Step 3: Take Action**
|
|
|
|
Options for remediation:
|
|
|
|
1. **Re-geocode with different provider:**
|
|
- Click **Re-geocode** button
|
|
- Select provider (GOOGLE recommended for low confidence)
|
|
- Click **Geocode Now**
|
|
- New confidence displayed
|
|
|
|
2. **Edit address:**
|
|
- Click **Edit Address**
|
|
- Correct typos or formatting issues
|
|
- Save changes
|
|
- Auto-triggers re-geocoding
|
|
|
|
3. **View on map:**
|
|
- Click **View Map**
|
|
- Verify location accuracy visually
|
|
- Drag marker to correct position if needed
|
|
|
|
### Bulk Re-geocoding
|
|
|
|
**Step 1: Select Locations**
|
|
|
|
1. In Low Confidence tab, use table checkboxes to select locations
|
|
2. Or click **Select All** to select all visible
|
|
3. Selected count displays: "50 selected"
|
|
|
|
**Step 2: Choose Provider**
|
|
|
|
1. Click **Bulk Re-geocode** button
|
|
2. Modal opens with provider selection:
|
|
```plaintext
|
|
┌─────────────────────────────────────┐
|
|
│ Bulk Re-geocode │
|
|
├─────────────────────────────────────┤
|
|
│ Re-geocode 50 locations │
|
|
│ │
|
|
│ Provider: [GOOGLE ▼] │
|
|
│ │
|
|
│ Options: │
|
|
│ ☑ Only if confidence < 50 │
|
|
│ ☑ Cache results │
|
|
│ ☐ Overwrite existing coordinates │
|
|
│ │
|
|
│ Estimated time: ~2 minutes │
|
|
│ │
|
|
│ [Cancel] [Start Re-geocoding] │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
**Step 3: Monitor Progress**
|
|
|
|
1. Job starts, progress bar appears:
|
|
```plaintext
|
|
Re-geocoding in progress... 25/50 (50%)
|
|
[████████████░░░░░░░░░░░░] 50%
|
|
```
|
|
|
|
2. Real-time updates:
|
|
- Total processed
|
|
- Successful geocodes
|
|
- Failed geocodes
|
|
- Average new confidence
|
|
|
|
**Step 4: Review Results**
|
|
|
|
Job completion summary:
|
|
|
|
```plaintext
|
|
┌─────────────────────────────────────┐
|
|
│ Bulk Re-geocode Complete │
|
|
├─────────────────────────────────────┤
|
|
│ Processed: 50 │
|
|
│ Successful: 47 (94%) │
|
|
│ Failed: 3 (6%) │
|
|
│ │
|
|
│ Quality Improvement: │
|
|
│ Avg Confidence Before: 42.5 │
|
|
│ Avg Confidence After: 81.3 │
|
|
│ Improvement: +38.8 │
|
|
│ │
|
|
│ [View Failed] [Close] │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
### Handle Duplicates
|
|
|
|
**Step 1: View Duplicates Tab**
|
|
|
|
1. Click **Duplicates** tab on dashboard
|
|
2. Table groups locations by coordinates
|
|
|
|
**Step 2: Review Duplicate Groups**
|
|
|
|
Table displays:
|
|
|
|
| Coordinates | Count | Addresses | Action |
|
|
|-------------|-------|-----------|--------|
|
|
| 43.6532, -79.3832 | 3 | 123 Main St, 123 Main Street, 123 Main St Unit 1 | [Review] |
|
|
| 43.6540, -79.3825 | 2 | 456 Bay St, 456 Bay Street | [Review] |
|
|
|
|
**Step 3: Resolve Duplicates**
|
|
|
|
Click **Review** to open resolution modal:
|
|
|
|
```plaintext
|
|
┌─────────────────────────────────────┐
|
|
│ Resolve Duplicates │
|
|
├─────────────────────────────────────┤
|
|
│ 3 locations at 43.6532, -79.3832 │
|
|
│ │
|
|
│ ○ Merge into single location │
|
|
│ Primary: 123 Main St │
|
|
│ Merge units from duplicates │
|
|
│ │
|
|
│ ○ Keep as separate multi-unit │
|
|
│ Mark as validated multi-unit │
|
|
│ │
|
|
│ ○ Re-geocode individually │
|
|
│ Try to get unique coordinates │
|
|
│ │
|
|
│ [Cancel] [Resolve] │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
**Resolution Options:**
|
|
|
|
1. **Merge:** Combine into single Location with multiple Address records
|
|
2. **Multi-unit:** Mark as legitimate multi-unit building
|
|
3. **Re-geocode:** Attempt to get unique coordinates for each
|
|
|
|
## Quality Improvement Strategies
|
|
|
|
### Multi-Provider Geocoding
|
|
|
|
**Fallback Chain:**
|
|
|
|
```typescript
|
|
// geocoding.service.ts
|
|
|
|
const PROVIDER_CHAIN: GeocodeProvider[] = [
|
|
'GOOGLE', // Primary: Best accuracy, paid
|
|
'MAPBOX', // Fallback 1: Good accuracy, paid
|
|
'NOMINATIM', // Fallback 2: Free, decent accuracy
|
|
'PHOTON', // Fallback 3: Free, lower accuracy
|
|
'ARCGIS' // Fallback 4: Free, basic accuracy
|
|
];
|
|
|
|
async geocode(address: string): Promise<GeocodeResult | null> {
|
|
for (const provider of PROVIDER_CHAIN) {
|
|
try {
|
|
const result = await this.geocodeWithProvider(address, provider);
|
|
if (result && result.confidence >= 50) {
|
|
return result; // Success, confidence acceptable
|
|
}
|
|
} catch (error) {
|
|
logger.warn(`Geocoding failed with ${provider}:`, error);
|
|
// Try next provider
|
|
}
|
|
}
|
|
return null; // All providers failed
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
- Increases success rate (90% → 96%+)
|
|
- Reduces dependency on single provider
|
|
- Cost optimization (use free providers as fallback)
|
|
- Provider outage resilience
|
|
|
|
### Address Normalization
|
|
|
|
**Pre-Geocoding Normalization:**
|
|
|
|
```typescript
|
|
const normalizeAddressForGeocoding = (address: string): string => {
|
|
let normalized = address;
|
|
|
|
// Remove extra whitespace
|
|
normalized = normalized.replace(/\s+/g, ' ').trim();
|
|
|
|
// Standardize abbreviations
|
|
const replacements: Record<string, string> = {
|
|
'Street': 'St',
|
|
'Avenue': 'Ave',
|
|
'Road': 'Rd',
|
|
'Drive': 'Dr',
|
|
'Boulevard': 'Blvd',
|
|
'Apartment': 'Apt',
|
|
'Unit': 'Unit',
|
|
'Suite': 'Ste'
|
|
};
|
|
|
|
Object.entries(replacements).forEach(([long, short]) => {
|
|
const regex = new RegExp(`\\b${long}\\b`, 'gi');
|
|
normalized = normalized.replace(regex, short);
|
|
});
|
|
|
|
// Ensure postal code spacing (Canadian format)
|
|
normalized = normalized.replace(/([A-Z]\d[A-Z])(\d[A-Z]\d)/, '$1 $2');
|
|
|
|
// Remove periods from abbreviations
|
|
normalized = normalized.replace(/\./g, '');
|
|
|
|
return normalized;
|
|
};
|
|
```
|
|
|
|
**Improvements:**
|
|
- Reduces geocoding errors by 10-15%
|
|
- Increases confidence scores
|
|
- Better cache hit rate
|
|
|
|
### Geocoding Cache
|
|
|
|
**Redis Cache Implementation:**
|
|
|
|
```typescript
|
|
// geocoding.service.ts
|
|
|
|
private async geocodeWithCache(address: string): Promise<GeocodeResult | null> {
|
|
const cacheKey = `geocode:${normalizeAddress(address)}`;
|
|
|
|
// Check cache
|
|
const cached = await redis.get(cacheKey);
|
|
if (cached) {
|
|
logger.debug('Geocoding cache hit:', address);
|
|
return JSON.parse(cached);
|
|
}
|
|
|
|
// Cache miss, geocode
|
|
const result = await this.geocode(address);
|
|
if (result) {
|
|
// Cache for 30 days
|
|
await redis.setex(cacheKey, 2592000, JSON.stringify(result));
|
|
}
|
|
|
|
return result;
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
- Reduces API costs (90% cache hit rate)
|
|
- Faster response times (Redis: <5ms vs API: 200-500ms)
|
|
- Consistent results for same address
|
|
- Provider API rate limit avoidance
|
|
|
|
### Manual Verification
|
|
|
|
**Critical Location Verification:**
|
|
|
|
Manually verify high-priority locations:
|
|
|
|
1. **Campaign offices:** Ensure exact coordinates
|
|
2. **Shift start points:** Verify accessibility
|
|
3. **Event venues:** Confirm entrance location
|
|
4. **Polling stations:** Critical for voter info
|
|
|
|
**Verification Process:**
|
|
|
|
```typescript
|
|
// Mark location as manually verified
|
|
await prisma.location.update({
|
|
where: { id: locationId },
|
|
data: {
|
|
geocodeConfidence: 100,
|
|
geocodeProvider: 'MANUAL',
|
|
geocodedAt: new Date()
|
|
}
|
|
});
|
|
```
|
|
|
|
### Regular Audits
|
|
|
|
**Monthly Quality Audit Checklist:**
|
|
|
|
1. **Run quality report:**
|
|
```bash
|
|
curl http://localhost:4000/api/locations/geocode-stats
|
|
```
|
|
|
|
2. **Check metrics against thresholds:**
|
|
- Geocoded % > 95%
|
|
- Avg confidence > 70
|
|
- Low confidence count < 50
|
|
- Duplicates < 20
|
|
|
|
3. **Review low-confidence locations:**
|
|
- Filter locations with confidence < 50
|
|
- Review top 20 by address
|
|
- Identify patterns (specific streets, providers)
|
|
|
|
4. **Bulk re-geocode low confidence:**
|
|
- Use GOOGLE provider for accuracy
|
|
- Monitor improvement in avg confidence
|
|
|
|
5. **Resolve duplicates:**
|
|
- Review all duplicate groups
|
|
- Merge or mark as multi-unit
|
|
- Update addresses as needed
|
|
|
|
6. **Export quality report:**
|
|
```typescript
|
|
const report = await generateQualityReport();
|
|
fs.writeFileSync(`quality-report-${date}.json`, JSON.stringify(report, null, 2));
|
|
```
|
|
|
|
## Code Examples
|
|
|
|
### DataQualityDashboardPage.tsx
|
|
|
|
```typescript
|
|
import React, { useEffect, useState } from 'react';
|
|
import { Card, Row, Col, Statistic, Table, Tabs, Button, message } from 'antd';
|
|
import { WarningOutlined, CheckCircleOutlined } from '@ant-design/icons';
|
|
import { api } from '@/lib/api';
|
|
import { Bar } from 'react-chartjs-2';
|
|
|
|
interface GeocodeStats {
|
|
total: number;
|
|
geocoded: number;
|
|
geocodedPercent: number;
|
|
avgConfidence: number;
|
|
providerBreakdown: Record<string, number>;
|
|
confidenceDistribution: Record<string, number>;
|
|
lowConfidenceCount: number;
|
|
missingCoordinates: number;
|
|
duplicatesCount: number;
|
|
}
|
|
|
|
const DataQualityDashboardPage: React.FC = () => {
|
|
const [stats, setStats] = useState<GeocodeStats | null>(null);
|
|
const [lowConfLocations, setLowConfLocations] = useState<any[]>([]);
|
|
const [duplicates, setDuplicates] = useState<any[]>([]);
|
|
const [loading, setLoading] = useState(false);
|
|
|
|
useEffect(() => {
|
|
fetchStats();
|
|
fetchLowConfidenceLocations();
|
|
fetchDuplicates();
|
|
}, []);
|
|
|
|
const fetchStats = async () => {
|
|
setLoading(true);
|
|
try {
|
|
const { data } = await api.get<GeocodeStats>('/locations/geocode-stats');
|
|
setStats(data);
|
|
} catch (error) {
|
|
message.error('Failed to load statistics');
|
|
} finally {
|
|
setLoading(false);
|
|
}
|
|
};
|
|
|
|
const fetchLowConfidenceLocations = async () => {
|
|
try {
|
|
const { data } = await api.get('/locations?geocodeConfidence=lt:50&limit=100');
|
|
setLowConfLocations(data.data);
|
|
} catch (error) {
|
|
message.error('Failed to load low-confidence locations');
|
|
}
|
|
};
|
|
|
|
const fetchDuplicates = async () => {
|
|
try {
|
|
const { data } = await api.get('/locations/duplicates');
|
|
setDuplicates(data.duplicates);
|
|
} catch (error) {
|
|
message.error('Failed to load duplicates');
|
|
}
|
|
};
|
|
|
|
const handleRegeocodeLocation = async (locationId: number) => {
|
|
try {
|
|
await api.post(`/locations/${locationId}/regeocode`, { provider: 'GOOGLE' });
|
|
message.success('Location re-geocoded successfully');
|
|
fetchStats();
|
|
fetchLowConfidenceLocations();
|
|
} catch (error) {
|
|
message.error('Failed to re-geocode location');
|
|
}
|
|
};
|
|
|
|
const confidenceChartData = stats ? {
|
|
labels: Object.keys(stats.confidenceDistribution),
|
|
datasets: [{
|
|
label: 'Locations',
|
|
data: Object.values(stats.confidenceDistribution),
|
|
backgroundColor: [
|
|
'#e74c3c', // 0-20: Red
|
|
'#f39c12', // 21-40: Orange
|
|
'#f1c40f', // 41-60: Yellow
|
|
'#3498db', // 61-80: Blue
|
|
'#27ae60' // 81-100: Green
|
|
]
|
|
}]
|
|
} : null;
|
|
|
|
const lowConfColumns = [
|
|
{ title: 'Address', dataIndex: 'address', key: 'address' },
|
|
{ title: 'Confidence', dataIndex: 'geocodeConfidence', key: 'confidence', render: (val: number) => (
|
|
<span style={{ color: val < 30 ? '#e74c3c' : '#f39c12' }}>{val}</span>
|
|
)},
|
|
{ title: 'Provider', dataIndex: 'geocodeProvider', key: 'provider' },
|
|
{ title: 'Action', key: 'action', render: (_: any, record: any) => (
|
|
<Button size="small" onClick={() => handleRegeocodeLocation(record.id)}>
|
|
Re-geocode
|
|
</Button>
|
|
)}
|
|
];
|
|
|
|
return (
|
|
<div>
|
|
<h1>Data Quality Dashboard</h1>
|
|
|
|
{/* Statistics Cards */}
|
|
<Row gutter={16} style={{ marginBottom: 24 }}>
|
|
<Col span={6}>
|
|
<Card>
|
|
<Statistic
|
|
title="Total Locations"
|
|
value={stats?.total || 0}
|
|
prefix={<CheckCircleOutlined />}
|
|
/>
|
|
</Card>
|
|
</Col>
|
|
<Col span={6}>
|
|
<Card>
|
|
<Statistic
|
|
title="Geocoded"
|
|
value={stats?.geocoded || 0}
|
|
suffix={`(${stats?.geocodedPercent.toFixed(1) || 0}%)`}
|
|
valueStyle={{ color: (stats?.geocodedPercent || 0) > 95 ? '#27ae60' : '#f39c12' }}
|
|
/>
|
|
</Card>
|
|
</Col>
|
|
<Col span={6}>
|
|
<Card>
|
|
<Statistic
|
|
title="Avg Confidence"
|
|
value={stats?.avgConfidence.toFixed(1) || 0}
|
|
valueStyle={{ color: (stats?.avgConfidence || 0) > 70 ? '#27ae60' : '#f39c12' }}
|
|
/>
|
|
</Card>
|
|
</Col>
|
|
<Col span={6}>
|
|
<Card>
|
|
<Statistic
|
|
title="Low Confidence"
|
|
value={stats?.lowConfidenceCount || 0}
|
|
prefix={<WarningOutlined />}
|
|
valueStyle={{ color: (stats?.lowConfidenceCount || 0) > 50 ? '#e74c3c' : '#f39c12' }}
|
|
/>
|
|
</Card>
|
|
</Col>
|
|
</Row>
|
|
|
|
{/* Charts and Tables */}
|
|
<Tabs
|
|
items={[
|
|
{
|
|
key: 'overview',
|
|
label: 'Overview',
|
|
children: (
|
|
<div>
|
|
<Card title="Confidence Distribution" style={{ marginBottom: 24 }}>
|
|
{confidenceChartData && <Bar data={confidenceChartData} />}
|
|
</Card>
|
|
<Card title="Provider Performance">
|
|
<Table
|
|
dataSource={stats ? Object.entries(stats.providerBreakdown).map(([provider, count]) => ({
|
|
provider,
|
|
count
|
|
})) : []}
|
|
columns={[
|
|
{ title: 'Provider', dataIndex: 'provider', key: 'provider' },
|
|
{ title: 'Count', dataIndex: 'count', key: 'count' }
|
|
]}
|
|
pagination={false}
|
|
/>
|
|
</Card>
|
|
</div>
|
|
)
|
|
},
|
|
{
|
|
key: 'low-confidence',
|
|
label: `Low Confidence (${lowConfLocations.length})`,
|
|
children: (
|
|
<Table
|
|
dataSource={lowConfLocations}
|
|
columns={lowConfColumns}
|
|
rowKey="id"
|
|
loading={loading}
|
|
/>
|
|
)
|
|
},
|
|
{
|
|
key: 'duplicates',
|
|
label: `Duplicates (${duplicates.length})`,
|
|
children: (
|
|
<Table
|
|
dataSource={duplicates}
|
|
columns={[
|
|
{ title: 'Coordinates', key: 'coords', render: (_, record: any) =>
|
|
`${record.coordinates.latitude.toFixed(6)}, ${record.coordinates.longitude.toFixed(6)}`
|
|
},
|
|
{ title: 'Count', dataIndex: 'count', key: 'count' },
|
|
{ title: 'Addresses', key: 'addresses', render: (_, record: any) =>
|
|
record.locations.map((l: any) => l.address).join(', ')
|
|
}
|
|
]}
|
|
rowKey={(record) => `${record.coordinates.latitude}-${record.coordinates.longitude}`}
|
|
/>
|
|
)
|
|
}
|
|
]}
|
|
/>
|
|
</div>
|
|
);
|
|
};
|
|
|
|
export default DataQualityDashboardPage;
|
|
```
|
|
|
|
### Geocode Statistics Service
|
|
|
|
```typescript
|
|
// locations.service.ts
|
|
|
|
import { prisma } from '@/config/database';
|
|
import type { GeocodeProvider } from '@prisma/client';
|
|
|
|
export class LocationsService {
|
|
async getGeocodeStats() {
|
|
const locations = await prisma.location.findMany({
|
|
select: {
|
|
id: true,
|
|
latitude: true,
|
|
longitude: true,
|
|
geocodeConfidence: true,
|
|
geocodeProvider: true
|
|
}
|
|
});
|
|
|
|
const total = locations.length;
|
|
const geocoded = locations.filter(l => l.latitude && l.longitude).length;
|
|
|
|
const sumConfidence = locations.reduce((sum, l) => sum + (l.geocodeConfidence || 0), 0);
|
|
const avgConfidence = total > 0 ? sumConfidence / total : 0;
|
|
|
|
// Provider breakdown
|
|
const providerBreakdown: Record<string, number> = {};
|
|
locations.forEach(l => {
|
|
const provider = l.geocodeProvider || 'UNKNOWN';
|
|
providerBreakdown[provider] = (providerBreakdown[provider] || 0) + 1;
|
|
});
|
|
|
|
// Confidence distribution
|
|
const confidenceDistribution = {
|
|
'0-20': 0,
|
|
'21-40': 0,
|
|
'41-60': 0,
|
|
'61-80': 0,
|
|
'81-100': 0
|
|
};
|
|
|
|
locations.forEach(l => {
|
|
const conf = l.geocodeConfidence || 0;
|
|
if (conf <= 20) confidenceDistribution['0-20']++;
|
|
else if (conf <= 40) confidenceDistribution['21-40']++;
|
|
else if (conf <= 60) confidenceDistribution['41-60']++;
|
|
else if (conf <= 80) confidenceDistribution['61-80']++;
|
|
else confidenceDistribution['81-100']++;
|
|
});
|
|
|
|
const lowConfidenceCount = locations.filter(l => (l.geocodeConfidence || 0) < 50).length;
|
|
const duplicatesCount = await this.countDuplicates();
|
|
|
|
return {
|
|
total,
|
|
geocoded,
|
|
geocodedPercent: total > 0 ? (geocoded / total) * 100 : 0,
|
|
avgConfidence,
|
|
providerBreakdown,
|
|
confidenceDistribution,
|
|
lowConfidenceCount,
|
|
missingCoordinates: total - geocoded,
|
|
duplicatesCount
|
|
};
|
|
}
|
|
|
|
async countDuplicates(): Promise<number> {
|
|
const locations = await prisma.location.findMany({
|
|
where: {
|
|
AND: [
|
|
{ latitude: { not: null } },
|
|
{ longitude: { not: null } }
|
|
]
|
|
},
|
|
select: { latitude: true, longitude: true }
|
|
});
|
|
|
|
const coordMap = new Map<string, number>();
|
|
locations.forEach(l => {
|
|
const key = `${l.latitude!.toFixed(6)},${l.longitude!.toFixed(6)}`;
|
|
coordMap.set(key, (coordMap.get(key) || 0) + 1);
|
|
});
|
|
|
|
return Array.from(coordMap.values()).filter(count => count > 1).reduce((sum, count) => sum + count, 0);
|
|
}
|
|
|
|
async regeocode(locationId: number, provider?: GeocodeProvider) {
|
|
const location = await prisma.location.findUnique({
|
|
where: { id: locationId }
|
|
});
|
|
|
|
if (!location) {
|
|
throw new Error('Location not found');
|
|
}
|
|
|
|
const result = await geocodingService.geocode(location.address, provider);
|
|
|
|
if (!result) {
|
|
throw new Error('Geocoding failed');
|
|
}
|
|
|
|
return await prisma.location.update({
|
|
where: { id: locationId },
|
|
data: {
|
|
latitude: result.latitude,
|
|
longitude: result.longitude,
|
|
geocodeConfidence: result.confidence,
|
|
geocodeProvider: result.provider,
|
|
geocodedAt: new Date()
|
|
}
|
|
});
|
|
}
|
|
}
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Problem: Many low-confidence locations
|
|
|
|
**Symptoms:**
|
|
- > 100 locations with confidence < 50
|
|
- Avg confidence < 60
|
|
- Prometheus alert firing
|
|
|
|
**Solutions:**
|
|
|
|
1. **Check provider API keys:**
|
|
```bash
|
|
# Test Google Geocoding API
|
|
curl "https://maps.googleapis.com/maps/api/geocode/json?address=123+Main+St+Toronto&key=YOUR_KEY"
|
|
|
|
# Verify key in .env
|
|
echo $GEOCODE_GOOGLE_API_KEY
|
|
```
|
|
|
|
2. **Try different primary provider:**
|
|
```env
|
|
# In .env, change primary provider
|
|
GEOCODE_PRIMARY_PROVIDER=GOOGLE # Most accurate
|
|
# Or try:
|
|
GEOCODE_PRIMARY_PROVIDER=MAPBOX # Good alternative
|
|
```
|
|
|
|
3. **Verify address format:**
|
|
```typescript
|
|
// Bad: Missing city/postal
|
|
"123 Main St"
|
|
|
|
// Good: Full address
|
|
"123 Main St, Toronto ON M5H 2N2"
|
|
```
|
|
|
|
4. **Use postal code for better accuracy:**
|
|
```typescript
|
|
// Append postal code if available
|
|
const fullAddress = location.postalCode
|
|
? `${location.address}, ${location.postalCode}`
|
|
: location.address;
|
|
```
|
|
|
|
5. **Bulk re-geocode with Google:**
|
|
```bash
|
|
# Via API
|
|
curl -X POST http://localhost:4000/api/locations/bulk-geocode \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-d '{"provider":"GOOGLE","confidenceThreshold":50}'
|
|
```
|
|
|
|
### Problem: Duplicate locations detected
|
|
|
|
**Symptoms:**
|
|
- Multiple locations at same coordinates
|
|
- Duplicates tab shows many groups
|
|
- Inflated location counts in cuts
|
|
|
|
**Solutions:**
|
|
|
|
1. **Check if legitimately multi-unit:**
|
|
```sql
|
|
-- Find buildings with multiple addresses
|
|
SELECT l.id, l.address, COUNT(a.id) as unit_count
|
|
FROM "Location" l
|
|
JOIN "Address" a ON a."locationId" = l.id
|
|
GROUP BY l.id
|
|
HAVING COUNT(a.id) > 1;
|
|
```
|
|
|
|
2. **Verify geocoding precision:**
|
|
```typescript
|
|
// Check if rounding issue
|
|
const isDuplicateRounding = (loc1, loc2) => {
|
|
// Use 4 decimal places (~11m precision) instead of 6 (~0.1m)
|
|
return loc1.latitude.toFixed(4) === loc2.latitude.toFixed(4) &&
|
|
loc1.longitude.toFixed(4) === loc2.longitude.toFixed(4);
|
|
};
|
|
```
|
|
|
|
3. **Review NAR import process:**
|
|
```typescript
|
|
// Ensure LOC_GUID unique constraint
|
|
const location = await prisma.location.upsert({
|
|
where: { locGuid: narRecord.LOC_GUID },
|
|
update: { /* update fields */ },
|
|
create: { /* create fields */ }
|
|
});
|
|
```
|
|
|
|
4. **Merge duplicates:**
|
|
```typescript
|
|
// Merge function
|
|
const mergeDuplicates = async (primaryId: number, duplicateIds: number[]) => {
|
|
// Move addresses to primary location
|
|
await prisma.address.updateMany({
|
|
where: { locationId: { in: duplicateIds } },
|
|
data: { locationId: primaryId }
|
|
});
|
|
|
|
// Delete duplicates
|
|
await prisma.location.deleteMany({
|
|
where: { id: { in: duplicateIds } }
|
|
});
|
|
};
|
|
```
|
|
|
|
### Problem: Geocoding stats slow to load
|
|
|
|
**Symptoms:**
|
|
- GET /api/locations/geocode-stats takes > 5 seconds
|
|
- Dashboard timeout errors
|
|
- High database CPU
|
|
|
|
**Solutions:**
|
|
|
|
1. **Add database indexes:**
|
|
```sql
|
|
CREATE INDEX CONCURRENTLY idx_locations_geocode_confidence
|
|
ON "Location"(geocodeConfidence);
|
|
|
|
CREATE INDEX CONCURRENTLY idx_locations_geocode_provider
|
|
ON "Location"(geocodeProvider);
|
|
|
|
CREATE INDEX CONCURRENTLY idx_locations_coords
|
|
ON "Location"(latitude, longitude)
|
|
WHERE latitude IS NOT NULL AND longitude IS NOT NULL;
|
|
```
|
|
|
|
2. **Cache stats in Redis:**
|
|
```typescript
|
|
// Cache for 5 minutes
|
|
const getCachedStats = async () => {
|
|
const cached = await redis.get('geocode:stats');
|
|
if (cached) return JSON.parse(cached);
|
|
|
|
const stats = await locationsService.getGeocodeStats();
|
|
await redis.setex('geocode:stats', 300, JSON.stringify(stats));
|
|
return stats;
|
|
};
|
|
```
|
|
|
|
3. **Use aggregation pipeline:**
|
|
```typescript
|
|
// Raw SQL for better performance
|
|
const stats = await prisma.$queryRaw`
|
|
SELECT
|
|
COUNT(*) as total,
|
|
COUNT(latitude) as geocoded,
|
|
AVG(COALESCE("geocodeConfidence", 0)) as avg_confidence,
|
|
"geocodeProvider",
|
|
COUNT(*) FILTER (WHERE "geocodeConfidence" < 50) as low_confidence
|
|
FROM "Location"
|
|
GROUP BY "geocodeProvider"
|
|
`;
|
|
```
|
|
|
|
4. **Materialize stats view:**
|
|
```sql
|
|
-- Create materialized view
|
|
CREATE MATERIALIZED VIEW geocode_stats_mv AS
|
|
SELECT
|
|
COUNT(*) as total,
|
|
COUNT(latitude) FILTER (WHERE latitude IS NOT NULL) as geocoded,
|
|
AVG(COALESCE("geocodeConfidence", 0)) as avg_confidence,
|
|
COUNT(*) FILTER (WHERE "geocodeConfidence" < 50) as low_confidence
|
|
FROM "Location";
|
|
|
|
-- Refresh hourly
|
|
REFRESH MATERIALIZED VIEW geocode_stats_mv;
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### Database Query Optimization
|
|
|
|
**Indexes:**
|
|
- `geocodeConfidence` (filtering)
|
|
- `geocodeProvider` (grouping)
|
|
- `(latitude, longitude)` composite (duplicate detection)
|
|
- Partial index on non-null coordinates
|
|
|
|
**Query Performance:**
|
|
- geocode-stats: ~500ms (1500 locations)
|
|
- Low confidence filter: ~100ms (with index)
|
|
- Duplicate detection: ~200ms (coordinate grouping)
|
|
- Bulk re-geocode: ~2-5 min (150 locations, depends on provider)
|
|
|
|
### API Rate Limits
|
|
|
|
**Provider Limits:**
|
|
- Google: 50 QPS, $5/1000 requests
|
|
- Mapbox: 100,000/month free, then $0.50/1000
|
|
- Nominatim: 1 QPS (public), no commercial use
|
|
- Photon: No official limit, self-hosted recommended
|
|
- ArcGIS: 100,000/month free
|
|
|
|
**Optimization:**
|
|
- Use Redis cache (30-day TTL)
|
|
- Batch geocoding jobs (avoid rate limits)
|
|
- Fallback to free providers for non-critical
|
|
- Monitor usage via provider dashboards
|
|
|
|
### Caching Strategy
|
|
|
|
**Cache Layers:**
|
|
|
|
1. **Application Cache (Redis):**
|
|
```typescript
|
|
// 30-day TTL for geocode results
|
|
const cacheKey = `geocode:${normalizeAddress(address)}`;
|
|
await redis.setex(cacheKey, 2592000, JSON.stringify(result));
|
|
```
|
|
|
|
2. **Statistics Cache:**
|
|
```typescript
|
|
// 5-minute TTL for stats
|
|
await redis.setex('geocode:stats', 300, JSON.stringify(stats));
|
|
```
|
|
|
|
3. **Provider Response Cache:**
|
|
```typescript
|
|
// Cache raw provider responses separately
|
|
await redis.setex(`provider:${provider}:${address}`, 604800, JSON.stringify(rawResponse));
|
|
```
|
|
|
|
**Cache Hit Rates:**
|
|
- Geocoding: 90%+ (repeated addresses)
|
|
- Statistics: 95%+ (frequent dashboard views)
|
|
- Provider responses: 85%+ (re-geocoding attempts)
|
|
|
|
## Related Documentation
|
|
|
|
### Backend Documentation
|
|
|
|
- **Locations Service:** `api/src/modules/map/locations/locations.service.ts`
|
|
- Geocode stats aggregation
|
|
- Duplicate detection
|
|
- Re-geocoding operations
|
|
|
|
- **Geocoding Service:** `api/src/modules/map/geocoding/geocoding.service.ts`
|
|
- Multi-provider fallback
|
|
- Confidence calculation
|
|
- Cache integration
|
|
|
|
- **Bulk Geocoding:** `api/src/modules/map/locations/bulk-geocode.routes.ts`
|
|
- Job queue integration
|
|
- Progress tracking
|
|
- Error handling
|
|
|
|
### Frontend Documentation
|
|
|
|
- **Data Quality Dashboard:** `admin/src/pages/DataQualityDashboardPage.tsx`
|
|
- Statistics display
|
|
- Charts and tables
|
|
- Bulk actions
|
|
|
|
- **Locations Page:** `admin/src/pages/LocationsPage.tsx`
|
|
- CSV import/export
|
|
- Inline geocoding
|
|
- Address editing
|
|
|
|
### Database Documentation
|
|
|
|
- **Location Model:** `api/prisma/schema.prisma`
|
|
- Geocoding metadata fields
|
|
- Indexes for performance
|
|
- Relations to Address
|
|
|
|
### Monitoring Documentation
|
|
|
|
- **Prometheus Metrics:** `api/src/utils/metrics.ts`
|
|
- Custom geocoding metrics
|
|
- Quality gauges
|
|
- Alert integration
|
|
|
|
- **Grafana Dashboard:** `configs/grafana/dashboards/data-quality.json`
|
|
- Quality trend charts
|
|
- Provider comparison
|
|
- Alert visualization
|
|
|
|
### External Resources
|
|
|
|
- **Google Geocoding API:** https://developers.google.com/maps/documentation/geocoding
|
|
- **Mapbox Geocoding API:** https://docs.mapbox.com/api/search/geocoding
|
|
- **Nominatim API:** https://nominatim.org/release-docs/latest/api/Search
|
|
- **Photon API:** https://photon.komoot.io
|