Validate Addresses
Overview
validate() answers a yes/no question: does this address string match a real PSGC record above a confidence threshold? Use it for data cleaning and gatekeeping; for ranked candidate review use search_fuzzy() instead.
Single address
from barangay import validate
v = validate("Tongmageng, Sitangkai, Tawi-Tawi")
print(v.valid, v.matched_name, v.score) # True Tongmageng 100.0Batch validation
from barangay import validate_many
for r in validate_many(["Tongmageng, Sitangkai, Tawi-Tawi", "Nonexistent Place"], threshold=80.0):
print(f"{r.input!r} -> {'valid' if r.valid else 'invalid'}")Tuning
Threshold
validate() defaults to threshold=95.0 — strict, for clean canonical input. Lower it for real-world data where abbreviations and dropped levels cost a few points. A partially-supplied address flips verdict across the thresholds:
from barangay import validate
validate("Tongmagn Sitangkai Tawi", threshold=60.0) # valid=True score=88.46
validate("Tongmagn Sitangkai Tawi", threshold=80.0) # valid=True score=88.46
validate("Tongmagn Sitangkai Tawi", threshold=95.0) # valid=False (88.46 < 95)Default 95.0 is the safe choice when your input is well-formed; drop to 80.0 when reconciling free-text user input or legacy records.
validate() returns a verdict, not candidates. When you need to see the alternatives (e.g. low-confidence rows flagged for human review), call search_fuzzy() — its score and limit let you inspect the ranked shortlist.
Abbreviations
Address strings often use abbreviations the masterlist doesn’t (Brgy. for Barangay, Pob. for Poblacion, Mun./City, roman vs arabic numerals). The package normalizes common forms internally; where it can’t, normalize your side first. A few recurring ones:
| Abbreviation | Expands to |
|---|---|
Brgy. / Bgy. |
Barangay |
Pob. |
Poblacion |
Mun. |
Municipality |
City |
City of … |
Sanitize before validating
sanitize_input() is a stateless normalizer — it lowercases and strips stray punctuation. It’s cheap to run per row and improves match rates on noisy input without changing semantics:
from barangay import sanitize_input, validate_many
raw = ["Brgy. 291, Pob., City of Manila", "Tongmageng, Sitangkai, Tawi-Tawi"]
clean = [sanitize_input(s) for s in raw]
for r in validate_many(clean, threshold=80.0):
print(r.input, "->", "valid" if r.valid else "invalid")
# brgy. 291, pob., city of manila -> invalid (abbreviations still defeat it)
# tongmageng, sitangkai, tawi-tawi -> validsanitize_input() normalizes casing and punctuation; it does not expand abbreviations. Pair it with your own abbreviation map for the noisiest sources.