Validate Addresses

Recipe: validate Philippine addresses with validate() and validate_many(), tune thresholds, and process addresses in batch.
Author

bendlikeabamboo

Overview

validate() answers a yes/no question: does this address string match a real PSGC record above a confidence threshold? Use it for data cleaning and gatekeeping; for ranked candidate review use search_fuzzy() instead.

Single address

from barangay import validate

v = validate("Tongmageng, Sitangkai, Tawi-Tawi")
print(v.valid, v.matched_name, v.score)  # True Tongmageng 100.0

Batch validation

from barangay import validate_many

for r in validate_many(["Tongmageng, Sitangkai, Tawi-Tawi", "Nonexistent Place"], threshold=80.0):
    print(f"{r.input!r} -> {'valid' if r.valid else 'invalid'}")

Tuning

Threshold

validate() defaults to threshold=95.0 — strict, for clean canonical input. Lower it for real-world data where abbreviations and dropped levels cost a few points. A partially-supplied address flips verdict across the thresholds:

from barangay import validate

validate("Tongmagn Sitangkai Tawi", threshold=60.0)   # valid=True   score=88.46
validate("Tongmagn Sitangkai Tawi", threshold=80.0)   # valid=True   score=88.46
validate("Tongmagn Sitangkai Tawi", threshold=95.0)   # valid=False  (88.46 < 95)

Default 95.0 is the safe choice when your input is well-formed; drop to 80.0 when reconciling free-text user input or legacy records.

Importantvalidate is pass/fail

validate() returns a verdict, not candidates. When you need to see the alternatives (e.g. low-confidence rows flagged for human review), call search_fuzzy() — its score and limit let you inspect the ranked shortlist.

Abbreviations

Address strings often use abbreviations the masterlist doesn’t (Brgy. for Barangay, Pob. for Poblacion, Mun./City, roman vs arabic numerals). The package normalizes common forms internally; where it can’t, normalize your side first. A few recurring ones:

Abbreviation Expands to
Brgy. / Bgy. Barangay
Pob. Poblacion
Mun. Municipality
City City of …

Sanitize before validating

sanitize_input() is a stateless normalizer — it lowercases and strips stray punctuation. It’s cheap to run per row and improves match rates on noisy input without changing semantics:

from barangay import sanitize_input, validate_many

raw = ["Brgy. 291, Pob., City of Manila", "Tongmageng, Sitangkai, Tawi-Tawi"]
clean = [sanitize_input(s) for s in raw]
for r in validate_many(clean, threshold=80.0):
    print(r.input, "->", "valid" if r.valid else "invalid")
# brgy. 291, pob., city of manila -> invalid  (abbreviations still defeat it)
# tongmageng, sitangkai, tawi-tawi -> valid

sanitize_input() normalizes casing and punctuation; it does not expand abbreviations. Pair it with your own abbreviation map for the noisiest sources.

See also