LSH (Locality-Sensitive Hashing)
LSHA family of hashing techniques designed so that similar inputs map to the same hash bucket with high probability — used to find approximate-match candidates in large datasets without comparing every pair.
Locality-Sensitive Hashing addresses the scaling problem in record linkage: comparing every pair of records in two datasets is O(n*m) and quickly becomes intractable above a few hundred thousand rows. LSH hashes records such that records likely to be similar collide in the same bucket, reducing comparison to within-bucket pairs only.
Several LSH families exist for different similarity measures: MinHash for Jaccard similarity over sets (useful for token-overlap matching of free-text fields), SimHash for cosine similarity over vectors (used in near-duplicate document detection), and random-projection LSH for Euclidean distance.
In reconciliation, LSH is typically used as a 'blocking' stage before more expensive scoring. ReconPe's ACRE engine uses LSH as the third blocking layer (after exact key and fuzzy key) to catch matches where reference identifiers have been corrupted — typos, partial truncation, or reformatting — and the only remaining signal is approximate text similarity.