What an identity graph actually is
An identity graph is a probabilistic, time-varying mapping of identifiers (device IDs, hashed contact records, addresses, behavioral fingerprints) to a unified entity (a person, or in some cases a household). It is probabilistic because linkage is not certain — most observed identifiers do not include a deterministic key. It is time-varying because identifiers change: people move, change phones, change emails, marry, divorce.
Construction
Edges in the graph are scored by Fellegi-Sunter linkage logic adapted for high-cardinality identifier classes. Strong-edge candidates (matched hashed PII fields, deterministic shared keys) anchor the graph; weak-edge candidates (co-occurrence, shared IP-time patterns, behavioral similarity) are scored and added with explicit confidence. Edges are continuously refreshed; identifiers that have not been observed within their decay window are demoted but not deleted.
Confidence calibration
Every node in the graph carries an aggregate linkage confidence in [0,1]. Calibration is performed against a labeled panel where ground-truth linkage is known. v4.7 of the graph holds at 97.4% on the panel — meaning when the graph claims linkage with > 0.9 confidence, it is correct 97.4% of the time across the test cohort. Downstream scoring weights signal contribution by node confidence; a high-signal observation from a low-confidence linkage contributes less than the same observation from a high-confidence linkage.
Household resolution
For many verticals — real estate, finance, insurance, healthcare — household resolution matters more than individual. Household entities are constructed by clustering individuals on address co-location, name patterns, and shared behavioral indicators. Household confidence is reported separately from individual confidence.
Privacy and compliance
All operations are hashed-first. The graph ingests hashed identifiers; raw PII is not stored or transacted on. Consent provenance is tracked per identifier; identifiers without verifiable consent are excluded from outputs. The architecture supports GDPR Article 17 (right to erasure) at the entity level — a verified request removes the entity and all derived signals.
Signal half-life — production model
Predictive cohort vs. cold list
Citations
- · Fellegi, I., & Sunter, A. — A Theory for Record Linkage. JASA, 1969.
- · Christen, P. — Data Matching: Concepts and Techniques for Record Linkage. Springer, 2012.
- · Steorts, R. C., et al. — Performance bounds for graphical record linkage. AISTATS, 2014.