Data Engineering

How to Match Personal Data Across Systems Without PII Leaks

· 16 min read

Most cross-channel analytics projects do not fail at the dashboard. They fail at the join. Two systems say they have the same user; the data layer says otherwise, and the analyst spends a Tuesday afternoon deciding whether julia.adams@gmail.com and Julia A. with phone +1 (415) 555-2014 are the same person or two different ones. Multiply that by ten thousand customers a month and you have a data quality crisis dressed up as a reporting problem.

In This Article

  1. The setup, and why simple joins stop working
  2. The naive fix, and why it stops scaling
  3. The cross-identifier pattern
  4. Implementation in the wild
  5. What changed after the rollout
  6. The failure modes worth knowing
  7. When you do, when you don’t

This is the story of how an edtech startup we worked with solved that problem at scale, and the broader playbook that comes out of it. The pattern is universal: any company with more than two customer-facing systems eventually runs into the same wall, and the fix is the same in shape if not in detail.

The setup, and why simple joins stop working

The client was a fast-growing online learning company with three core systems holding student data. The CRM captured the marketing-to-enrollment journey. The learning platform held everything from course progress to certificate issuance. A community chat tool tracked student support conversations. Thirteen other tools touched the data along the way, but those three carried the majority of the analytical weight.

Every student touched all three systems within their first week. Each system collected a slightly different shape of identity: the CRM had the marketing-form fields (full name, country, primary email, sometimes a phone number); the learning platform had whatever the student typed when they logged in for the first time, which was often a different email and sometimes a slightly different name; the chat tool used whatever handle the student picked when they joined the community channel.

The analytics layer was simple: a daily Python job pulled snapshots from each system into BigQuery, and a Tableau workbook turned those tables into dashboards. The first version of the pipeline matched students by email, on the assumption that an email address is the closest thing to a unique identifier in consumer software. That assumption held for the first hundred students and broke as soon as the company grew past a few thousand.

b21-sources
b21-sources
The breakage shape is predictable. Three categories of failure dominate. Human error is the loud one. Students mistype their email when registering, or use one email for marketing and another at login. Names get casing variations, with diacritics, with nickname forms. Phone numbers come in with country codes, without country codes, with dashes, with spaces, with no formatting at all.

The quiet failure mode is multiplicity. A single student often has two real emails (work and personal), two real phone numbers (mobile and the one their parent put on the marketing form), and three real names (legal name on the certificate, nickname in chat, anglicized name on the platform). All of those values are correct. None of them are reliably joinable to each other.

The third failure mode is platform constraint. Some systems force phone numbers into E.164. Others store whatever was typed. Some systems lowercase emails on input; others preserve the casing. Some systems strip diacritics; others don’t. Two records with identical canonical content can have different stored representations because the systems disagreed on what “canonical” means.

The cumulative effect is that month over month, dozens of students appear in two of the three systems but not the third, or appear in all three with subtle field differences that prevent a clean join. At the scale the company hit by month nine, that was eighty to ninety students per month with broken joins, each one requiring an analyst to manually reconcile the record before any cohort or revenue analysis could be trusted.

The naive fix, and why it stops scaling

The first response to a broken join is always the same: write a smarter match. Match on email, then fall back to phone, then fall back to fuzzy name. Lowercase everything, strip dashes from phones, normalize Unicode. Add a manual review queue for the records that fail all three.

This works at a hundred students. It fails at a thousand. It catches fire at ten thousand. The problem is not that the matching script is wrong. The problem is that the matching script is the wrong place to solve this.

b21-accuracy
b21-accuracy
Field-level match accuracy on real data is harsh. Email is the most reliable single key, and it still only matches cleanly about eighty-five percent of the time at the scale the client was operating at. Phone is worse, around sixty percent, because of the formatting variance noted above. Name-only matching is essentially a coin flip past a few thousand records. Composite matching (any-two-of-three) gets you to high nineties, but the false-positive rate also climbs because composite rules will happily merge two genuinely different students who share a name and live in the same city.

The deeper problem is that every analytical query in the warehouse contains an implicit dependency on the matching logic. Change the match rule and every historical dashboard shifts. Add a fallback and last month’s cohort retention number changes by half a percent. The matching logic is doing analytical work, but it lives in a script that nobody owns and that gets edited under deadline pressure to fix whatever join broke that morning.

You also can’t solve this problem with PII in scope. The reason the matching script keeps growing is that the analyst has to keep looking at full names, full emails, and full phone numbers to debug it. Every time you add a contractor or rotate an analyst, you are widening the blast radius of a potential PII leak. At some point your security team notices and your matching script becomes a compliance problem on top of a data quality one.

The cross-identifier pattern

The fix is structural. Pull the matching logic out of the analytical pipeline and into the data layer itself, and represent the result as a single canonical identifier that every downstream query joins on.

In practice, this means three things. First, a dedicated table of canonical entities, in our case a students table with one row per real student and a student_id column that nothing else in the warehouse generates. Second, a set of mapping tables, one per source system, that maps every observed identifier in that system back to the canonical student_id. Third, a single matching service (or, more often, a single dbt model with carefully versioned logic) that maintains those mapping tables on a schedule and exposes the matching decisions for review.

Every analytical query then joins on student_id, not on email or phone. The matching logic still exists, but it lives in one place, runs on a schedule, and is the only thing in the warehouse that touches raw PII. The analytical layer sees only the canonical ID and whatever non-PII attributes the student has (country, course enrollment, signup date, and so on).

This sounds like a small reorganization. It is not. It is the difference between a data warehouse that scales and one that doesn’t.

Implementation in the wild

The implementation we shipped for this client looked like this. A nightly job loaded snapshots of each source system into the warehouse with the original field names preserved. A dbt model called student_identities built the canonical students table, with one row per student and columns for the assigned student_id, the consolidated personal data, country, and a few other low-cardinality attributes the company actually needed for analysis.

For each source system, a _to_student_id mapping table held the per-system identifier (CRM contact ID, platform user ID, community user ID) and the corresponding student_id. The matching logic itself was a sequence of join rules expressed in dbt: email-exact, email-normalized (lowercase, strip aliases), phone-normalized (E.164, strip non-digits), and a final composite rule for the few records that did not match on any single field. Every match decision was tagged with the rule that produced it, which made debugging vastly easier than the previous setup.

Downstream models then read student_id directly. The query at the heart of the analytics layer looked roughly like this:

WITH platform_amo_refunds AS (
  SELECT DISTINCT
    student_id,
    course_id,
    signup_date,
    refund_status
  FROM platform_to_student
  JOIN crm_to_student USING (student_id)
  LEFT JOIN refunds USING (student_id)
),

with_community AS (
  SELECT * EXCEPT (channel_id, channel),
    IFNULL(channel_id, 'Not in community') AS channel_id,
    IFNULL(channel, 'Not in community') AS channel
  FROM platform_amo_refunds
  FULL OUTER JOIN community_to_student USING (student_id)
)

SELECT * FROM with_community;

What is worth noticing about that query is what it does not contain. No LOWER(email). No REGEXP_REPLACE on phone numbers. No nested COALESCE over three different name fields. The joining is trivial because the canonical IDs do the work the matching script used to do. The query reads like a join across three systems, because that is what it is.

In Tableau, records that did not match any system showed up as null student_id. That gave the team a free quality metric: every day they could check how many records came in unmatched, what source they came from, and which match rule should have caught them. That information used to be invisible. Now it was a queryable column.

What changed after the rollout

b21-after
b21-after
The numbers tell the story most cleanly. Before the cross-identifier rollout, the team was dealing with eighty to ninety unmatched records per month and roughly the same number of records that matched incorrectly. The analyst on rotation was spending two to three days a month manually reconciling identities, which is to say, looking at full names and emails and making judgment calls.

After the rollout, the unmatched count dropped to roughly twenty per month, a seventy-five percent reduction. The genuinely broken records (the ones that genuinely required human eyes) became identifiable on a daily basis instead of accumulating into a monthly reconciliation queue. Most importantly, the analytical layer became boring again. Cohort numbers stopped shifting when someone edited the matching script. The matching script stopped existing as a separate concept, because it was now a dbt model with tests and version control.

A side benefit emerged that nobody had explicitly designed for. Because PII now lived only in the matching layer, the team could open up the analytical warehouse to a much broader set of users (product, marketing, finance) without expanding the surface area of PII exposure. The analytics team gained reviewers; the security team gained sleep.

The failure modes worth knowing

A few patterns I have seen go wrong with this approach are worth flagging.

The first is treating matching as a one-time project. The cross-identifier pattern is not a one-shot migration. It is an ongoing data engineering responsibility. New source systems are added, formats drift, edge cases accumulate. The matching layer needs an owner and a maintenance cadence, not a single sprint.

The second is overconfidence in the matching rules. Even a well-tuned ruleset will produce wrong matches for a small percentage of records. Always tag each match with the rule that produced it, and always preserve the raw per-system identifiers in the mapping tables. When a stakeholder asks “why does this student have two enrollments,” you need to be able to answer the question. That requires keeping the receipts.

The third is letting PII leak back into downstream models. The whole point of the pattern is to isolate PII to the matching layer. The first time someone writes a downstream model that re-joins to the raw students table to grab an email “for one quick report,” the boundary starts to dissolve. Lock it down at the warehouse-permissions level if you can; otherwise be aggressive with code review.

When you do, when you don’t

The cross-identifier pattern is the right answer if you have three or more customer-facing systems, more than a few thousand customers, any nontrivial level of PII regulation (GDPR, CCPA, HIPAA, FERPA), or any analytics workload that joins across systems on customer identity. That covers most B2C businesses past Series A.

It is overkill if you have two systems and a few hundred customers. At that scale, a dbt model with a smart join clause and good tests is enough. Build the pattern when you start to feel the pain, not before.

It is also not a substitute for fixing the upstream systems. If your CRM is letting users register with five different formats for the same phone number, that is a process problem at the source, and the matching layer can only do so much. Push back upstream wherever you can; the matching layer is the last line of defense, not the first.

The pattern looks like a small data engineering investment, and it is, in raw hours of build time. The payoff is that the analytical layer stops fighting with itself. Reports become trustworthy on the first read, not the fourth. Analysts stop being part-time identity resolvers. That is the version of analytics that a growing business needs and almost never has by accident.

Keep reading

Enjoyed this article?

Get weekly data strategy insights delivered to your inbox.

Get in Touch

Let's Discuss Your Project

Book a 30-minute discovery call. We'll assess your data maturity and recommend the right approach — no strings attached.

Book a Discovery Call →
Need help with your data strategy? Book a Discovery Call →