Why the \(k\) in \(k\)-anonymity does not reflect any re-identification probability: defeating syntactic privacy

14 minute read

Over the years I’ve repeatedly heard the claim that \(k\)-anonymity means that the re-identification probability is at most \(\frac{1}{k}\). In an important sense this is misleading. The release of a \(k\)-anonymous version of a database can lead to the exposure of the original database, rendering the probability of re-identification, or indeed the probability of any other inference, in \(k\)-anonymous databases irrelevant.

To demonstrate this, we will look at specific examples of \(k\)-anonymous data being used to reveal the underlying data and of how \(k\)-anonymity does not protect against other type of inferences we might consider as violating privacy.

Following the examples, we make an argument that such examples cannot be assumed to be rare and therefore dismissed. It involves a game played between Alice and Bob, in which Bob wins if he can use \(k\)-anonymous information given to him to infer which one of two databases Alice chooses.

K-anonymity and a possible origin of the \(1/k\) claim

Before we get to the fun part, we will quickly refresh what \(k\)-anonymity is, and where the \(1/k\) probability claim might have originated. Just look at the tables and skip to the next section if you are familiar with \(k\)-anonymity.

In the following we view a database as a list of records. A record in turn is a list of values that each correspond to a particular observation or “attribute.” As we take all records to be of equal length, a database can be represented by a simple table in which rows are records and columns contain values for an attribute. An example database \(D_{a}\) with four rows and three columns can be seen below.

An example database \(D_{a}\) with four rows and three columns
age gender diabetes
old male yes
middle male yes
old female no
young female no

Now, consider the additional table \(D_{\text{lookup}}\) below.

SSN lookup table \(D_{\text{lookup}}\) potentially used by \(M\) to re-identify our example database. Note that “not-young, male” fits the two first rows.
age gender SSN
young male 12344
old male 12345
middle male 12346
old female 12347
young female 12348

This table contains an explicit identifier SSN (social security number) associated with each row. If we join this database \(D_{\text{lookup}}\) with the information in database \(D_{a}\), we establish an explicit one-to-one connection between rows in \(D_{a}\) and identifiers SSN in \(D_{\text{lookup}}\). This joint information can be represented as the table \(D_{\text{identified}}\) below.

The result \(D_{\text{identified}}\) of joining databases \(D_{a}\) and \(D_{\text{lookup}}\)
age gender diabetes SSN
old male yes 12345
middle male yes 12346
old female no 12347
young female no 12348

The result \(D_{\text{identified}}\) constitutes a “re-identification” of each row in \(D_{a}\) and consequently the entire database \(D_{a}\).

Now consider a description of age as “not young.” This fits both old and middle age. If we try to look up a row “not-young, male, yes” in the lookup database \(D_{\text{lookup}}\), we see that this fits two of the rows, each with a distinct identifier value. Not being able to resolve this ambiguity, we must conclude that the row “not-young, male, yes” could be associated with either of SSNs 12345 and 12346.

Now, if we transform a data table such that any row becomes the same as \(k-1\) other rows in the same table, then the above approach of matching rows in a lookup table must result in each row matching either none or at least \(k\) rows in any lookup table that can be used to re-identify all rows of the original table.

Sweeney and Samarati thought that the above approach was too destructive with respect to usefulness. They proposed to select a set of attributes or columns that one thinks can be used for matching in other tables as “quasi-identifiers” and only require ambiguity in those. They parameterized this ambiguity by \(k\) and named this property of a data table \(k\)-anonymity.

A 2-anonymous version of \(D_{a}\) for any set of quasi-identifiers can be found below as \(D'\).

Now, suppose we have an identifiable database for which matching with a lookup table identifies each row. If we have a \(k\)-anonymous version of this database, we get that each row in this \(k\)-anonymous database is matched with \(k\) possible identifiers. Choosing one uniformly at random yields a probability of getting the right one that is at most \(1/k\). This is a possible origin of the claim that \(k\)-anonymity yields a re-identification probability that is at most \(1/k\).

Of course, this probability depends on each potential identifier being selected with the same probability. Most of the time, there is information available that dictates that some identifiers are more likely than others. As an example consider that the original data is on a random sample of a particular population where 95% are women. Depending on how the \(k\)-anonymization is done, it could happen that half of the \(k\) identifiers for a particular \(k\)-anonymous record belong to men. It would then be natural to assume that these identifiers are less likely to be the real one than the ones belonging to women.

Also, as we will see in the following, \(k\)-anonymity does not necessarily prevent recovery of the original data, and consequently renders the question of re-identification probability of the \(k\)-anonymized data irrelevant.

Three simple examples of \(k\)-anonymity failures

In the first example, Bob knows that age in the original data occurs as young, middle, and old. He also knows that the algorithm used will generalize as little as possible. On receiving 2-anonymous data \(D'\) shown below, he immediately concludes that age in the two first rows must be middle and old because

  • if they had been the same, for example old, there would have been to need to generalize into not-young, and
  • not-young excludes young as a possibility.
A 2-anonymous database \(D'\). This is the example database \(D_{a}\) above where age has been “generalized” to achieve 2-anonymity.
age gender diabetes
not-young male yes
not-young male yes
not-middle female no
not-middle female no

By a similar argument he concludes that the two last rows must contain old and young. In conclusion he was able to infer what the original data was up to the order of the records, and is able to re-identify all these by using the lookup table \(D_{\text{lookup}}\).

In the second example, Bob knows that the original data is either \(D_{a}\) above or \(D_{b}\) below (which differs from \(D_{a}\) in that the age in the first row was changed from “old” to “young”).

An example database \(D_{b}\) with four rows and three columns
age gender diabetes
young male yes
middle male yes
old female no
young female no

He also knows that the algorithm used to \(k\)-anonymize the database \(D_{b}\) produces \(D''\) which is different from \(D'\).

A 2-anonymous database \(D''\). This is the example database \(D_{b}\) above where age has been “generalized” to achieve 2-anonymity.
age gender diabetes
not-old male yes
not-old male yes
not-middle female no
not-middle female no

On receiving \(D'\) from example one he compares this to \(D''\), sees they are different, and concludes that the original data was \(D_{a}\). Again, he can use \(D_{\text{lookup}}\) to re-identify the original data.

In the third example, Bob knows that his middle aged neighbor is in the data. On receiving \(D'\) he can immediately infer that his neighbor is a diabetic, which he did not know in advance.

The above examples differ in what Bob knows. In the first, he has detailed knowledge of how \(k\)-anonymity was achieved. In the second, Bob has been able to reduce the number of possible original databases to \(D_{a}\) and \(D_{b}\). The only thing he then needed was the \(k\)-anonymizing method and that this method produces different outputs for his two candidates. In the third he knows that his neighbor is in the database and that he is middle aged.

Importantly, the third example illustrates that identity is not the only type of inference that we associate with a violation of privacy.

Generalizing the second example: a game

As we have seen, transformation into a \(k\)-anonymous database does not protect against discovery of the source data. In the first example above, Bob used his detailed knowledge of how \(k\)-anonymity was achieved. In the second, he relied on knowing that the original was one of two candidates and that the method for \(k\)-anonymity produces different output for the two candidates.

The second example is interesting because with regards to privacy, it can be modified to apply to any deterministic algorithm or method that transforms a database into something that is “safe” with respect to loss of privacy. Such an algorithm or method implements a function from databases to databases and is called a deterministic sanitizer, or in the following just sanitizer. More specifically, the second example applies to any useful sanitizer, meaning a sanitizer for which there exists two inputs that produce different outputs, meaning that the sanitizer allows us to distinguish between at least two databases. A sanitizer that is not useful essentially always says “No comment!” regardless of the original data and is therefore useless.

We’ve already discussed that \(k\)-anonymity is a property of a database, dividing databases into two sets, the set of those that are \(k\)-anonymous and the set of those who are not. In this, \(k\)-anonymity acts like a function \(\sigma\) that when given database \(d\) returns 0 if it is \(k\)-anonymous and 1 when it is not. Think of the letter \(\sigma\) as meaning “sensitive” or “\(\sigma\)ensitive.” A function that returns 0 or 1 is often called a predicate, and here \(\sigma\) is a predicate on databases.

Now we generalize and let \(\sigma\) not just mean not \(k\)-anonymous and therefore sensitive, but let \(\sigma\) be any predicate on databases and interpret \(d\) as “sensitive” only if \(\sigma(d) = 1\). We call \(\sigma\) a syntactic definition of privacy1, and if there exists an algorithm that computes \(\sigma\), we call it a checkable definition of privacy. Notice that now we can say that \(k\)-anonymity is a checkable definition of privacy. We can now define santizers with respect to \(\sigma\) to be all algorithms \(f\) such that \(\sigma(f(d)) = 0\) for all \(d\).

If \(\sigma\) is non-constant, i.e., there exists both sensitive and non-sensitive databases, and allows us to define at least one useful sanitizer for \(\sigma\), we call \(\sigma\) useful. A little thought reveals that for any useful definition of privacy there exist at least two non-sensitive databases and at least one sensitive database.

Summarizing the above definitions, we can say that for a useful checkable definition of privacy \(\sigma\), there exists at least one sensitive database and two non-sensitive databases, at least one useful santizer, and we can decide whether a database is sensitive or not. In the following, when we say “definition of privacy” we mean a a useful and checkable definition of privacy.

We can now think of the second example in terms of the following game \(G\) that Alice and Bob play.

  1. Alice decides on a database \(D_{1}\) and a definition of privacy \(\sigma\) and gives \(D_{1}\) and \(\sigma\) to Bob.
  2. Bob generates a database \(D_{2}\) by making a copy of \(D_{1}\) and substituting one record for one he choses, generates a sanitizer \(f\) for \(\sigma\), and gives \(D_{2}\) and \(f\) to Alice.
  3. Alice secretly chooses one of the two databases as \(D_{i}\), computes \(D' = f(D_{i})\), and gives \(D'\) to Bob .
  4. Bob loses if he cannot find out which database Alice chose, otherwise he wins.

Requiring Bob to generate a database of the same size that differs in a single record is not essential to the game, however not requiring it seems unfair to Alice as expecting that two completely different databases should be sanitized to the same value is unreasonable from a utility standpoint. One substitution apart is in general the closest we can make two same-size databases while not specifying record types. Note that the databases in the second example of \(k\)-anonymity failures meet this requirement.

Now, all Bob has to do to win is to make sure that the deterministic \(f\) he gives to Alice in step 2. discerns between \(D_{1}\) and \(D_{2}\), i.e., \(f(D_{1}) \neq f(D_{2})\). Then all he has to do is to pick \(i\) such that \(D' = f(D_{i})\). Since \(\sigma\) is useful he can always provide the needed \(f\) by picking two different non-sensitive databases and let \(f\) output these for the two databases \(D_{1}\) and \(D_{2}\), respectively.

“Letting Bob choose \(f\) is cheating!” Charlie exclaims. It is not. A definition of privacy is a criterion on databases or information and not on sanitizers or algorithms. This means that the promise of privacy must hold regardless of the actual sanitizer used. Furthermore, checking sanitizers against a list of “approved” ones is for various reasons not practical. For example, there is no algorithm that can decide what any given algorithm actually computes2.

Importantly, the game above illustrates a problem with not only \(k\)-anonymity, but with all definitions of privacy that allow deterministic sanitizers. These vulnerable definitions include all definitions of privacy \(\sigma\) that are based solely on properties of information. One of these is the HIPAA “safe harbor” de-identification standard. This “safe harbor” is interesting in that this particular definition is based on deciding what attributes can be shared and which cannot. In this respect it is strongly related to the question of which attributes should be quasi-identifiers for \(k\)-anonymity. This latter question can also be cast in terms of a \(\sigma\), and consequently, relying on a particular subset of attributes in the data being quasi-identifiers is subject to the vulnerability illustrated by the game above.

The game serves primarily as a test of definitions of privacy \(\sigma\) and might seem unrelated to real life. However, it does capture what Bob exploited in the second example above, and shows that \(k\), which chooses among possible \(\sigma\)’s, does not play into what Bob can learn about Alice’s data. Consequently, the claim of \(1/k\) being an upper bound for the probability of re-identification is misleading.

Counting Games

“Ok,” says Charlie. “I now see that you are right in that there are situations where we can bypass \(k\)-anonymity and end up with re-identified data with a probability that has nothing to do with the number \(k\). However, I am still not convinced that this is of practical consequence. I think the assumption of Bob knowing that Alice holds one of a specific pair of databases is not really realistic enough to matter.”

Granted, Bob knowing that Alice holds one of two databases seems like a big ask. However, what if there are a million Bobs, each with their own knowledge of what Alice might be holding?

We now make a mathematical argument that we cannot rule out that there exist “a million Bobs” by counting the number of different games \(G\) there must be where Alice holds a specific sensitive database.

Now consider the game, and let records in a database consist of \(n\) bits, meaning there are \(2^{n}\) different values for a record. This means that a database \(D_{1}\) has at least \(2^{n} - 1\) “neighbors” \(D_{2}\). Let \(\sigma\) divide databases into \(x\) sensitive databases and \(y\) non-sensitive databases. Consequently, there are at least \(x(2^{n} - 1)\) pairs \((D_{1}, D_{2})\) where \(D_{1}\) is sensitive and \(D_{2}\) is a “neighbor” of \(D_{1}\). Now, for each of these pairs, there are \(y(y-1)\) ways it can be discerned by a sanitizer. It follows that there are at least \(x(2^{n} - 1)y(y-1)\) games in total. This number increases in both \(x\) and \(y\). Since \(\sigma\) is useful, picking the smallest possible \(x\) and \(y\) yields \(x=1\) and \(y=2\). Using these numbers, there must be at least \(2(2^{n} - 1) = 2^{n+1} - 2\) games. For \(n = 19\), there are a million games, and for \(n = 266\), which is shorter than a tweet, we have that the number of games exceeds the Eddington number \(10^{80}\), which is an estimate of the number of hydrogen atoms in the universe.

Note that this is a lower bound, and since we chose \(x=1\), the lower bound also holds when we Alice holds a specific sensitive database. For \(k\)-anonymity and a reasonable database size, the numbers \(n\), \(x\), and \(y\) are likely to be much larger. It now seems that a million Bobs is a gross underestimate in practice.

A short postscript on empirical arguments for privacy

“I did some re-identification experiments with my data, and the fraction of records in the \(k\)-anonymous data I was able to re-identify was much smaller than \(1/k\).” says Charlie. “What does that mean?”

One thing this means is that one can at least re-identify a fraction \(z\) of the data. This is a lower bound and in a sense tells us more about Charlie’s re-identification capabilities than it tells us about the risk of re-identification, unless \(z\) is really large of course. Continuing this line of thought: in the context of “the million Bobs,” a statement in support of the protective power of \(k\)-anonymity is arguably the experimenter saying “I’m not a Bob!”

  1. syntactic comes from the ability to recognize the “language of sensitive databases,” i.e., recognize sensitive databases.↩︎

  2. this is due to Rice’s Theorem, and while Rice’s Theorem does not apply in a finite world, it seems unlikely that the computational complexity is tractable for any finite world that realistically captures data privacy.↩︎