Differentially private \(k\)-anonymity makes limited sense

4 minute read

In response to the COVID-19 pandemic, the Norwegian Institute of Health first commissioned and then deployed a population surveillance / contact tracing cellphone app. This initiative stirred up controversy due to collection and centralized storage of detailed location and proximity information. At the same time, the government appointed a group of experts to review the app. In their very recently published report they suggest that differential privacy be considered as an alternative to \(k\)-anonymity for data privacy. Regarding how this can be done, they state (translated from Norwegian):

“Dependending on how the \(k\)-anonymization is done one can achieve differential privacy by taking a random sample followed by \(k\)-anonymity.”

Their report is meant for the general public and does not contain any details on how such a scheme could be implemented (attempts at providing such includes Li, et al.). However, one can assume that the expert group thinks it has merit since they mention it.

In the following, I argue that adding \(k\)-anonymity to differential privacy does not help with privacy but generally hurts utility. Consequently, the combination makes little sense if maximizing both privacy and utility is the goal.

\(k\)-anonymity: a property of data

The allure of \(k\)-anonymity is in part that it is easy to operationalize. It is easy to check whether a dataset is \(k\)-anonymous. It is also, at least superficially, explainable by the metaphor of “hiding in a crowd” of size at least \(k\). On the other hand, \(k\)-anonymity has been criticized for destroying utility. A very simple example of how \(k\)-anonymity affects data is \(k\)-anonymous data on the gender of \(k\) people. Since all \(k\) values are required to be the same, nothing can be learned about how many women there are in the data unless all or none are, in which case we can learn the exact number.

With respect to privacy, a big problem of \(k\)-anonymity is that it is hard to know what the value of \(k\) means for privacy risk, including risk of “re-identification”. This can only be addressed empirically, consequently only providing a lower bound on risk. A statement “we were only able to re-identify \(x\)% after \(k\)-anonymization” also means “one can at least re-identify \(x\)% after \(k\)-anonymization,” which does not inspire confidence for any value for \(x\). Also, if two \(k\)-anonymous datasets where assigned empirical risk bounds of \(x\)% and \(y\)%, respectively, it is not clear what the combined risk of publishing both is.

Differential privacy: a property of an algorithm

Enter differential privacy. Consider the question “If I contribute my data to the database, what is the increase in the chance of anyone learning anything about me they did not already know?” We can think of differential privacy in terms of the answer “not larger than a chosen threshold provided that a particular type of algorithm is used to expose information in the database.”

The threshold is expressed in terms of a parameter \(\epsilon\) which represents an upper bound of risk to privacy even when facing adversaries that are extraordinarily strong. The particular type of algorithm required uses the data and \(\epsilon\) to select a probability distribution according to which the final output is randomly chosen. The privacy guarantee stems from how little the presence or absence of any single record changes the probability distribution.

For utility, the trick is to find and use a probability distribution that maximally concentrates around a desired output. Often distributions can be chosen that yield very small error with high probability. For example, going back to the example of the genders of \(k\) individuals, we can compute the number of women with an error probability that decreases exponentially in the size of the error.

Furthermore, if one releases two outputs each computed with \(\epsilon_{1}\)- and \(\epsilon_{2}\)-differential privacy, respectively, then we can say that the combination of these two results is computed with \((\epsilon_{1}+\epsilon_{2})\)-differential privacy regardless of data that was input to each computation.

These properties have strongly contributed to making differential privacy an emerging privacy standard.

Combining \(k\)-anonymity and differential privacy

So, what is it that makes \(k\)-anonymity and differential privacy exhibit such different properties? The answer lies in that \(k\)-anonymity is a property of information or data, while differential privacy is a property of an algorithm. Operationally, this has consequences. We cannot check that output was produced with differential privacy, nor can we automatically check that a given implementation of an algorithm achieves differential privacy. Simply put, differential privacy is not “checkable” while \(k\)-anonymity is.

As a consequence of how they are defined, a combination of differential privacy and \(k\)-anonymity must necessarily be the requirement that a differentially private algorithm produces \(k\)-anonymous output. Furthermore, from a privacy-utility balance point of view, the combination is a net loss as the \(k\)-anonymity requirement does not help with the privacy risk upper bound but constrains the possible output and consequently utility. Additionally, the operational benefit of being “checkable” enjoyed by \(k\)-anonymity alone is lost.

In summary, a combination of differential privacy and \(k\)-anonymity makes little sense if the objective is optimizing the balance of privacy and information utility.