来源:Nature Communications 发布时间:2019/7/30 15:06:47
数据匿名化或不足以保护个人隐私 | 《自然-通讯》

论文标题:Estimating the success of re-identifications in incomplete datasets using generative models

期刊:Nature Communications

作者:Luc Rocher,Julien M. Hendrickx,Yves-Alexandre de Montjoye




《自然-通讯》发表的一篇论文 Estimating the success of re-identifications in incomplete datasets using generative models 介绍了一种可以评估一个人的身份是否能够从一个不完整的匿名化数据库中被重新识别出来的方法。该论文认为目前的匿名化和数据共享方法可能不足以保护个人隐私或满足数据保护法律法规的要求,如欧盟的《通用数据保护条例》(GDPR)。



英国帝国理工学院的Yves-Alexandre de Montjoye及同事开发了一种统计方法,能够准确估算通过匿名数据集正确地重新识别个体身份的可能性。作者发现,只需要知道少数几个属性,如邮政编码、出生日期、性别和子女数量,一般就能够以高可信度重新识别出个体身份——即使数据集是不完整的。已知属性越多,识别的可能性越大。例如,99.98%的马萨诸塞州人口可以通过15个人口统计学属性识别出来。因此,他们总结认为只公布取样数据集或不完全数据集不足以保护个人隐私。

摘要:While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.


