I’m glad that Thilo mentioned Security & Privacy as part of the data science skill set in his recent blog post. In my opinion, the two most interesting questions with respect to security & privacy in data science are the following:
- Data science for security: How can data science be used to make security-relevant statements, e.g. predicting possible large scale cyber attacks based on analysing communication patterns?
- Privacy for data science: how can data that contains personal identifiable information (PII) be anonymized before providing them to the data scientists for analysis, such that the analyst cannot link data back to individuals? This is typically identified with data anonymization.
This post deals with the second question. I’ll first show why obvious approaches to anonymize data typically don’t offer true anonymity and will then introduce two approaches that provide better protection.