In an algorithm-driven world where data is king, one mis-step can lead to a royal mess. Netflix discovered this in 2009 when it released anonymised movie reviews penned by subscribers. By crossmatching those snippets with reviews on another website, data sleuths revealed they could identify individual subscribers and what they had been watching. A gay customer sued for breach of privacy; Netflix settled.
That episode is still cited today by academics seeking ways of sifting useful information from data without outing the individuals who provide it. Where anonymisation failed, synthetic data might yet succeed.
It is, as its name suggests, artificially generated. It is most often created by funnelling real-world data through a noise-adding algorithm to construct a new data set. The resulting data set captures the statistical features of the original information without being a giveaway replica. Its usefulness hinges on a principle known as differential privacy: that anybody mining synthetic data could make the same statistical inferences as they would from the true data — without being able to identify individual contributions.