An Influential AI Dataset Contains Thousands of Suspected Child Sexual Abuse Images

An influential machine learning dataset—the likes of which has been used to train numerous popular image-generation applications—includes thousands of suspected images of child sexual abuse, a new academic report reveals.

The report, put together by Stanford University’s Internet Observatory, says that LAION-5B, a massive tranche of visual media, includes a significant number of illegal abuse images.

LAION-5B is maintained by the non-profit organization LAION (short for Large-scale Artificial Intelligence Open Network) and isn’t actually a stored collection of images but is instead a list of links to images that have been indexed by the organization. The links include metadata for each image, which helps machine learning models find images to draw on for training.

To sift through this expansive data tranche, researchers used PhotoDNA, a proprietary content filtering tool developed by Microsoft to help organizations identify and report certain types of prohibited content, including CSAM. In the course of their scroll through LAION’s dataset, researchers say that PhotoDNA found some 3,226 instances of suspected child abuse material. By consulting outside organizations, researchers were able to determine that many of those images were confirmed cases of CSAM. While the dataset in question involves billions of images, the existence of any amount of abuse content in its content should be troubling.

On Tuesday, after receiving an embargoed copy of Stanford’s report, LAION took the dataset offline and released a statement to address the controversy. It reads, in part:

LAION has a zero tolerance policy for illegal content. We work with organizations like IWF and others to continually monitor and validate links in the publicly available LAION datasets. Datasets are also validated through intensive filtering tools developed by our community and partner organizations to ensure they are safe and comply with the law.

…In an abundance of caution we have taken LAION 5B offline and are working quickly with the IWF and others to find and remove links that may still point to suspicious, potentially unlawful content on the public web.

LAION-5B has been used to train numerous AI applications, including the popular Stable Diffusion image generation app created by Stability AI. Gizmodo reached out to Stability AI for comment and will update this story if it responds.