AI startups have used billions of images of Brazilian children to train their models without consent, according to a report by Human Rights Watch (HRW).
The report reveals that popular image generators, including Stable Diffusion, utilized images of children spanning their entire childhood for model training, raising significant privacy concerns.
Billions of Brazilian kids’ images were utilized
HRW’s investigation, led by researcher Hye Jung Han, highlighted that the images were sourced from about 10 Brazilian states. These images, many of which were family pictures uploaded on parenting and personal blogs, were found in a dataset called LAION-5B. This dataset was created from Common Crawl snapshots of the public web and included “image text pairs” from nearly 6 billion pictures and captions posted since 2008. Although the dataset did not contain the photos, the associated text pairs still posed a significant privacy risk.
The HRW report emphasizes that using these images without consent increases the production of non-consensual images bearing the likeness of children. Despite efforts to remove links to these images in collaboration with LAION, concerns remain that the dataset may still reference children’s images from around the world, as removing links alone does not entirely solve the problem.
Children’s identities are easily traceable
The HRW report further revealed that the identities of many Brazilian children could be traced through the names and locations included in the captions that built the dataset. This raises concerns about the potential targeting of these children by bullies and the misuse of their images for explicit content. HRW also noted that “all publicly available versions of LAION-5B were taken down,” reducing the immediate risk of misusing Brazilian children’s photos.
LAION, the German nonprofit that created the dataset, stated that they would ensure all flagged content is removed before the dataset is made available again. This decision follows a Stanford University report that found links in the dataset pointing to illegal content on the public web, including over 3,000 suspected instances of child sexual abuse content.
Protecting children’s privacy
At least 85 girls in Brazil have reported harassment by classmates who used AI to generate sexually explicit deepfake content based on photos taken from their social media. This underscores the urgent need for measures to protect children’s privacy online.
LAION-5B was introduced in 2022 as a dataset intended to replicate OpenAI’s dataset and was promoted as the most significant “freely available image-text dataset.” When HRW contacted LAION regarding the images, the organization stated that AI models trained on LAION-5B “could not produce kids’ data verbatim.” However, they acknowledged the privacy and security risks involved.
HRW has called for urgent intervention by Brazilian lawmakers to protect children’s rights from emerging technologies. They recommend that new laws be enacted to prohibit the scraping of children’s data into AI models to safeguard their privacy and security.
The situation raises broader concerns about the ethical implications of using publicly available data for AI training. While organizations like LAION are taking steps to address these issues, the need for comprehensive regulations to protect individuals, particularly vulnerable groups like children, remains pressing. The collaboration between advocacy groups and legislative bodies will be crucial in developing frameworks that ensure the ethical use of data in AI development.
HRW’s findings highlight a significant privacy risk and emphasize the need for strict regulations to prevent the misuse of children’s images in AI training. The ongoing efforts to remove flagged content and the call for legislative action aim to protect children’s rights and ensure that AI technologies are developed and used responsibly.