Last week Techdirt wrote about leading Chinese tech companies being hit with GDPR complaints from noyb.eu concerning the transfer of personal data from the EU to China. More recently, much of the world has been obsessed with another Chinese company, DeepSeek, which operates in the fashionable area of AI chatbots. Most of the discussions have been about the impact DeepSeek’s apparently low-cost approach will have on the massive spending plans of existing, mostly US, AI companies. Another issue is to what extent DeepSeek’s model drew on OpenAI’s chatbot for its training. But the privacy concerns raised by noyb.eu about better-known Chinese companies are now becoming an issue for DeepSeek too.
The Italian consumer organization Altroconsumo believes that there were “serious violations of GDPR regulations” (original in Italian, all translations by DeepL) in DeepSeek’s processing of personal data, and it submitted a report to the Italian data protection authority, the Garante della Privacy. The Garante requested information from DeepSeek about “which personal data are collected, the sources used, the purposes pursued, the legal basis of the processing, and whether they are stored on servers located in China.” In addition:
The Authority also asked the companies what kind of information is used to train the artificial intelligence system and, in case personal data are collected through web scraping activities, to clarify how registered and non-registered service users have been or are being informed about the processing of their data.
The companies are required to submit the requested information to the Authority within 20 days.
The Garante has now imposed a block on downloading DeepSeek’s apps from both the Apple and Google app stores in Italy:
The limitation order — imposed to protect Italian users’ data — follows the companies’ communication received today, whose content was deemed entirely unsatisfactory.
Contrary to what was found by the Authority, the companies declared that they do not operate in Italy and that European legislation does not apply to them.
In addition to ordering the limitation on processing, the Authority also opened an investigation.
This is not the first time the Garante has taken this approach. In April 2023, it blocked access to ChatGPT in Italy, before lifting the block a few weeks later after changes were made by OpenAI to address the issues raised. So far, Italy is the only EU country to block DeepSeek, although Ireland’s Data Protection Commission has requested information from the company about its handling of personal data, while in the US the Pentagon has started blocking DeepSeek on parts of its network. DeepSeek’s position has been undermined somewhat by revelations from the cloud security company Wiz, which wrote on its blog:
Wiz Research has identified a publicly accessible ClickHouse database belonging to DeepSeek, which allows full control over database operations, including the ability to access internal data. The exposure includes over a million lines of log streams containing chat history, secret keys, backend details, and other highly sensitive information. The Wiz Research team immediately and responsibly disclosed the issue to DeepSeek, which promptly secured the exposure.
These growing concerns about the flow of personal data to servers in China concern DeepSeek’s own hosted model. One way to avoid the issue is to create versions of DeepSeek’s service hosted elsewhere, something that DeepSeek’s license allows and that Microsoft has just announced. Whether ordinary users would use them in preference to the “official” version is another matter. For businesses, a better solution would be self-hosting the service, so that sensitive commercial data stays behind the corporate firewall.
But there’s another privacy issue that using other hosts, or self-hosting, does not address. DeepSeek has not revealed what training data was used to create the system. This means that it is possible that data sources containing personal information were present. By entering suitable prompts it may be possible to extract personal data from the current version of DeepSeek. A new project called Open-R1 could help to fix this privacy issue. As TechCrunch reports:
Hugging Face head of research Leandro von Werra and several company engineers have launched Open-R1, a project that seeks to build a duplicate of [DeepSeek’s] R1 and open source all of its components, including the data used to train it.
Another benefit of creating a fully open-source version of DeepSeek’s system is that the censorship built into the current version can be eliminated. According to Ars Technica, there is lots of it, although it is relatively easy to circumvent:
The team at AI engineering and evaluation firm PromptFoo has tried to measure just how far the Chinese government’s control of DeepSeek’s responses goes. The firm created a gauntlet of 1,156 prompts encompassing “sensitive topics in China” (in part with the help of synthetic prompt generation building off of human-written seed prompts. PromptFoo’s list of prompts covers topics including independence movements in Taiwan and Tibet, alleged abuses of China’s Uyghur Muslim population, recent protests over autonomy in Hong Kong, the Tiananmen Square protests of 1989, and many more from a variety of angles.
After running those prompts through DeepSeek R1, PromptFoo found that a full 85 percent were answered with repetitive “canned refusals” that override the internal reasoning of the model with messages strongly promoting the Chinese government’s views.
The privacy issues surrounding the use of AI chatbots are new and complex. Creating a truly open-source system, including full details about the training sets, provides a way forward to address data protection issues that may be lurking in all current systems — and not just those from China.