maryam-abacha-university-ad

Wikimedia Deutschland has launched a new database designed to make Wikipedia’s vast knowledge base more accessible to artificial intelligence models.

The initiative, called the Wikidata Embedding Project, introduces a vector-based semantic search system capable of parsing more than 120 million articles across Wikipedia and its sister sites. The tool also supports the Model Context Protocol (MCP), a new standard that enables AI systems to query external data sources directly.

Developed in collaboration with neural search firm Jina and IBM-owned data provider DataStax, the project aims to give developers structured access to verified knowledge for retrieval-augmented generation (RAG) systems. Until now, Wikidata searches were limited to keywords or the specialised query language SPARQL.

“Powerful AI can be open and collaborative, rather than monopolised by large corporations,” said Philippe Saadé, project manager for Wikidata AI.

The new system organises information semantically. For example, a search for “scientist” will return nuclear physicists, Bell Labs alumni, multilingual translations, images, and related concepts such as “researcher” and “scholar.”

The database is available on Toolforge, and Wikimedia will hold a developer webinar on October 9. The launch comes amid rising demand for reliable training data in AI, as companies face legal and financial pressure over the use of copyrighted material. In February, Anthropic agreed to a $1.5 billion settlement with authors whose works had been used in its datasets.