Can AI bridge Switzerland’s legal language gap?

A lawyer researching Swiss case law in German can easily miss a decisive ruling written in French or Italian. In a country with several legal languages but one federal legal system, this is more than a technical inconvenience: it can shape which legal information is actually found.

In the Innosuisse Innovation Cheque project “JuRAG”, the Centre for Artificial Intelligence at ZHAW collaborated with Chiron Services to explore how AI can support multilingual legal search, and where language barriers still remain. We thank Chiron Services for the collaboration and domain input throughout the project, and we appreciate the support of Innosuisse, whose funding made this project possible.

Authors: Lars Schmid and Manuela Hürlimann, ZHAW Centre for Artificial Intelligence

The Swiss Federal Supreme Court publishes its decisions in German, French or Italian, depending on the language of the original proceedings. A lawyer in Zurich researching a question of federal law should, in principle, be able to find and consider all relevant rulings, regardless of the language in which they were written.

In practice, this is difficult. Traditional keyword search treats terms such as “Mietvertrag” and “bail à loyer” as unrelated, even though both refer to a rental contract. The same problem appears across many areas of law. When search systems rely too heavily on exact wording, legal research can become fragmented along language lines.

Chiron Services, a Basel-based software company specialised in data and AI solutions, approached us with a prototype that already worked well within a single language and should be developed into a multilingual tool. Our goal was to test different ways of making legal search more reliable when the query and the relevant court decision are written in different languages. This research project was funded by Innosuisse (Innocheque No. 128.926 INNO-ICT).

Testing multilingual search on 165,000 decisions

The benchmark contained 26 legal questions, each translated into German, French, Italian and English. This resulted in 104 query instances in total. The questions were constructed by a lawyer based on source documents. For each question, one or more relevant Swiss Federal Supreme Court decisions were identified as ground truth, including the specific considerations (“Erwägungen”) that contained the relevant legal reasoning. This allowed us to test whether a system could find the same relevant court decision regardless of the language used in the search query.

We built an evaluation benchmark from 165,556 Swiss Federal Supreme Court decisions published between 2000 and 2025. These decisions were split into around 2.4 million shorter passages.

We evaluated retrieval at passage level because Swiss Federal Supreme Court decisions are structured into “considerations”: the numbered parts of the judgment in which the court explains its legal reasoning. A decision may cover several issues, but only one specific consideration may be relevant to a lawyer’s question. We therefore did not only ask whether the system could find the right decision. We also asked whether it could find the right part of that decision.

We compared several approaches. The baseline was traditional keyword search: the system looks for documents that contain the same words as the query. This is simple and often useful, but it struggles when the same legal idea is expressed in another language or with different wording.

We then tested twelve modern multilingual embedding models. An embedding model turns a piece of text into a mathematical representation of its meaning. The idea is that passages with similar meanings should be “close to each other” (mathematically speaking), even if they use different words or different languages. In legal search, this means that a German query could, in principle, retrieve a relevant French or Italian passage.

We also tested a second step called reranking. Here, the system first retrieves a larger set of possible results and then uses a more specialized model to reorder them, so that the most relevant passages move closer to the top.

Finally, we tried two additional ideas. One was hybrid search, which combines keyword search with meaning-based search. The other was language deconfounding: a technique that tries to remove the “language signal” from the text representation, so that German, French and Italian passages are compared more by legal content than by language.1

We also tested whether a large language model could rewrite legal queries into wording that would be easier for the search system to handle.

What worked

The biggest improvement came from choosing the right multilingual embedding model.

Our best configuration found the correct passage among the top ten results 73% of the time. Traditional keyword search reached only 19%. This is a substantial difference: it suggests that, in many cases, meaning-based multilingual search could surface relevant legal material that keyword search would miss.

Reranking added a useful second-stage improvement. Hybrid search, by contrast, helped only marginally in our experiments. Combining keyword and meaning-based search sounds attractive, but in this setting the strongest embedding models already captured much of what the hybrid approach could add.

What surprised us

Two findings stood out. First, language deconfounding worked visually but not very strongly in practice. After removing the language signal, German and Italian documents mixed more clearly in the mathematical search space. But for strong multilingual models, this barely improved retrieval performance. The practical lesson is straightforward: it is usually better to start with a strong multilingual model than to try to repair a weaker one afterwards.

Second, asking a large language model to paraphrase the legal queries made retrieval worse. This was counterintuitive. One possible explanation is that the evaluation questions were written by a professional lawyer. Their wording already used precise legal terminology close to the language of the court decisions. Rewriting the questions may have weakened that legal precision instead of improving it.

Multilingual does not yet mean equally good in every language

The results are promising, but they are not uniform across languages. German and English queries performed better than French and Italian queries.

This matters. A legal search system that works well “on average” may still underperform for some users or some languages. In a multilingual legal system, that imbalance directly affects who can find which legal information.

Part of the imbalance likely comes from the data itself. The decision collection and the evaluation set were strongly skewed toward German-language decisions. Future work should therefore put particular emphasis on better evaluation data for French and Italian, and on methods that reduce performance gaps between languages.

What’s next

The JuRAG project shows that cross-lingual legal retrieval is feasible. AI-based search can help legal professionals find relevant rulings across language boundaries, and it may also make information about the Swiss legal system more accessible to a wider audience.

At the same time, the project also shows where more work is needed. The next steps are to reduce the remaining language gaps, expand the evaluation set, and test the system with practising lawyers in realistic research workflows.

If you work on applied legal AI, multilingual retrieval or related topics, we would be happy to hear from you. Please contact Lars Schmid or Manuela Hürlimann.

  1. Readers interested in the technical background of this approach can find a related paper here: The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure. ↩︎

Leave a Reply

Your email address will not be published. Required fields are marked *