Localization

How to Create a Bilingual Glossary Using Generative AI


Note: This blog post was originally written in Japanese for our Japanese website. We used our machine translation platforms to translate it and post-edit the content in English. The original Japanese post can be found here.

 

What is a glossary?

A glossary is an essential resource for ensuring that technical terms, proper nouns, and other words and phrases are translated into standardized equivalents in the target language (for the importance of glossaries, see this post). In addition, in machine translation, using a bilingual glossary can enforce the specified translations, making it a valuable resource for improving translation accuracy.

The basic method for creating a glossary is quite simple. First, extract terms from the target documents and then assign appropriate translations to them. That’s it. However, what if there is an enormous number of target documents, and you also need to find corresponding translations from an equally vast amount of previously translated documents? It would require an immense amount of time and effort.

To automate this task, methods for term extraction using tools were devised; however, with conventional approaches, it was extremely difficult to extract high-precision terms from a large pool of candidates, and extraction from parallel texts in particular was far from practical. With the advent of generative artificial intelligence, or generative AI, however, this process can now be carried out efficiently and accurately.

Conventional challenges

Conventional terminology extraction tools used methods that automatically extract frequently occurring terms from large volumes of documents by applying morphological analysis and other techniques. While this approach could automatically extract candidate terms from many documents, the results contained a lot of noise (i.e., unnecessary term candidates), and it could not handle the extraction of bilingual term pairs. As a result, a great deal of time and effort was ultimately required for sorting them out manually.

A new approach with generative AI

By leveraging today’s much-discussed generative AI, it has become possible to extract terms in a specified domain with high accuracy. Because generative AI can flexibly follow detailed prompts about which terms to extract, it can efficiently retrieve only the necessary terms. Furthermore, extracting bilingual term pairs is now possible, enabling bilingual terminology extraction as well.

Key innovation: Fine-tuning and separating bilingual term extraction

However, generative AI is not perfect. Even with finely tuned prompts, we found issues such as noise or failure to extract the necessary terms occurring at a certain rate. Therefore, we devised further measures to improve accuracy.

First, we handled term extraction in the source text and the process of aligning it with the translation as separate tasks. We also fine-tuned AI models for each, improving the accuracy of each task. Fine-tuning is the process of training a generative AI on specific data to further improve processing accuracy. For example, by training it on the patterns of terms contained in particular documents, we were able to reduce unnecessary term candidates and select only those suitable for the purpose. Similarly, for bilingual term extraction, we improved accuracy by training on data in which the translations are correctly aligned.

diagram_20251015

Terminology extraction process using generative AI

 

Actual results and future potential

By leveraging a fine-tuned generative AI, the noise that occurred with conventional methods has been significantly reduced, enabling the extraction of only carefully selected terms. In addition, the extraction of bilingual aligned data can be performed smoothly, increasing consistency in the glossary. Going forward, we plan to conduct further validation with more sample data to further unlock the potential of terminology extraction using generative AI.

Terminology extraction using generative AI could be a new solution for many companies struggling with large volumes of documents. The glossary-building efforts that had previously been abandoned can become a reality by leveraging this technology. Why not try terminology extraction with generative AI and achieve more efficient operational improvements?

Kawamura International’s services

At Kawamura International, we not only offer glossary extraction using generative AI, but also provide a wide variety of services that leverage AI translation, including the machine translation engine Kawamura NMT powered by NICT. We can make a wide range of proposals tailored to your security requirements, objectives, fields, and intended user groups. Please feel free to contact us if you have any questions.

Similar posts