Note: This blog post was originally written in Japanese for our Japanese website. We used our machine translation platform Translation Designer to translate it and post-edit the content in English. The original Japanese post can be found here.

Have any of you thought about the following things related to translation data?

  • We have been translating in-house but have not been able to utilize the accumulated translation assets.
  • We're interested in introducing machine translation to reduce translation costs.
  • We want to fully convert the past printed data into digital data but don't want to spend much time on it.
  • We want to organize data to be able to search, tabulate, and analyze them.

Visualizing data enables you to review your business processes. First of all, let's annotate the past data that is stored and forgotten at your company and convert them into a form that can be utilized.

What is annotation?

Annotation is a term used in various fields such as programming, but in the artificial intelligence or AI field, it is used in the sense of labeling data such as text, images, and videos. That means adding metadata*. By organizing data so that they can be understood by computers and be linked to different data, you can search, classify, and analyze their content.

*Metadata:
Data that describes what the data is about. Some metadata are automatically generated when data is created, but some are created by the user depending on the intended use of the data. Example: In the case of photo data, the file name, file creation date, file creator, etc. are metadata.

Data annotation by translation companies

Simply put, translation companies use data annotation to improve the quality of translation. In order to improve the quality of translation, it is important that the customer and translation company agree on the specifications* first. With these specifications, the annotation work can turn into an output of glossary or collection of bilingual sentences.

*Specifications:
Turnaround time, budget, quantity, text information, delivery file format, and reference materials (including glossaries, corpora, translation memories, style guides).

Glossaries and collection of bilingual sentences are essential materials for accurately translating technical terms and company-specific terms, and for having consistency in translated terms. Using a dedicated tool, we can conduct data annotation according to your budget, preferred turnaround time, quality, and purpose of data use. By data annotation, you'll be able to create a glossary or collection of bilingual sentences using the data kept at your company. High-quality glossaries and collection of bilingual sentences can also reduce translation costs or post-editing costs when editing machine translation output.

If you already use a computer-assisted translation (CAT) tool and have a translation memory, you can use that data as your collection of bilingual sentences. Also, if you have a bilingual file such as in a Word format, that can also be edited with a tool so that you can use it as your collection of bilingual sentences. It is also possible to create a monolingual glossary from a monolingual corpus*.

*Corpus:
Language database. A systematic collection of written and spoken language materials and annotated with information. A monolingual corpus is a corpus of only one language.

Machine translation and data annotation

Data annotation can also be used to create training data for machine translation engines. Mistranslations and omissions may occur when you use a general machine translation engine. However, by customizing the engine, translation quality will improve significantly.

There are two ways to customize machine translation:

(1) Applying terminology

If you apply a glossary that matches the source language and the target language, the words included in sentences will always be the words listed in the glossary. You will be able to translate technical terms correctly.

(2) Additional training

By letting a machine translation engine learn your collection of bilingual sentences, the machine translation output will be closer to the translation in your bilingual collection. By applying training data (glossaries and collection of bilingual sentences), a machine translation engine can learn translation rules on its own and improve the quality of translation. However, there are some points to be aware of when creating training data.

Are glossaries and bilingual sentences used for human translation different?

Human translation allows flexibility in deciding which terms to apply depending on the context. Inflections can be supported. Also, if there are multiple translations for a single word, a translator can select one according to the context.

Machine translation, on the other hand, does not consider the meaning or context of words, so the glossary used must be one-to-one. Terms that can be applied to machine translation include proper nouns, technical terms, industry terms, and terms that do not change form and whose translation does not depend on the context. In this way, it is necessary to create separate glossaries depending on whether you're performing human translation or machine translation.

Converting printed data into electronic data

As a pre-annotation process, handwritten characters or forms are scanned and converted into character data using an OCR, optical character reader or recognition, tool. After that, you can extract terms and information from the converted data, annotate them, and create the above-mentioned glossaries and collection of bilingual sentences.

With the above process, you can proceed to fully convert your past data from printed into digital. You can also expect to improve work efficiency by making it easier to search and analyze data.

Kawamura's data annotation services

With Kawamura International's data annotation services, data can be organized according to your intended purpose and application. We also have a service that can automatically create quick glossaries and collection of bilingual sentences.

Now is the time to consider how to utilize your past data stored at your company. Please feel free to reach out to us with any questions you may have.