LLM vs NMT Translation: Testing GPT-4o in Azure AI Translator Preview

LLM vs NMT for translation revisited, the Azure preview

Written By:

Martin Laplante

A couple of years ago, I wrote the blog post “Is GPT better at translating than translation engines?”

‍

It looked at the systematic research into translation quality, and concluded that contrary to popular belief, NMT (Neural machine translation) performed better than GPT models, the most advanced LLM (Large Language Model) at the time. Although LLM translation is more fluent and natural sounding, it is less accurate in general, that is to say the facts contained in the translation are less likely to match the facts contained in the original.

‍

Two years have passed, and lots more studies have been done with more recent models. Microsoft has recently brought out, in preview, some LLM based translation alternatives to NMT, integrated into Azure AI Translator. Have the results changed in two years? Oddly enough, not very much. Over a large set, NMT is still marginally better. LLMs are now better in many categories, but they are typically more expensive and slower by orders of magnitude.

‍

Why People Believe LLM Translation Is Better

‍

The reason why people persistently believe that LLMs are better is probably for two reasons. One is based on variations of the statistical fallacy called Survivorship Bias. When people translate using NMT and find a particularly clumsy translation, they may reach for an LLM to see if it does any better. Finding a better translation then reinforces their belief that LLMs are better at translation. Survivorship Bias says that if you have to pass (or in this case, fail) a test to be included in the sample to be studied then your sample is not representative of the whole population and may give different results. The other reason is that while translators and developers of translation software tend to value accuracy, other people value fluency and natural wording.

‍

Where LLM Translation Performs Better Than NMT

‍

There are certainly some types of text where LLMs do better than NMT. LLMs are better at adapting style and level of formality. For example there are many languages, such as Japanese and Korean among others, where you would use different words, noun forms and verb forms, depending on the relative rank, age, and degree of familiarity of the person being addressed. In Japanese it is called “keigo” (literally "respectful language"). A translation to Japanese might be technically correct but sound incredibly rude because of the wrong formality. Many Indo-European and Indo-Aryan languages have that feature, but mostly confined to pronouns, like the French “vous” and “tu”. Because relative status is not contained in each sentence, NMT is not good at adding that extraneous fact to the translation. It’s also partly a cultural bias on the part of those who determine the training error measures: is an incorrect tone considered much of an error when evaluation criteria are designed by an English speaker? This is similar to the gender problem in translation: when NMT (or any other translator) writes a sentence in certain languages, they have to choose a gender, and they will typically choose the stereotyped one based on the examples it has seen. Unless the stereotype affects you, you might not notice the problem.

‍

Technical Differences Between LLM and NMT Translation Models

‍

Let’s skip quickly over the details of the more complex examples of where LLMs and NMT differ. For example, NMT can look forward to the entire sentence and plan it in its entirety, while LLMs can only look backward: the process that generates the initial words in the sentence has no idea what the end of the sentence will say. On the other hand, LLMs have a longer context window; things from a previous paragraph can influence a translated sentence, while NMT doesn’t look that far back.

‍

Translating Idioms: A Key Difference Between LLMs and NMT

‍

One difference between LLMs and NMT that is easier to understand is figurative language such as idioms. NMT tends to take things literally, while LLMs do not. NMT is very concerned that every word and every fact in the original sentence also be present in the translation. It’s also trained for forward and backward translation. You know that trick where you let it translate from English to another language, and then translate that sentence back to English so you can laugh at the results? NMT engines include that in their training, and translating idioms literally is more likely to return roughly the same words and the same idiom than replacing it with a different idiom in the other language, then seeing whether it will revert to the same idiom when it translates it back to English.

‍

Azure AI Translator Public Preview of LLM translation

‍

Microsoft has released a public preview of Azure AI Translator with the ability to use LLMs as an alternative or a complement to NMT. The 2025-05-01-preview version of Azure AI Translator includes the ability to use an LLM for translation. This new public preview capability is severely limited. It is found in Foundry tools only, not in the Azure portal or in the Azure AI Language Studio where Azure AI Translator is normally found. It works for text translation only, not document translation. Rather than translating up to 1,000 entries at once, its limit is 50. Its maximum text size is 5,000 bytes rather than 50,000 bytes in theory, but in practice we find the maximum is much less, about 1,000 bytes.

‍

If we choose to use the LLM translation option, it allows us to specify either the GPT-4o model from Open AI or its compact version GPT-4o-mini, and you must deploy those models in Microsoft Foundry. You can tell the Azure Translator API which model to use, or none, or a hybrid approach with NMT. LLM translation is limited to a subset of the languages that NMT supports. In our testing, we encountered a frustrating number of random errors, so it’s not quite ready for production yet.

‍

Adaptive Custom Translation and Few-Shot Learning in Azure

‍

An extended version of LLM translation, called Adaptive Custom Translation, goes even further. If you recall this post I wrote in 2022, Azure Custom Translator is a way to re-train an Azure Translator engine with your data, at least 10,000 professionally translated examples, so that it will adopt your vocabulary and your style. It takes weeks of work and days of computer time to train a new model. Adaptive Custom Translation is a lot easier. It involves providing a small set of pre-translated sentences or phrases, as few as 5 sentences, but less than 10,000. Rather than training the model with it like Custom Translator does, it chooses a few of those sentence pairs on the fly to provide to the LLM as examples of the vocabulary and style that it should use. This is called “few shot learning”. The model is given a few hints of what a correct answer looks like before it is given a task. The data sets of sentences are generated and indexed in advance, and you can either provide a few reference sentences yourself, or let Azure Translator select appropriate sentences automatically. Adaptive Custom Translation supports only a handful of languages.

‍

The public preview also lets you specify the tone and genders to use in translations, whether you are using LLM translation or NMT.

‍

Testing Idiom Translation: NMT vs GPT-4o vs GPT-4o-mini

‍

So how good are the translations? We didn’t test with an extensive data set, we will accept the research that says that NMT is slightly better than LLMs on average, depending on whether we value accuracy or fluency.

‍

As mentioned earlier, one type of translation that LLMs do better at is idioms. While NMT tries to stick to the words in the original, LLMs are happy to hallucinate a likely phrase that someone might say in that sentence, like a roughly equivalent idiom, and replace the exact translation with it. If you are translating from English to another language and the English text suddenly talks about beating dead horses or mistaking a task for a piece of cake in the middle of a business discussion, LLMs will probably conclude that the phrase is out of place and replace it with something more likely in the other language. It’s hallucinating, but in a good way.

‍

We tried translating 33 German idioms mostly related to animals or food into English using either NMT, or LLM translation using gpt-4o-mini, or LLM translation using the much bigger gpt-4o model. We did not use Adaptive Custom Translation because it would not be helpful in the case of idioms.

‍

The results are in the table below. Where the translation is marked in green it is essentially correct, with the German idiom replaced with an equivalent English idiom. The ones in red are wrong, typically a literal translation, and amber ones have the general idea but not the correct corresponding idiom.

‍

‍

Results: How Well Do LLMs Translate Idioms?

‍

In one case the LLM refused to translate an idiom. The German “Da kannst du Gift drauf nehmen” literally means “you can take poison on it”, but the English idiom would be “you can bet your life on it”. Unfortunately, the LLM’s self-harm filter takes a dim view of advising someone to drink poison.

‍

You notice a lot of red in the table. LLM translation is not a panacea. While NMT translated only 6.5 idioms correctly, including three that got half points, gpt-4o-mini got 9 of them right and gpt-4o got 15.5 right. Still less than half but a lot better than without LLM. But you also see that gpt-4o-mini alone doesn’t make a huge difference, you need the full gpt-4o to make it worthwhile.

‍

Practical Use: Combining NMT and LLM Translation

‍

So, what can we do with this information? Well, if you’re interested, we have developed a version of PointFire Translator Express that can translate SharePoint Online pages using either NMT or LLMs. If you contact us, we might let you try it before it is released. The LLM translation is significantly slower than the NMT translation, and fails more often. Results are not necessarily better, but at least they are different, so a lot of them are actually better. One way to use it is to first translate pages with NMT, and any of them where you don’t like the translation you can translate again with LLMs. It will be slower, but oddly it’s about half as expensive as NMT when you use gpt-4o, and even less for gpt-4o-mini.

‍