TL;DR: Bad normalization
OpenAI’s Whisper paper
A different, language-specific set of transformations would be needed to equivalently normalize non-English text, but due to our lack of linguistic knowledge to build such normalizers for all languages, we resort to the following basic standardization for non-English text…
As a person who also lacks linguistic knowledge, I admire the straightforwardness of just going “Yeah, we have no idea, so we just won’t bother”. However, they seem to be pretty accurate in their self-assessment, because the normalization they end up with still has a glaring issue.
For a general ASR model like Whisper a rule of thumb is that it’s far better to train on raw data without any normalization: the model can learn to predict peripheral language features such as capitalization and punctuation given enough training data, and so a denormalization step is not required in actual application. Evaluation, however, is done with normalized text so that a dataset’s specific idiosyncrasies don’t reflect in the reported evaluation metric. In testing out the normalization function used by OpenAI, I plugged in some Malayalam text thinking I’d get basically the same thing back minus the punctuation:
ഞാൻ കേൾക്കാൻ ആഗ്രഹിക്കുന്ന വാക്കുകൾ കൊണ്ട് ഞാൻ എപ്പോഴും ജനങ്ങളെ ആശ്വസിപ്പിക്കാൻ ശ്രമിക്കുന്നു
Instead, I got this abomination:
ഞ ൻ ക ൾക ക ൻ ആഗ രഹ ക ക ന ന വ ക ക കൾ ക ണ ട ഞ ൻ എപ പ ഴ ജനങ ങള ആശ വസ പ പ ക ക ൻ ശ രമ ക ക ന ന
The culprit is pretty easy to find, right after the aforementioned passage:
Replace any markers, symbols, and punctuation characters with a space, i.e. when the Unicode category of each character in the NFKC-normalized string starts with M, S, or P
Unicode categories
A direct look at their code reveals a direr state of affairs. By default, the normalizer calls a function remove_symbols
that removes all characters whose Unicode category starts with M, S, or P, consistent with the paper:
def remove_symbols(s: str):
"""
Replace any other markers, symbols, punctuations with a space, keeping diacritics
"""
return "".join(
" " if unicodedata.category(c)[0] in "MSP" else c for c in unicodedata.normalize("NFKC", s)
)
However, if you set the parameter remove_diacritics
to True
, you instead get:
def remove_symbols_and_diacritics(s: str, keep=""):
"""
Replace any other markers, symbols, and punctuations with a space,
and drop any diacritics (category 'Mn' and some manual mappings)
"""
return "".join(
c
if c in keep
else ADDITIONAL_DIACRITICS[c]
if c in ADDITIONAL_DIACRITICS
else ""
if unicodedata.category(c) == "Mn"
else " "
if unicodedata.category(c)[0] in "MSP"
else c
for c in unicodedata.normalize("NFKD", s)
)
Here, they seem to realize that marks indeed include diacritics, and they omit nonspacing marks instead of replacing it with a whitespace. This is actually marginally better than what remove_symbols
does to diacritics, and this points to the the fundamental blunder being more “we didn’t proofread our code” and less “we didn’t check what Unicode categories meant”.
As we see before, OpenAI admits their normalization for anything non-English is barebones. Particularly for languages with no clear demarcation between words, it splits sentences into a sequence of characters, functionally measuring the CER (character error rate) instead of the reported WER (word error rate). This makes sense as a quick and clean solution: their numbers are measuring the wrong thing, but they acknowledge it. However, nowhere in the paper do they acknowledge that they do the same thing to a laundry list of languages that do have a clear demarcation between words: the same one that every other modern language uses. In the end, what you functionally get after normalization is a Frankenstein between CER and WER for languages unlucky enough to be using a particular class of diacritics.
Since this is the normalization they used to get their reported results, I reevaluated the medium model on the FLEURS dataset for all potentially affected languages with normalization that doesn’t just fling away diacritics to see if there was any real difference. Sure enough:
Language | WER (OpenAI) | WER (Mine) | Difference |
---|---|---|---|
Arabic | 20.4 | 27.9 | -7.5 |
Assamese | 102.3 | 161.2 | -58.9 |
Bengali | 100.6 | 134.9 | -34.3 |
Persian | 41.0 | 48.0 | -7.0 |
Gujarati | 104.8 | 117.8 | -13.0 |
Hindi | 26.8 | 58.1 | -31.3 |
Khmer | 98.9 | 287.8 | -188.9 |
Kannada | 77.7 | 109.8 | -32.1 |
Malayalam | 101.1 | 121.2 | -20.1 |
Marathi | 63.2 | 109.8 | -46.6 |
Nepali | 54.4 | 114.4 | -60.0 |
Punjabi | 102.0 | 125.9 | -23.9 |
Pashto | 119.4 | 170.0 | -50.6 |
Sindhi | 147.0 | 129.6 | 17.4 |
Tamil | 23.1 | 54.9 | -31.8 |
Telugu | 82.8 | 122.4 | -39.6 |
Urdu | 28.2 | 41.8 | -13.6 |
Yoruba | 105.1 | 172.7 | -67.6 |
Every single language had a deflated WER, with the exception of Sindhi (not really sure what’s going on there). Granted, Whisper did hopelessly terrible in most of these languages to begin with (WER over 100% not really a good look), but Tamil, Hindi, and Urdu strike me as three languages where the discrepancy does make a difference: 25%ish is a manageable WER, 55% less so.
This becomes even more of an issue when people keep using the same damn normalization function for finetuned model evaluation, like in the Hugging Face Whisper Finetune Sprint, where the leaderboard for these languages is basically curated misinformation.
An easy out would be to use the CER: this makes everything more or less consistent for all languages and prevents issues with scripto continua writing systems and agglutinative languages. That said, this is a pervasive issue with non-English NLP research in general, where evaluation and its metrics are optimized exclusively for English, but I digress. Next time around, I hope OpenAI uses all the money they saved from underpaying and mentally scarring Kenyan workers