Prime Video uses machine learning at scale to prevent spelling errors in content subtitles
We developed a real-time, language-adaptive tool to flag misspellings across 24 languages and process 11,000 subtitle files every month.
Prime Video has an expanding catalogue of content, with more content being added every week. This content could be your favorite Amazon Originals, new stand-up comedy shows, or live sporting events such as NFL Thursday Night Football (TNF). Using subtitles or closed captions is an industry-standard way to provide customers with a localized experience in a cost-effective and accessible manner.
A subtitle’s primary goal is communicating the spoken dialogue in a scene, but to do this the subtitle must be grammatically correct because incorrect subtitles create a poor viewing experience. Supporting high-quality subtitles requires Prime Video to solve technical challenges, including ensuring that the spelling in subtitles is correct. We’re customer-obsessed and so we aim to have no misspelled words in any language across our hundreds-of-thousands of subtitles. Using natural language processing (NLP), we developed a real-time solution that automatically identifies spelling issues and provides suggestions that are context-aware.
Our goals for the spellchecker and suggestion system were that it must be real-time, context-aware, and adaptable to new languages. Our system takes in a sentence (the subtitle block), identifies words that are incorrectly spelled, and provides between three and five suggestions that are most relevant in the context of the sentence. A linguistic expert then chooses the correct one.
The four challenges to building our real-time spellchecker
We needed to address multiple technical issues before developing this spellchecker. First, while there is lot of recent research into spellcheckers, most of the solutions focused on one language. We wanted a solution that scaled across multiple languages, while maintaining features such as real-time computation and contextual suggestions. While there are existing rule-based, crowd-sourced solutions that have language experts add complex rules to identify grammatical issues, these were extremely difficult to scale for our use case.
Second, we had to overcome the evaluation dataset. Misspelling test data for English is readily available and some limited hand-crafted samples are also available for other languages such as German. However, we didn’t find an appropriately-sized and clean dataset available for all the languages that Prime Video offers to customers.
Third, our solution had to work in real time. With recent technical advancements in text-based machine learning (ML) solutions, many researchers proposed neural network and deep learning solutions for English spellcheckers. However, most of these didn’t provide real-time speed at our required level of accuracy. We experimented with deep learning solutions but found that there was always a trade-off between the speed and accuracy of the system.
Finally, we wanted to provide between three and five suggestions for each spelling mistake identified by the system. The suggestions needed to be in a fixed edit-distance window from the misspelled word, which would reduce the search space and suggestion list needed to be ordered by the level of relevance in the context. For example, if the input sentence is “This is an appx tree,” where “appx” is the misspelt word, the two possible suggestions would be “apple” and “apps”. We wanted to ensure that “apple” appeared before “apps” because it makes the most sense within the context.
Building our real-time spellchecker
To build our solution, we used the following three steps: 1) detect misspellings in real-time, 2) generate suggestions for the misspelled words, and 3) rank these suggestions according to the context of the sentence.
First, the input sentence is tokenized into words or tokens using a regular expression rule that takes a list of acceptable characters in that language. For each word, we check if the word is correctly spelled by using a lookup table. Then, each spelling error is processed to generate a list of potential candidates for suggestions. Finally, these candidates are then ranked based on the context.
For each language, we have one preprocessing step where we compute n-gram conditional probabilities and a lookup tree. We need the lookup tree to quickly identify misspelled words and need the n-gram conditional probabilities for ranking candidate suggestions. We used articles from Wikipedia and existing Prime Video subtitles to generate our probabilities and lookup tree.
A significant problem we encountered while building these probability dictionaries were their huge sizes. For English, there were around 2.2MM unigrams, 50MM bigrams and 166MM trigrams. If we were to store these files as uncompressed Python Counters, they would end up being 44 MB, 1.8 GB and 6.4 GB respectively. To reduce their size, we compressed them by using a word-level Trie with hashing. This ensured the n-gram lookup operation was bounded in O(1), which reduced the file sizes by an average of 66%.
For generating candidates within edit distance of 2, we used the Symmetric Delete algorithm (SDA), which sped up the candidate generating multifold. For each candidate, we created a context score for every suggestion and ranked on decreasing order of that score, providing the top three to five suggestions. We defined the context score as the weighted sum of unigram probability and of bigram and trigram conditional probability. For example, in “This is an appx tree” and the suggestion “apple”, we consider P(apple), P(apple | an), P(tree | apple), P(apple | is, an), P(tree | an, apple). We calculate it for each suggestion by replacing the token with the suggestion.
To improve spellchecker’s performance for entire subtitle files, we improved further on the block-level spellchecker. In subtitles, proper nouns like character names are unlikely to be present in that language’s vocabulary and will be flagged as spelling errors. To minimize such errors, we run a state-of-the-art named-entity recognition (NER) on each subtitle block and words that are marked as entities. Additionally, if we encounter a word occurring more than a certain number of times in a single subtitle file, we assume that word is an intentional spelling mistake and don’t flag it.
Delivering accurate content to our customers at scale
Our spellchecker works in 24 languages and outperforms the industry-standard, rule-based systems while maintaining real-time speed. Detecting a misspelled word and providing contextual suggestions takes no more than 50ms for most of our supported languages. We have more than 98 percent precision for the top five suggestions for most of our languages.
The spellchecker can seem like an age-old problem but the scale at which Prime Video operates makes developing and efficiently deploying it a technical feat, allowing us to cater the spell-checking capability to over 11,000 subtitle files every month across our supported languages. This ensures that all video subtitles ingested to the ever-growing Prime Video catalogue are spelling error-free and provides the best possible user experience for our customers.
For more information about our work, see our A context sensitive real-time spellchecker with language adaptability paper on the Amazon Science website.