How Prime Video mapped 80,000+ titles to the relevant IMDb pages using Apache Lucene
Leveraging Apache Lucene and a heuristic search algorithm to enrich Prime Video’s catalog with IMDb metadata.
Integrating metadata from IMDb into Prime Video greatly enhances a customer’s viewing experience. For example, imagine that you’re looking for something to watch on Prime Video and that while scrolling through the Prime Video catalog, The Marvelous Mrs. Maisel catches your attention. You choose this title and find a page with information and details about the show. The series has a high rating from IMDb contributors, in addition to cast and crew information.
Then you recognize the leading actress, Rachel Brosnahan, and by clicking on her profile, you’re directed to a page with a short biography, filmography, and connections to other cast members.
Prime Video licenses this information from IMDb, the world’s most popular and authoritative source for information on movies, TV shows, and celebrities. For more information about IMDb’s data licensing products, see the IMDb Developer website. As a global database, IMDb strives to present relevant information to customers specific to their locality and language. This means storing the title, poster, release date, and certificates as released in a specific part of the world.
At Prime Video, our Web Emerging Data Provider team’s mission is to deliver data for building delightful customer experiences. We recently released the Top 10 feature and managed the service that provides IMDb metadata to Prime Video. However, we faced a significant technical challenge: Prime Video and IMDb have independent catalogs. This means that to enrich a Prime Video title with IMDb metadata, we first needed to map it to the relevant IMDb records.
We wanted to increase the mappings between both catalogs. To achieve this, we first had to investigate and understand the data in both catalogs, extract the relevant metadata from them, and create an algorithm that finds new mappings with high confidence.
Encountering data challenges in two separate catalogs
IMDb has extensive metadata for movies and TV shows, but Prime Video has its own metadata for each title such as actors, release date, or maturity ratings. But to map titles within these two datasets, we needed to focus on the common information that exists in both. This common information includes titles, genres, content type, release date, directors, actors, and a title synopsis. With so much data in common, you might assume that joining both datasets would be easy. But this wasn’t the case.
Our initial approach was to map movies with the same title name in both datasets and potentially use the remainder of the metadata to confirm the match. Although this seems a logical approach, it didn’t work. Sometimes, movies are released with different names in different countries. In IMDb, each movie might have been released with different titles in different regions, while Prime Video has only one name per title.
There were even some instances when a Prime Video’s title name was not in the list of IMDb’s names as these are directly provided by content providers. Moreover, names were also found containing phrases like “Extended edition” and “Director’s cut” but not consistently across both datasets. This made it difficult to even create a string similarity algorithm.
Additionally, there is no one-to-one mapping between Prime Video and IMDb genres or between the content-types used to describe a title. Release dates can differ between datasets, typically because a title is being released on different dates in different countries.
IMDb and Prime Video usually had significant differences for cast and crew in a movie or a TV show. For example, IMDb keeps multiple names per actor (localized versions and known nicknames), which makes difficult to map individual actors. Different actors might also share the same full name which further complicates our use case.
Additionally, the synopsis for some titles were completely different, while for others one dataset contained just a short version of the other’s synopsis. Finally, both datasets had missing data in the fields. The following image shows these key differences in the IMDb and Prime Video catalog entries for The Journey.
Mapping titles between both datasets was not an easy task. The following diagram shows how we mapped individual features of each dataset with each other.
The Prime Video title names should be among the IMDb names for a title. Both datasets contain sets of content-types and genres, so we compared these with each other. Also, both datasets contain the release date of a title.
We merged the Prime Video sets of actors and directors in one common set and then compared that with IMDb’s set of principal participating people. Finally, we compared the synopsis of Prime Video against the plot in IMDb.
Building a solution using Apache Lucene
To build a solution for our customers, our team had to first clean and bring both datasets to the same format. Then, we decided to work separately with different regional segments of the Prime Video catalog. This approach helped us work around the differences in title names from country to country and fine-tune our algorithm in a specific way for each region. We also worked separately for movies and TV series, as this lead in better results for both type of titles. After the data was ready, we began working on our algorithm.
Both datasets contained over 1MM titles, so comparing every Prime Video title with every IMDb entry would be a very slow process. To solve this problem, we leveraged Apache Lucene, an open-source search engine software library that provides powerful indexing and search features. These features allowed us to perform search queries against the IMDb records dataset in a quick and cost-effective way.
We loaded all IMDb titles into Lucene, defined the different fields, and indexed the data. Then we used Lucene to make search queries for every Prime Video title against the indexed IMDb data. Queries had all the required information separated in fields (for example, title, content-type, year of release, genres, or list of participating people). For each of these queries, we queried Lucene to get the top ten most similar IMDb records.
Why did we choose the top ten most similar titles and not just the single best match? Well, when we evaluated our algorithm using Prime Video catalog titles that had a known mapping in IMDb, our results showed that Lucene returned the correct title as the single best match in around 85 percent of the cases. However, the correct title was among the top 10 returned titles in 99 percent of cases.
One of the possible reasons for this behavior is that Lucene had difficulty finding the correct mapping of movies that had sequels with similar plots, genres, or actors. Yet Lucene help us to scale from more than one million potential matches to just ten. Now, we just had to find a way to pick the correct title from these ten intermediate results.
This brings us to the second part of our algorithm. By experimenting, we were able to build an empirical formula that calculates a similarity score between each of Prime Video’s titles and its ten potential IMDb matches. We could then compare titles as strings after we removed any numbers, dates, or phrases such as “4K version.”
We could also compare the year of a title’s release without expecting it to match perfectly. Comparing 2012 and 2013 would return a better similarity score than comparing 2009 and 2012. The same goes for actor names, genres, and content-types. We defined formulas that calculated the similarity of every field separately and assigned weights to each field’s similarity score to bring them together in a single matching score.
Our algorithm was quite slow but this didn’t matter because we only ran it against some very limited results for every title. At the same time, having a similarity score for every potential mapping had two uses. First, we were able to pick the most similar title in IMDb for every title in Prime Video. Second and equally important, we could set a threshold on that similarity score. If an IMDb title looks like the best match for a Prime Video title but their similarity score is not high enough, we don’t want to prioritize that mapping. This was a key factor in minimizing the introduction of any incorrect mappings.
Evaluating our algorithm’s success
Our evaluation process had two steps. First, we used the Prime Video catalog titles that we already knew the IMDb mappings for. Our algorithm was 98% accurate across all regions, which surpassed our target of 95% accuracy. The second part of our evaluation was to send our proposed new mappings of around 80,000 titles to the Prime Video Catalog team. This team manually audited a sample of 8,130 newly proposed mappings and found our algorithm to be 97.9% accurate, which was equivalent to the results from our evaluation set.
Before new mappings were incorporated in our catalog, the titles were split into two categories: high-profile titles and normal titles. High-profile titles had to pass a manual inspection before being added to our catalog while normal titles were directly added. Overall, these approximately 80,000 new mappings increased the number of known mappings by 16.5%.
Constantly learning and improving
Working on this problem was a positive experience. First of all, there is the obvious customer-facing benefits that come from enriching another 80,000+ Prime Video titles with IMDb metadata. We also had the opportunity to dive deep into the catalogs of Prime Video and IMDb to get a better understanding of the available data. We also worked with cutting-edge technologies and helped map titles with seemingly different descriptions and features.
To quickly get meaningful changes to the experience in front of our customers, we established a self-imposed deadline. This means that we haven’t been able to try out all of our ideas to enrich the Prime Video catalog with IMDb metadata. But that’s OK because we’re observing and listening to what our customers are telling us about the most recent changes. By working backwards from that valuable insight, we’ll define the next series of experiments to perform.