Skip navigation

I am writing a media metadata subsystem capable of understanding a vast amount of detail about the content that a media file contains. It uses the file name information to produce a metadata hint, then uses pluggable services to retrieve potential matches as to the content. The result (at least so far) is a tool that can examine a media file and tell you exactly what movie, television episode, etc it contains with very good accuracy, including all the appropriate metadata (even individual episode descriptions!)

If this sounds a lot like what Boxee does, that’s because it does do this. Unfortunately though, Boxee analysis is rudimentary, and requires particular ordering and formatting of the file name elements (title, year, episode number, etc) to achieve decent reliability of listings retrieved by the client.

Can’t find a listing? Those Scene Tags Are In The Way.

MediaExpert’s approach is different: it filters the “title string” (initially, the file name without the extension, and all non-alphanumeric characters replaced with spaces) until it has something it believes is very likely the title of the content. It does this by recognizing common scene tags and removing them from the title string which will be searched by metadata services. The tags are stored in the hint information, allowing the application or other filter plugins to make use of the information. It already recognizes a great deal of tags which are formatting or source related (ex. 480p, DVDRip). It also recognizes some of the most popular scene release groups. Finally, it treats any words found within square brackets ([, ]) as scene tags. It looks for a four digit number which starts with either 1 or 2 and if found, considers it the year the content was released.

The system supports pluggable “scraper” filters as well, with one builtin one: TelevisionScraper, which looks for season/episode information in a variety of formats including sNeN, eN, NxN, and more. Unlike Boxee, a minimum of 1 digit is allowed with a maximum of 3 for seasons, and 3 for episodes. This information is stored within the metadata hint.

Now the metadata services query their respective providers and return the results. The system is then able to narrow down the possibilities via merit, that is, a best-score-wins heuristic based on things like how close the actual year is to the one provided in the file name, whether the content type (episode, movie, etc) matches the metadata provided in the hint (like season/episode numbers), how close the expected title is to the real title (using Levenschtein distance).

Altogether this makes for a powerhouse of media detection capability, without any compelling need to compulsively rename your media collection (hey, feel free if you want to).

The system will also work with many other media types like music, news, adult content, etc. The media-expert tool will also be capable of exporting the metadata in XML format for caching and distribution.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: