Total Pageviews

Yahoo releases massive Flickr dataset, and a supercomputer steps up to analyze it

In case you missed this like I did (I blame a Structure hangover), Yahoo released last week a huge dataset Flickr images and videos. Called the Yahoo Flickr Creative Commons 100 Million, it contains 99.3 million photos, 700,000 videos and the associated metadata (title, camera type, description, tags) for each. About 49 million of the photos are geotagged and, Yahoo says, comments, likes and social data for each are available via the Flickr API.

Needless to say, this is a pretty impressive resource for anyone wanting to analyze images for the sake of learning something or just to train some new computer vision algorithms. We have been covering the rise of new artificial intelligence algorithms and techniques for years, most of which have benefited from access to huge amounts of online images, video and related content from which to derive context. Often, though, researchers or companies not in possession of the content (that is, pretty much everyone but Google, Facebook, Microsoft and Yahoo) have had to scrape or otherwise gather this data manually.

That being said, Google and Yahoo, in particular, have been pretty good about releasing various large datasets, usually textual data useful for training natural-language processing models.

Just a taste of what the dataset looks like. Source: Flickr user David Sharma

Just a taste of what the dataset looks like. Source: Flickr user David Shamma (https://www.flickr.com/photos/ayman/14444554781)

To test out just one possible function of the new image dataset, Yahoo is hosting a contest to build the system best capable of identifying where a photo or video was taken without using geographic coordinates. The training set for the contest includes 5 million photos and 25,000 videos.

Yahoo is also partnering with the International Computer Science Institute (at the University of California, Berkeley) and Lawrence Livermore National Laboratory to process the data on a specialized supercomputer — the Cray Catalyst machine designed for data-intensive computing — and extract various audio and visual features. That dataset, which Yahoo claims is north of 50 terabytes (the original 100-million-photos data is only about 12 gigabytes), and tools for analyzing it will be available on Amazon Web Services later this summer.

Image courtesy of Flickr user David Shamma.

Related research and analysis from Gigaom Research:
Subscriber content. Sign up for a free trial.