Dec 26, 2013 6:30 AM

Facebook Says It Knows Where People Are Migrating — But Can You Trust Its Data?

Facebook is a company that knows the personal habits of more than one billion people across the globe. That can be a scary thought — especially when you consider that Facebook is using this data to target ads and may share it with people and businesses you’d rather Zuckerberg and Co. didn’t share it with. But at the same time, this enormous trove of online data can give us a new means of seeing and understanding the world we live in.

Coordinated migration over the world.Image Facebook

Facebook is a company that knows the personal habits of more than one billion people. That can be a scary thought -- especially when you consider that Facebook is using this data to target ads and may share it with people and businesses you'd rather Zuckerberg and Co. didn't share it with. But at the same time, this enormous trove of online data can give us a new means of seeing and understanding the world we live in.

Take the data Facebook recently released describing the migration patterns of people across the globe. In a blog post, the Facebook data science team takes a look at what's called "coordinated migration" -- the movement of large numbers of people from one place to another.

According to information that Facebook collects on its social network, countries such as India, Nigeria, and Turkey are becoming increasingly urban, with many people moving from rural areas into large cities such as Hyderabad and Chennai in India and Lagos in Nigeria. Facebook calls these "destination cities," and some of them -- such as Istanbul, Turkey -- are attracting large numbers of people from across the border. In the U.S., Facebook says, coordinated migrations tend to come from other countries, such as from Cuba to Miami and from Mexico to cities such as Chicago, Houston, Dallas, and Los Angeles.

The trouble is that Facebook is only providing a small glimpse into the data it has collected. In one sense, that is as it should be. We don't want the web giant unloading our private data to the world at large. But it also means that there's no way for outside data scientists to examine and verify the company's findings -- to provide "peer review," in the parlance of academia.

It's a conundrum that will continue to hover over data science for the foreseeable future. There are ways of "anonymizing" data before it's released to others, but as we've seen in the past, anonymized data isn't always anonymous.

This isn't the first time scientists have used Facebook to analyze migration trends. In 2010, a former Apple developer named Pete Warden published a blog post detailing his analysis of a massive amount of data he scraped from public Facebook profiles. Although he initially shared the data with the world at large, he took it offline after a legal threat from Facebook. Once again, no peer review.

Nowadays, Facebook has made a habit of publishing studies based on its own data, including the migration analysis and countless others. But Warden is skeptical of putting too much faith in material released by corporate data scientists. "It's not so much that I think the studies are questionable, more that they only give us a narrow view," Warden, now the CTO of digital travel guide startup Jetpac, tells WIRED.

"I know that, for my own stories, there are often multiple ways of looking at the same data," he says. "But because we can only publish the results, and not the underlying data we built them on top of, only one of those interpretations ever makes it out into the world."

Many of the world's top data scientists are moving inside giant internet companies such as Facebook, but that doesn't solve the whole problem. You still need outsiders, Warden says, to review their work. He believes that outside academics should be pushing companies like Facebook to release more data. "I would love to see more academics making those sorts of pushes for information," he says. "Right now, I see journalists and startup folks figuring out how to use public information and APIs to investigate problems far more often than credentialed scientists."

A map of the U.S., showing the Facebook connections between people in different cities.

Image: Pete Warden

Devin Gaffney, a developer at a tech startup called Little Bird who holds a master's degree in Social Science of the Internet from Oxford University, says that many researchers are already doing this, pointing to social scientists such as Danah Boyd, Helen Nissenbaum, and Duncan Watts. "You've got potentially the largest data set on human interaction ever," he says. "It will be biased towards people who are on the internet, but it's still better than before. Plus, it's less work. You don't have to talk to 10,000 people. You just write some code to do it for you."

But the privacy issue remains. Although many academics are trying to get their hands on social media, Gaffney says, web companies are becoming more reluctant to share their data due to privacy concerns. And even if academics can get their hands on the data, they might not be able to use it -- depending on a web company's terms of service or the ethics policies proscribed by universities.

Some academics will gather data from Twitter, where most information is public rather than private. That means they can use it without permission. But there's a rub. If you collect data from Twitter, the terms of service say you can't redistribute it. And that means no peer review.

Some companies, however, are working to share more data with academics in a responsible way. The popular dating site OK Cupid publishes infographics detailing its own findings about love and sex online, but according to co-founder Christian Rudder, the company often shares data with academics as well. "We take steps to anonymize the data, but it's pretty robust," he says. "People are allowed to publish it in an academic context, but we ask that they don't use it in a commercial context."

At the moment, the company only works with about 10 percent of the researchers who request its data. "It's a pain in the butt to pull the data," Rudder says. "But we hope to be better about it."

One thing he doesn't expect to do is offer an application programming interface, or API, that lets anyone pull data from the site on their own. "If we just let anyone grab the data through an open API, there would be more bad science than good," he says. That may be true. But if the data is available to everyone, we also have the power to separate the bad science from the good.

Let's just hope that any open APIs protect your privacy.

nproxy.org