ReadWriteWeb

Youth social networking researcher danah boyd has observed that many people presume the way they use social networks is the way everyone uses them. "I interviewed gay men who thought Friendster was a gay dating site because all they saw were other gay men," she says. "I interviewed teens who believed that everyone on MySpace was Christian because all of the profiles they saw contained biblical quotes. We all live in our own worlds with people who share our values and, with networked media, it's often hard to see beyond that."

Now picture our perspective leaving our own experiences, zooming out and up until we can see how all the different groups are interacting on a worldwide social network. That bird's-eye view could be both beautiful and horrible if the resolution was clear enough. That's what a Ramen-eating, ex-Apple engineer named Pete Warden is about to release to the public this week.

This Wednesday, Warden will make Friend, Fan page and name data from hundreds of millions of Facebook users available to the academic research community. It's a move that Facebook has to have seen coming, a move that many in the data-centric community have been calling on the company itself to do for years, and an event that's been complicated by Facebook's recent privacy policy changes, which have muddied the waters of right and wrong but rendered even more data available for outside analysis.

If what people call Web 2.0 was all about creating new technologies that made it easy for everyday people to publish their thoughts, social connections and activities, then the next stage of innovation online may be services like recommendations, self and group awareness, and other features made possible by software developers building on top of the huge mass of data that Web 2.0 made public. It's a very exciting future, and Warden is about to fire one of the earliest big shots in that direction.

Nerds in Space: Social Graph Analysis For Solving Large-Group Problems

Warden studied Computer Vision in college in the U.K., then got into game development. After moving to L.A., he spent six years building graphics drivers for the original Playstation and the XBox. Then he started his own independent business, where, thankfully, he open-sourced much of his work (something he's still doing today).

When he found out that starting his own business wasn't going to work with his immigration status, he was very fortunate to have also caught Apple's eye with the software he had been releasing to the public. Apple bought his company in order to bring him on board. The proceeds of that small sale are now sustaining his next project after going independent again.

After spending five years at Apple struggling to navigate the maze of people and connections and types of expertise in order to get the information he needed, Warden decided to go independent and build a company that solved exactly that kind of problem. "I can't think of a better big company to work for, but it was still a big company," he says. "It was hard to find the right people to talk to, whether for particular expertise or for contacts at external companies." And so Warden left Apple to build a company that would use social graph analysis to solve problems like that. He called the company Mailana, a play on "mail analysis" since he was initially focused on email social graph analysis.

We've written here a number of times about Mailana's tool that analyzes the social graph of any Twitter user. Enter the username of someone on Twitter and Mailana will show you which 20 other people the user has exchanged the largest number of reciprocal public @ replies with. Find someone interesting or important? Mailana's Twitter analyzer will tell you who they most regularly interact with. See, for example, The Inner Circles of 10 Geek Rockstars on Twitter.

Pulling Down the Facebook Social Graph

Now Warden is about to unveil a much larger project along the same vein. For the past six months he's been crawling public profile pages on Facebook. He now has more than 215 million of them indexed and updated about once a month. When he began he was using the Web crawling service 80legs, but over time he had to build his own crawling infrastructure.

When I talked to him this afternoon, he had already begun uploading 100 GB of user data onto his server to make it available for academic research starting on Wednesday. Warden says he's removed identifying profile URLs but kept names, locations, Fan page lists and partial Friends lists. All those fields of data are just waiting to be analyzed and cross referenced. That's one very rich resource.

Yesterday Warden posted some of his own initial observations from the data on his personal blog. Those included:

  • In almost every state in the Southern U.S., God is number one most popular Fan page among Facebook users. Among people in the L.A., San Francisco and Nevada regions? "God hardly makes an appearance on the fan pages, but sports aren't that popular either," Warden writes. "Michael Jackson is a particular favorite, and San Francisco puts Barack Obama in the top spot." In the Oregon and Idaho region? Starbucks is number one.
  • In the Mormon-influenced areas of Utah and Eastern Idaho, the most popular Fan pages are The Book of Mormon, Glen Beck and the vampire book Twilight, which was authored by a Mormon.
  • The bulk of Warden's posted analysis yesterday was about location networks. People in the western U.S. tend to have Facebook friends all over the country; people in the southern U.S. tend to mostly be friends with people who have remained in the same area.

Taking a Deeper Look

These observations are interesting, but they are only the beginning of what's possible. Name, location, friends and interests are great data points to analyze. Warden has written a program that will estimate gender as well, based on names. All these data points can be cross-referenced with outside data, too. Members of Facebook's own staff did this kind of analysis when they compared user last names to U.S. Census data, which allowed them to estimate changes in Facebook's racial composition over time based on the likelihood of people with particular last names to report a particular racial backgrounds.

"I'm mostly thinking 'What do I try first?'," Warden says. "There's so many interesting ways to slice the data - especially as I'm starting to get changes over time. I'm also trying to map out political networks in aggregate; how polarized the fans of particular politicians are - so how likely a Sarah Palin fan is to have any friends who are fans of Obama, and how that varies with location too. One of my favorite results is that Texans are more likely to be fans of the Dallas Cowboys than God."

Warden says he hasn't talked to anyone from Facebook since he started crawling the site, but he did get an email from someone on the security team asking him to take down instructions he'd posted that exposed a security hole that made harvesting peoples' email addresses easy. So the company is paying attention. "I'd love to see them put me out of business by putting decent data out there," Warden says. He says his Amazon Web Services bill was over $5,000 last month.

Why is he indexing all this content and why is he going to hand it over to the academic world later this week? "I am fascinated by how we can build tools to understand our world and connect people based on all the data we're just littering the Internet with," Warden says.

"Nobody thinks about how much valuable information they're generating just by friending people and fanning pages. It's like we're constantly voting in a hundred different ways every day. And I'm a starry-eyed believer that we'll be able to change the world for the better using that neglected information. It's like an x-ray for the whole country - we can see all sorts of hidden details of who we're friends with, where we live, what we like."

For a great example of the kind of social impact that data analysis can make, Warden points to some of the fascinating ways that GIS data is illuminating the intersection of race and public services. Data has shed light on social injustices for decades, and measurable information about the interactions of hundreds of millions of people every day on Facebook offers opportunities to discover both good and bad news about the contemporary human condition.

Warden says he's not yet been able to interest any investors in his ideas for businesses based on this data, so his girlfriend Liz Baumann, a former insurance actuary, stepped in to help and is now running much of the crawling. He says he's now focused on "working on ways of presenting all this information in a form that answers questions for people willing to pay." His first experiment along those lines is the very interesting FanPageAnalytics.com.

What does Pete Warden hope for from this week's public release of all this Facebook data? "Hopefully I'll get to see a bunch of interesting [academic research] papers come out of it, worst case. And I'd like to be the guy people turn to when they need stuff like this."

Already well-respected among a fringe group of bleeding-edge geeks, we hope that Warden's work on social graph analysis will end up impacting a far larger number of people than may ever know his name.



Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. "Nobody thinks about how much valuable nformation they're generating just by friending people and fanning pages."

    False. I think about it all the time.

    Posted by: Todd | February 8, 2010 9:43 PM



  2. You don't count Todd. You are freakishly wonderful in that way.

    Posted by: Marshall Kirkpatrick | February 8, 2010 9:47 PM



  3. This is a great story and it hopefully points to what the future of content on the internet is if there is an evolution in content creation standards. By this I mean a movement towards a unified standard that brings underlying structure to all content that is created on the internet. Because most of the content in the fb silo is unstructured you cannot easily create rich queries without the help of very technical computer programs. If the data were structured we could move to a point were any user could query the system and possible create mini applications on top of their queries.

    Posted by: bruce wayne | February 8, 2010 9:49 PM



  4. Thanks Bruce, and thanks for bringing up the structured data part of this topic. I didn't work it into the story but it sure is a part of what's going on.

     Posted by: Marshall Author Profile Page | February 8, 2010 9:53 PM



  5. Hey, what's up? I am looking for a date on Valentine's Day my profile @ http://bit.ly/cqd0a2

    Posted by: Crystalsami | February 8, 2010 9:55 PM



  6. Nice job Pete! You seem to have been steadily cranking out an amazing amount of interesting stuff lately!

     Posted by: Charlie O'Keefe Author Profile Page | February 8, 2010 10:02 PM



  7. I wanna play with that dataset. Fascinating post Marshall.

    Posted by: Brian Daniel Eisenberg Posted on FriendFeed   | February 8, 2010 10:52 PM



  8. Let me get this straight - any academic researcher gets a list of who my Facebook friends are just for the asking? And this person is trying to sell that information? And he hasn't found any buyers? And he's spending $5,000 a month of his own money to build this?

    Four hundred million Facebook users? All of them? All of their friends? The whole enchilada? I've got lots of questions, but I'm very skeptical. This is looking like another "InfoChimps".

    They collected a metric gooseload of Twitter data, put it up for all to access for a few hours, then Twitter "asked" them to take it down. A year later they were back, this time selling "depersonalized" data for insane prices. Pardon me, but I think the Emperor is as naked as the day he was born.

     Posted by: Ed Borasky Author Profile Page | February 8, 2010 10:59 PM



  9. Danah Boyd is spelled with an 'h'

    Its good to try to at least get to a double digit word count before you start messing up.

     Posted by: David Appelbaum Author Profile Page | February 9, 2010 12:31 AM



  10. Great writing, it kept me reading all the way through... This is going to be a very, very interesting turn of events for Facebook... Great to see such a well thought out and academically credible / legally defensible challenge to Facebook's data privacy blunders. I wonder how much this impacts the data access deals Facebook made with Bing and Google?

    Posted by: Deane | February 9, 2010 12:58 AM



  11. if you can connect to those people continously , that is the precious ...


    http://easy2write.blogspot.com/

    Posted by: zigan | February 9, 2010 1:01 AM



  12. Really excellent post!
    Also on Pete's blog are interesting insights.

     Posted by: Michael Altendorf Author Profile Page | February 9, 2010 3:00 AM



  13. The problem with using Facebook information for academic research is that there are many duplicate and false accounts on the site. Which Facebook is aware of but, as of this date, has chosen to do nothing about it. Thus the same human, may have 3 to 5 different personas on FB leading Warden and researcher with extremely inaccurate date and thus inaccurate conclusions regarding the data. In the internet environment one never really knows who the person they are talking to really is neither by ethnicity, gender, religion or politics. there is also the unknown of intent of the user if they friend Palin or Obama- it may just be to gleen information to use in writing Jon Stewart's or others jokes. Or on other more radical sites to watch what those radicals are saying-but it does not mean that the person is friends with nor even agrees with the site subscribe to.

     Posted by: Sandi Author Profile Page | February 9, 2010 5:14 AM



  14. am i the only one that's bothered by the fact that some stranger can analyze data by invading someones privacy just because it suits him? i simply fail to see the benefits of this so called research. I am sure that if proven successful this will be used to monitor people and their every day activities (and no i have not watched to many movies i am just worried that my God given freedom will parish because of someones need to control...... somehow i see Matrix happening before my very own eyes....frightening thought).

    Posted by: maria | February 9, 2010 5:29 AM



  15. We have been thinking about data-mining in such a way for quite a time (and we have had some experiments with Twitter in Czech republic), and there is also a project called http://www.mechanicalcinderella.com/ - mechanized data mining of co-occurences of words on the internet.
    I just wonder - the data are gonna be completly public or restricted to academical sphere only? If so, how shall it work? Plus, are we going to be allowed to take a look here in Europe or USA only? :) still, many questions to be asked, looking forward to it!

     Posted by: Adam Author Profile Page | February 9, 2010 5:45 AM



  16. Look forward to seeing the data. I think Pete Warren is going to have to up his monthly budget once the data is public...

     Posted by: Mark Epstein Author Profile Page | February 9, 2010 7:45 AM



  17. Incredibly interesting work. At Victus Media we're having a blast inventing different ways to develop recommendation engines with socially driven semantic data. We're hoping to build a business out of it, but giving easier access to educational institutions feels like a great value ad.

    Great reporting, and good work to Danah :D

     Posted by: Mark Author Profile Page | February 9, 2010 8:57 AM



  18. @Maria: You're mistaken. There is no invasion of privacy here. As the article says, Warden is crawling PUBLIC Facebook pages. As for the value of the research, if you don't see value in understanding the behavior of 400MM people, I can't help you. Last but not least, if you are so worried about your "God given freedom", don't splatter your personal info all across the web.

    @Bruce Wayne: Not sure a structured data approach would ever work for such inherently unruly data. I think you'd give up as much or more than you'd gain by doing that. A better approach, IMO, is to use Networks Theory and statistical analysis to identify likely trends etc. Better to be approximately right than exactly wrong...

    Posted by: Nick | February 9, 2010 9:35 AM



  19. A. This is awesome.
    B. I don't see academia jumping on this right away, because they AREN'T STUDYING THE WEB which is a sad, sad fact. Sure, maybe CMU or MIT Media will do something with it, but in general academia has yet to get their head outta the sand in terms of studying digital communities.

    Posted by: belgian | February 9, 2010 11:25 AM



  20. As mathematicians have known for at least a few thousand years, STRUCTURE trumps "content". This (proven) reality bypasses the "bad data" problem, decisively. (It also reduces the data-storage problem to nearly total irrelevance.) More info? Contact me: either at arthur.gillman@gmail.com, or else at 204/ 896-4967. (If you need to leave a message, expect my reply within one business day.) /Arthur

    Posted by: Arthur Gillman | February 9, 2010 11:42 AM



  21. We communicate with other humans on three levels: Cosmetic (data and surface descriptions), Emotional (passion, purpose) and Meta (symbolism, iconography).

    What's most relevant to us, and the source of most transformation, gets conveyed on the Emotional level. The least relevant level is the Cosmetic. As a very wise woman, Viola Spolin, once said, "Information is a poor form of communication."

    Setting aside for a second that there's an unquantifiable level of misinformation on FB or on any social networking site, most of the artifacts jailed by Warden in his data-prison are cosmetic. Furthermore these artifacts are six months old, which is forever in network time. And finally, they are but a snapshot in time, when what is much more important in the networked environment is trend and flow. It is the non-static nature of networks that give them their potential. Warden's static data-prison has none of that potential.

    Sorry to be so snarky, but that's my interpretation. Thank you for the post, very enlightening!

    Posted by: Bonifer | February 9, 2010 12:15 PM



  22. What I hate is when I friend someone on facebook, we're talking colleagues, and a few weeks later they want you as a "fan". Just absolutely, crazy when I know they don't have over a 1000 friends.

    Posted by: Ruby | February 9, 2010 1:54 PM



  23. I wonder at the usability of the data. It seems like the data would be automatically skewed by the sheer fact that the only Facebook users being crawled are the ones unaware and stupid enough to have a public profile.

     Posted by: Tara Kotthoff Author Profile Page | February 9, 2010 5:03 PM



  24. What I find interest is what people say in their profiles vs what is 'true' or versus what they fan. For example i'm Anglocatholic but I wouldn't "fan" God -- instead OI've got it in my religous views. I'd also like to see the info for Canada

     Posted by: Kathryn Author Profile Page | February 9, 2010 5:51 PM



  25. It turns out Pete Warden is a british guy with another british axe to grind toward America and the Americans. His representation of America reads like a british propaganda manual, so it could never be taken seriously.

    Mormons, nomadic Americans(?), germanic native peoples(???)... its a british riot of insipid and spectacular stupidity and racial hatred, warping historical reality and science to fit the british empires old guard frustrations like a dickless dickensian glove full of holes.

    The british are still stuntingly fuming angry at the moon at losing America to another race of beings who were always there, and they're willing to rewrite American history to the best of their snotty and sniveling dropout ability to ease the pain.

    Trolling and jacking the planet for conquests and treasures is one thing. Trolling and jacking internet account information is a whole other ballgame, but really not that much different.

    Posted by: Barnes | February 9, 2010 6:36 PM



  26. I wish there was more technical information on how he crawled facebook without being banned in a second and how he went pass the "friends of friends" barrier.

    Posted by: reputation management | February 10, 2010 3:21 AM



  27. While I've thought about the information I generate through social networking, I haven't thought about it in that much detail. Crazy stuff. Something to think about from now on, definitely.

    Posted by: Adam | February 10, 2010 7:44 AM



  28. So is the National Security Agency interested in mining this FaceBook data to find networks of incipient terrorists like the murderous Army psychiatrist at Fort Hood? Apparently he had a series of communications with a mullah who also played a role in training the Christmas underwear bomber. Since the US Government makes a point of purposely ignoring the religious/ethnic character of the terrorists, they may feel foolhardy enough to openly communicate with each other via FaceBook and other social networking venues. It would especially be a place where people sympathetic with their attacks can make contact with the active jihadists, be vetted, and then given access to more clandestine channels of communication. Thus we could expect the pattern to involve a few initial contacts, and then no further public contacts. Thus if the data could be searched for people with certain religious and/or ethnic characteristics, and with less than a set number of contacts, it might highlight people being recruited into the loose netowrk of terrorists.

    Posted by: coltakashi | February 10, 2010 6:01 PM



  29. Great article.
    Thing this highlights for me is just how lax many people are with their security on social network sites. I set all mine on FB to 'Friends Only' from day one. Even photos are not available outside of this.
    Just like Todd, I do give a serious amount of thought to the footprint I leave on the web and already have concerns about it as I've been blogging for over 5 years.

    Posted by: Dean | February 10, 2010 9:53 PM



  30. What people don't realise is that this information is not a breach of privacy, and nor would it be realistic ever to use it as one. It's only going to show any meaningful data when it's aggregated such as above; and that data is incredibly useful. It allows governments to set policies, it allows companies to identify market trends; but it would not be helpful in tracking people, as I bet 99% of people who might list 'terrorism' as an interest are not terrorists. If it was used in that way the man hours wasted would be immense.

    Posted by: Matt Sleight | February 11, 2010 9:41 AM



  31. For those of you who think your profile is private, think again. Even if you hide everything, FaceBook makes the pages you "fan" and the groups you follow, public. You can't change that setting. Use FaceBook's privacy tools to see what others see on your page, and you'll understand. If you're a fan of something, it's not a secret on FaceBook.

    Posted by: Karen | February 11, 2010 10:47 AM



  32. Oh goodie...more Big Brother looking over my shoulder. Apparently Pete Warden believes all academicians are only capable of doing GOOD including the ones at East Anglia U that jiggered the climate data to promote a "Global Warming Crisis" that allowed politicians to tax carbon usage and thereby punish all energy consumption in the modern world. How starry eyed all these progressives are... They never consider the costly mischief and societal devastation of their lofty good intentions. (ie. generation of welfare kids growing up fatherless.) Focusing only on utopia and the good they hope will happen. Wasn't that the pap Obama "the social network genius" sold who then suggested we report health reform subversives criticizing his efforts....NICE!

     Posted by: Melissa Author Profile Page | February 11, 2010 4:43 PM



  33. Nothing against Danah, but this does open the discussion once again regarding your right to privacy online. Two of the most powerful companies, Google and Facebook, have had their CEO's tell us that we have no privacy online. This is just another example. If Danah can do this and provide to the academic research community, you can bet someone will or already has done so, purely to make money...

     Posted by: Click Author Profile Page | February 12, 2010 3:55 PM



  34. Time for a "Do Not Friend Me" registry.

    Posted by: Charles | February 13, 2010 5:21 AM



  35. Respectable academics won't touch this with a 100-foot pole, and not because they "don't study the web" (WTF?). Stolen data, accessed via a security hole? Sure it's "public", but given the well-known problems with the usability of Facebook's privacy settings, how many people have inadvertently made (or left) their profiles public? Plus, as others have pointed out, Facebook is full of fake profiles. So this data might be fun for the kind of amateur hour analysis Warden is doing on his blog, but it's not a great foundation for solid research. If anything, like the AOL search data release, this will *hurt* academic research on social media.

    Posted by: http://people.ischool.berkeley.edu/~ryanshaw/wordpress/bio Author Profile Page | February 13, 2010 12:23 PM



  36. I haven't seen a follow up, was the data finally released or not, does anybody know? It's Sunday.

    Posted by: Follow-upper | February 14, 2010 5:19 AM




  37. This is an interesting article and it hopefully points to what the future of content on the internet is if there is
    an evolution in content creation standards.
    Social Networking consider an effective way to market and promote business online and also the Facebook is a social utility that helps people better understand the world around them.

    Thanks for sharing this article with us!

    Posted by: sipldataservices | May 27, 2010 6:27 AM



  38. I am not sure what you would use the information for. To what purpose would you put the knowledge that Glen Beck is popular in a conservative area (also is mormon) I liked the graphs of links the southern states seemed the most linked and I found that curious.

    Posted by: Mange Dogs | July 7, 2010 3:25 PM



  39. The data was useful and the tags on the areas on the map were funny as well it was an interesting post.

    Posted by: Conversational Hypnosis | July 16, 2010 7:43 AM



  40. Even when he gathers the data I am not sure what he is going to sell as a product? You can find out if Glen Beck is popular in other ways as well. I hope it works for him but now sure how?

    Posted by: Massage Cupping | July 20, 2010 7:29 AM



  41. Every one always assumes people are like them and think like them. So when you are given choices that relate to you on a web site it certainly would seem to confirm it.

    Posted by: Acne free | July 20, 2010 9:33 AM



  42. The social media sites do tend to only let you see those like you since your friends on them are going to be like you and share the interests.

    Posted by: Fibromyalgia Remedies | July 20, 2010 10:28 AM



  43. Well on the likelyhood of having Obama liking friends If you are a Sarah Palin fan it is almost approaching 0. If you discuss your political views at all then they are either going to have to admit he is an idiot or move away and not be a friend.

    Posted by: Traveling Slave | August 14, 2010 6:28 AM



  44. Well I live in Texas and I am pleased you used the term Greater Texas on it. The chart was interesting and it shows the adoption of technology in the populations around the country great work thank you.

    Posted by: Slave to Hypnosis | August 19, 2010 6:07 AM



  45. I was impressed with the traffic volumes concentrated in different areas around the countries. I was surprised that Texas has such a high volume. I had assumed New York and Los Angeles but was surprised at the traffic for Texas.

    Posted by: Ralph Gout | August 30, 2010 6:36 AM



  46. I was expecting the big cities to show up and have all the traffic but that did not seem to be the case. You could pick out Las Angelos, Dallas, New York and a few others but it was not as concentrated as I assumed it would be.

    Posted by: Ralph Fibromyalgia Remedy | August 31, 2010 7:16 AM



Leave a comment

Optional: Sign in with Connect Facebook   Sign in with Twitter Twitter   Sign in with OpenID OpenID  |  
RWW SPONSORS



FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook
ReadWriteCloud - Sponsored by VMware and Intel





TEXT LINK ADS



RWW PARTNERS