ReadWriteWeb

Twitter Data Dump: InfoChimps Puts 1B Connections Up for Sale

Written by Marshall Kirkpatrick / November 11, 2009 9:57 PM / 11 Comments

infochimpslogo.jpgData extracted from 500 million Twitter messages was released today by a tiny Texas startup company that forward-looking geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still won't be ready for a few more hours.

"What we want is to see people use this to build web apps," Infochimps co-founder Flip Kromer told us today. "You take this data, mash it up with any other very large corpus of data with timestamps - and you've got a web app."

twitterinfochimps.jpg

This is particular, extracted data though - not the full text of Tweets. "We're trying to be careful," Kromer says, "we are not yet exposing the contents of tweets." And this data isn't cheap if you want the numbers broken out by the hour instead of the month.

This is a very big move because most developers struggle to get access to a large quantity of data from Twitter.

Here's what InfoChimps is putting on sale:

Tweet #38 in the History of Twitter: "oh this is going to be addictive" - by @dom
  1. Hashtags, links and smiley emoticons used across Twitter on an hour-by-hour basis.

  2. @ messages, RT and favorites and who they came from: 1 billion relations, making what the company calls a "conversation metric."

  3. A useful if less exciting set of data that will help developers map user ID numbers from search.twitter over to the different ID numbers used in the primary Twitter API. These systems were never merged and it can require a lot of API calls to merge user data.

The company believes it is capturing about 10% of the total data on Twitter right now, but Kromer says that he believes he can ramp that up to 30%.

Data as a Pot of Gold

InfoChimps is a bulk data marketplace with more than 5000 data sets in its catalog so far. The vast majority are free and were added by the company's own staff, but not all. The decades-old polling firm Zogby International, for example, is selling some Iraqi polling data through InfoChimps. Cross-reference that polling data with publicly available data about civilian casualties in Iraq and you can see some interesting patterns, InfoChimps' PR rep Josh Dilworth told us. (Dilworth is known as the most data-savvy PR guy in the Web 2.0 world and also represents Wolfram Alpha and Twine.)

The company hopes that it can sell the data derived from sitting on the Twitter API as a demonstration of the value that this and other data sets have. InfoChimps says it can help companies monetize data that they'd otherwise be paying to serve up through repeated API calls, if at all.

From sentiment analysis (not yet an option with the current InfoChimps data set) to social graph discovery (definitely an option), we've written extensively here before about the impacts that social data could have on business, social and political policies in the future.

John Zogby, founder of polling firm Zogby International, spoke to us at length (in a separate phone interview several months ago) about the value of using online social networks to measure public opinion. "We've been particularly known for innovating and polling new technologies," he said.

"83% of all households are online today and 92% of likely voters, so with online polling we are today about where the country was with telephone penetration when telephone surveys started. Social networking is not as representative as online access [in general] yet, but I'm comfortable with caveats: that you can do a random sampling, so long as you claim that's what your universe is, as long as you don't extrapolate to all Americans, etc. It has tremendous, tremendous value.

"I know that the landline era is coming to an end - not today or tomorrow but we've got to find new and different ways of doing our work. It's the same kind of crossroads as the '70s, when we moved away from the door-to-door and mail-in results to the landlines.

"Online, frankly just like telephone, doesn't have the minority population, but for market surveys you may be looking for a different kind of consumer.

"We know that the landline phone is pushing us away; we know that we can't use the cell phone in the same way; and we know that we've got to reinvent this industry [of measuring public opinion]. What's happening are simultaneous new technologies and at the same time growing penetration of these new technologies. We're riding a bucking bronco."

Use Cases

The conversation metric data that InfoChimps is selling is the most exciting to me. Imagine a third-party app using historical social-conversation data to filter Twitter or other messages based on the strongest social connections that I or other people have. Imagine, for example, social Q&A service Aardvark combining the Twitter Lists API with this InfoChimps data set for a scenario like this: "You have a question about stock options? How would you like us to find a person who knows about that, is regularly conversed-with by people on Robert Scoble's Twitter list of Venture Capitalists and is available right now?" That sounds pretty great to me.

The possible applications are many. "I see Twitter as a data acquisition device for what people talk about and how they relate to each other," InfoChimps' Kromer says.

Right now InfoChimps is selling the hashtag and link dataset for $8,000 and the social metric data set for $9,500. Eventually the company will likely move to a subscription model.

How They Got the Data

How did InfoChimps get the data? The company hits the Twitter Developer API 20,000 times an hour (the standard for developers) but takes big swaths of data each time it does. "I have a priority queue," Kromer told us.

"I can set a search term, and for each search term I can get 1500 tweets per API call. If I get 1500 tweets at a time, then the number of wasted tweets at the end of a series of searches is the smallest. If I'm searching for a term and get less than 1500 results back, then I forecast how long it will take to fill that number of results back up to the maximum and move it down the priority queue accordingly. On the lowest priority I have searches for RT or http. There will always be 1500 results for that. It's only API calls that limit me. As is, it's like a fisherman setting nets: what matters is that dinner is tasty."

Does that sound so hard? Worth thousands of dollars? Here's what Kromer says:

"It's not magic. If you talk to people who use Hadoop and do social networking analysis, this is underwhelming. You take 30 million users, 1 billion links, adorn each link with info at the end of the link and acrue it with the person at the head of the link. That breaks conventional databases; the plumbing is hard. The math is easy but when you do it a billion times, it starts to get interesting. You have to be careful and clever. We plan to do stuff that is structural - a clustering co-efficient true pagerank."

Ultimately it's about specialization and data as a service. "The people we need to come in and connect this info with human beings," Kromer says, "aren't the people who should be wasting their time on the math. And the guys who are good at doing these things should not be building Web apps."

But Can They Get Away With It?

There's some question whether Twitter will allow InfoChimps to sell data based on Twitter data. Kromer says he'd much rather resell the data on a commission than have to do all the work he's done to set up the extraction system. But it was a year ago that InfoChimps caught the eye of people who love data: by releasing a large collection of scraped Twitter data.

The InfoChimps blog post for that read: "Big huge thanks to twitter.com: they have given us permission to share this freely. Please go build tools with this data that make both twitter.com and yourself rich and famous: then more corporations will free their data."

But then Twitter founder Evan Williams asked InfoChimps to take those data sets down until a Terms of Service for them could be figured out. That never happened, and communication between the two companies hasn't progressed very far over the last year.

InfoChimps does not have Twitter's permission to do what it did today, but Kromer says Twitter hasn't contacted them either. No one from Twitter headquarters has responded to our request for comment yet.

"We talked to our lawyer about this a lot," Kromer told us, "we are on absolutely solid ground with regards to copyright, user privacy and use of the API. This is clearly for the benefit of their community."

That's nice that Kromer feels so assured, but his attitude seems a little unrealistic.

We asked technology journalist Robert Scoble what he thought of the dilemma, and his opinion is pretty clear. "If Twitter wants to be a platform, they have to behave like a platform," he said. "Don't be king-makers. Let the marketplace choose the winners. If they are going to say nobody should study the data because we're going to sell that, that's not being a platform. Twitter tries to pick the winners and it pisses me off. They admit that they are king-makers. All that does is make everyone vote against them and hope a competitor comes around."

Perhaps time will tell. But these are very early days in what looks to be an era of widespread innovation built on top of social data analysis.


Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. This could force Twitter to lock down data to 3rd parties further than it needs to, in order to prevent this kind of thing. I don't like where this is headed. Privacy concerns will spring up over the next few days, and we'll see more restrictive terms of service. Blah.

     Posted by: Justyn Howard Author Profile Page | November 11, 2009 9:40 PM



  2. I don't know. Allowing your app to take that many API calls is pretty hard. I'm pretty shocked Twitter didn't move to prevent this before and hasn't been clear about this from the beginning. I always figured that data+search was where they would get the money. I guess they will have to rely on the underwear gnome method now:
    http://www.urbandictionary.com/define.php?term=underwear%20gnome%20economics

    Posted by: Michelle Greer | November 11, 2009 9:45 PM



  3. Michelle, 20k per hour is the developer standard API hits per hour limit (i edited to clarify that) and the net impact on Twitter's servers could be a big relief if a bunch of people hit InfoChimps once instead of Twitter over and over again.

    That said, +1 for underpants gnomes.

     Posted by: Marshall Kirkpatrick Author Profile Page | November 11, 2009 9:55 PM



  4. I have to agree with Michelle, I thought it was all about the data too. It does seem odd that they would let it go so easily. I wonder if they will make the information available to universities at a discount. I'm sure a lot of Sociology departments would love to get their hands on it. With that said, I'm sure Facebook will be watching this very closely as they sit on the mountain of data too.

     Posted by: Michael Fidler Author Profile Page | November 11, 2009 10:15 PM



  5. Michael, yes you and Michelle are right - no idea how Twitter will respond. Re academic research: InfoChimps says it intends to offer a special deal. And yup, I certainly thought about FB too ;)

     Posted by: Marshall Kirkpatrick Author Profile Page | November 11, 2009 10:24 PM



  6. Couple of issues with @infochimps experiment:

    1. Privacy concerns and copyright issues will not allow them to "sell" the harvested corpus as such. Even the extracted statistics could be arguable legally, because just like any other blogging platform, a tweet is owned by the twitterer and Twitter has the license to use it to provide better service to its users who agreed to the TOS.

    2. Google (or any other search engine) collects tons of data from web owned by their respective copyrighters. To facilitate research in NLP/computational linguistics, it provides Terabytes of extracted data, so called N-grams - n word tuples, free of charge through LDC. Infochimps might have to do the same.

    3. Most importantly, it's the Real-TIme that will be important, so the value for the tweets from past year or month would be very little. Besides, there is ample amount of real-time data generated beyond current processing capabilities. Thanks to Twitter stream api, the firehose is getting ready to be unleashed.

    4. On a positive side though, we would need 20 such centers for ,not just gathering but, processing the tweet streams to cognize what's happening in Twitterverse and provide a movie like zoomable rendering of the global semantics.

     Posted by: Cognizr Author Profile Page | November 12, 2009 5:15 AM



  7. I'm overjoyed someone's finally doing this, if only to give Twitter a much-needed kick in the arse. They've let all this hugely valuable data lie fallow for years; it was only a matter of time before someone stepped up and took it for themselves (or others in this case). Think it's ballsy of Infochimps, especially for an unknown startup, and I love it!

     Posted by: Carla Thompson Author Profile Page | November 12, 2009 6:40 AM



  8. Will Twitter make the game changing move and shut Pandora's box? Or will they sit on the sidelines and let the masses dictate to them?

    Should be interesting to see how Twitter responds. It could really change the game in a big way!

     Posted by: Rex Author Profile Page | November 12, 2009 8:38 AM



  9. Twitter has always been very clear that you own the data you put into the system. On the flip side they also make it clear that while you own the data others may reuse the data with no compensation to you.

    Such additional uses by Twitter, or other companies, organizations or individuals who partner with Twitter, may be made with no compensation paid to you with respect to the Content that you submit, post, transmit or otherwise make available through the Services.

    From https://twitter.com/tos

    It's a strange "contract" that boils down to a free-for-all, but they way I see it Twitter wouldn't work any other way.

    Also, if Twitter were to put a stop to this then other companies would be subject to the same precedent. Think about companies like CoTweet or Gnip who in varying ways redistribute Twitter data as part of their business model.

     Posted by: Ross Bates Author Profile Page | November 12, 2009 9:25 AM



  10. Beyond the questions of users making informed decisions on how their data is used, the biggest concern I see is in what InfoChimp is saying about the intent here:

    "You take this data, mash it up with any other very large corpus of data with timestamps - and you've got a web app."

    Data that is somehow attributable to an individual, combined with a loose definition of its intended use, is a real recipe for privacy problems...for example, in issues related to healthcare. For those interested, my blog post today deals more with this topic: http://blogs.sas.com/hls

    Thanks RWW for another compelling story.

    Posted by: Jason Burke | November 12, 2009 12:39 PM



  11. "But Mommy! The Emperor is NAKED!"

     Posted by: Ed Borasky Author Profile Page | November 13, 2009 5:05 PM



Leave a comment

Optional: Sign in with Connect Facebook   Sign in with Twitter Twitter   Sign in with OpenID OpenID  |  
RWW SPONSORS


FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS