data mining - ReadWriteWeb http://www.readwriteweb.com/feeds/search/data mining en Copyright 2009 Richard MacManus readwriteweb@gmail.com Tue, 24 Nov 2009 12:40:23 -0800 http://www.sixapart.com/movabletype/?v=4.23-en http://blogs.law.harvard.edu/tech/rss Government Report Finds Data Mining an Ineffective Way To Smoke Out Terrorists nrclogo.pngRemember the "pre-cog" cop-things in Minority Report, able to figure out who was going to commit a crime before they committed it? If that's ever going to happen it looks like it's going to have to be something super-natural - because at least these days, technology is a long way from able to predict who's going to commit a crime.

A new 350 page report released today, written by heavyweights like former US Secretary of Defense William Perry, National Academy of Engineering President Charles Vest and sponsored by the Department of Homeland Security, argues that large scale data mining of consumer and other records is of "limited effectiveness" in finding suspects preparing to commit acts of terrorism.

]]>Sponsor

]]> The report was published by the National Research Council and was titled All Counterterrorism Programs That Collect and Mine Data Should Be Evaluated for Effectiveness, Privacy Impacts; Congress Should Consider New Privacy Safeguards. CNet's Declan McCullagh says the report offers a retort to the aims of the office of Total Information Awareness, whose duties were dispersed throughout the Federal government after extensive controversy several years ago.

The report notes that while credit agencies have been able to use data mining to find fraudulent financial activities, the tactic is of limited effectiveness in finding would-be terrorists for two reasons. First, because so little about the psychology and behavior of terrorists is known and second, because the resulting data is so rife with false positives that it's of very low quality.

The report argued that it was much more effective to use data mining to track known terrorists or to find people exhibiting very specific behavior. It warned against using such tactics as tracking emotional or psychological states as those are things the authors believe individuals should not be called to account for. Apparently that doesn't go without saying anymore.

perrypalin.pngMuch of the report's summary, and clearly its title, focused on the privacy implications of these false positives in particular and of this kind of data mining in general. Presumably the report was written in a different era, before it became appropriate to try out for the Vice Presidency of this country with words like "Al Qaida terrorists still plot to inflict catastrophic harm on America, and [Barack Obama] he's worried that someone won't read them their rights." (Palin acceptance speech) Evidently we live in a post-rights world now.

Thus what's most significant in today's report is the finding that pre-emptive data mining just doesn't work. Surely the ineffectiveness of pre-emptive actions is significant, isn't it? The report warned against using anti-terrorism data mining as an opportunity to find other actionable information.

The report offers a series of recommendations that include close monitoring of any such programs, possibly even including subjecting data-mining activities to regular data-mining based assessments of thier effectiveness. The report said that "legislation to clarify private-sector rights, responsibilities, and liability in turning over data to the government" was an area "ripe for congressional activity." At a time when neither party running for the US Presidency is willing to mention anything like this, such recommendations might seem either refreshing or insane.

]]>Discuss]]>
http://www.readwriteweb.com/archives/government_report_finds_data_m.php http://www.readwriteweb.com/archives/government_report_finds_data_m.php Analysis Tue, 07 Oct 2008 13:23:07 -0800 Marshall Kirkpatrick
Facebook Data Mining: Truth in Association? facebook_datamining_sept09.jpgWith a product as ubiquitous as Facebook, the public has raised a number of privacy-related concerns including optional settings, privacy policies and data mining. In the past, ReadWriteWeb covered Facebook's plans to sell user data for market research purposes. However, today's article in the Boston Globe suggests that user information can be mined for more than just advertising purposes.

]]>Sponsor

]]> facebook_socialgraph_sept09.jpgAn MIT experiment dubbed, "Gaydar" by creators Carter Jernigan and Behram Mistree has employed computational analysis to identify user traits based on information listed by their Facebook friends. Through friend profiles, the program predicts the likelihood of your religious affiliations, political leanings and even your sexual orientation. Essentially the idea is that friends are likely to share traits. So if you're in the closet, but you've got loads of vocal friends, a program of this nature could potentially out you.

Said Hal Abelson, a professor who co-taught the course, "[It] pulls the rug out from a whole policy and technology perspective that the point is to give you control over your information - because you don't have control over your information."

With the service being used to catch tax evaders, in addition to a conspiracy theory citing CIA ties, it'll be interesting to see how the public reacts to this latest show of Facebook data mining capabilities. While it's unlikely that terrorist suspects are friending each other on Facebook, there are a number of associations that need not be publicized to corporate partners or governments.

Photo Credit: Steve Jurvetson

]]>Discuss]]>
http://www.readwriteweb.com/archives/facebook_data_mining_truth_in_association.php http://www.readwriteweb.com/archives/facebook_data_mining_truth_in_association.php Facebook Sun, 20 Sep 2009 19:41:26 -0800 Dana Oshiro
Yahoo! Experiments in Reality Mining with Bluetooth MyBlogLog Yahoo! owned MyBlogLog is stepping into dangerous waters with a new experiment in mobile presence tracking through Bluetooth.

Demonstrated at the eTech conference today, m.mybloglog.com says it allows users to: "Bind your Bluetooth address to your MyBlogLog account and discover others nearby and [sic] find out if you have any shared interests. Meetspace keeps track of time spent with others so you have a running log of people to meet and things to talk about."

]]>Sponsor

]]> The new Mobile MyBloglog uses a java applet to tie your Bluetooth device to your MyBlogLog account, then polls for new activity every two minutes. In some way it's not that different from Google's Dodgeball or other mobile presence trackers. MyBlogLog is very tied into your online behavior, though, most recently relaunching with an emphasis on online lifestreaming. This new feature will let you, and Microhoo, view the recent online activities of the (participating) people you've been near lately.

Reality Mining

"Reality mining" is a phrase coined by MIT researcher Sandy Pentland, whose work we wrote about in December. Pentland is working on processing more than 350,000 hours of data collected from peoples' cell phones. Pentland's Nokia funded work is studying proximity, location and activity data using information including interactions recorded between Bluetooth devices.

Previous coverage of what Pentland is up to is worth a read on its own. Obviously he's not the only one working on passive collection of presence and activity data through the interaction of mobile devices.

The Privacy Lab That is MyBlogLog

MyBlogLog is a great laboratory for Yahoo! to experiment with behavioral tracking and personal information among early adopter crowds. There's a lot of fascinating work being done there. It sometimes borders on creepy, though, and this is one of those times.

If you've signed up for a MyBlogLog account, you've probably experienced the ambivalent feelings that can arise from on one hand being interested to see the faces of other people who read your blog or the blogs you like, but on the other hand feeling a little uneasy with your own blog reading being very public. The MyBlogLog cookie is very persistent, too. Of course this is opt-in, but how far down the rabbit hole are we going to go before that's no longer sufficient justification for new levels of tracking?

Data portability and lifestreaming online have huge potential, but once experiments like this start creeping into reality mining territory there are some gigantic privacy questions that come up. I don't know why MyBlogLog thinks it can get away with introducing this kind of service when it knows it has a shaky public image on privacy.

My first thought upon seeing this was: the internet brain implant creeps closer every day. Maybe I'm over reacting, but how often do you see people who never take their Bluetooth headsets off? This kind of tracking needs to stay as far away from the inside of my head as possible.

I have said several times that Yahoo! is pushing the envelope on data portability with MyBlogLog while the standards community sits too far towards the sidelines having a different discussion. The web, and data portability itself, need a big discussion of the privacy half of the data portability discussion. To keep track of these important discussions here's an RSS feed you can subscribe to that contains DataPortability.org discussions that contain the word "privacy" and Ask.com blogsearch results for the query: privacy AND "data portability" OR authorization. Enjoy. Here's a preview of the last few things that have come through this feed.

Recent Items in Data Portability and Privacy Feed

]]>Discuss]]>
http://www.readwriteweb.com/archives/yahoo_reality_mining.php http://www.readwriteweb.com/archives/yahoo_reality_mining.php Mon, 03 Mar 2008 21:02:09 -0800 Marshall Kirkpatrick
Web as Platform For Research on Oceans, Galaxies The University of Washington has announced two new research projects that will utilize cloud computing platforms from Internet companies such as Google, Microsoft, Amazon and IBM. According to the press release published on Genetic Engineering News, the University of Washington has won grants from the National Science Foundation to fund projects examining ocean climate simulations and analyzing astronomical images. Both of these projects will utilize cloud computing to examine and interact with "the massive datasets that are becoming more and more common in science."

]]>Sponsor

]]> The University of Washington projects tie into a couple of major trends in the current era of the Web: there's now much more data being created for the Web, or being transported to the Web; and we're seeing Web technologies being used to analyze and make sense of that data.

It's not only in scientific realms. We're seeing this on the Consumer Web too, as Marshall Kirkpatrick explained this morning in an article about social media monitoring tools. He wrote that data mining tools are being democratized and used more nowadays, similar to how online publishing tools were democratized in Web 2.0. The cloud computing servers that the University of Washington will utilize are relatively cheap and easy to use Web platforms that will enable data mining on a scale not seen before. These projects will access a cloud datacenter established for educational use in 2007, through a partnership between Google, IBM and six academic institutions (including the University of Washington).

Oceans and Galaxies of Data

Bill Howe, a researcher at the UW's eScience Institute, explained the impact of cloud computing on his ocean climate simulation project. Instead of running a simulation to test a single hypothesis, he said, climate scientists are now running long-term simulations and then sifting through tens of thousands of gigabytes of resulting data to discover trends.

Andrew Connolly, a UW associate professor of astronomy, explained that for his project analyzing astronomical images, cloud computing makes it easier to store and process information in the cloud and make the information available over the Web. He said that whereas scientists once competed for time on telescopes, recorded data and then studied the individual images in detail, now "telescopes continuously record high-resolution images that are available to all, providing millions of times more information." So the shift is that the data gathering has been automated and is available on a much larger scale than before for scientists to analyze it.

Data Rich - And Useful

This current era of the Web, which some are calling 'Web 3.0' (but we frankly don't know what it's called yet) is increasingly data rich. The same thing could have been said about the Web 2.0 era, when oceans of 'User Generated Content' were created. However the world of sensors is rapidly pouring even more data onto the Web. Ed Lazowska, a UW professor of computer science and engineering, noted that "the rapid evolution of sensors is transforming all sciences from data-poor to data-rich." He said that "the challenge is to use modern cloud computing resources, such as Amazon Web Services, and modern computer science advances, such as data mining and machine learning, to explore these massive volumes of data." He claimed that this new computational science will be pervasive and will have enormous impact.

We're always pleased when the Web has a meaningful impact on the 'real world' - and particularly on science projects such as this, where the findings could be profound.

]]>Discuss]]>
http://www.readwriteweb.com/archives/web_as_platform_for_research_on_oceans_galaxies.php http://www.readwriteweb.com/archives/web_as_platform_for_research_on_oceans_galaxies.php Real World Wed, 15 Apr 2009 18:45:43 -0800 Richard MacManus
Do You Trust Google to Resist Data Mining Across Services? googlelogo6.jpgGoogle's breadth of services is truly awesome and the amount of information the company touches concerning our lives and world can sometimes feel downright frightening. While almost no one takes the old phrase "Don't Be Evil" seriously anymore now that there are billions of dollars on the table and Chinese autocrats to satisfy - regular evaluations of Google's ethical positions still seem advisable.

One of the big questions being asked with increasing frequency is this: Is Google using data it collects through particular services and using it for its benefit in other services? We know the company scans our GMail and uses the text there to sell ads, but is this a tactic being employed across services? Some people appear to believe it is.

]]>Sponsor

]]> The Fears

When enterprise wiki service Socialtext announced this morning that they were folding Dan Bricklin's SocialCal (Visicalc) spreadsheet into their offerings, the announcement included this interesting customer quote:

"The timing of SocialCalc is perfect - we were in need of a wikified spreadsheet that had all of the utility of Google Docs without the datamining," remarks Brandon Stafford, Principal Engineer at GreenMountain Engineering."

We found it very interesting that a new application would specifically aim at Google's data mining as a weakness. That kind of tactic is likely to become increasingly frequent.

Similarly, when Google's Mark Lucovsky was a guest on last week's Gillmor Gang podcast, he was pressed on the question of data mining concerning the free javascript libraries that Google hosts and offers to developers. Is Google monitoring everything that goes on at the sites that use the libraries and using those observations for market intelligence such as ad sales?

You might remember that was a question people asked about MyBlogLog when Yahoo! bought the widely embedded service. Was Yahoo! using MyBlogLog to spy on AdSense and other activity unrelated to their own technology?

In Google's Defense

The information available cross-application is probably too seductive for Google, or almost any company, to pass up. The search and ad giant's saving grace may be that it has so much information in each silo already that it's uniquely satisfied not cross-pollinating.

Google's Lucovsky told Gillmor that "the Slashdot crowd" might think there's some kind of conspiracy, but that there really isn't. He assured listeners that Google only uses the information it collects from his javascript libraries to improve the service of the javascript library service. "The Slashdot crowd" is old school lingo for nonprofessional writers who post on the web but don't have a vested interest in respecting power - so they point at alleged conspiracies more often than the tamer professional press does.

Behind every alleged conspiracy at a giant company though is just a bunch of people doing their jobs. Only occasionally, we presume, do some of them come up with what would be a great idea as long as they don't get caught.

Data Portability

Some cross pollination of data from one service to another might in fact be great - if users had control over it and could use the same tactic for our own direct benefit. Until that kind of data portability policy and technology are in place, though, may of us would prefer that data remains right where it is and keeps its hands in plain sight.

Perspective

One of the first posts I wrote in my time at TechCrunch was about a Google experiment that would use your computer's microphone to track the ambient audio in a room, determine what TV shows you were watching and then serve up related ads in your browser. Presumably that program hasn't gone anywhere, snooping-obsessed researcher Shumeet Baluja has moved on to other research like monitoring video game players' behavior and psychology for ad targeting and watching how much porn people look at on their mobile phones.

Outside of Google's actions - data integrity (privacy) in hosted services has long been a concern and is now being responded to by some enterprise sales teams with boxes carrying applications locally behind customers' firewalls. As recently as the end of last year, SalesForce.com admitted that one of its employees fell for a phishing scam and handed over the key to that company's customer email accounts.

What if it wasn't wasn't an accident or an outside party though? What if data that was collected in "anonymous aggregate" proved just too juicy for personalization-hungry ad sales teams or security-obsessed government agencies. Do you trust Google to resist mining your data across the various Google services you use? Is avoiding "Google data mining" an effective selling point that would increase your consideration of products from another vendor? We expect that the answers to these questions will change over time and we think it would be wise to revisit them periodically.

]]>Discuss]]>
http://www.readwriteweb.com/archives/do_you_trust_google_to_resist_data_mining_across_services.php http://www.readwriteweb.com/archives/do_you_trust_google_to_resist_data_mining_across_services.php Analysis Tue, 10 Jun 2008 11:05:05 -0800 Marshall Kirkpatrick
MIT Researcher Collecting Passive Social Graph Data From Cellphone Activity, Bluetooth Sandy Pentland, a researcher at MIT whose work has received funding from Nokia, is working on processing more than 350,000 hours of data collected from peoples' cell phones. More than just who calls who, Pentland is also studying proximity, location and activity data using information like interactions recorded between Bluetooth devices.

The result is a field Pentland has given the obnoxious name "reality mining."

]]>Sponsor

]]>

In an interview yesterday with MIT's Technology Review (found via author Nick Carr), Pentland says that self-reporting of social connections and roles is far inferior to the kinds of analysis that can be done using passively collected data via mobile devices. While calling this data "reality" denies the importance of our hearts, minds and other parts of reality as yet imperceptible by our cell phones - it is very interesting research none the less.

This is where discussions about things like OpenID, OAuth and OpenSocial are likely to be played out. Passive mobile data will be a huge part of and will leverage your Social Graph. Once this kind of data becomes readily accesible in sophisticated ways, that could be when we'll see Telcos pressuring web services to produce standards compliant data - so they can make use of it for mobile marketing and services. Some of those services will be awesome and I anticipate them with both eagerness and caution.

Pentland predicts a future when he'll be able to use frequency of calls, physical proximity and interruptions in conversations to determine for example who among your Facebook friends is a real life friend, who you've never met in person and who is your superior in a workplace hierarchy. I see different ring tones for these different groups of people some time in the future!


Pentland also says that the data mobile devices can capture will be good for early alerting of things like epidemics (15% of the residents of an apartment building didn't go to work today - could be a problem). Using special software and already available hardware, there's a whole lot of data that can be collected - it's just a matter of figuring out how best to crunch that data.

Just Imagine the Shopping Opportunities!

Some people seem dead set on making the movie Minority Report a reality, Pentland among them. (Can we just have the interface without the mind reading, please?) Obviously the marketing opportunities that will arise from this kind of data are huge. Big, big money.

When your phone and Facebook put their heads together with your boss's Amazon wishlist - the only question that will remain is whether the birthday presents will be purchased via your phone or via your web enabled brain implant (11% of US respondents say they are somewhat or very likely to get one).

What Will the Rules Be?

Data mining is not bad. In fact, it's quite an exciting idea with a whole lot of potential. As long as it's not used to catch me thinking subversive thoughts - then let's go with it. That's not even an "if" - that's pretty much a deal breaker. Let's ignore that for just a moment, though.

Pentland articulates two good rules in his interview. First, there has to be an opt-out (or opt-in) option. Second, aggregate data needs to be anonymized and your individual data needs to be viewable by no human eyes but your own. When he says we need a "new deal" for privacy, I think that's probably a good choice of phrases.

Mobile devices are wonderful, life and world changing things. They are also the hardware for projects like Pentland's, for better or for worse.

]]>Discuss]]>
http://www.readwriteweb.com/archives/reality_mining.php http://www.readwriteweb.com/archives/reality_mining.php Mobile Services Fri, 21 Dec 2007 18:39:43 -0800 Marshall Kirkpatrick
Read/WriteWeb Daily The Daily is back, now that I'm over my jet lag :-)

edgecase- Scoble: I’m not an edge case (If you listen carefully, you'll hear me whoop near the end of Scoble's excellent outburst. I've never whooped in my entire life - yet here I am carrying on like I'm on the Oprah Show...)

- Alex Barnett's 'Edge Case' series on Flickr (caption to pic on left: "If someone calls me an edge case....")

- Dion Hinchcliffe on Live Labs (Microsoft's think tank and incubator is indeed an interesting project -- the best part for me is that they're going to invite external people, and not just scientists either, to play a part)

- Product Development: TV Guide will roll their own (cool - I did some analysis work on this...)

- Rumors of a Google homepage makeover (here's a screenshot c/- Flickr... I like the look of it)

- Google misses Street targets, shares tumble ("[this] ended the uninterrupted winning streak Google has had since its August 2004 public offering.")

- Apple analyst predicts big things (sees "potential for new iBooks by April [...], a potential "media hub" product (and more services), new iPods into year-end (including a new media player) and even a new cell phone within a year.")

- Ben has details of Aussie 2.0 action (Yahoo7, NewsCorp's truelocal.com.au, Fairfax - all ramping up for a Web media battle)

- The Online Storage Gang (TechCrunch has an excellent reference and analysis piece on online storage solutions, sure to be one of the key products on the Web by the end of this year. Great to see aussie company OmniDrive as their #1 pick!)

- Mining the Two Types of User-Supplied Content (Josh ponders the data mining efforts of Yahoo and Google)

- Internet Explorer 7 Beta 2 Preview released ( Dave Winer says it's significant because it's "the first Microsoft release that includes comprehensive support for RSS not only on the producing side, but also on the consuming side.")

Flickr pic by Alex Barnett

]]>Sponsor

]]>
http://www.readwriteweb.com/archives/readwriteweb_da_2.php http://www.readwriteweb.com/archives/readwriteweb_da_2.php List of Links Tue, 31 Jan 2006 20:58:10 -0800 Richard MacManus
Four Ad-Free Ways that Mined Data Can Make Money datalogo.jpgMachines can do wonderful things. Side by side with the rise of a new world of publishers, the computer scientists of the world are cranking it up as well - building new ways to create value from the sea of data being published by people. And then they take their work and they sell it to advertisers!

Barf-o-rama!

]]>Sponsor

]]> We have some appreciation for advertising technology and we certainly appreciate our advertisers here at RWW - but why do so many innovative technologies end up slinking away into the ad tech world and watching their grand visions for user empowerment fade?

The most obvious answer to that question might be that advertising is where the money is made. Data mining, machine processing large quantities of information in order to unearth patterns or other valuable insights, seems just made for demographic and behavioral targeting by advertisers.

We argue here, however, that money can be and is being made from data mining in ways other than by by sale of information to ad networks. Cooler, more exciting ways. We briefly discuss four markets for mined data that we believe exist now or could hold strong demand for analysis of aggregate data from online activity.

Let's be honest - we're thinking about Twitter here. When people ask how Twitter is going to make money, we think data mining has huge potential. More than Twitter, though, all kinds of apps will soon trade in user data as a primary currency.

Here's what we think that could look like.

Traffic

The most obvious example that's already real is Internet Service Providers selling customer web traffic data to traffic analyst firms. When you see a company measures web traffic of sites around the web, you can be pretty sure they are buying data about what sites you are visiting from your ISP.

This isn't the most interesting example because traffic analysis does find some of its meaning in advertising. It's also used for competitive intelligence, identifying vertical leaders and generally adding some semi-verifiable sophistication to our understanding of the landscape of the web. Unfortunately, as any publisher online will tell you - the resulting traffic estimates from these services are often wildly inaccurate.

Sentiment Analysis

More interesting than simple comings and goings is sentiment analysis of language used online about a given topic. There are PR uses for this data, but there's also a market for it in analyst firms who use it to make recommendations to their subscribers and clients.

Summize Labs have gone behind the cover of acquisition, but we believe that at least some of the original work continues.

This was the real technology being built by Summize, the search engine recently bought by Twitter. You might have noticed that though Summize is now called search.twitter - there's still no link to it from the Twitter site. Perhaps search wasn't the most important part of Summize after all - perhaps it's the sentiment analysis that's got the most potential.

Vertical Trend Watching

Ok, so maybe sentiment analysis of online activity could be solid enough to be interesting and worth a lot of money some day. And if wishes and buts were candy and nuts, we'd all have a merrier Christmas.

You know who's not messing around when it comes to stuff like this, though? People who trade in money. Hedge fund buyers in particular are particularly willing to try out hard core technology in order to get more and better information faster than anyone else. They are nuts for crazy tech; they pay thousands of dollars for research tools that could wrap Google Reader up like a pretzel and swallow it in one bite.

firstrainscreen5.jpg

FirstRain parses data you'd probably not think to imagine.

Check out our review of power news dashboard FirstRain, and RootMarkets a company that aims to trade in futures of web browsing data, ultimately for lead generation.

We want to see this kind of data crunching research tech outside of financial markets, though. We'd love to see some trends crunched out of the Twitter streams from Real estate pros, people in the Navy or biotech researchers. Users are segmented into these categories already by the directory Twellow, for example. We think rapid analysis of emerging trends in those verticals is something people would pay for.

Benchmarking

Google Analytics will now let you identify what kind of industry your website serves and once you do, they'll tell you how your website traffic trends compare to what's being seen by others in your industry. FreshBooks, a startup that provides online invoicing for independent professionals, offers benchmark data by industry to its users as well. Compared to other graphic designers, for example, you're charging less and getting your invoices filled slower than most.

Benchmark data helps people and businesses make better decisions, hopefully saving or making more money than they would have otherwise. Isn't that a lot more interesting than advertising?

Pointing out patterns of information gets people talking, too. Recommendation engine Strands offers a mobile banking service that prompts users to fill out their profile information by sharing interesting trivia with them about patterns in the data of users as a whole. "Did you know: married people spend 110% on groceries what single people do? Are you married or single?" Knowing whether customers are married or single lets a bank offer them targeted services, to understand the risks faced by their customers etc.

The Future of Data as Currency

These are just a few ways that large quantities of data can be used to derive value other than targeted advertising. All of them are more interesting than advertising, too.

Just like grocery stores give customers discounts in exchange for capturing their purchase histories, so too will users of online applications receive compensation for the data they co-produce with service providers that's subsequently monetized.

Beyond money, user co-producers of data will likely call for the ability to take their data from one service over to another, where they can contribute it to another aggregate of data and thus participate in another instance of value creation through the processing of data. That's data portability, or one way to articulate it.

We hope to see more examples of creative thinking about data mining and more startups that avoid taking the path of serving advertisers as their ultimate customers. The use of a tool impacts its orientation over time and these great technologies we are beginning to use online should be formed with greater goals in mind. There's too much utility at stake and the world's problems are too great for all this potential to be stunted by the seductive call of ad money. We hope an economy will grow to support alternative uses of user data and we hope it happens soon.

Top photo: Data processing center, CC from Flickr user Marcin Wichary

]]>Discuss]]>
http://www.readwriteweb.com/archives/four_adfree_ways_that_mined_da.php http://www.readwriteweb.com/archives/four_adfree_ways_that_mined_da.php Analysis Sun, 24 Aug 2008 10:14:46 -0800 Marshall Kirkpatrick
The State of the Market in Semantic Technologies Tom Tague from Thomson Reuters' OpenCalais team did a keynote speech today at SemTech in San Jose. His presentation was a wonderful wrapup of current semantic technology trends, and what we can expect over the next few years.

To open, he said that where we are now in the evolution of the Web is content rich, but information poor - plus "experientially deficient". He suggested that 'web 3.0' is about cleaning up the mess of web 2.0 and improving interfaces. In terms of semantic technology, he explained that over the past 5 years it has evolved from invention of standards to a period of commercial innovation on top of those inventions. While standards are still being worked on, now "we are at an inflection point where innovation is exploding."

]]>Sponsor

]]> Tague called Calais, the project he leads at Thomson Reuters, "a web service a.k.a. plumbing". They've had 13 releases, talked with 100+ customers about Calais, have 13,000 registered developers. He put the ideas that he's been talking about with customers and developers into 6 buckets, which we've listed with sub-categories below.

Tools

  • Semantic data mgmt
  • Semantic data generation
  • Databases
  • Integration and workflow

Tague said that tools are important, particularly in the enterprise. He sounded a note of caution to tools vendors: they need to simplify their stories, along with have "simple basic tools."

Social

  • Semantics-powered link sharing
  • Network mining
  • News sharing
  • Tweet mining

Tague said that we shouldn't focus on providing "frosting" on top of current social Web tools. He advised to focus on commercial imperatives, such as the categories above.

Advertising

  • Semantic ad placement
  • Contextual ad placement
  • Semantically driven landing pages
  • Mashup ads

There are clearly opportunities to improve advertising using semantic technology, said Tague.

Search

Tague noted that semantic search may be "the answer to the question nobody is asking." He said that we should look at general "semantic search" vs domain specific semantically-enhanced search. The latter is where the commercial opportunity actually is, but he questioned the economics of general semantic search.

Publishing

He put this into 3 sub-categories:

  • A-Content Producers - from back office to user experience
  • B-Editorial + Aggregation Publishing Models
  • C-Robotic publishing - aggregation only

Tague explained that Calais has really focused on this over the last 8-9 months. He said that classic publishers can get an enormous amount of value from this. Right now the big focus is "back in the bolier room," for example to cut editors from 3 to 2. He expects that later on more focus will go on enhancing the user experience.

Tague thinks that B is the biggest opportunity, using Huffington Post as an example. He said that it gives a "near newspaper like experience" at perhaps a 5th of the cost. It's an area where they're seeing adoption of Calais.

Interface

Tague noted that gaming is a huge industry that the semantic technology industry can learn from. He listed these attributes:

  • Great story line
  • High interactivity, immediate responsiveness
  • No interuptions
  • Graphically engaging
  • Seamless
  • Fun

So he asked who out there is trying to really change the user experience in semantic technology? He listed 4 companies (all of whom we've profiled on ReadWriteWeb):

  • Zemanta
  • Apture
  • Feedly
  • Glue

Tague told the audience that the next big innovation in interface will be something that stays with the user where they are, which will be mobile and in the browser.

To sum up, Tague suggested that semantic technologies vendors should decide whether they care about semantics or about user value. If it's semantics, then be a tools vendor. He said the basic building blocks are out there already, so focus on user experience.

Disclosure: SemTech has been a recent sponsor of ReadWriteWeb

]]>Discuss]]>
http://www.readwriteweb.com/archives/the_state_of_the_market_in_semantic_technologies.php http://www.readwriteweb.com/archives/the_state_of_the_market_in_semantic_technologies.php Conferences Tue, 16 Jun 2009 09:23:17 -0800 Richard MacManus
Crgslst: The Endangered, Sexy Craigslist Search Tool Denver, Colorado based Superhero.es has built crgslst, a very slick multi-city search tool for Craigslist. Craigslist itself doesn't offer a multi-search service. By combining the publicly available RSS feeds from Craigslist with AJAX, crgslst fills this need "so fast, we left the vowels behind."

Unfortunately, crgslst may be in violation of the Craigslist terms of use and could face the same shutdown that other similar projects have in the past. This situation brings up a number of questions about intellectual property, RSS and mashups.

]]>Sponsor

]]> Three years ago developer Jeff Attwood built a service at his site Coding Horror that performed a multi-city search of Craigslist, only to receive a shutdown order from Craigslist by email. That email included lines from the Terms of Use that are still present today.
Additionally, you agree not to:... use automated means, including spiders, robots, crawlers, data mining tools, or the like to download data from the Service - unless expressly permitted by craigslist;

What's an RSS feed though, but an API that lets 3rd parties download data from a site by automated means? Isn't Craigslist, or at least Housing Maps, the long-time darling of the mashup world? Some folks at least contend that an API is a way for noncommercial mashups to be developed without a lengthy, formal business development process.

There's no indication that crgslst has received any contact from Craigslist, but the history of similar services and the continued presence of the language above in the Terms of Use don't bode well.

Just the thought of a service like this getting shut down is sad. It's a great little site, offering a user experience that Craigslist itself would do well to offer. Who's IP is at work at crgslst, though?

For now, you can check out crgslst and see just one more example of the kinds of magic that becomes possible when a website offers its data in a standards-based format like RSS.

crgslstscreen.jpg

]]>Discuss]]>
http://www.readwriteweb.com/archives/crgslst_the_endangered_sexy_craigslist_search_tool.php http://www.readwriteweb.com/archives/crgslst_the_endangered_sexy_craigslist_search_tool.php Products Wed, 12 Mar 2008 12:51:07 -0800 Marshall Kirkpatrick
Facebook Lexicon Launches - Google Trends for Facebook Facebook has just launched a neat new trend mapping tool, called Lexicon. Similar to Google Trends, it allows you to create a trend graph for different words and (two-word) phrases on Facebook Walls. It has a surprisingly slick UI too, with the scroll bar enabling you to zoom in and out to get different views of the trend line. You can compare up to 5 different trends by separating words/phrases with a comma.

]]>Sponsor

]]> Although Lexicon compares favorably to Google Trends, it has some flaws. In our tests it had trouble with low frequency words (like "semantic") and also it choked on "web 2.0" ("Invalid term: web 2.0. Check that each term is a single word or two-word phrase, and that each term uses only alphanumeric characters"). Also, to compare apples to apples, Google Trends has a wider range of data - including breakdowns by region, city and language.

Here is an example of Lexicon:

...and a comparable trend map from Google Trends:

In announcing this new service, Facebook was careful to emphasize that no privacy violations have occured:

"We have a cluster of computers that count the number of occurrences of every term (for example, "juno") across profile, group and event Walls every day. The system strips out all personally identifiable information so that there is no way to track a mention back to a specific person. No human at Facebook ever reads these Wall posts, and Lexicon does not look at personal messages, invitations, or any other private user-to-user communications."

Overall, it's good to see Facebook mining some of the vast data that they have - but not stepping on sensitive privacy toes while doing so.

]]>Discuss]]>
http://www.readwriteweb.com/archives/facebook_lexicon.php http://www.readwriteweb.com/archives/facebook_lexicon.php Products Tue, 15 Apr 2008 14:53:54 -0800 Richard MacManus
Database Analytics Startup Aster Data Launches, Analyzes MySpace Grid-computing startup Aster Data Systems will officially launch today, three years after it was founded. Aster, which began in the Ph.D program at Standford, is a provider of "massively parallel processing databases" for organizations that have mammoth quantities of data that need to be stored and analyzed quickly. The Redwood City, California-based company is backed by Sequoia Capital, Cambrian Ventures, and First-Round Capital.

]]>Sponsor

]]> Aster's nCluster software allows companies with large amounts of data to store it on commodity hardware and scale with one-click, adding new servers as the data set grows. The company's first major client is MySpace, which generates 100s of terabytes of traffic data from its 110 million monthly unique users. Mining that data to understand how customers use and interact with the site requires some pretty robust architecture.

Aster's solution for MySpace uses a 100-node cluster of off-the-shelf commodity servers that can capture and load 100% of the data and run complex queries quickly. "MySpace needed to analyze complete datasets - not just samples or summaries. Sampling would completely miss infrequently occurring but highly profitable patterns," according to Aster, which says that nCluster has allowed MySpace to work with all of its terabytes of data and avoid the need to sample.

nCluster works by splitting up the cloud into smaller bits that each have a specific task. "Loader" nodes load data from external sources (and export to them), while "worker" boxes keep data stored on local disks. A "queen" layer directs the entire operation intelligently routing queries to the proper node. The "loader" tier can scale independently as needed, say Aster. "This enables query load-balancing to eliminate hot-spots and increase performance, returning results in seconds or minutes versus hours or 'did not finish,'" writes the company in a case study.

The software reminds me of 3Tera's AppLogic (our coverage), which is a grid computing operating system that makes it easier for companies to deploy their own compute cloud on commodity hardware. nCluster is essentially the same idea, but with an eye specifically toward managing and querying massive databases.

]]>Discuss]]>
http://www.readwriteweb.com/archives/database_analytics_startup_aster.php http://www.readwriteweb.com/archives/database_analytics_startup_aster.php Products Tue, 20 May 2008 00:01:01 -0800 Josh Catone
Could Wikipedia's Future Be as a Development Platform? Content creation at Wikipedia is slowing down. The already small number of active regular editors is on the decline and Jimmy Wales has called for live edits to be held for approval on many pages, a step sure to slow contributions even further.

The tapering of fresh content doesn't have to mean Wikipedia's death, though. The site contains a gargantuan amount of human created and tended but largely machine readable and structured data. That's a potential gold mine in terms of a potential pay-off in innovation. Wikipedia can offer developers opportunities to glean analysis, supplemental content and structured data from its years-old store of collaboratively generated information. All of that is possible, but Wikipedia as a platform can't be taken for granted.

]]>Sponsor

]]> SlumdogMillionaire.jpg

Above: Edit history via the WikiDashboard browser add-on, by Paul Irish.

If the sun is setting on Wikipedia's time as a fast-growing collection of user-contributed knowledge, maybe that part of the site's life was just its adolescence. Wiki inventor Ward Cunningham told us he thinks the moves by Wales to require approval before displaying edits are an "inevitable" maturing of the site, though not one he's necessarily happy about or believes is consistent with The Wiki Way. Nonetheless, the huge mass of knowledge amassed by the world's biggest wiki now offers developers and other websites all kinds of value that has only begun to be explored.

There is no formal Wikipedia Application Programming Interface (API) but the data there is relatively accesible anyway. It can be downloaded and proccessed locally. This spring a project called WikiXMLDB began offering a thoroughly XML-ified database of Wikipedia as well. We shouldn't fail to point out DBPedia, as well, where people are collaborating to make structured data available from Wikipedia. People are accessing the data in a variety of ways and are beginning to find good uses for it. One or more formal APIs from Wikipedia, though, would be exciting in ways similar to how it's exciting that the New York Times is opening up a number of APIs.

What Would People Do With Wikipedia Data?

Wikipedia as a tool to identify key sources of knowledge. Mainstream media coverage of Wikipedia in its early days often focused on the seemingly random contributors to the site. Some old guy with a beard down to his knees and living in a trailer park in New Mexico likes to edit entries about astronomy and the culinary arts. Isn't that quirky?

Wikipedia has managed to set free in a big way the knowledge stashed away in the minds of people all over the world. Identifying those people in a systematic way is just one example of the kind of value add that can be built on top of Wikipedia. Identifying key influencers online is a fast emerging industry and Wikipedia is one more place that can happen.

The Palo Alto Research Center recently built an application called WikiDashboard, a service to analyze recent changes and editors of any Wikipedia entry. Paul Irish, who incidentally is the editor of one of the best music blogs on the web, turned that data into a Greasemonkey script that gives one click access to the data from any page on Wikipedia (image above).

That's just the beginning of what could be done with contributor data, though there are so few active participants on Wikipedia that the user data may be more limited than you'd think.

wikiragescreen.jpg

Wikipedia as news radar. Wikipedia puts a great emphasis on current events, but the opposite is true as well - current events are reflected in Wikipedia. The site WikiRage treats Wikipedia edits like signals of significance - its subtitle is "Monitoring the Hive Mind Through Wikipedia Edits."

We've written here about non-advertising-based forms of data mining that could be huge in the future and how big a Facebook sentiment engine could be. Wikipedia edits number much, much lower than Twitter and Facebook updates, but they may be of higher value, and at the very least they seem like an important complement to a social media data mining strategy.

The Best Use Case: Leveraging Wikipedia's Structured Data

Last month we wrote here that Google appears to be exposing some semantically structured data in some of its search results. Some of that data may be originally analyzed at Google, but a lot of it is clearly coming in from Wikipedia. That's structured data that many, many companies could take advantage of.

Recommendation service MSpoke has been doing just that. (Disclosure: MSpoke's Sean Ammirati is the long-time producer of our podcast ReadWriteTalk.)

This business news tracking service uses Wikipedia to train its recommendation engines. Ammirati says that Wikipedia's disambiguation pages are very helpful in helping the company's technology know that there are, for example, two famous Michael Jordans - one of whom is a basketball player and the other is a statistician. That kind of distinction makes all the difference when you're in the business of recommendations.

By using a subset of Wikipedia's hierarchy of terms, MSpoke has been able to get an immediate foundation for its own taxonomy and quickly understand the content of articles it finds around the web.

This is the kind of thing that Metaweb and Powerset have tried to do in the past, as well. Powerset was absorbed by the Borg, and we're hearing rumors that things aren't going well at Metaweb. It's one thing to build added value from Wikipedia, it may be another to make it what you bet the farm on.

There could be something here, though. Wikipedia could do quite well for itself becoming less a destination site focused on public editing and more an open database, built up and still maintained after years of formerly frenetic public editing.

There's a chance that Wikipedia still isn't populated enough to be able to make that leap, that its political turbulence and waning enthusiasm are coming too soon. Only time will tell, but we have high hopes.

]]>Discuss]]>
http://www.readwriteweb.com/archives/could_wikipedias_future_api.php http://www.readwriteweb.com/archives/could_wikipedias_future_api.php Analysis Mon, 23 Feb 2009 13:21:50 -0800 Marshall Kirkpatrick
Google Warns of Privacy Issues on the Social Web lock_jan_09.jpgIn a recent paper about social privacy Google researchers caution that the expansion of the social Web and our growing involvement with it is compromising our privacy while offering the false sense of security that we act in the privacy of our own social circle.

]]>Sponsor

]]> Specifically, the paper suggests three areas where the social Web compromises user privacy.

1. Lack of control over activity streams

According to the paper, there are two primary ways in which lack of control over activity streams may compromise our privacy; the lack of control we have over events going into our activity streams (examples given are Facebook Beacon and coComment), and the lack of control we have when it comes to who can see our activity stream as is possible with Google Reader.

2. Unwelcome linkage

The authors define unwelcome linkage as occurring when links on the Internet reveal information about you that you had not intended to reveal, for instance trackbacks and accidental linkage.

3. De-anonymization through merging of social graphs

Given social networking sites extract a fair amount of personally identifiable information; the authors suggest it may be possible to uncover personal information by comparing data across social networking sites. In fact, this method of merging social graphs has already been used when researchers identified Netflix users by combining Netflix data with data from IMDb (PDF).

The Google paper suggests various solutions:

  • Applications should be explicit about which user activities automatically generate events for their activity stream
  • Users should be given control over which events make it into their activity stream and be able to remove events from the stream after they have been added by an application
  • Users should be explicitly told who the audience is for their activity stream; users should also have control over who the audience is for their activity stream
  • Application developers should build their applications such that the creation of activity stream events is more likely to be in sync with user expectation

The paper also proposes the building of tools that describe what information is available about you on the Internet; a warning system of sorts that includes an automatic link discovery tool which will quickly show you whether there is any privacy risks involved, so you can be better informed before creating new content.

As reported in New Scientist the Google paper, (Under)mining privacy in social networks (PDF), will be presented at the Web 2.0 Security and Privacy 2009 workshop in May.

Image credit: Darwin Bell

]]>Discuss]]>
http://www.readwriteweb.com/archives/google_warns_of_privacy_issues.php http://www.readwriteweb.com/archives/google_warns_of_privacy_issues.php Google Sat, 10 Jan 2009 10:14:26 -0800 Lidija Davis
Sense Networks: 4 Million Sensors to Help You Find a Party in San Francisco Yesterday we discussed MIT's project WikiCity, which monitors location data in cities via mobile sensors and creates visualizations from that. That project comes out of the SENSEable City lab at MIT and in our post we questioned whether there is any practical value in WikiCity currently or if it is simply "info porn". In this post we look at a commercial company that is doing much of the same thing by using data mining and real time analytics and trying to make a business from that. The company is Sense Networks and its stated aim is to index the real world "using real-time and historical location data for predictive analytics across multiple industries." Sense Networks was founded by top computer scientists from MIT and Columbia University.

]]>Sponsor

]]> Sense Networks has a platform, called Macrosense, that "receives streaming location data in real-time, analyzes and processes the data in the context of billions of historical data points, and stores it in a way that can be easily queried to better understand aggregate human activity." The company has so far built one consumer product on top of this platform: Citysense, an iPhone and Blackberry app that allows people in San Francisco to see the most happening nightlife in real time. Citysense currently accesses cell-phone and taxi GPS data from about four million GPS sensors, to see where the local hot spots are. It then links to Yelp and Google to show what venues are operating at popular locations. The product is currently only available in San Francisco, but a New York version is coming soon.

Citysense isn't the only such app doing this, there are currently a lot of location-based social networking plays. They include "social compass" service Loopt (our review), Nokia-owned Plazes (our review), Pelago's Whrrl, ULocate, and GyPSii. Probably our favorite right now is mobile social network app Brightkite, which at the end of last year we named our Most Promising startup for 2009. All of these apps offer something unique. For example Brightkite relies on actions from its user base to make it useful, whereas Citysense's strength is its 4 million sensors and the aggregate data it derives and analyzes from those.

Sense Networks was the subject of a recent review by MIT's Technology Review publication, which described how the next release of Citysense will show "not only where people are gathering in real time, but where people with similar behavioral patterns - students, tourists, or businesspeople, for instance - are congregating." So we can see that Citysense is slowly evolving into a social networking tool, like Brightkite. In the next release Citysense will categorize people into "tribes" - so far 20 tribes have been identified, including "young and edgy," "business traveler," "weekend mole," and "homebody." In order to do this, Sense Networks not only uses GPS data, but company address data and demographic data about people from the U.S. Census Bureau.

The company's monetization plans are predictably centered around location-based mobile advertising. This involves providing GPS data about city activity to advertisers, however the company insists that it will be aggregate data only - so user privacy is maintained. The user also can turn off the tracking and delete their existing data. An example of how advertising could work, via Technology Review, is data showing that "a particular demographic heads to bars downtown between 6 and 9 P.M. on weekdays. Advertisers could then tailor ads on a billboard screen to that specific crowd."

We think this is a good example of how sensors and mobile data are being used to provide real value to consumers. Let us know other examples you've come across lately.

]]>Discuss]]>
http://www.readwriteweb.com/archives/sense_networks_citysense.php http://www.readwriteweb.com/archives/sense_networks_citysense.php Real World Mon, 06 Apr 2009 19:11:29 -0800 Richard MacManus