10 result(s) displayed (1 - 10 of 11):
In my post over the weekend about how you don't need a diploma to code, I was thinking about the multi-volume The Art of Computer Programming that was written by former Stanford CS prof Donald Knuth. Knuth tomorrow turns 74 and it is worth taking a look at various things that you can find online as a means of celebrating his rich and varied life.
A few months ago we linked to Tomaž Kovačič's overview of text extraction algorithms. Now Kovačič has posted an evaluation of several text extraction algorithms and services, including Boilerpipe, NCleaner, the Python and Node.js versions of Readability and the Extractiv API.
To conduct his evaluations, Kovačič used the cleaneval dataset, which includes 681 documents, and a Google News dataset with 621 documents harvested by the authors of Boilerpipe.
Rapid-I announced this week that it will offer a marketplace for RapidMiner extensions to its open source data mining tool RapidMiner. "Over the years, many of you have been developing new RapidMiner Extensions dedicated to a broad set of topics," the company's announcement stays. "Whereas these extensions are easy to install in RapidMiner - just download and place them in the plugins folder - the hard part is to find them in the vastness that is the Internet." You can visit the beta version of the extension marketplace here.
It doesn't appear that there's a mechanism for offering paid extensions, yet. But Decision Stats blogger Ajay Ohri hopes to see this turn into an app store for algorithms.
The demand for text mining tools, services like Instapaper and Readability, and Web scraping have increased the importance of extracting article text from HTML pages.
Computer science student Tomaž Kovačič wrote an overview of text extraction algorithms. He also a big list of resources for hackers working with text extraction, including research papers and articles, software and Web APIS.
Here's another contest, but this one has a larger prize and greater purpose: The Heritage Provider Network is offering $3 million for an algorithm that will predict hospitalizations by using a pre-defined set of patient data sources. The Heritage Health Prize site claims that over $30 billion was spent on unnecessary hospital admissions in 2006. If patients that are at risk for hospitalization can be identified in advance, health care providers could develop new care plans to prevent hospitalization. The contest will be open for registration early next year.
Researchers at the Palo Alto Research Center (PARC) are developing a new Twitter client application that aims to derive meaning from the next-ending influx of tweets. The application, called "Eddi," automatically groups tweets for you into topics mentioned either explicitly or, unlike most Twitter clients that also provide topic browsing, implicitly. The end result is a Twitter app you can use to quickly find the popular discussions within your own personal Twitter stream, either by search, tag cloud, timeline or category list. It even suggests tweets you might be interested in reading, helping you sort the signal from the noise.
The Website Taste Predictor is a new Twitter tool that analyzes your Twitter account in order to recommend websites you would like. The project uses Twitter's OAuth authentication protocol to access your Twitter account so you don't have to enter in your username and password in order to try it out. How exactly it works, we can't say. There's no "about" page, "FAQ" or other explanation. In fact, there's not even a credit as to who made it, only a URL. But the URL is a big hint: it's hosted on the MIT.edu domain underneath the subheading ~peretti. And just who is ~peretti? Only the co-founder of the Huffington Post and the viral tracker BuzzFeed, Jonah Peretti.
According to a post on Google's Webmaster Central blog, Google is now discovering web sites by automatically scanning RSS and Atom feeds. This new process will help Google more quickly identify web pages and will allow users to find new content in search results as soon as it goes live. While not exactly "real-time," using feeds to identify updates to websites is an arguably faster method than the traditional crawling techniques Google has used in the past. And Google may get even faster in the near future - the post also notes that the company may soon explore using mechanisms like the real-time protocol PubSubHubbub to identify updated items going forward.
In an effort to help advertisers reach key consumers, Peerset is launching what it describes as a "psycho-graphic targeting tool." Not unlike dating algorithms, Peerset's targeting algorithm takes keywords and meta data from online profiles and matches them with relevant information. With dating sites, users receive recommendations on potential mates; with Peerset, users receive advertisements and deals on relevant products and services. Video life-streaming network Justin.tv is just one of the groups already reaping the benefits of this system.
ContextSense is a newly launched sentiment extraction technology from Wingify, a company focused on website optimization solutions. As a part of their core product which helps website owners identify visitor demographics and behavior, target ads, and optimize landing pages, ContextSense demonstrates how Wingify's contextual targeting technology works. To use the tool, you simply enter in a URL or a piece of text, and it will then reveal the overall sentiment of the website (positive or negative), relevant tags, concepts, categories, and contextually similar links. The end result is a quick glimpse into what a site is all about.
Movable Type search results powered by Fast Search