ReadWriteHack

Overview of Text Extraction Algorithms

The demand for text mining tools, services like Instapaper and Readability, and Web scraping have increased the importance of extracting article text from HTML pages.

Computer science student Tomaž Kova?i? wrote an overview of text extraction algorithms. He also a big list of resources for hackers working with text extraction, including research papers and articles, software and Web APIS.

Some of the techniques Kova?i? covers include:

See also: our coverage of Extractiv, a text extraction and analysis service.

Image by Andrew Mason


ReadWriteWeb encourages comments, but please remember: Keep it nice, keep it clean, and avoid promotional comments. We do pre-moderate some comments with links. For more information, please read our full comment policy.
blog comments powered by Disqus
Recommended Story
RWW SPONSORS



RWW PARTNERS