Bookmark

The Easy Way to Extract Useful Text from Arbitrary HTML - AI Depot

ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/, posted 2011 by peter in ai development nlp python scraping

This article shows you how to write a relatively simple script to extract text paragraphs from large chunks of HTML code, without knowing its structure or the tags used. It works on news articles and blogs pages with worthwhile text content, among others…

Bookmark

pyquery: a jquery-like library for python â€” pyquery v0.6.1 documentation

packages.python.org/pyquery/, posted 2010 by peter in development free html python scraping software xml

pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.

Bookmark

About [Selenium]

seleniumhq.org/projects/remote-control/, posted 2008 by peter in development free html java javascript perl python scraping software testing

Selenium Remote Control (RC) is a test tool that allows you to write automated web application UI tests in any programming language against any HTTP website using any mainstream JavaScript-enabled browser.

Bookmark

JWebUnit - JWebUnit

jwebunit.sourceforge.net/, posted 2008 by peter in development free html java javascript scraping software testing

JWebUnit is a Java-based testing framework for web applications. It wraps existing testing frameworks such as HtmlUnit and Selenium with a unified, simple testing interface to allow you to quickly test the correctness of your web applications.

Bookmark

HtmlUnit - Welcome to HtmlUnit

htmlunit.sourceforge.net/, posted 2008 by peter in development free html java javascript scraping software testing

HtmlUnit is a "browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser. It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use. It is typically used for testing purposes or to retrieve information from web sites.

Bookmark

Crowbar - SIMILE [Scraping with XULRunner]

simile.mit.edu/wiki/Crowbar, posted 2008 by peter in development download free linux mashup mozilla scraping software windows

Crowbar is a web scraping environment based on the use of a server-side headless mozilla-based browser. Its purpose is to allow running javascript scrapers against a DOM to automate web sites scraping but avoiding all the syntax normalization issues.

1–6 (6)

The Easy Way to Extract Useful Text from Arbitrary HTML - AI Depot

pyquery: a jquery-like library for python â€” pyquery v0.6.1 documentation

About [Selenium]

JWebUnit - JWebUnit

HtmlUnit - Welcome to HtmlUnit

Crowbar - SIMILE [Scraping with XULRunner]

Hello,

More Sites and Experiments