Bookmark
The Easy Way to Extract Useful Text from Arbitrary HTML - AI Depot
ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/, posted 2011 by peter in ai development nlp python scraping
This article shows you how to write a relatively simple script to extract text paragraphs from large chunks of HTML code, without knowing its structure or the tags used. It works on news articles and blogs pages with worthwhile text content, among others…
Bookmark
pyquery: a jquery-like library for python — pyquery v0.6.1 documentation
packages.python.org/pyquery/, posted 2010 by peter in development free html python scraping software xml
pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.
Bookmark
About [Selenium]
seleniumhq.org/projects/remote-control/, posted 2008 by peter in development free html java javascript perl python scraping software testing
Selenium Remote Control (RC) is a test tool that allows you to write automated web application UI tests in any programming language against any HTTP website using any mainstream JavaScript-enabled browser.
Bookmark
JWebUnit - JWebUnit
jwebunit.sourceforge.net/, posted 2008 by peter in development free html java javascript scraping software testing
JWebUnit is a Java-based testing framework for web applications. It wraps existing testing frameworks such as HtmlUnit and Selenium with a unified, simple testing interface to allow you to quickly test the correctness of your web applications.
Bookmark
HtmlUnit - Welcome to HtmlUnit
htmlunit.sourceforge.net/, posted 2008 by peter in development free html java javascript scraping software testing
HtmlUnit is a "browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser. It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use. It is typically used for testing purposes or to retrieve information from web sites.
Bookmark
Crowbar - SIMILE [Scraping with XULRunner]
simile.mit.edu/wiki/Crowbar, posted 2008 by peter in development download free linux mashup mozilla scraping software windows
Crowbar is a web scraping environment based on the use of a server-side headless mozilla-based browser. Its purpose is to allow running javascript scrapers against a DOM to automate web sites scraping but avoiding all the syntax normalization issues.
1–6 (6)