Python and Screen Scraping

I wanted to do a quick test on a website using Python. I knew about beautifulsoup but I wanted the power of JQuery. So I found pyquery.

I found some instructions to get started and noticed some people complaining about how difficult it is to get installed. Hmm I wonder why?

It only needs a dependency to lxml which has a dependency to easy_install which needs setuptools. Oh, that’s why people complain. Oh well let’s try.

  • I downloaded Download the setuptools-0.6c11-py2.6.egg because my version of Python is 2.6
  • Run setuptools as if it were a shell script. Apparently this installs easy_install
  • Now I can install lxml with easy_install $ sudo easy_install lxml

This failed for me. So I’m going to make sure I have libxml2 and libxslt. First I install libxml2-dev $ sudo apt-get install libxml2-dev

First I need to build libxml2

wget ftp://xmlsoft.org/libxml2/libxml2-sources-2.7.6.tar.gz
tar -xvsf libxml2-sources-2.7.6.tar.gz
cd libxml2-2.7.6/
./configure --prefix=/usr/local/libxml2
sudo make install

Next build libxslt

wget ftp://xmlsoft.org/libxslt/libxslt-1.1.26.tar.gz
tar -xvzf libxslt-1.1.26.tar.gz
cd libxslt-1.1.26
./configure --prefix=/usr/local/libxslt --with-libxml-prefix=/usr/local/libxml2/
sudo make install

Still errors… I’m stopping