Google Analytics

Search

To search for specific articles you can use advanced Google features. Go to www.google.com and enter "site:darrellgrainger.blogspot.com" before your search terms, e.g.

site:darrellgrainger.blogspot.com CSS selectors

will search for "CSS selectors" but only on my site.


Wednesday, December 14, 2011

Using the right tool for the job

From time to time I see people asking questions about how to use an automation tool to do something the tool was never meant to do. For example, how do I use Selenium to get the web page for a site without loading the javascript or CSS?

Selenium is designed to simulate a user browsing a website. When I open a web page with a browser, the website sends me javascript and CSS files. The browser just naturally processes those. If I don't want that, I shouldn't use a browser. If I am not using a browser, why would I use Selenium to send the HTTP request?

That is all the get() method in Selenium does. It opens a connection to the website and sends an HTTP request using the web browser. The website sends back an HTTP response and the browser processes it.

If all I want to do is send the request and get the response back, unprocessed. I don't need a web browser.

So how can I send an HTTP request and get the HTTP response back? There are a number of tools to do this.

Fiddler2: http://www.fiddler2.com/fiddler2/

The way Fiddler works is you add a proxy to your web browser (actually Fiddler does it automatically). Now when you use the web browser, if Fiddler is running, the web browser sends the HTTP request to Fiddler and Fiddler records the request and passes it on to the intended website. The website sends the response back to Fiddler and Fiddler passes it back to the web browser.

You can save the request/response pair and play them back. Before you play the request back you can get it. You can edit the website address, you can edit the context root of the URL and if there is POST data you can get the data as well.

Charles: http://www.charlesproxy.com/

Charles is much like Fiddler2 but there are two main differences. The first is that Charles is not free. You can get an evaluation copy of Charles but ultimately, you need to pay for it. So why would you use Charles? With purchase comes support. If there are things not working (SSL decrypting for example) you can get help with that. Additionally, Fiddler is only available on Windows. Charles works on Mac OS X and Linux as well.

curl: http://curl.haxx.se/

Fiddler and Charles are GUI applications with menus and dialogs. They are intended for interacting with humans. If you are more of a script writer or want something you can add to an automated test, you want something you can run from the command line. That would be curl. Because it is lightweight and command line driven, I can run curl commands over and over again. I can even use it crude for load testing.

The most common place to find curl is checking the contents of a web page or that a website is up and running. There are many command line options (-d to pass POST data, -k to ignore certificate errors, etc.) but the general use is curl -o output.txt http://your.website.com/some/context/root. This will send the HTTP request for /some/context/root to the website your.website.com. A more real example would be:

curl -o output.txt http://www.google.ca/search?q=curl

I could then use another command line tool to parse the output.txt file. Or I could use piping to pipe the output to another program.


Another nice command line tool is wget. The wget command, like curl, will let you send an HTTP request. The nice thing about wget is that you can use it to crawl an entire website. One of my favourite wget commands is:

wget -t 1 -nc -S --ignore-case -x -r -l 999 -k -p http://your.website.com

The -t sets the number of tries. I always figure if they don't send it to me on the first try they probably won't send it to me ever. The -nc is for 'no clobber'. If there are two files sent with the same name, it will write the first file using the full name and the second file with a .1 on the end. You might wonder, how could it have the same file twice in the same directory? The answer is UNIX versus Windows. On a UNIX system there might be index.html and INDEX.html. To UNIX these are different files but downloading it to Windows I need to treat these as the same file. The -S prints the server reponse header to stderr. It doesn't get saved to the files but lets me see that things are still going and something is being sent back. The --ignore-case option is because Windows ignores case so we should as well. The -x option forces the creation of directories. This will create a directory structure similar to the original website. This is important because two different directories on the server might have the same file name and we want to preserve that. The -r option is for recursive. Keep going down into subdirectories. The -l option is for the number of levels to recurse. If you don't specify it, the default is 5. The -k option is for converting links. If there are links in the pages being downloaded, they get converted. Relative links like src="../../index.html" will be fine. But if they hard coded something like src="http://your.website.com/foo.html" we want to convert this to a file:// rather than go back to the original website. Finally, the -p option says to get entire pages. If the HTML page we retrieve needs other things like CSS files, javascript, images, etc. then the -p option will retrieve them as well.

These are just some of the tools I use when Selenium is not the right tool for the job.

No comments: