Crawling the web with PHP

The internet is full of web spiders (also known as robots). The biggest and most powerful web spider is probably googlebot, google’s web spider, responsible for crawling the web, looking for new web pages to index, and checking if pages already in it’s index have been updated. This post shows how we can use PHP to build our own web spiders.

PHP comes with a library called the ‘Client URL Library’ (CURL) which is capable of fetching webpages and returning the results. The various functions and configuration options for using the native CURL library with PHP can be found here. For this example we will use a convenient wrapper for PHP’s CURL library. It can be downloaded here (originally posted at php.net).

In the following example we will use our web crawler to access the google home page, search for a query and output the url’s for the first 10 search results. Perhaps a tool such us this might be useful for someone working in search engine optimisation. Below I will show the code required and then explain each step in detail.


[in_article_ad]First we import the CURL wrapper so that we can use it’s classes in our script. Please note that the file curl.php needs to be found in the same directory for this to work. Change “curl.php” to “/path/to/curl.php” if necessary.

require_once("curl.php");

Next we assign our query to a variable. We can change “Web Spiders” to a different query if we wish to search for something else.

$query = "web spiders";

Now we create an instance of the Curl class and call it’s “get” method in order to download our search results from the google home page. “http://www.google.com/search” (our first argument) is the location of google’s search page. “$vars = array(“q”=>$query)” (our second argument) tells our object to append a parameter to our request. Because we are using the get method, our CURL script will append our query to the url as a query string. The final url will look like this: http://www.google.com/search?q=web+spiders

$curl = new Curl();
$response = $curl->get( "http://www.google.com/search",$vars = array("q"=>$query) );

[in_article_ad]We have now fetched our webpage and the raw HTML output of our request can be found in the body property of our $response object “$response->body”. In order to extract our top ten urls from the HTML output we will create a DomDocument object. This will allow us to easily traverse through the Document Object Model (DOM) of our page.

$domDocument = new DOMDocument();
$domDocument->loadHTML($response->body);

Next we use PHP’s DomDocument method getElementsByTagName to collect every HTML anchor tag (<a></a>) in the document. Anchor tags are used for links in HTML.

$documentLinks = $domDocument->getElementsByTagName("a");

Our DomDocument contains more links than we need. By gathering every link in the page we have also picked up links to other google pages, such as images, maps etc. We need to filter these out so that we only output links relating to our search query. Fortunately these links are easy to identify. All search results in the search engine result page have a class attribute named “l”. The following loop iterates through all of the links in our page, checks to see if they have a class attribute of “l” and if so outputs the url.

[in_article_ad]

for($i=0;$i<$documentLinks->length;$i++) 
{
	$documentLink = $documentLinks->item($i);
	
	if($documentLink->getAttribute("class") == "l")
	{
		echo $documentLink->getAttribute("href")."
";
	}
}

This example is intentionally simple; however this technique can be used to gather all sorts of information from different pages across the web. CURL also allows us to make HTTP requests using post. So we can use CURL to automatically fill in forms on the internet. So long as the forms do not use “Completely Automated Public Turing test to tell Computers and Humans Apart” (CAPTCHA) images for verification.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *