Scraping Data From Websites: May 2013

Thursday, 30 May 2013

PHP Web Page Scraping Tutorial

Web Scraping, also known as Web Harvesting and/or Web Data Extraction is the process of extracting data from a given web site or web page. In this tutorial I will go over a way for you to extract the title of a page, as well as the meta keywords, meta description, and links. With some basic knowledge in PHP and Regular Expressions you can accomplish this process with ease.

First lets go over the regular expression meta characters we will be using in this tutorial.
(.*)Plain Text

The dot (.) stands for any character while the asterisks (*) stands for 0 or more characters. When both are combined (.*) you are letting the system know that you are looking for any set of characters with a length of 0 or more.

As for our PHP, we will be using 3 functions in order to extract our data. The first function is our file_get_contents() function which will get the desired page and input all of its contents and html into a string format. The second function we will be using is our preg_match() function which will return us one result when given the regular expression code. The final function we will be using is preg_match_all() which works the same as preg_match() just that preg_match_all() will return more then 1 result.

For this tutorial I have included 1 HTML page that contains our Title Tag, Meta Description, Meta Keywords and Some Links. We will be using that file for our scraping purposes.

Lets start by setting up our variable that will contain our string of html from the external file.

<?php
$file_string = file_get_contents('page_to_scrape.html');
?>Plain Text

What we did above is simply get all of the contents from our file page_to_scrape.html and store it to a string. Now that we have our string we can then proceed to the next portion of extraction.

* Hint: You can replace page_to_scrape.html with any page or link you may want to scrape. Some sites my have terms against scraping so be sure to read the terms of use before you decide to scrape a site.

Lets start by extracting the text within our <title></title> tags. In order to accomplish this we need to use our preg_match() function. Given 3 parameters the preg_match() function will return us an array with our result. The first parameter will be our regular expression, the second parameter will be our variable containing the html content, and our third parameter will be our out put array which will contain our results.

<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>Plain Text

Let me explain what I did in the above code. First we know that we want the text from within the title tags <title></title>. So we need to insert (.*) in between the title tags to retrieve any characters that we may have within them. When using a regular expression in the preg_match() function we need to encapsulate our regular expression within two forward slashes. You could use other characters such as {} and more. For this example though we will use the forward slashes. I append a lower case i to the end to search case insensitive. We also need to escape the forward slash in the closing title tag so that our script does not end its search there. For our second parameter I passed through our variable $file_string which we defined earlier to contain our HTML content. Lastly we pass our third parameter which will out put an array of result. Now that we have an array I assigned the element of the array that we want to the variable $title_out for later usage.

Next we need to get the Meta Description and Meta Keywords. We will just do the same as what we did above and just change the HTML and output names as follows.

preg_match('/<meta name="keywords" content="(.*)" \/> /i', $file_string, $keywords);
$keywords_out = $keywords[1];
preg_match('/<meta name="description" content="(.*)" \/> /i', $file_string, $description);
$description_out = $description[1];Plain Text

Finally we need to retrieve our list of links on the page. In my sample HTML document I have my links enclosed within <li></li> tags. I will use this in conjunction with the <a></a> tags to extract my data. For this we will need to use our preg_match_all() function that way we can return back more then 1 result. For this function we will pass through our parameters just as we did with the preg_match() function.

preg_match_all('/<li><a href="(.*)">(.*)<\/a><\/li>/i', $file_string, $links);Plain Text

With the above code we now have an array assigned to $links with all of the results. Notice that I used the meta characters (.*) more then once this time. The reason for this is since the data will not be consistently the same we need to let the script know that any set of characters may be there in its place. Our $links array will return an array that contains the data within the href=”" as well as the data between the <a></a> tags.

Now that we have all of our data that we want to collect. We can simply just print them out as follows:

Title: <?php echo $title_out; ?>
Keywords: <?php echo $keywords_out; ?>
Description: <?php echo $description_out; ?>
Links: (Name - Link) 
<?php
echo '<ol>';
for($i = 0; $i < count($links[1]); $i++) {
echo '<li>' . $links[2][$i] . ' - ' . $links[1][$i] . '</li>';
}
echo '</ol>';
?>
Plain Text

Attached are the files used in this tutorial. Let me know if you have any questions below.

Source: http://www.devblog.co/php-web-page-scraping-tutorial/

Monday, 27 May 2013

Web Page Scraping using Java

In this blog, we are going to learn about web scraping fundamentals and implementation of web scraper using Java API.

Agenda of this post

    What is Web Scraping
    Web Scraping technique
    Useful API for web scraping
    Sample code using java API

Web scraping (also called Web harvesting or Web data extraction) is a technique of extracting information from websites.
It describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context.
Using web scraper, you can extract the useful content from the web page and convert into any format as applicable.

Web Scraping technique:
These are few steps suggested for web scraping:

    Connect : Connect with the remote site over HTTP or FTP.
    Extract : Extract information from the website
    Process : Filter useful data from source and format data in useful format
    Save : Save data in desired format.

There are different web scraping software and APIs available. I am going to use web-harvest for my web scrapping example.

Web-Harvest
Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them.

Source: http://half-wit4u.blogspot.in/2011/01/web-scraping-using-java-api.html

Friday, 24 May 2013

Unraveling data scraping: Understanding how to scrape data can facilitate journalists' work

Ever heard of "data scraping?" The term may seem new, but programmers have been using this technique for quite a while, and now it is attracting the attention of journalists who need to access and organize data for investigative reporting.

Scraping is a way of retrieving data from websites and placing it in a simple and flexible format so it can be cross-analyzed more easily. Many times the information necessary to support a story is available, however it is found in websites that are hard to navigate or in a data base that is hard to use. To automatically collect and display this information, reporters need to turn to computer programs known as "scrapers."

Even though it may seem like a "geek" thing, journalists don't need to take advanced courses in programming or know complicated language in order to scrape data. According to hacker Pedro Markun, who worked on several data scraping projects for the House of Digital Culture in Sao Paulo, the level of knowledge necessary to use this technique is "very basic."

“Scrapers are programs easy to handle. The big challenge and constant exercise is to find a pattern in the web pages' data - some pages are very simple, others are a never-ending headache," said Markun in an interview with the Knight Center for Journalism in the Americas.

Markun has a public profile on the website Scraperwiki, that allows you to create your own data scraper online, or to access those written by others.

Like Scraperwiki, other online tools exist to facilitate data scraping, such as Mozenda, a simple interface software that automates most of the work, and Screen Scraper, a more complex tool that works with several programming languages to extract data from the Web. Another similar useful software is Firebug for Firefox.

Likewise, Google offers the the program Google Refine for manipulating confusing data and converting it into more manageable formats.

Journalists also can download for free Ruby, a simple and efficient programming language, that can be run on Nokogirito do scrapings on documents and websites.

Data is not always available in open formats or easy to scrape. Scanned documents, for example, need to be converted to virtual text. To do this, there is a function that can be found in Tesseract, an OCR (Optic Character Recognizer) tool of Google that "reads" scanned texts and converts them to virtual texts.

Information and guidelines about the use of these tools are available on websites such as Propublica, which offers several articles and tutorials on scraping tools for journalism. YouTube videos also can prove a helpful source.

Even if you have adopted the hacker philosophy, and reading tutorials or working hands-on tends to be your way of learning, you may encounter some doubts or difficulties when using these tools. If this is the case, a good option is to get in contact with more experienced programmers via discussion groups such as Thackday and Scraperwiki Community, which offer both free and paid-for alternatives to find someone to help do a scraping.

While navigating databases might be old school for some journalists, better understanding how to retrieve and organize data has gained in importance as we've entered an age of information overload, making taking advatage of such data-scraping tips all the more worthwhile.

Source: https://knightcenter.utexas.edu/blog/00-9676-unraveling-data-scraping-understanding-how-scrape-data-can-facilitate-journalists-work

Friday, 17 May 2013

Data Scraping Wikipedia

Tony Hirst of the Ouseful.Info blog has written an excellent article explaining how you can use the importHTML function in Google Spreadsheets to retrieve data from any table in any website.

Tony uses the function to scrape data from Wikipedia tables and then uses Yahoo Pipes to geocode the data and create a Google Map mash-up (here is a Google Map showing UK city populations as scraped from Wikipedia).

I've been playing with the importHTML function for a few days now (since reading Tony's article) and instead of Yahoo Pipes I've been using Batchgeocode to retrieve the latitude and longitudes and then the Google Spreadsheet Map Wizard to create a map from the data.

The Google Maps API Tricks blog also has a post on how you can use Google Maps' own geocoder. The Google geocoder can export csv-data, which can then be directly imported into a Google Spreadsheet.

One of the awesome things about the importHTML function is that it is dynamic and automatically refreshes. This means you could use Tony's tutorial with weather or other data presented in table format on the web and create dynamic Google Maps that automatically refresh when the data is updated.

Source: http://www.mapsmaniac.com/2008/10/data-scaping-wikipedia.html

Monday, 6 May 2013

The Simple Way to Scrape an HTML Table: Google Docs

Raw data is the best data, but a lot of public data can still only be found in tables rather than as directly machine-readable files. One example is the FDIC’s List of Failed Banks. Here is a simple trick to scrape such data from a website: Use Google Docs.

The table on that page is even relatively nice because it includes some JavaScript to sort it. But a large table with close to 200 entries is still not exactly the best way to analyze that data.

I first tried dabbledb for this task, and it worked in principle. The only problem was that it only extracted 17 rows for some reason. I have no idea what the issue was, but I didn’t want to invest the time to figure it out.

After some digging around and even considering writing my own throw-away extraction script, I remembered having read something about Google Docs being able to import tables from websites. And indeed, it has a very useful function called ImportHtml that will scrape a table from a page.

To extract a table, create a new spreadsheet and enter the following expression in the top left cell: =ImportHtml(URL, “table”, num). URL here is the URL of the page (between quotation marks), “table” is the element to look for (Google Docs can also import lists), and num is the number of the element, in case there are more on the same page (which is rather common for tables). The latter supposedly starts at 1, but I had to use 0 to get it to pick up the correct table on the FDIC page.

Once this is done, Google Docs retrieves the data and inserts it into the spreadsheet, including the headers. The last step is to download the spreadsheet as a CSV file.

This is very simple and quick, and a much better idea than writing a custom script. Of course, the real solution would be to offer all data as a CSV file in addition to the table to begin with. But until that happens, we will need tools like this to get the data into a format that is actually useful.

Source: http://eagereyes.org/data/scrape-tables-using-google-docs