Raw data is the best data, but a lot of public data can still only be found in tables rather than as directly machine-readable files. One example is the FDIC’s List of Failed Banks. Here is a simple trick to scrape such data from a website: Use Google Docs.
The table on that page is even relatively nice because it includes some JavaScript to sort it. But a large table with close to 200 entries is still not exactly the best way to analyze that data.
I first tried dabbledb for this task, and it worked in principle. The only problem was that it only extracted 17 rows for some reason. I have no idea what the issue was, but I didn’t want to invest the time to figure it out.
After some digging around and even considering writing my own throw-away extraction script, I remembered having read something about Google Docs being able to import tables from websites. And indeed, it has a very useful function called ImportHtml that will scrape a table from a page.
To extract a table, create a new spreadsheet and enter the following expression in the top left cell: =ImportHtml(URL, “table”, num). URL here is the URL of the page (between quotation marks), “table” is the element to look for (Google Docs can also import lists), and num is the number of the element, in case there are more on the same page (which is rather common for tables). The latter supposedly starts at 1, but I had to use 0 to get it to pick up the correct table on the FDIC page.
Once this is done, Google Docs retrieves the data and inserts it into the spreadsheet, including the headers. The last step is to download the spreadsheet as a CSV file.
This is very simple and quick, and a much better idea than writing a custom script. Of course, the real solution would be to offer all data as a CSV file in addition to the table to begin with. But until that happens, we will need tools like this to get the data into a format that is actually useful.
Source: http://eagereyes.org/data/scrape-tables-using-google-docs
The table on that page is even relatively nice because it includes some JavaScript to sort it. But a large table with close to 200 entries is still not exactly the best way to analyze that data.
I first tried dabbledb for this task, and it worked in principle. The only problem was that it only extracted 17 rows for some reason. I have no idea what the issue was, but I didn’t want to invest the time to figure it out.
After some digging around and even considering writing my own throw-away extraction script, I remembered having read something about Google Docs being able to import tables from websites. And indeed, it has a very useful function called ImportHtml that will scrape a table from a page.
To extract a table, create a new spreadsheet and enter the following expression in the top left cell: =ImportHtml(URL, “table”, num). URL here is the URL of the page (between quotation marks), “table” is the element to look for (Google Docs can also import lists), and num is the number of the element, in case there are more on the same page (which is rather common for tables). The latter supposedly starts at 1, but I had to use 0 to get it to pick up the correct table on the FDIC page.
Once this is done, Google Docs retrieves the data and inserts it into the spreadsheet, including the headers. The last step is to download the spreadsheet as a CSV file.
This is very simple and quick, and a much better idea than writing a custom script. Of course, the real solution would be to offer all data as a CSV file in addition to the table to begin with. But until that happens, we will need tools like this to get the data into a format that is actually useful.
Source: http://eagereyes.org/data/scrape-tables-using-google-docs
This comment has been removed by the author.
ReplyDeleteI can’t imagine that’s a great post. Thanks for sharing.
ReplyDeleteUpgrade your career Learn Data Warehousing Training in Bangalore from industry experts get Complete hands-on Training, Interview preparation, and Job Assistance at Softgen Infotech.
Such a great word which you use in your article and article is amazing knowledge. thank you for sharing it.
ReplyDeleteLooking for Training Institute in Bangalore , India. Softgen Infotech is the best one to offers 85+ computer training courses including IT software course in Bangalore, India. Also it provides placement assistance service in Bangalore for IT.
Best Software Training Institute in Bangalore
Wow its a very good post. The information provided by you is really very good and helpful for me. Keep sharing good information.
ReplyDeleteBest Training Institute in Bangalore BTM. My Class Training Bangalore training center for certified course, learning on Software Training Course by expert faculties, also provides job placement for fresher, experience job seekers.
Software Training Institute in Bangalore
I am happy for sharing on this blog its awesome blog I really impressed. thanks for sharing.
ReplyDeleteStart your journey with Training Institute in Bangaloreand get hands-on Experience with 100% Placement assistance from Expert Trainers with 8+ Years of experience @eTechno Soft Solutions Located in BTM Layout Bangalore.
SAP Training in Bangalore
I am really happy to say it’s an interesting post to read. I learn new information from your article; you are doing a great job. Keep it up
ReplyDeleteoracle training in chennai
oracle training in omr
oracle dba training in chennai
oracle dba training in omr
ccna training in chennai
ccna training in omr
seo training in chennai
seo training in omr