Scraping Data From Websites: June 2013

Sunday 30 June 2013

Why Outsourcing Data Mining Services?

Are huge volumes of raw data waiting to be converted into information that you can use? Your organization's hunt for valuable information ends with valuable data mining, which can help to bring more accuracy and clarity in decision making process.

Nowadays world is information hungry and with Internet offering flexible communication, there is remarkable flow of data. It is significant to make the data available in a readily workable format where it can be of great help to your business. Then filtered data is of considerable use to the organization and efficient this services to increase profits, smooth work flow and ameliorating overall risks.

Data mining is a process that engages sorting through vast amounts of data and seeking out the pertinent information. Most of the instance data mining is conducted by professional, business organizations and financial analysts, although there are many growing fields that are finding the benefits of using in their business.

Data mining is helpful in every decision to make it quick and feasible. The information obtained by it is used for several applications for decision-making relating to direct marketing, e-commerce, customer relationship management, healthcare, scientific tests, telecommunications, financial services and utilities.

Data mining services include:

    Congregation data from websites into excel database
    Searching & collecting contact information from websites
    Using software to extract data from websites
    Extracting and summarizing stories from news sources
    Gathering information about competitors business

In this globalization era, handling your important data is becoming a headache for many business verticals. Then outsourcing is profitable option for your business. Since all projects are customized to suit the exact needs of the customer, huge savings in terms of time, money and infrastructure can be realized.

Advantages of Outsourcing Data Mining Services:

    Skilled and qualified technical staff who are proficient in English
    Improved technology scalability
    Advanced infrastructure resources
    Quick turnaround time
    Cost-effective prices
    Secure Network systems to ensure data safety
    Increased market coverage

Outsourcing will help you to focus on your core business operations and thus improve overall productivity. So data mining outsourcing is become wise choice for business. Outsourcing of this services helps businesses to manage their data effectively, which in turn enable them to achieve higher profits.

Source: http://ezinearticles.com/?Why-Outsourcing-Data-Mining-Services?&id=3066061

Friday 28 June 2013

What Your Expectations Should Be About a Web Analytics Solution

But what do you do when the process does not finish at your website? Do you know what to do when the character of your online business won't allow your users to finish the purchase online? May be, they may just send in special kind of form for the price quotation. And may be they will fill out some kind of call-me-back form. Is there any way you could measure those?

You need to know some crude facts for gauging non-direct conversions. Today, a number of web analytics tools will provide you with you special solutions to such problems. For instance, they'll let you to passively import the required data from the back-end system you have. Such an approach is ideal as you get to comprehensively analyze what the visitors did while moving around your website. This is pretty much comparable with the analysis of a full-fledged online conversion.

If you look at the big picture, all the data on the sources, behaviour or trends are readily available. Thus you are able to do virtually everything you wish. Unfortunately some web analytics tools with limited capacity won't let you to passively import the data in the system.

So you need to choose something that establishes a connection between the back-end system you have and your web analytics application. The reality is that, it's impossible to import your data into online applications like Google Analytics. Do you think you can really extract the essential data from Google Analytics? Could you match those data to each and every conversion in the back-end system? Fortunately, it's possible that you extract the data from Google Analytics if you make use of the API. Whatever, let's take a clear look at the different aspects for creating the connection.

You need a special web analytics solution that lets you to set Custom Variables. The big idea behind custom variables is that, you will create a feasible connection between your Google Analytics data and those data found in the back-end system. A good idea here is to create a compatible identifier for every visit (this could be a special hash tag or anything of that sort) you'll be able to post that particular variable into a special Custom Variable as well as in the hidden field in the price quotation form. If you store this hidden field into the back-end system you'll be able to export all the individual transactions later on - which include their visit id hash tag.

It's also possible that you request the entire chunk of data from your web analytics application API if you can segment on that particular identifier. This lets you to easily get a full-fledged insight in whatever that user did in their visit or how they managed to get into the website. Understandably, this could prove particularly interesting when you try to examine the traffic sources as well as the used keywords. You'll also get to compare the underlying differences between the sources that drive price quotations as well as sources that trigger actual conversions. Actually, that's how your user experience with your web analytics system should be.

Increase Your Conversion Rate! Turn your expensive missed opportunities into CUSTOMERS and SALES. A managed Real-Time Analytics web application that will drastically increase YOUR conversion rate like no one else.

Source: http://ezinearticles.com/?What-Your-Expectations-Should-Be-About-a-Web-Analytics-Solution&id=5242536

Tuesday 25 June 2013

Hard Drive Data Recovery - Protect Your Precious Data

The progress of electronics and with the significant use of computerized storage, the protection problem of gadgets and devices like Hard Disk Drives, Memory sticks and Flash Drives is becoming nervy for many small and average firms. Due to the fact that there are many hard drive data recovery applications and service providers so there is no concern anymore. Giving your medium for professional checkup, is the best option among the two mentioned. To build a plain picture of what is ideal to you, we'll talk about the two.

Hard drive data recovery Software:

Day by Day computer programs are becoming more flexible and better. Most of them are useable with Mac and simple computers. You can find so many choices accessible on-line in addition to regional providers, so what information one should look at when buying software.

Traits of hard drive data recovery Software programs:

• User Interface. Always make sure that the application tool is fast to handle with a number functions

• Known software. Research and testing creates an ideal and best program by top Brands

• Compatibility. Make sure whether the software operates with your computer settings or not

• Efficiency. To recover most computer files is the standard of a good program

• File Preview. This particular aspect allows you to spot all records before going to saving them

• Trial Version. Web shops are offering free downloadable version to experience the software in advance

• Support Service. This comprises of upgrades, newsletters, helpful hints and other practical services after you purchase the software. This is sign of a good service company.

Hard drive data recovery Services:

Shopping for certified service providers for data recovery, as stated earlier is a preferred way. They have got considerable practical experience in this field and have knowledge whether or not the data would be repaired or perhaps not Professional service providers provide totally free investigation of your gadget and just in case the particular data are recoverable, they will ask your approval and give a quota before applying necessary services.

Some things to consider a service

• Most of all price ranges are variable in this service category. A number of agencies may demand up to four thousand$ or even more, and some may charge less but a high price doesn't normally signify higher quality service.

• Firms for data recovery service can rescue data from any device such as Apple iPods or Hard Disks etc.

• Let them know what kind of data you had held in the device and check out whether they can recover all those files or perhaps not.

• The importance of Privacy Business organizations and clients may have sensitive data stored in their crashed device. Confirm that your details stay discreet.

• They need to have capable and well-informed specialist for performing data recovery procedures.

• And lastly various kinds of firms provide assurances to extract data if they think that the unit will respond

Professional suppliers should also invest adequate for data backup if they are investing million dollars on information technology and commercial infrastructure This should be a key strategy of small or average company to purchase hard drive data recovery services. Customers have to take expert guidance immediately after process of recovery and repair any lapse in their Data storage systems.

Source: http://ezinearticles.com/?Hard-Drive-Data-Recovery---Protect-Your-Precious-Data&id=7060267

Monday 24 June 2013

Is Web Scraping Relevant in Today's Business World?

Different techniques and processes have been created and developed over time to collect and analyze data. Web scraping is one of the processes that have hit the business market recently. It is a great process that offers businesses with vast amounts of data from different sources such as websites and databases.

It is good to clear the air and let people know that data scraping is legal process. The main reason is in this case is because the information or data is already available in the internet. It is important to know that it is not a process of stealing information but rather a process of collecting reliable information. Most people have regarded the technique as unsavory behavior. Their main basis of argument is that with time the process will be over flooded and therefore lead to parity in plagiarism.

We can therefore simply define web scraping as a process of collecting data from a wide variety of different websites and databases. The process can be achieved either manually or by the use of software. The rise of data mining companies has led to more use of the web extraction and web crawling process. Other main functions such companies are to process and analyze the data harvested. One of the important aspects about these companies is that they employ experts. The experts are aware of the viable keywords and also the kind of information which can create usable statistic and also the pages that are worth the effort. Therefore the role of data mining companies is not limited to mining of data but also help their clients be able to identify the various relationships and also build the models.

Some of the common methods of web scraping used include web crawling, text gripping, DOM parsing, and expression matching. The latter process can only be achieved through parsers, HTML pages or even semantic annotation. Therefore there are many different ways of scraping the data but most importantly they work towards the same goal. The main objective of using web scraping service is to retrieve and also compile data contained in databases and websites. This is a must process for a business to remain relevant in the business world.

The main questions asked about web scraping touch on relevance. Is the process relevant in the business world? The answer to this question is yes. The fact that it is employed by large companies in the world and has derived many rewards says it all. It is important to note that many people regarded this technology as a plagiarism tool and others consider it as a useful tool that harvests the data required for the business success.

Using of web scraping process to extract data from the internet for competition analysis is highly recommended. If this is the case, then you must be sure to spot any pattern or trend that can work in a given market.

Source: http://ezinearticles.com/?Is-Web-Scraping-Relevant-in-Todays-Business-World?&id=7091414

Friday 21 June 2013

Data Mining for Dollars

The more you know, the more you're aware you could be saving. And the deeper you dig, the richer the reward.

That's today's data mining capsulation of your realization: awareness of cost-saving options amid logistical obligations.

According to global trade group Association for Information and Image Management (AIIM), fewer than 25% of organizations in North America and Europe are currently utilizing captured data as part of their business process. With high ease and low cost associated with utilization of their information, this unawareness is shocking. And costly.

Shippers - you're in prime position to benefit the most by data mining and assessing your electronically-captured billing records, by utilizing a freight bill processing provider, to realize and receive significant savings.

Whatever your volume, the more you know about your transportation options, throughout all modes, the easier it is to ship smarter and save. A freight bill processor is able to offer insight capable of saving you 5% - 15% annually on your transportation expenditures.

The University of California - Los Angeles states that data mining is the process of analyzing data from different perspectives and summarizing it into useful information - knowledge that can be used to increase revenue, cuts costs, or both. Data mining software is an analytical tool that allows investigation of data from many different dimensions, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations among dozens of fields in large relational databases. Practically, it leads you to noticeable shipping savings.

Data mining and subsequent reporting of shipping activity will yield discovery of timely, actionable information that empowers you to make the best logistics decisions based on carrier options, along with associated routes, rates and fees. This function also provides a deeper understanding of trends, opportunities, weaknesses and threats. Exploration of pertinent data, in any combination over any time period, enables you the operational and financial view of your functional flow, ultimately providing you significant cost savings.

With data mining, you can create a report based on a radius from a ship point, or identify opportunities for service or modal shifts, providing insight regarding carrier usage by lane, volume, average cost per pound, shipment size and service type. Performance can be measured based on overall shipping expenditures, variances from trends in costs, volumes and accessorial charges.

The easiest way to get into data mining of your transportation information is to form an alliance with a freight bill processor that provides this independent analytical tool, and utilize their unbiased technologies and related abilities to make shipping decisions that'll enable you to ship smarter and save.

Source: http://ezinearticles.com/?Data-Mining-for-Dollars&id=7061178

Wednesday 19 June 2013

Web Data Extraction Services and Data Collection Form Website Pages

For any business market research and surveys plays crucial role in strategic decision making. Web scrapping and data extraction techniques help you find relevant information and data for your business or personal use. Most of the time professionals manually copy-paste data from web pages or download a whole website resulting in waste of time and efforts.

Instead, consider using web scraping techniques that crawls through thousands of website pages to extract specific information and simultaneously save this information into a database, CSV file, XML file or any other custom format for future reference.

Examples of web data extraction process include:
• Spider a government portal, extracting names of citizens for a survey
• Crawl competitor websites for product pricing and feature data
• Use web scraping to download images from a stock photography site for website design

Automated Data Collection
Web scraping also allows you to monitor website data changes over stipulated period and collect these data on a scheduled basis automatically. Automated data collection helps you discover market trends, determine user behavior and predict how data will change in near future.

Examples of automated data collection include:
• Monitor price information for select stocks on hourly basis
• Collect mortgage rates from various financial firms on daily basis
• Check whether reports on constant basis as and when required

Using web data extraction services you can mine any data related to your business objective, download them into a spreadsheet so that they can be analyzed and compared with ease.

In this way you get accurate and quicker results saving hundreds of man-hours and money!

With web data extraction services you can easily fetch product pricing information, sales leads, mailing database, competitors data, profile data and many more on a consistent basis.

Source: http://ezinearticles.com/?Web-Data-Extraction-Services-and-Data-Collection-Form-Website-Pages&id=4860417

Monday 17 June 2013

How Your Online Information is Stolen - The Art of Web Scraping and Data Harvesting

Web scraping, also known as web/internet harvesting involves the use of a computer program which is able to extract data from another program's display output. The main difference between standard parsing and web scraping is that in it, the output being scraped is meant for display to its human viewers instead of simply input to another program.

Therefore, it isn't generally document or structured for practical parsing. Generally web scraping will require that binary data be ignored - this usually means multimedia data or images - and then formatting the pieces that will confuse the desired goal - the text data. This means that in actually, optical character recognition software is a form of visual web scraper.

Usually a transfer of data occurring between two programs would utilize data structures designed to be processed automatically by computers, saving people from having to do this tedious job themselves. This usually involves formats and protocols with rigid structures that are therefore easy to parse, well documented, compact, and function to minimize duplication and ambiguity. In fact, they are so "computer-based" that they are generally not even readable by humans.

If human readability is desired, then the only automated way to accomplish this kind of a data transfer is by way of web scraping. At first, this was practiced in order to read the text data from the display screen of a computer. It was usually accomplished by reading the memory of the terminal via its auxiliary port, or through a connection between one computer's output port and another computer's input port.

It has therefore become a kind of way to parse the HTML text of web pages. The web scraping program is designed to process the text data that is of interest to the human reader, while identifying and removing any unwanted data, images, and formatting for the web design.

Though web scraping is often done for ethical reasons, it is frequently performed in order to swipe the data of "value" from another person or organization's website in order to apply it to someone else's - or to sabotage the original text altogether. Many efforts are now being put into place by webmasters in order to prevent this form of theft and vandalism.

Source: http://ezinearticles.com/?How-Your-Online-Information-is-Stolen---The-Art-of-Web-Scraping-and-Data-Harvesting&id=923976

Friday 14 June 2013

Data Mining and Financial Data Analysis

Introduction:

Most marketers understand the value of collecting financial data, but also realize the challenges of leveraging this knowledge to create intelligent, proactive pathways back to the customer. Data mining - technologies and techniques for recognizing and tracking patterns within data - helps businesses sift through layers of seemingly unrelated data for meaningful relationships, where they can anticipate, rather than simply react to, customer needs as well as financial need. In this accessible introduction, we provides a business and technological overview of data mining and outlines how, along with sound business processes and complementary technologies, data mining can reinforce and redefine for financial analysis.

Objective:

1. The main objective of mining techniques is to discuss how customized data mining tools should be developed for financial data analysis.

2. Usage pattern, in terms of the purpose can be categories as per the need for financial analysis.

3. Develop a tool for financial analysis through data mining techniques.

Data mining:

Data mining is the procedure for extracting or mining knowledge for the large quantity of data or we can say data mining is "knowledge mining for data" or also we can say Knowledge Discovery in Database (KDD). Means data mining is : data collection , database creation, data management, data analysis and understanding.

There are some steps in the process of knowledge discovery in database, such as

1. Data cleaning. (To remove nose and inconsistent data)

2. Data integration. (Where multiple data source may be combined.)

3. Data selection. (Where data relevant to the analysis task are retrieved from the database.)

4. Data transformation. (Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)

5. Data mining. (An essential process where intelligent methods are applied in order to extract data patterns.)

6. Pattern evaluation. (To identify the truly interesting patterns representing knowledge based on some interesting measures.)

7. Knowledge presentation.(Where visualization and knowledge representation techniques are used to present the mined knowledge to the user.)

Data Warehouse:

A data warehouse is a repository of information collected from multiple sources, stored under a unified schema and which usually resides at a single site.

Text:

Most of the banks and financial institutions offer a wide verity of banking services such as checking, savings, business and individual customer transactions, credit and investment services like mutual funds etc. Some also offer insurance services and stock investment services.

There are different types of analysis available, but in this case we want to give one analysis known as "Evolution Analysis".

Data evolution analysis is used for the object whose behavior changes over time. Although this may include characterization, discrimination, association, classification, or clustering of time related data, means we can say this evolution analysis is done through the time series data analysis, sequence or periodicity pattern matching and similarity based data analysis.

Data collect from banking and financial sectors are often relatively complete, reliable and high quality, which gives the facility for analysis and data mining. Here we discuss few cases such as,

Eg, 1. Suppose we have stock market data of the last few years available. And we would like to invest in shares of best companies. A data mining study of stock exchange data may identify stock evolution regularities for overall stocks and for the stocks of particular companies. Such regularities may help predict future trends in stock market prices, contributing our decision making regarding stock investments.

Eg, 2. One may like to view the debt and revenue change by month, by region and by other factors along with minimum, maximum, total, average, and other statistical information. Data ware houses, give the facility for comparative analysis and outlier analysis all are play important roles in financial data analysis and mining.

Eg, 3. Loan payment prediction and customer credit analysis are critical to the business of the bank. There are many factors can strongly influence loan payment performance and customer credit rating. Data mining may help identify important factors and eliminate irrelevant one.

Factors related to the risk of loan payments like term of the loan, debt ratio, payment to income ratio, credit history and many more. The banks than decide whose profile shows relatively low risks according to the critical factor analysis.

We can perform the task faster and create a more sophisticated presentation with financial analysis software. These products condense complex data analyses into easy-to-understand graphic presentations. And there's a bonus: Such software can vault our practice to a more advanced business consulting level and help we attract new clients.

To help us find a program that best fits our needs-and our budget-we examined some of the leading packages that represent, by vendors' estimates, more than 90% of the market. Although all the packages are marketed as financial analysis software, they don't all perform every function needed for full-spectrum analyses. It should allow us to provide a unique service to clients.

The Products:

ACCPAC CFO (Comprehensive Financial Optimizer) is designed for small and medium-size enterprises and can help make business-planning decisions by modeling the impact of various options. This is accomplished by demonstrating the what-if outcomes of small changes. A roll forward feature prepares budgets or forecast reports in minutes. The program also generates a financial scorecard of key financial information and indicators.

Customized Financial Analysis by BizBench provides financial benchmarking to determine how a company compares to others in its industry by using the Risk Management Association (RMA) database. It also highlights key ratios that need improvement and year-to-year trend analysis. A unique function, Back Calculation, calculates the profit targets or the appropriate asset base to support existing sales and profitability. Its DuPont Model Analysis demonstrates how each ratio affects return on equity.

Financial Analysis CS reviews and compares a client's financial position with business peers or industry standards. It also can compare multiple locations of a single business to determine which are most profitable. Users who subscribe to the RMA option can integrate with Financial Analysis CS, which then lets them provide aggregated financial indicators of peers or industry standards, showing clients how their businesses compare.

iLumen regularly collects a client's financial information to provide ongoing analysis. It also provides benchmarking information, comparing the client's financial performance with industry peers. The system is Web-based and can monitor a client's performance on a monthly, quarterly and annual basis. The network can upload a trial balance file directly from any accounting software program and provide charts, graphs and ratios that demonstrate a company's performance for the period. Analysis tools are viewed through customized dashboards.

PlanGuru by New Horizon Technologies can generate client-ready integrated balance sheets, income statements and cash-flow statements. The program includes tools for analyzing data, making projections, forecasting and budgeting. It also supports multiple resulting scenarios. The system can calculate up to 21 financial ratios as well as the breakeven point. PlanGuru uses a spreadsheet-style interface and wizards that guide users through data entry. It can import from Excel, QuickBooks, Peachtree and plain text files. It comes in professional and consultant editions. An add-on, called the Business Analyzer, calculates benchmarks.

ProfitCents by Sageworks is Web-based, so it requires no software or updates. It integrates with QuickBooks, CCH, Caseware, Creative Solutions and Best Software applications. It also provides a wide variety of businesses analyses for nonprofits and sole proprietorships. The company offers free consulting, training and customer support. It's also available in Spanish.

ProfitSystem fx Profit Driver by CCH Tax and Accounting provides a wide range of financial diagnostics and analytics. It provides data in spreadsheet form and can calculate benchmarking against industry standards. The program can track up to 40 periods.

Source: http://ezinearticles.com/?Data-Mining-and-Financial-Data-Analysis&id=2752017

Thursday 13 June 2013

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.

- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.

- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).

- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.

- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.

- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.

- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.

- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).

- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.

- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.

- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.

- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.

- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.

- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.

- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it into a database.

Source: http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Tuesday 11 June 2013

Best method for scraping data from Web using VB macro?

This is something of a conceptual question rather than on the specifics of code (ie am I going about this the right way in general or is there a better technique I could use?). I think that my problem represents a broad issue affecting many of the inexperienced people who post on this forum so an overview and sharing of best practice would also help many people.

My aim is to scrape statistical data from a website (here is an exemplar page: www.racingpost.com/horses/result_home.sd?race_id=572318&r_date=2013-02-26&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS

I have a very basic (if you pardon the pun) knowledge of VB which I use through excel but know nothing about other programming languages or conventions (SQL, HTML, XML etc.), however I am quite good at writing code to manipulate strings- that is, once I can scrape the data, even if it is in a very noisy form then I am expert at processing it. I am trying to build an automated process that will scrape up to 1000 pages in one hit. In one form or another, I have been working on this for years and the last few weeks have been very frustrating in that I have come up with several new methods which have taken days of work but have each had one fatal flaw that has stopped my progress.

Here are the methods I have tried (all using a VB macro run from Excel):
1) Control Firefox (as a shell application) - this was the poorest, I found that I could not interact with Firefox properly using a VB excel macro- i tried mainly using keystrokes etc.
2) Get inner text, outer text, inner html or outer html from internet explorer (IE)- this method was by far the most reliable but the data was, at times, very hard to parse and did not always contain everything I needed (good for some applications but bad for others)
3) automated Copy and Paste from IE- this was tantalisingly close to being perfect but is given to throwing up inexplicable errors whereby the type of information copied to the clipboard differs depending on whether it is done manually (ie CTRL+A, CTRL+C) or through the automated process (with the former I could get the HTML structure- ie tables, with the latter only text). The enigma here was that I could get the automated copy/paste to give me the right info IF I FIRST CLICKED INSIDE THE IE WINDOW USING MOUSE POINTER- however I was unable to automate this using a VB MACRO (I tried sendkeys and various other methods)
4) By automating an excel webquery- I recorded a macro of a query, this worked flawlessly giving me the structure of tables I needed. Snag was it was very very slow- even for a single page it might take 14 to 16 seconds (some of the other methods I used were near instantaneous). Also this method appears to encounter severe lagging/crashing problems when many refreshes are done (that may be because I don't know how to update the queries with different criteria, or properly extinguish them)
5) Loading the page as an XML document- I am investigating this method now- I know next-to-nothing about XML but have a hunch the sort of pages I am scraping (see example above) are suitable for this. I have managed to load the pages as an XML object but at present seem to be running into difficulties trying to parse the structure (ie various nodes) to extract text- keep running into object errors.

For the record I have posted highly specific questions with code relating to these individual methods without response so I am trying a broader question. What is the experience of others here? Which of these methods should I focus on? (bear in mind I am trying to keep everything to Excel VB Macros). I am getting to the point where I might look to get someone to code something for me and pay them (as this is taking hundreds of hours) - have people had good experiences employing others to write ad hoc code in this manner?

Source: http://www.mrexcel.com/forum/excel-questions/688229-best-method-scraping-data-web-using-vbulletin-macro.html

Friday 7 June 2013

Data Mining - Critical for Businesses to Tap the Unexplored Market

Knowledge discovery in databases (KDD) is an emerging field and is increasingly gaining importance in today's business. The knowledge discovery process, however, is vast, involving understanding of the business and its requirements, data selection, processing, mining and evaluation or interpretation; it does not have any pre-defined set of rules to go about solving a problem. Among the other stages, the data mining process holds high importance as the task involves identification of new patterns that have not been detected earlier from the dataset. This is relatively a broad concept involving web mining, text mining, online mining etc.

What Data Mining is and what it is not?

The data mining is the process of extracting information, which has been collected, analyzed and prepared, from the dataset and identifying new patterns from that information. At this juncture, it is also important to understand what it is not. The concept is often misunderstood for knowledge gathering, processing, analysis and interpretation/ inference derivation. While these processes are absolutely not data mining, they are very much necessary for its successful implementation.

The 'First-mover Advantage'

One of the major goals of the data mining process is to identify an unknown or rather unexplored segment that had always existed in the business or industry, but was overlooked. The process, when done meticulously using appropriate techniques, could even make way for niche segments providing companies the first-mover advantage. In any industry, the first-mover would bag the maximum benefits and exploit resources besides setting standards for other players to follow. The whole process is thus considered to be a worthy approach to identify unknown segments.

The online knowledge collection and research is the concept involving many complications and, therefore, outsourcing the data mining services often proves viable for large companies that cannot devote time for the task. Outsourcing the web mining services or text mining services would save an organization's productive time which would otherwise be spent in researching.

The data mining algorithms and challenges

Every data mining task follows certain algorithms using statistical methods, cluster analysis or decision tree techniques. However, there is no single universally accepted technique that can be adopted for all. Rather, the process completely depends on the nature of the business, industry and its requirements. Thus, appropriate methods have to be chosen depending upon the business operations.

The whole process is a subset of knowledge discovery process and as such involves different challenges. Analysis and preparation of dataset is very crucial as the well-researched material could assist in extracting only the relevant yet unidentified information useful for the business. Hence, the analysis of the gathered material and preparation of dataset, which also considers industrial standards during the process, would consume more time and labor. Investment is another major challenge in the process as it involves huge cost on deploying professionals with adequate domain knowledge plus knowledge on statistical and technological aspects.

The importance of maintaining a comprehensive database prompted the need for data mining which, in turn, paved way for niche concepts. Though the concept has been present for years now, companies faced with ever growing competition have realized its importance only in the recent years. Besides being relevant, the dataset from where the information is actually extracted also has to be sufficient enough so as to pull out and identify a new dimension. Yet, a standardized approach would result in better understanding and implementation of the newly identified patterns.

Source: http://ezinearticles.com/?Data-Mining---Critical-for-Businesses-to-Tap-the-Unexplored-Market&id=6745886

Wednesday 5 June 2013

Web Data Scraping It To Remain Relevant To The Business Process

Different techniques and processes designed to collect and analyze data, and have developed over time. Web SCRAPING to trade in the market processes recently beaten. Provides large amounts of data with various sources such as websites and databases are the process.

That breeze wills the good to be shaven, and they know there is a judgment has been given. This finding is that now in the clinical information or information. This information is important to know that a theft system to gather the data is reliable. Most people consider represent behavioral techniques.

Just collect data from a variety of different websites and databases, a process that can be defined as the web, the scholar shall shave. Can be achieved either manually or using a software process. Extraction and mining companies to increase the web crawling process has led to increased use. Another important function of these companies to analyze and process the data collected. Or is it the biggest reason, is that these companies make use of experts.

To think now: “I will come in the present case there is no agent of the old that he might escape? ‘” This solution, but unfortunately, I will not say. Subject yourself to consider, but the pain’s clinical course, but certainly a better alternative compared to the other is incredibly dangerous (but) it is free public proxy servers.

There are literally thousands of free proxy servers located around the world are very easy to use. But art is not finding them. First of all, you cannot tell that it is some kind or adorn an act tending to a server. Through the prayers of a proxy or a public bad is the idea of sending sensitive data.

Some of the most common web crawling, text, fun, and the scrape of Dom matching expression analysis methods used. Only process analyzers, HTML pages or system could produce the notes. All rights of succession to a number of other occasions, but rather operates to the same goal. The business world is to remain relevant to the business process.

Central questions about the relevance of the web, the scholar shall shave contact. Relevant to the business world? The answer is yes.

Website on the internet for information and analysis of the process of extracting the scholar shall shave competition use recommended. If this is the case, is there any pattern or trend that can work in any market, be sure to see.

Source: http://simplecallsolutions.com/web-data-scraping-it-to-remain-relevant-to-the-business-process/

Monday 3 June 2013

Scraping Data off a Web Site

I’m taking the Data Analysis class through Coursera and one of the topics we’ve covered so far is how to “scape” data off a web site. The idea is to programmatically got through the source code of a web page, pull out some data, and then clean it up so you can analyze it. This may seem like overkill at first glance. After all, why not just select the data with your mouse and copy-and-paste into a spreadsheet? Well, for one, there may be dozens (or hundreds) of pages to visit and copying-and-pasting from each one would be time-consuming and impractical. Second, rarely does a copy-and-paste off a web site produce data ready for analysis. You have to tidy it up, sometimes quite a bit. Clearly these are both tasks we would like to automate.

To put this idea to use, I decided to scrape some data from the box scores of Virginia Tech football games. I attended Tech and love watching their football team, so this seemed like a fun exercise. Here’s an example of one of their box scores. You’ll see it is has everything but what songs the band played during halftime. I decided to start simple and just scrape the Virginia Tech Drive Summaries. This summarizes each drive, including things like number of plays, number of yards gained, and time of possession. Here’s the function I wrote in R, called vtFballData:

vtFballData <- function(start,stop,season){
    dsf <- c()
    # read the source code
    for (i in start:stop){
    url <- paste("http://www.hokiesports.com/football/stats/showstats.html?",i,sep="")
    web_page <- readLines(url)

    # find where VT drive summary begins
    dsum <- web_page[(grep("Virginia Tech Drive Summary", web_page) - 2):
                         (grep("Virginia Tech Drive Summary", web_page) + 18)]
    dsum2 <- readHTMLTable(dsum)
    rn <- dim(dsum2[[1]])[1]
    cn <- dim(dsum2[[1]])[2]
    ds <- dsum2[[1]][4:rn,c(1,(cn-2):cn)]
    ds[,3] <- as.character(ds[,3]) # convert from factor to character
    py <- do.call(rbind,strsplit(sub("-"," ",ds[,3])," "))
    ds2 <- cbind(ds,py)
    ds2[,5] <- as.character(ds2[,5]) # convert from factor to character
    ds2[,6] <- as.character(ds2[,6]) # convert from factor to character
    ds2[,5] <- as.numeric(ds2[,5]) # convert from character to numeric
    ds2[,6] <- as.numeric(ds2[,6]) # convert from character to numeric
    ds2[,3] <- NULL # drop original pl-yds column

    names(ds2) <-c("quarter","result","top","plays","yards")
    # drop unused factor levels carried over from readlines
    ds2$quarter <- ds2$quarter[, drop=TRUE]
    ds2$result <- ds2$result[, drop=TRUE]

    # convert TOP from factor to character
    ds2[,3] <- as.character(ds2[,3])
    # convert TOP from M:S to just seconds
    ds2$top <- sapply(strsplit(ds2$top,":"),
        function(x) {
            x <- as.numeric(x)
            x[1]*60 + x[2]})

    # need to add opponent
    opp <- web_page[grep("Drive Summary", web_page)]
    opp <- opp[grep("Virginia Tech", opp, invert=TRUE)] # not VT
    opp <- strsplit(opp,">")[[1]][2]
    opp <- sub(" Drive Summary</td","",opp)
    ds2 <- cbind(season,opp,ds2)
    dsf <- rbind(dsf,ds2)
    }
return(dsf)
}

I’m sure this is three times longer than it needs to be and could be written much more efficiently, but it works and I understand it. Let’s break it down.

My function takes three values: start, stop, and season. Start and stop are both numerical values needed to specify a range of URLs on hokiesports.com. Season is simply the year of the season. I could have scraped that as well but decided to enter it by hand since this function is intended to retrieve all drive summaries for a given season.

The first thing I do in the function is define an empty variable called “dsf” (“drive summaries final”) that will ultimately be what my function returns. Next I start a for loop that will start and end at numbers I feed the function via the “start” and “stop” parameters. For example, the box score of the 1st game of the 2012 season has a URL ending in 14871. The box score of the last regular season game ends in 14882. To hit every box score of the 2012 season, I need to cycle through this range of numbers. Each time through the loop I “paste” the number to the end of “http://www.hokiesports.com/football/stats/showstats.html?” and create my URL. I then feed this URL to the readLines() function which retrieves the code of the web page and I save it as “web_page”.

Let’s say we’re in the first iteration of our loop and we’re doing the 2012 season. We just retrieved the code of the box score web page for the Georgia Tech game. If you go to that page, right click on it and view source, you’ll see exactly what we have stored in our “web_page” object. You’ll notice it has a lot of stuff we don’t need. So the next part of my function zeros in on the Virginia Tech drive summary:

# find where VT drive summary begins
dsum <- web_page[(grep("Virginia Tech Drive Summary", web_page) - 2):
                 (grep("Virginia Tech Drive Summary", web_page) + 18)]

This took some trial and error to assemble. The grep() function tells me which line contains the phrase “Virginia Tech Drive Summary”. I subtract 2 from that line to get the line number where the HTML table for the VT drive summary begins (i.e., where the opening <table> tag appears). I need this for the upcoming function. I also add 18 to that line number to get the final line of the table code. I then use this range of line numbers to extract the drive summary table and store it as “dsum”. Now I feed “dsum” to the readHTMLTable() function, which converts an HTML table to a dataframe (in a list object) and save it as “dsum2″. The readHTMLTable() function is part of the XML package, so you have download and install that package first and call library(XML) before running this function.

At this point we have a pretty good looking table. But it has 4 extra rows at the top we need to get rid of. Plus I don’t want every column. I only want the first column (quarter) and last three columns (How lost, Pl-Yds, and TOP). This is a personal choice. I suppose I could have snagged every column, but decided to just get a few. To get what I want, I define two new variables, “rn” and “cn”. They stand for row number and column number, respectively. “dsum2″ is a list object with the table in the first element, [[1]]. I reference that in the call to the dim () function. The first element returned is the number of rows, the second the number of columns. Using “rn” and “cn” I then index dsum2 to pull out a new table called “ds”. This is pretty much what I wanted. The rest of the function is mainly just formatting the data and giving names to the columns.

The next three lines of code serve to break up the “Pl-Yds” column into two separate columns: plays and yards. The following five lines change variable classes and remove the old “Pl-Yds” column. After that I assign names to the columns and drop unused factor levels. Next up I convert TOP into seconds. This allows me to do mathematical operations, such as summing and averaging.

The final chunk of code adds the opponent. This was harder than I thought it would be. I’m sure it can be done faster and easier than I did it, but what I does works. First I use the grep() function to identify the two lines that contain the phrase “Drive Summary”. One will always have Virginia Tech and the other their opponent. The next line uses the invert parameter of grep to pick the line that does not contain Virginia Tech. The selected line looks like this for the first box score of 2012: “<td colspan=\”9\”>Georgia Tech Drive Summary</td>”. Now I need to extract “Georgia Tech”. To do this I split the string by “>” and save the second element:

opp <- strsplit(opp,">")[[1]][2]

It looks like this after I do the split:

[[1]]
[1] "<td colspan=\"9\""              "Georgia Tech Drive Summary</td"

Hence the need to add the “[[1]][2]” reference. Finally I substitute ” Drive Summary</td” with nothing and that leaves me with “Georgia Tech”. Finally I add the season and opponent to the table and update the “dsf” object. The last line is necessary to allow me to add each game summary to the bottom of the previous table of game summaries.

Here’s how I used the function to scrape all VT drive summaries from the 2012 regular season:

dsData2012 <- vtFballData(14871,14882,2012)

To identify start and stop numbers I had to go to the VT 2012 stats page and hover over all the box score links to figure out the number sequence. Fortunately they go in order. (Thank you VT athletic dept!) The bowl game is out of sequence; its number is 15513. But I could get it by calling vtFballData(15513,15513,2012). After I call the function, which takes about 5 seconds to run, I get a data frame that looks like this:

season          opp quarter result top plays yards
   2012 Georgia Tech       1   PUNT 161     6    24
   2012 Georgia Tech       1     TD 287    12    56
   2012 Georgia Tech       1 DOWNS 104     5    -6
   2012 Georgia Tech       2   PUNT 298     7    34
   2012 Georgia Tech       2   PUNT 68     4    10
   2012 Georgia Tech       2   PUNT 42     3     2

Now I’m ready to do some analysis! There are plenty of other variables I could have added, such as whether VT won the game, whether it was a home or away game, whether it was a noon, afternoon or night game, etc. But this was good enough as an exercise. Maybe in the future I’ll revisit this little function and beef it up.

Source: http://www.clayford.net/statistics/scraping-data-off-a-web-site/