Scraping Data From Websites: The Need for Specialised Data Mining Techniques for Web 2.0

Web 2.0 is not exactly a new version of the Web, but rather a way to describe a new generation of interactive websites centred on the user. These are websites that offer

interactive information sharing, as well as collaboration - a case in point being wikis and blogs - and is now expanding to other areas as well. These new sites are the result of new technologies and new ideas and are on the cutting edge of Web development. Due to their novelty, they create a rather interesting challenge for data mining.

Data mining is simply a process of finding patterns in masses of data. There is such a vast plethora of information out there on the Web that it is necessary to use data mining tools to make sense of it. Traditional data mining techniques are not very effective when used on these new Web 2.0 sites because the user interface is so varied. Since Web 2.0 sites are created largely by user-supplied content, there is even more data to mine for valuable information. Having said that, the additional freedom in the format ensures that it is much more difficult to sift through the content to find what is usable.The data available is very valuable, so where there is a new platform, there must be new techniques developed for mining the data. The trick is that the data mining methods must themselves be flexible as the sites they are targeting are flexible. In the initial days of the World Wide Web, which was referred to as Web 1.0, data mining programs knew where to look for the desired information. Web 2.0 sites lack structure, meaning there is no single spot for the mining program to target. It must be able to scan and sift through all of the user-generated content to find what is needed. The upside is that there is a lot more data out there, which means more and more accurate results if the data can be properly utilized. The downside is that with all that data, if the selection criteria are not specific enough, the results will be meaningless. Too much of a good thing is definitely a bad thing. Wikis and blogs have been around long enough now that enough research has been carried out to understand them better. This research can now be used, in turn, to devise the best possible data mining methods. New algorithms are being developed that will allow data mining applications to analyse this data and return useful. Another problem is that there are many cul-de-sacs on the internet now, where groups of people share information freely, but only behind walls/barriers that keep it away from the genera results.

The main challenge in developing these algorithms does not lie with finding the data, because there is too much of it. The challenge is filtering out irrelevant data to get to the meaningful one. At this point none of the techniques are perfected. This makes Web 2.0 data mining an exciting and frustrating field, and yet another challenge in the never ending series of technological hurdles that have stemmed from the internet. There are numerous problems to overcome. One is the inability to rely on keywords, which used to be the best method to search. This does not allow for an understanding of context or sentiment associated with the keywords which can drastically vary the meaning of the keyword population. Social networking sites are a good example of this, where you can share information with everyone you know, but it is more difficult for that information to proliferate outside of those circles. This is good in terms of protecting privacy, but it does not add to the collective knowledge base and it can lead to a skewed understanding of public sentiment based on what social structures you have entry into. Attempts to use artificial intelligence have been less than successful because it is not adequately focused in its methodology. Data mining depends on the collection of data and sorting the results to create reports on the individual metrics that are the focus of interest. The size of the data sets are simply too large for traditional computational techniques to be able to tackle them. That is why a new answer needs to be found. Data mining is an important necessity for managing the backhaul of the internet. As Web 2.0 grows exponentially, it is increasingly hard to keep track of everything that is out there and summarize and synthesize it in a useful way. Data mining is necessary for companies to be able to really understand what customers like and want so that they can create products to meet these needs. In the increasingly aggressive global market, companies also need the reports resulting from data mining to remain competitive. If they are unable to keep track of the market and stay abreast of popular trends, they will not survive. The solution has to come from open source with options to scale databases depending on needs. There are companies that are now working on these ideas and are sharing the results with others to further improve them. So, just as open source and collective information sharing of Web 2.0 created these new data mining challenges, it will be the collective effort that solves the problems as well.

It is important to view this as a process of constant improvement, not one where an answer will be absolute for all time. Since its advent, the internet has changed quite significantly as well as the way users interact with it. Data mining will always be a critical part of corporate internet usage and its methods will continue to evolve just as the Web and its content does.

There is a huge incentive for creating better data mining solutions to tackle the complexities of Web 2.0. For this reason, several companies exist just for the purpose of analysing and creating solutions to the data mining problem. They find eager buyers for their applications in companies which are desperate for information on markets and potential customers. The companies in question do not simply want more data, they want better data. This requires a system that can classify and group data, and then make sense of the results.While the data mining process is expensive to start with, it is well worth for a retail company because it provides insight into the market and thus enables quick decisions.The speed at which a company which has insightful information on the marketplace can react to changes, gives it a huge advantage over the competition. Not only can the company react quickly, it is likely to steer itself in the right direction if its information is based on updated data.Advanced data mining will allow companies not only to make snap decisions, but also to plan long range strategies, based on the direction the marketplace is heading. Data mining brings the company closer to its customers. The real winners here, are the companies that have now discovered that they can make a living by improving the existing data mining techniques. They have filled a niche that was only created recently, which no one could have foreseen and have done quite a, good job at it.

Source: http://ezinearticles.com/?The-Need-for-Specialised-Data-Mining-Techniques-for-Web-2.0&id=7412130

Scraping Data From Websites

Thursday, 25 July 2013

The Need for Specialised Data Mining Techniques for Web 2.0

No comments:

Post a Comment