Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a
(very detailed) guide on how to find the answer.
The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to
scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being
processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome
Developer Tools for this, but you can use others such as FF firebug.
So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads
the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this
clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our
case requires a bit more work. I can tell two things right away:
They're using javascript to load the comments (because I'm staying on the same page).
They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the
page and just showing them to you, with each click it does another request to the database).
Now let's right-click and inspect element on that button. It's actually just a simple span with text:
<span>View Comments (2077)</span>
By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the
devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to
fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of
confusing data. Wait, here's one that makes sense:
http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-
3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_commen
ts.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_com
ment=1
See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object
(looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you
need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest
that a human can read:
go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
'_media.modules.content_comments.switches._enable_mutecommenter': '1',
'_media.modules.content_comments.switches._enable_view_others': '1',
'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
'count': '10',
'enable_collapsed_comment': '1',
'isNext': 'true',
'offset': '20',
'pageNumber': '2',
'sortBy': 'highestRated'}
Now it's just a matter of trial-and-error. However, a few things to note here:
Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what
happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So
now we understand how to use offset
The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that
from the original page somehow. Try digging around a little, you'll find it.
Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch
the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article
itself)
Using the devtools you have access to all client-side scripts. So by digging you can find that that link to
/get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the
request, and try to emulate that (though you can probably figure it out yourself)
You might need to overcome some security measures. For example, you might need a session-key from the original
article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't
trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in
case it shows up.
Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the
html comments you are getting (for which you might want to check out BeautifulSoup).
As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task
either.
So don't panic.
It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt).
Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help
you.
Source: http://stackoverflow.com/questions/20218855/web-data-scraping-online-news-comments-with-scrapy-python
(very detailed) guide on how to find the answer.
The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to
scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being
processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome
Developer Tools for this, but you can use others such as FF firebug.
So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads
the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this
clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our
case requires a bit more work. I can tell two things right away:
They're using javascript to load the comments (because I'm staying on the same page).
They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the
page and just showing them to you, with each click it does another request to the database).
Now let's right-click and inspect element on that button. It's actually just a simple span with text:
<span>View Comments (2077)</span>
By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the
devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to
fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of
confusing data. Wait, here's one that makes sense:
http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-
3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_commen
ts.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_com
ment=1
See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object
(looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you
need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest
that a human can read:
go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
'_media.modules.content_comments.switches._enable_mutecommenter': '1',
'_media.modules.content_comments.switches._enable_view_others': '1',
'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
'count': '10',
'enable_collapsed_comment': '1',
'isNext': 'true',
'offset': '20',
'pageNumber': '2',
'sortBy': 'highestRated'}
Now it's just a matter of trial-and-error. However, a few things to note here:
Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what
happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So
now we understand how to use offset
The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that
from the original page somehow. Try digging around a little, you'll find it.
Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch
the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article
itself)
Using the devtools you have access to all client-side scripts. So by digging you can find that that link to
/get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the
request, and try to emulate that (though you can probably figure it out yourself)
You might need to overcome some security measures. For example, you might need a session-key from the original
article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't
trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in
case it shows up.
Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the
html comments you are getting (for which you might want to check out BeautifulSoup).
As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task
either.
So don't panic.
It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt).
Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help
you.
Source: http://stackoverflow.com/questions/20218855/web-data-scraping-online-news-comments-with-scrapy-python
This particular papers fabulous, and My spouse and i enjoy each of the perform that you have placed into this. I’m sure that you will be making a really useful place. I has been additionally pleased. Good perform! transcription website
ReplyDelete