BlogRafael Oliveira3 min read

Crawling Yahoo! Answers for fun

Crawling Yahoo! Answers for fun

As you probably already read on the news — Yahoo! Answers is closing doors… To be honest I’m surprised it took this long — I can’t remember using it in many many years. But with knowledge pearls like this, it’s a shame to let all this knowledge go down the drain.

How would I know it?

While searching if someone was already taking care of the archiving effort (yes, the awesome community already is), I found this article from Gizmodo in which they were also crawling it for fun. I was surprised to see the crawling speed they mentioned, of a single article per second. Having written web crawlers in the past, I got a feeling we could probably improve on that! So let’s do it.

The Gizmodo article used links from the Yahoo! Answers sitemap file as a way to shortcut having to write a real crawler:

Site maps inception

For those who like me have never stared at a sitemap file before, it seems to be a collection of sub-sitemaps in a quite simple XML format, that just lists which URLs are available and when they last changed. Real crawlers can use this as a starting point, but will parse the downloaded HTML files to extract new links to download. If you want to learn more about crawlers, this article has a great introduction to them.

Back to the sitemap XML data, I use the Visual Studio Paste XML as Classes command to auto-generate the data model:

With this step done, we just need to iterate over the sitemap files, and download all URLs. I use a simple and lazy pattern to parallelize the downloads using the HttpClient GetStreamAsync() method, that looks like this:

Lazy & easy multi-threading in C#

The final crawler code has less than 200 lines and is quite simple. On my machine, I’m able to download approximately 100 pages per second — the limiting factor seems to be my network connection that gets fully saturated.

At this rate, it would take a bit over 9 days to download all the almost 84 million links from the sitemap. As I don’t want to have my network connection saturated for over a week — I just moved the code to an Azure VM. This has the added benefit of a much faster network connection — the crawler is able to download approximately 300 pages per second there.

Crawler code running on an Azure VM

Luckily Yahoo! is not rate-limiting the requests — probably as an effort to aid the archiving efforts. I don’t recommend using this crawler on other websites — you’ll saturate their servers and probably get your IP blocked!

Meanwhile, I’ll take a look on how to parse all these HTML files to extract the actual questions and answers…

Try Curiosity today

Download for free

Keep in touch

Sign up for updates, productivity tips and new features, and offers.
Unsubscribe any time.

Privacy policy | Terms of Service | Blog | Docs