I didn't plan to post anything until next month, but I'm beginning to feel that were I to limit myself to one post per month as I originally intended, it will take forever to exhaust the topics that I want to write about. So for now I'm going to adopt an adhoc schedule by posting whenever I have the time to write - and boy, do I have a lot to write about.
I mentioned in my last post that a lot of interesting technologies went into my web service at www.botslist.com. One of the technologies dealt with the problem of automatically detecting when the user agent string of a search engine is spoofed. There is a rather simple and elegant solution to the problem, but before I describe the solution, I want to describe the solution that the major search engines recommend. And in case anyone is thinking of following their recommendation, let me give you my own recommendation in one word: DON'T. Why ? Because the solution that the search engines recommend is quite inefficient, overly complicated and totally unnecessary. Let me explain.
According to this link, and this link and this post by Google's Matt Cutts, the recommended steps for detecting spoofed user agent strings are as follow.
(1) Start with the ip address of the suspected search engine that just sent a request to your webserver.
(2) Do a reverse DNS lookup on the ip address from the previous step. What do you get ? A hostname ? No, not on your life. The DNS system allows a single ip address to be mapped to multiple names; for example, a single ip address may be used to host a dns server, an ftp server, a mail server, and so on. It is possible that Google may have a policy of mapping a single ip address to a single name, but since when do you write code for Google only ? Besides, what happens to your carefully crafted web service code if Google changes their policy, assuming they have such a policy to begin with ? So if you want to write robust code that works in all cases, you must be prepared to handle a list of names when you do the reverse lookup that they are suggesting. Let's call this the name list NAMES[N].
(3) Now for each entry in the name list from the above step, you are supposed to do a forward DNS lookup. Again, what do you expect ? A single ip address ? Nope, not at all. You see, the DNS system allows a single name to be mapped to multiple addresses; for example, www.google.com mapped to four different addresses last time I checked. So if you want to write robust code that will work in all cases, you must be prepared to handle a list of ip addresses FOR EACH ENTRY IN THE NAME LIST from step(2). Are we there yet ? Sorry, but no, not quite.
(4) From step (3) we have a list of ip addresses for each entry in our name list. Let's call this the address list, ADDR[N, M]. Now we must look in this address list for the ip address of the suspected search engine from step (1). If we don't find it, we can conclude that the user agent string is spoofed. Or can we ? Look at the steps again carefully. We haven't used the user agent string at all ! All we have done is checked whether the suspected search engine has a proper reverse DNS entry or not. For example, if the search engine happens to be Googlebot connecting to your webserver from an ip address that Google has not set a reverse DNS entry for, the algorithm will conclude that the user agent string is spoofed --- which is not only misleading but also strictly incorrect. Even if we find the ip address of the suspected search engine in our ADDR list, we still can't conclude that the user agent string is not spoofed. In order to draw that conclusion, we must now check the entry in our NAMES list (i.e. the one that resulted in a matching ip address) to see if it belongs to Google's domain. But how do we know what Google's domains are ? If we know them today, will they be the same tomorrow ? What about other search engines ? Wasn't the point of all this to avoid hardcoding the ip addresses of search engines into our code ? But instead we are required to hardcode the search engine domain names, as if the search engines cannot change the domain names that their robots are crawling from !
There is a simpler and much more elegant solution. What we need to do is to break down the problem into two different problems. Problem #1 is how to verify whether or not the user agent string is spoofed. We must solve this problem in a way that does not depend on any particular search engine. If we determined that spoofing has taken place, there is no need to solve the second problem: the suspected search engine can be thrown off our server. So how do we solve problem #1 ? Well, we have the ip address of the suspected search engine, right ? So what we need to do is to try to obtain the ip address from another source and to compare the two addresses. Which other source do we have at our disposal ? The DNS system, of course. And how do we obtain an ip address from DNS ? By doing a forward lookup on a hostname, of course. But where can we get the hostname from ? Well, think very hard. Remember the user agent string that we are trying to verify ? It already contains a url for most search engines, and a url does contain a hostname. So Bingo! The solution to problem #1 is to parse the user agent string for a url, extract a hostname from that url, do a single forward lookup on the hostname to get a list of addresses, and if the ip address of the suspected search engine is in the list, then the user agent string is verified --- otherwise, it is spoofed. That's how it's done on www.botslist.com --- the proper way if I may say so --- but no major search engine has passed this verification test yet.
So what is problem #2 and how do we solve it ? Problem #2 is to determine if the verified user agent string belongs to your favourite search engine, be it Yahoo, or MSN or Google or whoever. And the cleanest solution to the problem ? Simple - just check to see if the domain name in the verified user agent string belongs to the search engine. It is not a particularly clean solution because you still have to know in advance which domain names belong to which search engine. But then, the problem itself isn't particularly clean to begin with since it presumes that you favour some search engines over others for some reason; otherwise, why would you want to know if the search engine belongs to Google and not its competitors? At botslist.com no search engine is favoured over any other, so luckily problem #2 is not an issue.