Friday, May 11, 2007

Automatic Verification Of Machine Agents, Or How To Tell When Search Engine Names Are Spoofed

I didn't plan to post anything until next month, but I'm beginning to feel that were I to limit myself to one post per month as I originally intended, it will take forever to exhaust the topics that I want to write about. So for now I'm going to adopt an adhoc schedule by posting whenever I have the time to write - and boy, do I have a lot to write about.

I mentioned in my last post that a lot of interesting technologies went into my web service at www.botslist.com. One of the technologies dealt with the problem of automatically detecting when the user agent string of a search engine is spoofed. There is a rather simple and elegant solution to the problem, but before I describe the solution, I want to describe the solution that the major search engines recommend. And in case anyone is thinking of following their recommendation, let me give you my own recommendation in one word: DON'T. Why ? Because the solution that the search engines recommend is quite inefficient, overly complicated and totally unnecessary. Let me explain.

According to this link, and this link and this post by Google's Matt Cutts, the recommended steps for detecting spoofed user agent strings are as follow.

(1) Start with the ip address of the suspected search engine that just sent a request to your webserver.

(2) Do a reverse DNS lookup on the ip address from the previous step. What do you get ? A hostname ? No, not on your life. The DNS system allows a single ip address to be mapped to multiple names; for example, a single ip address may be used to host a dns server, an ftp server, a mail server, and so on. It is possible that Google may have a policy of mapping a single ip address to a single name, but since when do you write code for Google only ? Besides, what happens to your carefully crafted web service code if Google changes their policy, assuming they have such a policy to begin with ? So if you want to write robust code that works in all cases, you must be prepared to handle a list of names when you do the reverse lookup that they are suggesting. Let's call this the name list NAMES[N].

(3) Now for each entry in the name list from the above step, you are supposed to do a forward DNS lookup. Again, what do you expect ? A single ip address ? Nope, not at all. You see, the DNS system allows a single name to be mapped to multiple addresses; for example, www.google.com mapped to four different addresses last time I checked. So if you want to write robust code that will work in all cases, you must be prepared to handle a list of ip addresses FOR EACH ENTRY IN THE NAME LIST from step(2). Are we there yet ? Sorry, but no, not quite.

(4) From step (3) we have a list of ip addresses for each entry in our name list. Let's call this the address list, ADDR[N, M]. Now we must look in this address list for the ip address of the suspected search engine from step (1). If we don't find it, we can conclude that the user agent string is spoofed. Or can we ? Look at the steps again carefully. We haven't used the user agent string at all ! All we have done is checked whether the suspected search engine has a proper reverse DNS entry or not. For example, if the search engine happens to be Googlebot connecting to your webserver from an ip address that Google has not set a reverse DNS entry for, the algorithm will conclude that the user agent string is spoofed --- which is not only misleading but also strictly incorrect. Even if we find the ip address of the suspected search engine in our ADDR list, we still can't conclude that the user agent string is not spoofed. In order to draw that conclusion, we must now check the entry in our NAMES list (i.e. the one that resulted in a matching ip address) to see if it belongs to Google's domain. But how do we know what Google's domains are ? If we know them today, will they be the same tomorrow ? What about other search engines ? Wasn't the point of all this to avoid hardcoding the ip addresses of search engines into our code ? But instead we are required to hardcode the search engine domain names, as if the search engines cannot change the domain names that their robots are crawling from !

There is a simpler and much more elegant solution. What we need to do is to break down the problem into two different problems. Problem #1 is how to verify whether or not the user agent string is spoofed. We must solve this problem in a way that does not depend on any particular search engine. If we determined that spoofing has taken place, there is no need to solve the second problem: the suspected search engine can be thrown off our server. So how do we solve problem #1 ? Well, we have the ip address of the suspected search engine, right ? So what we need to do is to try to obtain the ip address from another source and to compare the two addresses. Which other source do we have at our disposal ? The DNS system, of course. And how do we obtain an ip address from DNS ? By doing a forward lookup on a hostname, of course. But where can we get the hostname from ? Well, think very hard. Remember the user agent string that we are trying to verify ? It already contains a url for most search engines, and a url does contain a hostname. So Bingo! The solution to problem #1 is to parse the user agent string for a url, extract a hostname from that url, do a single forward lookup on the hostname to get a list of addresses, and if the ip address of the suspected search engine is in the list, then the user agent string is verified --- otherwise, it is spoofed. That's how it's done on www.botslist.com --- the proper way if I may say so --- but no major search engine has passed this verification test yet.

So what is problem #2 and how do we solve it ? Problem #2 is to determine if the verified user agent string belongs to your favourite search engine, be it Yahoo, or MSN or Google or whoever. And the cleanest solution to the problem ? Simple - just check to see if the domain name in the verified user agent string belongs to the search engine. It is not a particularly clean solution because you still have to know in advance which domain names belong to which search engine. But then, the problem itself isn't particularly clean to begin with since it presumes that you favour some search engines over others for some reason; otherwise, why would you want to know if the search engine belongs to Google and not its competitors? At botslist.com no search engine is favoured over any other, so luckily problem #2 is not an issue.

11 comments:

Anonymous said...

If no major search engine passes the verification test to date, how is the verification test useful

Mike Adewole said...

Some machine agents do pass the test and nothing stops the major search engines from passing the test too once they become aware of it. See the x-verified header at this link http://www.botslist.com/search?name=anonymouse

Anonymous said...

I guess you still don't understand how easy it is for bots to spoof the IP address -- making all this moot.

The REAL solution is to get Google et al to start using challenge SHA256 keys registered on the search engine sites themselves (per domain). Thus nothing can be spoofed. From what I hear, Google is working on this too.

Mike Adewole said...

If a suspected bot tries to pass the test by using an ip address that is spoofed, say by forging an address that belongs to Google, the suspected bot can't receive your content because the response will be misdirected to Google.

So a bot that wants your content can't use a spoofed address to pass the test. Of course, a bot that doesn't want your content is hardly a bot in the context of my article - it's an attacker - and the test isn't meant to defend against attacks.

See for example the explanation at http://www.iss.net/security_center/advice/Underground/Hacking/Methods/Technical/Spoofing/default.htm

Anonymous said...

Hey Mike, wait a sec...

Maybe I'm missing someting, but you said that the major search engines (Google, Yahoo, Live) don't pass trhough your sugested test?

If yes, so where's the sense to use this test at all?

Unknown said...

so, how do you technically implement the reverse dns?

Mike Adewole said...

@alvaro braga:

The idea is to encourage the big search engines to pass the test because it is much better than the other methods currently in use.

Mike Adewole said...

@r4ccoon:

The idea is that when a search engine visits a site, it will send a host name in a header field so that the web site can perform a single forward dns lookup on the name. If the name resolves to the ip address that the search engine is connecting from, the test is passed.

There are many options that search engines can use to pass the host name to web sites. One option is to use the host name part of the url in the user agent string. This is the option described in my post.

Another option is for search engine bots to automatically include a uahost: header field with their requests.

If the search engine vendors are not cooperative, yet another option (used at botslist.com for example) is for a site to maintain a mapping of the domain contained in the user agent string (e.g. google.com) to the known domain for the search engine bot (e.g. googlebot.com) and use that mapping to implement the test.

Unknown said...

i was searching again for this kind of problem, we can redirect the fake bot with 301 or 302.
use htaccess to do that.

ganool said...

how do that wtih php?

Anonymous said...

@ganool

search for "php header redirect"

or look here:

http://www.plus2net.com/php_tutorial/php_redirect.php