Friday, August 10, 2007

Reverse Proxies Are A Menace To Web Sessions — And To Bot Traps

A bottrap is a server-side script that tries to detect when too many requests are being sent to a web server by a client computer (and it may be designed to take some action to help reduce the load on the server). The web server can however not only receive requests from several computers A, B, C as if they are coming from a single forward proxy computer X, but it can also receive requests from a single computer Y as if they are coming from several reverse proxy computers P, Q, R. Take these things together and you get an interesting dynamics along with a very difficult problem to solve.

To understand the problem, suppose that your bottrap only checks the ip address of the client computer. Then requests from computers A, B, C will all appear to be coming from computer X and this will incorrectly trigger your bottrap sooner or later. This situation is not as far fetched as it may seem because these days, many companies provide access to the internet through firewalls and/or proxies such as X. So in order to avoid blocking legitimate visitors to your site, your bottrap cannot rely on checking the ip address of the client computer alone.

What your bottrap needs to do is to associate each client computer with a session-id which the client computer must return with every request to your web server. But if you do this dilligently, you will soon run into another problem with reverse proxies: requests from computer Y will now appear to come from computers P, Q, R and all of them will carry the same session-id. The problem is that from the point of view of your bot trapping code, there are at least three interpretations of the situation:

(a) the requests are coming from a single computer Y which is behind reverse proxies such as P, Q, R;

(b) the requests are coming from a single computer Y and a man-in-the-middle who is trying to hijack Y's session;

(c) the requests are coming from a stealth bot which is trying to avoid triggering your bot trap.

Because there is no way to distinguish these three cases, you are forced into the situation that if you allow requests from one case to be served, you will knowingly or unknowingly be serving requests from the other cases as well.

So a decision must be made. At botslist.com we decided to disregard these requests altogether. When we issue a session-id to a client computer's ip address, we expect the session-id to be returned with requests from the same ip address. If the session-id is received with requests from a different ip address, the session-id is disregarded and a new session-id is issued for the new ip address. In practice this means that requests from case (b) are frustrated so men-in-the-middle don't get to hijack our users' established sessions. It also frustrates requests from case (c) because no content is served until the session is established — and the session is established only when the session-id has been received back from the associated ip address.

But requests from case (a) are a real problem because they too are frustrated. Luckily for us, reverse proxies do not seem to be as widely deployed as firewalls and forward proxies and so not many of our visitors are affected. (No, AOL users don't count — just kidding :)

Friday, May 11, 2007

Automatic Verification Of Machine Agents, Or How To Tell When Search Engine Names Are Spoofed

I didn't plan to post anything until next month, but I'm beginning to feel that were I to limit myself to one post per month as I originally intended, it will take forever to exhaust the topics that I want to write about. So for now I'm going to adopt an adhoc schedule by posting whenever I have the time to write - and boy, do I have a lot to write about.

I mentioned in my last post that a lot of interesting technologies went into my web service at www.botslist.com. One of the technologies dealt with the problem of automatically detecting when the user agent string of a search engine is spoofed. There is a rather simple and elegant solution to the problem, but before I describe the solution, I want to describe the solution that the major search engines recommend. And in case anyone is thinking of following their recommendation, let me give you my own recommendation in one word: DON'T. Why ? Because the solution that the search engines recommend is quite inefficient, overly complicated and totally unnecessary. Let me explain.

According to this link, and this link and this post by Google's Matt Cutts, the recommended steps for detecting spoofed user agent strings are as follow.

(1) Start with the ip address of the suspected search engine that just sent a request to your webserver.

(2) Do a reverse DNS lookup on the ip address from the previous step. What do you get ? A hostname ? No, not on your life. The DNS system allows a single ip address to be mapped to multiple names; for example, a single ip address may be used to host a dns server, an ftp server, a mail server, and so on. It is possible that Google may have a policy of mapping a single ip address to a single name, but since when do you write code for Google only ? Besides, what happens to your carefully crafted web service code if Google changes their policy, assuming they have such a policy to begin with ? So if you want to write robust code that works in all cases, you must be prepared to handle a list of names when you do the reverse lookup that they are suggesting. Let's call this the name list NAMES[N].

(3) Now for each entry in the name list from the above step, you are supposed to do a forward DNS lookup. Again, what do you expect ? A single ip address ? Nope, not at all. You see, the DNS system allows a single name to be mapped to multiple addresses; for example, www.google.com mapped to four different addresses last time I checked. So if you want to write robust code that will work in all cases, you must be prepared to handle a list of ip addresses FOR EACH ENTRY IN THE NAME LIST from step(2). Are we there yet ? Sorry, but no, not quite.

(4) From step (3) we have a list of ip addresses for each entry in our name list. Let's call this the address list, ADDR[N, M]. Now we must look in this address list for the ip address of the suspected search engine from step (1). If we don't find it, we can conclude that the user agent string is spoofed. Or can we ? Look at the steps again carefully. We haven't used the user agent string at all ! All we have done is checked whether the suspected search engine has a proper reverse DNS entry or not. For example, if the search engine happens to be Googlebot connecting to your webserver from an ip address that Google has not set a reverse DNS entry for, the algorithm will conclude that the user agent string is spoofed --- which is not only misleading but also strictly incorrect. Even if we find the ip address of the suspected search engine in our ADDR list, we still can't conclude that the user agent string is not spoofed. In order to draw that conclusion, we must now check the entry in our NAMES list (i.e. the one that resulted in a matching ip address) to see if it belongs to Google's domain. But how do we know what Google's domains are ? If we know them today, will they be the same tomorrow ? What about other search engines ? Wasn't the point of all this to avoid hardcoding the ip addresses of search engines into our code ? But instead we are required to hardcode the search engine domain names, as if the search engines cannot change the domain names that their robots are crawling from !

There is a simpler and much more elegant solution. What we need to do is to break down the problem into two different problems. Problem #1 is how to verify whether or not the user agent string is spoofed. We must solve this problem in a way that does not depend on any particular search engine. If we determined that spoofing has taken place, there is no need to solve the second problem: the suspected search engine can be thrown off our server. So how do we solve problem #1 ? Well, we have the ip address of the suspected search engine, right ? So what we need to do is to try to obtain the ip address from another source and to compare the two addresses. Which other source do we have at our disposal ? The DNS system, of course. And how do we obtain an ip address from DNS ? By doing a forward lookup on a hostname, of course. But where can we get the hostname from ? Well, think very hard. Remember the user agent string that we are trying to verify ? It already contains a url for most search engines, and a url does contain a hostname. So Bingo! The solution to problem #1 is to parse the user agent string for a url, extract a hostname from that url, do a single forward lookup on the hostname to get a list of addresses, and if the ip address of the suspected search engine is in the list, then the user agent string is verified --- otherwise, it is spoofed. That's how it's done on www.botslist.com --- the proper way if I may say so --- but no major search engine has passed this verification test yet.

So what is problem #2 and how do we solve it ? Problem #2 is to determine if the verified user agent string belongs to your favourite search engine, be it Yahoo, or MSN or Google or whoever. And the cleanest solution to the problem ? Simple - just check to see if the domain name in the verified user agent string belongs to the search engine. It is not a particularly clean solution because you still have to know in advance which domain names belong to which search engine. But then, the problem itself isn't particularly clean to begin with since it presumes that you favour some search engines over others for some reason; otherwise, why would you want to know if the search engine belongs to Google and not its competitors? At botslist.com no search engine is favoured over any other, so luckily problem #2 is not an issue.

Wednesday, May 2, 2007

Botslist.com Goes Beta

After several months of development and testing, Botslist.com has been opened for testing by the general public. The development took much longer than anticipated but I believe the effort has been worthwhile because the site is hands-down one of the most comprehensve and sophisticated bot tracking services on the internet. That in itself would be saying nothing, of course, if it wasn't for the pain and agony that bad bots are causing many website owners like myself.

There are those who think bots should be allowed to roam and plunder websites as they like but I beg to differ. The bots were becoming an issue on my sites and I wasn't about to take it lying down. Since I could code my way out of the proverbial box, I felt it was time to do something about the problem. I had a solution in place not before long and the solution worked pretty well as designed. But something was amiss.

It didn't take long for me to figure out what was missing. I had a pretty good solution to the bot problem and as an aspiring MicroISV I was always on the lookout for ways to help people solve problems with software. So naturally it occured to me that with some additional effort, I could offer my bot solution to other website operators who are having headaches from bad bots and are willing to do something about the problem.

But I wasn't about to invest months of development time into a solution without doing some minimal market research first. So off I went to my favourite search engines to see what people were doing to solve their bot problem. Since I already had my own solution in place, I knew exactly what I was looking for.

It wasn't CrawlWall, which appears to be in beta since late 2006. It wasn't BotSense, which appears to be abandoned. It wasn't AntiCrawl, which appears to be something out of the outer MLM world. It wasn't AntiLeech, which appears to work with WordPress only, and in any case I wasn't prepared to deploy PHP and MySQL on my server. What I was looking for is something close to my own solution because I was convinced that this was the best --- and maybe even the only viable --- solution to the bot problem.

Now if my conviction was accurate, then I would expect someone else on the internet to have had a hint of the solution by now. Without this additional confirmation that my solution is an idea whose time has come, I wasn't going to put too much time into offering the solution to others as a service. But then I came across this post by Kjell Gunnar Bleivik in which he basically outlines the essential components of the proposed solution. That was all the confirmation I needed. Time to get my hands dirty with some coding.

It's almost done now - thank goodness. The major components are all in place and working as designed. There are a few points of detail where my solution differs from the one described by Kjell Gunnar Bleivik, but that was to be expected. For instance, I believe strongly that human nature would cripple a system based on donation, so instead I offer both a free service and a paid service to help fund the system. I believe also that a system that requires a massive hidden network would eventually fail, so instead I invite website operators who are already running bot detection scripts to redirect their bad bots to nonexisting urls on botslist.com which would automatically register the bots into the botslist database.

There is a lot of exciting technologies involved in botslist.com and I intend to describe some of these in future posts. I'm tempted to go on and on right here but this post is long enough as it is and you, my dearly beloved reader, probably have something else to do. So I will leave the rest till next month. Enjoy.