Friday, August 10, 2007

Reverse Proxies Are A Menace To Web Sessions — And To Bot Traps

A bottrap is a server-side script that tries to detect when too many requests are being sent to a web server by a client computer (and it may be designed to take some action to help reduce the load on the server). The web server can however not only receive requests from several computers A, B, C as if they are coming from a single forward proxy computer X, but it can also receive requests from a single computer Y as if they are coming from several reverse proxy computers P, Q, R. Take these things together and you get an interesting dynamics along with a very difficult problem to solve.

To understand the problem, suppose that your bottrap only checks the ip address of the client computer. Then requests from computers A, B, C will all appear to be coming from computer X and this will incorrectly trigger your bottrap sooner or later. This situation is not as far fetched as it may seem because these days, many companies provide access to the internet through firewalls and/or proxies such as X. So in order to avoid blocking legitimate visitors to your site, your bottrap cannot rely on checking the ip address of the client computer alone.

What your bottrap needs to do is to associate each client computer with a session-id which the client computer must return with every request to your web server. But if you do this dilligently, you will soon run into another problem with reverse proxies: requests from computer Y will now appear to come from computers P, Q, R and all of them will carry the same session-id. The problem is that from the point of view of your bot trapping code, there are at least three interpretations of the situation:

(a) the requests are coming from a single computer Y which is behind reverse proxies such as P, Q, R;

(b) the requests are coming from a single computer Y and a man-in-the-middle who is trying to hijack Y's session;

(c) the requests are coming from a stealth bot which is trying to avoid triggering your bot trap.

Because there is no way to distinguish these three cases, you are forced into the situation that if you allow requests from one case to be served, you will knowingly or unknowingly be serving requests from the other cases as well.

So a decision must be made. At botslist.com we decided to disregard these requests altogether. When we issue a session-id to a client computer's ip address, we expect the session-id to be returned with requests from the same ip address. If the session-id is received with requests from a different ip address, the session-id is disregarded and a new session-id is issued for the new ip address. In practice this means that requests from case (b) are frustrated so men-in-the-middle don't get to hijack our users' established sessions. It also frustrates requests from case (c) because no content is served until the session is established — and the session is established only when the session-id has been received back from the associated ip address.

But requests from case (a) are a real problem because they too are frustrated. Luckily for us, reverse proxies do not seem to be as widely deployed as firewalls and forward proxies and so not many of our visitors are affected. (No, AOL users don't count — just kidding :)