Wednesday, November 26, 2008

Preventing SQL Injection Attacks The Right Way

SQL injection attacks are everywhere these days. They have become ubiquitous beyond measure and have been exploited by hackers of all grades across different applications, database engines and operating systems. Try to find a solution to the problem on the web and you will be blessed with many solutions and recommendations like this and this and this.

All these solutions have one thing in common though: they go to great lengths to place the blame for sql injections squarely at the feet of the web application developer. Granted, there are many defensive techniques which a developer should employ in order to foil sql injection attacks among other things, but in my opinion the fundamental security holes that make these attacks possible can only be fixed by sql database vendors themselves and not by web application developers. Let me explain.

There was a time when operating systems opened all ports on a computer by default, when smtp servers allow anyone to relay mail through them, when dns servers allow anyone to submit recursive queries and perform zone transfers. By the by, all these features were found to be exploitable holes and blocked by all modern software. However, it is not just that these holes are now commonly blocked — what is vitally important is that they are now commonly blocked by default.

This attitude of being secure by default has proven time and time again to be the correct one to take whenever an exploitable feature is identified or recognized. It has prevented countless unanticipated attacks against misconfigured computers while at the same time allowing experts to open these holes under controlled conditions when necessary. The problem with sql injection attacks today is that the insecure features that make these attacks possible have not been recognized as such even though they have been identified for quite some time.

What are these features, you ask. There are two of them and for the purposes of this article I will refer to them as multistatement query execution (MQEXEC) and union query execution (UQEXEC). The first feature allows an sql database engine to execute multiple statements in one query. For example, when faced with a query such as
SELECT a FROM table1; DROP table2;

the database engine will happily execute both statements in the query one after the other. The second feature (UQEXEC) allows the database engine to combine the results of multiple substatements into one dataset. For example, when faced with a query such as
SELECT a,b FROM table1 UNION SELECT c,d FROM table2;

the database engine will happily combine the results of each select substatement into one dataset (subject to certain restrictions which are not relevant here).

The problem with these features, and the reason they are such a huge security hole in today's computing environment, is that they allow an attacker to modify queries in web applications in unforeseeable ways with devastating consequences for all parties concerned. For example, it is quite natural for a web application developer to write a query such as
SELECT a,b FROM table1 WHERE c = '+ParameterValue;

along with a text box that allows the web application user to enter the parameter value. But an attacker can instead enter a parameter value that will turn this simple query into a multistatement query with malicious side-effects. Many pundits will no doubt blame the developer for writing a query like this in the first place. Yet that is not really the problem here. The problem here, really, is that when the developer wrote this query, he or she wrote it under a very specific assumption and most database engines do not provide a way for him or her to require that this assumption be enforced.

The specific assumption made by the developer here is that this query will not ever be a multistatement query. Unfortunately for the developer, however, the sql dialect of most database engines does not provide any way for the developer to make this assumption explicit or to require its enforcement. Even worse for the developer is the fact that these database engines operate in a mode that enables the MQEXEC and UQEXEC features by default — thereby making the attacker's task tremendously easier than the developer's, with the expectation that the developer should work around this sorry state of affairs in some way.

But suppose for a moment that sql database engines offer ways to selectively enable the MQEXEC and UQEXEC features e.g. in Microsoft SQL Server, one might have the statements

similar in many ways to the familiar NOEXEC option. Suppose further that the default mode for these options is OFF. Then the vast majority of sql injection attacks will be thwarted by default at the database engine level.

This recommendation has an additional benefit for security auditing in that by simply flipping these options to OFF, all areas in an application where multistatement and union queries are needed or assumed will fail unconditionally. Security experts can then review the code in these areas carefully and selectively enable the options as necessary.

With the MQEXEC and UQEXEC options OFF by default, the only sql injection attacks possible are those involving single statement, non-union queries. These generally tend to have less devastating consequences but can in any case be guarded against using application-level techniques such as query parametrization. But relying exclusively on application-level techniques to fight sql injection attacks is a recipe for disaster as real-world experiences continue to show.

Saturday, September 6, 2008

How NOT To Verify Googlebot

Maintaining a blog takes a lot more effort than one would imagine. Besides family, lifestyle and work, it's hard to understand how people manage to find the time to keep their blogs up to date. I suppose the trick is to find a way to bring blogging under work or lifestyle so that blogging becomes something you do without much ado. Definitely something I need to work on.

Type "verify googlebot" into a major search engine and you will come across many articles telling you how to verify googlebot and other search engines using what I will call the reverse lookup verification method. This method is recommended by all the major search engines bar none. Google, Yahoo!, Microsoft, Ask and many internet experts recommend it. So it may come as a suprise to some people to read my article where I strongly recommended that this method not be used.

I feel that webmasters who use this verification method may not fully understand the implications of their action, so I'm going to try to explain the issue once again. You see, when a suspected search engine hits your website, you come into possession of an ip address that you know absolutely nothing about. Using this ip address, you are supposed to do a reverse dns lookup to get a list of hostnames to verify. If you are lucky and the ip address happens to belong to Google or Yahoo!, their server will respond to your dns request with a nice list of hostnames for you to verify. However, if that ip address happens to belong to the bad guys, you will be naive to expect them to play by the rules.

Now, in a perfect world the bad guys may respond to your dns requests with a nice list of names for you to verify. But in the real world they are more likely to dig out their playbook and launch every dirty trick they possess against your web server. If you have been following internet security closely, you must have heard of recent dns attacks where the bad guys needed to find a way to make you initiate dns requests. You know what? When you send a dns query to their servers on your own initiative, you have saved them the trouble of needing to trick you in the first place!! All they have to do now is jump straight into their attacks against your server.

In fact, if this method of verification becomes hugely popular, you can expect the bad guys to make good use of it to attempt to compromise even more web servers. It really is remarkably simple: all they need to do is send a request to your web server and wait for your dns requests to come in to launch their attacks. We all love to rave and rant about how insecure dns is, but using that insecure dns to send automated requests to random servers from your production web servers takes the insecurity to a whole new level. JUST DON'T DO IT no matter what the big search engines tell you.Try using one of the following solutions instead.


In this solution, any suspected search engine that hits your website will be verified against a locally maintained database on your network without sending any reverse dns queries from your web server. Then the ip address will be submitted to a separate and isolated verification server along with any other information from the request headers that you care about. The verification server should do the reverse lookup verification and update the shared database on your network.

This server should also do other kinds of verification just in case the ip address does belong to a major search engine but the search engine company failed to setup reverse dns for the ip address. The idea here is to isolate the verification server as much as you can from your production web servers so that even if the verification server is attacked or compromised, the attacks will not immediately affect your production servers.

Of course, if you are not in a position to maintain a verification server yourself, you can do the smart thing by signing up for a regularly updated free database like I offer at


If you insist on doing the verification on your production web servers instead of using an isolated server or signing up for a third-party database, then consider using the forward lookup verification method exclusively. I have described this verification method elsewhere. I will only say here that this method is much faster and more secure than the reverse lookup method that everyone else is promoting.

Also, when I first described this method over a year ago, database contained only one or two records of servers that passed the forward lookup test. Today, however, I'm very pleased to count twelve matching records in the database as you can check from this link\-verifiedmeth:%20uahome. The interesting thing here is that because the small search engines pass the forward lookup test, verifying them is much faster and safer than verifying the big search engines!

I think we can expect that many smaller search engines will continue to take the initiative on this issue while the big search engines will continue to do their thing. Perhaps until a few high profile websites using the reverse lookup method are compromised using that method as an attack vector. To be forewarned is to be forearmed, like they say.

Friday, August 10, 2007

Reverse Proxies Are A Menace To Web Sessions — And To Bot Traps

A bottrap is a server-side script that tries to detect when too many requests are being sent to a web server by a client computer (and it may be designed to take some action to help reduce the load on the server). The web server can however not only receive requests from several computers A, B, C as if they are coming from a single forward proxy computer X, but it can also receive requests from a single computer Y as if they are coming from several reverse proxy computers P, Q, R. Take these things together and you get an interesting dynamics along with a very difficult problem to solve.

To understand the problem, suppose that your bottrap only checks the ip address of the client computer. Then requests from computers A, B, C will all appear to be coming from computer X and this will incorrectly trigger your bottrap sooner or later. This situation is not as far fetched as it may seem because these days, many companies provide access to the internet through firewalls and/or proxies such as X. So in order to avoid blocking legitimate visitors to your site, your bottrap cannot rely on checking the ip address of the client computer alone.

What your bottrap needs to do is to associate each client computer with a session-id which the client computer must return with every request to your web server. But if you do this dilligently, you will soon run into another problem with reverse proxies: requests from computer Y will now appear to come from computers P, Q, R and all of them will carry the same session-id. The problem is that from the point of view of your bot trapping code, there are at least three interpretations of the situation:

(a) the requests are coming from a single computer Y which is behind reverse proxies such as P, Q, R;

(b) the requests are coming from a single computer Y and a man-in-the-middle who is trying to hijack Y's session;

(c) the requests are coming from a stealth bot which is trying to avoid triggering your bot trap.

Because there is no way to distinguish these three cases, you are forced into the situation that if you allow requests from one case to be served, you will knowingly or unknowingly be serving requests from the other cases as well.

So a decision must be made. At we decided to disregard these requests altogether. When we issue a session-id to a client computer's ip address, we expect the session-id to be returned with requests from the same ip address. If the session-id is received with requests from a different ip address, the session-id is disregarded and a new session-id is issued for the new ip address. In practice this means that requests from case (b) are frustrated so men-in-the-middle don't get to hijack our users' established sessions. It also frustrates requests from case (c) because no content is served until the session is established — and the session is established only when the session-id has been received back from the associated ip address.

But requests from case (a) are a real problem because they too are frustrated. Luckily for us, reverse proxies do not seem to be as widely deployed as firewalls and forward proxies and so not many of our visitors are affected. (No, AOL users don't count — just kidding :)

Friday, May 11, 2007

Automatic Verification Of Machine Agents, Or How To Tell When Search Engine Names Are Spoofed

I didn't plan to post anything until next month, but I'm beginning to feel that were I to limit myself to one post per month as I originally intended, it will take forever to exhaust the topics that I want to write about. So for now I'm going to adopt an adhoc schedule by posting whenever I have the time to write - and boy, do I have a lot to write about.

I mentioned in my last post that a lot of interesting technologies went into my web service at One of the technologies dealt with the problem of automatically detecting when the user agent string of a search engine is spoofed. There is a rather simple and elegant solution to the problem, but before I describe the solution, I want to describe the solution that the major search engines recommend. And in case anyone is thinking of following their recommendation, let me give you my own recommendation in one word: DON'T. Why ? Because the solution that the search engines recommend is quite inefficient, overly complicated and totally unnecessary. Let me explain.

According to this link, and this link and this post by Google's Matt Cutts, the recommended steps for detecting spoofed user agent strings are as follow.

(1) Start with the ip address of the suspected search engine that just sent a request to your webserver.

(2) Do a reverse DNS lookup on the ip address from the previous step. What do you get ? A hostname ? No, not on your life. The DNS system allows a single ip address to be mapped to multiple names; for example, a single ip address may be used to host a dns server, an ftp server, a mail server, and so on. It is possible that Google may have a policy of mapping a single ip address to a single name, but since when do you write code for Google only ? Besides, what happens to your carefully crafted web service code if Google changes their policy, assuming they have such a policy to begin with ? So if you want to write robust code that works in all cases, you must be prepared to handle a list of names when you do the reverse lookup that they are suggesting. Let's call this the name list NAMES[N].

(3) Now for each entry in the name list from the above step, you are supposed to do a forward DNS lookup. Again, what do you expect ? A single ip address ? Nope, not at all. You see, the DNS system allows a single name to be mapped to multiple addresses; for example, mapped to four different addresses last time I checked. So if you want to write robust code that will work in all cases, you must be prepared to handle a list of ip addresses FOR EACH ENTRY IN THE NAME LIST from step(2). Are we there yet ? Sorry, but no, not quite.

(4) From step (3) we have a list of ip addresses for each entry in our name list. Let's call this the address list, ADDR[N, M]. Now we must look in this address list for the ip address of the suspected search engine from step (1). If we don't find it, we can conclude that the user agent string is spoofed. Or can we ? Look at the steps again carefully. We haven't used the user agent string at all ! All we have done is checked whether the suspected search engine has a proper reverse DNS entry or not. For example, if the search engine happens to be Googlebot connecting to your webserver from an ip address that Google has not set a reverse DNS entry for, the algorithm will conclude that the user agent string is spoofed --- which is not only misleading but also strictly incorrect. Even if we find the ip address of the suspected search engine in our ADDR list, we still can't conclude that the user agent string is not spoofed. In order to draw that conclusion, we must now check the entry in our NAMES list (i.e. the one that resulted in a matching ip address) to see if it belongs to Google's domain. But how do we know what Google's domains are ? If we know them today, will they be the same tomorrow ? What about other search engines ? Wasn't the point of all this to avoid hardcoding the ip addresses of search engines into our code ? But instead we are required to hardcode the search engine domain names, as if the search engines cannot change the domain names that their robots are crawling from !

There is a simpler and much more elegant solution. What we need to do is to break down the problem into two different problems. Problem #1 is how to verify whether or not the user agent string is spoofed. We must solve this problem in a way that does not depend on any particular search engine. If we determined that spoofing has taken place, there is no need to solve the second problem: the suspected search engine can be thrown off our server. So how do we solve problem #1 ? Well, we have the ip address of the suspected search engine, right ? So what we need to do is to try to obtain the ip address from another source and to compare the two addresses. Which other source do we have at our disposal ? The DNS system, of course. And how do we obtain an ip address from DNS ? By doing a forward lookup on a hostname, of course. But where can we get the hostname from ? Well, think very hard. Remember the user agent string that we are trying to verify ? It already contains a url for most search engines, and a url does contain a hostname. So Bingo! The solution to problem #1 is to parse the user agent string for a url, extract a hostname from that url, do a single forward lookup on the hostname to get a list of addresses, and if the ip address of the suspected search engine is in the list, then the user agent string is verified --- otherwise, it is spoofed. That's how it's done on --- the proper way if I may say so --- but no major search engine has passed this verification test yet.

So what is problem #2 and how do we solve it ? Problem #2 is to determine if the verified user agent string belongs to your favourite search engine, be it Yahoo, or MSN or Google or whoever. And the cleanest solution to the problem ? Simple - just check to see if the domain name in the verified user agent string belongs to the search engine. It is not a particularly clean solution because you still have to know in advance which domain names belong to which search engine. But then, the problem itself isn't particularly clean to begin with since it presumes that you favour some search engines over others for some reason; otherwise, why would you want to know if the search engine belongs to Google and not its competitors? At no search engine is favoured over any other, so luckily problem #2 is not an issue.

Wednesday, May 2, 2007 Goes Beta

After several months of development and testing, has been opened for testing by the general public. The development took much longer than anticipated but I believe the effort has been worthwhile because the site is hands-down one of the most comprehensve and sophisticated bot tracking services on the internet. That in itself would be saying nothing, of course, if it wasn't for the pain and agony that bad bots are causing many website owners like myself.

There are those who think bots should be allowed to roam and plunder websites as they like but I beg to differ. The bots were becoming an issue on my sites and I wasn't about to take it lying down. Since I could code my way out of the proverbial box, I felt it was time to do something about the problem. I had a solution in place not before long and the solution worked pretty well as designed. But something was amiss.

It didn't take long for me to figure out what was missing. I had a pretty good solution to the bot problem and as an aspiring MicroISV I was always on the lookout for ways to help people solve problems with software. So naturally it occured to me that with some additional effort, I could offer my bot solution to other website operators who are having headaches from bad bots and are willing to do something about the problem.

But I wasn't about to invest months of development time into a solution without doing some minimal market research first. So off I went to my favourite search engines to see what people were doing to solve their bot problem. Since I already had my own solution in place, I knew exactly what I was looking for.

It wasn't CrawlWall, which appears to be in beta since late 2006. It wasn't BotSense, which appears to be abandoned. It wasn't AntiCrawl, which appears to be something out of the outer MLM world. It wasn't AntiLeech, which appears to work with WordPress only, and in any case I wasn't prepared to deploy PHP and MySQL on my server. What I was looking for is something close to my own solution because I was convinced that this was the best --- and maybe even the only viable --- solution to the bot problem.

Now if my conviction was accurate, then I would expect someone else on the internet to have had a hint of the solution by now. Without this additional confirmation that my solution is an idea whose time has come, I wasn't going to put too much time into offering the solution to others as a service. But then I came across this post by Kjell Gunnar Bleivik in which he basically outlines the essential components of the proposed solution. That was all the confirmation I needed. Time to get my hands dirty with some coding.

It's almost done now - thank goodness. The major components are all in place and working as designed. There are a few points of detail where my solution differs from the one described by Kjell Gunnar Bleivik, but that was to be expected. For instance, I believe strongly that human nature would cripple a system based on donation, so instead I offer both a free service and a paid service to help fund the system. I believe also that a system that requires a massive hidden network would eventually fail, so instead I invite website operators who are already running bot detection scripts to redirect their bad bots to nonexisting urls on which would automatically register the bots into the botslist database.

There is a lot of exciting technologies involved in and I intend to describe some of these in future posts. I'm tempted to go on and on right here but this post is long enough as it is and you, my dearly beloved reader, probably have something else to do. So I will leave the rest till next month. Enjoy.