Last modified: 2006-09-12 17:02:17 UTC
The US government, as part of the Voice of America's availability and propagation monitoring, monitors various web pages to determine if they can be accessed from locations in the Orient - I suppose you could say that they want to be sure that their products are not disoriented. I will not guess at their motives, but I noticed that www.wikipedia.org had a miserable availability percentage - but it was not zero, making me wonder if it was poorly implemented government censorship. I then wondered if the poor availability was Chinese government censorship or if the problem was occidental. This monitoring program and its detailed results are public record, the base URL of the reporting program is here: http://voa.his.com/ For web pages, they compare availability of their own web pages with some other widely used web pages such as www.yahoo.com and www.wikipedia.org. http://asia.ibbmonitor.com/RMSWebMonitor/cgi-bin/WebMonitorCGI.acgi?webmon_query=yes is the URL of the web page monitoring system. They record details of what has happened when they note a failure, which seems to be a return code other than 200, and the contents of the successful or the error page - and a traceroute is automatically done and recorded in case of failure. The recordings of their failure to fetch http://www.wikipedia.org/ showed a return code of "403 forbidden resource requested". The failures presented a very short web page with the content: "Please provide a User-Agent header" In attempting to duplicate the error (or not, for myself) I was able to determine that the servers which serve www.wikipedia.org reject HTTP/1.0. This is common, but I had to send more than one header to make HTTP/1.1 work. I was able to sometimes duplicate the error that they sometimes received with the following command (run under bash in Linux - nc is Netcat, a non-standard command that simply connects to the system and port it is told to connect to and then passes the data it gets on standard in to that port and puts its output on standard out - and records some things about what it does if told to on standard error - if it is not installed, you probably should install it - failing that, telnet might work if it does not mung the stream much - this may require a local option depending on the implementation of telnet on your system. . Anyway, this command SOMETIMES gets the error, since it is actually a problem of non-tranaparent cacheing. ( echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\n\r\n"; cat ) |nc -v www.wikipedia.org 80 This command NEVER got the error: ( echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\nUser-Agent: I told you not to be stupid, you moron\r\n\r\n"; cat ) |nc -v www.wikipedia.org 80 Howard Stern fans note that this is not aimed at anyone and is a quote from the show. http://www.w3.org/Protocols/rfc2616/rfc2616.html says that User-Agent header field "SHOULD" be included by the user agent. However, it is not required. Contrast this to the Host: header field, which "MUST" be included. My guess is that the cacheing is matched against the user-agent and that if the user-agent does not match, there is a cache miss. There is nothing wrong with this, but I believe that the main wikipedia engine gets this right, while squid gets this wrong. If this is really needed The point is that the contents of the user agent line is meaningless. There is no need to require it, if it is left out, just insert an uncommon string as a default. In fact, it arguably violates standards to reject the request - but randomly rejecting some requests and not others is certainly not something that should be done - at the least, the cache should be transparent. This is an example of the error recurring moments ago - prompts were edited to protect the innocent. The output was not edited at all. $ ( echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\n\r\n"; cat ) |nc -v www.wikipedia.org 80 rr.pmtpa.wikimedia.org [66.230.200.112] 80 (http) open HTTP/1.0 403 Forbidden Date: Mon, 11 Sep 2006 07:37:39 GMT Server: Apache X-Powered-By: PHP/5.1.2 Content-Type: text/html; charset=utf-8 X-Cache: MISS from will.wikimedia.org X-Cache-Lookup: MISS from will.wikimedia.org:80 Via: 1.0 will.wikimedia.org:80 (squid/2.6.STABLE3) Connection: close Please provide a User-Agent header [press enter] $ If you do not have netcat installed, you may be able to use bash and cat in this manner: $ (echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\n\r\n" 1>&0; cat -nA 1>&2 & cat </dev/tty 1>&0) 0<>/dev/tcp/www.wikipedia.com/80 1 HTTP/1.0 403 Forbidden^M$ 2 Date: Mon, 11 Sep 2006 08:13:20 GMT^M$ 3 Server: Apache^M$ 4 X-Powered-By: PHP/5.1.2^M$ 5 Content-Type: text/html; charset=utf-8^M$ 6 X-Cache: MISS from srv10.wikimedia.org^M$ 7 X-Cache-Lookup: MISS from srv10.wikimedia.org:80^M$ 8 Via: 1.0 srv10.wikimedia.org:80 (squid/2.6.STABLE3)^M$ 9 Connection: close^M$ 10 ^M$ 11 Please provide a User-Agent header cat: write error: Bad file descriptor $ In this case, the cat error is generated locally.
This is to block simple, abusive bots.
In that case it should be fixed to always fail. As it sits it fails sometimes and not always, and you can usually get it to work simply by trying over and over again if you were to decide that the site was unreliable and you could get past the issues with it by retrying. This will actually cause more work for the site - and not do more than slow down the scum who is ignoring robots.txt.