Last modified: 2006-09-12 17:02:17 UTC
The US government, as part of the Voice of America's availability and
propagation monitoring, monitors various web pages to determine if they can be
accessed from locations in the Orient - I suppose you could say that they want
to be sure that their products are not disoriented. I will not guess at their
motives, but I noticed that www.wikipedia.org had a miserable availability
percentage - but it was not zero, making me wonder if it was poorly implemented
government censorship. I then wondered if the poor availability was Chinese
government censorship or if the problem was occidental.
This monitoring program and its detailed results are public record, the base URL
of the reporting program is here: http://voa.his.com/
For web pages, they compare availability of their own web pages with some other
widely used web pages such as www.yahoo.com and www.wikipedia.org.
is the URL of the web page monitoring system.
They record details of what has happened when they note a failure, which seems
to be a return code other than 200, and the contents of the successful or the
error page - and a traceroute is automatically done and recorded in case of failure.
The recordings of their failure to fetch http://www.wikipedia.org/ showed a
return code of
"403 forbidden resource requested". The failures presented a very short web page
with the content:
"Please provide a User-Agent header"
In attempting to duplicate the error (or not, for myself) I was able to
determine that the servers which serve www.wikipedia.org reject HTTP/1.0. This
is common, but I had to send more than one header to make HTTP/1.1 work.
I was able to sometimes duplicate the error that they sometimes received with
the following command (run under bash in Linux - nc is Netcat, a non-standard
command that simply connects to the system and port it is told to connect to and
then passes the data it gets on standard in to that port and puts its output on
standard out - and records some things about what it does if told to on standard
error - if it is not installed, you probably should install it - failing that,
telnet might work if it does not mung the stream much - this may require a local
option depending on the implementation of telnet on your system. .
Anyway, this command SOMETIMES gets the error, since it is actually a problem of
( echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\n\r\n"; cat ) |nc -v
This command NEVER got the error:
( echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\nUser-Agent: I told you
not to be stupid, you moron\r\n\r\n"; cat ) |nc -v www.wikipedia.org 80
Howard Stern fans note that this is not aimed at anyone and is a quote from the
http://www.w3.org/Protocols/rfc2616/rfc2616.html says that User-Agent header
field "SHOULD" be included by the user agent. However, it is not required.
Contrast this to the Host: header field, which "MUST" be included.
My guess is that the cacheing is matched against the user-agent and that if the
user-agent does not match, there is a cache miss. There is nothing wrong with
this, but I believe that the main wikipedia engine gets this right, while squid
gets this wrong. If this is really needed
The point is that the contents of the user agent line is meaningless. There is
no need to require it, if it is left out, just insert an uncommon string as a
default. In fact, it arguably violates standards to reject the request - but
randomly rejecting some requests and not others is certainly not something that
should be done - at the least, the cache should be transparent.
This is an example of the error recurring moments ago - prompts were edited to
protect the innocent. The output was not edited at all.
$ ( echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\n\r\n"; cat ) |nc -v
rr.pmtpa.wikimedia.org [126.96.36.199] 80 (http) open
HTTP/1.0 403 Forbidden
Date: Mon, 11 Sep 2006 07:37:39 GMT
Content-Type: text/html; charset=utf-8
X-Cache: MISS from will.wikimedia.org
X-Cache-Lookup: MISS from will.wikimedia.org:80
Via: 1.0 will.wikimedia.org:80 (squid/2.6.STABLE3)
Please provide a User-Agent header
If you do not have netcat installed, you may be able to use bash and cat in this
$ (echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\n\r\n" 1>&0; cat -nA
1>&2 & cat </dev/tty 1>&0) 0<>/dev/tcp/www.wikipedia.com/80
1 HTTP/1.0 403 Forbidden^M$
2 Date: Mon, 11 Sep 2006 08:13:20 GMT^M$
3 Server: Apache^M$
4 X-Powered-By: PHP/5.1.2^M$
5 Content-Type: text/html; charset=utf-8^M$
6 X-Cache: MISS from srv10.wikimedia.org^M$
7 X-Cache-Lookup: MISS from srv10.wikimedia.org:80^M$
8 Via: 1.0 srv10.wikimedia.org:80 (squid/2.6.STABLE3)^M$
9 Connection: close^M$
11 Please provide a User-Agent header
cat: write error: Bad file descriptor
In this case, the cat error is generated locally.
This is to block simple, abusive bots.
In that case it should be fixed to always fail. As it sits it fails sometimes
and not always, and you can usually get it to work simply by trying over and
over again if you were to decide that the site was unreliable and you could get
past the issues with it by retrying. This will actually cause more work for the
site - and not do more than slow down the scum who is ignoring robots.txt.