Last modified: 2006-09-12 17:02:17 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T9289, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 7289 - HTTP User-Agent: request header sometimes required.
HTTP User-Agent: request header sometimes required.
Status: RESOLVED INVALID
Product: Wikimedia
Classification: Unclassified
Bugzilla (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-09-11 08:17 UTC by Nick Simicich
Modified: 2006-09-12 17:02 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nick Simicich 2006-09-11 08:17:35 UTC
The US government, as part of the Voice of America's availability and
propagation monitoring, monitors various web pages to determine if they can be
accessed from locations in the Orient - I suppose you could say that they want
to be sure that their products are not disoriented. I will not guess at their
motives, but I noticed that www.wikipedia.org had a miserable availability
percentage - but it was not zero, making me wonder if it was poorly implemented
government censorship. I then wondered if the poor availability was Chinese
government censorship or if the problem was occidental.

This monitoring program and its detailed results are public record, the base URL
of the reporting program is here: http://voa.his.com/

For web pages, they compare availability of their own web pages with some other
widely used web pages such as www.yahoo.com and www.wikipedia.org.

http://asia.ibbmonitor.com/RMSWebMonitor/cgi-bin/WebMonitorCGI.acgi?webmon_query=yes

is the URL of the web page monitoring system.

They record details of what has happened when they note a failure, which seems
to be a return code other than 200, and the contents of the successful or the
error page - and a traceroute is automatically done and recorded in case of failure.

The recordings of their failure to fetch http://www.wikipedia.org/ showed a
return code of 

"403 forbidden resource requested". The failures presented a very short web page
with the content:

"Please provide a User-Agent header"

In attempting to duplicate the error (or not, for myself) I was able to
determine that the servers which serve www.wikipedia.org reject HTTP/1.0. This
is common, but I had to send more than one header to make HTTP/1.1 work.

I was able to sometimes duplicate the error that they sometimes received with
the following command (run under bash in Linux - nc is Netcat, a non-standard
command that simply connects to the system and port it is told to connect to and
then passes the data it gets on standard in to that port and puts its output on
standard out - and records some things about what it does if told to on standard
error - if it is not installed, you probably should install it - failing that,
telnet might work if it does not mung the stream much - this may require a local
option depending on the implementation of telnet on your system.  .

Anyway, this command SOMETIMES gets the error, since it is actually a problem of
non-tranaparent cacheing.

( echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\n\r\n"; cat ) |nc -v
www.wikipedia.org 80 

This command NEVER got the error:

( echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\nUser-Agent: I told you
not to be stupid, you moron\r\n\r\n"; cat ) |nc -v www.wikipedia.org 80 

Howard Stern fans note that this is not aimed at anyone and is a quote from the
show.

http://www.w3.org/Protocols/rfc2616/rfc2616.html says that User-Agent header
field "SHOULD" be included by the user agent. However, it is not required.
Contrast this to the Host: header field, which "MUST"  be included.

My guess is that the cacheing is matched against the user-agent and that if the
user-agent does not match, there is a cache miss.  There is nothing wrong with
this, but I believe that the main wikipedia engine gets this right, while squid
gets this wrong. If this is really needed

The point is that the contents of the user agent line is meaningless.  There is
no need to require it, if it is left out, just insert an uncommon string as a
default.  In fact, it arguably violates standards to reject the request - but
randomly rejecting some requests and not others is certainly not something that
should be done - at the least, the cache should be transparent.

This is an example of the error recurring moments ago - prompts were edited to
protect the innocent. The output was not edited at all.

$ ( echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\n\r\n"; cat ) |nc -v
www.wikipedia.org 80
rr.pmtpa.wikimedia.org [66.230.200.112] 80 (http) open
HTTP/1.0 403 Forbidden
Date: Mon, 11 Sep 2006 07:37:39 GMT
Server: Apache
X-Powered-By: PHP/5.1.2
Content-Type: text/html; charset=utf-8
X-Cache: MISS from will.wikimedia.org
X-Cache-Lookup: MISS from will.wikimedia.org:80
Via: 1.0 will.wikimedia.org:80 (squid/2.6.STABLE3)
Connection: close

Please provide a User-Agent header

[press enter]
$

If you do not have netcat installed, you may be able to use bash and cat in this
manner:

$ (echo -en "GET / HTTP/1.1\r\nHost: www.wikipedia.org\r\n\r\n" 1>&0; cat -nA
1>&2 & cat </dev/tty 1>&0) 0<>/dev/tcp/www.wikipedia.com/80
     1  HTTP/1.0 403 Forbidden^M$
     2  Date: Mon, 11 Sep 2006 08:13:20 GMT^M$
     3  Server: Apache^M$
     4  X-Powered-By: PHP/5.1.2^M$
     5  Content-Type: text/html; charset=utf-8^M$
     6  X-Cache: MISS from srv10.wikimedia.org^M$
     7  X-Cache-Lookup: MISS from srv10.wikimedia.org:80^M$
     8  Via: 1.0 srv10.wikimedia.org:80 (squid/2.6.STABLE3)^M$
     9  Connection: close^M$
    10  ^M$
    11  Please provide a User-Agent header
cat: write error: Bad file descriptor
$
In this case, the cat error is generated locally.
Comment 1 Brion Vibber 2006-09-11 08:19:07 UTC
This is to block simple, abusive bots.
Comment 2 Nick Simicich 2006-09-12 03:52:06 UTC
In that case it should be fixed to always fail. As it sits it fails sometimes
and not always, and you can usually get it to work simply by trying over and
over again if you were to decide that the site was unreliable and you could get
past the issues with it by retrying.  This will actually cause more work for the
site - and not do more than slow down the scum who is ignoring robots.txt.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links