Last modified: 2014-03-01 21:26:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 19242 - [dbzip2] English Wikipedia dump has a wrong size
[dbzip2] English Wikipedia dump has a wrong size
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal major (vote)
: ---
Assigned To: Tomasz Finc
http://download.wikimedia.org/enwiki/...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-06-16 18:38 UTC by Gene Wang
Modified: 2014-03-01 21:26 UTC (History)
1 user (show)

See Also:
Web browser: Internet Explorer
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Gene Wang 2009-06-16 18:38:41 UTC
The size listed on the Web is 4.9Gb. The actual downloaded file size is 899 MB.

In WinXP/IE6, the file can be downloaded for 899MB. The window showing download progress states that the target size is 899MB.

In Vista/IE6, the file cannot be downloaded. It complains about HTTP header error.

In Vista/IE7, the file can be downloaded for 899MB. The window showing download progress states that the target size is 4.9GB.

wget downloads, again, a file of 899MB.

899MB is too small for an English Wikipedia article dump. Previous dump (July 2008) was around 3.8GB.
Comment 1 Gurch 2009-06-16 23:05:26 UTC
(In reply to comment #0)
> Previous dump (July 2008) was around 3.8GB.

The previous dump was the week before, and there have been 5 since dumps were started again in May, e.g. http://download.wikimedia.org/enwiki/20090610/

Are all of these affected? I don't have the bandwidth to find out.

Comment 2 Gene Wang 2009-06-17 21:26:20 UTC
I obtained a dump successfully about a year ago.

I tried all the dumps currently available at http://download.wikimedia.org/enwiki and they had the same problem.
Comment 3 Brion Vibber 2009-06-23 01:28:12 UTC
Note that you may have an old version of wget which was known to have problems with files over 4GB.

Have never attempted large files with IE, but recent versions presumably should work if on an NTFS filesystem. (If you're downloading to a FAT32 filesystem like many default USB drives it will likely fail, but I think it should fail differently -- reporting an error at 2gb or 4gb rather than cropping off to the 0.9GB.)

Another possibility is that you're accessing the internet through a proxy which fails to understand large files properly. This might explain the Content-Length header being passed on (so you get correct report of the 4.9GB to come) but the intermediary crapping out at the 0.9GB 32-bit-wrapped limit.

Tomasz, putting this one on your bench; be good to double-check we haven't broken the server or something ;) but afaik it should be serving out fine.
Comment 4 Tomasz Finc 2009-06-23 02:48:22 UTC
Did a quick verify on OSX 10.5 using wget 1.11.4 and everything is showing up just like it should. 4.9GB

/opt/local/bin/wget -S http://download.wikimedia.org/enwiki/20090610/enwiki-20090610-pages-articles.xml.bz2
......
HTTP request sent, awaiting response... 
  HTTP/1.0 200 OK
  Connection: keep-alive
  Content-Type: application/octet-stream
  Accept-Ranges: bytes
  Content-Length: 5227630350
  Date: Tue, 23 Jun 2009 02:24:22 GMT
  Server: lighttpd/1.4.19
Length: 5227630350 (4.9G) [application/octet-stream]

Looking at http://tinyurl.com/ozafl2 shows the same correct content length header being returned if the user agent is IE.

It also correctly downloads past 899MB from my personal server that is not running in the wikimedia cluster.

I'll try this with IE after re-installing windows in the next day or so just to make sure its not an issue with the browser but otherwise I'm really suspecting a 32bit proxy here.

Gene, could you post in the content length you seeing on the wget downloads by adding a "-S" ? 

It would also be nice to know if you going through a 32bit proxy as Brion says. That could easily do it.

Comment 5 Gene Wang 2009-06-23 18:38:21 UTC
I guess the problem is related to the proxy server. I switched to a diffrent proxy server and a download attempt via IE stopped at 918MB.

I obtained the wget 1.11.4 and ran it with the -S option. There was a hiccup at around 900MB -- the connection got closed. Fortunately wget is robust enough this time to get reconnected. So far it is still running smoothly with ~3GB downloaded:

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = c:\Program Files\GnuWin32/etc/wgetrc
--2009-06-23 10:21:24--  http://download.wikimedia.org/enwiki/20090618/enwiki-20090618-pages-articles.xml.bz2
Resolving download.wikimedia.org... 208.80.152.183
Connecting to download.wikimedia.org|208.80.152.183|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Connection: Keep-Alive
  Proxy-Connection: Keep-Alive
  Content-Length: 5258589574
  Date: Tue, 23 Jun 2009 17:21:25 GMT
  Content-Type: application/octet-stream
  Server: lighttpd/1.4.19
  Accept-Ranges: bytes
Length: 5258589574 (4.9G) [application/octet-stream]
Saving to: `enwiki-20090618-pages-articles.xml.bz2'

18% [=================>                                                                                 ] 963,624,164  763K/s   in 21m 9s

2009-06-23 10:42:34 (741 KB/s) - Connection closed at byte 963624164. Retrying.

--2009-06-23 10:42:35--  (try: 2)  http://download.wikimedia.org/enwiki/20090618/enwiki-20090618-pages-articles.xml.bz2
Connecting to download.wikimedia.org|208.80.152.183|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 206 Partial Content
  Connection: close
  Proxy-Connection: close
  Content-Length: 4294965410
  Date: Tue, 23 Jun 2009 17:42:36 GMT
  Content-Range: bytes 963624164-5258589573/5258589574
  Content-Type: application/octet-stream
  Server: lighttpd/1.4.19
  Accept-Ranges: bytes
Length: 5258589574 (4.9G), 4294965410 (4.0G) remaining [application/octet-stream]
Saving to: `enwiki-20090618-pages-articles.xml.bz2'

58% [+++++++++++++++++======================================>                                         ] 3,086,259,252  773K/s  eta 46m 20s

Thanks for all of you who have helped!
Comment 6 Tomasz Finc 2009-06-23 22:02:20 UTC
No problem. Let us know if anything else pops up.
Comment 7 Antoine "hashar" Musso (WMF) 2012-02-21 14:56:08 UTC
moving product dbzip2 to product Wikimedia tools

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links