Last modified: 2014-05-14 22:20:38 UTC
Hi all; I'm trying to download Wikimedia Commons, but I have found some errors. For example: * oi_archive_name is empty for this file http://commons.wikimedia.org/wiki/File:Nl-scheikundig.ogg#filehistory * link is broken and you get an empty file http://commons.wikimedia.org/wiki/File:SMS_Bluecher.jpg#filehistory Are you aware of these errors in old files? Is this going to be fixed? Regards, emijrp
(In reply to comment #0) > Hi all; > > I'm trying to download Wikimedia Commons, but I have found some errors. For > example: > * oi_archive_name is empty for this file > http://commons.wikimedia.org/wiki/File:Nl-scheikundig.ogg#filehistory > * link is broken and you get an empty file > http://commons.wikimedia.org/wiki/File:SMS_Bluecher.jpg#filehistory > > Are you aware of these errors in old files? Is this going to be fixed? > > Regards, > emijrp It can only be fixed if said files exist in some backup/similar
It may still be on NFS, I've seen this in various places.
(In reply to comment #1) > (In reply to comment #0) > > Hi all; > > > > I'm trying to download Wikimedia Commons, but I have found some errors. For > > example: > > * oi_archive_name is empty for this file > > http://commons.wikimedia.org/wiki/File:Nl-scheikundig.ogg#filehistory > > * link is broken and you get an empty file > > http://commons.wikimedia.org/wiki/File:SMS_Bluecher.jpg#filehistory > > > > Are you aware of these errors in old files? Is this going to be fixed? > > > > Regards, > > emijrp > > It can only be fixed if said files exist in some backup/similar There are more errors like those ones, I didn't make a comprehensive list.
There are more bugs like this.
Just came across this on http://commons.wikimedia.org/wiki/File:Pyrenees_relief_map_with_rivers-fr.svg
Bawolff, do you have suggestions on how to break down this bug in actionable items? We probably need the following: 1) some maintenance script to list files with each of the problems in question (oi_archive_name empty, archived versions linking "404 Not Found" etc.), 2) scripts or whatever to correct the wrong metadata (where that's the problem) or look for missing files in NFS and restore them, 3) bug to track the need to do something about the leftovers. I'm downloading all the Commons files with emijrp's script, so we already have huge lists of suspects, e.g. https://archive.org/download/wikimediacommons-201208/2012-08-check.txt
(Data loss -> critical.)
Well the easiest to find would be everything select oi_name, oi_timestamp from oldimage where oi_archive_name = ''; this could be done by anyone with labs After that one can look in the thumbnail log. From what I've seen of it, its full of line about thumbnail failed due to missing src path (this seems to be the main cause of failing png thumbnails now that vips has removed the size limit on that format) As an aside, It'd be nice if we graphed number of missing files somewhere in ganglia. Ancedotally it seems like there are more of them then there used to be. It would be good to get real stats on this very scary problem.
Btw, one probable cause of recent incidents may have been fixed - see bug 54736 See also related bug 54776
*** Bug 60766 has been marked as a duplicate of this bug. ***
*** Bug 41320 has been marked as a duplicate of this bug. ***
*** Bug 56218 has been marked as a duplicate of this bug. ***