Last modified: 2012-09-19 18:19:33 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T37047, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 35047 - [Regression] File deletions don't always purge everything
[Regression] File deletions don't always purge everything
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Media storage (Other open bugs)
unspecified
All All
: Highest critical (vote)
: ---
Assigned To: Tim Starling
:
Depends on:
Blocks: 31217
  Show dependency treegraph
 
Reported: 2012-03-08 00:02 UTC by Rainer Rillke @commons.wikimedia
Modified: 2012-09-19 18:19 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Rainer Rillke @commons.wikimedia 2012-03-08 00:02:03 UTC
On Commons, nowadays we sometimes notice that after a file deletion via API, the file-revision-table remains visible and links to the files are not rendered as redlinks. Purging helps but if you delete, you do not expect having to go to the page and purge it.

I will try to get some screenshots.
Comment 1 Krinkle 2012-03-08 07:03:05 UTC
Not just the API.

For the past hour I've been working through a backlog of copyright violations on Commons based on this category:

https://commons.wikimedia.org/wiki/Category:Copyright_violations

All 3 files I deleted in the last 5 minutes however are different, something weird happened (or rather, didn't happen). The category member links were kept and after refreshing they're still showing up in the category (although the thumbnail is broken, and opening the link shows a "This page was recently deleted" message).

[[commons:File:Re7.png]], [[commons:File:Re5.png]], [[commons:File:Re3.png]]
Comment 2 Krinkle 2012-03-08 07:03:51 UTC
Looks like after purging the category page and file pages, they were removed from the category.

It shouldn't be possible for a non-existent page to be categorized :D
Comment 3 Krinkle 2012-03-08 07:10:44 UTC
The POST request submitting action=delete sometimes takes about to 2-3 seconds to respond.
Comment 4 Rainer Rillke @commons.wikimedia 2012-03-08 10:57:03 UTC
For me it sometimes takes up to 10s but I am on the "wrong server" (secure.wikimedia.org)
Comment 5 Jameslwoodward 2012-03-08 15:55:58 UTC
Until around the middle of February, I could use DelReqHandler to delete all of the files in a Mass DR in about ten seconds -- hit the {del} link on each file, then hit Enter to accept the default comment, and so on.  They would all go red very quickly.

I am currently working on 
http://commons.wikimedia.org/wiki/Commons:Deletion_requests/Files_in_Category:Commemorative_euro_coins_%28France%29#Files_in_Category:Commemorative_euro_coins_.28France.29

After two minutes, the first five files are still highlighted in gray with "working".  The last six are all red. If recent experience is true here, they will remain "working" indefinitely -- I once left them overnight without change.

If I then reload the DR page, several things can happen.
1) In this case, they all show up red. This is true less than half the time
2) In most cases some or all will will remain blue.
2a) Then if I go to the file page, in some cases the file will be intact.
2b) In other cases, the file will appear intact, but the pulldown will start with "Undelete", showing that the file is actually gone.  In that case, using "Purge" on the pulldown will turn it red on the DR page.
2c) in still other cases, the file page will turn up with "That file has been deleted..."

Since there are more than 100 DRs on an average day (more than 200 on today's targets), this is seriously slowing down closures.

Jim

Jameslwoodward
Comment 6 Aaron Schulz 2012-03-08 19:44:50 UTC
Are any 500 type errors or timeouts occurring?
Comment 7 Krinkle 2012-03-08 20:28:18 UTC
(In reply to comment #6)
> Are any 500 type errors or timeouts occurring?

Not that I know of.

See also 35054
Comment 8 Jameslwoodward 2012-03-09 16:02:27 UTC
Reply to Comment #6:

Occasionally, perhaps one in twenty deletes:

"API request returned code 504 errorError code is Gateway Time-out"

Sometimes just one, sometimes multiple times while doing a multiple delete.

Sometimes the process completes correctly after a 504 error; sometimes some of them hang. Five of 13 deletes are hung now at Commons:Deletion requests/The Boy in the Plastic Bubble (film).  The other eight are now red-links.

It happens more often when more than three or four deletes are running at once.

Jim
Comment 9 Aaron Schulz 2012-03-14 17:36:10 UTC
Is this still occurring?
Comment 10 Jameslwoodward 2012-03-15 15:30:51 UTC
Yes.
I just deleted
    File:Alter Way 50 x 109.jpg
    File:Reciprok logo 50 x 109.jpg
    File:AWF 50 x 109.jpg
    File:AWS LOGO 50 x 109.jpg [keep] [del]
    File:AWH 50x109.jpg [keep] [del]
After a page reload, the first three are red, the last two are blue, although they have both been deleted.

Clicking on
   File:AWS LOGO 50 x 109.jpg [keep] [del]
yields: "No file by this name exists, but you can upload it." 

while 
   File:AWH 50 x 109.jpg [keep] [del]
yields a page that shows the thumbnail and asks "View or restore 4 deleted edits?"
----

On  
    File:Badingchad.jpg [keep] [del]
    File:Jamesmadara.jpg [keep] [del]
    File:Lovetriangle.jpg [keep] [del]
    File:Gakuranboys.jpg [keep] [del]
the first one is still blue, but actually deleted.  The other three were not deleted.


Jim
Comment 11 Aaron Schulz 2012-03-15 17:26:42 UTC
How does the DelReqHandler script check the return status of the deletions. I assume it uses the API. Did it get a bad response for the files that where not deleted? Would it properly detect if it just got an error message back or treat it as success?

In any case there is definitely a cache invalidation problem going on. It could be a missing purge call, an HTMLCacheUpdate job bug, or purge request packet loss.

I wish I had a link or screenshot to "File:AWH 50 x 109.jpg" since that one sounds the strangest. Does it also show the file history?
Comment 12 Jameslwoodward 2012-03-15 23:03:27 UTC
I'm a user, not the programmer, so I can't answer technical questions, but, yes, I'm pretty sure it uses the API.

It's a transitory phenomena, so a link won't help, but the next time I see a page like File:AWH 50 x 109.jpg I'll shoot a screen shot and e-mail it to you.

---

It occurs to me -- and I'm guessing above my pay grade here -- that the servers appear to have generally slowed down over the last six weeks or so.  Is it possible that this is a stack or queue getting overloaded, that would not have been overloaded when the servers were moving much faster in December, so that hitting a bunch of quick (del) would have been cleared before the overload occurred? Certainly even a single (del) takes much longer now than it did then.

Jim
Comment 13 Rainer Rillke @commons.wikimedia 2012-03-15 23:16:12 UTC
(In reply to comment #11)
> Would it properly detect if it just got an error message back or treat
it as success?
The the former one. 

https://secure.wikimedia.org/wikipedia/commons/wiki/MediaWiki:Gadget-DelReqHandler.js

Deletion 
 deleteFile -> doAPICall -> $.ajax -> error/success (would throw an alert-box on error) (both HTTP-status errors and result.error (like bad token) are handled)
Comment 14 Andreas F. Borchert 2012-03-19 19:46:18 UTC
I just noticed a similar problem, using the regular interface and while deleting just one file [1]. The deletion took considerable time and delivered following error message:

----
Database error
Jump to: navigation, search
A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was:

    (SQL query hidden)

from within function "LocalFile::delete". Database returned error "1205: Lock wait timeout exceeded; try restarting transaction (10.0.6.41)".
----

Like in the case above, the links to that file remained blue and the
at the file description page, the thumbnail along with the file history
and the metadata were still to be seen. Purging helped but took also
considerable time.

Andreas.

[1] http://commons.wikimedia.org/wiki/File:Nude_man_standing_with_an_erection..jpg
Comment 15 Sumana Harihareswara 2012-03-22 22:59:29 UTC
Assigning to Tim.
Comment 16 Tim Starling 2012-03-23 01:04:24 UTC
The browser version used to view the category page after deletion may be relevant. Anyone want to say what browser version they were using?
Comment 17 Jameslwoodward 2012-03-23 02:24:39 UTC
Firefox 10.0.1 running on XP Pro

Jim
Comment 18 Tim Starling 2012-03-23 05:39:22 UTC
There are a few different issues here.

(In reply to comment #0)
> On Commons, nowadays we sometimes notice that after a file deletion via API,
> the file-revision-table remains visible and links to the files are not rendered
> as redlinks. Purging helps but if you delete, you do not expect having to go to
> the page and purge it.

The slave DB is used to check page_touched in response to an IMS request, so if the file description page looks identical to how it looked before delete, then slave lag coupled with session corruption is a possibility.

If the description page had no description text, but the "file history" section was still present, then that would point to an incomplete delete operation. In my test install for a file deletion, I count four separate transactions with write queries in them. The deletion of the image row happens after the deletion of the page row, so it's possible for the page to be deleted before the image, and it's possible for this situation to be user-visible.

If everything seemed fine after a few minutes then it was probably just slow rather than broken.

(In reply to comment #1)
> All 3 files I deleted in the last 5 minutes however are different, something
> weird happened (or rather, didn't happen). The category member links were kept
> and after refreshing they're still showing up in the category (although the
> thumbnail is broken, and opening the link shows a "This page was recently
> deleted" message).

When an image is deleted, the page_touched for the categories it is in are not updated. It's probably been that way since categories were invented, I confirmed it back to 1.17. The code to do this purge is just not there. This is not related to the other reports. 

(In reply to comment #5)
> After two minutes, the first five files are still highlighted in gray with
> "working".  The last six are all red. If recent experience is true here, they
> will remain "working" indefinitely -- I once left them overnight without
> change.

Maybe an unexpected result from api.php caused the gadget to throw an exception. In any case, failure to show an informative error message is a gadget problem, not a server problem.

> 
> If I then reload the DR page, several things can happen.
> 1) In this case, they all show up red. This is true less than half the time
> 2) In most cases some or all will will remain blue.

There's a few possible causes of that.

> 2a) Then if I go to the file page, in some cases the file will be intact.
> 2b) In other cases, the file will appear intact, but the pulldown will start
> with "Undelete", showing that the file is actually gone.  In that case, using
> "Purge" on the pulldown will turn it red on the DR page.

The deleted status used by the skin comes from the slave DB, so this could be either slave lag or an incomplete delete operation.

(In reply to comment #14)
> from within function "LocalFile::delete". Database returned error "1205: Lock
> wait timeout exceeded; try restarting transaction (10.0.6.41)".

This is the site_stats update, an exception thrown here will cause the memcached file cache update and LocalFile::purgeThumbList() to be skipped. Those two things are done in the same transaction, and LocalFile::purgeThumbList() is quite slow. But I'm not sure if it will explain all of the problems on this bug report. I'll look at dberror.log, it will tell us whether this lock wait timeout is common.
Comment 19 Tim Starling 2012-03-23 05:52:02 UTC
LocalFile::delete DB error counts per day:

     23 in Feb 1
      7 in Feb 3
      1 in Feb 5
      3 in Feb 6
      6 in Feb 9
     26 in Feb 10
      1 in Feb 11
      2 in Feb 12
     91 in Feb 13
      7 in Feb 14
      5 in Feb 15
      7 in Feb 16
      1 in Feb 17
      4 in Feb 19
     96 in Feb 20
     27 in Feb 21
      9 in Feb 22
     48 in Feb 23
     90 in Feb 24
     72 in Feb 25
    180 in Feb 26
     13 in Feb 27
     59 in Feb 28
     28 in Feb 29
     29 in Mar 1
     33 in Mar 2
     30 in Mar 3
      2 in Mar 4
     13 in Mar 6
      3 in Mar 7
      5 in Mar 8
     10 in Mar 9
      3 in Mar 10
      4 in Mar 11
      9 in Mar 12
      8 in Mar 13
      4 in Mar 14
     56 in Mar 15
     23 in Mar 16
     38 in Mar 17
     50 in Mar 18
    216 in Mar 19
     49 in Mar 20
      2 in Mar 21
      5 in Mar 22

Note the low counts around March 8, that implies it's not responsible for the original report.
Comment 20 Tim Starling 2012-03-23 06:40:57 UTC
The logs show lots of lock wait timeouts for other things that write to site_stats, but LocalFile::delete() seems like the most likely root cause. I patched it in r114442, we'll see if that fixes that part of this bug report at least.
Comment 21 Jameslwoodward 2012-03-23 11:01:22 UTC
It is far too soon to declare victory, but the problem may have gone away.

In case that it is not correct, two notes:

I would be careful of diagnosing from the counts.  There are only half a dozen Admins who close most of the DRs. I know that I have cleared many fewer -- hundreds, perhaps as many as several thousand -- DRs than I would have if this bug were not slowing me down.  I would not be surprised if my colleagues have also slowed down or stopped doing DRs. That is certainly evident from the backlog.

Note, again, that this a new problem in the last six weeks or eight weeks. I would look for some change around February 1 as the cause.

Jim
Comment 22 Jameslwoodward 2012-03-23 15:47:12 UTC
I have deleted about five hundred files using DelReqHandler this morning with no sign of the trouble detailed above.

Thank you very much for fixing this significant problem.


Jim
Comment 23 Tim Starling 2012-03-28 00:53:04 UTC
The logs show that total lock wait timeouts have dropped to a small fraction of what they were before r114442.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links