Last modified: 2014-03-20 04:10:36 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T47877, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 45877 - ResourceLoader: Bad cache stuck due to race condition with scap between load.php and index.php server
ResourceLoader: Bad cache stuck due to race condition with scap between load....
Status: REOPENED
Product: MediaWiki
Classification: Unclassified
ResourceLoader (Other open bugs)
1.21.x
All All
: High major (vote)
: Future release
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 54945
  Show dependency treegraph
 
Reported: 2013-03-08 03:39 UTC by spage
Modified: 2014-03-20 04:10 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description spage 2013-03-08 03:39:19 UTC
E3 deployed several extensions with changed modules today. GettingStarted had a new RL module 'ext.gettingstarted.openTask' for a new JS file resources/ext.gettingstarted.openTask.js. scap finished 4:19PDT [00:19:44]

50 minutes after scap finished I visited a page that should have had the new module, but didn't. mw.loader.getState( 'resources/ext.gettingstarted' ) reported "missing". Shift-reload of the page didn't help; requesting just the bits URL that included this module [1] also reported it "missing"[2].  Modifying the version in the query string to a later time or requesting just the ext.gettingstarted.openTask module worked fine.

My hypothesis: during scap, some user requested a page that ran the new PHP code that adds the new module, so her browser made the request to bits.wikimedia.org for modules including the new module. At that time the bits server didn't yet have the JS file, so the server response reports "missing".  But then RL happily caches that response for X minutes, where X is a long time, well over 50 minutes.

The missing module continued for about 10 more minutes, until the same URL worked.[3]  The working request reported a modified time of Thu 07 Mar 2013 04:07:51 PM PST (I think 00:07:51 UTC, which matches the version= on the end of the bits URL.

The a-m-c times of the ext.openTask.js file according to stat(1) are
  Access: 2013-03-08 00:08:12.857544926 +0000
  Modify: 2013-03-07 21:49:43.000000000 +0000
  Change: 2013-03-08 00:08:12.857544926 +0000
the same as the other files in the module that existed before this deploy.

Is the workaround to avoid these damaging situations to touch every JS and CSS file that one deploys before running scap?

This might be a dupe of bug 37812 , "Module cache should be invalidated (newer timestamp) when lesser old items are removed or added (scripts, style, messages)"

[1] The request with missing module:
https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.Experiments.lib|ext.UserBuckets%2CeventLogging%2CmarkAsHelpful%2CpostEdit|ext.articleFeedbackv5.startup|ext.gadget.DRN-wizard%2CHotCat%2CReferenceTooltips%2Ccharinsert%2CmySandbox%2Cteahouse|ext.gettingstarted.openTask|jquery.articleFeedbackv5.verify|jquery.autoEllipsis%2CcheckboxShiftClick%2CclickTracking%2CdelayedBind%2Chidpi%2ChighlightText%2Cjson%2CmakeCollapsible%2Cmw-jump%2Cplaceholder%2Csuggestions%2CtabIndex|mediawiki.Title%2CUri%2Capi%2Chidpi%2CsearchSuggest%2Cuser|mediawiki.api.watch|mediawiki.page.ready|mediawiki.page.watch.ajax|mobile.desktop|mw.MwEmbedSupport.style|mw.PopUpMediaTransform|schema.GettingStarted&skin=vector&version=20130308T000751Z&*

[2] The end of the (jsbeautify'd) response for this URL was:
 }, {}, {});
mw.loader.state({
    "ext.gettingstarted.openTask": "missing"
});
/* cache key: enwiki:resourceloader:filter:minify-js:7:3e6c3b5e8a0b731930af1060b6ab62b6 */

[3] The identical URL request when it starting working returned a last line of:
/* cache key: enwiki:resourceloader:filter:minify-js:7:03dcfbab169aa00f66478e647270d340 */
Comment 1 Sam Reed (reedy) 2013-03-08 03:40:56 UTC
(In reply to comment #0)
> Is the workaround to avoid these damaging situations to touch every JS and
> CSS
> file that one deploys before running scap?

Isn't touching startup.js enough IIRC?
Comment 2 Matthew Flaschen 2013-03-08 03:45:48 UTC
Yep, sounds like a dupe.  Thank you for tracking it down.

*** This bug has been marked as a duplicate of bug 37812 ***
Comment 3 Krinkle 2013-03-08 05:30:03 UTC
(In reply to comment #1)
> (In reply to comment #0)
> > Is the workaround to avoid these damaging situations to touch every JS and
> > CSS
> > file that one deploys before running scap?
> 
> Isn't touching startup.js enough IIRC?

Yeah, because that will (as a side-effect) invalidate the startup module request entirely (not just the manifest) and when it re-calculates the manifest it will see that the module has in fact changed and the next user getting the startup module will get a higher timestamp and thus a different bits url and thus the working code.

However, this may not be working because of what you think. Let me clarify just in case. The timestamp the module gets after touching startup.js is not related to the time you touch startup.js. That just forces resourceloader to re-check the timestamps of the files in the module (which you didn't touch by touching startup.js). The timestamp will still be higher because, as Matthew correctly described, the previous time resourceloader was asked to check the timestamp (due to that unfortunate visitor during scap) some of the files weren't there yet.

(In reply to comment #2)
> Yep, sounds like a dupe.  Thank you for tracking it down.
> 
> *** This bug has been marked as a duplicate of bug 37812 ***

No, this is not a dupe of that bug.

bug 37812 is about the case where you properly deploy a change but the timestamp (no matter how many times to invalidate the startup manifest) does not change because none of the components resourceloader checks for have changed. bug 37812 happens when the change you make only involves removing older files from the array, or adding new files that were modified longer ago, causing the max() to still be the same.

This bug (bug 45877) is about the race condition where one server is already embedding requests urls in the page (or mw.loader.load calls) while the module in question is not yet available on the apache server that the request will be made to.

e.g.
en.wikipedia.org --> srv123 @r12 --> outputs html mw.loader.load('foo')
-> bits.wikimedia.org --> srv214 @r11 --> mw.loader.state('foo', 'missing');

Where that exact url (with that timestamp) will get cached for 30 days.

Not it wont be broken for 30 days because the startup manifest en.wikipedia.org requests context work with (from bits) will be rebuilt every 5 minutes.

I'm not sure what the proper solution is for this problem. Perhaps syncing to bits firts, though that might bring the opposite problem, which is likely less visible, but might be problematic also.
Comment 4 Matthew Flaschen 2013-03-08 05:37:46 UTC
S Page did the digging and filed this bug report.

I misinterpreted bug 37812 as about adding a file in general (including adding the actual file when it's already in the manifest).  Upon reading more closely, I see it's only about the registry.
Comment 5 Krinkle 2014-03-20 03:53:49 UTC
(In reply to Krinkle from comment #3)
> This bug (bug 45877) is about the race condition where one server is already
> embedding requests urls in the page (or mw.loader.load calls) while the
> module in question is not yet available on the apache server that the
> request will be made to.
> 
> e.g.
> en.wikipedia.org --> srv123 @r12 --> outputs html mw.loader.load('foo')
> -> bits.wikimedia.org --> srv214 @r11 --> mw.loader.state('foo', 'missing');
> 
> Where that exact url (with that timestamp) will get cached for 30 days.
> 
> Now it won't be broken for 30 days because the startup manifest
> en.wikipedia.org requests context work with (from bits) will be rebuilt
> every 5 minutes.
> 
> I'm not sure what the proper solution is for this problem. Perhaps syncing
> to bits firts, though that might bring the opposite problem, which is likely
> less visible, but might be equally problematic.

Rephrasing bug summary to more accurately reflect this.

The different relevant scenarios:


1) index.php server first, load.php server second

So whenever a change of any kind is deployed (new module, or a change to an existing module), it is possible that the code might arrive on one server first (e.g. the server serving the html, with the module load queue), and then the client makes a subsequent request to load.php for those modules (handled by a server that doesn't yet have the code).

In that case, the response will be module.state=missing; Which is fine and degrades gracefully. This is not a broken state, simply the old state prior to this particular deployment. Once the deployment is finished, the next time a client requests the startup module from load.php, the module will be in there and with a higher timestamp, so the module=missing response won't be served again.

2) index.php server and load.php/startup first, load.php/module second

First html server gets it, ensuring the module is in the load queue (if not already) and sometime before or after this, the server handling the startup module got it too. Client will make a request for the module with the newer version number in the url, gets handled by a server that we haven't synced to yet. Responds with an old version of the module (or module=missing if its a new module).

This situation doesn't resolve itself within 5 minutes because the bad response is stuck in frontend varnish cache at the correct url.

Touching startup.js won't help either.

This can only be resolved by changing the module again (touch any of the relevant module's included files), syncing that and hoping you don't get the same race condition (it's rare but I'd say it happens 1/20 times).
Comment 6 Krinkle 2014-03-20 04:04:12 UTC
(In reply to Krinkle from comment #5)
> The different relevant scenarios:
> 
> 
> 1) index.php server first, load.php server second
> 
> (..)
> 
> 2) index.php server and load.php/startup first, load.php/module second
> 
> (..)

3) server is in frankenstein state

This affects both ResourceLoader (load.php) and the main MediaWiki request handling in general. Due to how we use rsync I believe it is indeed possible for any moment during the sync action for the mediawiki install to contain files it shouldn't, miss files it needs, and has mixed newer and older versions of some files.

The window for this is relatively small as it relates to a single server as opposed to the entire cluster.

This can lead to random PHP errors (which are not as visible since those aren't cached, but they do get served to the public, mostly logged-in users only I think, unless a page gets purged of course).

This can lead to responses from load.php that will be cached and can be making all kinds of weird concatenations between incompatible files. This might resolve itself if the older file was synced first (next request the sync is finished, there'll be a newer file). But if the newer file was synced first, it won't fix itself (same as #2).


--

CC-ing Tim Starling. I discussed this with him a few months back and he had a few thoughts on optimising scap that would shorten the window during which this bug can occur.

--

Using a more atomic scap, we can reduce the window during which #3 can happen. #2 can still happen though because atomic scaps only helps a single server maintain its state, between servers there will still be many seconds 1-2 minutes delay depending on how long the sync action takes across the cluster.

With something like git-deploy will mitigate both.

For #3:Near impossible because it works with working copies (fetch internally first, then check out)

For #2: It first instructs all servers to fetch, and when finished all of them to checkout. So there's much shorter window during which difference servers have different active directories because copy time is taken out of the equation
Comment 7 Bryan Davis 2014-03-20 04:07:04 UTC
Is this really a problem with the atomicity of the scap process in general? 

Making the MW deploy process more atomic is a goal for the "next generation scap" project. It would be good to have some use cases such as this one documented in the talk page for the deploy system requirements [0] or somewhere near by so we can try and squash ugly edge case problems such as this one related to the front end caching architecture.

One random idea that just popped into my head would be to create some system for telling the front end varnishes that things will be changing soon and that they should not cache any new responses until given the all clear sign.

[0]: https://www.mediawiki.org/wiki/Talk:Deployment_tooling/Notes/Deployment_system_requirements
Comment 8 Bryan Davis 2014-03-20 04:10:36 UTC
(In reply to Krinkle from comment #6)
> This affects both ResourceLoader (load.php) and the main MediaWiki request
> handling in general. Due to how we use rsync I believe it is indeed possible
> for any moment during the sync action for the mediawiki install to contain
> files it shouldn't, miss files it needs, and has mixed newer and older
> versions of some files.


The window for this state on a given host should be very short. We give rsync the `--delete-delay` and `--delay-updates` flags which defer removing or changing files until all changes are present on the local host.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links