Last modified: 2014-09-24 00:02:33 UTC
Created attachment 5833 [details] Image URL with timestamp patch I'm optimizing performance of MediaWiki instances and one of the issues I came across is that images in MediaWiki don't change their URLs over time as they change so it's impossible to set far future expires headers for them to keep them firmly in browsers caches. Here's the link explaining this particular issue: http://developer.yahoo.com/performance/rules.html#expires You can see this issue in action here: http://performance.webpagetest.org:8080/result/090218_132826127ab7f254499631e3e688b24b/ (simple two-run test of http://en.wikipedia.org/wiki/Hilary_Clinton page) - notice that on repeat run all image requests are sent again even though images didn't change so we get 55 extra requests with 304 responses which requires 4 more connections to the commons server (see "Connection View" section below) all of which could be avoided. This might get even worse if we'll test consequent views with pages sharing only some images - in this case loading images after the ones that were already requested will be blocked. I didn't try to calculate traffic savings (can be significant even though it's only headers that are being sent), but it can be done based on some statistics. The good news is that MediaWiki already has control over the versioning of uploaded files (which is most important for images) so the solution would be to just make unique query string for each version of the image. It looks like solutions for local file store and remote stores might be different, but I created a patch that relies on getTimestamp to be implemented accordingly in each subclass (LocalFile.php / ForeignAPIFile.php and so on). Another, much "cleaner", approach would be to use file revision number instead of timestamp, but it'll require more knowledge of file store implementation which I lack. It might be heavier on CPU though as it'll require getting history from the database. Anyway, I'm attaching a patch that already works for local file repository where timestamp implementation works fine. You can see result of this patch here: http://performance.webpagetest.org:8080/result/090219_289bbf4e150b039459abe3ba3d3ce148/ (notice, that on second run only the page is requested). If it all sounds right, I can apply this patch to the tree. Sergey
Yep, patch doesn't include web server configuration for expiration headers. Simple .htaccess like can be put into images/ folder (if Apache has AllowOverride Indexes for it): ExpiresActive on ExpiresDefault A25920000
Spiffy! Offhand looks good, though would want to double-check there's no conflicts with remote repos and the on-demand thumbnailing. Tim, can you take a peek at this today and see if there's any issues there? Thanks!
Does Squid currently get purged on image reupload? I suppose it must, to deal with image links that have no size specified. I had always assumed Squid is why we didn't change image URLs, but on reflection, it seems unlikely to be a big deal to do such purges occasionally. If this works with Squid and file cache, it should probably be on by default. (Why doesn't file cache hook into Squid's purge mechanism?)
Will squids purge File:Foo.jpg?timestamp=19700101000000 entry when Foo.jpg is reuploaded? Are pages using images on remote-repos correctly purged on image reupload? (I think the problems of bug 1394 complicate it) Infinite expiry images plus squids serving pages pointing to old images...
> Does Squid currently get purged on image reupload? Currently the plain page view URL does get purged from local Squids, however any *client* that has cached the image doesn't get any such notification. So, either the browser has to go back to hit the server every time it shows it to check if it's changed (slow!), or it speculatively caches it for some amount of time with the risk of showing an outdated version. You can see this effect when you upload a new version of an image and see the old one sitting there on the File: page until you refresh. Changing the URL with a timestamp would mean that any page which has been updated will use the updated URL, giving you the updated image version when you view it. > Are pages using images on remote-repos correctly purged on image reupload? Nope, which is an issue to consider. There's not currently any registry of remote use, so the wiki doesn't know who to send purges to. (This would not be too hard to implement internally for DB-based repos so Commons could update the other Wikimedia sites, but would be much trickier for third-party sites using us via an API repo).
(In reply to comment #5) > > Are pages using images on remote-repos correctly purged on image reupload? > > Nope, which is an issue to consider. There's not currently any registry of > remote use, so the wiki doesn't know who to send purges to. (This would not be > too hard to implement internally for DB-based repos so Commons could update the > other Wikimedia sites, but would be much trickier for third-party sites using > us via an API repo). > Which unless we had a dedicated action=repo or similar, the API has no way of distinguishing between normal API requests and a request to act as a repo.
So what do we do with this? Can this patch be localized so that stores that can benefit from this could utilize this feature?
Not related to the solution but useful to measure future performance optimizations: http://www.showslow.com/details/?url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FHillary_Clinton
Just to illustrate results, here's the test for TechPresentations with this patch applied (it uses local repository): http://www.webpagetest.org/result/091210_3H74/
It's been 2 years since I provided initial patch, but Hillary Clinton still sends 304s for static assets: http://www.webpagetest.org/result/110225_EY_5e420956c8cf54450c47902cc4e82be0/1/details/cached/ You're loosing user experience and traffic.
This needs to be reviewed by someone who understands our Squid setup, like Tim or Brion. I don't think it needs a config option, it should just always be enabled, but we need to make sure the right Squid URLs are purged for it to work on Wikimedia. You're right that the status quo is unreasonable.
I think last time this was discussed there was another issue - that you guys have remote repository with static assets (uploads.wikimedia.org) while smaller MW installs can use local system to determine the version number. In any case, it's worth implementing in both cases. Sergey
No, it's not a problem for wikimedia since it is -for now- nfs mounted. It is a problem for people using us as a remote repository.
Basically, there are a few ways to get versions: - from asset itself - ideally crc32/md5 of content (it's actually pretty fast) - or modification time (which is not very good) - from meta-data (in case of MW, it's file revision number) Ideally, it should be part of the file name, e.g. http://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Hillary_Rodham_Clinton_Signature.svg/rev125/128px-Hillary_Rodham_Clinton_Signature.svg.png Notice "rev125" between last two slashes. It can be a real folder if you prefer to keep old files or just pseudo-folder which can only be used for cache busting. Cache busting is a must as we have assets infinitely stored in all possible caches, not only in SQuid. Don't know if you need to tell SQuid to clean old URLs or they'll be just LRU-ed later. BTW, all this goes for skin files as well - it should probably be done differently though - as build script or post-commit hook or something like SVN-Assets tool that checks repository revision or hash of the file and generates file names accordingly. Sergey
Adding the need-review keyword to indicate that this patch still needs to be reviewed. Thanks for the patch and sorry for the wait, Sergey.
Sergey, I'm sorry, but because so much time has passed since you submitted your patch, trunk has changed and your patch no longer applies cleanly. If the problem's still happening, would you mind updating it and then letting me know? I'll then get a reviewer for it. Thanks.
It doesn't make any difference to me whether or not the patch is updated. There's not a significant amount of code review here, it's mostly about the idea.
Glad you guys are on it - I don't think I can dig into the guts of MW again to get it working right, but Tim is correct, it's not much code, just simple stuff. Still, if you can use real or pseudo-folders for file names, that would be even better (query strings might not be as good in terms of caches like your Squids and external caches too). BTW, old way and new way can co-exists if there are worries about some instances not being able to support remote repos - all you need to do is set up infinite expires only on versioned URLs and keep regular URLs intact.
It's been 3 years already, but Hillary is still very slow: http://www.webpagetest.org/result/120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9/ Second run is not significantly more efficient then firs one: http://www.webpagetest.org/video/compare.php?tests=120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9-r:1-c:0,120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9-r:1-c:1
(In reply to comment #19) > It's been 3 years already, but Hillary is still very slow: > http://www.webpagetest.org/result/120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9/ > > Second run is not significantly more efficient then firs one: > http://www.webpagetest.org/video/compare.php?tests=120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9-r:1-c:0,120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9-r:1-c:1 Well second run being almost same speed is mostly due to counting ajax from the banner load. If you discount the ajax, it'd be roughly 6.2 seconds vs 5.5 seconds. If you don't count one image that took insanely long to return a 304 (which could just be a rare occurrence. Or it could be common place, I don't really know), the comparision becomes 6.2 seconds vs 3.6 seconds. Hence the speedup by fixing this bug might not be as much as that test would lead you to believe (It is still probably something that is fairy significant though, assuming it can be done effectively)
Actually, I'm looking at render times and not on load events.
Adding performance keyword, and removing Tim since he's not specifically looking at this. Aaron Schulz may have an idea or two about where we should go with this.
We really should look at this one again. If the WMF infra is so problematic, then perhaps we should wrap it in a conditional, so that at least it will improve functionality for 'the rest of them' ? It could potentially fix the problem where people upload a new version of an image and some browsers don't purge the cached copy of the thumbnail by themselves. (We still need to tell some people to bypass their browser cache after 'upload new version' at times, even though I can't really see why a browser would 'not' send a request with the current setup. Perhaps some browsers try to be too smart if there is no Cache-Control:must-revalidate and set an hidden max-age ?).
Checking back 4.5 years later, are you guys still interested in saving traffic and increasing performance of web pages? Any way I can help with his? Refreshing everybody's memory? Explaining the effect this can have on users and systems? I'll be happy to do so - can even take a day or two of vacation to help.
(In reply to comment #24) > Checking back 4.5 years later, are you guys still interested in saving > traffic > and increasing performance of web pages? > > Any way I can help with his? Refreshing everybody's memory? Explaining the > effect this can have on users and systems? > > I'll be happy to do so - can even take a day or two of vacation to help. Actually there has been recent interest in this sort of thing, but for different reasons (easier management of purging cache on server side. Obviously your reasons are good too)
Great, I'll be happy to see this implemented.