Last modified: 2014-09-24 00:02:33 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T19577, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 17577 - Image urls should have far future expires
Image urls should have far future expires
Status: NEW
Product: MediaWiki
Classification: Unclassified
File management (Other open bugs)
unspecified
All All
: Low enhancement with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
http://performance.webpagetest.org:80...
: patch, patch-reviewed, performance, platformeng
Depends on: 64214
Blocks:
  Show dependency treegraph
 
Reported: 2009-02-19 22:45 UTC by Sergey Chernyshev
Modified: 2014-09-24 00:02 UTC (History)
20 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Image URL with timestamp patch (1.57 KB, patch)
2009-02-19 22:45 UTC, Sergey Chernyshev
Details

Description Sergey Chernyshev 2009-02-19 22:45:09 UTC
Created attachment 5833 [details]
Image URL with timestamp patch

I'm optimizing performance of MediaWiki instances and one of the issues I came across is that images in MediaWiki don't change their URLs over time as they change so it's impossible to set far future expires headers for them to keep them firmly in browsers caches.

Here's the link explaining this particular issue: http://developer.yahoo.com/performance/rules.html#expires

You can see this issue in action here: http://performance.webpagetest.org:8080/result/090218_132826127ab7f254499631e3e688b24b/ (simple two-run test of http://en.wikipedia.org/wiki/Hilary_Clinton page) - notice that on repeat run all image requests are sent again even though images didn't change so we get 55 extra requests with 304 responses which requires 4 more connections to the commons server (see "Connection View" section below) all of which could be avoided. This might get even worse if we'll test consequent views with pages sharing only some images - in this case loading images after the ones that were already requested will be blocked.

I didn't try to calculate traffic savings (can be significant even though it's only headers that are being sent), but it can be done based on some statistics.

The good news is that MediaWiki already has control over the versioning of uploaded files (which is most important for images) so the solution would be to just make unique query string for each version of the image.

It looks like solutions for local file store and remote stores might be different, but I created a patch that relies on getTimestamp to be implemented accordingly in each subclass (LocalFile.php / ForeignAPIFile.php and so on).

Another, much "cleaner", approach would be to use file revision number instead of timestamp, but it'll require more knowledge of file store implementation which I lack. It might be heavier on CPU though as it'll require getting history from the database.

Anyway, I'm attaching a patch that already works for local file repository where timestamp implementation works fine.

You can see result of this patch here: http://performance.webpagetest.org:8080/result/090219_289bbf4e150b039459abe3ba3d3ce148/ (notice, that on second run only the page is requested).

If it all sounds right, I can apply this patch to the tree.

          Sergey
Comment 1 Sergey Chernyshev 2009-02-19 22:48:58 UTC
Yep, patch doesn't include web server configuration for expiration headers.
Simple .htaccess like can be put into images/ folder (if Apache has AllowOverride Indexes for it):

  ExpiresActive on
  ExpiresDefault A25920000
Comment 2 Brion Vibber 2009-02-19 22:51:22 UTC
Spiffy!

Offhand looks good, though would want to double-check there's no conflicts with remote repos and the on-demand thumbnailing.

Tim, can you take a peek at this today and see if there's any issues there? Thanks!
Comment 3 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-02-19 23:23:18 UTC
Does Squid currently get purged on image reupload?  I suppose it must, to deal with image links that have no size specified.  I had always assumed Squid is why we didn't change image URLs, but on reflection, it seems unlikely to be a big deal to do such purges occasionally.

If this works with Squid and file cache, it should probably be on by default.  (Why doesn't file cache hook into Squid's purge mechanism?)
Comment 4 Platonides 2009-02-20 00:01:21 UTC
Will squids purge File:Foo.jpg?timestamp=19700101000000 entry when Foo.jpg is reuploaded?

Are pages using images on remote-repos correctly purged on image reupload?
(I think the problems of bug 1394 complicate it) 
Infinite expiry images plus squids serving pages pointing to old images...
Comment 5 Brion Vibber 2009-02-20 00:09:28 UTC
> Does Squid currently get purged on image reupload?

Currently the plain page view URL does get purged from local Squids, however any *client* that has cached the image doesn't get any such notification. So, either the browser has to go back to hit the server every time it shows it to check if it's changed (slow!), or it speculatively caches it for some amount of time with the risk of showing an outdated version.

You can see this effect when you upload a new version of an image and see the old one sitting there on the File: page until you refresh.

Changing the URL with a timestamp would mean that any page which has been updated will use the updated URL, giving you the updated image version when you view it.

> Are pages using images on remote-repos correctly purged on image reupload?

Nope, which is an issue to consider. There's not currently any registry of remote use, so the wiki doesn't know who to send purges to. (This would not be too hard to implement internally for DB-based repos so Commons could update the other Wikimedia sites, but would be much trickier for third-party sites using us via an API repo).
Comment 6 Chad H. 2009-02-20 17:13:06 UTC
(In reply to comment #5)
> > Are pages using images on remote-repos correctly purged on image reupload?
> 
> Nope, which is an issue to consider. There's not currently any registry of
> remote use, so the wiki doesn't know who to send purges to. (This would not be
> too hard to implement internally for DB-based repos so Commons could update the
> other Wikimedia sites, but would be much trickier for third-party sites using
> us via an API repo).
> 

Which unless we had a dedicated action=repo or similar, the API has no way of distinguishing between normal API requests and a request to act as a repo.
Comment 7 Sergey Chernyshev 2009-04-10 16:20:54 UTC
So what do we do with this? Can this patch be localized so that stores that can benefit from this could utilize this feature?
Comment 8 Sergey Chernyshev 2009-05-15 22:27:44 UTC
Not related to the solution but useful to measure future performance optimizations:
http://www.showslow.com/details/?url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FHillary_Clinton
Comment 9 Sergey Chernyshev 2009-12-10 15:46:02 UTC
Just to illustrate results, here's the test for TechPresentations with this patch applied (it uses local repository): http://www.webpagetest.org/result/091210_3H74/
Comment 10 Sergey Chernyshev 2011-02-25 18:49:31 UTC
It's been 2 years since I provided initial patch, but Hillary Clinton still sends 304s for static assets: http://www.webpagetest.org/result/110225_EY_5e420956c8cf54450c47902cc4e82be0/1/details/cached/

You're loosing user experience and traffic.
Comment 11 Aryeh Gregor (not reading bugmail, please e-mail directly) 2011-02-25 18:55:56 UTC
This needs to be reviewed by someone who understands our Squid setup, like Tim or Brion.  I don't think it needs a config option, it should just always be enabled, but we need to make sure the right Squid URLs are purged for it to work on Wikimedia.  You're right that the status quo is unreasonable.
Comment 12 Sergey Chernyshev 2011-02-25 19:47:29 UTC
I think last time this was discussed there was another issue - that you guys have remote repository with static assets (uploads.wikimedia.org) while smaller MW installs can use local system to determine the version number.

In any case, it's worth implementing in both cases.

      Sergey
Comment 13 Platonides 2011-02-26 18:05:28 UTC
No, it's not a problem for wikimedia since it is -for now- nfs mounted.

It is a problem for people using us as a remote repository.
Comment 14 Sergey Chernyshev 2011-02-26 18:58:01 UTC
Basically, there are a few ways to get versions:
 - from asset itself
    - ideally crc32/md5 of content (it's actually pretty fast)
    - or modification time (which is not very good)
 - from meta-data (in case of MW, it's file revision number)

Ideally, it should be part of the file name, e.g. 
http://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Hillary_Rodham_Clinton_Signature.svg/rev125/128px-Hillary_Rodham_Clinton_Signature.svg.png

Notice "rev125" between last two slashes. It can be a real folder if you prefer to keep old files or just pseudo-folder which can only be used for cache busting.

Cache busting is a must as we have assets infinitely stored in all possible caches, not only in SQuid. Don't know if you need to tell SQuid to clean old URLs or they'll be just LRU-ed later.

BTW, all this goes for skin files as well - it should probably be done differently though - as build script or post-commit hook or something like SVN-Assets tool that checks repository revision or hash of the file and generates file names accordingly.

      Sergey
Comment 15 Sumana Harihareswara 2011-11-10 06:57:33 UTC
Adding the need-review keyword to indicate that this patch still needs to be reviewed.  Thanks for the patch and sorry for the wait, Sergey.
Comment 16 Sumana Harihareswara 2011-12-22 05:50:06 UTC
Sergey, I'm sorry, but because so much time has passed since you submitted your patch, trunk has changed and your patch no longer applies cleanly.  If the problem's still happening, would you mind updating it and then letting me know?  I'll then get a reviewer for it.  Thanks.
Comment 17 Tim Starling 2011-12-22 10:24:33 UTC
It doesn't make any difference to me whether or not the patch is updated. There's not a significant amount of code review here, it's mostly about the idea.
Comment 18 Sergey Chernyshev 2011-12-22 15:52:50 UTC
Glad you guys are on it - I don't think I can dig into the guts of MW again to get it working right, but Tim is correct, it's not much code, just simple stuff.

Still, if you can use real or pseudo-folders for file names, that would be even better (query strings might not be as good in terms of caches like your Squids and external caches too).

BTW, old way and new way can co-exists if there are worries about some instances not being able to support remote repos - all you need to do is set up infinite expires only on versioned URLs and keep regular URLs intact.
Comment 19 Sergey Chernyshev 2012-03-02 00:24:58 UTC
It's been 3 years already, but Hillary is still very slow:
http://www.webpagetest.org/result/120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9/

Second run is not significantly more efficient then firs one:
http://www.webpagetest.org/video/compare.php?tests=120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9-r:1-c:0,120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9-r:1-c:1
Comment 20 Bawolff (Brian Wolff) 2012-03-02 01:47:57 UTC
(In reply to comment #19)
> It's been 3 years already, but Hillary is still very slow:
> http://www.webpagetest.org/result/120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9/
> 
> Second run is not significantly more efficient then firs one:
> http://www.webpagetest.org/video/compare.php?tests=120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9-r:1-c:0,120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9-r:1-c:1

Well second run being almost same speed is mostly due to counting ajax from the banner load. If you discount the ajax, it'd be roughly 6.2 seconds vs 5.5 seconds. If you don't count one image that took insanely long to return a 304 (which could just be a rare occurrence. Or it could be common place, I don't really know), the comparision becomes 6.2 seconds vs 3.6 seconds. Hence the speedup by fixing this bug might not be as much as that test would lead you to believe (It is still probably something that is fairy significant though, assuming it can be done effectively)
Comment 21 Sergey Chernyshev 2012-03-02 07:26:22 UTC
Actually, I'm looking at render times and not on load events.
Comment 22 Rob Lanphier 2012-11-01 00:41:04 UTC
Adding performance keyword, and removing Tim since he's not specifically looking at this.  Aaron Schulz may have an idea or two about where we should go with this.
Comment 23 Derk-Jan Hartman 2013-04-25 09:02:45 UTC
We really should look at this one again. If the WMF infra is so problematic, then perhaps we should wrap it in a conditional, so that at least it will improve functionality for 'the rest of them' ?

It could potentially fix the problem where people upload a new version of an image and some browsers don't purge the cached copy of the thumbnail by themselves. (We still need to tell some people to bypass their browser cache after 'upload new version' at times, even though I can't really see why a browser would 'not' send a request with the current setup. Perhaps some browsers try to be too smart if there is no Cache-Control:must-revalidate and set an hidden max-age ?).
Comment 24 Sergey Chernyshev 2013-11-14 15:38:16 UTC
Checking back 4.5 years later, are you guys still interested in saving traffic and increasing performance of web pages?

Any way I can help with his? Refreshing everybody's memory? Explaining the effect this can have on users and systems?

I'll be happy to do so - can even take a day or two of vacation to help.
Comment 25 Bawolff (Brian Wolff) 2013-11-14 15:54:37 UTC
(In reply to comment #24)
> Checking back 4.5 years later, are you guys still interested in saving
> traffic
> and increasing performance of web pages?
> 
> Any way I can help with his? Refreshing everybody's memory? Explaining the
> effect this can have on users and systems?
> 
> I'll be happy to do so - can even take a day or two of vacation to help.

Actually there has been recent interest in this sort of thing, but for different reasons (easier management of purging cache on server side. Obviously your reasons are good too)
Comment 26 Sergey Chernyshev 2013-11-15 03:19:47 UTC
Great, I'll be happy to see this implemented.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links