Last modified: 2010-09-23 15:50:50 UTC
Created attachment 6553 [details] side by side showing mobile site delay Mobile pages are occasionally not in sync with the regular pages. In one instance that I know of (see attachment) a change was made 11 hours ago but still not updated on the mobile site. Noticed after email sent to OTRS.
The problem is that Wikimedia Mobile uses a naive caching system with a fixed expiry time and no method, manual or automatic, to purge cache entries following an article update. There were further reports on IRC today when an image was deleted from the main page, causing long-term breakage of the mobile main page. mobile1 is currently not reaching 100% CPU even when the cache is empty. So I reduced the cache size by a factor of 5 to reduce the effective expiry time. This will reduce the impact of this bug. I suggest either removing the cache altogether or implementing a proper invalidation scheme. We can easily scale up the CPU usage by adding more servers.
The aggressive caching came from using our initial server, which was extremely slow and not at all prepared to handle traffic. Moving the system over to the mobile1 box has given us much more power and a lot more space. Mobile does use a very standard caching system. Since it is not integrated with the main Wikipedia, Brion and I felt that simple caching would avoid a lot of complex integration work. Personally, I don't believe this is as big of an issue as its being made out to be. Its been this way for months and besides people commenting that the homepage is different, there doesn't seem to be a giant problem here. The problem with *not* caching isn't CPU, its page loading time. The server must download the entire page, parse it, modify it, and then jam it into the layout, which takes about 0.3 seconds. As opposed to grabbing it from cache and throwing it in the layout which takes about 9ms. Certainly though, we can tweak the cache settings. A smaller cache is A-OK with me. Its doing about 50% cache hit rate right now, which is totally fine. I'd even be fine with doing only an hour of cache. I just don't think either "removing the cache altogether or implementing a proper invalidation scheme" are really good options. First one means a slower site, second one means a ton of breakable, complex integrations.
it isn't that complex to implement invalidation, mediawiki app writes invalidation streams, listening to them isn't that difficult, all you have to do is map the stream events to memcached objects and delete them...
I've moved us to a 1 hour expiration time and we can keep the 5GB cache. Caching is much, much less aggressive now. Is this satisfactory to everyone for the time being?
I'm closing this bug. 2 hours seems to be pretty satisfactory.