Last modified: 2013-08-20 01:18:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T45449, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 43449 - Monitor effectiveness of HTCP purging
Monitor effectiveness of HTCP purging
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: High enhancement with 2 votes (vote)
: ---
Assigned To: Brandon Black
: ops
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-27 00:25 UTC by Bawolff (Brian Wolff)
Modified: 2013-08-20 01:18 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Bawolff (Brian Wolff) 2012-12-27 00:25:09 UTC
Moved over from bug 43448:

Could we have some sort of monitoring of if HTCP packets cause
things actually get purged. squid/varnish purging suddenly not working seems to
have happened quite a few times in the past (all for different reasons), and we
have no monitoring of it, (We don't even have any unit tests on the MW side for
it as far as I am aware).

This is bad since:
*Most people don't have squid/varnish set up in their dev environment, so
people don't notice on local test wikis.
*The symptoms are gradual, and usually pass unnoticed for some time
*With exception of images, the people primarily effected are anons, who are
less likely to know how to report the issue.
Comment 1 Andre Klapper 2013-01-22 13:28:49 UTC
FYI, posted on ops@ by Tim Starling six hours ago:

"There is a nagios check to make sure varnishhtcpd is working, but it
only checks to see if the process is still running, it doesn't check
to see if it is actually working."
Comment 2 Andre Klapper 2013-01-23 11:56:17 UTC
For the records, in case somebody considers working on this:
<TimStarling> andre__:  just purge a URL, request it, and check its Age header
<TimStarling> it should be less than some threshold
<TimStarling> http://tools.ietf.org/rfcmarkup?doc=2616#section-5.1.2
Comment 3 Rob Lanphier 2013-01-29 00:13:27 UTC
I spoke with CT about this, and he's going to talk to his team about what can be done here.
Comment 4 Andre Klapper 2013-02-27 19:50:19 UTC
Assigning to Mark as just discussed in the Ops/Platform meeting here in SF.
Comment 5 Andre Klapper 2013-02-28 22:29:13 UTC
RT #4607
Comment 6 Greg Grossmeier 2013-04-10 20:34:32 UTC
Assigning to Brandon per Roadmap Updates meeting and email thread.
Comment 7 Andre Klapper 2013-05-16 15:39:00 UTC
Brandon: Are there any news / progress to share yet?
Comment 8 Brandon Black 2013-05-16 22:38:09 UTC
Yes, I've been implementing a replacement for varnishhtcpd.  You can see the evolving initial version at the changeset here: https://gerrit.wikimedia.org/r/#/c/60390/ .  I hope to be able to test it in prod in the next few days, and it shouldn't suffer from the perf/loss bugs of the previous implementation.

Stats output still needs implementation there as well (for monitoring the daemon's own reliability as well as other issues like loss of multicast delivery), but we'd rather put the stats work in the fresh new code than attach to the known-failing code.
Comment 9 Brandon Black 2013-05-29 15:52:54 UTC
The replacement daemon was deployed to production today.  The initial deployment is just a minimum-change swap of the two pieces of software.  Further enhancements (to performance, and logging of stats to spot multicast loss) will come once this has had a little time to stabilize without any loud complaints of being worse than before.
Comment 10 Andre Klapper 2013-06-18 07:16:08 UTC
(In reply to comment #9 by Brandon Black)
> The replacement daemon was deployed to production today.  The initial
> deployment is just a minimum-change swap of the two pieces of software. 
> Further enhancements (to performance, and logging of stats to spot multicast
> loss) will come once this has had a little time to stabilize without any loud
> complaints of being worse than before.

Great! Is it already possible to judge whether it's stable enough?
Comment 11 Andre Klapper 2013-07-03 12:23:52 UTC
Brandon: Were the last 5 weeks enough time to judge whether it's stable enough? (Is this bug report fixed?)
Comment 12 Brandon Black 2013-07-03 14:07:42 UTC
Yes, I think so, although I just fixed a bug in the software yesterday.  Still, it's a significant improvement and we've un-deployed the previous software.  May as well close this bug and then open further ones as warranted for further changes to our purging architecture?  The title of the bug doesn't precisely correlate with what ended up happening anyways (monitoring the success rate, which we still can't really do, and won't ever be able to do with any real accuracy so long as it's plain multicast).
Comment 13 Rob Lanphier 2013-07-29 16:57:50 UTC
Brandon, are you actually building proper monitoring into this daemon, or do we need to start separate work?  I remember Mark making the case that this could be done within Varnish, but I'm still kinda confused as to how we can actually do effective monitoring of Varnish purging from within Varnish.

Bug 49362 is an example of a bug that would be great to have proper monitoring for.
Comment 14 Brandon Black 2013-07-29 17:59:40 UTC
The daemon logs some stats to a file, which we could pick up and graph (but currently do not, yet).  These would basically give you the rate of multicast purge requests the daemon's receiving and whether it's failing to process any of them due to some large-impact bug that's overflowing the queue.

The larger issue that makes that relatively ineffective is that the requests arrive over multicast, which is an unreliable protocol by design.  They could be lost in the sender's output buffers, anywhere in the network, or discarded at the receiving cache (local buffering issues) and we'd have no indication that was happening.

Upgrading from multicast is also an expensive proposition in terms of complexity (after all, the reason we're using it is that it's simple and efficient).  We've thrown around some ideas about replacing multicast with http://en.wikipedia.org/wiki/Pragmatic_General_Multicast , likely using http://zeromq.org/ as the communications abstraction layer, as a solution to the unreliability of multicast.  This would basically give us a reliable sequence-number system with retransmission that's handled at that layer.

That means adding zeromq support to the php that sends the purge requests, adding it to vhtcpd, and most likely also building out a redundant, co-operating set of middleboxes as publish/subscribe multiplexers.  I'm not fond of going down this path unless we really see a strong need to upgrade from multicast, though.  It smells of too much complexity for the problem we're trying to solve, and/or that there may be a better mechanism for this if we re-think how purging is being accomplished in general.

In any case, I think that would all be outside the scope of this ticket.
Comment 15 Bawolff (Brian Wolff) 2013-07-29 18:09:57 UTC
I think monitoring should happen outside of the deamon (Since as you say, there's a limit to what we can do with an unreliable protocol).

What I would suggest is some script (perhaps even living on the tools lab) that does the following:
*Find the 10 most recent overwritten files. Get the thumbnail sizes that would be on the image description page, along with the original file asset (from both europe and north america varnish). Look at the age header. If the age header is longer than the time between last re-upload and now, yell.
*Pick a test file at random. Request the file at some random size. Do ?action=purge. Sleep for about 10 seconds. Request the file again. Check to make sure that the age header is either not present or < 10.
*For good measure. Pick a popular page like [[Wikipedia:Village pump (Technical)]] (also some redirect page like [[WP:VPT]]). Request the page. Check that the age header is less than the time between now and last edit. (Or at least for the redirect case, make sure that the difference isn't super big to give some lee-way for job queue)
Comment 16 Brandon Black 2013-07-29 18:14:08 UTC
How would one "Find the 10 most recent overwritten files" reliably/efficiently?  Most of these solutions you're suggesting seem to give us some probabilistic idea that things are working, but really solve the problem if a random small percentage of purges are being lost in the pipe somewhere.  They'd have to run at pretty high rates to even catch singular failed elements (one varnish not receiving purges, which may or may not have already cached the test file, which you may or may not hit with your check)
Comment 17 Brandon Black 2013-07-29 18:14:54 UTC
Sorry, I meant to say "..., but really DON'T solve the problem ..."
Comment 18 Bawolff (Brian Wolff) 2013-07-29 18:21:07 UTC
(In reply to comment #16)
> How would one "Find the 10 most recent overwritten files"
> reliably/efficiently?
>  Most of these solutions you're suggesting seem to give us some probabilistic
> idea that things are working, but really solve the problem if a random small
> percentage of purges are being lost in the pipe somewhere.  They'd have to
> run
> at pretty high rates to even catch singular failed elements (one varnish not
> receiving purges, which may or may not have already cached the test file,
> which
> you may or may not hit with your check)

Very true. However I'm more concerned with mass failures. (The type of thing where doing this once every 6 hours would be sufficient). Massive failures to the purging system have happend in the past several times. Monitoring for this type of failure I think is important. (Fine grained monitoring would be cool too, but seems more difficult)
Comment 19 Bawolff (Brian Wolff) 2013-07-29 18:24:01 UTC
>How would one "Find the 10 most recent overwritten files" reliably/efficiently?

via db query (or api): select log_title from logging where log_type = 'upload' and log_action = 'overwrite' order by log_timestamp DESC limit 10;
Comment 20 Bryan Davis 2013-08-12 17:49:57 UTC
There is a patch to setup ganglia monitoring for vhtcpd in Gerrit change #77975.
Comment 21 Gerrit Notification Bot 2013-08-13 03:37:40 UTC
Change 77975 had a related patch set uploaded by MZMcBride:
Add ganglia monitoring for vhtcpd.

https://gerrit.wikimedia.org/r/77975
Comment 22 Gerrit Notification Bot 2013-08-16 16:18:25 UTC
Change 77975 merged by BBlack:
Add ganglia monitoring for vhtcpd.

https://gerrit.wikimedia.org/r/77975
Comment 23 MZMcBride 2013-08-20 01:06:40 UTC
(In reply to comment #22)
> Change 77975 merged by BBlack:
> Add ganglia monitoring for vhtcpd.
> 
> https://gerrit.wikimedia.org/r/77975

With this changeset now merged, I'm a little unclear what's still needed to mark this bug as resolved/fixed. Brian W.: can you clarify?
Comment 24 Bawolff (Brian Wolff) 2013-08-20 01:18:48 UTC
(In reply to comment #23)
> (In reply to comment #22)
> > Change 77975 merged by BBlack:
> > Add ganglia monitoring for vhtcpd.
> > 
> > https://gerrit.wikimedia.org/r/77975
> 
> With this changeset now merged, I'm a little unclear what's still needed to
> mark this bug as resolved/fixed. Brian W.: can you clarify?

I'm going to call this bug closed for now. There's pretty graphs at http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=vhtcpd_inpkts_dequeued&s=by+name&c=Upload+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4

One thing we were talking about earlier was doing actual tests where we have a script that either looks at recent re-uploads, and check the purge succeded, or specificly purged things, and checked to see if that works. We could do that later if this monitoring turns out not to be enough

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links