Last modified: 2014-04-11 08:27:42 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T54295, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 52295 - Add sanitized User-Agent to default fields logged by EventLogging
Add sanitized User-Agent to default fields logged by EventLogging
Status: NEW
Product: Analytics
Classification: Unclassified
EventLogging (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
: analytics
Depends on:
Blocks: 59832
  Show dependency treegraph
 
Reported: 2013-07-30 20:44 UTC by Dario Taraborelli
Modified: 2014-04-11 08:27 UTC (History)
15 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Dario Taraborelli 2013-07-30 20:44:54 UTC
Logging sanitized user-agents allows us to diagnose browser-specific performance and usability issues. UAs have already been logged as part of http://meta.wikimedia.org/wiki/Schema:NavigationTiming and we added them to the instrumentation requirements for http://meta.wikimedia.org/wiki/Schema:Edit.

This proposal is to make UA a default field logged by EventLogging for all client-side events.
Comment 1 Steven Walling 2013-07-30 21:07:57 UTC
Can you expand on what "sanitized" means for user agents?
Comment 2 Dario Taraborelli 2013-07-30 21:12:50 UTC
Instead of logging the full, unparsed UA string we match it against a list of the N most popular browser/agents and log everything else as "other".
Comment 3 Luis Villa (WMF Legal) 2013-07-30 21:40:57 UTC
Or perhaps not "match against a list", and instead simply bucket them - top 10 (or 100) get left as-is, rest get bucketed by OS/browser with details removed. Marc-Andre has implemented this in labs, so ccing him.
Comment 4 Marc A. Pelletier 2013-07-31 03:07:50 UTC
Only the sanitation part has been implemented, given the smaller scope of usefulness for Tool Labs (where debugging against specific versions is less of an issue).  I was planning to use DevCamp as an opportunity to hack at bugzillas, I'll whip up a PHP version of the sanitizing code then for inclusion in EventLogging.

That said, keeping a dynamic "top N" for bucketing may or may not be reasonable in terms of performance for something called that often; we'll have to see how that fares in practice.
Comment 5 Oliver Keyes 2013-11-20 19:28:28 UTC
Marc, any news on the PHP version?
Comment 6 Dario Taraborelli 2013-12-05 20:01:19 UTC
We also have new use cases from VE (estimating how many new registered users have VE-capable browsers)and MultiMedia (see https://meta.wikimedia.org/wiki/Schema:MediaViewerPerf) that would benefit from this change.
Comment 7 Gerrit Notification Bot 2013-12-19 21:27:09 UTC
Change 102817 had a related patch set uploaded by Ori.livneh:
Add user-agent header to the format spec of EventLogging's varnishncsa instance

https://gerrit.wikimedia.org/r/102817
Comment 8 Gabriel Wicke 2013-12-19 23:21:01 UTC
(In reply to comment #7)
> Change 102817 had a related patch set uploaded by Ori.livneh:
> Add user-agent header to the format spec of EventLogging's varnishncsa
> instance
> 
> https://gerrit.wikimedia.org/r/102817

Will this make it possible to get statistics of browsers used for logged-in requests as discussed in https://bugzilla.wikimedia.org/show_bug.cgi?id=56575 ?
Comment 9 Matthew Flaschen 2013-12-20 00:32:01 UTC
(In reply to comment #8)
> Will this make it possible to get statistics of browsers used for logged-in
> requests as discussed in
> https://bugzilla.wikimedia.org/show_bug.cgi?id=56575 ?

Bug 56575 is already possible (as the initial report here said, we already log the UA in specific cases).  There's an in-progress patch at https://gerrit.wikimedia.org/r/#/c/93526/ .

This is about logging it by default, which is possible, but not required for bug 56575.
Comment 10 Gabriel Wicke 2013-12-20 05:17:26 UTC
(In reply to comment #9)
> This is about logging it by default, which is possible, but not required for
> bug 56575.

Yes, sub-sampling would definitely be sufficient to get a representative sample. For the data I am primarily interested in it would have to be sub-sampling of all page views though, which is again very close to what is discussed in this bug.

Is there information about the authentication status in the custom varnishncsa logging? If so, then https://gerrit.wikimedia.org/r/102817 could be used directly to get browser market shares for anonymous vs. logged-in users.
Comment 11 Nemo 2013-12-20 08:03:51 UTC
(In reply to comment #7)
> Change 102817 had a related patch set uploaded by Ori.livneh:
> Add user-agent header to the format spec of EventLogging's varnishncsa
> instance
> 
> https://gerrit.wikimedia.org/r/102817

Is there a list of said callers doing "user-agent logging and processing" somewhere, for curiosity, to track the progress on their standardisation (which is a very good thing to do) and to help define requirements?

(In reply to comment #10)
> Yes, sub-sampling would definitely be sufficient to get a representative
> sample.

In what cases is sub-sampling not sufficient? Can EventLogging default to sampling for UA unless otherwise requested by callers? Or maybe the bucketing mentioned above has the same results, is the implementation mentioned in comment 4 described somewhere and/or the place where logs go documented (the latter was asked by MZMcBride in https://gerrit.wikimedia.org/r/#/c/93526/ )?

Just for fun (everybody here knows already), panopticlick.eff.org gives my main browser's UA an entropy of 13.25, making it the most tracking item of all; in the secondary browser (Chromium) it's 14.42 and the third worst after accept-language and plugins.
Comment 12 Matthew Flaschen 2013-12-24 07:08:25 UTC
(In reply to comment #11)
> Or maybe the bucketing
> mentioned above has the same results, is the implementation mentioned in
> comment 4 described somewhere and/or the place where logs go documented (the
> latter was asked by MZMcBride in https://gerrit.wikimedia.org/r/#/c/93526/ )?

It's documented at https://wikitech.wikimedia.org/wiki/EventLogging#Data_storage; it might need an update. Basically, there are text logs (mainly used for debugging), Mongo (not sure if any analysts actually use this), and MySQL (commonly used by analysts).

You can see how the MySQL tables are generated at https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging.git/36cd7fbd9f763f369fb3d7ae503ef4c9133f99bf/server%2Feventlogging%2Fjrm.py
Comment 13 Gerrit Notification Bot 2014-01-07 18:59:32 UTC
Change 102817 merged by Ottomata:
Add user-agent header to the format spec of EventLogging's varnishncsa instance

https://gerrit.wikimedia.org/r/102817
Comment 14 Dario Taraborelli 2014-01-07 22:13:04 UTC
Nemo – I started adding some details on the sanitization logic (expanding on Nuria's draft) here: https://www.mediawiki.org/wiki/EventLogging/UserAgentAnonymization 

This is still a draft, we will add more information on the next steps (particularly on the bucketing, which hasn't been implemented yet).
Comment 15 Nemo 2014-01-08 10:17:17 UTC
(In reply to comment #14)
> Nemo – I started adding some details on the sanitization logic (expanding on
> Nuria's draft) here:
> https://www.mediawiki.org/wiki/EventLogging/UserAgentAnonymization 
> 
> This is still a draft, we will add more information on the next steps
> (particularly on the bucketing, which hasn't been implemented yet).

Thank you very much! Watchlisted, will look later.
Comment 16 nuria 2014-01-08 13:42:28 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > Change 102817 had a related patch set uploaded by Ori.livneh:
> > Add user-agent header to the format spec of EventLogging's varnishncsa
> > instance
> > 
> > https://gerrit.wikimedia.org/r/102817
> 
> Will this make it possible to get statistics of browsers used for logged-in
> requests as discussed in
> https://bugzilla.wikimedia.org/show_bug.cgi?id=56575 ?

(sorry everyone for not answering comment any sooner)

Gabriel, this change is intended only for EventLogging data so (once implemented fully) you would hopefully be able to get some user agent data. But note, the data does not equally represent all requests to the site, rather the ones for which event logging events are send out.
Comment 17 nuria 2014-01-08 13:46:49 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > This is about logging it by default, which is possible, but not required for
> > bug 56575.
> 
> Yes, sub-sampling would definitely be sufficient to get a representative
> sample. For the data I am primarily interested in it would have to be
> sub-sampling of all page views though, which is again very close to what is
> discussed in this bug.
> 
> Is there information about the authentication status in the custom
> varnishncsa
> logging? If so, then https://gerrit.wikimedia.org/r/102817 could be used
> directly to get browser market shares for anonymous vs. logged-in users.


There is no authentication info in varnishncsa at the time of logging. But with this change the logging will be happening for all events when fully implemented. Events themselves do have info about the logging status of the user.
Comment 18 nuria 2014-01-08 14:16:50 UTC
(In reply to comment #14)
> Nemo – I started adding some details on the sanitization logic (expanding on
> Nuria's draft) here:
> https://www.mediawiki.org/wiki/EventLogging/UserAgentAnonymization 
> 
> This is still a draft, we will add more information on the next steps
> (particularly on the bucketing, which hasn't been implemented yet).

As March pointed out above "keeping a dynamic "top N" for bucketing may or may not be reasonablein terms of performance for something called that often; we'll have to see how that fares in practice". A logging solution needs to be as light as possible, which means decoupled from any kind of storage lookups upon logging.
Comment 19 Andre Klapper 2014-02-26 12:51:07 UTC
[Refered Gerrit patch has been merged; resetting status]
Comment 20 Andre Klapper 2014-02-26 12:53:31 UTC
[moving from MediaWiki extensions to Analytics product - see bug 61946]
Comment 21 Nemo 2014-04-06 15:40:43 UTC
From [[mail:analytics]]: "We also finished a numbers of unplanned tasks: [...] User Agent discussions (EventLogging)". Any published notes/recap?
Comment 22 nuria 2014-04-06 18:53:26 UTC
We have hired a hard working Product Manager for analytics that is getting up to speed on the issue regarding User Agents and Privacy. He shall be publishing documentation once he's had time to catch up.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links