Last modified: 2014-02-13 01:05:11 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 12742 - Collect enwiki clickstream data (we could use it to automatically fix links to disambiguation pages and more)
Collect enwiki clickstream data (we could use it to automatically fix links t...
Product: Datasets
Classification: Unclassified
Webstatscollector (Other open bugs)
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
: analytics
Depends on:
  Show dependency treegraph
Reported: 2008-01-22 20:37 UTC by Jason Spiro
Modified: 2014-02-13 01:05 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Jason Spiro 2008-01-22 20:37:42 UTC
It looks like bug 4118 ("Semi-automatic disambiguation") won't be implemented.  But you developers have access to server logs.  Could you buy a tool to derive [[clickstream]] data from the enwiki logs, strip out the IP addresses from the reports, and then either post the data online or share it with people who request it?  We then could use the data

* to write a bot that will use it to automatically fix links to disambig pages (this is a separate idea that I can file a bug for later)
* or for all sorts of other uses.  (I don't know what clickstream data can be used for so I don't know what these possible uses are.)
Comment 1 Bawolff (Brian Wolff) 2011-02-08 03:14:40 UTC
As a privacy nutjob.... (I think you can guess where my comment is going)
Comment 2 Jason Spiro 2011-02-09 03:37:20 UTC
Sorry Bawolff.  Based on some Google research I did in response to your comment, I found that the Wikimedia Foundation already decided last year to get some better analytics tools.[1]  :)  But remember that the Foundation has a privacy policy already.  Also, they can do a few things if they so choose:  they can limit who can see the data, and they can limit from whom they collect the data.

^  [1].

Also, if they decide to make clickstream data available to certain people (say, bot developers), they can further sanitize it by removing all records of clicks on user pages and user talk pages.

I just CC'ed all five members of the analytics upgrade team to this bug, and assigned this bug to Howie Fung.  I hope both of those actions were OK.
Comment 3 Diederik van Liere 2011-02-09 03:52:50 UTC
Maybe we should assign this to Rob Lanphier or Nimish.
Comment 4 Bawolff (Brian Wolff) 2011-02-09 19:15:29 UTC
>Based on some Google research I did in response to your
>comment, I found that the Wikimedia Foundation already decided last year to get
>some better analytics tools.

My main concern was giving out such data to everyone who could potentially want it. Bot developers are a wide group of people, of varying levels of competency. I wouldn't really want such a group to have access to such data unless it was very well anonimized. Such information could be sensitive. Say someone browsed through various articles on Wikipedia about sexual topics, followed by a browse through the commons categories for sexual images (Assuming such categories still exist after the recent controversies that i havn't really been following) followed by the user visiting his own userpage (so one can identify who it is. If user pages aren't listed, perhaps followed by him accidently making a typo and going to uer:<user name>/ whatever). That might be something that the user would not want to be published.

Anyways, I'm all for better analyitic tools in general (I love the page stats), but we also have to be careful. Even anonoymized data can be harmful to release (for example [[AOL search data scandal]]) if not done carefully.
Comment 5 howief 2011-02-10 02:31:59 UTC
I'm not sure the benefits of fixing the disambiguation issue outweigh the potential privacy concerns.   Yes, we do want better analytics, but we should think about what clickdata we want to track and/or publish very carefully.  E.g., we may consider applying click-tracking to some types of pages, but not others if that's possible.  

Robla is managing the priority list of analytics related features, so I'm going to assign this to him.

Any other use cases for this data?
Comment 6 Andre Klapper 2012-12-03 13:59:10 UTC
[mass-moving wikistats reports from Wikimedia→Statistics to Analytics→Wikistats to have stats issues under one Bugzilla product (see bug 42088) - sorry for the bugspam!]
Comment 7 Andre Klapper 2012-12-03 16:58:29 UTC
So the only potential usecase I've seen mentioned so far in this report is
"to write a bot that will use such data to automatically fix links to disambig pages".
Is that all?

However, better clickstream data is mentioned at

Note You need to log in before you can comment on or make changes to this bug.