Last modified: 2014-09-28 17:50:53 UTC
Around http://lists.wikimedia.org/pipermail/analytics/2014-July/002351.html it seems some EventLogging schemas need to get purged. ----------------------------- The names of the schemas are not yet fully clear, but the OP in one part said: we can probably just wholesale remove the associated schemas listed at https://meta.wikimedia.org/wiki/Research:Asking_anonymous_editors_to_register#Schemas Removal should happen before 2014-08-04, but (as discussed in private communication) only after 2014-08-01. I made it clear in private communication that we probably cannot meet that deadline. If I understood OP correctly, Sean will handle the database cleanup. I pushed back on cleanup of raw logs.
*** Bug 68978 has been marked as a duplicate of this bug. ***
> I pushed back on cleanup of raw logs. Steven clarified on-list that they have an agreement with legal to remove the data. So we should do it.
On-list [1] Kevin said > Christian: before I prioritize it, can you scope out how much work > would be required? The items that immediatedly come mind are: * Clarify which schemas are meant to get purged. * Clarify how to handle future data (We're still seeing those events getting logged). We have no machinery in place to guard against data entering raw-logs. * Clarify whether or not purging EventLogging's “raw-logs” is sufficient (Since the relevant part of the data flow starts at the caches, it goes through both the udp2log and kafka pipeline) * Clarify if the event data got sent to universities (through udp2log forwards). * If the event data got sent to universities (see above item), clarify how to proceed there. * Get data removed from database (Either we get access, or we need to discuss with Sean or Ops) * Get data removed from all relevant files in vanadium:/var/log/eventlogging/... * Make sure the cleansed files from vanadium get rsynced over to stats1002, and stats1003. * If necessary (see 3rd item), remove the data from kafka cosumers (Might be easier to just nuke current data, as we repaved Hadoop some days ago anyways) * If necessary (see 3rd item), remove the data from udp2log consumers (Not sure. Might turn out that effectively no udp2log filter is actually selecting this data) Taking a quick look, it seems data-collection might have started in April 2014. The 2nd and 3rd item probably need more discussion with Steven (probably also legal, as some items are costly). As our team lacks the required access for most of those parts, we either need to get access [2], or consume more Ops time (which requires more preparations on our end). As the above list of items have some “Clarify” and “If” items, it's hard to give an estimate. If those items do not resolve to much extra work: Maybe 1-2 weeks total wall-clock time. But most of this time will be waiting time. So maybe one or two man-days. [1] http://lists.wikimedia.org/pipermail/analytics/2014-August/002367.html [2] I already applied when receiving Steven's first email, and Toby approved. But those items just require three days waiting.
ahalfak said in private communication that he has finished the things he needed to do, so we're good to get things moving from their end.
As discussed in private emails between Steven, Aaron and me, the request is only for the following schemas: SignupExpAccountCreationComplete SignupExpAccountCreationImpression SignupExpCTAButtonClick SignupExpCTAImpression SignupExpPageLinkClick TrackedPageContentSaveComplete Removal of future data is beyond the scope of this request.
The tables to be purged from the log database are SignupExpAccountCreationComplete_8539421 SignupExpAccountCreationImpression_8539445 SignupExpCTAButtonClick_8102619 SignupExpCTAButtonClick_8965028 SignupExpCTAImpression_8101716 SignupExpCTAImpression_8965023 SignupExpPageLinkClick_8101692 SignupExpPageLinkClick_8965014 TrackedPageContentSaveComplete_7872558 TrackedPageContentSaveComplete_8535426 On-list announcement about the upcoming purge is at http://lists.wikimedia.org/pipermail/analytics/2014-August/002382.html