Last modified: 2014-05-29 17:25:30 UTC
(this is not, strictly, a limn bug, but a bug in how we generate the data, though I guess it could be resolved by adding a feature to limn, such as an explicit "key" column rather than relying on row numbers.) We use Limn to display "active editors by country", based on the files generated by scripts, visible in /datasources. In the past (I don't know how to reproduce this), a graph created and saved featuring data for country X, _after some time_ (i.e. when different data was generated by the scripts) began displaying data for country Y. My theory, confirmed at one point by Evan, was that the graphs relied on row numbers in the CSVs, and for whatever reason, countries must have been added or removed from the report (or perhaps countries with 0 editors don't get listed at all? That could make for very erratic changes month over month...), which shifted the row numbers, which caused the graph to display false and misleading data. I cannot overstate how crucial this is: it makes sharing links to the actual graphs (as distinct from saving screenshots of them) impossible, because we can't guarantee a future viewer of that link would actually see correct data. Since I have no insight into the data-generating scripts themselves, I can't ascertain this is no longer a problem, nor can I force this to reproduce. But I tried to describe the problem as technically as I can, to help you decide if this can still happen. If you are confident it can't happen any more, I'd be thrilled to hear, and you can close this bug.
Prioritization and scheduling of this bug is tracked on Mingle card https://mingle.corp.wikimedia.org/projects/analytics/cards/1164
We recently fixed a bug that is very similar to the problem you describe. Let me make sure it's only related and not the same thing. So until some weeks back, when looking at the active editors for country X (e.g.: [1]), the graph showed the data for one of the countries in the set {X, Y1, Y2, Y3}. Reloading the page might show the data for a different country of the same set {X, Y1, Y2, Y3}. When ordering Y1, Y2, Y3 alphabetically, they were really close to each other. The root cause seems to have been column order mismatches between different versions of the same file. This problem was solved by trying to make sense of ~17k files and removing ~15k stale files/duplicates. (In reply to comment #0) > [...] than relying on row numbers.) > [...] > My theory, confirmed at one point by Evan, was that the graphs relied on row > numbers Sorry to be nitpicking here, but since you are talking both here and also some lines above about /row/ numbers, let me make sure we are talking about the same files. Do you really mean /row/ numbers (that could totally be the case, but would hint towards you using files that I have not yet discovered in our repos), or /column/ numbers (As for example used in [2])? The files produced by Evans geowiki scripts (i.e.: "Active Editor" data) rely on column numbers. Yes, the column number of country X in file Z.csv might change between any given day. And in fact, they not only “might” change, they actually do change often. (For the current geowiki dashboards, graphs, ... those frequent changes are not a problem, as we regenerate the relevant files for each run using the current csvs) > [...], which > shifted the row numbers, which caused the graph to display false and > misleading > data. If that graph displayed false data, that's a real problem from my perspective. But since you are using past tense in your description and you also state that you cannot force to reproduce the problem… are we still affected by the problem? If so, could you point me to a concrete file/URL that causes problems? > Since I have no insight into the data-generating scripts themselves, [...] The scripts are at https://gerrit.wikimedia.org/r/#/admin/projects/analytics/geowiki . As usual: Patches welcome :-) You can find a rough overview of the geowiki dataflow at https://wikitech.wikimedia.org/wiki/Analytics/Geowiki#Dataflow . [1] http://gp.wmflabs.org/graphs/en_germany_all [2] http://gp.wmflabs.org/data/datafiles/gp/en_all.csv
Thanks for this insightful comment! Yes, this sounds exactly like the problem we had been having. My use of "row number" is naive -- that is, I meant that the data was identified by index (incorrectly assumed a row index rather than a column index) rather than by key. So yes, it matches the column number problem you describe. So I'm hoping this is resolved now, but I am still confused by two statements you make: 1. One one hand, you say the "problem was solved by trying to make sense of ~17k files and removing ~15k stale files/duplicates." 2. OTOH, you say "Yes, the column number of country X in file Z.csv might change between any given day." So... if graphs still rely on column numbers, are we still in essentially the same situation, wherein we can't trust a graph to still be pointing at data for the same country after N days/months?
(In reply to comment #3) > So... if graphs still rely on column numbers, are we still in essentially the > same situation, wherein we can't trust a graph to still be pointing at data > for > the same country after N days/months? I do not think so. On the one hand, we are not only generating the data files, but also the graph definitions daily. So the referenced columns within the graph files and the columns in the data files should correspond. Even after columns got rearranged. For example if column X1 of data file Z becomes X2, the corresponding graph file for Z is also updated to use X2 instead of X1. On the other hand, the cleaning up of the served data repositories made sure that no stale files (with outhdated column indices) are lying around, waiting to be picked up by limn. As it seems the problem we fixed matches your observations, I am marking the bug as fixed for now. However, if you notice (once gp.wmflabs.org goes online with data again) that graphs come with wrong captions, do let us know and reopen the bug. Thanks!
Excellent, thanks! This was the biggest show-stopper for me.
It seems it was a short party :-( Meanwhile, there was a change that allows to create graphs on your own. Those "user-created" graphs are not updated if we recreate the "script-generated" graphs. So while "script-generated" graphs are not affected, the "user-created" graphs are now again affected. Hence, reopening the bug.
[moving tickets as per bug 65903]