Last modified: 2014-06-12 17:06:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T68112, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 66112 - "data:" URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se
"data:" URLs accounting for 6 of the top 10 most viewed articles reported by ...
Status: NEW
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-06-04 08:28 UTC by christian
Modified: 2014-06-12 17:06 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description christian 2014-06-04 08:28:26 UTC
In #analytics on 2014-06-03:

  22:12:42 <Nemo_bis> O_o http://stats.grok.se/pt.q/top

.

On the above page (which currently shows “Most viewed articles in 201403”),
ranks 1, 2, 6, 7, 8, and 9 match

 ^[dD]ata:image/png;base64,iVBORw0K

. This looks wrong, as they look like data scheme URLs.
Comment 1 christian 2014-06-04 08:42:01 UTC
Looking through the log files, we indeed see requests for [1]

  http://es.wikipedia.org/wiki/Data:image/png;base64,iVBORw0K[...]

so webstatscollector is doing the right thing :-/

Currently, this traffir amounts to ~500K requests per day.

We see such requests back until the first sampled log files we still
have. (But they were fewer in numbers back then)

Requested URLs are mostly to eswiki (~58%), and ptwiki (~38%).

Referrers are either empty (~97%) or coming mostly from ptwiki (to a
lesser extend eswiki, enwiki).

User Agents match '^Mozilla/5\.0 (Windows NT [56]\.' for >98% of requests.

Unwrapping the inline data from the URLs, and looking at them it seems
they are just images for UI chrome.

The images in the data uri scheme decode to images from VectorBeta like
  VectorBeta/resources/typography/images/search-fade.png
  VectorBeta/resources/typography/images/tab-break.png
  VectorBeta/resources/typography/images/tab-current-fade.png
  VectorBeta/resources/typography/images/portal-break.png





[1] Since they are just UI images, here are some concrete examples:

http://es.wikipedia.org/wiki/

http://pt.wikipedia.org/wiki/Data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAAuCAIAAABmjeQ9AAAARElEQVR42mVO2wrAUAhy/f8fz%2BniVMTYQ3hLKkgGgN/IPvgIhUYYV/qogdP75J01V%2BJwrKZr/5YPcnzN3e6t7l%2B2K%2BEFX91B1daOi7sAAAAASUVORK5CYII%3D

http://es.wikipedia.org/wiki/

http://es.wikipedia.org/wiki/Data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAAQCAIAAABY/YLgAAAAJUlEQVQIHQXBsQEAAAjDoND/73UWdnerhmHVsDQZJrNWVg3Dqge6bgMe6bejNAAAAABJRU5ErkJggg%3D%3D
Comment 2 Matthew Flaschen 2014-06-04 17:49:22 UTC
The bug looks like a browser/crawler bug where it's interpreting data URIs as relative URLs due to not understanding the protocol (and having a weird default for unknown protocols)

(In reply to christian from comment #1)
> User Agents match '^Mozilla/5\.0 (Windows NT [56]\.' for >98% of requests.

Do you know which browsers these actually are?  Does it have the MSIE or Trident token?

It is a known issue that IE <= 7 (http://caniuse.com/#feat=datauri) does not support data URIs.  However, my understanding is that it's supposed to just drop it; I've never heard it would send a bogus request (I could be wrong, though).
Comment 3 Matthew Flaschen 2014-06-04 17:52:32 UTC
(In reply to Matthew Flaschen from comment #2)
> Do you know which browsers these actually are?  Does it have the MSIE or
> Trident token?

If you could share the full user agent, either publicly or privately, that might be helpful.
Comment 4 Matthew Flaschen 2014-06-04 17:56:03 UTC
(In reply to Matthew Flaschen from comment #2)
> It is a known issue that IE <= 7 (http://caniuse.com/#feat=datauri) does not
> support data URIs.  However, my understanding is that it's supposed to just
> drop it; I've never heard it would send a bogus request (I could be wrong,
> though).

This (old IE support) is also why we have a PNG fallback, which it's supposed to use.
Comment 5 christian 2014-06-04 19:29:33 UTC
Sadly enough. No IE<=7 issue. That was the first impression yesterday as well :-(

(In reply to Matthew Flaschen from comment #2)
> Do you know which browsers these actually are?

Yes. User Agents are for example (figured they are generic enough to post):

  Mozilla/5.0 (Windows NT 6.1; rv:29.0) Gecko/20100101 Firefox/29.0
  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36

> Does it have the MSIE or
> Trident token?

Nope.
Affected browsers are mostly Firefox (~65%) and Chrome (~33%).
In old versions and (as exhibited above) also new versions.

It seems to be a "Windows with (Firefox or Chrome)" issue.
Comment 6 Matthew Flaschen 2014-06-05 05:38:18 UTC
(In reply to christian from comment #5)
> It seems to be a "Windows with (Firefox or Chrome)" issue.

Or a bot spoofing their user-agent to pretend to be such.
Comment 7 Oliver Keyes 2014-06-05 06:02:54 UTC
Which is fairly common. Even IE has started deliberately making ambiguous user agents because the devs have realised that people write special rules around IE UAs.

Is there anything interesting in the x_analytics field? I recall a problem with a similar range of browsers from Zero - attempts to DDoS the ISP-level packet inspection in Bangladesh.
Comment 8 christian 2014-06-05 09:29:14 UTC
(In reply to Matthew Flaschen from comment #6)
> (In reply to christian from comment #5)
> > It seems to be a "Windows with (Firefox or Chrome)" issue.
> 
> Or a bot spoofing their user-agent to pretend to be such.

I checked that. And while of course, we cannot rule it out, it's
not too plausible to me.

The number of requests is following a strong weekly pattern.

For each day, the client IPs fall in between 200 to 500 different /24 IP groups.
(Basically all matching the country for the relevant wikis. So Brazil IPs
fetching ptwiki, Venezuelan IPs fetching eswiki.)

Sure. A /smart/ botnet still could implement a weekly pattern and grab many
relevant different IPs that are correctly geolocated.
But then ... a smart botnet would not misinterpret data uris. And even if they
did by accident, such a smart botnet would notice and fix it.

So I'd rule bots out.
Comment 9 christian 2014-06-05 09:32:55 UTC
(In reply to Oliver Keyes from comment #7)
> Is there anything interesting in the x_analytics field?

No. X-Analytics is empty for all those requests.
Comment 10 christian 2014-06-05 12:15:44 UTC
For those who want to take a look themselves, there are prefiltered (from sampled-1000 stream) tsvs for May and June 2014 in

  /home/qchris/data-uris

on stat1002 (the date in the file name corresponds to the date in the file name
of the sampled-1000 tsv files).
Comment 11 Bartosz Dziewoński 2014-06-05 16:12:44 UTC
(In reply to christian from comment #1)
> The images in the data uri scheme decode to images from VectorBeta like
>   VectorBeta/resources/typography/images/search-fade.png
>   VectorBeta/resources/typography/images/tab-break.png
>   VectorBeta/resources/typography/images/tab-current-fade.png
>   VectorBeta/resources/typography/images/portal-break.png

These images are also part of the core Vector skin, where
they sit at [mediawiki/core]/skins/vector/images.
Comment 12 Oliver Keyes 2014-06-05 16:13:35 UTC
Humn. Worth CCing the typography peeps and seeing if there's something weird in the implementation?
Comment 13 Bartosz Dziewoński 2014-06-05 16:15:13 UTC
The images listed also do not have SVG versions, so I wouldn't blame the SVG->PNG fallback mechanism.
Comment 14 Bartosz Dziewoński 2014-06-05 16:43:46 UTC
We were missing test cases that would prove that CSSMin is not borking data: URIs generated by LESS mixins like .background-image(), so I added some in https://gerrit.wikimedia.org/r/#/c/137698/ just in case.
Comment 15 christian 2014-06-05 22:20:37 UTC
(In reply to Bartosz Dziewoński from comment #11)
> These images are also part of the core Vector skin, [...]

*Facepalm*
I had core at an old commit :-(

Yup ... they can come from core as well :-) Thanks.
Comment 16 christian 2014-06-10 07:27:37 UTC
Probably not relevant as the CSS should be interpreted as UTF-8
... but since I've been burnt by UTF-8 support on Windows a few times,
I checked the CSS of some prominent Wikipedias [1], and it seems of
them only

  eswiki [2]
  ptwiki [3]
  plwiki [4]

had css classes using characters beyond 7-bit ASCII.

However, while eswiki, and ptwiki are the affected ones, plwiki does not
seem to be affected.



[1] arwiki cswiki dawiki dewiki elwiki enwiki eswiki fawiki fiwiki
frwiki hewiki idwiki itwiki jawiki kowiki nlwiki nowiki plwiki ptwiki
ruwiki svwiki trwiki ukwiki zhwiki

[2] eswiki:
  .arquería
  .astronomía
  .béisbol
  .canadá
  .cómics
  .comunicación
  [...]

[3] ptwiki:
  .page-Wikipédia_Esplanada_geral
  .page-Wikipédia_Esplanada_propostas

[4] plwiki:
  .page-Wikipedia_Strona_główna
Comment 17 Toby Negrin 2014-06-12 17:06:18 UTC
Need collaboration with Platform to work on this further.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links