Last modified: 2013-05-10 15:46:39 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29992, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 27992 - user names show up in <ip> tags
user names show up in <ip> tags
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Ariel T. Glenn
: analytics
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-03-11 13:04 UTC by Daniel Kinzler
Modified: 2013-05-10 15:46 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Daniel Kinzler 2011-03-11 13:04:18 UTC
Sometimes, the <ip> tag does not contain an IP address, but a user's name. Example from dewiki-20100903-stub-meta-history.xml:

<revision>
<id>7</id>
<timestamp>2002-07-08T01:55:46Z</timestamp>
<contributor>
<ip>Ben-Zin</ip>
</contributor>
<minor/>
<comment>*</comment>
<text id="7" />
</revision>


Supposedly this happens when rev_user is 0, because the user was unknown on the wiki when the revision was created. This is expected behavior and frequently happens when revisions are imported from other wikis. It's especially frequent for very old revisions, imported from usermod.

The expected behavior would be to only show valid IPv4 and IPv6 addresses in the <ip> tag. If the user ID is 0 but the user name is not a valid IP address, it should exported as a regular user but without an ID:

<contributor>
<username>Ben-Zin</username>
</contributor>

This is especially important for researches who want to be able to distinguish between anonymous contributions and contributions of logged in users. The presence if the <ip> tag is supposed to indicate an anonymous contribution. This bug makes that assumption false and leaves researches only with the possibility to work around the issue by looking for themselves if the <ip> tag actually contains an IP address or not.
Comment 1 gleim 2011-03-14 15:02:15 UTC
Since I wrote a little patch for the data: It concerns the Contributors of 42916 Revisions (dewiki 2010-09-03). I can provide additional data if required- just contact me.
Comment 2 Ariel T. Glenn 2011-11-14 15:06:20 UTC
gleim, what does your patch do exactly?
Comment 3 Platonides 2011-11-14 15:18:34 UTC
What do you propose to do with values such as 210.50.203.xxx or 123.office.bomis.com ?

If researchers are going to make software to crawl it by themselves, doesn't seem unreasonable that they filter such values to their liking, too.
Comment 4 Daniel Kinzler 2011-11-14 19:00:14 UTC
(In reply to comment #3)
> What do you propose to do with values such as 210.50.203.xxx or
> 123.office.bomis.com ?

they are not valid IP addresses, so they should not be treated as ip adresses. We we *should* recognize valid ipv6 addresses (but only in the form mediawiki uses when recording them for anon edits).

> If researchers are going to make software to crawl it by themselves, doesn't
> seem unreasonable that they filter such values to their liking, too.

requiring our users to fix our broken output isn't really the best practice, is it? we can easily fix this, so we should.

Ideally, truely anonymous edits should be distinguishable from edits by unknown authors in the database. Instead of using user=0 for both, unknown users could have -1 or something. But I suppose there's no knowing what that would break... adding an extra "anon" flag along with every xxx_user field would be a pain too. Oh, well.
Comment 5 gleim 2011-11-15 06:50:05 UTC
(In reply to comment #2)
> gleim, what does your patch do exactly?

Hello,
sorry for my late reply, I haven't been online for a few days. My patch is to correct the data, I have not been touching any code. All I did was simply to check for each case a username is falsely marked as anonymous ip, wether there is a newer entry with valid userID etc. That allows me to rewrite the old entries. Of course this does not fix anything but it helped me in our case.

Best wishes,

Rüdiger
Comment 6 Ariel T. Glenn 2011-11-16 10:23:33 UTC
Rüdiger:

Would you mind making your script available as an attachment here, so that users of the dumps can make these corrections until we have a patch approved and deployed? 

To Daniel: 

We have no way to be sure that a username really exists on a project, i.e. really existed at the time of an edit.  We can't actually look up all names in the user table to see if they are valid, because if a user is renamed and all goes well, the old name disappears from the table.  Bearing that in mind, I think the only approach we can reasonably take is that if the rev_user is 0, and rev_user_text looks *exactly* like an IP address, then we log it as an IP address, otherwise not.

Examples that would be recorded as usernames, all real usernames taken from enwp:

193.251.9.132 is back for more
152.163.xx.xx
64.175.249.214 (Hephaestos)
Comment 7 Daniel Kinzler 2011-11-16 11:48:19 UTC
(In reply to comment #6)

yes, i aree. it can and should be implemented as "if user == 0 and user_name matches ip_pattern".

my thoughts about making the distinction explicit in the database only apply to fresh imports. this can not be done reliable in retrospect, as you pointed out.

> Examples that would be recorded as usernames, all real usernames taken from
> enwp:
> 
> 193.251.9.132 is back for more
> 152.163.xx.xx
> 64.175.249.214 (Hephaestos)

and they should, because they *are* usernames. people actually created an account with that name (one *some* wiki, somewhere. or someone imported manipulated/broken dumps).

btw, something like 123.345.111.333 should also be logged as a user name :) not sure about 0.0.0.0, but i don't think that can happen anyway.
Comment 8 Daniel Friesen 2011-11-16 11:55:32 UTC
(In reply to comment #7)
> (In reply to comment #6)
> btw, something like 123.345.111.333 should also be logged as a user name :) not
> sure about 0.0.0.0, but i don't think that can happen anyway.

I believe we consider 0.0.0.0 to be an IP address server side, so naturally that should be considered an ip.

We already have this kind of heuristic going on server side and already have code to differentiate, any change we make should just make use of that same code.
Comment 9 Diederik van Liere 2011-11-16 14:28:28 UTC
Maybe a better place  to fix this is to have a cleanupUsernames.php file in /maintenance?
Comment 10 Daniel Kinzler 2011-11-16 14:41:03 UTC
(In reply to comment #9)
> Maybe a better place  to fix this is to have a cleanupUsernames.php file in
> /maintenance?

and what exactly would that do?

in the database, we have user entries with id 0 for two things: unknown users (from imports) and anon edits (IPs). how would a cleanup script help with that?

this bug is making this distinction correctly in xml dumps, just as mediawiki makes that distinction in other places.
Comment 11 Ariel T. Glenn 2011-11-16 19:02:54 UTC
Three things: revisions which got screwed up from undeletions or other complications with a rename user. I'm not sure what the exact bug(s) are but there are some.
Comment 12 Daniel Kinzler 2011-11-16 20:35:22 UTC
(In reply to comment #11)
> Three things:

right. and that *could* perhaps be fixed with a maintenance script. 

but as far as this ticket is concerned, it doesn't make a difference: if the userid is null, put the username into <ip> tags only if it's a valid IP address.
Comment 13 Ariel T. Glenn 2011-11-16 20:59:49 UTC
Ah, I meant to suggest rather that some of those can't be fixed by a maintenance script actually :-D  Not without knowing for sure what the right fix is for each such revision, and I don't think we can know that.  Anyways, as has now been said to death on this bug, if it is exactly a valid ip and has rev_user 0 it goes into ip tags.
Comment 14 gleim 2011-11-17 08:37:02 UTC
(In reply to comment #6)
> Rüdiger:
> 
> Would you mind making your script available as an attachment here, so that
> users of the dumps can make these corrections until we have a patch approved
> and deployed? 
> 
Iam using an optimized MySQL Representation with code which would not be of much use for others :-(.
Comment 15 Ariel T. Glenn 2011-11-17 09:23:46 UTC
Patched in rev 103448. Note that in the case where we conclude that the username is really a username, we still write out a 0 uid since that's the value in the db.
Comment 16 db [inactive,noenotif] 2011-12-19 18:30:33 UTC
(In reply to comment #15)
> Note that in the case where we conclude that the
> username is really a username, we still write out a 0 uid since that's the
> value in the db.

That is only true for imports, so that should be no problem.
Comment 17 db [inactive,noenotif] 2013-05-10 15:46:39 UTC
Already fixed by r103448

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links