Last modified: 2014-05-13 06:08:05 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T50260, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 48260 - possible to create duplicate sitelinks
possible to create duplicate sitelinks
Status: RESOLVED DUPLICATE of bug 42325
Product: MediaWiki extensions
Classification: Unclassified
WikidataRepo (Other open bugs)
master
All All
: Normal critical (vote)
: ---
Assigned To: Wikidata bugs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-08 14:51 UTC by Lydia Pintscher
Modified: 2014-05-13 06:08 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Lydia Pintscher 2013-05-08 14:51:01 UTC
reported at http://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#True_duplicate


Currently both Q12863749 and Q2618201 link to ka:დიდი ბრიტანეთი 1960 წლის ზაფხულის ოლიმპიურ თამაშებზე.

Same with Q12863758/Q146146/ka:დიდი ბრიტანეთი 1996 წლის ზაფხულის ოლიმპიურ თამაშებზე.
Comment 1 T. H. Kelly (Pink&) 2013-05-08 15:32:22 UTC
I should note that it's not possible to add either ka.wp link to any other items. (Try it for yourself @ [[d:Q4115189]].) Furthermore, it's not possible to edit any other fields on either item, as doing so generates a "Site link [[lang:page]] already used on [[Q####]]" error, even if you're just trying to set a label/description/alias. (Same if you try to use special pages instead of editing directly.)
Comment 2 Byrial Jensen 2013-05-18 17:48:23 UTC
Q12340897 and Q12343899 also both have [[da:Viborgvej (Aarhus)]] as links. I can edit labels and description in one of them (Q12343899), but not the other.
Comment 3 Daniel Kinzler 2013-05-30 10:30:23 UTC
I have tried to investigate Q12863749 and Q2618201 a bit. Here is what I found:

* Q12863749 was created on May 6 2013, with the ka link in place.
* Q2618201 got the ka link two days later, on May 8 2013.

The edit to Q2618201 that added this link should not have worked, it should have been prevented by a uniqueness constraint implemented using the database table wb_items_per_site. However, looking at this table, it has an entry for the ka links on Q2618201, but not for Q12863749. This means that Q2618201 now essentially "owns" that link.

Consequently, Q2618201 can still be edited, while edits to Q12863749 will fail due to the uniqueness constraint. 

The cause of the problem is probably that the edit that created Q12863749 was not fully completed, but failed for some reason half way through the process, after saving the primary data blob but before registering the site links in wb_items_per_site, causing an inconsistency in the database. 

Note that the risk of such inconsistencies is considered acceptable in MediaWiki design, since enforcing full consistency using transactions would make it very hard to make page updates scale to the level we need on Wikipedia.

Marking wontfix, because we can't fix this without rewriting most of MediaWiki.

As to the issue at hand, Q12863749 should probably just be deleted, since it consists only of the duplicate links.
Comment 4 T. H. Kelly (Pink&) 2013-05-30 10:55:53 UTC
{{Deleted|color=pink}} :)

Out of curiosity, does the fact that the first 2 were only 9 Q#s apart (and, to a lesser extent, that the third was only 500k away from them) mean anything? That is to say, is there any particular reason that the inconsistency occurred in so narrow a range?
Comment 5 Daniel Kinzler 2013-05-31 18:14:46 UTC
(In reply to comment #4)
> That is to say, is there any particular reason that the inconsistency
> occurred in so narrow a range?

I don't think so, except for the fact that during a time where several bots are working on importing the same set of pages, this is more likely to happen.

For the record: this kind of thing should be *rare*. We can't avoid it completely, but it really shouldn't happen often. 

If this happens frequently, please re-open this bug.
Comment 6 T. H. Kelly (Pink&) 2013-05-31 19:13:10 UTC
(In reply to comment #5)
> For the record: this kind of thing should be *rare*. We can't avoid it
> completely, but it really shouldn't happen often. 
> 
> If this happens frequently, please re-open this bug.

Okay. I've created [[d:Wikidata:True duplicates]] to monitor how often this happens. Thanks for figuring this out. :)
Comment 7 Byrial Jensen 2013-05-31 22:10:53 UTC
I wrote all 30.536.568 links in 2013-05-27 database dump to a file, and then used sort(1) and uniq(1) to find all duplicate links. The result is 38,764 duplicates, which is 1.3 per 1,000 links, so this is not a rare thing.
Comment 8 Byrial Jensen 2013-05-31 22:44:01 UTC
Correction: My default collation order did sort some different characters as the same. With LC_COLLATE=C there is only 3,182 duplicates (0.10 duplicates pr. 1,000 links), but it is still many I think.
Comment 9 Daniel Kinzler 2013-06-01 17:38:00 UTC
(In reply to comment #8)
> I wrote all 30.536.568 links in 2013-05-27 database dump to a file, and then
> used sort(1) and uniq(1) to find all duplicate links.

To my knowledge, there can be no dupes in the wb_items_by_site table, because there is a primary key covering the relevant fields.

Can you show exactly what you did? What exactly the query looks like? Can you give some examples duplicates?

As far as I can see, the problem described in this report occurs when there are things *missing* from wb_items_by_site, and thus conflicts fail to be detected.
Comment 10 Byrial Jensen 2013-06-01 23:02:59 UTC
(Reply to comment #8)
I did not use the wb_items_by_site table (that would be impossible for me as it is not available in the public database dumps of Wikidata).

I downloaded http://dumps.wikimedia.org/wikidatawiki/20130527/wikidatawiki-20130527-pages-articles.xml.bz2, and parsed the stored JSON formatted page text for each item (that is all pages in namespace 0). There is 12,565,377 items in the file and they contain 30,536,568 links. 3.160 of the links are occur twice, for example:

als:Vorlage:Navigationsleiste Schweizer Gebirgspässe
an:Piedra (desambigación)
ar:تصنيف:بريطانيا
ar:تصنيف:تاريخ الشام
ar:تصنيف:تعليقات للرد
ar:تصنيف:جامعة محمد الخامس
ar:تصنيف:ولاية أريانة
ar:تصنيف:ولاية قابس
ar:تصنيف:ويكيبيديون رجال
ar:يحيى الفخراني
arz:تصنيف:بريطانيا
arz:يحيى الفخرانى
ast:Categoría:Botsuana
ast:Categoría:Llioneses
ast:La Caleya

22 of the links occur three times, for example:

az:Kateqoriya:1802-ci ildəki hadisələr
be-x-old:Катэгорыя:Падзеі 1802 году
be:Катэгорыя:Горад Мар'іна Горка
eo:Kategorio:139° U
map-bms:Kategori:Bangsalsari, Jember
map-bms:Kategori:Cibitung, Bekasi
map-bms:Kategori:Cilebak, Kuningan
os:Категори:139° н. д.
sr:Категорија:Босанскохерцеговачки вајари
ta:இடைக்குன்றூர் கிழார்
ta:உறையூர் மருத்துவன் தாமோதரனார்

I will later prepare a complete list of the items which contain the duplicate links so they can be deleted.
Comment 11 Daniel Kinzler 2013-06-02 13:10:48 UTC
Thanks for investigating, Byrial!

(In reply to comment #10)
> (Reply to comment #8)
> I did not use the wb_items_by_site table (that would be impossible for me as
> it is not available in the public database dumps of Wikidata).

Ah - I guess we should fix that.
It's available on the toolserver though. I assumed you were using that.

> There is 12,565,377 items in the file and they contain
> 30,536,568 links. 3.160 of the links are occur twice, for example:

Could you please include the item IDs in that list? I can find one of the items in the database easily, but (by the nature of the bug) not the other (or the third).

> I will later prepare a complete list of the items which contain the duplicate
> links so they can be deleted.

That would be awesome, thank you!


Re-opening, so we can track the investigation and deletion of further duplicates.
Comment 12 Byrial Jensen 2013-06-03 04:42:34 UTC
Please see http://www.wikidata.org/wiki/User:Byrial/Duplicates

Each line contains one of the 3182 duplicate links and the 2 or 3 items which contains the link. (NB: It is not the original links in the list as the they appear in the databse dump file, as I have changed the original localized namespace names for some (but not all) languages when I originally parsed the database dump)

One item may be on the list several times when it contains several duplicated links for different languages. The list is sorted after item number.
Comment 13 Daniel Kinzler 2013-06-27 10:04:13 UTC
We can't prevent this from happening, but we could:

* we could try harder to detect incomplete saves
* we could have a maintenance script that removes sitelinks from entities that don't have that sitelink stored in the database table.
Comment 14 denny vrandecic 2013-06-27 10:17:27 UTC

*** This bug has been marked as a duplicate of bug 42325 ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links