Last modified: 2011-03-30 14:25:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29773, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 27773 - Length of dump text and length field in API do not match
Length of dump text and length field in API do not match
Status: NEW
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal minor (vote)
: ---
Assigned To: Ariel T. Glenn
: analytics
Depends on:
Blocks: 27772
  Show dependency treegraph
 
Reported: 2011-02-27 23:59 UTC by Diederik van Liere
Modified: 2011-03-30 14:25 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Diederik van Liere 2011-02-27 23:59:47 UTC
Length of dump text and length field in API do not match (even after UF8 encoding) due to inconsistent line break characters and beginning/ending whitespace. 
Note that this results in false negatives when detecting identity reverts

Current workaround:
Strip whitespace from the beginning/end and replace all "\r\n" (windows linebreak) with "\n".  With this approach, you get acceptable (99%), but still imperfect consistency between API and dump.
Comment 1 Brion Vibber 2011-03-05 00:35:09 UTC
Are there any specific examples?

Are whitespace mismatches due to problems parsing the way whitespace is encoded in the XML, or due to the XML dumps actually containing incorrect whitespace?

(The dumps may well contain incorrect whitespace, most likely due to inconsistencies in parsing the previous whitespace when doing multiple passes combining text from previous dumps with new stub dumps, etc.)
Comment 2 Roan Kattouw 2011-03-05 20:48:06 UTC
(In reply to comment #1)
> Are there any specific examples?
> 
> Are whitespace mismatches due to problems parsing the way whitespace is encoded
> in the XML, or due to the XML dumps actually containing incorrect whitespace?
> 
Do the XML dumps use the xml:space="preserve" attribute?
Comment 3 Ariel T. Glenn 2011-03-27 13:34:59 UTC
I would like a specific page ID, revision ID and dump file to look at, if someone can point me to one.
Comment 4 Aaron Halfaker 2011-03-29 17:14:28 UTC
Anarchism(12)
RevisionId: 233194

From the 2010-01-30 XML dump at the end of the 233194 revision (notice the line breaks before the closing </text> tag)
----------------------------------------------------
[...]
/Talk &lt;br&gt;
    
/Todo &lt;br&gt;

[[Anarchy/Talk]] [http://www.wikipedia.com/wiki.cgi?action=history&amp;id=Anarchy Anarchy History] (The content of Anarchy and Anarchism have since been merged into this version)

</text>
-----------------------------------------------------

From the API (http://en.wikipedia.org/w/api.php?action=query&prop=revisions&revids=233194&rvprop=content&format=jsonfm) (notice that the string ends right after the last non-whitespace character)
-----------------------------------------------------
{
	"query": {
		"pages": {
			"12": {
				"pageid": 12,
				"ns": 0,
				"title": "Anarchism",
				"revisions": [
					{
						"*": "''Anarchism'' is <removed most of the text here -Aaron Halfaker> (The content of Anarchy and Anarchism have since been merged into this version)"
					}
				]
			}
		}
	}
}
-----------------------------------------------------
Comment 5 Ariel T. Glenn 2011-03-30 14:25:40 UTC
(Yes, the XML files have  <text xml:space="preserve"> in them.)

I had a look at the output we get from ExternalStore::fetchFromURL()

The text we get back has a newline after the final parenthesis. 

That text is 8884 bytes long, which matches the rev_len recorded in the revision table and in the XML dump file.  When I apply the various conversions for & < > " and strip the ^Ms I get the byte count of the text entry in the xml file: 8930.

When I do the same conversions for the json format (for " \r \n and /) I come up one byte longer, 9160, than the actual json output text, 9159.  My conclusion is that the json formatter or perhaps generally the API loses that newline at the end.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links