Last modified: 2011-03-30 14:25:40 UTC
Length of dump text and length field in API do not match (even after UF8 encoding) due to inconsistent line break characters and beginning/ending whitespace. Note that this results in false negatives when detecting identity reverts Current workaround: Strip whitespace from the beginning/end and replace all "\r\n" (windows linebreak) with "\n". With this approach, you get acceptable (99%), but still imperfect consistency between API and dump.
Are there any specific examples? Are whitespace mismatches due to problems parsing the way whitespace is encoded in the XML, or due to the XML dumps actually containing incorrect whitespace? (The dumps may well contain incorrect whitespace, most likely due to inconsistencies in parsing the previous whitespace when doing multiple passes combining text from previous dumps with new stub dumps, etc.)
(In reply to comment #1) > Are there any specific examples? > > Are whitespace mismatches due to problems parsing the way whitespace is encoded > in the XML, or due to the XML dumps actually containing incorrect whitespace? > Do the XML dumps use the xml:space="preserve" attribute?
I would like a specific page ID, revision ID and dump file to look at, if someone can point me to one.
Anarchism(12) RevisionId: 233194 From the 2010-01-30 XML dump at the end of the 233194 revision (notice the line breaks before the closing </text> tag) ---------------------------------------------------- [...] /Talk <br> /Todo <br> [[Anarchy/Talk]] [http://www.wikipedia.com/wiki.cgi?action=history&id=Anarchy Anarchy History] (The content of Anarchy and Anarchism have since been merged into this version) </text> ----------------------------------------------------- From the API (http://en.wikipedia.org/w/api.php?action=query&prop=revisions&revids=233194&rvprop=content&format=jsonfm) (notice that the string ends right after the last non-whitespace character) ----------------------------------------------------- { "query": { "pages": { "12": { "pageid": 12, "ns": 0, "title": "Anarchism", "revisions": [ { "*": "''Anarchism'' is <removed most of the text here -Aaron Halfaker> (The content of Anarchy and Anarchism have since been merged into this version)" } ] } } } } -----------------------------------------------------
(Yes, the XML files have <text xml:space="preserve"> in them.) I had a look at the output we get from ExternalStore::fetchFromURL() The text we get back has a newline after the final parenthesis. That text is 8884 bytes long, which matches the rev_len recorded in the revision table and in the XML dump file. When I apply the various conversions for & < > " and strip the ^Ms I get the byte count of the text entry in the xml file: 8930. When I do the same conversions for the json format (for " \r \n and /) I come up one byte longer, 9160, than the actual json output text, 9159. My conclusion is that the json formatter or perhaps generally the API loses that newline at the end.