Last modified: 2014-05-05 13:28:14 UTC
Capitalization in parse_user function to format strings in the media wiki user format is done assuming 1 byte per character, this breaks with user names whose first character takes up two bytes. Sample: Current code: >>> a = "èMarianne.ramsès ".decode('utf-8') >>> s = a.strip() >>> s = a.strip().encode('utf-8') >>> first = s[0] >>> print first � -> this is 'half' a character Correct sequence: >>> a = "èMarianne.ramsès ".decode('utf-8') >>> s = a.strip() >>> first = s[0].upper().encode('utf-8') >>> print first È We likely need to review all the code regarding string comparisons on user_names. Perhaps having our own type for user names that wraps encoding issues is best.
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1548
This bug has been fixed but requires an integration test to close.
Change 129672 had a related patch set uploaded by Nuria: Adding test for cohort uploading for cohort with cyrilic and arabic usernames. https://gerrit.wikimedia.org/r/129672
Change 129672 merged by Milimetric: Adding test for cohort uploading for cohort with cyrilic and arabic usernames. https://gerrit.wikimedia.org/r/129672
This is deployed and tested in staging. Please re-open if you have issues in staging. We will deploy to production on Thursday May 1st.
Did this get deployed already? I've been testing today and found that utf8 names work fine when uploaded as a txt file, but if I try to use them in the Paste Usernames box, I get "error! Server error while processing your upload".
Sage: Would you mind opening a bug with some examples that we can use to test the issue noting that it only happens via coping usernames in the textbox? We have been working on encoding but there is likely more work to do on the http layer regarding character parsing.
nuria: done as bug 64893.