Last modified: 2014-02-26 12:53:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T49368, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 47368 - log database in s1-analytics-slave has charecter set utf8, while every other database has charset binary


Summary:	log database in s1-analytics-slave has charecter set utf8, while every other ...

Status:	NEW

Product:	Analytics
Classification:	Unclassified
Component:	EventLogging (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-04-18 18:19 UTC by Yuvi Panda
Modified:	2014-02-26 12:53 UTC (History)
CC List:	8 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Yuvi Panda 2013-04-18 18:19:30 UTC

Inconsistency is causing issues with scripts which now have to treat the log database specially.

Comment 1 Yuvi Panda 2013-04-18 18:20:12 UTC

Related is https://bugzilla.wikimedia.org/show_bug.cgi?id=45718 - if we change charset for the database we should also ensure that the tables are changed too.

Comment 2 Dario Taraborelli 2013-04-18 18:35:51 UTC

Yuvi, I am not fully persuaded the charset should be consistent with MW's default binary. Other databases used for data analysis on s1 host datasets coming from various sources (squid logs, API, dumps) that make utf8 the ideal encoding. I typically rely on IDs for my work so I am probably less sensitive to this issue.

Can you expand on the issues you're referring to in the original report?

On a related note, the situation on s1 is pretty messy:

SHOW CREATE DATABASE staging;
+----------+--------------------------------------------------------------------+
| Database | Create Database                                                    |
+----------+--------------------------------------------------------------------+
| staging  | CREATE DATABASE `staging` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+--------------------------------------------------------------------+
1 row in set (0.03 sec)

SHOW CREATE DATABASE prod;
+----------+-----------------------------------------------------------------+
| Database | Create Database                                                 |
+----------+-----------------------------------------------------------------+
| prod     | CREATE DATABASE `prod` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+-----------------------------------------------------------------+
1 row in set (0.03 sec)

SHOW CREATE DATABASE enwiki;
+----------+-------------------------------------------------------------------+
| Database | Create Database                                                   |
+----------+-------------------------------------------------------------------+
| enwiki   | CREATE DATABASE `enwiki` /*!40100 DEFAULT CHARACTER SET binary */ |
+----------+-------------------------------------------------------------------+
1 row in set (0.03 sec)

SHOW CREATE DATABASE log;
+----------+--------------------------------------------------------------+
| Database | Create Database                                              |
+----------+--------------------------------------------------------------+
| log      | CREATE DATABASE `log` /*!40100 DEFAULT CHARACTER SET utf8 */ |
+----------+--------------------------------------------------------------+
1 row in set (0.03 sec)

Comment 3 Yuvi Panda 2013-04-18 18:46:49 UTC

So the problem is with Python. I'm using a bunch of scripts to manually 'join' data from the production slaves (commons) and eventlogging, and am having to treat the connections to log / others differently. This prevents databases from being interchangeable, plus this kinda special casing is always bad.

Is there any direct benefit from having log to be utf8 than binary?

Comment 4 Yuvi Panda 2013-04-18 19:04:54 UTC

As per IRC, I think being internally consistent inside MySQL (all mysql databases have same encoding, you don't need to check that everytime) is more important than having *some* mysql tables be consistent with some other outside sources.

Comment 5 Yuvi Panda 2013-04-18 19:13:19 UTC

https://gerrit.wikimedia.org/r/#/c/59880/ is the use case I am talking about. DB handling code should not have to care about which database it is connecting to...

Comment 6 Ori Livneh 2013-04-19 19:42:47 UTC

(In reply to comment #0)
> Inconsistency is causing issues with scripts which now have to treat the log
> database specially.

This isn't EventLogging exceptionalism, you know: URIs are UTF-8, MediaWiki speaks UTF-8, the Python source is UTF-8, the local of the system it's running on is UTF-8, the ZeroMQ stream is UTF-8, MongoDB is UTF-8, and it is the most common encoding on Android and iOS.

Binary character encoding is a bit of a misnomer, since it's not actually a character encoding but the absence of one. When all of your interactions with the database are mediated by MediaWiki, then it doesn't matter much how the database is configured as long as MediaWiki knows how to work with it. But EventLogging data has many consumers, working with data in SQL, CSV and XML format in Python and R and direct database client GUIs and who knows what else. Leaving the character encoding unspecified means leaving it up to application code, which in this case is scattered across a number of different codebases. This opens up a vast gulf of indeterminacy and a large potential for encoding issues.

Getting all these different pieces, written in different languages and running on different systems, to speak with one another was tricky as hell, but it's working now, and we'd need to have a really compelling reason for changing it.

I think this is a WONTFIX, but leaving it open for now in case Yuvi wants to make a thorough case for a migration.

Comment 7 Yuvi Panda 2013-04-19 19:50:10 UTC

*shrug* I should open a new bug to have the slaves of the production databases set to utf8, but I am unsure how productive that would be.

Comment 8 Ori Livneh 2013-04-25 06:23:09 UTC

The reason we don't use the 'utf8' character encoding setting in MySQL is, it turns out, that it can only encode BMP characters, which have a maximum width of three bytes. Supplementary characters cannot be encoded at all.

So I *do* think we need to change the character encoding, but to what?

Our databases run MariaDB 5.5, which supports a newer character encoding called 'utf8mb4', which offers true utf8 support. I'd strongly prefer that to 'binary', in the interest of being explicit and mitigating the chance that a client will default to some other value based on the system locale or what have you. But this then would leave Yuvi's issue unresolved: the EventLogging database would still be slightly different than the rest of the cluster.

Thoughts?

Comment 9 Andre Klapper 2014-02-26 12:53:46 UTC

[moving from MediaWiki extensions to Analytics product - see bug 61946]

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links