Last modified: 2014-07-17 12:52:03 UTC
As a translation admin, I want the initial alignment offered by Special:PageMigration to be balanced enough for me to orient myself in the task of fixing it manually. First opportunity: align == in the order they appear. Irrespective of (and without changing) splitting, we can align the first source unit having a ^==[^=] to the first "target" unit having a ^==[^=], adding blanks before if needed.
The left hand side source units are not under the control of Special:PageMigration, isn't it? So if the source unit is like - ==Section== Text text text The corresponding target unit also needs to contain the section header as well as the text, as it would have been the case at Special:Translate. I personally feel that introducing new lines (if not present) after section headers while preparing the page for translation (step 1) would make the alignment better. From the examples I tested, I found this as the main reason for the mismatch. I feel we could get done with step 1 first and then see how the alignment is, and then work to get the best alignment after both the steps are ready :)
The two concerns are separate. You're always going to have past translations which don't follow the source text (or your assumptions) in their structure including whitespace around headers. If success of one step of the process depends completely on perfect success of another step, it will be hard to have some progress. This is just pass 1 of the alignment improvement process you wrote down at https://www.mediawiki.org/w/index.php?title=Extension:Translate/Mass_migration_tools/Design&oldid=988113 , «In the first pass, section headers can be covered. The flow would be to simply check for section headers present as translation units and get the corresponding section from the translation, assuming that all the sections are in the same order in both the text. [1]»
BPositive> what do you mean by "adding blanks before if needed"? :) Adding an empty textarea/"unit".
Alright I am thinking on the approach mentioned by you. But it would be great if https://gerrit.wikimedia.org/r/#/c/136334/ gets merged. That would give me an array of translationUnits which do not contain a single unit of section headers and other text mixed up. Once I have such array, I could scan the sourceUnits and translationUnits and and match up section headers in the order they appear. Doing so, there won't be a need to add an extra unit before/after.
Change 138220 had a related patch set uploaded by BPositive: Simplistic alignment based on h2 headers for Special:PageMigration https://gerrit.wikimedia.org/r/138220
Note that in theory [[mw:API:Parse]] can be used to have a list of headers, e.g. for a full page https://www.mediawiki.org/w/api.php?action=parse&oldid=629558&prop=sections As long as we stay simple that's probably not needed, but it would if we need to make more or more complex things in a sane way.
Change 138220 merged by jenkins-bot: Simplistic alignment based on h2 headers for Special:PageMigration https://gerrit.wikimedia.org/r/138220