Last modified: 2014-09-24 01:28:01 UTC
Created attachment 7222 [details]
A patch for MediaWiki 1.14.1
In our company (Customized InformSystems, http://www.custis.ru/), we use MediaWiki for maintaining corporate knowledgebases, and I'm responsible for its support and etc. Sorry in advance for my english :-)
I have greatly improved MediaWiki XML Export and Import mechanism and I want to submit a patch. The following features are added:
* Support for exporting and importing UPLOADS - either attached inside the output file, which will be a multipart/related document with binary parts in this case, or as URLs + SHA1 hashes inside the XML document, so that destination Wiki could download them by itself.
* Support for selecting pages from category with subcategories.
* Support for selecting pages from specific namespace.
* Support for selecting pages modified after a specific date.
* Support for selecting pages linked to selected ("pagelinks closure").
* The DVCS-like detection of "conflicts" during import and extended import report with 5 types of messages:
1) All revisions were previously imported. No local changes.
2) All revisions were previously imported. Page changed locally.
3) N revisions imported.
4) N revisions imported (new page).
5) N revisions imported (conflict: XX (import) and YY (local)).
* With all these improvements, the import mechanism is backwards compatible and is able to import dumps from MediaWikis without these improvements.
Unfortunately (or fortunately? :-D), the patch requires a change to filerepo/LocalFile.php, because there is an irrationality: archived versions of uploads (oldimages) are stored with the timestamp in the file name which IS NOT EQUAL to the oldimage timestamp in the database. The timestamp in the file name is the timestamp of archiving, i.e. the timestamp of NEXT revision of upload. IMO, it is an inconsistent behaviour: oldimage file names depend on the order of import. So I'll submit a patch and a tool to convert timestamps in the oldimage archive.
I think all these features are very useful for MediaWiki users, so I ask you if it is possible to incorporate these changes into the MediaWiki code? Now, they are written against 1.14.1, but I can port them into trunk, if it is needed. The patch submitted also DOES NOT conform to your code-style, for example, I hate adding spaces after opening and before closing braces. But if it is needed, I can change the patch to conform.
But I think you can anyway start reviewing the patch :) IMO, it's very useful.
Created attachment 7223 [details]
A maintenance tool to change oldimage filenames
Also this patch fixes Bug 9108.
I'll add this bug as a See Also to all related bugs.
P.S: If you want to see it in action, go to our external corporate Wiki: http://lib.custis.ru/Special:Export or my own wiki: http://yourcmc.ru/wiki/Special:Export
The patch is applied to both of them.
I'd certainly advise to port to trunk, considering 1.15.2 is considered stable, and 1.16 is not hugely far away :)
Then the chance of getting it reviewed quicker are increased
No new functionality will be added to already released versions. Please provide a patch for trunk and reopen on submit.
Outdated patches are no reason to resolve INVALID. Patch can be updated to run against trunk,
I will definitely update the patch to run against trunk when I'll have time to do it :) by now, you can only try live demonstration on URLs above...
Created attachment 7601 [details]
Patch for Trunk MediaWiki (svn70079)
Good news everyone ! :-)
I ported my patch to MediaWiki 1.17 trunk (svn revision 70079).
Can you give it some review please ?
Created attachment 7870 [details]
Patch for trunk MediaWiki (svn 77332)
Updated the patch for the trunk + added some checks and ability to export very large files using 1MB buffer.
Bryan, can you poke at this?
A few comments:
> <font color=red>
is a big no-no, use <span class="error">
immediate*() are deprecated, use just begin() and commit()
Comments only on the filerepo part:
+ global $wgUser;
File repo should never ever ever ever use globals like $wgUser. Pass the user object as a function argument, similar to LocalFile::upload
+ $dstPath = $this->repo->getZonePath('public') . '/archive/' . $this->getHashPath() . $dstName;
+ $dstName = gmdate( 'YmdHis', wfTimestamp( TS_UNIX, $timestamp ) ) . '!' . $this->getName();
You are essentially duplicating this from LocalFile::publish. The archive name generation should be moved to a separate function, which is called by both publish() and recordOldUpload().
E.g. $dstName = $this->generateArchiveName( $timestamp );
I find it strange that you are doing path generation in recordOldUpload(). Is there a specific reason you are not doing that in the caller and pass the entire archive name? That would seem a bit more logical to me.
+ /* Original gmdate( 'YmdHis' ) is not corrent AT ALL! */
+ /* It gives an inconsistency: file name has one timestamp and database row has another. */
Correct, we should fix that anyway.
+ $props['timestamp'] = wfTimestamp( TS_MW, $timestamp );
You should use $dbw->timestamp( $timestamp )
I don't know the import/export code at all, so I won't comment on that.
In general good work. You are on a good way to finally get upload importing/exporting available.
(In reply to comment #12)
> + /* Original gmdate( 'YmdHis' ) is not corrent AT ALL! */
> + /* It gives an inconsistency: file name has one timestamp and database
> row has another. */
> Correct, we should fix that anyway.
I retract that. The timestamp part of the filename is the timestamp when the file was moved into the archive, not the timestamp when the file was originally uploaded.
Thanks everybody for remarks :)
(In reply to comment #13)
> I retract that. The timestamp part of the filename is the timestamp when the
> file was moved into the archive, not the timestamp when the file was originally
Yeah, originally it is. But I personally don't understand the practical sense of this. The time when the file was moved into the archive is usually equal to or 1-2 second relative to the time when a new version of the same file was uploaded. So each version contains timestamp of another, moreover, not always an accurate one.
I think versions should be independent, and should not form such "linked lists".
(In reply to comment #12)
> I find it strange that you are doing path generation in recordOldUpload(). Is
> there a specific reason you are not doing that in the caller and pass the
> entire archive name? That would seem a bit more logical to me.
Do you mean concatenating $dstPath instead of passing it as a parameter to recordOldUpload()? If so, recordOldUpload() uses $dstName as the value of oi_archive_name DB column, that's the reason.
Created attachment 8193 [details]
Rebase patch for new trunk MW (svn 82606)
Rebase patch to svn 82606 + add corrections from comment 12 by Bryan Tong Minh.
From now, all these patches will be available at:
Also forgot to mention that in new version of patch, the insertion of empty "marker" revisions to imported pages is now disabled, because it leads to very fancy bugs, in particular, infinitely duplicated revisions without changes in the case of 2-way ("vice-versa"?) replication between two wikis. One sees a change, imports it, adds a newer marker revision, then other sees this revision in export file, imports it, etc...
Created attachment 8386 [details]
Rebase to 85617 + fix "no history" for uploads + use sha1base36
Any progress on review? I see there were some changes to import.php and specialexport.php in trunk...
I've started with implementing the required FileRepo changes in r85635. I've heavily modified your patch to match our coding standards. I have moved most logic from LocalFile to OldLocalFile, because that means that we can use things like $this->getRel() etc.
I will next look at the multipart/related parts of the patch.
(In reply to comment #14)
> Thanks everybody for remarks :)
> (In reply to comment #13)
> > I retract that. The timestamp part of the filename is the timestamp when the
> > file was moved into the archive, not the timestamp when the file was originally
> > uploaded.
> Yeah, originally it is. But I personally don't understand the practical sense
> of this. The time when the file was moved into the archive is usually equal to
> or 1-2 second relative to the time when a new version of the same file was
> uploaded. So each version contains timestamp of another, moreover, not always
> an accurate one.
> I think versions should be independent, and should not form such "linked
Perhaps, but changing this may lead to unexpected breakage, which I don't feel inclined to search for.
Oh, thanks, sorry for the delay. I've seen your changes now.
One remark: I've also seen there is now base64 export added? I understand this is for the sake of backwards compatibility, but it's not very optimal. Maybe leave it as an option?
> Perhaps, but changing this may lead to unexpected breakage,
> which I don't feel inclined to search for.
What for example? Or is it completely :) unexpected so you have no idea about it?
Also, I recently made some changes in the page selection mechanism. The use cases were:
- deny export of pages marked with some category
- allow "incremental" export, i.e. correct export of pages which were changed since given time.
So, changes were:
- moved used image, template and page links selection into page selection box (and rewritten the code optimizing it - each title is no more loaded ); so that export exports ONLY pages that are in textbox now, and link selection is done using "add pages";
- added a "NOT-category" filter for page selection box ("Add pages:"), which is applied AFTER selection of image, template and page links, and which is applied to the full list in the textbox, not only to added pages;
- changed the behaviour of modification date filter - it is now also applied after all, just like "NOT-category". So it allows to build page lists for incremental replication/export.
All this can be seen on, for example, http://wiki.4intra.net/ as an example.
I'll try to rebase the patch against the new trunk with your WIP.
(In reply to comment #20)
> Oh, thanks, sorry for the delay. I've seen your changes now.
> One remark: I've also seen there is now base64 export added? I understand this
> is for the sake of backwards compatibility, but it's not very optimal. Maybe
> leave it as an option?
I don't understand. The base64 is backwards compatible if you parse with an XML parser.
> > Perhaps, but changing this may lead to unexpected breakage,
> > which I don't feel inclined to search for.
> What for example? Or is it completely :) unexpected so you have no idea about
I have absolutely no idea what would break, but I know MediaWiki well enough that something will break.
> I'll try to rebase the patch against the new trunk with your WIP.
That's fine. However, I don't think that I will have time to review and commit the patch, so somebody else should review it then.
Compatible, but increases the size :)
> I have absolutely no idea what would break
So you won't change it at all?
Conformance to the site style
is a requirement for core code contributions.
I'm not sure what the point is of this multipart/related document type: is there some client that can read it? The exports appear to be sent out with HTTP headers specifying Content-Type: application/xml, but with a body that is plainly not XML, rather it is this MIME message format with headers embedded in the body text.
It seems like zip would have been a better choice. The wording of the messages which enable this multipart format certainly don't warn the user that what they are going to get will be so exotic.
What does this:
signify? It is wrapped around some strange-looking code. Is it meant to be there?
The 5000-page limit in SpecialExport::getPagesFromCategory() appears to have been removed without explanation. Also, SpecialExport::rgetPagesFromCategory() appears to have no recursion depth limit.
It's probably feasible to use the upload timestamp of the old image version to generate the archive name, but it would need to be done with a bit more care and analysis than I see here. For example, LocalFileRestoreBatch::execute() appears to need patching. Deployment would be difficult since it would break if MediaWiki is downgraded or run with multiple versions on the same database.
It would be simpler, and head off potential future problems with analysis of image filesystems, if the archive names could be changed to a completely different regime, adding some marker to show whether they were generated the old way or the new way.
(In reply to comment #24)
> Conformance to the site style
> is a requirement for core code contributions.
I'll run my modifications through the tool.
> I'm not sure what the point is of this multipart/related document type: is
> there some client that can read it? The exports appear to be sent out with HTTP
> headers specifying Content-Type: application/xml, but with a body that is
> plainly not XML, rather it is this MIME message format with headers embedded in
> the body text.
> It seems like zip would have been a better choice. The wording of the messages
> which enable this multipart format certainly don't warn the user that what they
> are going to get will be so exotic.
Oh, thanks. I've not thinked about zip. I agree that multipart is very exotic. For some reason, it was the first "archive" type I thinked of.
> What does this:
> signify? It is wrapped around some strange-looking code. Is it meant to be
No... It's from other patch (access control extension). Sorry.
> The 5000-page limit in SpecialExport::getPagesFromCategory() appears to have
> been removed without explanation. Also, SpecialExport::rgetPagesFromCategory()
> appears to have no recursion depth limit.
I think any hard-coded limits are bad, and they were hard-coded... I've just removed them as it seemed the easiest way.
Recursion limit is harmful for small/intranet MW installations - users often create deep category hierarchies and want to export them :) and there is relatively little amount of pages, so the performance doesn't count. 5000 pages limit is harmful for example for exporting full wiki content.
These limits should probably be on parametrized on $wgSomething...
> It's probably feasible to use the upload timestamp of the old image version to
> generate the archive name, but it would need to be done with a bit more care
> and analysis than I see here. For example, LocalFileRestoreBatch::execute()
> appears to need patching. Deployment would be difficult since it would break if
> MediaWiki is downgraded or run with multiple versions on the same database.
> It would be simpler, and head off potential future problems with analysis of
> image filesystems, if the archive names could be changed to a completely
> different regime, adding some marker to show whether they were generated the
> old way or the new way.
I'll change my modifications according to these remarks. Zip is really a very good idea :)
Vitaliy, have you had time to revise your patch?
Sorry, didn't see the comments - dnsrbl.net has gone mad and was blocking all email on my server :)
Not fully, I've just implemented ZIP support by now, this required reworking writers/filters/readers, so now I have several questions to the devs:
1. What was the original idea besides Dump*Filters ? My opinion is that it's absolutely irrational to filter the pages AFTER loading them from the DB, plus DumpFilters weren't actually used in export, so I've removed them - is that OK ?
2. Also I don't truly understand the idea of having separate methods for write* (writeOpenPage, writeClosePage, etc) in DumpOutput. Was it supposed to allow easy switching of underlying stream format from XML to something other? I think it's useless as there is no other dump format by now :) so it's not designed for the real needs of other format. So it's also removed :)
3. Also, in the new version of patch, I always use temporary files to read/write dump archives, so the support for streaming gzip/bzip2/etc filters is also removed. But, my opinion is that it's also not a big problem - there's support for ZIP anyway.
4. Now, the import code doesn't restore recent changes for page edits/creations. I think it must restore them for consistency and I've implemented that - is it OK?
5. Also I don't fully understand the purpose of setting revision callbacks to methods of self in WikiImporter :) what's that purpose?
6. ImportStringSource - was this supposed to be really used somewhere?
Created attachment 9466 [details]
Patch for archive filenames
I see the changes to the import xml reader happened in the meantime.
So I need to rebase my patch against trunk again.
Additionally, I still want to add archive support instead of or in addition to base64 encoding inside xml, because the latter increases file size by 30%, which could be sensitive even on medium installations :)
Now, I just want to submit patch for archive file names. I suggest a prepended 'T' letter for backwards compatibility noted by Tim.
For "A maintenance tool to change oldimage filenames", that really needs rewriting to use our Maintenance classes
Legacy code not using it isn't so bad, but adding new files that don't isn't a good way forward
I agree, it anyway needs change for that 'T' letter :)
Created attachment 9544 [details]
ZIP support + rebase the patch to r104165
The new version of patch:
* Rebased to r104165... (again and again...)
* ZIP support, used instead of multipart/related by default. Old multipart/related is also supported, partly as an example of multiple archive formats
* Added reporting of imported upload count
* Recent changes are now restored during import (so the post-import state looks almost exactly same as in the source wiki)
* FileRepo part does not need changes anymore, thanks to Bryan Tong Minh! :)
* There was a bug in trunk - sha1 was compared with sha1base36 in Import.php
* A note about addPages() / addPagesExec() in SpecialExport: they're static because they are useful for MW extensions, for example for a BatchEditor extension we have.
I understand that the code maybe isn't perfect... But see, previous versions of MediaWiki also don't have perfect code in export/import part. :)
And one more thing about backwards compatibility, ZIP and etc etc etc: all efforts to maintain backwards compatibility with MW < 1.18 are useless anyway, because old XML parser didn't allow differences in the element hierarchy, even if all the difference is a single added element, and it bails out with strange errors in this case.
So I still suggest archives as the primary "dump format", not the XML with base64-encoded parts :)
Be nice to get it in for you before the patches go stale again!
Created attachment 10267 [details]
Patch for svn 114100
Rebased the patch again...
Please review it someone!
Created attachment 10268 [details]
Updated maintenance tool
Created attachment 10312 [details]
Updated patch for svn 114100
Updated the patch (one small bugfix, use $title instead of $this->title in the error message).
Vitaliy, just wanted to check in and suggest that you get https://www.mediawiki.org/wiki/Git/Workflow a Git/Gerrit account! That'll make it easier to get your patches reviewed.
OK, I've pushed it to Gerrit.
1) Change old image timestamps:
2) Quick fix for Bug 37209 - without it importing doesn't work:
3) Main part of import/export patch:
when improving Import maybe somebody could consider looking on this bug 22077
just an idea :)
By the way, I think I'll try to move these import/export changes to an extension.
The only precondition is this patch:
(change old image timestamps to include their own version, not the archiving timestamp)
Most of other changes can be easily extracted to an extension. I think it will be simpler to install and use, and it won't require long review process for my codebomb :)