Last modified: 2014-11-17 09:43:04 UTC
The dumpHTML.php generates filenames which
* are UNICODE encoded, which is not very well supported by every tools
* have very long names
This is not very critical, as long as the dump is on a Hard-disk (filesystem) ;
but that is very problematic if you want to put the dump on a CD-ROM or DVD-ROM
and that is exactly what we try to do now.
To my opinion the generated filenames should be ISO 9660 Level-3 compliant,
which is the most well supported filesystem.
To solve the problem I propose to save the article in a truncated version of the
Md5 hash of each title.
Almost the same problem exists with pictures/media files.
(In reply to comment #0)
> The dumpHTML.php generates filenames which are UNICODE encoded, which is not critical, as long as the dump is on a Hard-disk (filesystem) ;
> but that is problematic if you want to put the dump on a CD-ROM or DVD-ROM ...
> To solve the problem I propose to save the article in a truncated version of the
> Md5 hash of each title. Almost the same problem exists with pictures/media files.
> ... propose to save the article in a truncated version of the Md5 hash of each title.
I developed such a modified version of DumpHTML which creates snapshots with filenames of articles and picture/media files using MD5-hashed filenames only. All links and URLs are MD5 hashed versions of the original (Unicode) filenames.
Snapshots were burnt onto DVDs and tested successfully on different operating systems (Windows 2000, Windows XP, Linux SUSE 11.0).
The diff to the current DumpHTML checkout will be posted soon.
Created attachment 5248 [details]
difference to dumpHTML-MW1.12-r30339.inc
The attachment solves that problem: it lets the dumpHTML.inc module encode links and local filenames of articles, images, thumbnail images and media file with the MD5 hash of the original filename. This alllows to store snapshots on CD/DVD filesystems. Resulting snapshots on DVD have been succesfully checked on Windows 2000, Windows XP and Linux SUSE 11.0 systems.
I can post the whole dumpHTML extension on request.
Created attachment 5278 [details]
new version 2.11 (diff to dumpHTML-MW1.12-r30339.inc)
New version 2.11 (diff to dumpHTML-MW1.12-r30339.inc) fixes a small problem for articles and/or image/media files with slashes in it.
(In reply to comment #3)
> New version 2.11 (diff to dumpHTML-MW1.12-r30339.inc) fixes a small problem for
> articles and/or image/media files with slashes in it.
I prepared a tgz of the original and modified dumpHTML files including the diff which is available on http://www.tgries.de/mediawiki/dumpHTML-v2.11.tgz
Is this patch integrated in trunk ?
(In reply to comment #5)
> Is this patch integrated in trunk ?
No. I am not a developer and do not have SVN access. Since the publication of the https://bugzilla.wikimedia.org/attachment.cgi?id=5278 I haven't noticed any problems using it for several different dumps. Looks stable.
the patch is buggy, i can't apply it to my files
Created attachment 5810 [details]
Clean patch for r47214
Clean patch for r47214
well I made a new patch for the trunk version, maybe somebody could comit it to SVN
In my wiki, people use the name of the dumped file to figure out what page the file corresponds too, so using hashed filenames would be bad. Since we generate files on Windows though, we do end up filtering out characters that aren't appropriate for that OS with a regex. ASCII transliteration would probably work too. Regardless, if this in included, please make it optional.
(In reply to comment #9)
> In my wiki, people use the name of the dumped file to figure out what page the
> file corresponds too, so using hashed filenames would be bad. Since we
> generate files on Windows though, we do end up filtering out characters that
> aren't appropriate for that OS with a regex. ASCII transliteration would
> probably work too. Regardless, if this in included, please make it optional.
Please feel free to present a better solution, "filtering out" non-ASCII may not be the best solution, as it introduces at least some kind of irregularities; I admit it helps to guess filenames, but this was not required at the first place (how often do your users access your MediaWiki articles by modifying the URL?).
Working with many different systems (Windows, Linux, ISO file systems on CD/DVD) I found the "hash" solution a robust one (progammed in reasonable time) to store all pages and files reliably on different media.
The original (official) DumpHTML by Tim appeared not to work on different file systems (it works fine on LINUX servers), when you copy the created dumps between Linux - DVD - Windows, for example, you will quickly encounter problems with non-ASCII page- and image filenames like Umlauts in the "Begrüßungsbox".
Perhaps Tim can be motivated to present a robust solution which fits all needs.
Our users actually find the dumped files via a search engine (which I have no control over) which displays the file name to users as the page title. Our page titles are also all in English which helps. Regardless, I'm not saying you should change everything to suit my edge case. I'm just saying if you implement your hashed solution, make it something that can be turned off via a configuration option so that you can still get today's functionality.
(In reply to comment #11)
> Our users actually find the dumped files via a search engine (which I have no
> control over) which displays the file name to users as the page title. Our
> page titles are also all in English which helps.
The _page _titles are preserved: the "hash" solution does not touch the page titles, <title> tag content is always preserved. Only the last parts of the url (file _name_ parts) are changed - file extensions are also preserved (html, jpg, png, gif, doc and so on)
Does the patch work for anyone?
I tried the current dumpHTML version from svn (30.Nov.2010) and also r47214 with the patch applied but I end up always like this:
* when not applying the patch I have problems with pages with Umlaute in the title, eg "Zuständigkeiten",.everything else seems fine
* when applying the patch, I get "PHP Warning: urldecode() expects parameter 1 to be string, object given in [...]/dumpHTML.inc on line 18" and the dump is completely broken.
I generate the dump on a CentOS 5 with php 5.1.6 and MediaWiki v1.13.2, zipped it and sent it to my Windows 7 / Windows 2003 boxes.
I didn't try the first patch because the revision number seems wrong.
*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*
DaSch, if you have time to update your patch to work with current trunk, that would be neat.
DaSch, I'm sorry for the wait in response! Thank you for the patch.
If this issue is still something that you'd like to follow up on, take a look
at our current codebase and consider updating and submitting your patch
directly into our new Git source control system.
You can do this by getting and using "developer access"
Thanks again, and I apologize for the wait.
The munging strategy can be configured with a new --munge-title argument. I tried not to fix any bugs with this patch ;) so the default munge algorithm should be the same as previous behavior. The "md5" munge uses T. Gries's patch above, and the "windows" munge exposes some inaccessible code from the "getFriendly..." method.
(In reply to comment #17)
> The munging strategy can be configured with a new --munge-title argument. I
> tried not to fix any bugs with this patch ;) so the default munge algorithm
> should be the same as previous behavior. The "md5" munge uses T. Gries's patch
> above, and the "windows" munge exposes some inaccessible code from the
> "getFriendly..." method.
using --munge-title windows or any other options i get this error:
Unexpected non-MediaWiki exception encountered, of type "Exception" exception 'Exception' with message 'no such titlemunger exists: 1' in /dir/w/extensions/DumpHTML/MungeTitle.inc:18 Stack trace:
0 /dir/w/extensions/DumpHTML/dumpHTML.inc(92): MungeTitle->__construct(1)
1 /dir/w/extensions/DumpHTML/dumpHTML.php(132): DumpHTML->__construct(Array)
Thanks for the report! The argument processing should be fixed in r115629.
With git head of dumpHTML and MediaWiki 1.19.2 on and EXT4 filesystem on Ubuntu 12.10, there is some encoding issue that sprinkles 2F (Unicode for forward slash) into my image src URLs and filenames. This is without using the munge parameter as I want to use an existing local image mirror.
sudo /usr/bin/php /var/lib/mediawiki/extensions/DumpHTML/dumpHTML.php -d
results in links like:
With the last commit before the munge parameter everything is fine.
i download dumphtml,with chinese windows os,
php D:\A\extensions\DumpHTML\dumpHTML.php -d d:\wikidump -k monobook --image-snapshot --force-copy --munge-title windows
but images are not in proper folder,
D:\wikidump2\articles\文\件\7E\文件~Jr01.gif.html can open,but can not see picture,
the picture url is D:\wikidump2\images\4\42\Jr01.gif,can not open, then i search Jr01.gif,the result is in the folder D:\wikidump2\images\4\_\4.
what is wrong?
badhot: Could you ask on https://www.mediawiki.org/wiki/Project:Support_desk for support requests? Thanks!
[ASSIGNED status since comment 8 in 2009; obviously not the case. Resetting.]