Last modified: 2014-11-17 09:43:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 8147 - Filenames in the HTML snapshot by extension dumpHTML
Filenames in the HTML snapshot by extension dumpHTML
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
DumpHTML (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: DaSch
http://www.mediawiki.org/wiki/Extensi...
: patch, patch-need-review
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-12-04 11:34 UTC by Kelson [Emmanuel Engelhart]
Modified: 2014-11-17 09:43 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
difference to dumpHTML-MW1.12-r30339.inc (3.15 KB, patch)
2008-09-01 08:36 UTC, T. Gries
Details
new version 2.11 (diff to dumpHTML-MW1.12-r30339.inc) (3.33 KB, patch)
2008-09-03 11:41 UTC, T. Gries
Details
Clean patch for r47214 (3.26 KB, patch)
2009-02-13 14:58 UTC, DaSch
Details

Description Kelson [Emmanuel Engelhart] 2006-12-04 11:34:53 UTC
The dumpHTML.php generates filenames which 
* are UNICODE encoded, which is not very well supported by every tools
* have very long names

This is not very critical, as long as the dump is on a Hard-disk (filesystem) ;
but that is very problematic if you want to put the dump on a CD-ROM or DVD-ROM
and that is exactly what we try to do now.

To my opinion the generated filenames should be ISO 9660 Level-3 compliant,
which is the most well supported filesystem.

To solve the problem I propose to save the article in a truncated version of the
Md5 hash of each title.

Almost the same problem exists with pictures/media files.
Comment 1 T. Gries 2008-08-28 22:21:45 UTC
(In reply to comment #0)
> The dumpHTML.php generates filenames which are UNICODE encoded, which is not critical, as long as the dump is on a Hard-disk (filesystem) ;
> but that is problematic if you want to put the dump on a CD-ROM or DVD-ROM ...
> To solve the problem I propose to save the article in a truncated version of the
> Md5 hash of each title. Almost the same problem exists with pictures/media files.

> ... propose to save the article in a truncated version of the Md5 hash of each title.

I developed such a modified version of DumpHTML which creates snapshots with filenames of articles and picture/media files using MD5-hashed filenames only. All links and URLs are MD5 hashed versions of the original (Unicode) filenames. 

Snapshots were burnt onto DVDs and tested successfully on different operating systems (Windows 2000, Windows XP, Linux SUSE 11.0).

The diff to the current DumpHTML checkout will be posted soon.
Comment 2 T. Gries 2008-09-01 08:36:16 UTC
Created attachment 5248 [details]
difference to dumpHTML-MW1.12-r30339.inc

The attachment solves that problem: it lets the dumpHTML.inc module encode links and local filenames of articles, images, thumbnail images and media file with the MD5 hash of the original filename. This alllows to store snapshots on CD/DVD filesystems. Resulting snapshots on DVD have been succesfully checked on Windows 2000, Windows XP and Linux SUSE 11.0 systems.

I can post the whole dumpHTML extension on request.
Comment 3 T. Gries 2008-09-03 11:41:37 UTC
Created attachment 5278 [details]
new version 2.11 (diff to dumpHTML-MW1.12-r30339.inc)

New version 2.11 (diff to dumpHTML-MW1.12-r30339.inc) fixes a small problem for articles and/or image/media files with slashes in it.
Comment 4 T. Gries 2008-09-03 16:48:55 UTC
(In reply to comment #3)
> New version 2.11 (diff to dumpHTML-MW1.12-r30339.inc) fixes a small problem for
> articles and/or image/media files with slashes in it.

I prepared a tgz of the original and modified dumpHTML files including the diff which is available on http://www.tgries.de/mediawiki/dumpHTML-v2.11.tgz

Comment 5 Kelson [Emmanuel Engelhart] 2008-09-21 11:58:21 UTC
Is this patch integrated in trunk ?
Comment 6 T. Gries 2008-09-21 12:06:08 UTC
(In reply to comment #5)
> Is this patch integrated in trunk ?
No. I am not a developer and do not have SVN access. Since the publication of the https://bugzilla.wikimedia.org/attachment.cgi?id=5278 I haven't noticed any problems using it for several different dumps. Looks stable.
Comment 7 DaSch 2009-02-13 14:50:52 UTC
the patch is buggy, i can't apply it to my files
Comment 8 DaSch 2009-02-13 14:58:03 UTC
Created attachment 5810 [details]
Clean patch for r47214

Clean patch for r47214

well I made a new patch for the trunk version, maybe somebody could comit it to SVN
Comment 9 Christian Neubauer 2009-12-28 13:30:48 UTC
In my wiki, people use the name of the dumped file to figure out what page the file corresponds too, so using hashed filenames would be bad.  Since we generate files on Windows though, we do end up filtering out characters that aren't appropriate for that OS with a regex.  ASCII transliteration would probably work too.  Regardless, if this in included, please make it optional.
Comment 10 T. Gries 2009-12-28 16:48:40 UTC
(In reply to comment #9)
> In my wiki, people use the name of the dumped file to figure out what page the
> file corresponds too, so using hashed filenames would be bad.  Since we
> generate files on Windows though, we do end up filtering out characters that
> aren't appropriate for that OS with a regex.  ASCII transliteration would
> probably work too.  Regardless, if this in included, please make it optional.
Please feel free to present a better solution, "filtering out" non-ASCII may not be the best solution, as it introduces at least some kind of irregularities; I admit it helps to guess filenames, but this was not required at the first place (how often do your users access your MediaWiki articles by modifying the URL?).

Working with many different systems (Windows, Linux, ISO file systems on CD/DVD) I found the "hash" solution a robust one (progammed in reasonable time) to store all pages and files reliably on different media.

The original (official) DumpHTML by Tim appeared not to work on different file systems (it works fine on LINUX servers), when you copy the created dumps between Linux - DVD - Windows, for example, you will quickly encounter problems with non-ASCII page- and image filenames like Umlauts in the "Begrüßungsbox".

Perhaps Tim can be motivated to present a robust solution which fits all needs.
Comment 11 Christian Neubauer 2009-12-28 17:54:04 UTC
Our users actually find the dumped files via a search engine (which I have no control over) which displays the file name to users as the page title.  Our page titles are also all in English which helps.  Regardless, I'm not saying you should change everything to suit my edge case.  I'm just saying if you implement your hashed solution, make it something that can be turned off via a configuration option so that you can still get today's functionality.
Comment 12 T. Gries 2009-12-28 22:32:26 UTC
(In reply to comment #11)
> Our users actually find the dumped files via a search engine (which I have no
> control over) which displays the file name to users as the page title.  Our
> page titles are also all in English which helps.

The _page _titles are preserved: the "hash" solution does not touch the page titles, <title> tag content is always preserved. Only the last parts of the url (file _name_ parts) are changed - file extensions are also preserved (html, jpg, png, gif, doc and so on)
Comment 13 stfnmstr 2010-11-30 11:08:38 UTC
Does the patch work for anyone?

I tried the current dumpHTML version from svn (30.Nov.2010) and also r47214 with the patch applied but I end up always like this:

* when not applying the patch I have problems with pages with Umlaute in the title, eg "Zuständigkeiten",.everything else seems fine

* when applying the patch, I get "PHP Warning: urldecode() expects parameter 1 to be string, object given in [...]/dumpHTML.inc on line 18" and the dump is completely broken.

I generate the dump on a CentOS 5 with php 5.1.6 and MediaWiki v1.13.2, zipped it and sent it to my Windows 7 / Windows 2003 boxes.

I didn't try the first patch because the revision number seems wrong.
Comment 14 p858snake 2011-04-30 00:09:10 UTC
*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*
Comment 15 Sumana Harihareswara 2011-10-07 19:20:21 UTC
DaSch, if you have time to update your patch to work with current trunk, that would be neat.
Comment 16 Sumana Harihareswara 2012-05-23 20:00:44 UTC
DaSch, I'm sorry for the wait in response!  Thank you for the patch.

If this issue is still something that you'd like to follow up on, take a look
at our current codebase and consider updating and submitting your patch
directly into our new Git source control system.

https://www.mediawiki.org/wiki/Git/Workflow

You can do this by getting and using "developer access"

https://www.mediawiki.org/wiki/Developer_access

Thanks again, and I apologize for the wait.
Comment 17 Adam Wight 2012-07-08 05:02:58 UTC
http://www.mediawiki.org/wiki/Special:Code/MediaWiki/115597

The munging strategy can be configured with a new --munge-title argument.  I tried not to fix any bugs with this patch ;) so the default munge algorithm should be the same as previous behavior.  The "md5" munge uses T. Gries's patch above, and the "windows" munge exposes some inaccessible code from the "getFriendly..." method.
Comment 18 Daniel Shirley 2012-07-28 08:16:32 UTC
(In reply to comment #17)
> http://www.mediawiki.org/wiki/Special:Code/MediaWiki/115597
> 
> The munging strategy can be configured with a new --munge-title argument.  I
> tried not to fix any bugs with this patch ;) so the default munge algorithm
> should be the same as previous behavior.  The "md5" munge uses T. Gries's patch
> above, and the "windows" munge exposes some inaccessible code from the
> "getFriendly..." method.

using --munge-title windows or any other options i get this error:

Unexpected non-MediaWiki exception encountered, of type "Exception" exception 'Exception' with message 'no such titlemunger exists: 1' in /dir/w/extensions/DumpHTML/MungeTitle.inc:18 Stack trace:

    0 /dir/w/extensions/DumpHTML/dumpHTML.inc(92): MungeTitle->__construct(1)
    1 /dir/w/extensions/DumpHTML/dumpHTML.php(132): DumpHTML->__construct(Array)
    2 {main}
Comment 19 Adam Wight 2012-07-28 21:37:54 UTC
Thanks for the report!  The argument processing should be fixed in r115629.
Comment 20 Jason Skomorowski 2012-12-15 17:13:45 UTC
With git head of dumpHTML and MediaWiki 1.19.2 on and EXT4 filesystem on Ubuntu 12.10, there is some encoding issue that sprinkles 2F (Unicode for forward slash) into my image src URLs and filenames. This is without using the munge parameter as I want to use an existing local image mirror.

sudo /usr/bin/php /var/lib/mediawiki/extensions/DumpHTML/dumpHTML.php -d
/s/wikidumptest --image-snapshot

results in links like:

file:///s/wikidumptest/images/thumb2F/d/2F//d/d7/Lager_beer_in_glass.jpg/180px-Lager_beer_in_glass.jpg

With the last commit before the munge parameter everything is fine.
Comment 21 badhot 2013-06-07 07:13:03 UTC
i download dumphtml,with chinese windows os,
run 
php D:\A\extensions\DumpHTML\dumpHTML.php -d d:\wikidump -k monobook --image-snapshot --force-copy --munge-title windows 

but images are not in proper folder,

D:\wikidump2\articles\文\件\7E\文件~Jr01.gif.html can open,but can not see picture,
the picture url is D:\wikidump2\images\4\42\Jr01.gif,can not open, then  i search Jr01.gif,the result is in the folder D:\wikidump2\images\4\_\4.

what is wrong?
Comment 22 Andre Klapper 2013-06-07 10:25:15 UTC
badhot: Could you ask on https://www.mediawiki.org/wiki/Project:Support_desk for support requests? Thanks!
Comment 23 Andre Klapper 2014-02-18 13:16:04 UTC
[ASSIGNED status since comment 8 in 2009; obviously not the case. Resetting.]

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links