Last modified: 2014-11-17 09:43:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T10147, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 8147 - Filenames in the HTML snapshot by extension dumpHTML


Summary:	Filenames in the HTML snapshot by extension dumpHTML

Status:	NEW

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	DumpHTML (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement (vote)
Target Milestone:	---
Assigned To:	DaSch

URL:	http://www.mediawiki.org/wiki/Extensi...
Whiteboard:
Keywords:	patch, patch-need-review

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2006-12-04 11:34 UTC by Kelson [Emmanuel Engelhart]
Modified:	2014-11-17 09:43 UTC (History)
CC List:	11 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
difference to dumpHTML-MW1.12-r30339.inc (3.15 KB, patch) 2008-09-01 08:36 UTC, T. Gries	Details
new version 2.11 (diff to dumpHTML-MW1.12-r30339.inc) (3.33 KB, patch) 2008-09-03 11:41 UTC, T. Gries	Details
Clean patch for r47214 (3.26 KB, patch) 2009-02-13 14:58 UTC, DaSch	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Description Kelson [Emmanuel Engelhart] 2006-12-04 11:34:53 UTC

The dumpHTML.php generates filenames which 
* are UNICODE encoded, which is not very well supported by every tools
* have very long names

This is not very critical, as long as the dump is on a Hard-disk (filesystem) ;
but that is very problematic if you want to put the dump on a CD-ROM or DVD-ROM
and that is exactly what we try to do now.

To my opinion the generated filenames should be ISO 9660 Level-3 compliant,
which is the most well supported filesystem.

To solve the problem I propose to save the article in a truncated version of the
Md5 hash of each title.

Almost the same problem exists with pictures/media files.

Comment 1 T. Gries 2008-08-28 22:21:45 UTC

(In reply to comment #0)
> The dumpHTML.php generates filenames which are UNICODE encoded, which is not critical, as long as the dump is on a Hard-disk (filesystem) ;
> but that is problematic if you want to put the dump on a CD-ROM or DVD-ROM ...
> To solve the problem I propose to save the article in a truncated version of the
> Md5 hash of each title. Almost the same problem exists with pictures/media files.

> ... propose to save the article in a truncated version of the Md5 hash of each title.

I developed such a modified version of DumpHTML which creates snapshots with filenames of articles and picture/media files using MD5-hashed filenames only. All links and URLs are MD5 hashed versions of the original (Unicode) filenames. 

Snapshots were burnt onto DVDs and tested successfully on different operating systems (Windows 2000, Windows XP, Linux SUSE 11.0).

The diff to the current DumpHTML checkout will be posted soon.

Comment 2 T. Gries 2008-09-01 08:36:16 UTC

Created attachment 5248 [details]
difference to dumpHTML-MW1.12-r30339.inc

The attachment solves that problem: it lets the dumpHTML.inc module encode links and local filenames of articles, images, thumbnail images and media file with the MD5 hash of the original filename. This alllows to store snapshots on CD/DVD filesystems. Resulting snapshots on DVD have been succesfully checked on Windows 2000, Windows XP and Linux SUSE 11.0 systems.

I can post the whole dumpHTML extension on request.

Comment 3 T. Gries 2008-09-03 11:41:37 UTC

Created attachment 5278 [details]
new version 2.11 (diff to dumpHTML-MW1.12-r30339.inc)

New version 2.11 (diff to dumpHTML-MW1.12-r30339.inc) fixes a small problem for articles and/or image/media files with slashes in it.

Comment 4 T. Gries 2008-09-03 16:48:55 UTC

(In reply to comment #3)
> New version 2.11 (diff to dumpHTML-MW1.12-r30339.inc) fixes a small problem for
> articles and/or image/media files with slashes in it.

I prepared a tgz of the original and modified dumpHTML files including the diff which is available on http://www.tgries.de/mediawiki/dumpHTML-v2.11.tgz

Comment 5 Kelson [Emmanuel Engelhart] 2008-09-21 11:58:21 UTC

Is this patch integrated in trunk ?

Comment 6 T. Gries 2008-09-21 12:06:08 UTC

(In reply to comment #5)
> Is this patch integrated in trunk ?
No. I am not a developer and do not have SVN access. Since the publication of the https://bugzilla.wikimedia.org/attachment.cgi?id=5278 I haven't noticed any problems using it for several different dumps. Looks stable.

Comment 7 DaSch 2009-02-13 14:50:52 UTC

the patch is buggy, i can't apply it to my files

Comment 8 DaSch 2009-02-13 14:58:03 UTC

Created attachment 5810 [details]
Clean patch for r47214

Clean patch for r47214

well I made a new patch for the trunk version, maybe somebody could comit it to SVN

Comment 9 Christian Neubauer 2009-12-28 13:30:48 UTC

In my wiki, people use the name of the dumped file to figure out what page the file corresponds too, so using hashed filenames would be bad.  Since we generate files on Windows though, we do end up filtering out characters that aren't appropriate for that OS with a regex.  ASCII transliteration would probably work too.  Regardless, if this in included, please make it optional.

Comment 10 T. Gries 2009-12-28 16:48:40 UTC

(In reply to comment #9)
> In my wiki, people use the name of the dumped file to figure out what page the
> file corresponds too, so using hashed filenames would be bad.  Since we
> generate files on Windows though, we do end up filtering out characters that
> aren't appropriate for that OS with a regex.  ASCII transliteration would
> probably work too.  Regardless, if this in included, please make it optional.
Please feel free to present a better solution, "filtering out" non-ASCII may not be the best solution, as it introduces at least some kind of irregularities; I admit it helps to guess filenames, but this was not required at the first place (how often do your users access your MediaWiki articles by modifying the URL?).

Working with many different systems (Windows, Linux, ISO file systems on CD/DVD) I found the "hash" solution a robust one (progammed in reasonable time) to store all pages and files reliably on different media.

The original (official) DumpHTML by Tim appeared not to work on different file systems (it works fine on LINUX servers), when you copy the created dumps between Linux - DVD - Windows, for example, you will quickly encounter problems with non-ASCII page- and image filenames like Umlauts in the "Begrüßungsbox".

Perhaps Tim can be motivated to present a robust solution which fits all needs.

Comment 11 Christian Neubauer 2009-12-28 17:54:04 UTC

Our users actually find the dumped files via a search engine (which I have no control over) which displays the file name to users as the page title.  Our page titles are also all in English which helps.  Regardless, I'm not saying you should change everything to suit my edge case.  I'm just saying if you implement your hashed solution, make it something that can be turned off via a configuration option so that you can still get today's functionality.

Comment 12 T. Gries 2009-12-28 22:32:26 UTC

(In reply to comment #11)
> Our users actually find the dumped files via a search engine (which I have no
> control over) which displays the file name to users as the page title.  Our
> page titles are also all in English which helps.

The _page _titles are preserved: the "hash" solution does not touch the page titles, <title> tag content is always preserved. Only the last parts of the url (file _name_ parts) are changed - file extensions are also preserved (html, jpg, png, gif, doc and so on)

Comment 13 stfnmstr 2010-11-30 11:08:38 UTC

Does the patch work for anyone?

I tried the current dumpHTML version from svn (30.Nov.2010) and also r47214 with the patch applied but I end up always like this:

* when not applying the patch I have problems with pages with Umlaute in the title, eg "Zuständigkeiten",.everything else seems fine

* when applying the patch, I get "PHP Warning: urldecode() expects parameter 1 to be string, object given in [...]/dumpHTML.inc on line 18" and the dump is completely broken.

I generate the dump on a CentOS 5 with php 5.1.6 and MediaWiki v1.13.2, zipped it and sent it to my Windows 7 / Windows 2003 boxes.

I didn't try the first patch because the revision number seems wrong.

Comment 14 p858snake 2011-04-30 00:09:10 UTC

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

Comment 15 Sumana Harihareswara 2011-10-07 19:20:21 UTC

DaSch, if you have time to update your patch to work with current trunk, that would be neat.

Comment 16 Sumana Harihareswara 2012-05-23 20:00:44 UTC

DaSch, I'm sorry for the wait in response!  Thank you for the patch.

If this issue is still something that you'd like to follow up on, take a look
at our current codebase and consider updating and submitting your patch
directly into our new Git source control system.

https://www.mediawiki.org/wiki/Git/Workflow

You can do this by getting and using "developer access"

https://www.mediawiki.org/wiki/Developer_access

Thanks again, and I apologize for the wait.

Comment 17 Adam Wight 2012-07-08 05:02:58 UTC

http://www.mediawiki.org/wiki/Special:Code/MediaWiki/115597

The munging strategy can be configured with a new --munge-title argument.  I tried not to fix any bugs with this patch ;) so the default munge algorithm should be the same as previous behavior.  The "md5" munge uses T. Gries's patch above, and the "windows" munge exposes some inaccessible code from the "getFriendly..." method.

Comment 18 Daniel Shirley 2012-07-28 08:16:32 UTC

(In reply to comment #17)
> http://www.mediawiki.org/wiki/Special:Code/MediaWiki/115597
> 
> The munging strategy can be configured with a new --munge-title argument.  I
> tried not to fix any bugs with this patch ;) so the default munge algorithm
> should be the same as previous behavior.  The "md5" munge uses T. Gries's patch
> above, and the "windows" munge exposes some inaccessible code from the
> "getFriendly..." method.

using --munge-title windows or any other options i get this error:

Unexpected non-MediaWiki exception encountered, of type "Exception" exception 'Exception' with message 'no such titlemunger exists: 1' in /dir/w/extensions/DumpHTML/MungeTitle.inc:18 Stack trace:

    0 /dir/w/extensions/DumpHTML/dumpHTML.inc(92): MungeTitle->__construct(1)
    1 /dir/w/extensions/DumpHTML/dumpHTML.php(132): DumpHTML->__construct(Array)
    2 {main}

Comment 19 Adam Wight 2012-07-28 21:37:54 UTC

Thanks for the report!  The argument processing should be fixed in r115629.

Comment 20 Jason Skomorowski 2012-12-15 17:13:45 UTC

With git head of dumpHTML and MediaWiki 1.19.2 on and EXT4 filesystem on Ubuntu 12.10, there is some encoding issue that sprinkles 2F (Unicode for forward slash) into my image src URLs and filenames. This is without using the munge parameter as I want to use an existing local image mirror.

sudo /usr/bin/php /var/lib/mediawiki/extensions/DumpHTML/dumpHTML.php -d
/s/wikidumptest --image-snapshot

results in links like:

file:///s/wikidumptest/images/thumb2F/d/2F//d/d7/Lager_beer_in_glass.jpg/180px-Lager_beer_in_glass.jpg

With the last commit before the munge parameter everything is fine.

Comment 21 badhot 2013-06-07 07:13:03 UTC

i download dumphtml,with chinese windows os,
run 
php D:\A\extensions\DumpHTML\dumpHTML.php -d d:\wikidump -k monobook --image-snapshot --force-copy --munge-title windows 

but images are not in proper folder,

D:\wikidump2\articles\文\件\7E\文件~Jr01.gif.html can open,but can not see picture,
the picture url is D:\wikidump2\images\4\42\Jr01.gif,can not open, then  i search Jr01.gif,the result is in the folder D:\wikidump2\images\4\_\4.

what is wrong?

Comment 22 Andre Klapper 2013-06-07 10:25:15 UTC

badhot: Could you ask on https://www.mediawiki.org/wiki/Project:Support_desk for support requests? Thanks!

Comment 23 Andre Klapper 2014-02-18 13:16:04 UTC

[ASSIGNED status since comment 8 in 2009; obviously not the case. Resetting.]

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links