Last modified: 2014-09-23 23:08:39 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T27984, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 25984 - Isolate parser from database dependencies
Isolate parser from database dependencies
Status: NEW
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal enhancement with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: parser, patch, patch-need-review
Depends on:
Blocks: 26858
  Show dependency treegraph
 
Reported: 2010-11-18 02:35 UTC by Andrew Dunbar
Modified: 2014-09-23 23:08 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Patch for 0001-Make-MediaWiki-1.19-fetch-content-from-HTTP.patch (1.59 KB, patch)
2012-06-10 18:13 UTC, Ángel González
Details
Patch to make MediaWiki obtain page sources from alternative locations (13.47 KB, patch)
2012-06-10 18:47 UTC, _Vi
Details

Description Andrew Dunbar 2010-11-18 02:35:10 UTC
Many people need to parse wikitext but due to its nature all attempts at alternative parsers are incomplete or have failed utterly.

The only parser known to "correctly" parse wikitext is the Parser.php - part of the MediaWiki source.

But it's not possible to use this parser in your own code or as a standalone PHP script because it calls the database directly for various things directly or indirectly, such as parser options which may depend on a user, and the localisation cache.

It would be a good thing if it were possible for third parties, or even unit tests to be able to use the genuine MediaWiki parser without the need for a MdiaWiki install and database import.

It should be possible to pass a string literal to the parser and get HTML back.
Comment 1 Bawolff (Brian Wolff) 2010-11-18 18:36:58 UTC
I don't think that'd be possible, since you need db access the moment you have wikitext with {{some template}} in it (to retrieve template). Same with [[link]] (so the parser can figure out if it should be a red link or not).

However, with that said, I just tried disabling all caches in my LocalSettings.php (along with db credentials so I know that no db access took place) and I was successfully able to parser a string literal from maintenance/eval.php as long as it didn't have any links or transclusions in it.


For reference, the LocalSettings.php I used in my test was:
<?php
$IP = dirname( __FILE__ );
$path = array( $IP, "$IP/includes", "$IP/languages" );
set_include_path( implode( PATH_SEPARATOR, $path ) . PATH_SEPARATOR . get_include_path() );
require_once( "$IP/includes/DefaultSettings.php" );
$wgReadOnly = true;
$wgMessageCacheType = CACHE_NONE;
$wgParserCacheType = CACHE_NONE;
$wgMainCacheType = CACHE_NONE;
$wgLocalisationCacheConf['storeClass'] = 'LCStore_Null';
Comment 2 Max Semenik 2010-11-18 18:51:19 UTC
(In reply to comment #1)
> I don't think that'd be possible, since you need db access the moment you have
> wikitext with {{some template}} in it (to retrieve template). Same with
> [[link]] (so the parser can figure out if it should be a red link or not).

You can always abstract DB access with something like

interface WikiAccess {
   function getPageText( $title );
   function getPageExistence( $title );
   function getPageProps( $title );
}

The question is what will we achieve with this, because:

$grep $wg Parser.php | wc -l
    134

This bug should be titled "Get rid of parser's external dependencies". And how we are going to untangle it from, say, $wgContLang? It depends on a half of MW.
Comment 3 Andrew Dunbar 2010-11-19 02:37:24 UTC
For getting templates and red/blue link info I suggest adding a layer of abstraction that the parser can call rather than calling directly to the database.

Parser users would then have the choice of implementing them or knowing not to try parsing template calls. In my case I have written code to both existence checking and wikitext extraction directly from locally stored (and indexed) dump files.

ContentLang is harder but should probably also be abstracted with an English default, the default strings could be automatically be extracted to a text file included in the source tarball to make sure they're up to date. Parser users should be able to implement their own ContentLang equivalent also.

One problem I've found with ContentLang is it's not possible to instantiate one without a User. You either pass a user or the default constructor seems to call the database anyway to get the language settings for the default user. This would also need to be abstracted, which should not be difficult in principle since MediaWiki already serves mostly to anonymous users who are not logged in.

Getting rid of all external dependencies is probably a fair goal but some might be fine. The first goal might be to get the Parser working from a MediaWiki tarball that has been unarchived but has not had its installer ran, which is how I'm working on it on a machine with no web server or database software installed.
Comment 4 Bawolff (Brian Wolff) 2010-11-20 02:07:53 UTC
>For getting templates and red/blue link info I suggest adding a layer of
>abstraction that the parser can call rather than calling directly to the
>database.

You could make your own custom db backend that recognizes certain queries and calls your thingy, but that kind of sounds insane.


>One problem I've found with ContentLang is it's not possible to instantiate one
>without a User. You either pass a user or the default constructor seems to call
>the database anyway to get the language settings for the default user

That doesn't seem right. $wgContLang (which is what I assume you're referring to) does not depend on the user's language pref. I'm doubtful that $wgLang hits the db for anon users. Furthermore I managed to do $wgContLang->commaList(array('foo', 'bar')); on my local install without accessing the db.

>the default strings could be automatically be extracted to a text file
>included in the source tarball to make sure they're up to date

$wgUseDatabaseMessages = false; does that

>Getting rid of all external dependencies is probably a fair goal but some might
>be fine.

I'm unconvinced that it'd be worth all the effort given that its not that beneficial to mediawiki to do that (but I'm not planning to do these things, so it doesn't really matter if i see the benefit ;)

If you just want to make it work without installing db/apache/etc, you probably can make it work with just an "extension", but it'd be a bit "hacky"
Comment 5 Mark A. Hershberger 2011-02-11 18:29:20 UTC
fixing this bug (isolating the parser) would solve so many problems.  Now, is it do-able?  I think some of the best discussion I've seen yet is on this bug.
Comment 6 Bawolff (Brian Wolff) 2011-02-11 22:07:53 UTC
Which problems would that solve? It would make other folks who want to parse wikitext without mediawiki lives easier (which would be very nice, but not exactly a super high priority in my mind). It also might be slightly cleaner architecturally, and it might help slightly with the /make a js parser thingy so wysiwyg thingies can be more easily implemented on the client side/ goal, but not significantly as the parser would still be written in php, so couldn't just be plopped into a js library. Other then that, I'm not exactly sure what this would solve.
Comment 7 Mark A. Hershberger 2011-02-19 02:35:36 UTC
(In reply to comment #6)
> Which problems would that solve?

It would make it easier for third parties to use, yes, but that isn't the point.  It would be easier to maintain and less "scary" for people to work on.

Maybe there is a limit to how much the parser can be isolated from MediaWiki, maybe it isn't 100% achievable, but achieving this isn't an ivory tower goal, the point isn't simply "architectural cleanliness", but something far more pragmatic: maintainability.
Comment 8 _Vi 2012-06-09 01:05:32 UTC
Created special hacky patch to use MediaWiki parser without actual database (https://github.com/vi/wiki_dump_and_read/blob/master/hacked_mediawiki/0001-Make-MediaWiki-1.19-fetch-content-from-HTTP.patch)

If parser and database code were separated properly it would had been simpler and less hacky.
Comment 9 Platonides 2012-06-09 12:19:02 UTC
Yes, it is hacky :)

Some ideas:
- Indent with tabs, not spaces.
- If you add a new global, it has to be defined in DefaultSettings
- Names like hackTriggered are fine for your code, but would carry no meaning if it were integrated upstream.
- Instead of downloading from a web server, load from the filesystem. Check for ../ attacks. (Ideally, there would be different classes depending if it was db-backed or filesystem-based)
- Wikipage::checkForDownloadingHack() should return itself the (cached) content, instead of manually doing the $this->downloadedContent
- No need to hack Parser::fetchTemplateAndTitle(), that can be redirected through setTemplateCallback(). 
- Why do you need to change EditPage, if you're not doing page editing?
Comment 10 _Vi 2012-06-10 04:20:05 UTC
Actually it was not intended to merging into upstream, I implemented this primarily just for me to be able to grab online wikis, save them in compressed form and use local MediaWiki to view them, without any lengthy indexing phase. The result is at https://github.com/vi/wiki_dump_and_read ("wikiexport" is also a hacky hack).

> Indent with tabs, not spaces.
Is there patch checking utility to catch things and other common problems like on some other projects?

> If you add a new global, it has to be defined in DefaultSettings
OK.

> Names like hackTriggered are fine for your code, but would carry no meaning
if it were integrated upstream.
As I don't know the MediaWiki internals, such attempts will alway be hacky unless "Bug 25984" really closed. Can rename to some proper things, but it will remain hacky. I usually explicitly mention the word "hack" to tell users "beware, something may be wrong here".

> Instead of downloading from a web server, load from the filesystem.
May be, but less flexible. The goal is to make easy to connect MediaWiki to other page source code source. HTTP approach is portable, with filesystem approach the only good way is FUSE. 

> Check for ../ attacks.
It is job of the server. https://github.com/vi/wiki_dump_and_read/blob/master/wikishelve_server.py just serves entries from Python's "shelve" database (single file on filesystem). And the whole thing is initially intended for local-only read-only usage.

> Wikipage::checkForDownloadingHack() should return itself the (cached)
content, instead of manually doing the $this->downloadedContent
Yes.

> No need to hack Parser::fetchTemplateAndTitle(), that can be redirected
through setTemplateCallback(). 
Not a PHP/MediaWiki hacker => just did what managed to do the first.

> Why do you need to change EditPage, if you're not doing page editing?
To be able to view source (sometimes things get broken => can still view content in source form).


> Ideally, there would be different classes depending if it was
db-backed or filesystem-based
I think creating good class structure to support DBBackend, FilesystemBackend, HttpBackend is a step in resolving "Bug 25984".


(Bumping this discussion was advised by Freenode/#mediawiki user)
(Will tell here again if/when implement the improved fetch-from-HTTP patch)
Comment 11 Andrew Dunbar 2012-06-10 07:06:23 UTC
I would dearly love to see a version of this patch go upstream so that others who want to use the real live parser without a DB can see where to start.

Obviously having a proper abstraction layer between the parser and various db- / HTTP- / filesystem back end classes is the best way, but since the dev team is busy with bigger projects, having (a cleaned up version of) this starting point in the codebase will help people wanting 100% parse fidelity for offline viewers, data miners, etc.

Personally I've wanted an offline viewer that worked straight from the published dump files for years. (Other offline viewers like Kiwix need their own huge downloads in their own formats.) I had the code to extract the needed data from the dump files, but never succeeded with the next step of parser integration.

It was me who asked _Vi to participate in this discussion. He's the first one I bumped into over the years who has done some real work in this direction. Hack = prototype = as good a place to start as anywhere. (-:
Comment 12 Ángel González 2012-06-10 18:13:02 UTC
Created attachment 10720 [details]
Patch for 0001-Make-MediaWiki-1.19-fetch-content-from-HTTP.patch

_Vi is also processing to a different format. :)

Did you see http://wiki-web.es/mediawiki-offline-reader/ ?

They are not straight dumps from dumps.wikimedia.org, although the original idea was that they would eventually be processed like that, publishing the index file along the xml dumps.
You could use the original files, treating them as a single bucket, but performance would be horrible with big dumps.

My approach was to use a new database type for reading the dumps, so it doesn't need an extra process or database.

Admittedly, it targetted the then current MediaWiki 1.13, so it'd need an update in order to work with current MediaWiki versions (mainly things like new columns/tables).

Vi, I did some tests with your code using eswiki-20081126 dump. For that version I store the processed file + categories + indexes in less than 800M. In your case, the shelve file needs 2.4G (a little smaller than the decompressed xml dump, 2495471616 vs 2584170611).

I had to perform a number of changes, in the patch to make it apply, to the interwikis so wikipedia is treated as a local namespace, to paths... Also the database contains references to /home/vi/usr/mediawiki_sa_vi/w/, but it mostly works.
The more noticeable problems are that images don't work and redirects are not followed.
Other features such as categories or special pages are also broken, but I assume that's expected?
Comment 13 _Vi 2012-06-10 18:46:33 UTC
Improved the patch: 0001-Make-MediaWiki-1.20-gb7ed02-be-able-to-fetch-from-al.patch .

1. Now applies to master branch (b7ed0276e560389913c629d97a46aaa47f48798b)
2. Separate class "AlternativeSourceObtainerBackend"
3. Is not _expected_ to break existing functions unless wgAlternativeSourceObtainerUri is set
4. wgAlternativeSourceObtainerUri is properly registered in DefaultSettings
5. Some comments above the variables and functions
6. Red/blue links are supported now, at expense of massive number of requests to AltBackend for each link.

Not implemented:
1. Tabs instead of spaces
2. "No need to hack Parser::fetchTemplateAndTitle"


It should be usable for everything PHP can fopen.
Comment 14 _Vi 2012-06-10 18:47:34 UTC
Created attachment 10721 [details]
Patch to make MediaWiki obtain page sources from alternative locations
Comment 15 Sumana Harihareswara 2012-06-10 23:58:56 UTC
_Vi - thanks for your work. By the way, want developer access?

https://www.mediawiki.org/wiki/Developer_access
Comment 16 _Vi 2012-06-11 08:10:59 UTC
> _Vi is also processing to a different format. :)
>I did some tests with your code using eswiki-20081126 dump. 
The main thing that with HTTP you can experiment with your own storage formats easily.

> In your case, the shelve file needs 2.4G
The file is expected to be stored on compressed filesystem like reiser4/btrfs/fusecompress.

> Other features such as categories or special pages are also broken, but I
assume that's expected?
It is expected for old patch to be usable only for simple rendering pages without extras. New patch should at least not break things when AltBackend is turned off (still only basic viewing features).

> By the way, want developer access?
Unlikely it will be useful for me. /* Do you share dev access with anybody that easily? */ "Dev access" does not automatically lead to "knowledge about the system and good code quality" and I don't want to break things. If I came up with some patch I ask at Freenode and/or attach it to some bug report.
Comment 17 Sumana Harihareswara 2012-06-11 11:54:44 UTC
Yes, we do share dev access with anyone, and recommend it for anyone who has ever given us a patch.  It's access to suggest patches directly into our git repository, but you can't break things, because a senior developer has to approve it before it gets merged.  If you get and use dev access, you make it *easier* for us to review, comment on, and eventually merge the code, and you can comment on the patch in the case someone else takes and merges it.
Comment 18 _Vi 2012-06-11 14:04:12 UTC
> If you get and use dev access, you make it
> *easier* for us to review, comment on, 
Done, https://www.mediawiki.org/wiki/Developer_access#User:Vi2

(Not sure what to do with it yet)
Comment 19 Sumana Harihareswara 2012-06-11 14:22:51 UTC
Thanks, _Vi.  You should have a temporary password in your email now.  Initial login & password change steps:

https://labsconsole.wikimedia.org/wiki/Help:Access#Initial_log_in_and_password_change

How to suggest your future patches directly into our source code repository (we use Git for version control and Gerrit for code review), in case you want to do that:

https://www.mediawiki.org/wiki/Git/Tutorial

If the patch under discussion in this bug is just a hacky prototype for discussion, then it's fine to keep on discussing it here and attaching improved patches here.
Comment 20 Ángel González 2012-06-13 00:00:20 UTC
> > In your case, the shelve file needs 2.4G
> The file is expected to be stored on compressed filesystem like
> reiser4/btrfs/fusecompress.

How would you do that?
And even more, how would you *share* such file?
Comment 21 _Vi 2012-06-13 11:09:58 UTC
> How would you do that?
For example, in this way:
$ pv pages_talk_templates_dump.xml.xz | wikishelve_create.py shelve
800M 0:NN:NN [100KB/s] [============] 100%
$ fusecompress -o fc_c:bzip2,fc_b:512,allow-other store mount
$ pv shelve > mount/shelve
2.4G 0:NN:NN [1MB/s] [============] 100%
$ wikishelve_server.py mount/shelve 5077

> And even more, how would you *share* such file?
Don't share, share "pages_talk_templates_dump.xml.xz" instead.

Note: here it's better to discuss only MediaWiki part. For storage part of "wiki_dump_and_read" better create an issue at GitHub.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links