Last modified: 2010-05-15 15:37:56 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T5182, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 3182 - Not enough memory for ImportDump.php
Not enough memory for ImportDump.php
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Special pages (Other open bugs)
1.5.x
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-08-18 09:37 UTC by Andras Fabian
Modified: 2010-05-15 15:37 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Andras Fabian 2005-08-18 09:37:04 UTC
I was experimenting with importing the latest en.wikipedia XML dump to my MySQL
database, but allways failed (at around 800.000 pages - which is only about 1/3
of the current db). The reason was very simple: the importer allways ate up my
complete memory (2 GB RAM + 1 GB swap = 3GB) withhin hours, and the php bailed out.
I was looking for the reason for days, and finally I found it (should have
thought about it earlier). It is the CacheManager. It puts every Title in the
cache, and never frees it up. But somehow I don't see the point, why the
importer needs the cache at all, whe one only looks for an ArticleID? But I
found a workaround:
In the SpecialImport.php there are the following lines:

		$article = new Article( $this->title );
		$pageId = $article->getId();
		if( $pageId == 0 ) {
			# must create the page...
			$pageId = $article->insertOn( $dbw );
		}

Now the $article->getId() call is the culprit, because at this point the Title
it is put to the cache (article->getId is calling $this->mTitle->getArticleID()
and getArticleID() puts the new Title object to the cache). But this checking
for existing articleId is not necessary at all, if one only imports the current
Pages (and not all!), because there the PageID's are distinct and the DB should
be empty (and no existing Articles). 

The solution: comment out $pageId = $article->getId() and replace it by $pageId = 0

Now this is a hack, and should be made configurable (or turn on/off from command
line) becaues people who import the ALL XML will need this feature (nevertheless
I can't imagine, how one could succeed with it, as this will need many-many GB
of RAM).
Comment 1 Brion Vibber 2005-08-19 08:43:47 UTC
I would not expect CacheManager to be invoked during this process at all; it's used 
when viewing pages if the file cache is on. Further, a CacheManager shouldn't 
actually keep any data around or grow it as I understand.

Can you clarify what you are referring to?
Comment 2 Andras Fabian 2005-08-19 21:23:08 UTC
You can reproduce this behaviour very easily. Take for example the big (1 GB)
http://download.wikipedia.org/wikipedia/en/pages_current.xml.gz and feed it
maintenance/importDump.php. The just looking at "top" you will see, how fast the
memory consumption of the PHP process is growing (and consequently free memory
is going away rapidly). Now, if you do comment out in SpecialImport.php the line
$article->getId() (which should be obsolete, if you are importing a "current"
dump, because if I understand it correctly it has every page only once - the
current/latest version) then you will see the big differenc. The PHP process
will not grow in memory needs (nly 27-30 Mbyte on my computer) and the import is
running well until the end.
The reason for the big memory consumption is (as far as I understand from
reading the code):
article->getId() is calling $this->mTitle->getArticleID(). Then if I look into
Title.php at "funcion getArticleID" then I see in both branches of the "if" clause:
$this->mArticleID = $wgLinkCache->addLinkObj( $this )
And I suspect, that addLinkObj is the one, who is consuming memory. Because, if
it happens for every Title object during the import process without making any
memory cleaning, at some point you run out of memory.
Comment 3 Brion Vibber 2005-08-19 22:21:39 UTC
So when you claimed that CacheManager was at fault, you meant LinkCache?
That at least makes sense. :)

Checking for existing pages is absolutely required, since not all imports will be conflict-free. 
However disabling the link cache for these checks is probably in order.
Comment 4 Brion Vibber 2005-09-24 03:48:26 UTC
Toss this into WikiRevision::importOldRevision() in SpecialImport.php:

+               // avoid memory leak...?
+               global $wgLinkCache;
+               $wgLinkCache->clear();

Done in CVS HEAD and REL1_5; will be in next 1.5 release.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links