Last modified: 2010-05-15 15:37:56 UTC
I was experimenting with importing the latest en.wikipedia XML dump to my MySQL database, but allways failed (at around 800.000 pages - which is only about 1/3 of the current db). The reason was very simple: the importer allways ate up my complete memory (2 GB RAM + 1 GB swap = 3GB) withhin hours, and the php bailed out. I was looking for the reason for days, and finally I found it (should have thought about it earlier). It is the CacheManager. It puts every Title in the cache, and never frees it up. But somehow I don't see the point, why the importer needs the cache at all, whe one only looks for an ArticleID? But I found a workaround: In the SpecialImport.php there are the following lines: $article = new Article( $this->title ); $pageId = $article->getId(); if( $pageId == 0 ) { # must create the page... $pageId = $article->insertOn( $dbw ); } Now the $article->getId() call is the culprit, because at this point the Title it is put to the cache (article->getId is calling $this->mTitle->getArticleID() and getArticleID() puts the new Title object to the cache). But this checking for existing articleId is not necessary at all, if one only imports the current Pages (and not all!), because there the PageID's are distinct and the DB should be empty (and no existing Articles). The solution: comment out $pageId = $article->getId() and replace it by $pageId = 0 Now this is a hack, and should be made configurable (or turn on/off from command line) becaues people who import the ALL XML will need this feature (nevertheless I can't imagine, how one could succeed with it, as this will need many-many GB of RAM).
I would not expect CacheManager to be invoked during this process at all; it's used when viewing pages if the file cache is on. Further, a CacheManager shouldn't actually keep any data around or grow it as I understand. Can you clarify what you are referring to?
You can reproduce this behaviour very easily. Take for example the big (1 GB) http://download.wikipedia.org/wikipedia/en/pages_current.xml.gz and feed it maintenance/importDump.php. The just looking at "top" you will see, how fast the memory consumption of the PHP process is growing (and consequently free memory is going away rapidly). Now, if you do comment out in SpecialImport.php the line $article->getId() (which should be obsolete, if you are importing a "current" dump, because if I understand it correctly it has every page only once - the current/latest version) then you will see the big differenc. The PHP process will not grow in memory needs (nly 27-30 Mbyte on my computer) and the import is running well until the end. The reason for the big memory consumption is (as far as I understand from reading the code): article->getId() is calling $this->mTitle->getArticleID(). Then if I look into Title.php at "funcion getArticleID" then I see in both branches of the "if" clause: $this->mArticleID = $wgLinkCache->addLinkObj( $this ) And I suspect, that addLinkObj is the one, who is consuming memory. Because, if it happens for every Title object during the import process without making any memory cleaning, at some point you run out of memory.
So when you claimed that CacheManager was at fault, you meant LinkCache? That at least makes sense. :) Checking for existing pages is absolutely required, since not all imports will be conflict-free. However disabling the link cache for these checks is probably in order.
Toss this into WikiRevision::importOldRevision() in SpecialImport.php: + // avoid memory leak...? + global $wgLinkCache; + $wgLinkCache->clear(); Done in CVS HEAD and REL1_5; will be in next 1.5 release.