Last modified: 2013-10-29 16:34:21 UTC
Some search backends, like LuceneSearch, rely on XML dumps to build the search index. The indexer has no knowledge of content models, so it will index everything in the dump as-is. For non-text content models, this means it will index the serialized form, which will often lead to bad results (see bug 42234). To solve this, a brief discussion on wikitech-l suggests to implement an option for the dump creation process that would output generated text instead of raw serialized data into the dumps. This option could then be used to create dumps especially for rebuilding a search index. See http://www.gossamer-threads.com/lists/wiki/wikitech/340638 The Content interface already defined the function getTextForSearchIndex for generating such pseudo-content. It only needs to be hooked up to dump generation.
The work should be done in Export.php I suppose, because then all the actual dump infrastructure will 'just work'. Additionally someone using Special:Export would be able to exprt content in this format (if given the right checkboxes).
Since the move from Lucene to ElasticSearch, this is no longer an issue.