Last modified: 2013-09-26 15:06:42 UTC
1. Search test2wiki for environment : https://test2.wikipedia.org/w/index.php?search=environment&title=Special%3ASearch 2. Notice that the first result has, as the text snippet: Geography making maps Countries of the world Natural environment[edit | edit source] Climate Soil Rivers Rocks 3. Click through to https://test2.wikipedia.org/wiki/Geography and notice that "[edit | edit source]" is not in the real text of the article. I think CirrusSearch should not be displaying "[edit | edit source]" in the text snippets in the search results.
Another repro case, slightly different: Search for "Valiant" https://test2.wikipedia.org/w/index.php?search=valiant&title=Special%3ASearch and you'll see the result "Blooper", with the text excerpt being: "A Bug's Life, Toy Story 2, Monsters, Inc., and Valiant. Contents 1 The "blooper" in pop culture 1" The words after "Valiant" are part of the table of contents of the page.
Created attachment 13104 [details] "[edit | edit source]" in the search results snippet
Try this one: https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=edit+source&fulltext=Search The action item here: remove the edit links and any other automatically added text from the page before dropping it into the search backend. Also, remove the able of contents if possible. I'm pretty sure the edit links and their ilk are super high priority but I'm not sure of the priority on the table of contents.
https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=video+sorry&fulltext=Search gets me a link to https://test2.wikipedia.org/wiki/Birch_beer that includes the excerpt: a heart as big as a whale. Also: enjoy this video! Sorry, your browser either has JavaScript disabled So that's one more automatically added bit of text to remove from the search corpus.
I've pushed a fix to gerrit: https://gerrit.wikimedia.org/r/#/c/80018/ I'll set this bug to PATCH_TO_REVIEW once I push some regression tests to review as well.
Tests: https://gerrit.wikimedia.org/r/#/c/80021/ I forgot to include the bug number in the commit messages but these links should help.
Tweaked the summary a bit. Older summary included: "[edit | edit source]", ToC text, & "JS disabled" warning. Genericized this to user interface elements and clarified that this is a CirrusSearch-specific issue.
Change 80018 had a related patch set uploaded by TTO: Remove parts of rendered page from search. https://gerrit.wikimedia.org/r/80018
Change 80018 merged by jenkins-bot: Remove parts of rendered page from search. https://gerrit.wikimedia.org/r/80018
Live and working.