Last modified: 2014-06-27 12:01:40 UTC
The German community did a voting for a "technical wish list". "Source search" made it into the top 20 wishes. See [[w:de:WP:Umfragen/Technische_Wünsche/Suche#Wünsche]]. I had the chance to talk with one of the current CirrusSearch developers and we think it should be fairly easy to implement this: In addition to the current field (which contains the visible text only) we could add a second field that contains the plain, untransformed wiki text. I suggest a keyword "insource:..." to allow searching this field. This could be very powerful in combination with the existing "hastemplate:...". Possible problems: 1. This will roughly double the size of the index. Is this worth it? 2. Stemming should be disabled on this field, if that's possible. And it probably needs a few more tweaks. 3. Searching for special characters can't work, right? 4. Can this still work if we switch to Parsoid some day? It should, right?
Change 137733 had a related patch set uploaded by Manybubbles: Basic insource support https://gerrit.wikimedia.org/r/137733
Change 137733 merged by jenkins-bot: Insource support https://gerrit.wikimedia.org/r/137733
(In reply to Gerrit Notification Bot from comment #2) > Change 137733 merged by jenkins-bot: > Insource support > > https://gerrit.wikimedia.org/r/137733 Whoa, really?
(In reply to MZMcBride from comment #3) > (In reply to Gerrit Notification Bot from comment #2) > > Change 137733 merged by jenkins-bot: > > Insource support > > > > https://gerrit.wikimedia.org/r/137733 > > Whoa, really? Yep, should start making its way live with the next wmf branch come Thursday.
(In reply to Chad H. from comment #4) > (In reply to MZMcBride from comment #3) > > (In reply to Gerrit Notification Bot from comment #2) > > > Change 137733 merged by jenkins-bot: > > > Insource support > > > > > > https://gerrit.wikimedia.org/r/137733 > > > > Whoa, really? > > Yep, should start making its way live with the next wmf branch come Thursday. Caveats for regexes: 1. Its kinda slow. 2. We only allow 2 concurrent queries at a time. 3. We have a maximum queue of 10. This is to keep more then 12 apaches stuck waiting for it. 4. Syntax error feedback is only OK, not great. 5. If you fill up the queue then you won't get a useful error message. 6. No highlighting of results at all. Something I'll work on fixing in the next couple weeks. 7. Its going to take some time after the initial release for all pages to be indexed. We didn't have the source indexed before so we'll have to regenerate all the documents and we didn't write anything fancy to do just the source so we'll end up rerendering everything. Its slow, but it'll work. 8. The regex language is actually Lucene's regex which is designed to be efficient rather then super expressive. I chose it because its safe. 9. Other stuff I don't remember? Docs are here: https://www.mediawiki.org/wiki/Search/CirrusSearchFeatures#insource: We were tired of waiting for ops to build out infrastructure for easy copying to labs. So we figured we'd just make it in prod and limit it to a few executors. Hopefully everything will be just fine. We might, but haven't yet, decided it'd be best to limit it to users with a permission, or signed in users, or something. We'd only do that if we saw that it was crushing us or that some asshole was keeping the queue full and no legitimate users could use it.
(In reply to Nik Everett from comment #5) > We were tired of waiting for ops to build out infrastructure for easy > copying to labs. Well it's a lower priority for Swift than say image storage, so I understand the delay. We still want it though for backups and labs :)
(In reply to Chad H. from comment #6) > (In reply to Nik Everett from comment #5) > > We were tired of waiting for ops to build out infrastructure for easy > > copying to labs. > > Well it's a lower priority for Swift than say image storage, so I understand > the delay. We still want it though for backups and labs :) Yeah! I totally want backups! I just was tired of waiting for it for regexes. Hopefully it won't turn out to be a mistake.