Last modified: 2014-11-06 10:01:41 UTC
Code Review - Error secondary index must be enabled for comment:pginer ---- Also for operators message: and file: (hence tentatively marking as blocker of bug 36467, bug 49674). I hear that «as part of the secondary indexing change its no longer allowed to run in its naive form. I think that is OK, we are going to encourage everyone to turn on secondary indexing.» https://groups.google.com/forum/#!topic/repo-discuss/rCS2ooVEpbg Is this the infamous Lucene search? If yes, is this a WONTFIX or what blocks it other than caching (<https://code.google.com/p/gitblit/issues/detail?id=274#c9>)? P.s.: Pasting here IRC discussion about the URL above for lack of a better place. http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-dev/20140217.txt [10:37:35] <ori> Krinkle: have you seen mwgrep yet? [10:38:23] <ori> ssh to tin and run: mwgrep mw.user.anon [10:40:40] <hoo> ori: ooowwwm, super handy... thanks :) [10:41:09] <hoo> I've been using a hacky bash script for that until now :P [10:41:38] <ori> how did it work? [10:42:06] <hoo> ori: It basically just pulled in all JS pages I wan interested in (mostly common.js) and ran grep on them :P [10:42:11] <hoo> * was [10:42:14] <ori> ahh [10:48:49] <Krinkle> ori: how performant is that (how much is search engine, how much is local script), you think we could expose this publicly somehow? [10:49:28] <Krinkle> I wrote two separate toolserver tools over the past years to do essentially that, all of them extremely hacky and slow. [10:49:44] <hoo> hoo@terbium:~$ time mwgrep 453894375dfs [10:49:44] <hoo> [...] [10:49:44] <hoo> real 0m0.176s [10:49:52] <ori> the local script is basically constructing the URL for a request; it's all elasticsearch [10:50:09] <Krinkle> couldn't use any API or search backend for it at the time, as it mangled code quite badly (punctuation normalisation etc.), and no way to search multiple wikis [10:50:11] <ori> nik (manybubbles) said the load it generates is laughably small [10:50:28] <Krinkle> yeah, looked at the source code already [10:50:41] <Krinkle> wasn't sure where ""_source['text'].contains('%s')" % args.term" ran though [10:50:42] <hoo> ori: Does that service used in there have a public host? [10:50:58] <hoo> http://search.svc.eqiad.wmnet:9200 [10:51:09] <ori> you could change it to a regexp search and as long as it was restricted to the mw ns it would be fine (again, per nik) [10:51:53] <ori> it's not publicly-accessible; pathological queries will run it into the ground, so you need to have a layer in front that restricts the types of searches [10:52:02] <ori> in our case, the mediawiki search interface [10:52:43] <Krinkle> ori: Hm.. it currently doesn't take regex though, right? Would need a minor change to the script. Just verifying I'm not using it incorrectly [10:52:56] <ori> yeah, doesn't take a regexp at the moment [10:53:29] <ori> because this particular query is not risky (because ns:8, title ends with 'js' or 'css' restricts the pool of candidates), we could have a public front-end for it, but chad's plan is to make the entire elastic index queryable from labs [10:53:40] <Nemo_bis> hoo: does this help the non-protocol-relative scripts searches? [10:54:07] <hoo> Nemo_bis: Not much w/o regexp support, yet [10:54:13] <Nemo_bis> ok [10:54:29] <hoo> But I already have quite a list of stuff to clean up which I gathered using my old bash script [10:55:52] <ori> Krinkle: http://p.defau.lt/?Yswucu3jR5S9QqZfFre7Vg is the query (first line is path, everything below is the json query, which should be sent as POST payload) [10:56:56] <ori> Krinkle, hoo: you can replace 'contains' with 'matches' to make it a regexp search [10:57:29] <Krinkle> ori: what script language is that? [10:57:44] <hoo> python [10:57:48] <Krinkle> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-script-filter.html [10:57:56] <Krinkle> ok [10:58:24] <ori> Krinkle: no, MVEL [10:58:26] <ori> <http://github.com/mvel/mvel> [10:58:36] <ori> context: http://p.defau.lt/?EZ6aPeZRMWdjZtwpOnFgWQ [10:58:53] <Krinkle> yeah, it says that [10:59:05] <Krinkle> <a>scripts</a> -> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html [10:59:09] <Krinkle> uses by default mvel [10:59:24] <ori> ah, yep [11:00:27] <hoo> ori: mh... regexp don't seem to work after I did the change you suggested... [11:00:34] <hoo> or aren't that perl-style regexp? [11:02:18] <Krinkle> Im trying to find docs on mvel regarding available string methods [11:02:20] <Krinkle> no luck [11:03:18] <Krinkle> http://mvel.codehaus.org/MVEL+2.0+Operators shows a different syntax ( foo contains "x", and, foo ~= '[a-z].+' ) [11:03:47] <ori> Krinkle: "By default, like Java, MVEL provides all java.lang.* classes pre-imported as part of the compiler and runtime." [11:03:54] <Krinkle> right [11:04:23] <ori> so since _source['text'] is a string, you can call http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#matches(java.lang.String) [11:05:06] <ori> I don't know how that is functionally different from using MVEL's ~= operator [11:05:10] <Krinkle> Ah, this is going to be important to document and protect against. Working 3 or 4 levels of abstraction deep to end up evaluating java. [11:05:22] <Krinkle> java, mvel, json, http [11:05:52] <ori> elastic's http interface is designed to be a private api tho [11:06:00] <Krinkle> yeah [11:06:10] <ori> but yeah, it is a bit like a matryoshka doll [11:06:29] <ori> https://upload.wikimedia.org/wikipedia/commons/4/45/7babushkadolls.JPG [11:06:33] <Krinkle> except that its bigger on the inside [11:06:34] <ori> i'll let you decide which one is java :P [11:06:51] <Krinkle> http < json < mvel < java [11:07:16] <Krinkle> well, kinda sorta [11:07:17] <Krinkle> nvm [11:07:41] <ori> don't forget shell-style argv parsing > json [11:08:06] <Krinkle> yeah, but out final API isn't going to do that :P [11:08:31] <ori> at least i hope not [11:08:59] <Krinkle> ApiAwesome.php, $apiResult = `"/bin/mwgrep escapeshellarg($_GET['q'])"` [11:09:25] <ori> I'm sure we could come up with something more hideous than that [11:09:33] <ori> it needs Puppet [11:09:41] <ori> and ideally also XML [11:10:00] <Krinkle> and php serialise/wakeup [11:10:37] <ori> see also <http://stackoverflow.com/questions/7797217/how-to-index-source-code-with-elasticsearch> for the "proper" way to do code searches without resulting to this sort of trickery [11:12:17] <ori> basically teach lucene to run <https://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/parser/js/Parser.java> against the text and index names [11:12:35] * ori chuckles at 'import com.google.caja.SomethingWidgyHappenedError;' [11:13:56] <Nemo_bis> If ElasticSearch can be used for proper code searches, maybe gitblit can be convinced to use it instead of Lucene? [11:14:23] <Nemo_bis> If yes please file a bug upstream, the request for gitweb-like grep search was rejected. [11:14:37] <ori> elastic uses lucene under the hood [11:16:04] <ori> I've been poking at its overall performance (<https://code.google.com/p/gitblit/issues/detail?id=274#c9>), it's hideously slow atm [11:18:37] <Nemo_bis> I know it uses Lucene, hence I'd think it would be slightly easier to convince them to adopt it. :) [11:19:21] <Nemo_bis> Oh, thanks for the link, I thought I had starred that one. [11:22:47] <Nemo_bis> And thanks for working on that
file: search would be extremely useful to construct gerrit dashboards which show changes related to some specific topic (for example, I would like to create a dashboard showing pending core changes related to file handling).
Should be easy to enable, but according to the docs gerrit needs to be offline to create the index. https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#index
(In reply to Nemo from comment #0) > Is this the infamous Lucene search? Yes. Gerrit's current search back-end gerrit uses is not great. Agreed. But Gerrit's Lucene search back-end had it's own warts when I last tried (which admittedly is some time back). Lucene is fresh in Gerrit. I have no say here, but since we're heading Phabricator anyways, I do not think switching the search back-end is worth it at this point.
Thanks for commenting. (In reply to christian from comment #3) > I have no say here, but since we're heading Phabricator anyways, I do not > think switching the search back-end is worth it at this point. AFAIK there is no timeline for that yet, will probably take one year or two from my understanding. Depending on how hard it is, even just one team using the feature for their dashboard(s) as Tgr suggested above can pay the effort back; and it's unlikely it would be one team only given how big a usage enjoys even a spammy feature as the gerrit reviewer bot.
I'd love to be able to do file: / path: searches. Since 2.9, there is conflicts:, which would help me as a workaround, but we are on 2.8.
(In reply to Adrian Lang from comment #5) > Since 2.9, there is conflicts:, but we are on 2.8. see bug 68271