Last modified: 2014-11-06 10:01:41 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T63463, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 61463 - Gerrit search: enable secondary index to allow latest operators


Summary:	Gerrit search: enable secondary index to allow latest operators

Status:	NEW

Product:	Wikimedia
Classification:	Unclassified
Component:	Git/Gerrit (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized enhancement with 1 vote (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	https://gerrit.wikimedia.org/r/#/q/co...
Whiteboard:
Keywords:

Depends on:
Blocks:	36467 49674
	Show dependency tree / graph

Reported:	2014-02-17 14:22 UTC by Nemo
Modified:	2014-11-06 10:01 UTC (History)
CC List:	11 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Nemo 2014-02-17 14:22:00 UTC

Code Review - Error
secondary index must be enabled for comment:pginer
----

Also for operators message: and file: (hence tentatively marking as blocker of bug 36467, bug 49674).

I hear that «as part of the secondary indexing change its no longer allowed
to run in its naive form. I think that is OK, we are going to
encourage everyone to turn on secondary indexing.»
https://groups.google.com/forum/#!topic/repo-discuss/rCS2ooVEpbg

Is this the infamous Lucene search? If yes, is this a WONTFIX or what blocks it other than caching (<https://code.google.com/p/gitblit/issues/detail?id=274#c9>)?

P.s.: Pasting here IRC discussion about the URL above for lack of a better place. http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-dev/20140217.txt

[10:37:35] <ori>	 Krinkle: have you seen mwgrep yet?
[10:38:23] <ori>	 ssh to tin and run: mwgrep mw.user.anon
[10:40:40] <hoo>	 ori: ooowwwm, super handy... thanks :)
[10:41:09] <hoo>	 I've been using a hacky bash script for that until now :P
[10:41:38] <ori>	 how did it work?
[10:42:06] <hoo>	 ori: It basically just pulled in all JS pages I wan interested in (mostly common.js) and ran grep on them :P
[10:42:11] <hoo>	 * was
[10:42:14] <ori>	 ahh
[10:48:49] <Krinkle>	 ori: how performant is that (how much is search engine, how much is local script), you think we could expose this publicly somehow? 
[10:49:28] <Krinkle>	 I wrote two separate toolserver tools over the past years to do essentially that, all of them extremely hacky and slow.
[10:49:44] <hoo>	 hoo@terbium:~$ time mwgrep 453894375dfs
[10:49:44] <hoo>	 [...]
[10:49:44] <hoo>	 real	0m0.176s
[10:49:52] <ori>	 the local script is basically constructing the URL for a request; it's all elasticsearch
[10:50:09] <Krinkle>	 couldn't use any API or search backend for it at the time, as it mangled code quite badly (punctuation normalisation etc.), and no way to search multiple wikis
[10:50:11] <ori>	 nik (manybubbles) said the load it generates is laughably small
[10:50:28] <Krinkle>	 yeah, looked at the source code already
[10:50:41] <Krinkle>	 wasn't sure where ""_source['text'].contains('%s')" % args.term" ran though
[10:50:42] <hoo>	 ori: Does that service used in there have a public host?
[10:50:58] <hoo>	 http://search.svc.eqiad.wmnet:9200
[10:51:09] <ori>	 you could change it to a regexp search and as long as it was restricted to the mw ns it would be fine (again, per nik)
[10:51:53] <ori>	 it's not publicly-accessible; pathological queries will run it into the ground, so you need to have a layer in front that restricts the types of searches
[10:52:02] <ori>	 in our case, the mediawiki search interface
[10:52:43] <Krinkle>	 ori: Hm.. it currently doesn't take regex though, right? Would need a minor change to the script. Just verifying I'm not using it incorrectly
[10:52:56] <ori>	 yeah, doesn't take a regexp at the moment
[10:53:29] <ori>	 because this particular query is not risky (because ns:8, title ends with 'js' or 'css' restricts the pool of candidates), we could have a public front-end for it, but chad's plan is to make the entire elastic index queryable from labs
[10:53:40] <Nemo_bis>	 hoo: does this help the non-protocol-relative scripts searches?
[10:54:07] <hoo>	 Nemo_bis: Not much w/o regexp support, yet
[10:54:13] <Nemo_bis>	 ok
[10:54:29] <hoo>	 But I already have quite a list of stuff to clean up which I gathered using my old bash script
[10:55:52] <ori>	 Krinkle: http://p.defau.lt/?Yswucu3jR5S9QqZfFre7Vg is the query (first line is path, everything below is the json query, which should be sent as POST payload)
[10:56:56] <ori>	 Krinkle, hoo: you can replace 'contains' with 'matches' to make it a regexp search
[10:57:29] <Krinkle>	 ori: what script language is that?
[10:57:44] <hoo>	 python
[10:57:48] <Krinkle>	 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-script-filter.html
[10:57:56] <Krinkle>	 ok
[10:58:24] <ori>	 Krinkle: no, MVEL
[10:58:26] <ori>	 <http://github.com/mvel/mvel>
[10:58:36] <ori>	 context: http://p.defau.lt/?EZ6aPeZRMWdjZtwpOnFgWQ
[10:58:53] <Krinkle>	 yeah, it says that
[10:59:05] <Krinkle>	 <a>scripts</a> -> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
[10:59:09] <Krinkle>	 uses by default mvel 
[10:59:24] <ori>	 ah, yep
[11:00:27] <hoo>	 ori: mh... regexp don't seem to work after I did the change you suggested...
[11:00:34] <hoo>	 or aren't that perl-style regexp?
[11:02:18] <Krinkle>	 Im trying to find docs on mvel regarding available string methods
[11:02:20] <Krinkle>	 no luck
[11:03:18] <Krinkle>	 http://mvel.codehaus.org/MVEL+2.0+Operators shows a different syntax ( foo contains "x", and, foo ~= '[a-z].+' )
[11:03:47] <ori>	 Krinkle: "By default, like Java, MVEL provides all java.lang.* classes pre-imported as part of the compiler and runtime."
[11:03:54] <Krinkle>	 right
[11:04:23] <ori>	 so since _source['text'] is a string, you can call http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#matches(java.lang.String)
[11:05:06] <ori>	 I don't know how that is functionally different from using MVEL's ~= operator
[11:05:10] <Krinkle>	 Ah, this is going to be important to document and protect against. Working 3 or 4 levels of abstraction deep to end up evaluating java.
[11:05:22] <Krinkle>	 java, mvel, json, http
[11:05:52] <ori>	 elastic's http interface is designed to be a private api tho
[11:06:00] <Krinkle>	 yeah
[11:06:10] <ori>	 but yeah, it is a bit like a matryoshka doll
[11:06:29] <ori>	 https://upload.wikimedia.org/wikipedia/commons/4/45/7babushkadolls.JPG
[11:06:33] <Krinkle>	 except that its bigger on the inside
[11:06:34] <ori>	 i'll let you decide which one is java :P
[11:06:51] <Krinkle>	 http < json < mvel < java
[11:07:16] <Krinkle>	 well, kinda sorta
[11:07:17] <Krinkle>	 nvm
[11:07:41] <ori>	 don't forget shell-style argv parsing > json
[11:08:06] <Krinkle>	 yeah, but out final API isn't going to do that :P
[11:08:31] <ori>	 at least i hope not
[11:08:59] <Krinkle>	 ApiAwesome.php, $apiResult = `"/bin/mwgrep escapeshellarg($_GET['q'])"` 
[11:09:25] <ori>	 I'm sure we could come up with something more hideous than that
[11:09:33] <ori>	 it needs Puppet
[11:09:41] <ori>	 and ideally also XML
[11:10:00] <Krinkle>	 and php serialise/wakeup
[11:10:37] <ori>	 see also <http://stackoverflow.com/questions/7797217/how-to-index-source-code-with-elasticsearch> for the "proper" way to do code searches without resulting to this sort of trickery
[11:12:17] <ori>	 basically teach lucene to run <https://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/parser/js/Parser.java> against the text and index names
[11:12:35] * ori  chuckles at 'import com.google.caja.SomethingWidgyHappenedError;'
[11:13:56] <Nemo_bis>	 If ElasticSearch can be used for proper code searches, maybe gitblit can be convinced to use it instead of Lucene?
[11:14:23] <Nemo_bis>	 If yes please file a bug upstream, the request for gitweb-like grep search was rejected.
[11:14:37] <ori>	 elastic uses lucene under the hood
[11:16:04] <ori>	 I've been poking at its overall performance (<https://code.google.com/p/gitblit/issues/detail?id=274#c9>), it's hideously slow atm
[11:18:37] <Nemo_bis>	 I know it uses Lucene, hence I'd think it would be slightly easier to convince them to adopt it. :)
[11:19:21] <Nemo_bis>	 Oh, thanks for the link, I thought I had starred that one.
[11:22:47] <Nemo_bis>	 And thanks for working on that

Comment 1 Tisza Gergő 2014-05-27 22:15:52 UTC

file: search would be extremely useful to construct gerrit dashboards which show changes related to some specific topic (for example, I would like to create a dashboard showing pending core changes related to file handling).

Comment 2 Jan Zerebecki 2014-07-16 20:08:16 UTC

Should be easy to enable, but according to the docs gerrit needs to be offline to create the index.
https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#index

Comment 3 christian 2014-07-24 11:34:13 UTC

(In reply to Nemo from comment #0)
> Is this the infamous Lucene search?

Yes.

Gerrit's current search back-end gerrit uses is not great. Agreed.

But Gerrit's Lucene search back-end had it's own warts when I last tried
(which admittedly is some time back). Lucene is fresh in Gerrit.

I have no say here, but since we're heading Phabricator anyways, I do not
think switching the search back-end is worth it at this point.

Comment 4 Nemo 2014-07-24 11:57:09 UTC

Thanks for commenting.

(In reply to christian from comment #3)
> I have no say here, but since we're heading Phabricator anyways, I do not
> think switching the search back-end is worth it at this point.

AFAIK there is no timeline for that yet, will probably take one year or two from my understanding. Depending on how hard it is, even just one team using the feature for their dashboard(s) as Tgr suggested above can pay the effort back; and it's unlikely it would be one team only given how big a usage enjoys even a spammy feature as the gerrit reviewer bot.

Comment 5 Adrian Lang 2014-11-06 09:40:21 UTC

I'd love to be able to do file: / path: searches. Since 2.9, there is conflicts:, which would help me as a workaround, but we are on 2.8.

Comment 6 Andre Klapper 2014-11-06 10:01:41 UTC

(In reply to Adrian Lang from comment #5)
> Since 2.9, there is conflicts:, but we are on 2.8.

see bug 68271

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links