Last modified: 2014-01-22 17:53:39 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T30493, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 28493 - Monitor and index error logs for trends and new errors
Monitor and index error logs for trends and new errors
Status: NEW
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal enhancement with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
: platformeng
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-04-11 17:21 UTC by Krinkle
Modified: 2014-01-22 17:53 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Krinkle 2011-04-11 17:21:18 UTC
To report php errors, or database errors such as the following:

----
Database error

A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was:

  (SQL query hidden)

from within function "LinksUpdate::incrTableUpdate". Database returned error "1205: Lock wait timeout exceeded; try restarting transaction (10.0.6.41)".
-----

An irc bot could be written that reports db errors, and perhaps php/mediawiki errors or other kind of "should be rare" errors as well.

According to Reedy there's a "global db/sql error" file on fenari.
Comment 1 p858snake 2011-04-12 06:24:24 UTC
Since these should be "rare" is there really all that much need for it?
Comment 2 Roan Kattouw 2011-04-12 09:51:03 UTC
(In reply to comment #1)
> Since these should be "rare" is there really all that much need for it?
They *should* be rare, but that doesn't mean that they are ;)

Interesting files on fenari are:
/home/wikipedia/syslog/apache.log -- Aggregated error.log for all Apaches. Needs some filtering to be usable; grep -i fatal works well for me
/home/wikipedia/log/dberror.log -- DB errors

Some additional notes:
* these files have different formats
* repeated errors have to be filtered out so this doesn't get too noisy
* srv numbers should be reported, as well as db numbers for DB errors
Comment 3 Krinkle 2011-04-12 10:00:23 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > Since these should be "rare" is there really all that much need for it?
> They *should* be rare, but that doesn't mean that they are ;)
> 
> Interesting files on fenari are:
> /home/wikipedia/syslog/apache.log -- Aggregated error.log for all Apaches.
> Needs some filtering to be usable; grep -i fatal works well for me
> /home/wikipedia/log/dberror.log -- DB errors
> 
> Some additional notes:
> * these files have different formats
> * repeated errors have to be filtered out so this doesn't get too noisy
> * srv numbers should be reported, as well as db numbers for DB errors

Could you publish a sample of each somewhere ?
Comment 4 Sam Reed (reedy) 2011-04-12 10:12:02 UTC
I can later if Roan doesn't get round to it before me
Comment 5 Roan Kattouw 2011-04-12 10:51:47 UTC
Censored sample of dberror.log:

Tue Apr 12 6:35:04 UTC 2011     srv254  enwiki  Error connecting to 10.0.6.22: Can't connect to MySQL server on '10.0.6.22' (4) (10.0.6.22)
Tue Apr 12 6:35:15 UTC 2011     srv265  dewiki  User::invalidateCache   10.0.6.33       1205    Lock wait timeout exceeded; Try restarting transaction (10.0.6.33)      UPDATE  `user` SET user_touched = 'CENSORED' WHERE user_id = 'CENSORED'
Tue Apr 12 6:51:14 UTC 2011     srv278  ruwiki  GlobalUsage::insertLinks        10.0.6.41       1062    Duplicate entry 'Houston_City_Hall_from_Hermann_Square_(HDR).jpg-ruwiki-4401' for key 'PRIMARY' (10.0.6.41)     INSERT  INTO `globalimagelinks` (gil_wiki,gil_page,gil_page_namespace_id,gil_page_namespace,gil_page_title,gil_to) VALUES ('ruwiki','4401','0','','Заглавная_страница','Houston_City_Hall_from_Hermann_Square_(HDR).jpg')
Tue Apr 12 6:52:33 UTC 2011     srv163  frwiki  Job::pop        10.0.6.39       1213    Deadlock found when trying to get lock; Try restarting transaction (10.0.6.39)  DELETE FROM `job` WHERE job_id = '136559964'

Censored sample of grep -i fatal apache.log :
Apr 12 07:13:18 10.0.8.3 apache2[6295]: PHP Fatal error:  Maximum execution time of CENSORED seconds exceeded in /usr/local/apache/common-local/php-1.17/includes/parser/Parser.php on line 3202
Apr 12 07:19:48 10.0.8.2 apache2[3887]: PHP Fatal error:  Allowed memory size of CENSORED bytes exhausted (tried to allocate CENSORED bytes) in /usr/local/apache/common-local/php-1.17/includes/parser/LinkHolderArray.php on line 265
Apr 12 08:48:56 10.0.8.18 apache2[14920]: PHP Fatal error:  Call to a member function isRedirect() on a non-object in /usr/local/apache/common-local/php-1.17/extensions/Collection/Collection.php on line 369

As you can see the DB servers in dberror.log and the srv servers in apache.log are stored as IP addresses, so you'd need to resolve those:

$ host 10.0.8.18
18.8.0.10.in-addr.arpa domain name pointer srv268.pmtpa.wmnet.

I'm not sure the censoring of the limits in apache.log was necessary, but I do think we'll want to censor SQL queries before posting them to a public channel. There is code for this in MW already (censoring and generalizing SQL queries for profiling purposes), somewhere.
Comment 6 Krinkle 2011-04-12 11:13:04 UTC
(In reply to comment #5)
> Censored sample of dberror.log:
> 
> Tue Apr 12 6:35:04 UTC 2011     srv254  enwiki  Error connecting to 10.0.6.22:
> Can't connect to MySQL server on '10.0.6.22' (4) (10.0.6.22)

Looks good.

> Censored sample of grep -i fatal apache.log :
> Apr 12 07:13:18 10.0.8.3 apache2[6295]: PHP Fatal error:  Maximum execution
> time of CENSORED seconds exceeded i

Why only fatals though ? I think we should keep our code conventions to trunk to wmf as well, no notices, warnings or fatals should appear. Although since we're just getting started on this, it makes sense to start with a filtered output to existing channels, but an unfiltered output could be set as well. ie. #wikimedia-debug or whatever.

(unfiltered, not uncensored)

> I'm not sure the censoring of the limits in apache.log was necessary, but I do
> think we'll want to censor SQL queries before posting them to a public channel.
> There is code for this in MW already (censoring and generalizing SQL queries
> for profiling purposes), somewhere.

Nice!

> As you can see the DB servers in dberror.log and the srv servers in apache.log
> are stored as IP addresses, so you'd need to resolve those:

Is this information available on noc.wikimedia.org as well ? There are some IPs and names relations there, not sure if these can or should be there as well.


In mc.php there's 46 => '10.0.8.18:11000',
Comment 7 Roan Kattouw 2011-04-12 11:39:22 UTC
(In reply to comment #6)
> Why only fatals though ?
Because there's lots of garbage like this polluting the logs all the time:

Apr 12 06:31:22 10.0.8.21 apache2[29754]: [error] [client 208.80.152.81] Symbolic link not allowed or link target not accessible: /usr/local/apache/common/docroot/meta/style, referer: http://cursilloswfla.org/
Apr 12 06:31:22 10.0.8.21 apache2[29880]: [error] [client 208.80.152.71] (36)File name too long: access to /Category:Banks_of_S%252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525C3%252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525A3o_Tom%252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525C3%252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525A9_and_Pr%252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525C3%2525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252
Apr 12 06:31:23 10.0.2.233 apache2[23144]: [error] [client 208.80.152.87] Directory index forbidden by Options directive: /usr/local/apache/common/docroot/commons/w/


> I think we should keep our code conventions to trunk
> to wmf as well, no notices, warnings or fatals should appear. Although since
> we're just getting started on this, it makes sense to start with a filtered
> output to existing channels, but an unfiltered output could be set as well. ie.
> #wikimedia-debug or whatever.
> 
> (unfiltered, not uncensored)
> 
Reporting notices and warnings is fine, as long as the garbage mentioned above is filtered out.

> > As you can see the DB servers in dberror.log and the srv servers in apache.log
> > are stored as IP addresses, so you'd need to resolve those:
> 
> Is this information available on noc.wikimedia.org as well ? There are some IPs
> and names relations there, not sure if these can or should be there as well.
> 
> 
> In mc.php there's 46 => '10.0.8.18:11000',
No, this info is not available there, but this script would have to run on fenari or another host within the cluster anyway, so it can just do reverse DNS lookups.
Comment 8 Krinkle 2011-04-12 11:47:27 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > Why only fatals though ?
> Because there's lots of garbage like this polluting the logs all the time:
> 
> Apr 12 06:31:22 10.0.8.21 apache2[29754]: [error] [client 208.80.152.81]
> Symbolic link not allowed or link target not accessible:
> /usr/local/apache/common/docroot/meta/style, referer: http://cursilloswfla.org/
> Apr 12 06:31:22 10.0.8.21 apache2[29880]: [error] [client 208.80.152.71]
> (36)File name too long: access to
> /Category:Banks_of_S%252525252
> 
> > output to existing channels, but an unfiltered output could be set as well. ie.
> > #wikimedia-debug or whatever.
> > 
> > (unfiltered, not uncensored)
> > 
> Reporting notices and warnings is fine, as long as the garbage mentioned above
> is filtered out.

Yeah, I forgot apache logs aren't just php's errors.
Comment 9 Krinkle 2013-10-31 11:40:21 UTC
So, we've got:

* Aggregated logs on the servers:
  https://wikitech.wikimedia.org/wiki/Logs

* Gangla and graphite graphing some of these as numerical statistics,
  but no actual errors or trends. Needs one to open the logs for details.
  That's fine when working on a major exception spike (regression), but
  when trying to find minor notices and warnings not affecting everyone
  we need something else.

translatewiki.net has an IRC bot echoing all these error logs, that's too much for us (at the very least we'd need to de-duplicate things).

However I think it is should be feasible to develop something that monitors these, detects similar errors (similar to how we group them in fatalmonitor), and only report to IRC when new errors are first seen or errors seen earlier become significantly more common.

We need to be careful about what is exposed, but all-in-all a nice web dashboard to show the details and an IRC bot to report trends and new ones could be quite useful.

The web dashboard should probably not be written from scratch (perhaps use logstash), if it also has an API to query trends and new ones we can write an irc reporter off of that.

This would either need to be run in production (proxied through fenari or whatever we do for things like graphite/gdash these days), or we'd need to replicate the necessary data to a wmflabs instance.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links