Last modified: 2014-10-14 09:58:11 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73783, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 71783 - Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005)


Summary:	Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005)

Status:	RESOLVED FIXED

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	Infrastructure (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:	71873
Blocks:
	Show dependency tree / graph

Reported:	2014-10-08 08:44 UTC by Antoine "hashar" Musso (WMF)
Modified:	2014-10-14 09:58 UTC (History)
CC List:	14 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Antoine "hashar" Musso (WMF) 2014-10-08 08:44:10 UTC

The Jenkins master on gallium is unable to connect to the deployment-cxserver01.eqiad.wmflabs instance to update the content translation code base (job is beta-cxserver-update-eqiad ).

When starting the Jenkins agent over ssh:

  [10/08/14 08:41:58] [SSH] Opening SSH connection to 10.68.17.162:22.
  java.io.IOException: There was a problem while connecting to 10.68.17.162:22
  ...
  [10/08/14 08:42:01] [SSH] Connection closed.
  [10/08/14 08:42:01] Launch failed - cleaning up connection

Comment 1 Antoine "hashar" Musso (WMF) 2014-10-08 08:47:00 UTC

The virt1005 compute node died overnight, might explain the issue.

Comment 2 Antoine "hashar" Musso (WMF) 2014-10-08 08:53:58 UTC

The instance is hosted on virt1005 which died overnight.  I have marked the Jenkins slave as offline: https://integration.wikimedia.org/ci/computer/deployment-cxserver01/

I attempted to reboot it via OpenStackManager but it does not come back.  I guess the  the VM is corrupted.

Impact:
* content translation server is not running for beta cluster
* code updates for content translation servers are obviously not pushed :D

Moving to Infrastructure (corrupted VM apparently)

Comment 3 Antoine "hashar" Musso (WMF) 2014-10-08 08:54:38 UTC

Link to instance informations: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000421.eqiad.wmflabs

Comment 4 Andrew Bogott 2014-10-08 18:05:56 UTC

If this instance has any important data I can try to reclaim the drive contents.  Otherwise you should just delete and recreate.

Comment 5 Antoine "hashar" Musso (WMF) 2014-10-08 19:45:59 UTC

(In reply to Andrew Bogott from comment #4)
> If this instance has any important data I can try to reclaim the drive
> contents.  Otherwise you should just delete and recreate.

Thank you for the time spent on investigating the issue.

I will check with the cxserver folks (Kartik and Niklas, added to cc) and see whether they need any data.  Else we will recreate it and update relevant configuration files.  It is fully puppetized AFAIK.

Comment 6 Antoine "hashar" Musso (WMF) 2014-10-09 09:03:35 UTC

Creating an instance deployment-cxserver02 :

Size: m1.medium
OS: Ubuntu Trusty
Security rules: default, cxserver

Ie the same as deployment-cxserver01 used to be.

Comment 7 Antoine "hashar" Musso (WMF) 2014-10-09 09:18:28 UTC

Kart confirmed we can get rid of the instance.  Since beta cluster is out of quota, that is convenient.

Comment 8 Nemo 2014-10-09 09:19:05 UTC

(Context:

> The virt1005 compute node died overnight, might explain the issue.

https://lists.wikimedia.org/pipermail/labs-l/2014-October/002982.html )

Comment 9 Greg Grossmeier 2014-10-09 16:35:01 UTC

CRITICAL: deployment-prep.deployment-cxserver02.puppetagent.failed_events.value (100.00%)

Comment 10 Antoine "hashar" Musso (WMF) 2014-10-09 19:18:20 UTC

(In reply to Greg Grossmeier from comment #9)
> CRITICAL:
> deployment-prep.deployment-cxserver02.puppetagent.failed_events.value
> (100.00%)

Yup that is due to this bug.   I wanted to acknowledge the alarm, but since the monitor is on the production Icinga I lack permissions to do so.

Comment 11 Greg Grossmeier 2014-10-09 19:19:50 UTC

(In reply to Antoine "hashar" Musso from comment #10)
> (In reply to Greg Grossmeier from comment #9)
> > CRITICAL:
> > deployment-prep.deployment-cxserver02.puppetagent.failed_events.value
> > (100.00%)
> 
> Yup that is due to this bug.   I wanted to acknowledge the alarm, but since
> the monitor is on the production Icinga I lack permissions to do so.

Yuvi: Halp? How should we address this?

Comment 12 Yuvi Panda 2014-10-09 20:16:05 UTC

So puppet fails on cxserver02 because it tries to create a lvm volume and fails (/mnt, I think), leading to cascading failures (among which this is one, I presume). ^d ran into the same problem on his new ES box there as well, I think.

I'll investigate in a bit, but andrewbogott/coren/others feel free to take this as well...

Comment 13 Antoine "hashar" Musso (WMF) 2014-10-10 08:18:26 UTC

(In reply to Yuvi Panda from comment #12)
> So puppet fails on cxserver02 because it tries to create a lvm volume and
> fails (/mnt, I think), leading to cascading failures (among which this is
> one, I presume). ^d ran into the same problem on his new ES box there as
> well, I think.
> 
> I'll investigate in a bit, but andrewbogott/coren/others feel free to take
> this as well...

Yup made I have it another bug for Labs > Infrastructure:

https://bugzilla.wikimedia.org/show_bug.cgi?id=71873

Comment 14 Kartik Mistry 2014-10-14 09:58:11 UTC

https://cxserver-beta.wmflabs.org/ is built fine (deployment-cxserver03), so closing this now.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links