Jenkins is down

SethTisue · February 1, 2018, 8:46pm

scala-ci.typesafe.com is down while we deal with a security issue. PR validation in the scala/scala repo won’t exist until it’s back up. (ETA: ???)

mghildiy · February 3, 2018, 5:11am

Any idea when it is going to be up?

SethTisue · February 5, 2018, 4:51pm

Rather than just bringing everything back up exactly as it was, Adriaan is using the downtime as an opportunity to make some changes to how we manage and configure Jenkins. Not sure exactly how long it will take, but we are certainly aiming for days and not weeks.

adriaanm · February 6, 2018, 10:43pm

tl;dr: it’s a matter of a day or two until Jenkins is back, I hope

Back story: our old Jenkins setup was stuck at a version with known vulnerabilities (and automatic exploits that are popular in the crypto currency miner^Wfreeloader community…).

We’re very close to upgrading to Jenkins 2, but we have to rewrite our jobs because the job dsl plugin was dropped :-/

Sneak preview (currently locked down until we harden the config further):

Alas, the 90s chic look was preserved for 2.0.

mghildiy · February 11, 2018, 12:07pm

Can PR be generated now?

adriaanm · February 12, 2018, 11:29am

Yes. We’re still fine tuning to take advantage of the new features in Jenkins 2, but CI is mostly operational. Thank you for your patience!

SethTisue · February 13, 2018, 6:14pm

PR validation seems to be working on new PRs. On existing PRs, sometimes /rebuild works, maybe not always? If not, rebasing your PR is worth a try to see if it brings Scabot back to its senses. If neither tactic helps, please make noise on the PR.

adriaanm · February 13, 2018, 8:48pm

If there’s no failed commit, /rebuild does nothing. Use /synch to have scabot look at jenkins for the expected jobs and fail the build if none are found. Then rebuild will actually do something.

mghildiy · February 14, 2018, 2:30am

I do an empty commit:

git commit --allow-empty -m "message

Jasper-M · February 14, 2018, 12:34pm

/synch
/rebuild
/synch

worked for me.

NthPortal · February 14, 2018, 8:24pm

Jenkins seems to be working fine, except that it doesn’t report statuses back when it finishes without a /sync (at least for me)

adriaanm · February 14, 2018, 9:38pm

For new builds, this shouldn’t happen anymore. If you see this again, please post a link to the PR and I’ll investigate!

Normally, you should see a log entry that jenkins notified scabot in the console output (in the beginning):

Notifying endpoint with url 'http://scala-ci.typesafe.com:8888/jenkins'

I just looked, and this is the case for https://scala-ci.typesafe.com/job/scala-2.13.x-validate-main/104/console (for your PR at https://github.com/scala/scala/pull/6319). Scabot is definitely limited in how it handles weird states that arise while jenkins is available intermittently. I made /synch slightly more aggressive so that it will fail commits that don’t have a corresponding successful jenkins job.

cheers
adriaan

NthPortal · February 15, 2018, 5:10am

I actually noticed this behaviour for https://github.com/scala/scala/pull/6323, which was PRed after Jenkins was back up, and did not sync automatically. Looking at recent PRs, this one was only opened a couple hours ago, and hasn’t been marked as passing even though its build has succeeded.

lrytz · February 15, 2018, 8:54am

https://github.com/scala/scala/pull/6326 is a fresh PR from 10 hours ago, CI passed, but the status was not synched back to GitHub

adriaanm · February 15, 2018, 10:15am

Thanks, I’ll see what I can dig up in the logs.

adriaanm · February 15, 2018, 10:56am

I can’t reproduce, but I’ve added some logging to scabot when it receives an unhandled jenkins event.

SethTisue · February 22, 2018, 6:10pm

summary of current situation:

old PRs should be rebased
/rebuild doesn’t seem to be necessary anymore (after rebasing)
PR validation snapshots are being published successfully (enabling e.g. running the community build against a PR)
but /synch, to get a successful result reflected on the PR page, is still usually (always?) necessary. the cause is elusive

NthPortal · March 3, 2018, 4:59am

Builds seem to stall (with maybe >= 10% of the time, from my experience?), usually in the middle of running ScalaCheck tests.

An example is this build, for scala/scala#6360 (which I won’t rebuild yet, in case it’s helpful for debugging).

adriaanm · March 4, 2018, 8:25pm

Thanks for reporting! Tracking here: https://github.com/scala/scala-jenkins-infra/issues/249
I’ve noticed other jobs taking a long time to complete, but no deadlocks so far… No idea, maybe the increased parallelism is revealing some underlying bugs?

adriaanm · March 4, 2018, 8:28pm

In other news, I finally figured out why Scabot wasn’t (always) reporting job completion. The jenkins notification plugin had started sending notifications that exceeded (300k JSON blobs!) the default play entity length limit (100kb), so the request never made it to my code that was logging incoming JSON. Should be resolved now.