Jenkins is down

scala-ci.typesafe.com is down while we deal with a security issue. PR validation in the scala/scala repo won’t exist until it’s back up. (ETA: ???)

Any idea when it is going to be up?

Rather than just bringing everything back up exactly as it was, Adriaan is using the downtime as an opportunity to make some changes to how we manage and configure Jenkins. Not sure exactly how long it will take, but we are certainly aiming for days and not weeks.

tl;dr: it’s a matter of a day or two until Jenkins is back, I hope

Back story: our old Jenkins setup was stuck at a version with known vulnerabilities (and automatic exploits that are popular in the crypto currency miner^Wfreeloader community…).

We’re very close to upgrading to Jenkins 2, but we have to rewrite our jobs because the job dsl plugin was dropped :-/

Sneak preview (currently locked down until we harden the config further):

Alas, the 90s chic look was preserved for 2.0.

Can PR be generated now?

Yes. We’re still fine tuning to take advantage of the new features in Jenkins 2, but CI is mostly operational. Thank you for your patience!

1 Like

PR validation seems to be working on new PRs. On existing PRs, sometimes /rebuild works, maybe not always? If not, rebasing your PR is worth a try to see if it brings Scabot back to its senses. If neither tactic helps, please make noise on the PR.

If there’s no failed commit, /rebuild does nothing. Use /synch to have scabot look at jenkins for the expected jobs and fail the build if none are found. Then rebuild will actually do something.

I do an empty commit:

git commit --allow-empty -m "message

/synch
/rebuild
/synch

worked for me.

Jenkins seems to be working fine, except that it doesn’t report statuses back when it finishes without a /sync (at least for me)

For new builds, this shouldn’t happen anymore. If you see this again, please post a link to the PR and I’ll investigate!

Normally, you should see a log entry that jenkins notified scabot in the console output (in the beginning):

Notifying endpoint with url 'http://scala-ci.typesafe.com:8888/jenkins'

I just looked, and this is the case for https://scala-ci.typesafe.com/job/scala-2.13.x-validate-main/104/console (for your PR at https://github.com/scala/scala/pull/6319). Scabot is definitely limited in how it handles weird states that arise while jenkins is available intermittently. I made /synch slightly more aggressive so that it will fail commits that don’t have a corresponding successful jenkins job.

cheers
adriaan

I actually noticed this behaviour for https://github.com/scala/scala/pull/6323, which was PRed after Jenkins was back up, and did not sync automatically. Looking at recent PRs, this one was only opened a couple hours ago, and hasn’t been marked as passing even though its build has succeeded.

1 Like

https://github.com/scala/scala/pull/6326 is a fresh PR from 10 hours ago, CI passed, but the status was not synched back to GitHub

Thanks, I’ll see what I can dig up in the logs.

I can’t reproduce, but I’ve added some logging to scabot when it receives an unhandled jenkins event.

summary of current situation:

  • old PRs should be rebased
  • /rebuild doesn’t seem to be necessary anymore (after rebasing)
  • PR validation snapshots are being published successfully (enabling e.g. running the community build against a PR)
  • but /synch, to get a successful result reflected on the PR page, is still usually (always?) necessary. the cause is elusive
1 Like

Builds seem to stall (with maybe >= 10% of the time, from my experience?), usually in the middle of running ScalaCheck tests.

An example is this build, for scala/scala#6360 (which I won’t rebuild yet, in case it’s helpful for debugging).

1 Like

Thanks for reporting! Tracking here: https://github.com/scala/scala-jenkins-infra/issues/249
I’ve noticed other jobs taking a long time to complete, but no deadlocks so far… No idea, maybe the increased parallelism is revealing some underlying bugs?

In other news, I finally figured out why Scabot wasn’t (always) reporting job completion. The jenkins notification plugin had started sending notifications that exceeded (300k JSON blobs!) the default play entity length limit (100kb), so the request never made it to my code that was logging incoming JSON. Should be resolved now.

2 Likes