Build Timeouts

Well, it works on my machine...

Feb 14, 2025

While not directly related to our switch from Jenkins to GitHub, one of the first improvements we made to the build after the switchover was to help developers diagnose build timeouts. In this post, I’ll talk about how we used a little command-line magic and the upload-artifacts action to capture thread dumps of stuck builds.

In the Bad Old Days of 8-hour Jenkins builds, we would commonly have tests get stuck in some way. This was typically a “test problem” (usually in the shutdown logic), but occasionally we have had real deadlocks or livelocks that only sporadically surfaced. Back in those days, the only diagnostic information we had was the Gradle stdout from the build log. Sometimes you might get a clue from the log like an exception, but usually, you would need to manually try to find which tests had started but not completed.

Once you knew which test had gotten stuck, you would try to reproduce the issue locally. Due to differences between the CI environment and developers’ local environments, reproducing these failures was next to impossible. And well… you know how this typically goes.

Issue State Changed: Closed (Can’t Reproduce)

On rare occasions, a developer might be able to reproduce the problem or stare extra hard at the code and manifest the bug out of hiding. More commonly we would ignore the timeout and try the build again.

For properly investigating a stuck build, we need a thread dump.

Enter the linux timeout command. This utility simply runs another command with a timeout. If the command times out, you get an exit code of 124. Otherwise, you will receive the exit code of the given command. This lets us manage timeouts explicitly rather than relying on the CI framework.

The approach for capturing thread dumps in Apache Kafka’s CI has two parts. First, we run a bash script in the background that would produce a thread dump for all Gradle processes after a given duration (the timeout minus 5 minutes).

thread-dump.sh

SLEEP_MINUTES=$(($TIMEOUT_MINUTES-5))
echo "Dumping threads in $SLEEP_MINUTES minutes"
sleep $(($SLEEP_MINUTES*60));

echo "Timed out after $SLEEP_MINUTES minutes. Dumping threads now..."
mkdir thread-dumps
sleep 5;

for GRADLE_WORKER_PID in `jps | grep GradleWorkerMain | awk -F" " '{print $1}'`;
do
  echo "Dumping threads for GradleWorkerMain pid $GRADLE_WORKER_PID into $FILENAME";
  FILENAME="thread-dumps/GradleWorkerMain-$GRADLE_WORKER_PID.txt"
  jstack $GRADLE_WORKER_PID > $FILENAME
  if ! grep -q "kafka" $FILENAME; then
    echo "No match for 'kafka' in thread dump file $FILENAME, discarding it."
    rm $FILENAME;
  fi;
  sleep 5;
done;

Next, the main Gradle command is run with the timeout utility to ensure it gets terminated in time. The whole step looks a bit like this:

shell: bash
id: junit-test
run: |
  set +e
  thread-dump.sh &
  timeout 180m ./gradlew test
  exitcode="$?"
  echo "exitcode=$exitcode" >> $GITHUB_OUTPUT

Now we have both enforced a timeout for the Gradle execution but also ensured that we will produce a thread dump before the process is killed. Using the output from this “junit-test” step, we can add conditional logic later in the build. At the moment, the only conditional step we run using this output is “upload-artifact” to archive the thread dumps.

name: Archive Thread Dumps
id: thread-dump-upload-artifact
if: steps.junit-test.outputs.gradle-exitcode == '124'
uses: actions/upload-artifact@v4
with:
  name: junit-thread-dumps-${{ matrix.java }}
  path: thread-dumps/*
  compression-level: 9
  if-no-files-found: ignore

We quickly started seeing the benefits after making this improvement. Here are a few issues that were found (and fixed!) within a few weeks of this change.

https://issues.apache.org/jira/browse/KAFKA-17680
https://issues.apache.org/jira/browse/KAFKA-17766
https://issues.apache.org/jira/browse/KAFKA-17886

In all these cases, we would have been hard-pressed to track these problems down without thread dumps from the CI environment.

In addition to fixing quite a few test issues, we have all but eliminated timeouts from our CI system. Nice!

Building Apache Kafka

Discussion about this post