A long-term roadmap for the core design of Concourse, a general-purpose CI/CD tool.

Accompanying slides. Recommended viewing: episode 1 of Yu-Gi-Oh.

Concourse's design philosophy is to be expressive, versatile, and safe while limited to a handful of simple, proven concepts. The design of these concepts should make good practices feel intuitive and bad practices feel uncomfortable.

Coming up with these designs can be very challenging. There are many different workflows and patterns across the software industry, and they each have to be deeply understood in order to know what the good and bad practices are.

This post provides a bit of insight into what we've been up to with Concourse's core design - chiefly regarding 'spaces', which has become a bit of a white whale on our roadmap.

There are a lot of words here - sorry! If you just want to skim, I've added a single-paragraph summary under each roadmap entry.

Each roadmap entry corresponds to an RFC or an issue, linked in their header. If you want to get involved in our design process or just provide feedback, please check them out and submit a PR review! (Thanks!)

Table of contents

  1. Where is 'spaces'?
  2. Where are we now?
  3. Where are we going?
  4. What comes after all this?
  5. Thanks!

Where is 'spaces'?

For those of you not familiar with spaces, it was a big ol' feature that enabled the following workflows:

  • Dynamically running against things like branches/pull requests, which change over time (i.e. commits to a branch) and space (i.e. the set of branches themselves). Hence the name 'spaces.'
  • Fanning in using passed constraints across spaces. This is currently impossible to do with separate pipelines, because pipelines can't reference each other's resources.
  • Automatically cleaning up spaces for closed PRs, etc. This is annoying to automate and requires keeping track of state.

These workflows still make sense, so why is 'spaces' dead?

Well, I approached it the wrong way. To me, the idea of resources tracking change over time and space felt pretty solid from a theoretical standpoint. In hindsight, maybe it just sounded cool.

I had no reservations baking 'spaces' in to every layer of the stack - it would add more depth to all the existing ideas. Everything was going to change: the resource interface, the web UI, how jobs work... It was all so exciting!

But as time went on it became terrifying. It was a double-or-nothing bet. Either 'spaces' made sense everywhere, or 'spaces' didn't make sense at all. I tried to carve out work that could be done before fully committing to spaces, but it didn't make the monolithic feature any less monolithic.

me vs the space dragon

Where are we now?

First off, I want to give a quick update on a couple of big things that you can expect in v6.0:

Issue #3602: a new algorithm

We are re-designing the algorithm used for determining the candidate input versions for a job. The new approach will rely less on brute force and will perform better with large installations.

This new algorithm fixes long-standing architectural issues with the old one, which loaded each pipeline's entire build and resource version history into memory and determined valid candidates using brute force.

The key difference between the old and new algorithm is how passed constraints are implemented, specifically when multiple inputs depend on the same job:

plan:
- get: foo
  passed: [foo-unit, integration]
- get: bar
  passed: [bar-unit, integration]
- get: baz
  passed: [integration]

In Concourse, this means "give me versions of foo, bar, and baz that have passed through integration together in the same build, with the same version of foo having passed foo-unit and the same version of bar having passed bar-unit."

How does this work? Well, it's hard to describe either algorithm succinctly, but I'll try:

  • The old algorithm goes through resource versions, newest first, and checks whether each version satisfies the input's own passed constraints. Next it checks that any other already-chosen input versions which mention the same job in their passed constraints also came from the same build, recursing and walking through versions until everything is satisfied. This process is brute-force, an uses a lot of CPU.
  • The new algorithm instead loops over build output version sets via the jobs listed in each passed constraint, assigning all the relevant versions for a given build at once as long as the versions match the other already-chosen versions assigned via builds of prior jobs in the passed constraint.

This new approach really simplifies things because the versions are inherently coming from the same build. Now that we don't have to do the extra cross-referencing, the new flow can just make a handful of cheap database queries instead of having to load the whole pipeline's dataset into memory.

We've been testing the old and new algorithm in two separate environments, each at the scale of 1,000 jobs with varying passed constraints and a sprinkle of version: every across four web nodes.

  • The old algorithm starts off very fast but grows slower and slower as the pipeline dataset grows, eventually exhausting the web nodes of RAM and swap.
  • The new algorithm starts off slightly slower than the old one - it's hard to beat an in-memory dataset - but it stays stable, uses less CPU, and does not leak memory.

We're making a few final touches as we to get as much performance out of the new algorithm as possible, since we don't tend to touch it often. Once we're finished, we'll jump straight to...:

Issue #413: build re-triggering

The new algorithm changes the behavior for today's pinning-based flow for re-triggering a build, so we're going to implement proper support for build re-triggering and ship these two features together in v6.0.

Right now the only way to "re-trigger" a build is to pin each of the job's upstream resources to the version from the build, trigger a new build, and go back and un-pin them all. It's pretty cumbersome and error-prone.

It also kind of breaks with the new algorithm. Now that the new algorithm is based on build output sets and not version history, once the new build succeeds its older versions will end up being the first set attempted for that job, potentially propagating them to downstream jobs.

That's not what I would expect from a re-trigger. I would expect a re-trigger to act "in-place," while preserving the logs of the original failure for posterity.

To avoid this surprising change in behaviour, we're going to implement build re-triggering properly and stop abusing the version pinning feature, which was originally designed for temporarily pinning a broken upstream dependency to a "known good" one.

Build re-triggering will be implemented in a way that preserves the order of the builds that the algorithm will go over. If the re-triggered build succeeds, its set of outputs will be available to downstream jobs based on the original build's order.

Another benefit to implementing re-triggering soon is that folks using a pull request resource will have a much easier time re-triggering failed pull requests, without having to wait on the rest of the roadmap (i.e. 'spaces').

Where are we going?

So, going back to the 'spaces' initiative. The pieces really started to fall into place over the past few months, and I think I've arrived at a roadmap that accomplishes all of the goals of 'spaces' but in a significantly more Concourse-y way.

Instead of one monolithic feature, I have a bunch of smaller features to propose that are independently valuable and can be delivered in any order. As we complete them, a bigger picture will start to take shape.

Let's jump right in!

RFC #24: resources v2

Resources v2 is the first major revision of the resource interface since Concourse's inception. It's a step to take very carefully. I think we're finally ready to go.

The v2 interface brings long-awaited polish to the interface: it renames in and out to get and put to match their step names, introduces a delete action, standardises TLS configuration, and revises terminology so as to not be coupled to the 'versioned artifacts' use case.

The latest proposal for Resources v2, RFC #24, is a lot like RFC #1 but with one big difference: 'spaces' is no longer a foundational piece of the interface. Instead, RFC #24 proposes that we generalize and simplify the interface to an extent that it can be used for various pipeline workflows, not just versioning artifacts.

The new direction is to leverage composition between resources and pipelines via config fragments, which can be passed from one resource to another or used for ((vars)) in a pipeline template. 'Config fragments' replace 'versions' in the interface, and are used as versions for the 'versioned artifacts' flow (today's primary use of resources).

By generalizing the resource concept we set the stage for proper pipeline-level support for notifications (e.g. Slack alerts, GitHub commit status), trigger-only resources (e.g. time), and spatial resources (e.g. branches, pull requests) without tying each use case into the interface itself.

Now that 'spaces' is gone from the interface, the actual change in the interface protocol is somewhat cosmetic. As a result, Concourse pipelines will be able to use v1 and v2 resources side-by-side for all the same functionality. This way we can move forward with pipeline-level resource features without fragmenting the resource ecosystem!

RFC #26: artifact resources

Artifact resources are an interpretation of the generic resource interface that maps to today's usage of the resource interface.

Artifact resources use config fragments as versions, modeling change to an external entity over time. This should sound familiar to anyone using Concourse today: they're the sole use case that Concourse resources were originally designed around.

The 'artifact resources' proposal clarifies that this is now just one use case for the general resource interface, and outlines a few long-awaited features:

  • Versions can be deleted using the delete action in the resource interface.
  • The put action can emit multiple versions. Each will be recorded as an output of the build.
  • The automatic get after the put step will be made opt-in. (Huzzah!)

The automatic get after each put is something that has confused and occasionally frustrated users, but we didn't want to break backwards compatibility and we didn't want users to have to 'opt out' (that's too many knobs to turn).

This RFC will provide a backwards-compatible transition path to artifact resources. Check out RFC #26 for more details!

RFC #31: set_pipeline step

The first step on our journey towards 'spaces' is to introduce a simple, but critical piece of the puzzle: a set_pipeline step.

The set_pipeline step is used like so:

jobs:
- name: bootstrap
  plan:
  - get: ci
    trigger: true
  - set_pipeline: concourse
    file: ci/pipelines/concourse.yml

This job will configure a concourse pipeline within the job's team. The pipeline will be automatically unpaused, and no authentication is required.

The first thing this lets us do is deprecate the concourse-pipeline resource, which has two pretty fundamental problems:

  • Having to configure auth is really awkward - you have to set up a local user and give the resource the keys to the kingdom.
  • Keeping the version of fly within the resource in sync with your own Concourse's version is a bit clunky.

With the set_pipeline step, both of these problems immediately go away and pipelines start to feel a more first-class rather than just being the tip of the abstraction iceberg.

RFC #32: projects

Ok, I promised to provide a tl;dr for each roadmap entry, but projects can't really be summed up that easily. This is the most impactful feature on this roadmap.

  • A "project" is a new concept bootstrapped by two existing ones: a resource from which to continuously load the project's config, which specifies a build plan to execute whenever the project resource changes.
  • Projects act as a namespace for pipelines, and provide a long-requested workflow for automating their configuration. As the roadmap goes on, this workflow becomes more and more powerful.
  • Projects allow you to define project-wide resources which let you clean up duplicate definitions across your pipelines and support cross-pipeline passed constraints.
  • Projects also define project-wide tasks, which remove the need to thread a resource through all your jobs just to have the task configs to execute, and finally gives meaning to task names (the x in task: x).

A project's build plan can be used for anything you want. Small projects could use the build plan to run tests and/or perform various steps in a single build - a workflow more familiar to users of other CI systems:

name: ci
plan:
- get: booklit
  trigger: true
- task: unit

Larger projects could use the build plan to execute set_pipeline steps. Concourse has long encouraged users to keep their pipelines under source control, but it never enforced it: fly set-pipeline was still a manual operation, and users would often forget to check in their changes. Projects will fix that:

name: ci
plan:
- set_pipeline: booklit

Small projects may start without pipelines and start using pipelines as they grow. Our original slogan, 'CI that scales with your project,' is now pretty literal! The hope is that by introducing build plans without requiring knowledge of pipelines and jobs, we'll have made Concourse's learning curve more gradual and made Concourse feel less overkill for side-projects.

This feature will have far-reaching implications for Concourse, so it won't be sneaking in quietly. I've opened RFC #32 and would really appreciate feedback!

RFC #33: archiving pipelines

Archiving pipelines is a way to soft-delete a pipeline while still being able to peruse the build history for a pipeline you no longer want.

Well, after that bombshell this one's pretty easy to explain. Let's take a look at our own Concourse team's pipelines:

Look at all that cruft! So many old, paused or bit-rotting pipelines which I really don't care about anymore but don't really have the heart to delete. That old-concourse pipeline served us well for years - it has sentimental value. In some cases you may also want to keep the history around for auditing purposes.

Archiving pipelines will allow you to humanely retire a pipeline in a way that gets it out of your way while still allowing you to peruse the build history should you ever need to. Archived pipelines are no longer active and will allow you to re-use their name without bringing the old pipeline back.

There's already an open pull request for this: #2518 - shout-out to @tkellen! The ball has been in our court for a while to figure out the UI/UX, so we're just going to submit a new RFC and work out all the details.

RFC #34: instanced pipelines

Instanced pipelines group together pipelines which share a common template configured with different ((vars)). They provide a simple two-level hierarchy and automatic archiving of instances which are no longer needed.

Instanced pipelines are an important piece of the 'spaces' puzzle: it's how users will navigate through their spatial pipelines, and it's what keeps no-longer-relevant spaces for e.g. merged PRs and deleted branches from piling up forever.

Pipeline instances are created using the set_pipeline step like so:

plan:
- set_pipeline: branch
  instance_vars:
    branch: feature/projects
- set_pipeline: branch
  instance_vars:
    branch: feature/new-algorithm

At the end of a build which uses set_pipeline, all instances of the named pipelines which were not configured by the build will be automatically archived.

Check out RFC #34 for more details!

RFC #29: spatial resources

Spatial resources are resources whose check monitors spatial change, not change over time. Two common examples are the set of branches or open pull requests for a repo. The across step allows a build to process each 'space' and trigger on changes to the set.

When used with the new across step, the  set_pipeline step, and instanced pipelines, this enables dynamic pipeline configuration across spatial change.

The final piece of the puzzle for 'spaces' is the addition of an across step. This step points to a resource and has a plan which will execute for every config fragment returned by the resource's check, all within one build.

Let's first look at a simple use case, which is to execute a task across many variants:

plan:
# ...
- across: supported-go-versions
  as: go
  do:
  - task: unit
    image: go

In this case, imagine we have a supported-go-versions resource whose check returns a config fragment for each tag and digest based on a pre-configured list of supported tags (e.g. 1.10, 1.11, 1.12), and whose in/get fetches the image.

When nested, the across step enables dynamic build matrices:

plan:
# ...
- across: supported-go-versions
  as: go
  do: # needed so we can define another 'across'
  - across: other-things
    as: some-input
    task: unit
    image: go

When used with set_pipeline and instanced pipelines, it enables dynamic pipeline matrices:

plan:
- across: repo-branches
  as: repo-branch
  set_pipeline: branch
  instance_vars:
    branch_name: ((repo-branch.name))

(Assuming we provide the ability to access fields of an artifact with ((vars)).)

RFC #27: trigger resources

Trigger resources allow jobs to specify parameters that can trigger new builds but don't have anything to fetch - they just propagate config fragments to the build.

This is also a relatively simple feature, but it will improve today's usage of the time resource by having per-job trigger semantics rather than having all jobs downstream of one time resource leading to a thundering herd of builds hitting your workers all at once.

Rough sketch:

jobs:
- name: smoke-test
  plan:
  - param: 10m
    trigger: true

Semantically, param is similar to get but with one key difference: there is no central version history. Rather than being used as an artifact, the resource is used solely for its config fragments. Concourse will check against the job's last used config fragment for the trigger resource, 10m, and if a different fragment is returned the job will trigger with the new one.

This skips the get, eliminates the thundering herd issue (because all jobs have their own interval), and could enable an interesting pattern for manually-parameterized builds: just write a resource type that can fetch user-provided config fragments from some external source (i.e. a repo).

Here's one idea of what that may look like, where the config fragments returned by param are somehow usable with ((vars)) syntax in subsequent steps:

plan:
- param: environment
- task: smoke-test
  vars:
    environment: ((environment.name))

Another interesting use case would be to use it as a instance_fragment with the set_pipeline step.

This idea is pretty half-baked - I've been mainly focusing on the 'spatial resources' idea. Follow along in the RFC #27 and help the idea develop!

RFC #28: notification resources

Notification resources will allow you to monitor the flow of a resource through your pipeline and emit build status notifications (e.g. Slack alerts, GitHub commit status) without having to sprinkle put steps all over your pipeline.

Have you ever wanted to reflect your CI build status on your GitHub commits? Or send a Slack notification whenever the build is fixed or broken?

If so, you're probably aware of how ugly it can make your pipelines, both in YAML and in the UI.

A simple pipeline quickly turns into a mess of boxes and lines:

a simple pipeline before and after notifications were added

Not only is it a lot of manual work to copy-paste those on_success and on_failure hooks, when you finally configure it it really ruins the signal-to-noise ratio of the pipeline UI.

So, the plan for notification resources is to leverage composition, a pattern set forth in the Resources v2 RFC (#24). Instead of annotating every single job, you annotate a resource, and any time that resource is used in a build a notification will be fired, by executing the notification resource's put step with the config fragment of the original resource (e.g. ref: abcdef) and the status of the build.

This way you don't have to update all of your jobs, and notifications don't clutter up the pipeline UI. Neato!

This idea is also a bit half-baked - follow along in RFC #28 when you have time!

What comes after all this?

I dunno.

I have a lot of respect for software that is eventually considered 'done.' I would really like Concourse's core design to achieve that someday.

We'll always have things to improve, whether it's through better efficiency, better UX, support for new underlying technologies (Kubernetes, Nomad), or just making our codebase more accessible for contributors. But from a core design standpoint, I think the most important thing is stability.

The software industry changes quickly. Hot new tools show up all the time and get replaced by newer and better tools. I don't want our users to have to keep re-doing their CI stack just to keep up.

Concourse should insulate projects from the constant churn in the industry by providing a solid set of principles and abstractions that hold true regardless of the underlying technology.

We will continue to listen to user feedback and improve Concourse. Our goal is for it to support good patterns and prevent anti-patterns that we can identify in workflows across the industry. Thankfully patterns don't change as frequently as tools do.

Thanks!

Everything I've outlined here comes from years of feedback through all of your GitHub issues, forum posts, and conversations in Discord (or Slack for you OGs). I'm very thankful for those of you that have stuck around and helped us understand your workflows, and I'm especially grateful for your patience.

For those of you who couldn't wait and ultimately had to switch tools, I hope we accomplished one of our original goals, and I hope to see you back in the future!