As part of our intent to open up the various developmental processes of the Venice project, we are announcing today the publication of our latest stable release, 0.4.85!
This post is not about the details of that specific release, but to shed light on the release process itself, and what it means for us to call a release stable.
In terms of cadence, we typically cut releases about twice a week, but not all of them go sitewide at LinkedIn. Out of the three services, the controllers are deployed most frequently, followed by routers and finally servers. The first two are stateless, which makes them easy to upgrade and roll back quickly, whereas servers are stateful and thus carry more inertia.
Before a release gets deployed sitewide, we go through many steps:
We deploy to our certification environment, where we run some tests which are more difficult to achieve in the test suite. This includes performance, chaos, longevity and compatibility testing.
If certification is satisfactory, we progress to the staging environment, which is not intended for us to test Venice, but rather for our internal users to test their own applications. Still, we occasionally find issues at this step.
After staging we hit prod, starting with canaries (1 node per cluster) in a single region. We bake the release by performing additional checks (including load tests, EKG and Dyno) in this environment, to capture the idiosyncrasies of our diversity of workloads. If the canaries are successful, we then roll out to entire clusters, still in just one region. The fully released region bakes some more, and if no issues are discovered, we finally deploy to other regions as well.
If an issue is discovered at any step, a judgment call is made on whether to rollback the release fully, partially or not at all, depending on the scope and severity of the issue. In general, we tend to be fairly strict on release viability, and even an issue as minor as breaking a metric could be sufficient to scuttle a release.
In some cases, we have tactical improvements that apply to specific workloads and are thus released in priority to the affected clusters. Overall, we try to move fast without breaking things. As such, we don't let the main branch get too far ahead of what's in prod, since large changesets make issues harder to debug.
Due to all of the above reasons, we routinely run a mix of release versions, and only occasionally align the whole site on a single version. As a matter of fact, we have already moved beyond this stable release in some parts of the site.
In order to mark a version as a stable release, we’ll use the criteria that it has been successfully rolled out sitewide. This doesn’t mean other versions are bad, but merely that they received a lesser amount of testing at scale.
Needless to say, we encourage new users to stick to the stable releases, though we hope that in time some power users from the open source community will help share the burden of stress testing the bleeding edge releases as well.
Finally, for each stable release, we will regenerate the Docker images on Docker Hub. And starting from the next stable release, we will publish release notes on this blog to highlight some of the major changes since the last stable release.
If you have any questions or feedback, please let us know on the Venice community Slack!