Friday, December 15, 2017

An Approach to Blue/Green Sitecore Deployments

You may have heard someone speak of blue-green deployments, and you may have found yourself wondering what those colors have to do with software development. Simply put, blue-green deployments reduces downtime and risk when deploying changes. The central idea is to have two identical production environments labeled Blue and Green. At any given time one of the environments is live and the other is not. New code is always pushed to the non-live environment where it can be tested. Once validated, web traffic is pointed to the freshly updated environment. The other environment no longer serves traffic and can serve as a rollback in the unlikely event an issue surfaces in the newly live environment.

Obviously, zero-downtime deployments are very valuable to enterprise clients. Since blue-green deployments minimize downtime (perhaps even to zero) it behooves us as Sitecore architects to utilize the strategy. Of course, if you have any familiarity with the complexity of a Sitecore environment, blue-green deployments may seem like an unattainable goal.

My motivation for this blog post is to sketch out an approach for blue-green deployments with Sitecore. I'd like to think through the problem and demonstrate it is possible insofar as we can trust a thought-experiment.

The Challenge

The primary problem posed by Sitecore with blue-green deployments is the database layer. Since the database is a shared resource amongst all Sitecore servers in an environment, any change there can affect the entire environment. Additionally, since we are dealing with a CMS, we must expect that authors regularly introduce changes to the database.

This database challenge is exacerbated by two factors:
  1. We cannot control the schema. Sitecore must own that.
  2. We must actually think about two database layers (SQL and Mongo) -- in which some databases are inter-related -- as well as their attendant search indexes

Point #1 should be self-evident. Sitecore's API encapsulates the database layer. Any changes to the database must be done through the API; this is de rigueur for any CMS and Sitecore is no exception.

Point #2 probably requires a little more explanation. The databases in Mongo function as a very large "net" that captures all interaction data with visitors to the site. Mongo is organized in such a way to make writes very fast. The Reporting database in SQL represents data from Mongo that has been reorganized to support efficient reads so that report performance is optimized. The Analytics index lets us query Mongo data from the API efficiently. Thus, Mongo is the source of truth for visitor interaction data and the Reporting database and the Analytics index are coupled to it. This means any changes introduced to the data in Mongo must also be represented in SQL and in the Analytics index. We must treat those three systems as a unit.


A Solution

Clearly, we must mitigate the problems posed by the database layers. I believe we can, but let's first pose a few assumptions:
  1. Content Authors will be inactive during deployments.
  2. We cannot use InProc mode for session state on CD servers.
  3. Sticky sessions (server affinity) should be disabled.
  4. Load-balancer supports configuration changes through scripting


Step 0: Initial State

Now to the fun part! Before the deployment begins we imagine blue and green PROD environments where the green environment is live and one version ahead of the blue environment. For this initial state, I've greatly simplified the server topology. I'm representing all of Sitecore servers with a single app server and all databases as a single database. I've also eliminated search indexes.



Step 1: Synchronize Content

The first step in the deployment process is to synchronize all data managed by content authors in Sitecore from the live (green) environment to the blue environment. Typical examples of this sort of data would be the site pages and data items which we would expect to be descendants of /sitecore/content and media assets in the Media Library. One possible mechanism for automating the synchronization process is to use Razl. Here's a nice YouTube video demonstrating the capability. It's important we synchronize content prior to deploying new code so that our deployment is working against the same Sitecore data as it would in a 'conventional' deployment to PROD. Note in the diagram below that SQL data in the blue environment is still version 0. The asterisk represents the addition of managed content synchronized from the green environment.



Step 2: Deploy New Version

The next step is to deploy our changes to the blue environment. I've represented this work performed by Octopus Deploy. It is my preferred tool for this job, but it's not the only one. Octopus manages making changes to all Sitecore servers (files placed on the file system and Sitecore items written to SQL) as well as ensuring we publish and rebuild indexes as required. When this step completes the blue environment will be one version ahead of the green environment (while retaining the synchronized content from green.)

Update: There is nothing, I believe, about this strategy the requires indexes to be rebuilt. There could be something about your particular solution that needs an index to be rebuilt. If so, this is the correct step to perform that work.



Step 3: Testing

At this point, the blue environment is ready for testing. The specific forms of testing are wide-open: starting with someone doing basic smoke-testing all the way through a full suite of automated tests.



Step 4: Change Connection Strings and Analytics Index

Now comes the hard part: dealing with databases. Fear not, however, we have a plan. Recall from step 1 that we've already dealt with any differences between blue and green due to changes made by content authors (also, we assume a content freeze during deployment.) In step 2 we deployed the the latest solution as well as published and rebuilt indexes if required. Therefore, the last hurdle is analytics and session data.

First, let's take a moment to think through analytics data. It all starts with the four Mongo databases. From the Mongo databases we have the derived data in the SQL Reporting database and the Analytics search index. So, really we need to think of these three subsystems as a unit. Another important consideration is that we never want to mix the live and non-live analytics data. For example, we don't want to dirty visitor interaction data with clicks generated by smoke-tests performed during a deployment. We also need to be careful with session data. Live users will be transitioned from the green environment to the blue environment. All of their serialized session data must remain valid and coherent.

Great...how do we do this?

We change the connection strings for the Mongo, Reporting, and Session databases for all Sitecore servers in the blue environment. We also modify the analytics index config to use the 'analytics_live' Solr core which is replicated from a corresponding 'analytics_live' Solr core in the green environment. In this way, we guarantee that all analytics and session data for a blue Sitecore server corresponds to live data.



Step 5: Make Blue the Live Environment

We are now ready to redirect incoming traffic from the Internet to the blue environment, thus making it the live environment. At the same time, we also need to reverse the direction of Solr replication so that the blue analytics_live index is now the master index and green's version is the slave. Since we assume there are no sticky sessions. The traffic switch should be nearly instantaneous.



Step 6: Finish Retiring the Green Environment

Finally, we must finish retiring the green environment from being live. This step is really just the inverse of step 4. That is, we change the connection strings for the Mongo, Reporting, and Session databases for all Sitecore servers in the green environment. Again, we also modify the analytics index config. This time, however, we wire up the analytics index in Sitecore to use the 'analytics' Solr core rather than the 'analytics_live' Solr core.




Rollbacks

While I separated steps 4-6 are listed above as discrete steps to help illustrate the idea, as a practical matter we should think of them as a single continuous step. Even better, if we automate 4-6 as a single operation, the process of rolling back from blue to green becomes much easier.

Imagine you finish step 6 and proudly watch traffic seamlessly flow to the blue servers only to discover that despite all the testing in step 3, an issue pops up related to the code just deployed. You could rollback to the green environment by inverting steps 4-6.  That's a simple proposition if you took the time to automate 4-6 as a single operation.