Day 37 — Writing the manual while building the plane while flying it.

Developing our Business Continuity Plan

3 min readJan 6, 2021

What happens when I spend 2 weeks working on data security.

Winston Churchill said

success consists of going from failure to failure without loss of enthusiasm.

Well…I think being a startup founder is moving from problem to problem without loss of enthusiasm.

Today was a struggle. My problem to solve was “business continuity planning” (a playbook for how we’ll respond to outages and disasters) and I just couldn’t find the enthusiasm. For one, I quickly realized I was out of my league. Two, there are so many things still left to build that managing what happens when something we build breaks just seemed weird.

Nevertheless…I survived and ended up learning a lot about how we manage failure.

Here’s what I accomplished..

Business Continuity Planning

Starting this work today, I had no idea what I was in for. I spent half the day just learning exactly what business continuity planning is and the other half applying what I learned by developing a “playbook” for how we’ll manage outages and disruptions to our key systems.

I found surprisingly few (good) examples online of how companies actually manage this today. There was a plethora of policies, but few tangible playbooks that I could find.

The few examples of business continuity plans I did find were focused on natural disasters impacting work settings, which really isn’t a problem for us since we work remotely. So instead I focused mostly on our 3rd party systems that manage our data and keep our systems running.

Rating the Business Impact

I started by rating all of our key systems. It boiled down to:

HIGH PRIORITY (if these stopped working, our app service would be disrupted): AWS
MEDIUM PRIORITY (If these stopped working, our app service would be degraded): Apple, Google, and Expo.io
LOW PRIORITY (if these services were disrupted, it would be annoying): Slack, Google Workspace, and other internal messaging tools
VERY LOW PRIORITY (if these services were disrupted, customers would never know and we’ll manage until we’re back): e.g., Calendly, Analytics tools, etc.

Backups

AWS is by far the most critical system we have. Because we are planning to use a 3rd party vendor like Aptible or Medstack for our technical security, I didn’t worry about this today.

Monitoring

One of the key activities for managing outages is detecting a system outage occurred in the first place. For this, we rely on a bunch of 3rd party systems that all have outage tools available.

Eventually we’ll get smarter and automate them with notifications to slack, etc., but for now I just included links to their status pages.

Notifying the Team and Customers

We mainly use Slack for internal messaging, which will be our primary tool in an outage (unless Slack is down!). For Customers, it will depend on the impact whether we need to message them directly.

Responding to Outages

Each system brings with it its own plan:

AWS: A variety of backup methods and approaches exist to manage this. They need to be configured, which we’ll rely on a 3rd party vendor to help with.
Google/Apple/Expo: If any of these services are down, it will impact our ability to push app updates. While it’s highly unlikely an extended outage will occur, we will plan on monitoring and notifying customers if they’re impacted from this.
Slack/Google Workspace/Dropbox: Alternative means are available for these tools, including (gasp!) just making a phone call. Each are identified in the playbook.

The Result: Business Continuity Playbook

It can still be vastly improved (suggestions welcome!), but we have a start and it’s at least something for my team to react to and improve.

And that’s cause for enthusiasm.