Subscribe to RSS feed  Follow @jackvamvas - Twitter

*Use the Comments section for questions Links



Chaos engineering and databases

25 August,2021 by Rambler

Embrace the CHAOS!  (aka resiliency) 

Chaos engineering is defined as "the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” from .  It's a discipline practised by many  companies - notably Netflix,Google & Amazon.    

Distributed systems are becoming even more complex - surface area is growing,exposing systems to a wider range of vulnerabilities. The old assumptions of developers and engineers are being questioned . Typical assumptions such as networks are stable, topologies never change , only one admin can do everything & only one data centre.  

Chaos engineering is not about breaking things in production!   



There is lots of documentation around this topic .   is a great resource and outline an experiment framework. 


1) Define and measure your system’s “steady state.”

2)Create a hypothesis

3) Simulate what could happen in the real world - "What could go wrong?" and then simulate. O even "You never run out things that can go wrong"

4)Prove or disprove your hypothesis.    Compare steady-state with impact of disturbance , and improve


Thinking about databases . Chaos engineering has loads of documentation around stateless devices. But there aren't as many ideas discussed around managing large database systems. 


These are some ideas to start a conversation .   The possibilities are endless - but more value is gained from focusing on potential failures that's fairly likely to happen and could potentially have a significant impact.

In order to improve a system , you must fail constantly & strengthen. Increasing confidence comes from introducing turbulence 

-> Clustered database environments - turning off replicas , check to see other replicas are getting promoted

-> Simulate ransomware attack such as encrypt\make unavailable a database. Can you complete a system recovery ?

->Testing the supporting systems around the database server systems - e.g Monitoring system , 

-> Backups - Testing of restores!!!!!!!!   if right that was required - could you complete a point-in-time restore



Read more on systems management

The Agile DBA

Day 2 Operations - Are you ready?


Author: Rambler (


Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment on Chaos engineering and databases