25 August,2021 by Rambler
Embrace the CHAOS! (aka resiliency)
Chaos engineering is defined as "the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” from principlesofchaos.org . It's a discipline practised by many companies - notably Netflix,Google & Amazon.
Distributed systems are becoming even more complex - surface area is growing,exposing systems to a wider range of vulnerabilities. The old assumptions of developers and engineers are being questioned . Typical assumptions such as networks are stable, topologies never change , only one admin can do everything & only one data centre.
Chaos engineering is not about breaking things in production!
There is lots of documentation around this topic . principlesofchaos.org is a great resource and outline an experiment framework.
1) Define and measure your system’s “steady state.”
2)Create a hypothesis
3) Simulate what could happen in the real world - "What could go wrong?" and then simulate. O even "You never run out things that can go wrong"
4)Prove or disprove your hypothesis. Compare steady-state with impact of disturbance , and improve
Thinking about databases . Chaos engineering has loads of documentation around stateless devices. But there aren't as many ideas discussed around managing large database systems.
These are some ideas to start a conversation . The possibilities are endless - but more value is gained from focusing on potential failures that's fairly likely to happen and could potentially have a significant impact.
In order to improve a system , you must fail constantly & strengthen. Increasing confidence comes from introducing turbulence
-> Clustered database environments - turning off replicas , check to see other replicas are getting promoted
-> Simulate ransomware attack such as encrypt\make unavailable a database. Can you complete a system recovery ?
->Testing the supporting systems around the database server systems - e.g Monitoring system ,
-> Backups - Testing of restores!!!!!!!! if right that was required - could you complete a point-in-time restore
Read more on systems management
Day 2 Operations - Are you ready?
This is only a preview. Your comment has not yet been posted.
As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.
Having trouble reading this image? View an alternate.
Posted by: |