11 October,2024 by Rambler
Recently , some developers were performing some Chaos Engineering efforts on applications hosted on EKS. They were using the Fault Injection Simulator to simulate Availibility Zone (AZ) failure and causing RDS Instances Mult AZ and Aurora Clusters in a Single Region to to fail over into another AZ in the same Region.
They presented me with some feedback - that when the RDS failed over , there application was unable to connect and write to the database. A little bit of research quickly identified the source of the issue.
DNS caching is the process of storing temporarily a DNS record into an OS ,browser or application layer component.
Some applications use DNS caching which has the advantage of speeding the address resolution of a target db connection - but under certain circumstances causes the application to not be able to resolve the address.
IP addresses associated with RDS can be inconsistent . The dynamic nature of the IP address means that utilising DNS caching can lead to connection issues until there is a refresh of the DNS record
There are multiple circumstances where the ip adress can change on an RDS instance such as a RDS restart , Multi AZ failover planned or unplanned, major or minor version upgrade, AZ outage.
RDS Multi AZ – When the RDS failover mechanism occurs, planned or unplanned , the RDS automatically switches over to the alternate AZ standby replica, causing the DNS record to point to the new standby replica. The standby replica has a different ip address.
Aurora Cluster - When the Aurora DB Cluster fails over ,the physical IP address pointed to by the cluster endpoint changes when the failover mechanism promotes a new DB instance to be the write primary instance for the cluster.
If you use any form of connection pooling\multiplexing , the application layer must accomodate a flush or reduce the TTL for any cached DNS information.
A lot of applications cache the IP Address of a host name and will never re-resolve if there is a failure. You may also want to check the behaviour of your app that its handling a database disconnect correctly.
If there is a requirement to restart the application each time, then revisit the application architecture and establish a process to manage the database connection loss.
Links
What do I need to know about the IP addresses assigned to my Amazon RDS DB instances?
Read more on : Failing over a Multi-AZ DB instance for Amazon RDS
Aurora Cluster – Improve application availability on Amazon Aurora -
Aurora Clutser - Operational Guide -Basic operational guidelines for Amazon Aurora
This is only a preview. Your comment has not yet been posted.
As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.
Having trouble reading this image? View an alternate.
Posted by: |