Kavin Ragvan


Kavin Ragvan

Cloud Performance Engineer with a demonstrated history of working in the Information Technology and services industry. Has over a decade of experience in Non- Functional testing with strong expertise in Performance Testing & Engineering, Chaos Engineering, and Site Reliability Engineering in Resiliency & Observability Areas.

Specialized in AWS & GCP Cloud Performance testing and in designing & implementing Cloud Test frameworks for Performance, Resiliency, and Observability. Has been involved in the creation of automated Performance & Resilience Engineering frameworks and implementing Continuous integration & Continuous delivery to perform early performance,resilience, and accessibility testing and identify potential performance bottlenecks during the development phase. Has presented 4 Whitepapers related to Cloud Performance Testing, Chaos Engineering, and Microservices at Software Conferences.

Title :Designing for Site Reliability &Observability using AWS FIS 


Designing for Site Reliability &Observability using AWS FIS

AWS Chaos Experiments helps in stressing an application in testing or production environments by creating real world disruptive events, such as sudden increase in CPU or memory consumption, observing how the system responds, and implementing improvements that helps in improving Application’s performance, observability, and resiliency. AWS Fault Injection Simulator is an AWS managed service for running Chaos experiments on AWS. For AWS Fault Simulations , AWS FIS proves to be an efficient pay per use model compared to other commercial tools in market like Gremlin and Net havoc. 

FIS Real Time Uses: 

  1. Design & Simulate Failures in PreProd Environment: Simulate Real world failures in Test Environment like Stage or Perf to understand the Autoscaling thresholds and health checks, MTTR,etc,. 
  2. Production Chaos Continue the tests on Production creating potential failure conditions and observing how effectively the team and system responds  
  3. Build Observability FIS can be used to ensure the Observability of the system by testing the alerts ,monitoring dashboards using Fault simulations 
  4. CICD Integration AWS Fault Injection Simulator can be integrated into continuous delivery pipeline which helps to repeatedly test the impact of fault actions as part of the software delivery process. 

Supported AWS Services 

ported AWS Services 

  • FIS supports the AWS Services like EC2, RDS, EKS and ECS and can also be customized to support additional fault simulations for AWS Fargate, EBS Volumes, etc., 
  • FIS can also inject API Throttle, API Unavailable and API Exceptions 

FIS Experiment- Design Steps 

  1. Create &Assign FIS IAM role for the experiment  
  2. Specify Target instances 
  3. Pass the inputs for the experiment as json in the actions 
  4. Configure or select CloudWatch alarms for experiment stop conditions 
  5. Run from console or CLI 

FIS Benefits:

  • Customizable Fault Simulations- FIS allows to combine different level of fault simulations like state, resource, network and customize/ save the fault actions as per our use case 
  • Experiment Control & Visibility- FIS supports CloudWatch & uses existing metrics to monitor FIS experiments. Experiments’ running, completed status, triggers all are visible in the console  
  • No Setup, agents needed- FIS needs no prerequisite setup; the dependencies for fault simulations are managed by AWS.  
  • Cost- Since FIS uses Pay per use model, the overall cost incurred will be far less compared to commercial tools 
  • Security- Experiments are tied to IAM for security. As As FIS is AWS managed service , its safe and secured eliminating the need to install any other agents into the instances 
  • Console access- FIS can be used from Console, CLI and AWS APIs that helps in continuous integration 

Sample Fault Simulations: 

  1. Terminate single/multiple EC2 Instances across zones and regions 
  2. Reboot single/multiple App/Cache/DB Instances across zones and regions 
  3. Stop single/multiple EC2 Instances across zones and regions 
  4. CPU Stress in the EC2 Instances
(High CPU/ Throttle CPU/CPU Burn) 
  5. Memory Stress in the EC2 Instances
(Insufficient Memory) 
  6. Hybrid Resource Stress in the EC2 Instances 
  7. Latency  in the EC2 Instances Instances 
  8. Disconnect Primary DB-Reboot DB Instance 
  9. Failover RDS DB 
  10. Insufficient Memory Issues with DB instance 
  11. Kill a particular Microservice/ process (by PID/ name) in an instance 
  12. Latency in Producers or Consumers Instances