Speaker

Kavin Arvind Ragavan is a Cloud Performance Architect with 12 years of experience in Performance and Reliability Engineering.

Kavin has strong expertise in architecting and designing solutions for validating Cloud Performance (Server side/ Client side) and Resilience. He is Specialized in AWS & GCP Cloud Performance and Resilience Engineering. He has architected Cloud Chaos Frameworks for GCP using GCP cloud Workflows & open-source tools and presented that as a solution in conferences. 

Kavin has designed and presented CICD based frameworks to perform early performance, resilience, and accessibility validation in CICD Pipeline and to identify potential performance bottlenecks during the development phase. Kavin has vast experience in Cloud Performance testing solutions, monitoring strategies using Performance and APM Tools for Cloud Migrations and new Application development in Cloud. 

He has also Presented Technical Whitepapers, participated in Interactive Talks and conducted Workshops related to Cloud Performance Testing, Resilience Testing and Microservices in various Software Conferences.He has published Blogs related to Chaos and Observability frameworks in various events and platforms like Medium.com 

Title: Applying SRE principles to Containers- Kubernetes Chaos Engineering

SRE Best Practices for Containers

     1. Shift Left into Dev cycle  

Dev, Perf and Kubernetes SRE teams can identify weaknesses & potential outages in infrastructures earlier by inducing modern chaos tests in a controlled way in CI Pipeline. Chaos tests can be done anywhere in the DevOps cycle. 

 

   2. Shift Right into Production 

Resilience can be validated in the staging environment and eventually in production with actual user load to find bugs and vulnerabilities, fixing them which leads to an increased resilience of the system. The extent of chaos tests varies from lower level env to production. 

   

 3. Testing for Kubernetes Changes 

Testing in all scenarios like- Deploying new code, Adding dependencies, Observing changes in usage patterns, Mitigating problems, Kubernetes upgrades certification, post-upgrade validation of services, etc. 

 

   4. Validate Application/Service Resilience 

Verify the application resilience whenever a change has happened in the underlying stack. This can also be Continuous Resilience- Process of continuously verifying if the application service is resilient against faults 

     

5. Validate Infrastructure Resilience 

Application resilience depends more on the underlying stack than the application. If the application is stabilized, the resilience of the service that runs on Kubernetes depends on other components and infrastructure most of the time 

   

 6. Resilience Benchmarking  

Chaos workflows supports the user in defining the expected result, observing the result, analyzing the overall system behavior, and in the decision-making process- if the system needs to be tuned for improving the resilience and resilience benchmarking, etc. 

 

sr.no
Category
Type
Faults
1.
Platform Pod Chaos
Simulates Pod failures, such as Pod node restart, Pod's persistent unavailability, and certain container failures in a specific Pod
2
Platform Node Chaos
Simulates GCP platform failures, such as the GCP node restart.
3
Network Network Chaos
Simulates network failures, such as network latency, packet loss, packet disorder, and network partitions.
4
Network DNS Chaos
Simulates DNS failures, such as the parsing failure of DNS domain name and the wrong IP address returned.
5
InfrastructureStress Chaos
Simulates CPU or memory stress
6
InfrastructureFile IO Chaos
Simulates the I/O failure of an application file, such as I/O delays, read and write failures.
7
InfrastructureTime Chaos
Simulates the time jump exception.
8
InfrastructureKernel Chaos
Simulates kernel failures, such as an exception of the application memory allocation.
9
ApplicationHttp Chaos
Simulates HTTP communication failures, such as HTTP communication latency.
10
ApplicationJVM Chaos
Simulates JVM application failures, such as the function call delay

 

Resilience/ Chaos Engineering for Containers 

  • Resiliency is the ability of the system to gracefully handle and recover from hardware and software failures and provide an acceptable level of service to the business 

  • Resilience/ Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production 

  • Applying Chaos Engineering experiments on Cloud Services and Kubernetes helps to continuously improve application’s performance, observability, and resiliency through different fault simulations. 

  1. SPOF Failures- Failure of one service or component should not have cascading impact on the other components
  2. Dependency Failures- Failure of the dependent service like the database, cache shouldn’t make the application down
  3. App level Failure Injections- Introduce resource, state, network level faults into the application
  4. Data Failures- Data to be available to applications if the system that originally hosted the data fails
  5. Canary Deployment Failures- Verify automated rollback mechanism for code in production in case of failure

Tools for Kubernetes Chaos 

Chaos Mesh is an open-source cloud-native Chaos Engineering platform. It offers various types of fault simulation and has an enormous capability to orchestrate fault scenarios. Using Chaos Mesh, we can simulate various abnormalities that might occur in the development, testing, and production environments and find potential problems in the system 

Litmus is a Cloud-Native Chaos Engineering Framework with cross-cloud support. Its purpose is to help Kubernetes SREs and Developers to find weaknesses in both Non-Kubernetes as well as platforms and applications running on Kubernetes by providing a complete Chaos Engineering framework and associated Chaos Experiments 

Chaos Mesh Framework and Features  

  • Authenticated Login : RBAC- Role based access control to login to clusters 
  • Cloud Native: Chaos Mesh supports every Kubernetes environment with its powerful automation ability utilizing its CRDs 
  • Workflow Orchestration: Design own Chaos experiment scenarios on the platform, including multiple mixing experiments and application status checks 
  • High security: Chaos Mesh is designed with multiple layers of security control and provides high security. 
  • Community support: Chaos Mesh is an incubating project hosted by CNCF and has a growing number of contributors and adopters all over the world 

Litmus Framework and Features:

Users & Teams

  • Creation of Users with Role Based Access Control
  • Creating a Team of multiple Users
  • Authenticating Users

Monitoring & Observability

  • Connecting a Data Source (from any Agent) and monitor workflows
  • Monitor effect of chaos in real time with interleaved events and metrics from Prometheus Data source

Customized Workflows

  • Creation of scenarios Templates, Custom Workflows from Scratch (using Chaos Hubs), From pre-created YAMLs
  • Attaching priority to Chaos Experiments based on your use cases

Modern fault scenarios

  • Many new Kubernetes native chaos scenarios for fault simulation in distributed testing system