In the fast-paced world of software development and operations, Site Reliability Engineering (SRE) has emerged as a critical discipline for ensuring the stability, performance, and reliability of complex systems. But what does it take to be an effective SRE practitioner in the real world?
This is the first Installment of our "Insights from the Trenches" series, we dive deeper into the world of Site Reliability Engineering (SRE) and explore how seasoned practitioners navigate the challenges of maintaining complex systems.
We are joined by Harish Padmanabhan, an SRE professional, who shares his valuable insights on managing a complex, multi-cloud infrastructure that supports a major financial services platform. Harish discusses the challenges his team faces and the solutions they have implemented to ensure the platform's reliability and performance while handling an incredibly high volume of transactions.
Interviewer: Can you tell us about your team structure and the support model you follow?
Harish: Our SRE team is distributed across three geographical locations: North America, Europe , and Asia Pacific. We follow a "follow the sun" model to ensure continuous support coverage throughout the day. The team is responsible for setting up the platform, handling Continuous Customer Delivery (CCD), and providing continuous monitoring.
In North America, we have team members located in Houston and Argentina. The European team operates from Glasgow and London, while the Asia Pacific team is based in Singapore and Bangalore. This global distribution allows us to provide 24/7 support, with each region taking on the responsibility during their respective working hours.
Interviewer: What kind of infrastructure and observability tools have you used in your experience ?
Harish: I have worked with platforms utilizing a hybrid infrastructure, consisting of Red Hat machines, Pivotal Cloud Foundry (a managed private cloud by VMware), and AWS for the public-facing portal. This hybrid approach allows us to leverage the strengths of each solution while ensuring high availability and scalability.
For logging, I have used Splunk with a custom setup involving Fluentd and Kafka topics. Instead of using the traditional Splunk forwarder, we use Fluentd to ship logs from the instances to Kafka topics. From there, the logs are consumed by Splunk. This setup enables us to efficiently deliver logs as feeds to over 100+ application teams without granting them direct access to the Splunk instance. By providing log feeds, we can control access and ensure better performance compared to allowing direct access to Splunk.
On the application performance monitoring (APM) front, we rely on Dynatrace. Dynatrace allows us to capture service-level metrics such as response times and JVM-level information. It provides us with detailed insights into the performance of our applications and helps us identify any potential bottlenecks or issues.
Interviewer: Can you walk us through your process for onboarding new services?
Harish: Onboarding new services is a critical process that requires strict gating to ensure the services meet the necessary standards before being pushed to production. Our onboarding process involves several key steps.
First, the QA team performs regression tests to ensure the service functions as expected and does not introduce any regressions. Once the regression tests are passed, the service moves on to the next stage.
Next, our SRE team runs a series of test scripts to validate the service's adherence to logging and performance standards. We have developed automated tools that validate the log formats and generate reports confirming whether the service is ready for production. This automation helps us streamline the onboarding process and ensures consistency across all services.
We follow a regular release cycle, with a 15-day digital release for bug fixes and a 30-day candidate release for major features and integrations. The 15-day digital release focuses on addressing any critical bugs or issues that have been identified in the previous 15-30 days. On the other hand, the 30-day candidate release is a more comprehensive release that includes major feature updates and integrations.
To keep track of the services and their related information, we maintain documentation in SharePoint. This documentation includes service mappings, application development manager (ADM) information, GitHub repositories, and details about any open major incidents related to each service. By centralizing this information, we can easily access and manage service-related data.
Interviewer: How do you handle incident management and problem-solving?
Harish: Incident management is a crucial aspect of our SRE process, and we have established a robust system to handle incidents effectively. We use ServiceNow as our incident management tool, which automatically creates war rooms and engages the relevant teams based on predefined project IDs and application owners.
When a major incident occurs, our first priority is to assemble the necessary teams and stakeholders in the war room. We collaborate closely with the application development teams to identify the root cause of the issue and implement fixes. The problem management team, which is a subset of the SRE team, takes the lead in initiating war rooms and driving the Root Cause Analysis (RCA) process.
Once the immediate issue is resolved, we conduct a structured postmortem process to document the incident, identify the contributing factors, and derive learnings from it. The postmortem process involves a thorough analysis of the incident timeline, the actions taken, and the outcomes. We focus on identifying areas for improvement and defining action items to prevent similar incidents from occurring in the future.
To ensure that the learnings from incidents are properly implemented, we assign ownership of the action items to specific individuals or teams. The progress of these action items is tracked and regularly reviewed to ensure they are completed in a timely manner.
Interviewer: Can you explain your approach to SLO and alert customization?
Harish: Service Level Objectives (SLOs) and alert customization are essential components of our SRE strategy. We work closely with the application development teams to define SLOs based on the monitoring data collected over a 15-day period.
To streamline the process of setting up alerts, we have developed an automated tool that allows application teams to configure standard alerts by filling out a simple form. The form includes fields for specifying the service name, alert patterns, thresholds, and notification preferences. This automated tool reduces the manual effort required to set up alerts and ensures consistency across different services.
However, there are cases where manual customization is necessary, particularly for complex alerting scenarios. For example, when we need to correlate queue size with specific error codes, it requires custom configuration. In such cases, we collaborate with the application teams to understand their specific requirements and develop tailored alerting solutions.
We also define Service Level Indicators (SLIs) based on logs and metrics, capturing key aspects such as availability, stability, and performance. Availability is measured by the number of HTTP calls, while stability takes into account the number of HTTP calls and custom exceptions. Performance is evaluated based on response times, including the 95th and 99th percentile values.
By setting up comprehensive SLOs and customizing alerts based on the specific needs of each service, we can proactively detect and respond to issues before they impact end-users.
Interviewer: What are some of the challenges you face in your SRE journey, and how do you address them?
Harish: One of the main challenges we face is ensuring that application development teams follow the required processes and provide the necessary documentation. Initially, there was resistance from some teams, as they were not accustomed to the level of documentation and process adherence we required. To address this, we focused on building bridges and educating the teams about the importance of these processes. We emphasized how following these processes would ultimately benefit them by reducing incidents and improving the overall reliability of their services.
Another challenge is the manual effort required to set up and maintain custom alert scenarios. These scenarios often involve complex correlations between different metrics or logs, making them time-consuming to configure. To mitigate this, we are continuously working on automating as much of the alerting setup process as possible. By developing reusable templates and scripts, we aim to reduce the manual effort involved in configuring custom alerts.
Automating infrastructure setup and reducing toil is an ongoing process, and there are still some tasks that require manual intervention. For example, configuring firewalls for newly procured instances is a manual task that we are working on automating. We have made significant progress in automation, but there is always room for improvement.
Maintaining up-to-date documentation for a large number of services (around 1500) is another challenge, especially with regular releases and changes in ownership. To tackle this, we have established processes and guidelines for documentation updates. We encourage teams to treat documentation as a first-class citizen and allocate time for updating it regularly. We also leverage automation wherever possible to keep the documentation in sync with the actual state of the services.
Ensuring comprehensive monitoring coverage for all services is an ongoing effort, particularly when dealing with a large number of services and application teams. We continuously assess our monitoring coverage and identify gaps. We work closely with the application teams to onboard new services and ensure they have the necessary monitoring and alerting in place.
Interviewer: How do you approach toil reduction and automation?
Harish: Toil reduction and automation are key focus areas for our SRE team. We are constantly looking for opportunities to automate repetitive tasks and reduce manual effort.
One of the areas where we have made significant progress is in the automation of infrastructure setup. We have developed custom tools and scripts that automate the provisioning and configuration of resources. This includes automating the setup of monitoring agents, configuring logging pipelines, and provisioning the necessary infrastructure components.
We have also invested in developing custom tools for reporting, alerting, and service discovery. These tools help us streamline our processes and provide valuable insights into the health and performance of our services.
While we have made significant progress in automation, there are still some tasks that require manual effort. For example, configuring custom alerts for new services or ensuring comprehensive monitoring coverage can be time-consuming. We are actively working on improving our automation capabilities in these areas to further reduce toil.
One approach we take is to identify common patterns and develop reusable templates or scripts that can be easily adapted for different services. By leveraging these templates, we can reduce the manual effort required to set up monitoring and alerting for new services.
We also encourage a culture of automation within the team. Whenever someone encounters a manual task that is repetitive or time-consuming, we encourage them to explore ways to automate it. This could involve writing scripts, developing tools, or leveraging existing automation frameworks.
By continuously focusing on toil reduction and automation, we aim to free up our SRE team's time and allow them to focus on higher-value tasks, such as improving the overall reliability and performance of our services.
1. Implement a strict gating process for onboarding new services, ensuring adherence to logging and performance standards
2. Automate log format validation and report generation to streamline the onboarding process
3. Use incident management tools like ServiceNow to create war rooms and engage relevant teams automatically
4. Collaborate with application development teams to define SLOs and customize alerts based on specific requirements
5. Continuously work on automating infrastructure setup and reducing manual toil
6. Use tools for reporting, alerting, and service discovery to improve efficiency
7. Maintain up-to-date documentation and ensure comprehensive monitoring coverage for all services
1.. Evaluate your current team structure and consider adopting a "follow the sun" model for global coverage
2. Assess your infrastructure and identify opportunities to leverage a hybrid approach with both on-premises and cloud solutions
3. Implement a strict gating process for onboarding new services, with automated validation of logging and performance standards
4. Invest in developing custom tools for reporting, alerting, and service discovery to improve efficiency and reduce toil
5. Foster collaboration between SRE and application development teams to define SLOs and customize alerts
6. Prioritize efforts to automate infrastructure setup and reduce manual interventions
7. Establish processes for maintaining up-to-date documentation and ensuring comprehensive monitoring coverage for all services
Harish Padmanabhan's insights into managing the multi-cloud infrastructure highlight the importance of a comprehensive approach to SRE. By focusing on automation, collaboration with application development teams, and continuous improvement, Harish and his team have been able to ensure the reliability and performance of the chase.com platform, despite the challenges of handling a massive volume of transactions.
Harish Padmanabhan is an accomplished Site Reliability Engineering (SRE) professional with over 7 years of experience at JPMorgan Chase & Co. Currently serving as the Vice President of Site Reliability Engineering, Harish has been instrumental in building and growing the SRE team, implementing SRE process models, and driving innovation efforts.
Linkedin: https://www.linkedin.com/in/harish-padmanaban-ph-d-39727442/
Harish Padmanabhan |SRE Lead (SRE / DevOps / Platform) at JPMorgan Chase & Co.
Subscribe to our newsletter & never miss our latest news and promotions.