The Sr Director, Site Reliability Engineering (SRE) is responsible for developing and implementing a comprehensive strategy for site reliability, encompassing scalability, performance, and reliability improvements. The role will align SRE objectives with overall business goals and technology roadmaps. It will foster the spirit of continuous improvement to the SRE and position it to benefit the organizational objectives across the Berkley Corporation.
The person in this role is responsible for overseeing SRE team operations, ensuring the reliability and availability of key applications and supporting infrastructure. This role will work effectively with Service Management to enforce best practices for system reliability, monitoring, capacity planning, incident response, problem management, disaster recovery, change management, and workflow automation. They will also own and administer the tools and technologies necessary to generate a complete view of SRE metrics and improvement areas, including (but limited to) monitoring, logging, notification, dashboarding, and AIOps.
Team Performance Management:
- Instantiate and build a robust SRE team over time and integrate SRE into Berkley’s product development and operational process.
- Recruit, mentor, and develop a high-performing team of SRE professionals.
- Monitor ongoing staff performance; identify and communicate opportunities for improvement.
- Provide leadership and support to ensure projects are staffed appropriately and timelines are met.
Collaboration and Relationship Building:
- Collaborate with the BTS IT Leadership Teams and other groups across the IT organization to drive a unified approach to site reliability that reduces downtime and minimizes outage business impact.
- Foster strong relationships with delivery organization leadership to align SRE efforts with organizational goals. Work collaboratively with other business and IT leaders to ensure cross functional problems are addressed cohesively across the organization.
- Work cross-functionally in partnership with software development teams to guide product development in creating resilient and durable software systems.
- Collaborate with EA to institute design patterns for resilient systems and mechanisms for scoring applications against industry-recognized configurations (including active-active, active-passive, recover-from-scratch, and data replication scenarios).
Execution, Project, and Work Management:
- Define, and track reliability and observability OKRs for infrastructure and key systems.
Implement robust monitoring and alerting systems to proactively identify potential issues, analyze system performance, and facilitate quick response to incidents. - Implement AIOps functionality to enable auto-response, self-healing, and anomaly trend analysis.
- Drive the development and implementation of automation solutions to remove “toil”, streamline processes, reduce manual interventions, and enhance the overall efficiency of the product engineering and SRE teams.
- Work closely with product, development, infrastructure, and architecture teams to conduct capacity planning, ensuring that systems can handle current and future demand. Anticipate growth and scalability requirements.
- Establish and oversee effective high-severity incident response processes, ensure timely incident resolution, and conduct post-mortems to identify root causes and implement preventive measures.
- Improve reliability by identifying and addressing gaps in our architecture, services, and tooling.
- Oversee disaster recovery program for both on premise and Cloud-based Berkley solutions.
- Performs other duties assigned.