Our Information Systems group builds and maintains over 100 applications that enable data-driven, real-time decisions across the company. As a Site Reliability Engineer, you will be responsible for maintaining and improving the reliability of several mission critical applications to be running in 24 x 7 x 365 manner within the data center.
Responsibilities:
• Responsible for monitoring and maintaining the agreed upon uptime SLAs for the mission critical applications.
• Document Downtime Incidents, Downtime Response, meet with various teams to identify root cause and establish remediation procedures
• Understand, execute and document complex production execution platforms with the goal of triaging downtime incidents and developing a documented triage process.
• Proactively monitor health of software and hardware
• Design and develop solutions that support high availability, reliability, and security for existing applications
• Develop and maintain automation tools and processes for deployment, configuration, monitoring, and alerting of existing systems
• Collaborate with development and operations teams to identify and resolve software and infrastructure issues
• Identify and influence/implement best practices for system scalability, security, and performance
• Conduct research and development to identify new tools and technologies that can improve existing software systems
• Participate in on-call rotation to respond to critical incidents and ensure system availability and reliability
Requirements:
• Bachelor's degree in Computer Science, Computer Engineering, or related field
• 5+ years of experience in software engineering or site reliability engineering
• Proficiency in Linux
• Knowledge of containerization and container orchestration technologies such as Docker and Kubernetes
• Experience debugging production systems using instrumentation and monitoring
• Experience with observability platforms
• Development experience with Python is preferred