Our client, a worldwide media and entertainment brand, is seeking a Senior Site Reliability Engineer to add to their team!
Responsibilities:
Build monitoring and automation to quickly triage and discover failures across hardware, software, applications and platforms
In-depth analysis of service trends and implement adjustments to mitigate risk and prevent issue recurrence
Provide guidance to application engineers related to design patterns that are resistant to failure
Collaborate with organizational partners to ensure services are designed scalable and operable
Participate in 24/7 On call rotation to support for Major Incident response
Required Skills:
Solid experience supporting AWS Cloud solutions and product offerings and container technologies (i.e. ECS, Kubernetes)
Experience supporting and monitoring Google Cloud environments
Strong understanding of proactive monitoring methodologies using APM (i.e. AppDynamics, New Relic) solutions or other monitoring tools
Experience implementing self-healing recovery capabilities, monitoring tools, and dashboards.
Requires knowledge in Prometheus, Grafana, automation development, Splunk, Linux, and system & application monitoring
Understanding of technical architecture, application systems design and integration in a large heterogeneous enterprise environment with hands on experience in SOA, Angular/Node, Java/J2EE, Oracle or MySQL/MariaDB
5+ years programming in one or more of: Java, Node, Python, Perl or C
5+ years UNIX systems knowledge and/or systems administration background
Passion for analyzing and troubleshooting large-scale distributed systems
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership
Strong understanding of Open Telemetry technology and best practices
Experience contributing to open source code using version control