We have partnered with our client in search of a Site Reliability Engineer
Roles & Responsibilities:
a. Skillset – AWS, Big Data, Spark, Python, Shell / Perl Scripting, Control-M, Autosys. Grafana, AppDynamics, APICA
b. Experience –
• At least 5 years of experience in AWS, Big Data, Spark
• 2-3 years in Python, Shell Scripting
Application monitoring infrastructure using Splunk or Dynatrace, servers, databases, distributed batch jobs and supporting sustained resiliency, disaster recovery and high availability events
Triage Distributed and Mainframe applications and Middleware Platform related Incidents
Incident and Problem Management functions (i.e. responding to service requests from Client-facing support team, Operations, Risk/Control partners, etc.)
Perform Unix Shell, PERL, Bash scripting as required.
Analyses of IBM WebSphere, Apache Tomcat, IBM DataPower reported Incidents and triage for issue resolutions.
Run job scheduler like CTRL-M
Troubleshoot technical issues (Java/J2EE, .Net, or Cloud) and escalate and work with appropriate technology teams to provide solutions
Incident Management (coordinate incident management coverage, to ensure appropriate coverage, call facilitation, coordination and communications during critical outage situations)
Call documentation, queue management, ticket analysis and interface to impacting lines of business for incident impact analysis via the Production Assurance process
Provide end-to-end view of issues for objectivity and ability to be a single voice for line-of-business
Influence senior technology leads across organizations to ensure timely resolution of incidents
Required Skills:
Proven expertise in application development and support environment with more than one technology and multiple design techniques
Advanced knowledge of development toolset to design, develop, test, deploy, maintain, and improve software
Proficiency in AWS, Akamai, Datadog Technologies
Proficiency in Splunk, Dynatrace, Unix, Linus, Tomcat and WebSphere.
Experience in setting up Splunk alerting and monitoring.
Experience in building monitoring dashboards through Dynatrace and ability to triage application performance through deep diving using Dynatrace.
Experience in configuring tomcats instances and WebSphere JVMs.
Nice To Have Skills:
Experience with one or more general purpose programming (Java, Python, .Net, C++, etc.)
Experience with cloud platforms like AWS and understanding of Pivotal cloud foundry.
Understanding of network topologies, load balancing concepts and content delivery network
Understanding of HAProxy (context based routing) and Pivotal Gemfire (extended L2 cache)
Understanding of Web application and mobile application.
Understanding of relational databases like Oracle/DB2
Understanding of non-relational databases like Casandra.
Understanding of IBM MQ and Kafka.
Understanding of risk controls and compliance to departmental and company-wide standards
Ability to work collaboratively in teams and develop meaningful relationships to achieve common goals
Ability to run production incident conferences bridges