SUMMARY:
Client is seeking a Site Reliability Engineer to join a fast growing Operations team, focused on monitoring, deployment stability, and system reliability. You will utilize your strong operations background and scripting skills to instrument our monitoring platforms to detect abnormalities before they become a problem.
RESPONSIBILITIES:
Develop monitoring and notification policies for production applications.
Own operability, scalability and performance processes.
Establish and maintain performance thresholds to reduce noise and increase reliability, responsiveness, and accuracy.
Document alerts and definition of procedures for resolution.
Facilitate post incident reviews to identify areas for improvement.
Work with deployment team to ensure parity between environments as well as a stable build and release cycle.
Help oversee and maintain Jenkins instances and automated/scheduled jobs.
Work with engineering teams to understand system architecture, identify single points of failure, and design a reliable production environment.
Additional responsibilities as assigned.
EDUCATION/EXPERIENCE/LICENSURE:
KNOWLEDGE, SKILLS, AND ABILITIES:
Comfortable excelling in a frequent and incremental code testing and deployment environment.
Comfort with collaboration and open communication, reaching across functional and organization borders.
Desire to learn and apply new technologies is required.
Ability to work effectively within a team environment is required.
ADDITIONAL INFORMATION:
This position typically works 40 hours per week and occasionally evenings and/or weekends to meet deadlines.
This position requires being part of an on call rotation.
Flexible on work schedules due to on call rotation and night/weekend deadlines.