|Job Type:||Full Time|
Amazon TechOps is at the heart of the high availability of Amazon Web Services. We make customer impacting events shorter and less frequent by providing large scale event and incident management. Our automated tooling quickly identifies the cause of an issue and helps mitigate its impact, and much of our engineer time is spent on projects to improve the tooling and automation. We also provide manual incident management for AWS and other Amazon groups, directing the resolution of an issue with service teams, and diving deep into those events to drive improvements to the tooling. It's an exciting time to join our team as we are rapidly growing and expanding our offerings.
As a System Development Engineer on the team you will build tooling to automate the detection and resolution of issues within AWS and Amazon infrastructure. You will also spend a portion of your time of your time directing the resolution of high visibility incidents by leading conference calls and virtual teams. Using data learned from those incidents you will drive further improvements into our automation, tooling, and processes so that the next event is shorter or avoided entirely. You will participate on project teams to expand use of our tooling to additional areas across Amazon. If you're looking for a team with great growth potential and an opportunity to make a huge impact, this is the team to join.
- Drive the resolution of large scale customer impacting issues as part of a team rotation, including some weekends and holidays
- Lead projects and virtual teams to drive operational improvements
- Design, build, and enhance event detection and management tools
- Participate in Agile sprints to evolve business processes and technologies
- Create and review documentation; design new standard operating procedures
- Identify and troubleshoot recurring platform issues and own projects to drive improvements
- Mentor peers in your areas of technical and operational strength
- Bachelor's Degree in Computer Science or at least 4 years relevant experience in a large-scale technical environment
- 3-5 years of experience using and troubleshooting Linux or Unix based systems
- 3-5 years experiencing troubleshooting and resolving technical issues in a distributed environment.
- 2+ years experience automating tasks using scripting languages. Software development experience with compiled languages a plus.
- 2+ years experience driving collaborative projects from conception to delivery using Agile/Scrum methodology
- Solid grasp of networking fundamentals
- Effective organizational skills and the ability to maintain a consistently high standard of operations in a busy environment
- English language written and verbal communication skills
- Experience building services for a large scale cloud platform such as AWS
- Knowledge of current best practice frameworks such as ITIL
- Experience driving and managing large troubleshooting efforts
- Experience dealing effectively with internal technical teams during problem resolution
- Ability to effectively operate and communicate efficiently under pressure
- Experience dealing effectively with internal customers during problem resolution and operating efficiently under pressure