As we are experiencing rapid growth, so is our need to dedicate ourselves to establishing stellar site reliability engineering practices. You will be part of forming our first dedicated SRE team.
We are rapidly growing our platform and continue to strive to make performance and platform reliability a key indicator of success. With you on the team, you will take a leading role in creating a performance culture, enabling automation, introducing measurements, and facilitating sharing across teams.
Key responsibilities:
Defining SLIs, SLOs and error budgets together with the product team
Establish error budgets for the product team; what degree of flexibility can we allow to have without business impact?
Help further strengthening our application performance monitoring
Ensuring visibility of SLIs and SLOs to empower teams to make sound performance decisions.
Set up alerts and triggers to proactively identify performance issues
Assist the product team in identifying opportunities for performance improvement projects and take active part in the execution
Actively take part in architectural work and contribute with performance related considerations
Document your work to equip the product team to practice what you preach
Improve our deployment process to make it as boring as possible
Your skills:
Ability to think in systems and architecture - consider edge cases and error handling.
Have 5+ years of engineering experience.
A strong communicator - this is not a one man job; you have to pave the way for our teams
Have an enthusiastic and pragmatic go-for-it attitude. When you see something broken, you can't help but fix it.
Have experience with part of the technologies from our stack: Node, GraphQL, PostgreSQL, AWS, Google Cloud and Heroku
Have professional experience from product teams working on large-scale systems
This job comes with several perks and benefits