Senior Site Reliability Engineer

Job Expired
We’re looking for someone to help improve our reliability and performance through deep analysis and remediation of our AWS infrastructure, monitors, alerts, and code.

 Key Responsibilities 
  • Refactor our existing monitors and alerts to be actionable and reliable, recommending and implementing diagnostic techniques and monitoring tools.
  • Deep dive and analysis into RDS (Aurora PostgreSQL) performance, using that data to inform scaling policies and automation
  • Help discover correlations between customer experience and performance indicators to determine what is noticeable by customers, and suggest and implement improvements based on findings
  • Help us to develop SLI’s, SLO’s, and SLA’s that are impactful as they relate to our customer’s experience
  • Help triage outages and issues across multiple teams, services, and codebases as they arise, leading root cause analysis and creating stories to prevent and/or detect those issues in the future
  • Serve as technical lead for deep dives to identify solutions to prevent future incidents
  • Introduce chaos engineering, promoting experimentation in production to discover and remediate systemic weaknesses and improve performance and reliability

 Skills Knowledge and Expertise 
  • Expertise in AWS
  • Expertise with RDS, preferably Aurora PostgreSQL engine
  • Expertise with containerization
  • Experience with open source monitoring and visualization systems and tools, i.e. Prometheus (monitoring + tracing), Grafana/Kibana (dashboards), GrayLog (logging)
  • Experience implementing, maintaining, and troubleshooting continuous integration/continuous delivery (CI/CD) tooling
  • Experience with implementing improvements in areas such as maintainability, scalability, availability, extensibility and security
  • Ability to work with many teams across disciplines (cloud, platform, development, qa, and security) to resolve issues as they arise and implement improvements
  • Experience with distributed tracing, diagnostic tooling, application performance monitoring, and the golden signals

 Our Stack 
Our stack is evolving over the next year and we’d love you to be a part of that! 
Currently we’re using:
  • Back-end: JavaScript/TypeScript, Node.js, ES6, GoLang
  • Data: Aurora PostgreSQL, Redis, ElasticSearch
  • DevOps & Deployment: All things AWS, Terraform (and Terraform Cloud), Jenkins, Github, Grafana, GrayLog
  • Testing: Playwright, Mocha, Jest
  • Front-end: Vue.js, Webpack, SCSS
  • This job has expired!
Email Me Jobs Like These
Showing 1–0 of 0 jobs

Leave your thoughts

Share this job
Company Information
  • Total Jobs 379 Jobs
  • Location California
  • Full Address 2150 Shattuck Ave, Berkeley, CA 94704, US

Contact Us