We are looking for an experienced Site Reliability Engineer (SRE) who will work to harden and provide visibility to both infrastructure and customer resources to make them more robust and secure.

Responsibilities include but are not limited to:

  • Monitor alerts and respond to outages or performance degradation
  • Develop tools to streamline that activity and to automate as much as possible
  • Reduce manual, repetitive, error prone workload, freeing up engineering to take on longer term projects and becoming more proactive than reactive
  • Strive to reduce alert-fatigue to ensure the monitoring system minimizes false positives and prioritizes clearly actionable and timely alerts
  • Continuously improve upon and refine both monitoring systems and deployment workflows
  • Prioritize both security and the end user experience
  • Provide not only customer value, but also value to the rest of the Blockdaemon team
  • Performs other duties and responsibilities as assigned

Skills and Required Qualifications:

  • Linux service administration (systemd, docker, etc)
  • Linux shell (bash, ssh, etc.)
  • Certificate management (SSH key, TLS, CA, PKI, etc) 
  • Linux troubleshooting (curl, tcpdump, ps, top, swap, memory, cpu usage, kernel logs, etc)
  • Cloud VM/network provisioning and administration (AWS, GCE, Azure)
  • Beats, ElasticSearch, Logstash, Kibana
  • Prometheus
  • Bare metal provisioning, including system performance tuning, RAID provisioning, disk partitioning, NVMe/SSD wear monitoring and mitigation, etc.
  • Terraform and/or Ansible, Vault a plus
  • Git and continuous integration
  • K8s experience a plus
  • Strong verbal and written communication skills, including presentation skills
  • Able to prioritize work, pivot as necessary and meet deadlines