The Site Reliability Team at Bed Bath and Beyond is looking for a Site Reliability Engineer (SRE) who can build, instrument, troubleshoot, automate and triage highly scalable legacy and modern systems.
The candidate will be part of a team with a mission to blend a variety of skill sets and work collaboratively to ensure not only that we deliver quality, but also take an active role in determining what architectures and technologies perform, scale and deliver services reliably.
Troubleshoot issues across the entire stack - hardware, software, applications and network.
Design, build, test, and automate discovery, instrumentation, alerting, and escalation of monitoring.
Document and articulate clearly all efforts and communicate and demonstrate to the team with ease.
Capable of responding to major\critical events and be an active participant in determining solutions and instrumentation Hands on experience building fault tolerant infrastructure and monitoring instrumentation with such technologies as Kubernetes, Kafka, Cassandra, AWS, GCP, etc.
Experience instrumenting and researching issues with CA Monitoring Suite, Nagios, InfluxDB, Grafana, Prometheus, Stack Driver, Sumo Logic, New Relic, Quantum Metric, Tealeaf etc...
Familiarity with tools such as Puppet, Ansible, Salt, Chef, or CFEngine would be a plus.
Additional familiarity with log analysis tools such as Sumo Logic, ELK, and Splunk would also be helpful.
Practical knowledge of shell scripting and at least one scripting language (Python, Ruby).