Senior Site Reliability Engineer - D&R

Ireland - Dublin

Location(s)

Ireland - Dublin

Team(s)

Product & Engineering


Company:

Rapid7 (NASDAQ: RPD) helps organizations across the globe protect what matters most so innovation can thrive in an increasingly connected world. Our comprehensive technology, services, and community-focused research simplify the complex for security teams, helping them reduce vulnerabilities, monitor for malicious behavior, be in 10 places at once, and shut down attacks. We're on a mission to make security solutions easier to use and access so we can bring safety and resilience to more people.
With more than 10,000 customers across 140+ countries, Rapid7 is a leader in cybersecurity that has earned numerous industry accolades and recognition for our technology and culture.. 

Our Dublin Engineering group is responsible for providing the log management services such as search, alerting and data visualisation to security professionals. Our systems ingest large amounts of data that need to be highly available and performant at all times. 

We are looking for a talented Site Reliability Engineering (SRE) with a deep interest in distributed systems, cloud computing and the architecture of large-scale systems. The SRE will be part of a team that ensures our Log Management services have the reliability and uptime appropriate to our user's needs. You will work with other engineering teams to help solve extremely challenging problems.

Some of the technologies we use include:

Java, Python, Terraform, Jenkins, Artifactory, Chef, Puppet, Ansible, Zookeeper, Docker, AWS (EC2, S3, CloudFormation, etc.), Cassandra, PostgreSQL, Kafka

Responsibilities:

  • Working closely with Engineering teams, Architecture, Infrastructure and Product teams to improve the lifecycle of the Log Management services - from inception, design, deployment, operations, monitoring, security, upgrade and maintenance

  • Supporting services before they go live through activities such as design, deployment, migration strategy, monitoring, and playbook reviews  

  • Maintaining services once they are live by measuring and monitoring availability, latency, and overall system health

  • Scaling systems through automation, driving service and infrastructure improvements as well as other ways

  • Troubleshooting production issues and liaising with relevant Engineering or Infrastructure team for a resolution

  • Participating in on-call support, and incident response follow-ups such as post-mortems

Skills and Understanding:

  • Previous experience in a SRE role

  • 5+ years of experience scaling SaaS services and infrastructure

  • Solid experience of developing, scaling, deploying and troubleshooting large-scale systems

  • Solid understanding of deployment and monitoring frameworks

  • Ability to debug, optimise code and automate routine tasks

  • Advanced understanding of System Performance and tuning

  • Excellent knowledge of NoSQL concepts

  • Excellent knowledge of OOP languages such as Java 

  • Excellent knowledge of scripting languages such as Python 

  • Experience with algorithms and data structures

  • Excellent knowledge of RESTFul architectures

  • Understanding of Unix/Linux operating systems

  • Proficient in AWS services, including EC2, RDS, VPC, networking, S3, MSK, etc.

  • Systematic problem-solving approach

  • Excellent communication & influencing skills

  • Excellent technical writing skills

#LI-GC2