Rapid7 is a leading provider of security data and analytics solutions that enable organisations to implement an effective, analytics-driven approach to cyber security. We combine our extensive experience in security data and analytics and deep insight into attacker behaviours and techniques to make sense of the wealth of data available to organisations about their IT environments and users.
Our Dublin Engineering group is responsible for providing the log management services such as search, alerting and data visualisation to security professionals. Our systems ingest large amounts of data that need to be highly available and performant at all times.
We are looking for a talented Site Reliability Engineering (SRE) with a deep interest in distributed systems, cloud computing and the architecture of large-scale systems. The SRE will be part of a team that ensures our Log Management services have the reliability and uptime appropriate to our user's needs. You will work with other engineering teams to help solve extremely challenging problems.
Working closely with SRE Lead, Engineering teams, Architecture, Infrastructure and Product teams to improve the lifecycle of the Log Management services - from inception, design, deployment, operations, monitoring, security, upgrade and maintenance
Supporting services before they go live through activities such as design, deployment, migration strategy, monitoring, and playbook reviews
Maintaining services once they are live by measuring and monitoring availability, latency, and overall system health
Scaling systems through automation, driving service and infrastructure improvements as well as other ways
Troubleshooting production issues and liaising with relevant Engineering or Infrastructure team for a resolution
Participating in on-call support, and incident response follow-ups such as post-mortems
Skills and Understanding:
Previous experience in a SRE role
5+ years of experience scaling SaaS services and infrastructure
Solid experience of developing, scaling, deploying and troubleshooting large-scale systems
Solid understanding of deployment and monitoring frameworks
Ability to debug, optimise code and automate routine tasks
Advanced understanding of System Performance and tuning
Excellent knowledge of NoSQL concepts
Excellent knowledge of OOP languages such as Java
Excellent knowledge of scripting languages such as Python
Experience with algorithms and data structures
Excellent knowledge of RESTFul architectures
Understanding of Unix/Linux operating systems
Proficient in AWS services, including EC2, RDS, VPC, networking, S3, MSK, etc.