Rapid7 (NASDAQ: RPD) helps organizations across the globe protect what matters most so innovation can thrive in an increasingly connected world. Our comprehensive technology, services, and community-focused research simplify the complex for security teams, helping them reduce vulnerabilities, monitor for malicious behavior, be in 10 places at once, and shut down attacks. We're on a mission to make security solutions easier to use and access so we can bring safety and resilience to more people.
With more than 10,000 customers across 140+ countries, Rapid7 is a leader in cybersecurity that has earned numerous industry accolades and recognition for our technology and culture..
Our Dublin Engineering group is responsible for providing the log management services such as search, alerting and data visualisation to security professionals. Our systems ingest large amounts of data that need to be highly available and performant at all times.
We are looking for a talented Site Reliability Engineering (SRE) with a deep interest in distributed systems, cloud computing and the architecture of large-scale systems. The SRE will be part of a team that ensures our Log Management services have the reliability and uptime appropriate to our user's needs. You will work with other engineering teams to help solve extremely challenging problems.
Some of the technologies we use include:
Java, Python, Terraform, Jenkins, Artifactory, Chef, Puppet, Ansible, Zookeeper, Docker, AWS (EC2, S3, CloudFormation, etc.), Cassandra, PostgreSQL, Kafka
Working closely with Engineering teams, Architecture, Infrastructure and Product teams to improve the lifecycle of the Log Management services - from inception, design, deployment, operations, monitoring, security, upgrade and maintenance
Supporting services before they go live through activities such as design, deployment, migration strategy, monitoring, and playbook reviews
Maintaining services once they are live by measuring and monitoring availability, latency, and overall system health
Scaling systems through automation, driving service and infrastructure improvements as well as other ways
Troubleshooting production issues and liaising with relevant Engineering or Infrastructure team for a resolution
Participating in on-call support, and incident response follow-ups such as post-mortems
Skills and Understanding:
Previous experience in a SRE role
5+ years of experience scaling SaaS services and infrastructure
Solid experience of developing, scaling, deploying and troubleshooting large-scale systems
Solid understanding of deployment and monitoring frameworks
Ability to debug, optimise code and automate routine tasks
Advanced understanding of System Performance and tuning
Excellent knowledge of NoSQL concepts
Excellent knowledge of OOP languages such as Java
Excellent knowledge of scripting languages such as Python
Experience with algorithms and data structures
Excellent knowledge of RESTFul architectures
Understanding of Unix/Linux operating systems
Proficient in AWS services, including EC2, RDS, VPC, networking, S3, MSK, etc.
Systematic problem-solving approach
Excellent communication & influencing skills
Excellent technical writing skills