Site Reliability Engineer
Aventus
Job title: Site Reliability Engineer
Company: Aventus
Job description: About the jobWe are seeking a skilled Data Site Reliability Engineer (SRE) who has experience with data platforms to join a dynamic international company.The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of our systems and applications. As an SRE, you will collaborate closely with development and operations teams to design, implement, and maintain robust infrastructure solutions. You will also be involved in monitoring, troubleshooting, and optimizing our systems to meet the demands of our rapidly growing business. The role will sit in the cloud Engineering team where you willdevelop and maintain cloud-native technology:● Highly scalable Kubernetes clusters● Cloud Access management automation and integration with k8sAs Data Platform Site Reliability Engineering you will manage infrastructure and applications oncloud computing platforms to deliver data processing, governance, and storage.As an SRE, you’ll need to solve problems that arise using empirical data, teamwork, and your ownunique expertise.The Data Platform SRE will work directly with our data platform and engineering teams in anembedded SRE model, operating in unison with the developers to deliver seamless experiencesfor our customers.Responsibilities:
- Design, implement, and maintain scalable and reliable data infrastructure solutions for storing, processing, and analyzing large volumes of data.
- Collaborate with data engineering and data science teams to define and implement operational requirements for data pipelines, ETL processes, and analytical workflows.
- Automate deployment, configuration, and monitoring of data systems and services to ensure efficient and reliable operation.
- Develop and maintain monitoring and alerting systems to proactively identify and address issues with data availability, quality, and performance.
- Troubleshoot and resolve data-related issues in a timely manner, minimizing impact on downstream applications and users.
- Implement data governance and security best practices to ensure the confidentiality, integrity, and availability of our data assets.
- Perform capacity planning and performance tuning to optimize the performance and cost-effectiveness of our data infrastructure.
- Participate in on-call rotations and respond to data-related incidents outside of regular business hours when necessary.
- Evaluate and adopt new data technologies and tools to improve the efficiency, reliability, and scalability of our data infrastructure.
- Document system designs, configurations, and operational procedures to facilitate knowledge sharing and collaboration.
Qualifications● Strong sense of ownership and integrity demonstrated through clear communication andcollaboration● Experience in architecting, developing, operating, and troubleshooting Kubernetesclusters and/or other highly available systems at scale.● Proficiency with the architecture, deployment, performance tuning, and troubleshooting ofopen-source data analytics technologies, especially Apache Spark, Trino and relatedsoftware in a large-scale environment● The ability to design, author, and release code in languages like Go, Python, or Java● Acute drive to automate manual operations and to improve them through repeatediteration● Understanding of the Linux Operating System, standard networking protocols, andcomponents● Experience with cloud-native services on AWS/GCP● Hands-on experience managing large numbers of diverse systems with configurationmanagement or software delivery platforms (such as Terraform, Cloudformation, ArgoCD,and Flux)● Experience with deploying, supporting and monitoring new and existing services,platforms, and application stacks● Excellent troubleshooting and problem-solving skills● Experience with scale testing, disaster recovery, and capacity planning● Effective communication and collaboration skills: have the ability to drive and promotetechnical partnerships across teams● Incident response and/or incident management experienceThis is a long term remote contract role, candidates in the surrounding regions are preferred due to the time zone.If the above matches your skillset , please apply
Expected salary:
Location: United Arab Emirates
Job posting date: Fri, 28 Jun 2024 07:24:15 GMT
Apply for the job now!