Overview:
The Production System Operations Engineer ensures the organization's systems, services, and network function efficiently and securely. This role requires the candidate to collaborate with stakeholders to align technical operations with business objectives, ensuring high availability, performance, and scalability of all systems.
Key Responsibilities
Infrastructure and Operations Management
- Oversee the maintenance and optimization of technical infrastructure, including cloud environments, applications, databases, networks, and storage.
- Manage systems monitoring, incident management, and problem resolution to minimize downtime and ensure high availability.
- Rotate on call in the team to monitor system and application alerts, notifications, and dashboards 7*24
- Acknowledge and categorize incidents based on severity and impact
- Perform initial troubleshooting and diagnosis using predefined procedures and knowledge base
- Collaborate with SOC (security operations center) team on security event/incident response
- Plan and execute disaster recovery and business continuity strategies.
- Develop, implement, and maintain policies and procedures for technical operations, ensuring compliance with industry and government standards and regulations.
Strategic Planning and Execution
- Collaborate with leadership to develop and execute the technical operational strategy.
- Identify opportunities for technology enhancements and cost optimization.
- Align technical operations with organizational goals and initiatives.
- Prepare and manage the technical operations budget, ensuring cost-effectiveness.
Vendor and Stakeholder Management
- Manage relationships with vendors, service providers, and contractors to ensure the delivery of high-quality services and products.
- Negotiate contracts, review service level agreements (SLAs), and oversee vendor performance.
- Act as the primary point of contact for technical escalations and coordinate resolutions with internal teams and external stakeholders.
Security and Compliance
- Ensure the security, integrity, and compliance of technical operations in line with organizational and regulatory requirements.
- Collaborate with cybersecurity teams to identify and mitigate risks.
- Perform regular audits and assessments of systems and processes.
Qualifications
Education
- Bachelor’s degree in Computer Science, Information Technology, or a related field (Master’s degree preferred).
Experience
- 5+ years of experience in Cloud technical operations and infrastructure.
- Proven experience in managing large-scale systems, cloud environments, and enterprise networks.
- Strong background in incident management, system administration, and technical troubleshooting.
Skills and Competencies
- Proficiency and in-depth knowledge of cloud infrastructure technologies (e.g., Azure, Linux, MySQL, Windows, virtualization, cloud platforms, security and networking).
- Azure: web application gateway, firewall, AE server (DNS), SLB, VPC, security group, ECS, MySQL (HA), Key vault, BLOB, Redis, Bastion
- CloudFlare: Anti-DDoS, WAF, CDN, DNS
- Containerization: Docker, Kubernetes and related tools
- Application: Web services (API, management system), backend services, frontend service, OP1 (CICD, ELK, middleware management, monitoring system)
- Additional nice to have technologies: Nginx, ELK, Kafka, Nacos, RabbitMQ, XXL-job cluster, ES, Canal, Zookeeper, VPN
- Experience with ITIL practices, SRE methodologies, and DevOps principles.
- Solid understanding of CI/CD pipelines.
- Ability to work effectively in a fast-paced environment and prioritize tasks accordingly
- Excellent leadership and interpersonal skills to manage teams and stakeholders effectively.
- Good communication skills and ability to collaborate effectively with team members and stakeholders
- Expertise in IT service management (ITSM) frameworks like ITIL.
- Strong analytical, problem-solving, and decision-making abilities.
- Project management skills and familiarity with methodologies like Agile or Waterfall.
Key Performance Indicators (KPIs)
- Uptime percentage for critical systems and infrastructure.
- Mean Time to Resolution (MTTR) for incidents.
- Budget adherence and cost-saving initiatives.
- Team productivity and employee satisfaction scores.
- Compliance with security and regulatory standards.
Offer
- Fantastic new office on Yas Island.
- Opportunity to work in a growing business.
- Chance to work with like-minded professionals.
- A diverse environment with a determination to reach our goals.
- Training and learning opportunities.
- Company benefits which support your health and well-being.
Interested? Do apply directly with your CV
#momentumservices #igaming #hiring #UAE #UAEjobs