Data Centre System Operation Engineer (Johor Bahru)

We are seeking an L1 Data Center Operations Engineer to support 24x7 daily operations of large-scale GPU clusters and supporting data center infrastructure. This role provides first-line monitoring, incident triage, ticket handling, service coordination and physical break-fix support to ensure stable, reliable, and efficient operation of high-performance GPU platforms supporting AI workloads.

KEY RESPONSIBILITIES

Oversee daily operations of GPU clusters and data center systems in a 24x7 shift based environment, ensuring services remain stable, available, and operating within defined SLAs.
Monitor system health, performance, and capacity using monitoring and alerting tools, and proactively identify abnormal conditions or potential risks.
Acknowledge, triage, and respond to operational incidents, perform first-level troubleshooting based on documented SOPs, and escalate to L2/L3 teams when issues cannot be resolved at L1, ensuring timely service restoration.
Own ticket lifecycle in the ITSM system, including creation, categorization (Incident, Service Request, Change), prioritization, regular updates, and closure with proper evidence.
Coordinate with hardware vendors for GPU server break-fix, including opening cases, providing logs, tracking progress, and validating restoration.
Perform physical data center tasks such as cabling (fiber/copper), optics replacement, PDU visual inspection, labeling, basic rack checks, and environmental inspections, following approved work orders.
Support deployment and commissioning activities for racks and infrastructure under guidance from senior engineers, including racking, cabling, and basic validation.
Collect logs, screenshots, and diagnostic outputs from systems and monitoring tools to support troubleshooting and vendor cases.
Work closely with Network, Systems, Platform, Facilities, and Vendors to support AI workloads and coordinate incidents and maintenance activities.
Participate in shift handovers, clearly documenting open issues, risks, pending vendor actions, and planned activities.

Requirements

Bachelor’s degree in Computer Science, Information Technology, Electrical Engineering, or a related field. Equivalent practical experience will be considered.
2+ years of experience in IT infrastructure operations, data center operations, NOC, or similar environment.
Familiarity with GPU hardware platforms (e.g., NVIDIA GPUs) and basic awareness of AI / HPC environments.
Basic Linux operating system skills (command line usage, log review, service status, file systems).
Experience using monitoring and alerting tools (e.g., Prometheus, Grafana, Zabbix, or similar).
Experience working with ticketing and IT service management tools (e.g., Jira Service Management or similar).
Hands-on experience performing IT hardware replacement and basic break-fix tasks.
Experience working in data center operations, system operations, or technical support roles.

Job Type: Permanent

Pay: From RM5,000.00 per month

Application Question(s)

This role requires to work on 24/7 shift. Are you okay with it?

Experience

System Operation Engineer: 2 years (Required)
Data Centre IT operations: 2 years (Required)

Work Location: In person

Job Type

Job Type: Full Time
Location: Kulai, Johor

Share this job: