Data Centre System Operation Engineer (Johor Bahru)
Neuron SolutionsWe are seeking an L1 Data Center Operations Engineer to support 24x7 daily operations of large-scale GPU clusters and supporting data center infrastructure. This role provides first-line monitoring, incident triage, ticket handling, service coordination and physical break-fix support to ensure stable, reliable, and efficient operation of high-performance GPU platforms supporting AI workloads.
KEY RESPONSIBILITIES
- Oversee daily operations of GPU clusters and data center systems in a 24x7 shift based environment, ensuring services remain stable, available, and operating within defined SLAs.
- Monitor system health, performance, and capacity using monitoring and alerting tools, and proactively identify abnormal conditions or potential risks.
- Acknowledge, triage, and respond to operational incidents, perform first-level troubleshooting based on documented SOPs, and escalate to L2/L3 teams when issues cannot be resolved at L1, ensuring timely service restoration.
- Own ticket lifecycle in the ITSM system, including creation, categorization (Incident, Service Request, Change), prioritization, regular updates, and closure with proper evidence.
- Coordinate with hardware vendors for GPU server break-fix, including opening cases, providing logs, tracking progress, and validating restoration.
- Perform physical data center tasks such as cabling (fiber/copper), optics replacement, PDU visual inspection, labeling, basic rack checks, and environmental inspections, following approved work orders.
- Support deployment and commissioning activities for racks and infrastructure under guidance from senior engineers, including racking, cabling, and basic validation.
- Collect logs, screenshots, and diagnostic outputs from systems and monitoring tools to support troubleshooting and vendor cases.
- Work closely with Network, Systems, Platform, Facilities, and Vendors to support AI workloads and coordinate incidents and maintenance activities.
- Participate in shift handovers, clearly documenting open issues, risks, pending vendor actions, and planned activities.
Requirements
- Bachelor’s degree in Computer Science, Information Technology, Electrical Engineering, or a related field. Equivalent practical experience will be considered.
- 2+ years of experience in IT infrastructure operations, data center operations, NOC, or similar environment.
- Familiarity with GPU hardware platforms (e.g., NVIDIA GPUs) and basic awareness of AI / HPC environments.
- Basic Linux operating system skills (command line usage, log review, service status, file systems).
- Experience using monitoring and alerting tools (e.g., Prometheus, Grafana, Zabbix, or similar).
- Experience working with ticketing and IT service management tools (e.g., Jira Service Management or similar).
- Hands-on experience performing IT hardware replacement and basic break-fix tasks.
- Experience working in data center operations, system operations, or technical support roles.
Job Type: Permanent
Pay: From RM5,000.00 per month
Application Question(s)
- This role requires to work on 24/7 shift. Are you okay with it?
Experience
- System Operation Engineer: 2 years (Required)
- Data Centre IT operations: 2 years (Required)
Work Location: In person
Job Type
- Job Type
- Full Time
- Location
- Kulai, Johor
Share this job:
