HPC Linux System Administrator

Date: Jul 30, 2019

Location: Saudi Arabia

Company: King Abdullah University of Science & Technology

Position Summary: 

 

The HPC senior systems administrator will work with other Advanced Computing Infrastructure team members to administer the 850 node Ibex+ supercomputer, storage systems, other support computers, networking fabrics and other related systems and services. The required competencies of the senior administrator are: high degree of systems design skill with respect to service dependencies and overall systems availability; ability to assess and assign priorities of various tasks based on the operational requirements of the systems; mentor junior staff in the areas of systems technology and interacting with end users. Additionally, the required competencies also include those of the “Systems Administrator” which are: a high level of Linux administration experience (RHEL6, RHEL7 or equivalent); experience managing high performance data storage systems; ability to effectively use the Slurm workload manager; automation experience (including appropriate scripting languages); monitoring of large scale systems; interaction with end users as required; and ability to adapt to changing priorities of tasks as specified by the team lead.

 

Major Responsibilities: 

 

  • Providing a high level of technical competency and mentoring junior staff in all aspects of systems infrastructure administration.
  • Work with other team members, as necessary, to maintain the Advanced Computing Infrastructure. This includes the areas of: supporting end users; operational maintenance of existing systems; architectural and design components of upgrades (hardware and software); planned decommissioning of obsolete systems; ensuring the systems are accurately monitored; participate in design and evaluation exercises to maximise utilisation of the infrastructure; and implement and asses test bed systems to permit assessment of architectural ideas and concepts.
  • Work closely with end users and provide educational support, as necessary, for them to make more efficient use of the resources.
  • Maintain a broad knowledge of the current best practices of HPC systems.
  • Ensure systems are configured and maintained in compliancy with university and laboratory policies.
  • Maintain and participate in continuous development of IT skills as related to HPC systems.
  • As required, develop solutions to requirements that meet or exceed the expectations of university research staff.
  • Provide individual or group training on a variety of topics related to HPC infrastructure
  • Be mindful of university and laboratory safety polices at all times, and ensure any issues that arise are dealt with in compliance of all relevant policies.
  • Interact with other university IT groups professionally.
  • Work closely with the HPC team lead and other team members in the development of high level plans.
  • Continually maintain and monitor the HPC data storage fabric (including design, testing and analysis as required).
  • Develop infrastructure and mechanisms to reliably report on systems utilisation as required

 

Competencies:

 

  • Demonstrated knowledge of clustered Linux systems, including securing systems, and day-to-day troubleshooting, monitoring, support, software packaging, and working within industry-wide best practices
  • Administering, configuring, and supporting HPC clusters, including systems with accelerators, and high performance file systems and storage.
  • Hardware installation, configuration, upgrades and repairs fault diagnosis and subsequent rectification of computer systems hardware automated system management tools (ie: puppet)
  • Cluster provisioning systems (ie: Warewulf)
  • Slurm workload scheduler
  • Managing and supporting Infiniband-based networks is desirable
  • Virtualisation in the Linux environment (libvirt / QEMU)
  • Applications / Technologies: Git, Apache, TomCat, Kerberos, LDAP
  • Networking technologies (IP addressing; configuring network switches (Cisco and/or Arista); L2 vs L3; RFC1918 – Private Internets; VLANs; IEEE 802.3ad – Link Aggregation)
  • Implementing firewalls (as applied to Linux and network devices)
  • Monitoring systems using open source tools (ie: Nagios, Graphite, Graphana, Collectd, etc)
  • Utilising data and system security techniques, practices and standards as they relate to HPC systems, storage and networks high speed data transfer methods (ie. GridFTP, Globus Online, Aspera, bbftp or similar)
  • Ability to work closely with end users to minimise barriers to their efficient use of the Advanced Computing resources

 

Qualification / Experience

 

  • Bachelor of Science (or equivalent) in a relevant discipline plus 10 years’ experience,
  • OR Master of Science (or equivalent) in a relevant discipline plus 7 years’ experience
  • OR Doctor of Philosophy (or equivalent) in a relevant discipline plus 5 years’ experience