Engineer –Systems, DevOps for HPC, BigDat and ML/DL Administrator

Donostia / San Sebastián

Administrador de sistemas, DevOps para HPC, BigData y ML/DL

We are looking for motivated candidates, with proven experience in TI and Cloud systems’ management, to become part of a team dedicated to design and start-up a common calculus system for the whole organization dedicated to support research projects. The intention is to cover in a more efficient and democratic way the organization’s researchers’ computation needs, and being more efficient from the management and resource maintenance point of view.

The candidate we are looking for should have to have knowledge and experience to manage, based on good practice, the configuration, maintenance, update and monitoring of the calculus system in two levels: (1) HPC hardware infrastructure, Artificial Intelligence and Big Data, including local machine clusters that use latest GPU technology and other hardware environments to train, test and infer Deep Learning models, distributed storage and CI/CD. (2) The software platform that provides applications and services that enable developing research projects in an efficient manner, and to simplify integration, maintenance and monitoring operations. Other aims are linked to active participation in research projects as support/help to researchers on the implementation of emerging and innovative technologies.

Candidates should show proactive attitude towards problem solving, excellent IT capacities, teamwork and compromise to understand needs of colleagues. Candidates must also have necessary abilities in cloud, devops and systems technologies, particularly distributed computing and storage systems, scalability and security. Knowledge in machine learning processing and general data processing will be highly valued.

Candidates must have:

Education: At least, masters in computer science or telecommunications

Experience: We are looking for a versatile candidate with proven experience on the following areas:

Experience in Linux environments (user management, scripting, service management, monitoring and process settings)
Experience in network configuration (communication and security network traffic monitoring)
Distributed storage systems (BeeGFS, Lustre, Ceph, NAS configuration)
HPC workload systems: Slurm
Container technologies: Docker
Orchestration microservices and technologies: Kubernetes
CI/CD tools: GitLab

We value candidates:

It will be highly valued

Experience in HPC architectures, GPU servers, data-driven architectures, distributed storage
Bare-metal virtualization solutions: Proxmox, MAAS, OpenStack
Implementación de sistemas Big Data and Database system implementation: Kafka, PostgreSQL, Spark, MongoDB, Cassandra
Configuration automation tools: Ansible, Puppet,…
Knowledge about different cloud service suppliers and their service offers (eg. IaaS, PaaS): Amazon Web Services, Google Cloud Platform, Microsoft Azure
Code-defined infrastructures: for example AWS CloudFormation, Terraform
MLOPs and AI workflow management tools: Airflow, Kubeflow, etc.

We offer:

His/Her tasks and responsibilities will include:

To evaluate existing HW infrastructure (focused on GPU servers, file and network servers), to identify the system’s needs and to participate in the design to upgrade the system
To deduce future HW needs
To help implement the internal HPC platform (by collaborating with external consultants)
To implement CI/CD and MLOPs’ good practices
To develop MLOPs for middleware
To support on the implementation of MLOps for third parties, private/public clouds/clusters

Our offer

Joining a dynamic, innovative centre, leader on the Computer Graphics, Visual Computing and Multimedia, Data Analysis sector nationally and in Europe.

Systems, DevOps for HPC, BigDat and ML/DL Administrator

Closed offer