Systems, DevOps for HPC, BigDat and ML/DL Administrator

Administrador de sistemas, DevOps para HPC, BigData y ML/DL
24.07.2020

We are looking for motivated candidates, with proven experience in TI and Cloud systems’ management, to become part of a team dedicated to design and start-up a common calculus system for the whole organization dedicated to support research projects. The intention is to cover in a more efficient and democratic way the organization’s researchers’ computation needs, and being more efficient from the management and resource maintenance point of view.

The candidate we are looking for should have to have knowledge and experience to manage, based on good practice, the configuration, maintenance, update and monitoring of the calculus system in two levels: (1) HPC hardware infrastructure, Artificial Intelligence and Big Data, including local machine clusters that use latest GPU technology and other hardware environments to train, test and infer Deep Learning models, distributed storage and CI/CD. (2) The software platform that provides applications and services that enable developing research projects in an efficient manner, and to simplify integration, maintenance and monitoring operations. Other aims are linked to active participation in research projects as support/help to researchers on the implementation of emerging and innovative technologies.

Candidates should show proactive attitude towards problem solving, excellent IT capacities, teamwork and compromise to understand needs of colleagues. Candidates must also have necessary abilities in cloud, devops and systems technologies, particularly distributed computing and storage systems, scalability and security. Knowledge in machine learning processing and general data processing will be highly valued.

Candidates must have:

Education:          At least, masters in computer science or telecommunications

Experience:       We are looking for a versatile candidate with proven experience on the following areas:

  • Experience in Linux environments (user management, scripting, service management, monitoring and process settings)
  • Experience in network configuration (communication and security network traffic monitoring)
  • Distributed storage systems (BeeGFS, Lustre, Ceph, NAS configuration)
  • HPC workload systems: Slurm
  • Container technologies: Docker
  • Orchestration microservices and technologies: Kubernetes
  • CI/CD tools: GitLab

It will be highly valued

  • Experience in HPC architectures, GPU servers, data-driven architectures, distributed storage
  • Bare-metal virtualization solutions: Proxmox, MAAS, OpenStack
  • Implementación de sistemas Big Data and Database system implementation: Kafka, PostgreSQL, Spark, MongoDB, Cassandra
  • Configuration automation tools: Ansible, Puppet,…
  • Knowledge about different cloud service suppliers and their service offers (eg. IaaS, PaaS): Amazon Web Services, Google Cloud Platform, Microsoft Azure
  • Code-defined infrastructures: for example AWS CloudFormation, Terraform
  • MLOPs and AI workflow management tools: Airflow, Kubeflow, etc.

His/Her tasks and responsibilities will include:

  • To evaluate existing HW infrastructure (focused on GPU servers, file and network servers), to identify the system’s needs and to participate in the design to upgrade the system
  • To deduce future HW needs
  • To help implement the internal HPC platform (by collaborating with external consultants)
  • To implement CI/CD and MLOPs’ good practices
  • To develop MLOPs for middleware
  • To support on the implementation of MLOps for third parties, private/public clouds/clusters

Our offer

Joining a dynamic, innovative centre, leader on the Computer Graphics, Visual Computing and Multimedia, Data Analysis sector nationally and in Europe.

Vicomtech

Parque Científico y Tecnológico de Gipuzkoa,
Paseo Mikeletegi 57,
20009 Donostia / San Sebastián (Spain)

+(34) 943 309 230

close overlay