HPC Batch support for GitLab CI - Part 1

Back to Listing

Garrett Adams

Hanover, MD, 14 September 2020

GitLab’s built-in Continuous Integration (CI) tools are some of the best in the industry. Onyx Point, Inc. has been continuing our efforts to improve GitLab’s CI security. Continue reading to learn more about integrating GitLab CI with high-performance computing (HPC) resource schedulers.

If you haven’t seen our previous blogs on SetUID Runner for GitLab, please go back and check those out because the extension and implementation we discuss in this blog are a direct continuation of our previous SetUID work. You can find those previous blogs here and here.

The next problem to tackle following our development and implementation of the SetUID GitLab Runner was how we could implement this on batch High-Performance Computing (HPC) systems. Unlike the SetUID Runner, which was designed to work on the login node, this added feature would allow jobs to execute on the batch compute nodes, utilizing additional resources and parallel processing capabilities. This would open up an entirely new paradigm of CI within organizations that rely heavily on resource manager/HPC systems. Users can now submit CI jobs directly to batch systems such as IBM’s Spectrum Load Sharing Facility (LSF) or the Open Source system, Slurm Workload Manager.

With many of the Department of Energy (DOE) labs using different resource managers, we began researching an efficient way to go from the GitLab Runner code-base (written entirely in GoLang) to interacting with the underlying batch systems already installed at each of these locations. In order to accomplish this, we looked into using the Distributed Resource Management Application API (DRMAA). This API has implementations in C, Java, JavaScript, Python, and GoLang with built-in support for LSF, Slurm and numerous other schedulers. DRMAA made for an ideal solution to our problem.

DRMAA stack

By choosing this API as a middle-man layer, we’d be able to circumvent having to develop direct communication with each of the individual schedulers. This made the development process easier and faster.

From the beginning, we had some concerns about DRMAA and the possibility that it wouldn’t provide the features we needed to accomplish our goal. Despite this, we decided the benefit of “write once - run anywhere” outweighed any issues that we could come up with so we began development on this new Batch executor with DRMAA at its center. For the most part, the executor itself is a close copy to the original shell runner. We simply removed the portion that executes the script contents with the system shell and replaced it with a call to DRMAA, which would then handle LSF/Slurm/etc in the background.

At first, things seemed to be going well. We were very quickly able to get jobs submitted to underlying batch systems but it didn’t take long for certain issues to present themselves. For example, DRMAA has no way of handling output from the batch job execution. That means in order to get the output we needed we would have to use the .outputPath function and traverse the generated output file to grab the data (which was different for each batch system).

Furthermore, we realized that the previously implemented SetUID functionality was no longer valid as we no longer used the shell inside the runner. This was a difficult situation we found ourselves in as DRMAA has no built-in ability to SetUID the job being run.

The final nail in the coffin was a bug inside of LSF/DRMAA’s code written by IBM. There is an argument in LSF, -p, where you can specify the number of processes for your job to run. Despite this being a perfectly valid argument when using LSF on the command line, with DRMAA it would cause the runner to crash and not submit the job. We eventually discovered that hidden deep inside IBM’s thousands of lines of LSF/DRMAA code, there was an array of all the valid arguments that can be passed to the batch system. Once we corrected that by adding -p to the array, everything suddenly worked properly.

Despite the strides we made to keep DRMAA a part of the runner, we eventually decided it simply didn’t provide all of the features we needed. To fulfill our vision of the perfect CI solution for batch systems we decided it would be more valuable to begin writing our own API. Our own API could handle all the basic things we needed, including creating job submissions and passing native arguments, but also created the ability to add any nuanced features that may come up in the future. We knew our own API would be the most scalable option.

When the Batch executor is properly enabled, the in-house developed API will announce that it is running as a Batch runner and maintain the previous SetUID functionality. It will detect the resource manager that is in use, pull contents from the CI job that was submitted, save that content into the required shell script, and forward the script to the previously identified underlying resource manager. In doing this, the executor will maintain any arguments defined in the gitlab-ci.yml file and pass them to the resource manager. The output is then returned to the GitLab web interface.

CI Job Flow

By extending the SetUID functionality with the Batch executor, we are able to add a feature that allows jobs to execute on the batch compute nodes utilizing additional resources and parallel processing capability above and beyond the standard shell executor. On the next post in this series, we will delve into the details of the Batch runner implementation!

Onyx Point is dedicated to supporting open standards, reducing vendor lock-in, and contributing to the open source community. We have partnered with GitLab to help extend modern development practices to more organizations by providing professional services, training, and custom development.

Click here to learn more about how we can help you.

Garrett is a software engineer for Onyx Point, Inc. He previously worked on the Exascale Computing CI initiative implementing modifications on both Gitlab Server and Gitlab Runner projects. Garrett now leads development on the SIMP Scanner project and actively contributes to other SIMP Console codebases.

At Onyx Point, our engineers focus on Security, System Administration, Automation, Dataflow, and DevOps consulting for government and commercial clients. We offer professional services for Puppet, RedHat, SIMP, NiFi, GitLab, and the other solutions in place that keep your systems running securely and efficiently. We offer Open Source Software support and Engineering and Consulting services through GSA IT Schedule 70. As Open Source contributors and advocates, we encourage the use of FOSS products in Government as part of an overarching IT Efficiencies plan to reduce ongoing IT expenditures attributed to software licensing. Our support and contributions to Open Source, are just one of our many guiding principles

  • Customer First.
  • Security in All We Do.
  • Pursue Innovation with Integrity.
  • Communicate Openly and Respectfully.
  • Offer Your Talents, and Appreciate the Talents of Others

programming, open-source, gitlab, DoE, security

Share this story

We work with these Technologies + Partners