Introduction to High Throughput Computing with HTCondor

introduction to high throughput computing with n.w

1 / 21

Embed Share

Learn about High Throughput Computing using HTCondor, a specialized workload management system offering job queueing, scheduling, resource and priority management, access to additional computing resources, reliability features, and automating workflows. Understand the differences between HTC and HPC, job scheduling, reliability aspects, job submission requirements, and more.

toma_91 Follow

Uploaded on Jun 09, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Introduction to High Throughput Computing with HTCondor Yulia Pustovalova & Alexandra Sasha Pozhidaeva Slides kindly provided by Jonathan Wedell NMRbox summer workshop June 10-11, 2024

NMRbox VMs

What is HTCondor? Specialized workload management system for compute-intensive jobs. Provides: Job queueing, scheduling, and querying Resource management Priority management Access to additional computing resources Reliability Automatically re-running failed jobs is possible Automating workflows

HTC vs HPC HTCondor is high throughput computing, not high-performance computing When your work is embarrassingly parallel think HTCondor Condor Pool HTCondor allows creating pools of different machines Submitted jobs can run on any machine in the pool that meets the requirements

Job Scheduling When should the job run? How many times should it run? What resources are needed? Memory, HD, CPU cores When should it run on which machines? Desktops at night Servers during the day

Reliability Jobs can automatically restart upon failure Logs automatically capture when jobs ran and any output and error generated Checkpointing allows a job to save state and move from one machine to another or resume from last checkpoint upon failure Requires special linking

Submitting to HTCondor The submit file is the most basic description of a job to run using HTCondor Specify: Executable Arguments Requirements (Hardware and Software) Log locations Universe (HTCondor environment)

Requirements CPU, GPU, memory, disk availability Network filesystems Other arbitrary parameters This job can only run on machine xyz. This job requires Matlab version x

Other options Where to save log files What arguments to run software with How many times to run Potentially with different arguments Under what conditions to stop execution or pause execution What files to transfer to the execute machine Supports downloading files using HTTP

Script examples executable = /usr/software/bin/alphafold arguments = $(JobName).fasta getenv = True output = $(JobName).stdout error = $(JobName).stderr JobName = 2dog queue JobName = 2cow queue JobName = 1rcf queue

Script examples executable = monte_carlo_method.py arguments = -seed $(PROCID) log = logs/mc.log output = logs/mc_$(PROCID).out error = logs/mc_$(PROCID).err queue 1000

Script examples: Requirements request_gpus = 1 request_cpus = 4 request_disk = 1M request_memory = 4G requirements = NMRPIPE == "11.5 rev 2023.105.21.31"

Universes Suggested universe: Other universes: Standard Java Parallel Scheduler Grid VM Vanilla (Default) https://htcondor.readthedocs.io/en/latest/users-manual/choosing-an-htcondor-universe.html

Interacting with HTCondor condor_status Shows available cores

Interacting with HTCondor condor_q (-bet) Shows jobs in the queue (-bet) Shows jobs in the queue with additional information of hosts and requirements condor_submit submission.file condor_rm job_id (or user_id) Remove jobs by ID or user

Useful links HTcondor documentations and tutorials https://github.com/NMRbox/htcondor-tutorial https://htcondor.readthedocs.io/en/latest/users-manual/quick-start- guide.html GPUs requirements https://portal.osg- htc.org/documentation/htc_workloads/specific_resource/gpu-jobs/

Examples AlphaFold Molecular Dynamics Simulation using GROMACS executable = /usr/software/bin/gmx arguments = mdrun -v -deffnm step7_production log = basic.log output = basic.out error = basic.err getenv=True queue

AlphaFold on NMRbox Version 3 is now available on Google Server Limitations: 20 jobs per day Limit of 5000 residues in the system NMRbox - Version 2.3.2 Submitting multiple jobs at the same time through HTCondor ~ 8000 residues on H100 mashines

Grace Hopper (grace.nmrbox.org) machines (by request only) can handle >12000 residues 480 GB of system memory H100 with 96 GB of built-in RAM 72 ARM based cores It can access the GPU RAM faster and has access to the whole 480+96 GB

AF2 on NMRbox Implementation of popular Colab Google Server Customizable: /reboxitory/data/alphafold/2.3.2/alphafold.py h Arguments: --database --max_template_date --use_precomputed_msas --num_multimer_predictions_per_model --no-gpu --custom_config_file

Acknowledgments Jon Wedell Gerard Weatherby NMRhub team For all support questions, please email us at: support@nmrbox.org Join our Slack community channel!

Introduction to High Throughput Computing with HTCondor

Download Presentation

Presentation Transcript

Related

More Related Content