
Introduction to High Throughput Computing with HTCondor
Learn about High Throughput Computing using HTCondor, a specialized workload management system offering job queueing, scheduling, resource and priority management, access to additional computing resources, reliability features, and automating workflows. Understand the differences between HTC and HPC, job scheduling, reliability aspects, job submission requirements, and more.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Introduction to High Throughput Computing with HTCondor Yulia Pustovalova & Alexandra Sasha Pozhidaeva Slides kindly provided by Jonathan Wedell NMRbox summer workshop June 10-11, 2024
What is HTCondor? Specialized workload management system for compute-intensive jobs. Provides: Job queueing, scheduling, and querying Resource management Priority management Access to additional computing resources Reliability Automatically re-running failed jobs is possible Automating workflows
HTC vs HPC HTCondor is high throughput computing, not high-performance computing When your work is embarrassingly parallel think HTCondor Condor Pool HTCondor allows creating pools of different machines Submitted jobs can run on any machine in the pool that meets the requirements
Job Scheduling When should the job run? How many times should it run? What resources are needed? Memory, HD, CPU cores When should it run on which machines? Desktops at night Servers during the day
Reliability Jobs can automatically restart upon failure Logs automatically capture when jobs ran and any output and error generated Checkpointing allows a job to save state and move from one machine to another or resume from last checkpoint upon failure Requires special linking
Submitting to HTCondor The submit file is the most basic description of a job to run using HTCondor Specify: Executable Arguments Requirements (Hardware and Software) Log locations Universe (HTCondor environment)
Requirements CPU, GPU, memory, disk availability Network filesystems Other arbitrary parameters This job can only run on machine xyz. This job requires Matlab version x
Other options Where to save log files What arguments to run software with How many times to run Potentially with different arguments Under what conditions to stop execution or pause execution What files to transfer to the execute machine Supports downloading files using HTTP
Script examples executable = /usr/software/bin/alphafold arguments = $(JobName).fasta getenv = True output = $(JobName).stdout error = $(JobName).stderr JobName = 2dog queue JobName = 2cow queue JobName = 1rcf queue
Script examples executable = monte_carlo_method.py arguments = -seed $(PROCID) log = logs/mc.log output = logs/mc_$(PROCID).out error = logs/mc_$(PROCID).err queue 1000
Script examples: Requirements request_gpus = 1 request_cpus = 4 request_disk = 1M request_memory = 4G requirements = NMRPIPE == "11.5 rev 2023.105.21.31"
Universes Suggested universe: Other universes: Standard Java Parallel Scheduler Grid VM Vanilla (Default) https://htcondor.readthedocs.io/en/latest/users-manual/choosing-an-htcondor-universe.html
Interacting with HTCondor condor_status Shows available cores
Interacting with HTCondor condor_q (-bet) Shows jobs in the queue (-bet) Shows jobs in the queue with additional information of hosts and requirements condor_submit submission.file condor_rm job_id (or user_id) Remove jobs by ID or user
Useful links HTcondor documentations and tutorials https://github.com/NMRbox/htcondor-tutorial https://htcondor.readthedocs.io/en/latest/users-manual/quick-start- guide.html GPUs requirements https://portal.osg- htc.org/documentation/htc_workloads/specific_resource/gpu-jobs/
Examples AlphaFold Molecular Dynamics Simulation using GROMACS executable = /usr/software/bin/gmx arguments = mdrun -v -deffnm step7_production log = basic.log output = basic.out error = basic.err getenv=True queue
AlphaFold on NMRbox Version 3 is now available on Google Server Limitations: 20 jobs per day Limit of 5000 residues in the system NMRbox - Version 2.3.2 Submitting multiple jobs at the same time through HTCondor ~ 8000 residues on H100 mashines
Grace Hopper (grace.nmrbox.org) machines (by request only) can handle >12000 residues 480 GB of system memory H100 with 96 GB of built-in RAM 72 ARM based cores It can access the GPU RAM faster and has access to the whole 480+96 GB
AF2 on NMRbox Implementation of popular Colab Google Server Customizable: /reboxitory/data/alphafold/2.3.2/alphafold.py h Arguments: --database --max_template_date --use_precomputed_msas --num_multimer_predictions_per_model --no-gpu --custom_config_file
Acknowledgments Jon Wedell Gerard Weatherby NMRhub team For all support questions, please email us at: support@nmrbox.org Join our Slack community channel!