
Cross-Architecture High-Order Epistasis Detection on CPU and GPU Devices
Uncover the genetic basis of complex diseases through Genome-wide association studies (GWAS) and Single Nucleotide Polymorphism (SNP) analysis. Explore how epistasis helps explain certain traits and discover CUDA-based high-order epistasis detection methods. Learn about cutting-edge research in understanding genetic interactions for improved disease detection, treatment, and prevention.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
oneAPI Developer Summit IWOCL & SYCLcon2021 Cross-architecture high-order exhaustive epistasis detection on CPU and GPU devices oneAPI Great Cross-Architecture Challenge Highlight Ricardo Nobre Sergio Santander-Jim nez, Leonel Sousa, Aleksandar Ilic inesc-id.pt
Genes encode phenotype Genome-wide association studies (GWAS) have been uncovering the genetic basis of complex diseases, e.g. asthma, cancer, diabetes data for large number of samples (for different genetic markers) markers most correlated w/ trait cases controls do not have trait have trait e.g. disease predisposition Improved detection, treatment and prevention for more conditions Exploiting Novel HW w/ Efficient Methods 2 | 26-April-21
Single Nucleotide Polymorphism (SNP) SNP G A C G T A A C T G C A T T person 1 : G A C A T A A C T G T A T T person 2 : at specific DNA position (> 1% population) homozygous major (?) XX major allele (X) minor allele (x) heterozygous (?) Xx lowest freq. highest freq. homozygous minor (?) xx 3 | 26-April-21
Epistasis helps explaining certain traits1,2,3 k SNPs interacting (transcends additive effects) Per each set of SNPs: (1) Count genotype frequencies (2) Score combinations of SNPs (3) Reduce scores / identify solution [1] J. Sun, Hidden Risk Genes with High-Order Intragenic Epistasis in Alzheimer s Disease. J Alzheimers Dis, 2014 [2] C. Im, Genome-wide search for higher order epistasis as modifiers of treatment effects on bone mineral density in childhood cancer survivors. Eur J Hum Genet, 2018 [3] Y. Quan, Facilitating Anti-Cancer Combinatorial Drug Discovery by Targeting Epistatic Disease Genes. Molecules, 2018 4 | 26-April-21
CUDA-based high-order epistasis detection R. Nobre, S. Santander-Jim nez, L. Sousa and A. Ilic. Accelerating 3-way Epistasis Detection with CPU+GPU processing. Paper presented in 23rd Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), New Orleans, 2020. DOI: 10.1007/978-3-030-63171-0_6 ~3 faster than related art w/ospecialized HW tackles 3-way searches (rare in SoA) R. Nobre, A. Ilic, S. Santander-Jim nez and L. Sousa. Exploring the Binary Precision Capabilities of Tensor Cores for Epistasis Detection. Paper presented in 34th International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, 2020. DOI: 10.1109/IPDPS47924.2020.00043 [Best Paper nominee] R. Nobre, A. Ilic, S. Santander-Jim nez and L. Sousa. Retargeting Tensor Accelerators for Epistasis Detection. IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 9, pp. 2160-2174, 1 Sept. 2021. DOI: 10.1109/TPDS.2021.3060322. 2-way and 3-way using HW targeting BNNs faster than any related art (incl. vs. 6-GPU nodes) How difficult could it be to port some of these to Data Parallel C++? 5 | 26-April-21
Converting Application to DPC++ functionally equivalent DPC++ implementation Intel DPC++ Compatibility Tool epistasis detection implemented in CUDA 1. Run intercept-build to create compile_commands.json with compilation command (nvcc ) and pointers to CUDA sources of the application 2. dpct -p compile_commands.json -in-root=$PROJ_DIR -out-root=dpcpp_out *.cu - -cuda-include-path=$CUDA_DIR/include 3. Create Makefile that uses dpcpp instead of nvcc, adjusting libraries used (e.g. -fiopenmp) Intel DPC++ Compatibility Tool Developer Guide and Reference https://software.intel.com/content/www/us/en/develop/documentation/intel-dpcpp-compatibility-tool-user-guide/top/usage-workflow-overview.html 6 | 26-April-21
DPCT provides helpful feedback Migrated API does not return error code. (*, 0) is inserted. You may need to rewrite this code. int ret = cudaMalloc((unsigned long long**)&d_datasetCases, datasetCases_size * sizeof(unsigned long long)); SYCL uses exceptions to report errors and does not use the error codes. The call was replaced with 0. You need to rewrite this code. cudaError_t err = cudaGetLastError(); The call to __ldg was removed, because there is no corresponding API in DPC++. Most are optional. Minimal modifications had to be performed to compile the DPC++ translated code E.g. sycl::min(val, shfl_xor_32(val, 1, item_ct1)); 7 | 26-April-21
Be aware of assumptions about HW in code __inline__ float warpReduceMin(float val, sycl::nd_item<3> item_ct1) { val = sycl::min(val, shfl_xor_32(val, 1, item_ct1)); val = sycl::min(val, shfl_xor_32(val, 2, item_ct1)); val = sycl::min(val, shfl_xor_32(val, 4, item_ct1)); val = sycl::min(val, shfl_xor_32(val, 8, item_ct1)); val = sycl::min(val, shfl_xor_32(val, 16, item_ct1)); return val; } subgroups might not have 32 work-items (warp in NVIDIA) depends on the target and on how the code is compiled __inline__ float warpReduceMin(float val, sycl::nd_item<3> item_ct1) { for(int i=1; i < item_ct1.get_sub_group().get_local_range().get(0); i = i * 2) { val = sycl::min(val, shfl_xor_32(val, i, item_ct1)); } return val; } Assumptions like these can result in code that compiles but produces incorrect output 8 | 26-April-21
Running third-order searches on DevCloud nodes The application can now be executed on more GPU devices and also on CPU-based systems (2 ) Xeon Gold 6128 [CPU] 5.2 run on dual Xeon 6128 make run_cpu qsub -l nodes=1:ppn=2:iris_xe_max -d . run.sh UHD Graphics P630 (Gen9.5) [iGPU] 8.7 Iris Xe MAX (Gen12) [dGPU] 107.2 run on Iris Xe MAX make run_gpu qsub -l nodes=1:ppn=2:gold6128 -d . run.sh 0 50 100 150 PERFORMANCE [giga unique sets samples / sec] [dataset: 1024 SNPs 4096samples] Code repository: https://github.com/rjfnobre/crossarch-episdet DevMesh entry: https://devmesh.intel.com/projects/cross-architecture-high-order-exhaustive-epistasis-detection-on- cpu-and-gpu-devices 9 | 26-April-21
Improving execution on CPU-based system for(int comb_i = 0; comb_i < COMB_SIZE; comb_i++) { unsigned long long acc = 0xFFFFFFFFFFFFFFFF; for(int epistasis_i=0; epistasis_i < EPISTASIS_SIZE; epistasis_i++) { acc = acc & controlsArr[epistasis_i * 3 + ((int) (comb_i / pow_table[epistasis_i])) % 3]; } observedValues_shared[comb_i * 2 * WORKGROUP_SIZE + 0 * WORKGROUP_SIZE + local_id] += sycl::popcount(acc); } for(int a_i = 0; a_i < 3; a_i++) { for(int b_i = 0; b_i < 3; b_i++) { for(int c_i = 0; c_i < 3; c_i++) { uint comb_i = a_i * 9 + b_i * 3 + c_i; observedValues_shared[comb_i * 2 * WORKGROUP_SIZE + 0 * WORKGROUP_SIZE + local_id] += sycl::popcount(controlsArr[0 * 3 + a_i] & controlsArr[1 * 3 + b_i] & controlsArr[2 * 3 + c_i]); } } } Increased perf. on CPU-based system (2 ) Xeon Gold 6128 [CPU] 5.2 21.9 Minimal impact on GPU devices 0 5 10 15 20 25 Direct translation Alternative indexing PERFORMANCE [giga unique sets samples / sec] 10 | 26-April-21
Improving execution on GPU devices Local mem. used contingency table: uint 27 2 4 bytes ushort 27 2 2 bytes 2 33genotype frequencies per contingency table determines # of work-groups that can execute on each subslice Each work-item constructs a cont. table more work-items executing concurrently kernel parameters (host code): in device code: dpct_local_acc_ct1(sycl::range<1>(2 * WORKGROUP_SIZE * comb * sizeof(DATA_TYPE)), cgh); auto smem = (DATA_TYPE *)dpct_local; DATA_TYPE * observedValues_shared = smem; UHD Graphics P630 (Gen9.5) [iGPU] 1.16 / 1.57 on Iris Xe MAX / P630 Iris Xe MAX (Gen12) [dGPU] Slower on CPU (~7%) 0 50 100 150 32-bit (uint) 16-bit (ushort) PERFORMANCE [giga unique sets samples / sec] 11 | 26-April-21
Ongoing / Future work Conclusions DPC++ Compatibility Tool proved to be an efficient way to port our CUDA code Further tune application to particular architectures / devices Research algorithms to eficiently generalize to higher order searches Application is capable of targeting HW not supported by the original implementation With the growth of DPC++/SYCL, more architectures will be supported Add support for the FPGA devices available in DevCloud 12 | 26-April-21
Thank you! Supported by national funds through FCT, under project UIDB/50021/2020 and Grant SFRH/BPD/119220/2016, and the ERDF, under project LISBOA- 01-0145-FEDER-031901 (PTDC/CCI-COM/31901/2017, HiPErBio).