CN114067917A - GATK super computer system based on tuning parameters - Google Patents

GATK super computer system based on tuning parameters Download PDF

Info

Publication number
CN114067917A
CN114067917A CN202111356155.5A CN202111356155A CN114067917A CN 114067917 A CN114067917 A CN 114067917A CN 202111356155 A CN202111356155 A CN 202111356155A CN 114067917 A CN114067917 A CN 114067917A
Authority
CN
China
Prior art keywords
gatk
servers
cpu
cores
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111356155.5A
Other languages
Chinese (zh)
Inventor
徐恩格
顾文彬
单晓冬
蒋鹏飞
鲍复劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou International Science Park Data Center Co ltd
Original Assignee
Suzhou International Science Park Data Center Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou International Science Park Data Center Co ltd filed Critical Suzhou International Science Park Data Center Co ltd
Priority to CN202111356155.5A priority Critical patent/CN114067917A/en
Publication of CN114067917A publication Critical patent/CN114067917A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to a GATK (goal-oriented technology) super computer system based on tuning parameters, and belongs to the field of computers. The system comprises a plurality of CPU servers and an LSF operation scheduling system; the LSF job scheduling system is in signal connection with 8CPU servers; the CPU server comprises 2 or 4 slots; the number of cores of the slot is 20; the total number of cores is 80 or 40; the number of threads of the kernel is 1; and the CPU server processes the calculation task of the LSF job scheduling system by using a multithreading parameter NumThreads. The invention can be used in any hyper-computation center, so that the GATK software can improve the computing capability on a plurality of server platforms.

Description

GATK super computer system based on tuning parameters
Technical Field
The invention belongs to the field of computers, and relates to a GATK (goal oriented technology) super computer system based on tuning parameters.
Background
The GATK is named The Genome Analysis Toolkit, is software developed by The Broad Institute and used for second-generation re-sequencing data Analysis, contains a plurality of useful tools, mainly focuses on searching variation, genotyping and highly focuses on data quality assurance. However, in the prior GATK operation, each step is operated in sequence, but because the CPU occupancy rate of each step is greatly different, and some individual steps only need to use fewer cores to achieve higher computing capacity, the computing resources are wasted.
Disclosure of Invention
In view of the above, the present invention is directed to a GATK supercomputer system based on tuning parameters.
In order to achieve the purpose, the invention provides the following technical scheme:
the GATK super computer system based on the tuning parameters comprises a plurality of CPU servers and an LSF operation scheduling system;
the LSF job scheduling system is in signal connection with 8CPU servers;
the CPU server comprises 2 or 4 slots;
the number of cores of the slot is 20; the total number of cores is 80 or 40;
the number of threads of the kernel is 1;
and the CPU server processes the calculation task of the LSF job scheduling system by using a multithreading parameter NumThreads.
Optionally, the number of the CPU servers is 8.
Optionally, the 8CPU servers include 5 servers with a model number of Intel 6248CPU total kernel of 80 cores and 3 servers with a model number of Intel 6248CPU total kernel of 40 cores.
Optionally, when the total kernel is 40, the memory is 384 GB; when the total kernel is 80, the memory is 1536 GB;
the server 384GB of the 40 kernel runs 6-7 tasks simultaneously to the limit;
the server 1536GB of the 80 core runs 17 tasks to the limit.
Optionally, the GATK supercomputer system further includes a network card provided with a 100Gbps Infiniband computing network interface.
The invention has the beneficial effects that: the invention can be used in any hyper-computation center, so that the GATK software can improve the computing capability on a plurality of server platforms.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a system diagram of the present invention;
FIG. 2 is a script flow diagram;
fig. 3 is a schematic diagram of thread extension-single sample identification.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
Referring to fig. 1, a GATK supercomputer system based on tuning parameters lists information for each step. BWA-Mem: low scatter sequences were mapped according to a large reference genome. SortSam: the input SAM or BAM is sorted and an index file is created. MarkDuplicates: the aligned records are checked in the provided SAM or BAM file to find out the repetitive sequences. Solidergargettcreator: the target region is issued with segments of indelraligner, which are locally realigned. Indel Realigner: local re-alignment is performed on the reads to correct the irregularities due to the presence of indels. BaseRecalibrator: systematic errors in the benchmark quality scores given by the sequencer are identified and a re-correction model for adjusting the benchmark quality scores accordingly is calculated. PrintReads: a re-corrected merged output bam file arranged in coordinate order is generated. HaplotypeCaller: germline SNPs and indels are identified by local recombination haplotypes, producing genomic VCFs similar to all possible genotypes in all sites.
Since a batch of data is a plurality of genomes which need to be calculated, a plurality of steps are provided for calculating a whole genome sequence sample, but the usage amount of CPU in each step is different, and therefore, a script is correspondingly designed, as shown in FIG. 2.
The first is to traverse each genome data and submit the data to all machines in turn, and a plurality of steps before calculation, namely 8 steps which can be calculated and completed by threads. The second procedure is executed after all computations are completed.
The second is to traverse each genome data and submit it to all machines in turn, several steps after calculation, namely the step of 40 threads of calculation is needed. Once multiple threads are needed for a step computation, a second procedure is needed to compute.
1. At present, a sample task has multiple steps, among the steps, partial steps can fully run 80 cores of a single machine, partial steps can only use 3-8 cores, and a resource idle state exists.
2. According to the current test situation, a maximum of 20 whole genomes run in one week, and the actual situation is basically equivalent to the state that a machine with poor performance runs the tasks.
3. Software efficiency needs to be improved within one week under the existing resource condition, so that the number of calculated samples reaches the maximum, 8 high-performance CPU servers are provided for resources, namely 5 servers with the model number of Intel 6248 (total 80 cores) and 3 servers with the model number of Intel 6248 (total 40 cores), 520 cores are summed, 1.3TB internal memory, 550TB parallel file storage and 50M bandwidth are stored, as shown in table 1.
The servers have a total of two server types.
One is a server with the carrying model of Intel 6248CPU and the total kernel of 80 cores, and the other is a server with the carrying model of Intel 6248CPU and the total kernel of 40 cores.
A total of 8 servers are used, which is the total amount of our resource amount. More than one system can be built by one server, and the number of the machines has no upper limit in principle.
Table 1 server configuration table
Figure BDA0003357199140000031
Figure BDA0003357199140000041
Lsf (load Sharing facility) is a tool for distributed resource management to schedule, monitor, and analyze the load of networked computers. Is commercially paid software. Through centralized monitoring and scheduling, resources such as a CPU, a memory, a disk, a License and the like of the computer are fully shared.
On a cluster or supercomputer platform, parallel computing programs cannot generally be run directly at will in mpixec or mpirun, but rather must be submitted to computing tasks through a job management system provided thereon.
As an important component of cluster system software, the cluster job management system can uniformly manage and schedule the cluster software and hardware resources according to the requirements of users, ensure that the user jobs fairly and reasonably share the cluster resources, and improve the system utilization rate and the throughput rate. Other cluster job management systems that are commonly used are PBS (programmable Batch System) SLURM (simple Linux Utility for Resource management) SGE (Sun Grid Engine).
The whole analysis process is carried out in 8 steps, and the results obtained after the test of the supercomputing server show that the two steps of gene comparison BWA-Mem and sequencing haplotypeCaller after comparison take 15 hours totally, while the calculation of the server before takes 6 hours, wherein the calculation speed difference is large.
After multidimensional pressure tests such as network, storage, server performance, job scheduling and the like, no problems are found. The view CPU occupancy has been in a lower occupancy state and all steps have taken 80 threads to compute. It was then found that no multi-threaded parameter, the "NumThreads" parameter, was used in the software parameter settings. All the computing tasks are single-thread computing, and the performance of the server is not exerted. Finally, the original calculation time of 15 hours is shortened to be completed within 5 hours through parameter adjustment.
And (3) tuning process of other steps:
among 8 steps, after the optimization of partial steps, 80 cores of a single machine can be fully run, but partial steps can only use 3-8 cores, and a resource idle state exists. Research shows that some steps of software can not break through the upper limit of threads and can not perform parallel computation. The higher the number of threads, the less significant the reduction in execution time. This is due to the nature of parallel processing. Initially there are several threads processing a large data set, which shortens run time. However, it does take time to divide the run into smaller parts, issue on separate threads, and finally recombine the parts into one large file. This overhead increases as the number of threads being used increases, so at some point more threads may instead stall the total run time. According to the test situation, only 20 whole genome samples can be run at most within one week for identification, and the actual situation is basically equivalent to the state of running the tasks.
4 scripts are written after careful research, and the problem that a large amount of computing resources are not utilized is solved. The former script can only run all tasks on a single server, and actually the simultaneous operation of a plurality of servers can not be run (namely 200 samples can only be run on one server in sequence). Meanwhile, the former script cannot adjust the number of threads to be called according to requirements, which causes waste of computing resources of the server (the script cannot be modified according to the requirements of different steps on CPU threads). The first two scripts divide the problem that the original script file calls the upper limit according to the number of threads in some steps into two major steps, and the last two scripts submit jobs in batches aiming at two types of servers, so that the calculation speed is hopefully improved to the maximum extent.
As the number of threads increases, the execution time for single sample identification decreases. The results show that with increasing number of threads running, the processing time of BWA-Mem and HaplotpypeCaller is significantly reduced compared to the rest of the pipeline. Similar or better results than experimentation may be achieved using thread and process level optimization. Practice has shown that running most steps on the server using 10 or 8 cores (except BWA-Mem and haplotypecall which need to occupy all resources) will work better. And 8 cores and 10 cores can have all cores occupied on a single server when computing in parallel.
The size of the memory also needs to be considered, the memory of a single server is limited, and if too many tasks are parallel, the memory may be insufficient, so that all tasks on the server are influenced to run.
Table 2 thread extension-single sample identification List
Figure BDA0003357199140000051
In conjunction with table 2 and fig. 3, it is shown how the execution time varies with the number of threads at each step in gene sequencing. For the data set Solexa-272221, as the number of threads increases from 1 to 36, the run time for all steps decreases. The higher the number of threads, the less significant the reduction in execution time. This is due to the nature of parallel processing. Initially there are several threads processing a large data set, which shortens run time. However, it does take time to divide the run into smaller parts, issue on separate threads, and finally recombine the parts into one large file. This overhead increases as the number of threads being used increases, so at some point more threads may instead stall the total run time. Therefore, the thread/number of threads cannot be arbitrarily selected; for example, thread expansion is dependent on size, as the amount of work done by each individual thread must be greater than the time it takes to combine the work done by each individual thread or synchronization. A set of analyses can be completed in a shorter time, judiciously exploiting the number of samples being analyzed and the number of threads available.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (5)

1. The GATK super computer system based on the tuning parameters is characterized in that: the system comprises a plurality of CPU servers and an LSF job scheduling system;
the LSF job scheduling system is in signal connection with 8CPU servers;
the CPU server comprises 2 or 4 slots;
the number of cores of the slot is 20; the total number of cores is 80 or 40;
the number of threads of the kernel is 1;
and the CPU server processes the calculation task of the LSF job scheduling system by using a multithreading parameter NumThreads.
2. A tuning parameter based GATK supercomputer system according to claim 1, wherein: the number of the CPU servers is 8.
3. A tuning parameter based GATK supercomputer system according to claim 1, wherein: the 8CPU servers comprise 5 servers with the model number of Intel 6248 and the total kernel of the CPU of 80 cores and 3 servers with the model number of Intel 6248 and the total kernel of the CPU of 40 cores.
4. A tuning parameter based GATK supercomputer system according to claim 3, wherein: when the total kernel is 40, the memory is 384 GB; when the total kernel is 80, the memory is 1536 GB;
the server 384GB of the 40 kernel runs 6-7 tasks simultaneously to the limit;
the server 1536GB of the 80 core runs 17 tasks to the limit.
5. A tuning parameter based GATK supercomputer system according to claim 4, characterized by: the GATK super computer system also comprises a network card provided with a 100Gbps Infiniband computing network interface.
CN202111356155.5A 2021-11-16 2021-11-16 GATK super computer system based on tuning parameters Pending CN114067917A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111356155.5A CN114067917A (en) 2021-11-16 2021-11-16 GATK super computer system based on tuning parameters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111356155.5A CN114067917A (en) 2021-11-16 2021-11-16 GATK super computer system based on tuning parameters

Publications (1)

Publication Number Publication Date
CN114067917A true CN114067917A (en) 2022-02-18

Family

ID=80272663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111356155.5A Pending CN114067917A (en) 2021-11-16 2021-11-16 GATK super computer system based on tuning parameters

Country Status (1)

Country Link
CN (1) CN114067917A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114968559A (en) * 2022-05-06 2022-08-30 苏州国科综合数据中心有限公司 LSF-based method for multi-host multi-GPU distributed arrangement of deep learning model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114968559A (en) * 2022-05-06 2022-08-30 苏州国科综合数据中心有限公司 LSF-based method for multi-host multi-GPU distributed arrangement of deep learning model
CN114968559B (en) * 2022-05-06 2023-12-01 苏州国科综合数据中心有限公司 LSF-based multi-host multi-GPU distributed arrangement deep learning model method

Similar Documents

Publication Publication Date Title
Chen et al. Improving MapReduce performance using smart speculative execution strategy
Boyer et al. Load balancing in a changing world: dealing with heterogeneity and performance variability
US8578381B2 (en) Apparatus, system and method for rapid resource scheduling in a compute farm
Nghiem et al. Towards efficient resource provisioning in MapReduce
US10176014B2 (en) System and method for multithreaded processing
Yang et al. Design adaptive task allocation scheduler to improve MapReduce performance in heterogeneous clouds
CN1914597A (en) Dynamic loading and unloading for processing unit
Garcia Pinto et al. A visual performance analysis framework for task‐based parallel applications running on hybrid clusters
Ino et al. Sequence homology search using fine grained cycle sharing of idle GPUs
Zhang et al. Fine-grained multi-query stream processing on integrated architectures
Jin et al. Towards low-latency batched stream processing by pre-scheduling
Maroulis et al. A holistic energy-efficient real-time scheduler for mixed stream and batch processing workloads
CN114067917A (en) GATK super computer system based on tuning parameters
Ruan et al. A comparative study of large-scale cluster workload traces via multiview analysis
Choi et al. Interference-aware co-scheduling method based on classification of application characteristics from hardware performance counter using data mining
CN108132834A (en) Method for allocating tasks and system under multi-level sharing cache memory framework
CN111176831A (en) Dynamic thread mapping optimization method and device based on multithread shared memory communication
Fu et al. Optimizing speculative execution in spark heterogeneous environments
Yu et al. Exploiting online locality and reduction parallelism for sampled dense matrix multiplication on gpus
CN1851652A (en) Method for realizing process priority-level round robin scheduling for embedded SRAM operating system
Wu et al. Paraopt: Automated application parameterization and optimization for the cloud
Simakov et al. A quantitative analysis of node sharing on hpc clusters using xdmod application kernels
CN110928659A (en) Numerical value pool system remote multi-platform access method with self-adaptive function
CN116360921A (en) Cloud platform resource optimal scheduling method and system for electric power Internet of things
CN115033389A (en) Energy-saving task resource scheduling method and device for power grid information system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination