CN114067917A

CN114067917A - GATK super computer system based on tuning parameters

Info

Publication number: CN114067917A
Application number: CN202111356155.5A
Authority: CN
Inventors: 徐恩格; 顾文彬; 单晓冬; 蒋鹏飞; 鲍复劼
Original assignee: Suzhou International Science Park Data Center Co ltd
Current assignee: Suzhou International Science Park Data Center Co ltd
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-02-18

Abstract

The invention relates to a GATK (goal-oriented technology) super computer system based on tuning parameters, and belongs to the field of computers. The system comprises a plurality of CPU servers and an LSF operation scheduling system; the LSF job scheduling system is in signal connection with 8CPU servers; the CPU server comprises 2 or 4 slots; the number of cores of the slot is 20; the total number of cores is 80 or 40; the number of threads of the kernel is 1; and the CPU server processes the calculation task of the LSF job scheduling system by using a multithreading parameter NumThreads. The invention can be used in any hyper-computation center, so that the GATK software can improve the computing capability on a plurality of server platforms.

Description

GATK super computer system based on tuning parameters

Technical Field

The invention belongs to the field of computers, and relates to a GATK (goal oriented technology) super computer system based on tuning parameters.

Background

The GATK is named The Genome Analysis Toolkit, is software developed by The Broad Institute and used for second-generation re-sequencing data Analysis, contains a plurality of useful tools, mainly focuses on searching variation, genotyping and highly focuses on data quality assurance. However, in the prior GATK operation, each step is operated in sequence, but because the CPU occupancy rate of each step is greatly different, and some individual steps only need to use fewer cores to achieve higher computing capacity, the computing resources are wasted.

Disclosure of Invention

In view of the above, the present invention is directed to a GATK supercomputer system based on tuning parameters.

In order to achieve the purpose, the invention provides the following technical scheme:

the GATK super computer system based on the tuning parameters comprises a plurality of CPU servers and an LSF operation scheduling system;

the LSF job scheduling system is in signal connection with 8CPU servers;

the CPU server comprises 2 or 4 slots;

the number of cores of the slot is 20; the total number of cores is 80 or 40;

the number of threads of the kernel is 1;

and the CPU server processes the calculation task of the LSF job scheduling system by using a multithreading parameter NumThreads.

Optionally, the number of the CPU servers is 8.

Optionally, the 8CPU servers include 5 servers with a model number of Intel 6248CPU total kernel of 80 cores and 3 servers with a model number of Intel 6248CPU total kernel of 40 cores.

Optionally, when the total kernel is 40, the memory is 384 GB; when the total kernel is 80, the memory is 1536 GB;

the server 384GB of the 40 kernel runs 6-7 tasks simultaneously to the limit;

the server 1536GB of the 80 core runs 17 tasks to the limit.

Optionally, the GATK supercomputer system further includes a network card provided with a 100Gbps Infiniband computing network interface.

The invention has the beneficial effects that: the invention can be used in any hyper-computation center, so that the GATK software can improve the computing capability on a plurality of server platforms.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a system diagram of the present invention;

FIG. 2 is a script flow diagram;

fig. 3 is a schematic diagram of thread extension-single sample identification.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Referring to fig. 1, a GATK supercomputer system based on tuning parameters lists information for each step. BWA-Mem: low scatter sequences were mapped according to a large reference genome. SortSam: the input SAM or BAM is sorted and an index file is created. MarkDuplicates: the aligned records are checked in the provided SAM or BAM file to find out the repetitive sequences. Solidergargettcreator: the target region is issued with segments of indelraligner, which are locally realigned. Indel Realigner: local re-alignment is performed on the reads to correct the irregularities due to the presence of indels. BaseRecalibrator: systematic errors in the benchmark quality scores given by the sequencer are identified and a re-correction model for adjusting the benchmark quality scores accordingly is calculated. PrintReads: a re-corrected merged output bam file arranged in coordinate order is generated. HaplotypeCaller: germline SNPs and indels are identified by local recombination haplotypes, producing genomic VCFs similar to all possible genotypes in all sites.

Since a batch of data is a plurality of genomes which need to be calculated, a plurality of steps are provided for calculating a whole genome sequence sample, but the usage amount of CPU in each step is different, and therefore, a script is correspondingly designed, as shown in FIG. 2.

The first is to traverse each genome data and submit the data to all machines in turn, and a plurality of steps before calculation, namely 8 steps which can be calculated and completed by threads. The second procedure is executed after all computations are completed.

The second is to traverse each genome data and submit it to all machines in turn, several steps after calculation, namely the step of 40 threads of calculation is needed. Once multiple threads are needed for a step computation, a second procedure is needed to compute.

1. At present, a sample task has multiple steps, among the steps, partial steps can fully run 80 cores of a single machine, partial steps can only use 3-8 cores, and a resource idle state exists.

2. According to the current test situation, a maximum of 20 whole genomes run in one week, and the actual situation is basically equivalent to the state that a machine with poor performance runs the tasks.

3. Software efficiency needs to be improved within one week under the existing resource condition, so that the number of calculated samples reaches the maximum, 8 high-performance CPU servers are provided for resources, namely 5 servers with the model number of Intel 6248 (total 80 cores) and 3 servers with the model number of Intel 6248 (total 40 cores), 520 cores are summed, 1.3TB internal memory, 550TB parallel file storage and 50M bandwidth are stored, as shown in table 1.

The servers have a total of two server types.

One is a server with the carrying model of Intel 6248CPU and the total kernel of 80 cores, and the other is a server with the carrying model of Intel 6248CPU and the total kernel of 40 cores.

A total of 8 servers are used, which is the total amount of our resource amount. More than one system can be built by one server, and the number of the machines has no upper limit in principle.

Table 1 server configuration table

Lsf (load Sharing facility) is a tool for distributed resource management to schedule, monitor, and analyze the load of networked computers. Is commercially paid software. Through centralized monitoring and scheduling, resources such as a CPU, a memory, a disk, a License and the like of the computer are fully shared.

On a cluster or supercomputer platform, parallel computing programs cannot generally be run directly at will in mpixec or mpirun, but rather must be submitted to computing tasks through a job management system provided thereon.

As an important component of cluster system software, the cluster job management system can uniformly manage and schedule the cluster software and hardware resources according to the requirements of users, ensure that the user jobs fairly and reasonably share the cluster resources, and improve the system utilization rate and the throughput rate. Other cluster job management systems that are commonly used are PBS (programmable Batch System) SLURM (simple Linux Utility for Resource management) SGE (Sun Grid Engine).

The whole analysis process is carried out in 8 steps, and the results obtained after the test of the supercomputing server show that the two steps of gene comparison BWA-Mem and sequencing haplotypeCaller after comparison take 15 hours totally, while the calculation of the server before takes 6 hours, wherein the calculation speed difference is large.

After multidimensional pressure tests such as network, storage, server performance, job scheduling and the like, no problems are found. The view CPU occupancy has been in a lower occupancy state and all steps have taken 80 threads to compute. It was then found that no multi-threaded parameter, the "NumThreads" parameter, was used in the software parameter settings. All the computing tasks are single-thread computing, and the performance of the server is not exerted. Finally, the original calculation time of 15 hours is shortened to be completed within 5 hours through parameter adjustment.

And (3) tuning process of other steps:

among 8 steps, after the optimization of partial steps, 80 cores of a single machine can be fully run, but partial steps can only use 3-8 cores, and a resource idle state exists. Research shows that some steps of software can not break through the upper limit of threads and can not perform parallel computation. The higher the number of threads, the less significant the reduction in execution time. This is due to the nature of parallel processing. Initially there are several threads processing a large data set, which shortens run time. However, it does take time to divide the run into smaller parts, issue on separate threads, and finally recombine the parts into one large file. This overhead increases as the number of threads being used increases, so at some point more threads may instead stall the total run time. According to the test situation, only 20 whole genome samples can be run at most within one week for identification, and the actual situation is basically equivalent to the state of running the tasks.

4 scripts are written after careful research, and the problem that a large amount of computing resources are not utilized is solved. The former script can only run all tasks on a single server, and actually the simultaneous operation of a plurality of servers can not be run (namely 200 samples can only be run on one server in sequence). Meanwhile, the former script cannot adjust the number of threads to be called according to requirements, which causes waste of computing resources of the server (the script cannot be modified according to the requirements of different steps on CPU threads). The first two scripts divide the problem that the original script file calls the upper limit according to the number of threads in some steps into two major steps, and the last two scripts submit jobs in batches aiming at two types of servers, so that the calculation speed is hopefully improved to the maximum extent.

As the number of threads increases, the execution time for single sample identification decreases. The results show that with increasing number of threads running, the processing time of BWA-Mem and HaplotpypeCaller is significantly reduced compared to the rest of the pipeline. Similar or better results than experimentation may be achieved using thread and process level optimization. Practice has shown that running most steps on the server using 10 or 8 cores (except BWA-Mem and haplotypecall which need to occupy all resources) will work better. And 8 cores and 10 cores can have all cores occupied on a single server when computing in parallel.

The size of the memory also needs to be considered, the memory of a single server is limited, and if too many tasks are parallel, the memory may be insufficient, so that all tasks on the server are influenced to run.

Table 2 thread extension-single sample identification List

In conjunction with table 2 and fig. 3, it is shown how the execution time varies with the number of threads at each step in gene sequencing. For the data set Solexa-272221, as the number of threads increases from 1 to 36, the run time for all steps decreases. The higher the number of threads, the less significant the reduction in execution time. This is due to the nature of parallel processing. Initially there are several threads processing a large data set, which shortens run time. However, it does take time to divide the run into smaller parts, issue on separate threads, and finally recombine the parts into one large file. This overhead increases as the number of threads being used increases, so at some point more threads may instead stall the total run time. Therefore, the thread/number of threads cannot be arbitrarily selected; for example, thread expansion is dependent on size, as the amount of work done by each individual thread must be greater than the time it takes to combine the work done by each individual thread or synchronization. A set of analyses can be completed in a shorter time, judiciously exploiting the number of samples being analyzed and the number of threads available.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. The GATK super computer system based on the tuning parameters is characterized in that: the system comprises a plurality of CPU servers and an LSF job scheduling system;

the LSF job scheduling system is in signal connection with 8CPU servers;

the CPU server comprises 2 or 4 slots;

the number of cores of the slot is 20; the total number of cores is 80 or 40;

the number of threads of the kernel is 1;

2. A tuning parameter based GATK supercomputer system according to claim 1, wherein: the number of the CPU servers is 8.

3. A tuning parameter based GATK supercomputer system according to claim 1, wherein: the 8CPU servers comprise 5 servers with the model number of Intel 6248 and the total kernel of the CPU of 80 cores and 3 servers with the model number of Intel 6248 and the total kernel of the CPU of 40 cores.

4. A tuning parameter based GATK supercomputer system according to claim 3, wherein: when the total kernel is 40, the memory is 384 GB; when the total kernel is 80, the memory is 1536 GB;

the server 384GB of the 40 kernel runs 6-7 tasks simultaneously to the limit;

the server 1536GB of the 80 core runs 17 tasks to the limit.

5. A tuning parameter based GATK supercomputer system according to claim 4, characterized by: the GATK super computer system also comprises a network card provided with a 100Gbps Infiniband computing network interface.