KR20180122775A

KR20180122775A - Bio information analysis system and analysis method of the same

Info

Publication number: KR20180122775A
Application number: KR1020170056773A
Authority: KR
Inventors: 서영준; 감민재; 김유하; 김비
Original assignee: 서영준
Priority date: 2017-05-04
Filing date: 2017-05-04
Publication date: 2018-11-14

Abstract

The present invention relates to a bio information analysis system and an analysis method thereof. The bio information analysis system according to an embodiment of the present invention includes: a master for generating a disk image including at least one or more analysis tools for analyzing bio information data included in a bio information analysis pipeline; and one or more workers for receiving the disk image generated in the master and analyzing the bio information data included in the bio information analysis pipeline by driving the received disk image with a container. Accordingly, the present invention can analyze the bio information using the corresponding pipeline without installing additional software.

Description

TECHNICAL FIELD [0001] The present invention relates to a biological information analysis system,

The present invention relates to a life information analysis system and an analysis method, and more particularly, to a life information analysis system and an analysis method for analyzing life information by using a pipeline for analyzing life information.

Precision medicine uses technology, science and medical records for a new understanding of the roots of disease and develop targeted therapies to ultimately save people's lives. .

Clinical genomics (NGS) based on next generation sequencing (NGS) are used for such precision medical care. Precision care can then be used to identify patients' genomes and select appropriate therapies to reduce the economic burden on patients and countries, and to improve patient care and survival.

In order to pursue such precision medical care, biological data has increased exponentially with the development of high-speed analytical instruments, and it is necessary to perform computational processing to analyze and analyze such biological data effectively. In this case, the analysis can be performed using a pipeline to analyze life information. The pipeline is a hardware technique capable of performing parallel processing so that large-scale data of life information can be analyzed through a predetermined procedure It is one.

In order to perform the analysis using the pipeline, software corresponding to the life information performed in the pipeline is required to pass the predetermined procedure included in the pipeline. The procedure or process to be analyzed in the pipeline may vary depending on the type of life information performed through such a pipeline, and the type of software used may vary.

At this time, when the life information is analyzed using the pipeline, it can be performed in a distributed clustering environment. In this case, when the pipeline for analyzing the specific life information is executed in the calculation node of the distributed clustering environment, the software necessary for the analysis of the pipeline is installed in the corresponding calculation node so that the analysis using the pipeline can be performed. However, there is a problem in that software required for each calculation node of the distributed clustering environment must be installed because the kinds of necessary software are considerably large depending on the kind of life information to be analyzed.

Further, when updating of software installed in each of the calculation nodes in the distributed clustering environment occurs, there may arise a problem that software installed in each of the calculation nodes must be updated.

Korean Registered Patent No. 10-1279392 (June 31, 2013)

SUMMARY OF THE INVENTION It is an object of the present invention to provide a life information analysis system and an analysis method using pipelines in a distributed clustering environment capable of analyzing various kinds of pipelines.

The system for analyzing life information according to an embodiment of the present invention includes a master for generating a disk image including at least one analysis tool for analyzing life information data included in a life information analysis pipeline, (master); And at least one worker for receiving the disk image generated by the master and driving the received disk image as a container to analyze the life information data included in the life information analysis pipeline .

At this time, the master acquires the path (URL) of the life information data file to configure the life information analysis pipeline, and parameters of the pipeline may be input.

The at least one worker may allocate at least one of a CPU resource, a memory, and a communication resource to a container in which the worker operates.

In addition, when the at least one worker receives a plurality of disk images from the master, the worker can analyze the life information data included in the life information analysis pipeline by setting priorities for the plurality of received disk images.

At this time, the priority is analyzed in the at least one worker according to the time order registered in the master, and when two or more life information analysis pipelines are driven to analyze life information data, The registration can be completed at the time when the disk image is registered.

Meanwhile, a method for analyzing life information according to an embodiment of the present invention includes: registering a job for analyzing life information data included in a life information analysis pipeline in a master; Constructing a life information analysis pipeline for analysis of the life information data; Generating a disk image including at least one analysis tool in the life information analysis pipeline; Transmitting the generated job in the form of a disk image to at least one worker performing bioinformation analysis; Driving a job received from the at least one worker with a container; And analyzing life information data included in the life information analysis pipeline using the driven container.

At this time, the step of constructing the life information analysis pipeline includes: checking the location where the life information data is stored; And inputting parameters of the bioinformation analysis pipeline.

When the disk image is generated, the master may further store the generated disk image and upload the life information data.

In addition, when the analysis of the life information data is completed, it may further include checking whether there is a job that is waiting.

And checking the priorities of the two or more jobs if the waiting jobs are two or more.

At this time, the priority of the job is first analyzed in the at least one worker according to the time sequence registered in the master, and when two or more life information analysis pipelines are driven to analyze the life information data, The registration to the master can be completed at the time when the job for the line is registered.

According to the present invention, when the life information is analyzed using the NGS analysis pipeline, each worker in the distributed clustering environment analyzes a container using an image in which software necessary for life information to be analyzed is installed, It is possible to analyze life information using the pipeline even if no software is installed in the worker.

In addition, when analyzing the life information, analysis can be performed through job scheduling, and pipeline analysis can be performed more efficiently.

1 is a schematic diagram illustrating an analysis system in accordance with an embodiment of the present invention.
2 is a flowchart illustrating an analysis method according to an embodiment of the present invention.
3 is a diagram illustrating an example of scheduling a job in an analysis method according to an embodiment of the present invention.
4 is a graph comparing the aquatic time of the proposed pipeline of the analysis system according to an embodiment of the present invention with the aquatic time of the BWA standard pipeline.
5 is a graph comparing the accuracy of the proposed pipeline of the analysis system and the accuracy of the BWA standard pipeline according to an embodiment of the present invention.
6 is a diagram for explaining various kinds of pipelines for analysis of a dielectric.

Preferred embodiments of the present invention will be described more specifically with reference to the accompanying drawings.

The bioinformation analysis system according to the present invention can process a pipeline for bioinformation analysis for analyzing various life information. The pipeline for bioinformation analysis can include analysis tools for inputting and analyzing life information and outputting data about the results, or for linking with analysis tools . And data conversion scripters for converting the result data output from each analysis tool into a format suitable for the bioinformation data corresponding to the order analysis tool.

In the present invention, the pipeline operation for analyzing life information can be performed in a distributed clustering environment, and thus a cloud service apparatus can be used. In addition, distributed clustering can be configured with low-end computers connected to a network switch (preferably a memory of 16 Gb or more) as needed. Therefore, small-scale research institutes and individuals can easily construct clusters for bioinformation analysis such as genomes, and can operate a pipeline for analyzing life information such as various genomes.

1 is a schematic diagram illustrating an analysis system in accordance with an embodiment of the present invention.

Referring to FIG. 1, a life information analysis system 100 according to an embodiment of the present invention includes a master 110, a storage 120, a first worker 130, and a second worker 140 ).

The master 110 may include an input 112, a first server 114, a second server 116, and a scheduler 118. The input unit 112 is provided to create and configure a pipeline for analyzing life information such as a genome from a user. At this time, the information input through the input unit 112 may be a job, a share of the CPU and memory, a job name, and the like. At this time, the name of the job should not overlap with other jobs. The input unit 112 may be a web server that a user can access via a web. Accordingly, the user can access the web page of the input unit 112 through the network and input information corresponding to the job.

The first server 114 sends data to the storage unit 120 in order to store the result data output from the completed pipeline through the first worker 130 or the second worker 140 in the storage unit 120 . Accordingly, the first server 114 may be an SFTP server.

The second server 116 may construct an analysis pipeline based on the information of the job input through the input unit 112, and may store an image of the configured pipeline. The pipeline may be connected to any one or more of the first and second walkers 130 and 140 so that the image of the stored pipeline can be driven by at least one of the first and second walkers 130 and 140 As shown in FIG.

The scheduler 118 performs scheduling for the pipeline configured in the second server 116. [ That is, when a plurality of pipelines are configured, scheduling is performed on which pipeline the analysis is performed.

The storage unit 120 may communicate with the master 110, the first worker 130 and the second worker 140 via the network and may include a master 110, a first worker 130, 140). &Lt; / RTI >

The first worker 130 and the second worker 140 receive the image of the pipeline transmitted from the master 110 and can drive the pipeline using the lightweight container technique. And analysis of life information can be performed by driving the pipeline through the container. In the present embodiment, the first walker 130 and the second walker 140 explain that the first container 134 and the second container 136 can be driven, respectively, Can be driven more.

The first worker 130 and the second worker 140 may include the kernels 132 and the lightweight containers driven by the first worker 130 and the second worker 140 may include a first worker 130, The second worker 130 and the kernel 132 included in the second worker 140. In this embodiment, the first walker 130 and the second walker 140 may each be provided with a cent OS, and a Ubuntu may be installed in the lightweight container.

In the present embodiment, although only the first walker 130 and the second walker 140 are used, they may be additionally provided, such as the third walker or the fourth walker, Walkers may also be included.

In the present embodiment, as described above, the first and second walkers 130 and 140 may be low-rise computers connected to a network switch (memory is preferably 16 Gb or more) or high-end computers.

For example, when the image file of the pipeline transmitted from the master 110 is transmitted to the first worker 130 and the image file is executed in the first worker 130, The container corresponding to the image of the line can be executed. Accordingly, analysis of life information can be performed according to the configuration of the pipeline in the container.

At this time, the container formed by executing the image of the pipeline can load the data on the life information, and can analyze the life information using the loaded data. At this time, the container may be provided with software for performing analysis of life information. Accordingly, the pipeline analysis can be performed as the container is driven.

FIG. 2 is a flowchart illustrating an analysis method according to an embodiment of the present invention. FIG. 3 is a diagram illustrating an example of job scheduling in an analysis method according to an embodiment of the present invention.

First, the analysis method will be described with reference to FIG. When one job starts (S101), the user registers the job through the input unit 112 (S103). At this time, when the job is performed in the first worker 130 and the second worker 140, the user sets the occupancy rate of the CPU and the memory, sets whether the job is a normal job, a parent job, or a child job, You can set the name. At this time, a unique name can be used by double checking the job name.

When the job is registered in this manner, a pipeline corresponding to the job is configured (S105). At this time, the pipeline can be divided into two stages in the case of the NGS (next generation sequencing) analysis pipeline. First, analyze the samples obtained from the human body with sequencing equipment, compare the human body's raw data (fastq) with the standard reference (fasta), sort them, and process raw data processing and cleaning have. And there are variant calling and annotating steps to find the mutation and to connect the information. These steps can be called the genomic analysis pipeline.

Then, the data URL for confirming the location of the data on the life information to be used for the construction of the pipeline is acquired (S107). At this time, the data URL is a process of confirming a path for confirming the position of data to be executed through the pipeline, and at this stage, the data is not uploaded to the pipeline.

After the data URL of the pipeline is confirmed, the parameters of the pipeline are inputted by the user (S109). As a parameter of the pipeline is input, for example, a docker file can be created and an inner container shell can be created.

As described above, if the data URL is obtained and the parameters of the pipeline are successfully input, the job is registered (S111). At this time, if the input of the parameters of the pipeline fails, the flow returns to step 105. [

Once the job is registered, the pipeline can be stored in the form of a disk image. Accordingly, a shell script is generated, the generated disk image is stored in the second server 116 (DB), and data is uploaded according to the data URL obtained in step S107 (S113).

The generated disk image is uploaded from the master 110 to the first worker 130 and the second worker 140 (S115). At this time, if the upload fails, the step ends (S117). At this time, the disk image stored in the second server 116 is deleted, and the shell script and other files generated in step S113 are deleted.

If the upload is successful, it is checked whether the number of jobs running in the first worker 130 and the second worker 140 is smaller than a maximum value (S119). At this time, if the number of jobs to be operated is not smaller than the maximum value, the job of the uploaded disk image is pending (S121). At this time, the disk image stored in the second server 116 is deleted, The script and other files are deleted.

If the number of jobs to be operated is smaller than the maximum value, a job in the form of an uploaded disk image is executed (S123). Then, the job is executed (S125).

(S127). If unsuccessful, it is notified that the job has failed (S129). Then, the disk image stored in the second server 116 is deleted, and the Shell script and other files generated in step S113 are deleted.

If the job is successfully executed, it is checked whether there is a job being pending (S131). If there is no waiting job, the success of the job is announced (S133). Then, the disk image stored in the second server 116 is deleted, and the Shell script and other files generated in step S113 are deleted.

If there is a waiting job, the second server 116 deletes the disk image of the succeeded job (S135). Then, the priority of the waiting job is confirmed (S137). At this time, if the priority is confirmed, the process proceeds from the step S123 in which the priority is highest.

At this time, in this embodiment, as in step S137, scheduling may be implemented to set the priority of the job. Job priority setting can be performed first when the registered job is executed first, and when it is composed of the parent job and the child job, after the registration of the parent job and the child job is completed.

A normal job is a job in which life information analysis is performed through a single pipeline independently. Parent job and Child job are one analysis. First, a parent job is performed and then a child job is performed to analyze one life information. For example, a child job is a job that is subsequently performed after the parent job is performed. After the parent job is executed, the child job corresponding to the immediately executed parent job should be performed.

As described above, the reason why the child job is performed after the execution of the parent job is that the child job is performed using the result of the parent job.

An example will be described with reference to FIG. A description will be given of a case where normal job 01, parent job 01, normal job 02, parent job 02, child job 02, and child job 01 are sequentially inputted through the input unit 112 of the life information analysis system 100 do. The life information analysis system 100 according to the present embodiment can be performed according to the order of jobs input in the distributed clustering environment. The scheduling can be performed by distinguishing the normal job from the parent job and the child job.

According to the input order, the first Normal job 01 is executed according to the first input of Normal job 01. Next, the input Parent job 01 is input, and then the Normal job 02 is input. After the execution of the Normal job 01 is completed, the Parent job 01 is not executed in order and the Normal job 02 is executed first. At this time, the reason why the parent job 01 is not performed first is that the child job 01 corresponding to the parent job 01 is not inputted.

After normal job 02 is completed, Parent job 02 and Child job 02 are executed consecutively, and Parent job 02 and Child job 02 are executed before Parent job 01. Then, when Child job 02 is input and then Child job 01 is input, when Child job 02 is completed, Parent job 01 is executed and then Child job 01 is executed.

As described above, the reason why the parent job 01 and the child job 01 are performed last compared with other jobs is as follows. When one job is completed, the container corresponding to the completed job is moved to the first worker 130 ). Accordingly, when the parent job 01 is first executed in the order inputted through the input unit 112 and then the normal job 02 is executed, the container corresponding to the parent job 01 in the state where the result of the parent job 01 is not transferred to the child job 01 Can be deleted. Accordingly, the Parent job 01 and the Child job 01 are sequentially executed according to the input of the Child job 01, and the Child job 01 is executed in the state that the result of the Parent job 01 is transmitted to the Child job 01.

FIG. 4 is a graph comparing the execution time of the proposed pipeline of the analysis system according to an embodiment of the present invention with the aquatic time of the BWA standard pipeline. FIG. 5 is a graph comparing the accuracy of the proposed pipeline and the accuracy of the BWA standard pipeline in the bioinformation analysis system 100 according to an embodiment of the present invention. And FIG. 6 is a view for explaining various kinds of pipelines for analysis of a dielectric.

Referring to FIG. 4, it can be seen that the analysis speed of the pipeline using the bioinformation analysis system 100 of the present invention is compared with the BWA standard pipeline, which is performed within about 25% of the time of the BWA standard pipeline have. As a result, we can see that the analysis can be performed about 4 times faster than the BWA standard pipeline.

Referring to FIG. 5, it can be seen that the life information analysis system 100 (Naligner) according to an embodiment of the present invention has a higher true positive rate than the BWA. Therefore, it can be confirmed that the accuracy is higher than the BWA standard pipeline.

Also, as in the embodiment of the present invention, by embodying a pipeline as a disk image and implementing it as a container, it is possible to easily operate a pipeline for various types of dielectric analysis as shown in FIG. 6, , And distributed clustering.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. It should be understood that the scope of the present invention is to be understood as the scope of the following claims and their equivalents.

100: Life information analysis system
110: Master 112: Input
114: first server 116: second server
118: Scheduler
120: storage unit 130: first walker
132: Kernel 134: First container
136: Second container
140: Second Walker

Claims

A master for generating a disk image containing at least one analysis tool for analyzing the bioinformation data contained in the bioinformation analysis pipeline; And
And at least one worker for receiving the disk image generated by the master and driving the received disk image as a container to analyze the life information data included in the life information analysis pipeline, system.

The method according to claim 1,
Wherein the master acquires a path (URL) of a vital information data file to configure the vital information analysis pipeline, and inputs parameters of the pipeline.

The method according to claim 1,
Wherein the at least one worker assigns at least one of a CPU resource, a memory, and a communication resource to the driven container.

The method according to claim 1,
Wherein the at least one worker sets priorities for the plurality of received disk images and analyzes the life information data included in the life information analysis pipeline when the plurality of disk images are received from the master.

The method of claim 4,
Wherein the priority is analyzed in the at least one worker according to a time order registered in the master,
When two or more life information analysis pipelines are driven and analysis of life information data is required, the life information analysis system is completed when the disk image for all life information analysis pipelines is registered.

A job for analyzing bioinformation data contained in a bioinformation analysis pipeline is registered in a master;
Constructing a life information analysis pipeline for analysis of the life information data;
Generating a disk image including at least one analysis tool in the life information analysis pipeline;
Transmitting the generated job in the form of a disk image to at least one worker performing bioinformation analysis;
Driving a job received from the at least one worker with a container; And
And analyzing the life information data included in the life information analysis pipeline using the driven container.

The method of claim 6,
The step of constructing the life information analysis pipeline comprises:
Confirming a location where the bio-information data is stored; And
And inputting parameters of the life information analysis pipeline.

The method of claim 6,
Further comprising the step of, when the disk image is created, storing the generated disk image and uploading the life information data.

The method of claim 6,
Further comprising the step of: if the analysis of the life information data is completed, further checking whether there is a job waiting.

The method of claim 9,
Further comprising confirming the priority of the two or more jobs if the waiting job is more than one.

The method of claim 10,
Wherein the priority of the job is first analyzed in the at least one worker according to the time order registered in the master,
When two or more bioinformation analysis pipelines are driven and analysis of bioinformation data is required, registration of the bioinformation analysis system is completed at the time when the job for all bioinformation analysis pipeline is registered.