KR20180122775A - Bio information analysis system and analysis method of the same - Google Patents
Bio information analysis system and analysis method of the same Download PDFInfo
- Publication number
- KR20180122775A KR20180122775A KR1020170056773A KR20170056773A KR20180122775A KR 20180122775 A KR20180122775 A KR 20180122775A KR 1020170056773 A KR1020170056773 A KR 1020170056773A KR 20170056773 A KR20170056773 A KR 20170056773A KR 20180122775 A KR20180122775 A KR 20180122775A
- Authority
- KR
- South Korea
- Prior art keywords
- analysis
- life information
- job
- pipeline
- worker
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B99/00—Subject matter not provided for in other groups of this subclass
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
The present invention relates to a life information analysis system and an analysis method, and more particularly, to a life information analysis system and an analysis method for analyzing life information by using a pipeline for analyzing life information.
Precision medicine uses technology, science and medical records for a new understanding of the roots of disease and develop targeted therapies to ultimately save people's lives. .
Clinical genomics (NGS) based on next generation sequencing (NGS) are used for such precision medical care. Precision care can then be used to identify patients' genomes and select appropriate therapies to reduce the economic burden on patients and countries, and to improve patient care and survival.
In order to pursue such precision medical care, biological data has increased exponentially with the development of high-speed analytical instruments, and it is necessary to perform computational processing to analyze and analyze such biological data effectively. In this case, the analysis can be performed using a pipeline to analyze life information. The pipeline is a hardware technique capable of performing parallel processing so that large-scale data of life information can be analyzed through a predetermined procedure It is one.
In order to perform the analysis using the pipeline, software corresponding to the life information performed in the pipeline is required to pass the predetermined procedure included in the pipeline. The procedure or process to be analyzed in the pipeline may vary depending on the type of life information performed through such a pipeline, and the type of software used may vary.
At this time, when the life information is analyzed using the pipeline, it can be performed in a distributed clustering environment. In this case, when the pipeline for analyzing the specific life information is executed in the calculation node of the distributed clustering environment, the software necessary for the analysis of the pipeline is installed in the corresponding calculation node so that the analysis using the pipeline can be performed. However, there is a problem in that software required for each calculation node of the distributed clustering environment must be installed because the kinds of necessary software are considerably large depending on the kind of life information to be analyzed.
Further, when updating of software installed in each of the calculation nodes in the distributed clustering environment occurs, there may arise a problem that software installed in each of the calculation nodes must be updated.
SUMMARY OF THE INVENTION It is an object of the present invention to provide a life information analysis system and an analysis method using pipelines in a distributed clustering environment capable of analyzing various kinds of pipelines.
The system for analyzing life information according to an embodiment of the present invention includes a master for generating a disk image including at least one analysis tool for analyzing life information data included in a life information analysis pipeline, (master); And at least one worker for receiving the disk image generated by the master and driving the received disk image as a container to analyze the life information data included in the life information analysis pipeline .
At this time, the master acquires the path (URL) of the life information data file to configure the life information analysis pipeline, and parameters of the pipeline may be input.
The at least one worker may allocate at least one of a CPU resource, a memory, and a communication resource to a container in which the worker operates.
In addition, when the at least one worker receives a plurality of disk images from the master, the worker can analyze the life information data included in the life information analysis pipeline by setting priorities for the plurality of received disk images.
At this time, the priority is analyzed in the at least one worker according to the time order registered in the master, and when two or more life information analysis pipelines are driven to analyze life information data, The registration can be completed at the time when the disk image is registered.
Meanwhile, a method for analyzing life information according to an embodiment of the present invention includes: registering a job for analyzing life information data included in a life information analysis pipeline in a master; Constructing a life information analysis pipeline for analysis of the life information data; Generating a disk image including at least one analysis tool in the life information analysis pipeline; Transmitting the generated job in the form of a disk image to at least one worker performing bioinformation analysis; Driving a job received from the at least one worker with a container; And analyzing life information data included in the life information analysis pipeline using the driven container.
At this time, the step of constructing the life information analysis pipeline includes: checking the location where the life information data is stored; And inputting parameters of the bioinformation analysis pipeline.
When the disk image is generated, the master may further store the generated disk image and upload the life information data.
In addition, when the analysis of the life information data is completed, it may further include checking whether there is a job that is waiting.
And checking the priorities of the two or more jobs if the waiting jobs are two or more.
At this time, the priority of the job is first analyzed in the at least one worker according to the time sequence registered in the master, and when two or more life information analysis pipelines are driven to analyze the life information data, The registration to the master can be completed at the time when the job for the line is registered.
According to the present invention, when the life information is analyzed using the NGS analysis pipeline, each worker in the distributed clustering environment analyzes a container using an image in which software necessary for life information to be analyzed is installed, It is possible to analyze life information using the pipeline even if no software is installed in the worker.
In addition, when analyzing the life information, analysis can be performed through job scheduling, and pipeline analysis can be performed more efficiently.
1 is a schematic diagram illustrating an analysis system in accordance with an embodiment of the present invention.
2 is a flowchart illustrating an analysis method according to an embodiment of the present invention.
3 is a diagram illustrating an example of scheduling a job in an analysis method according to an embodiment of the present invention.
4 is a graph comparing the aquatic time of the proposed pipeline of the analysis system according to an embodiment of the present invention with the aquatic time of the BWA standard pipeline.
5 is a graph comparing the accuracy of the proposed pipeline of the analysis system and the accuracy of the BWA standard pipeline according to an embodiment of the present invention.
6 is a diagram for explaining various kinds of pipelines for analysis of a dielectric.
Preferred embodiments of the present invention will be described more specifically with reference to the accompanying drawings.
The bioinformation analysis system according to the present invention can process a pipeline for bioinformation analysis for analyzing various life information. The pipeline for bioinformation analysis can include analysis tools for inputting and analyzing life information and outputting data about the results, or for linking with analysis tools . And data conversion scripters for converting the result data output from each analysis tool into a format suitable for the bioinformation data corresponding to the order analysis tool.
In the present invention, the pipeline operation for analyzing life information can be performed in a distributed clustering environment, and thus a cloud service apparatus can be used. In addition, distributed clustering can be configured with low-end computers connected to a network switch (preferably a memory of 16 Gb or more) as needed. Therefore, small-scale research institutes and individuals can easily construct clusters for bioinformation analysis such as genomes, and can operate a pipeline for analyzing life information such as various genomes.
1 is a schematic diagram illustrating an analysis system in accordance with an embodiment of the present invention.
Referring to FIG. 1, a life
The
The
The
The
The
The
The
In the present embodiment, although only the
In the present embodiment, as described above, the first and
For example, when the image file of the pipeline transmitted from the
At this time, the container formed by executing the image of the pipeline can load the data on the life information, and can analyze the life information using the loaded data. At this time, the container may be provided with software for performing analysis of life information. Accordingly, the pipeline analysis can be performed as the container is driven.
FIG. 2 is a flowchart illustrating an analysis method according to an embodiment of the present invention. FIG. 3 is a diagram illustrating an example of job scheduling in an analysis method according to an embodiment of the present invention.
First, the analysis method will be described with reference to FIG. When one job starts (S101), the user registers the job through the input unit 112 (S103). At this time, when the job is performed in the
When the job is registered in this manner, a pipeline corresponding to the job is configured (S105). At this time, the pipeline can be divided into two stages in the case of the NGS (next generation sequencing) analysis pipeline. First, analyze the samples obtained from the human body with sequencing equipment, compare the human body's raw data (fastq) with the standard reference (fasta), sort them, and process raw data processing and cleaning have. And there are variant calling and annotating steps to find the mutation and to connect the information. These steps can be called the genomic analysis pipeline.
Then, the data URL for confirming the location of the data on the life information to be used for the construction of the pipeline is acquired (S107). At this time, the data URL is a process of confirming a path for confirming the position of data to be executed through the pipeline, and at this stage, the data is not uploaded to the pipeline.
After the data URL of the pipeline is confirmed, the parameters of the pipeline are inputted by the user (S109). As a parameter of the pipeline is input, for example, a docker file can be created and an inner container shell can be created.
As described above, if the data URL is obtained and the parameters of the pipeline are successfully input, the job is registered (S111). At this time, if the input of the parameters of the pipeline fails, the flow returns to step 105. [
Once the job is registered, the pipeline can be stored in the form of a disk image. Accordingly, a shell script is generated, the generated disk image is stored in the second server 116 (DB), and data is uploaded according to the data URL obtained in step S107 (S113).
The generated disk image is uploaded from the
If the upload is successful, it is checked whether the number of jobs running in the
If the number of jobs to be operated is smaller than the maximum value, a job in the form of an uploaded disk image is executed (S123). Then, the job is executed (S125).
(S127). If unsuccessful, it is notified that the job has failed (S129). Then, the disk image stored in the
If the job is successfully executed, it is checked whether there is a job being pending (S131). If there is no waiting job, the success of the job is announced (S133). Then, the disk image stored in the
If there is a waiting job, the
At this time, in this embodiment, as in step S137, scheduling may be implemented to set the priority of the job. Job priority setting can be performed first when the registered job is executed first, and when it is composed of the parent job and the child job, after the registration of the parent job and the child job is completed.
A normal job is a job in which life information analysis is performed through a single pipeline independently. Parent job and Child job are one analysis. First, a parent job is performed and then a child job is performed to analyze one life information. For example, a child job is a job that is subsequently performed after the parent job is performed. After the parent job is executed, the child job corresponding to the immediately executed parent job should be performed.
As described above, the reason why the child job is performed after the execution of the parent job is that the child job is performed using the result of the parent job.
An example will be described with reference to FIG. A description will be given of a case where
According to the input order, the first
After
As described above, the reason why the
FIG. 4 is a graph comparing the execution time of the proposed pipeline of the analysis system according to an embodiment of the present invention with the aquatic time of the BWA standard pipeline. FIG. 5 is a graph comparing the accuracy of the proposed pipeline and the accuracy of the BWA standard pipeline in the
Referring to FIG. 4, it can be seen that the analysis speed of the pipeline using the
Referring to FIG. 5, it can be seen that the life information analysis system 100 (Naligner) according to an embodiment of the present invention has a higher true positive rate than the BWA. Therefore, it can be confirmed that the accuracy is higher than the BWA standard pipeline.
Also, as in the embodiment of the present invention, by embodying a pipeline as a disk image and implementing it as a container, it is possible to easily operate a pipeline for various types of dielectric analysis as shown in FIG. 6, , And distributed clustering.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. It should be understood that the scope of the present invention is to be understood as the scope of the following claims and their equivalents.
100: Life information analysis system
110: Master 112: Input
114: first server 116: second server
118: Scheduler
120: storage unit 130: first walker
132: Kernel 134: First container
136: Second container
140: Second Walker
Claims (11)
And at least one worker for receiving the disk image generated by the master and driving the received disk image as a container to analyze the life information data included in the life information analysis pipeline, system.
Wherein the master acquires a path (URL) of a vital information data file to configure the vital information analysis pipeline, and inputs parameters of the pipeline.
Wherein the at least one worker assigns at least one of a CPU resource, a memory, and a communication resource to the driven container.
Wherein the at least one worker sets priorities for the plurality of received disk images and analyzes the life information data included in the life information analysis pipeline when the plurality of disk images are received from the master.
Wherein the priority is analyzed in the at least one worker according to a time order registered in the master,
When two or more life information analysis pipelines are driven and analysis of life information data is required, the life information analysis system is completed when the disk image for all life information analysis pipelines is registered.
Constructing a life information analysis pipeline for analysis of the life information data;
Generating a disk image including at least one analysis tool in the life information analysis pipeline;
Transmitting the generated job in the form of a disk image to at least one worker performing bioinformation analysis;
Driving a job received from the at least one worker with a container; And
And analyzing the life information data included in the life information analysis pipeline using the driven container.
The step of constructing the life information analysis pipeline comprises:
Confirming a location where the bio-information data is stored; And
And inputting parameters of the life information analysis pipeline.
Further comprising the step of, when the disk image is created, storing the generated disk image and uploading the life information data.
Further comprising the step of: if the analysis of the life information data is completed, further checking whether there is a job waiting.
Further comprising confirming the priority of the two or more jobs if the waiting job is more than one.
Wherein the priority of the job is first analyzed in the at least one worker according to the time order registered in the master,
When two or more bioinformation analysis pipelines are driven and analysis of bioinformation data is required, registration of the bioinformation analysis system is completed at the time when the job for all bioinformation analysis pipeline is registered.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020170056773A KR20180122775A (en) | 2017-05-04 | 2017-05-04 | Bio information analysis system and analysis method of the same |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020170056773A KR20180122775A (en) | 2017-05-04 | 2017-05-04 | Bio information analysis system and analysis method of the same |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20180122775A true KR20180122775A (en) | 2018-11-14 |
Family
ID=64328343
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020170056773A KR20180122775A (en) | 2017-05-04 | 2017-05-04 | Bio information analysis system and analysis method of the same |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20180122775A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102282755B1 (en) * | 2020-11-27 | 2021-07-29 | 주식회사 비아이티 | Method, device and system for providing solution to analyze genome based on container |
-
2017
- 2017-05-04 KR KR1020170056773A patent/KR20180122775A/en not_active Application Discontinuation
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102282755B1 (en) * | 2020-11-27 | 2021-07-29 | 주식회사 비아이티 | Method, device and system for providing solution to analyze genome based on container |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7466058B2 (en) | Clinical omics data processing method, device, electronic device, and computer program based on graph neural network | |
Horvath et al. | An evolutionary optimizer of libsvm models | |
US11244761B2 (en) | Accelerated clinical biomarker prediction (ACBP) platform | |
US20150066381A1 (en) | Genomic pipeline editor with tool localization | |
KR20190069637A (en) | Charging method and system in multi cloud in the same way | |
KR102565874B1 (en) | Method for vectorizing medical data for machine learning, data transforming apparatus and data transforming program | |
Vignolo et al. | A cloud-based bioinformatic analytic infrastructure and Data Management Core for the Expanded Program on Immunization Consortium | |
KR20180122775A (en) | Bio information analysis system and analysis method of the same | |
Korpela et al. | EPIC-TRACE: predicting TCR binding to unseen epitopes using attention and contextualized embeddings | |
US10467068B2 (en) | Automated remote computing method and system by email platform for molecular analysis | |
Zenkert et al. | Kirett-a wearable device to support rescue operations using artificial intelligence to improve first aid | |
US11763944B2 (en) | System and method for clinical decision support system with inquiry based on reinforcement learning | |
CN117273167A (en) | Training method, medical method, device, equipment and storage medium for medical model | |
Greenes et al. | Design of a standards-based external rules engine for decision support in a variety of application contexts: report of a feasibility study at Partners HealthCare System | |
CN111048165A (en) | Method and device for determining test sample, computer medium and electronic equipment | |
CN114579626B (en) | Data processing method, data processing device, electronic equipment and medium | |
CN115458148A (en) | Intelligent selection method and intelligent selection device for triage method | |
Manconi et al. | Framing Apache Spark in life sciences | |
KR20170034630A (en) | Method and system providing a case data | |
CN111063436A (en) | Data processing method and device, storage medium and electronic terminal | |
RU2809124C2 (en) | System and method of interpreting alleles using graph-based reference genome | |
Patel et al. | Metapipeline-DNA: A Comprehensive Germline & Somatic Genomics Nextflow Pipeline | |
US20220208356A1 (en) | Radiological Based Methods and Systems for Detection of Maladies | |
Nasir et al. | Screening of potential vaccine candidates through machine learning approach | |
Araújo et al. | EpiMobile: Pathogen Point of Care Diagnosis and Global Surveillance using Mobile Devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E601 | Decision to refuse application |