CN113326123A

CN113326123A - Biological information analysis and calculation system and method based on container technology

Info

Publication number: CN113326123A
Application number: CN202110484623.0A
Authority: CN
Inventors: 余育超; 朱晓文
Original assignee: Hangzhou Shengwu Technology Co ltd
Current assignee: Hangzhou Shengwu Technology Co ltd
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2021-08-31
Anticipated expiration: 2041-04-30
Also published as: CN113326123B

Abstract

The invention relates to a biological information analysis and calculation system and method based on container technology, comprising the following steps: a Web interaction module: inputting original data needing to carry out biological information analysis by a user and submitting an analysis task; a management module: the system comprises a management and control node, a data processing unit and a data processing unit, wherein the management and control node is used for sending a deployment instruction at the management and control node and creating an analysis and calculation container for analysis and calculation; a calculation module: the system comprises a plurality of analysis and calculation containers, a plurality of analysis and calculation containers and a plurality of analysis and calculation containers, wherein the analysis and calculation containers are used for carrying out analysis and calculation according to original data input by a user and the content of an analysis task to obtain an analysis result; a storage module: the system comprises a data storage module, a data processing module and a data processing module, wherein the data storage module is used for storing original data input by a user and an analysis result obtained by the calculation module; a data delivery platform: for forming a report of the analysis result to the user. The invention can further promote the industrialized standardization process of gene data analysis and calculation, reduces the idle cost of resource resources such as CPU, memory, storage, network and the like by the pipelined mode analysis, and improves the efficiency.

Description

Biological information analysis and calculation system and method based on container technology

Technical Field

The invention belongs to the technical field of biological information analysis, and particularly relates to a system and a method for analyzing and calculating biological information based on a container technology.

Background

The biological information analysis and calculation mainly refers to processing a large amount of original biological data (including gene data, protein data and the like) generated by the current biological detection technology. The calculation of the big data needs to be performed by using a professional server, and the existing calculation technical scheme mainly comprises the following two types:

(1) and the local server is erected for analyzing and calculating the biological information data. According to different computing data requirements, various types of servers are purchased to build a local computing platform. Such as tower servers, rack servers, etc. The technical scheme has the problems of high single input cost, poor expandability, high daily maintenance cost, long time for returning the original and the like.

(2) And (3) purchasing a cloud server ECS, batch computing and high-performance computing service provided by a cloud service provider to perform biological information data computing. The technical scheme is limited by the advanced development field of the industry, and the services provided by the cloud service providers have the problems of low industry adaptation degree and no great cost advantage caused by cloud computing resource waste.

The two production modes of analysis and calculation are small workshop type analysis and calculation modes, and larger production investment is the construction of larger workshops. With the explosive growth of gene data, the bottleneck of capacity limitation becomes more prominent.

Furthermore, bioinformatics analysis includes a large number of types, and involves a large number of analysis software and a large number of development languages (including Perl, python, Java, R, and the like). Therefore, in the prior art, as many operating systems as possible are installed on the server, and software meets the analysis requirement. The system software configuration of the server is complicated and not easy to manage.

When a gene data analysis task occurs, different software development logics are different, and different requirements are imposed on a CPU and a memory of a server. The CPU and memory ratio of the server is difficult to adapt to all the requirements. In the prior art, a high-performance server is configured locally, or a high-performance cloud server is purchased at a cloud end. During analysis and calculation, a large amount of CPU waste or memory idle can be caused, and the analysis and calculation cost is high.

Disclosure of Invention

In order to solve the above problems, the present invention provides a system and a method for analyzing and calculating biological information based on a container technology, which can further improve an industrial standardization process of gene data calculation, reduce idle costs of resource resources such as CPU, memory, storage, network, etc. by a pipeline type pattern analysis, and improve efficiency.

The technical scheme of the invention is as follows:

a container technology based bioinformation analysis computing system comprising:

a Web interaction module: inputting original data needing to carry out biological information analysis by a user and submitting an analysis task;

a regulation module: splitting the task into analysis subtasks of various types according to task information submitted by a user, creating a node container for analysis and calculation, monitoring and processing data states of various tasks in real time, and updating and feeding back the task states in real time;

a calculation analysis module: the management system selects corresponding containers according to different analysis subtasks, and rapidly deploys node containers for analysis and calculation;

a data sharing module: the data storage module is used for storing data generated by the operation of each task node container; through a data sharing mode, data transmission among all node containers is reduced, the task execution time can be effectively reduced, and the efficiency is improved;

a data storage module: the system comprises a data storage module, a data processing module and a data processing module, wherein the data storage module is used for storing raw data input by a user and analysis result data;

a data delivery platform: for forming a report of the analysis result to the user.

Preferably, the analysis and calculation container performs specific mirror image construction and encapsulation by using a docker technology according to different system environments required in different analysis steps and different software used for operation, and then is created by matching with corresponding calculation resources according to various software of the encapsulated mirror image and data analysis requirements.

Preferably, the storage module includes an object storage server for long-term storage and data copy of data and a file storage server for receiving the copied data and calling various analysis and calculation containers to analyze and process the data with the copy data as a center.

The invention provides a biological information analysis and calculation method based on a container technology, which comprises the following steps:

s1: a user uploads original data required for biological information analysis on an interface of a front-end web;

s2: a resident small management and control server is established for supporting front-end web interaction and sending management and control instructions;

s3: submitting an analysis task on a front-end web, and creating a plurality of different types of analysis and calculation containers by using a preset analysis and calculation task container mirror image through a resident small management and control server issuing instruction;

s4: based on gene original data and data analysis task needs, various analysis and calculation containers are utilized to perform pipelined analysis and calculation by taking data on a file storage server as a center;

s5: after the analysis and calculation are completed, storing the analysis result and related data, executing corresponding container life cycles by each type of analysis and calculation container, continuously analyzing other tasks by a task, closing the container after no task is performed, and releasing calculation resources;

s6: the resident small management and control server receives the result downloading address while receiving the calculation completion information and displays the result downloading address on the front-end web, and the front-end web is used for downloading the result information according to the downloading address and delivering the result information to the user.

Preferably, the process for manufacturing the mirror image of the analysis and calculation task container comprises: disassembling various processes of the step of the raw information analysis, integrating software analysis contents with similar hardware and software resource requirements, specifically constructing a container mirror image adapted to system software according to different configuration requirements of various analysis tasks, and distributing adapted computing resources for the container when the container is created.

Preferably, the analysis calculation container in step S3 is constructed according to the parsing data of the analysis task, where the parsing process of the analysis task includes: data quality control and cleaning, reference genome comparison, data processing and annotation after comparison, data screening and visual display;

and (3) controlling and cleaning the data quality: the requirements on a CPU and an internal memory are not high, single thread is more, and time is consumed;

the reference genome alignment: the requirements on a CPU and a memory are high, and the occupation of the memory is large;

the data processing and annotation after the comparison have high requirements on the memory and increase along with the increase of the data volume;

the requirements on a CPU and a memory are not high in the data screening and the visual display;

the requirements for system software environment and computing resources are different in the steps, disassembly is carried out according to the system environment requirements and language requirements during disassembly, then specific mirror image construction and packaging are carried out, and the establishment scheme of the computing container is disassembled according to disassembled mirror image data.

Preferably, after the analysis and calculation are completed, a memory service of the object storage is adopted to store the analysis result, and the object storage sends the result download address and the account password for extracting the analysis result to the resident small management and control server.

Preferably, the storage service of the object storage is stored in a low-frequency storage mode with AES encryption or an archive storage mode.

Preferably, the step S5 further includes: the original data described in step S1 is copied into the mounted file storage server by the minimanagement server using the service of the file storage server.

Preferably, the specific process of the pipelined analysis and calculation is as follows: the method comprises the steps that a file storage server storing data is mounted on all kinds of created analysis and calculation containers, the original data are circulated in the file storage, workflow analysis tasks are executed after the analysis and calculation containers, after the tasks are completed, completion information is returned to a resident small management and control server, and the next analysis and calculation container replaces the resident small management and control server to continue analysis until all analysis tasks are completed.

The invention has the beneficial effects that:

1. only one resident small server is arranged in the scheme, the initial investment cost is low, various computing containers participate in computing only when computing tasks occur, and the computing containers are automatically closed according to the life cycle, so that the cost of single use is reduced to the lowest.

2. The various analysis and calculation containers created based on the container technology can use different CPU and memory combinations, and exert the calculation resources to the maximum extent. And the number of the containers is only related to the upper limit of a cloud computing service provider, so that multi-sample multi-task parallel computing can be met, and the time cost is saved with the maximum efficiency.

3. The invention is based on various analysis containers created by container technology, uses the management mirror image technology of the container to carry out management, has deployment time and deployment efficiency which are both larger than those of the traditional virtual machine technology and snapshot technology, and integrates the calculation software with similar hardware requirement software requirement in the same container mirror image. The system software environments required by each biological information analysis product are independent from each other, so that the management is convenient, and the deployment is rapid.

4. The container technology can share the inner core of the same operating system and isolate the application program from other parts of the system, so that single analysis software can be isolated relatively and independently during calculation and analysis, mutual interference is avoided, the calculation accuracy is ensured, and the error rate is reduced.

5. The invention uses the file storage to carry out the data file and the database file during operation, and mounts various analysis containers on the file storage. The data is used as the center and is similar to a production line in industrial production, so that the network bandwidth is released, the analysis performance is improved, and the operation calculation cost is reduced. The storage service of the object storage OSS is used for uploading the original data and delivering the analysis result, so that higher data transmission speed and data safety protection are achieved.

Drawings

Fig. 1 is a block diagram of a system architecture provided in the present invention.

FIG. 2 is a schematic flow chart of the method of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides a biological information analysis computing system based on container technology, comprising: a Web interaction module: the user inputs raw data that needs to be analyzed for the biological information and submits an analysis task.

A regulation module: according to task information submitted by a user, splitting into analysis subtasks of various types, creating node containers (dockers) for analysis and calculation, monitoring and processing data states of various tasks in real time, and updating and feeding back the task states in real time.

A calculation analysis module: the system comprises various types of biological information analysis and calculation containers, the system selects corresponding containers according to different analysis subtasks, and node containers for analysis and calculation are rapidly deployed.

A data sharing module: the data sharing method is used for storing the data generated by the operation of each task node container, reduces data transmission among the node containers through a data sharing mode, can effectively reduce task execution time, and improves efficiency.

The container in the calculation analysis module comprises calculation resources, an analysis system and software used for operation, and different CPU and memory combinations are used to exert the calculation resources to the maximum extent.

As shown in fig. 2, the present invention also provides a biological information analysis and calculation method based on container technology, comprising the following steps:

a resident small server ECS is created to support the front-end web interaction and the sending of a control and regulation instruction;

after the analysis tasks are submitted by a front-end web, by utilizing an elastic container computing service provided by cloud computing, a resident server issues an instruction to use a preset corresponding analysis and computation task container mirror image, a plurality of corresponding analysis and computation containers are created, and the server is matched with computing and storage resources according to corresponding analysis requirements.

And utilizing object storage and file storage services to mount all the created analysis and calculation containers into the file storage service, copying data provided by a user on a front-end web into the file storage service, starting calculation and analysis by all the analysis and calculation containers, ending container periods of containers completing respective calculation tasks, and jumping to run successfully. If other data analysis tasks exist, the schedulable container continues to calculate other data. And after all the calculation is finished, delivering the result to the object storage for storage, and returning the finishing information which is resident in the small server.

And after receiving the completion information, the resident server delivers the result downloading information to the user at the front web interaction end.

The processes can be realized on a web end of a cloud computing technology or a small server built locally, so that various devices with browsers can access and issue analysis instructions.

As an embodiment of the present invention, the process of forming the container mirror image of the analysis and calculation task includes disassembling various raw information analysis processes, combining and sorting various analysis software to make and disassemble container mirror images of various tasks, which specifically includes: software analysis contents with similar resource requirements are integrated together, and container images with different analysis tasks and adaptive computing resources are configured.

Taking the currently used more RNA sequencing gene data analysis software as an example, the original gene data needs to be analyzed and calculated through the following steps: data quality control and cleaning, reference genome comparison (mapping), data processing and annotation after comparison, data screening and visual display.

Wherein, data quality control and cleaning: the requirements on a CPU and an internal memory are not high, single thread is more, and time is consumed. Software is mostly based on (Python, Java, etc.)

Reference genome alignment (mapping): the requirements on a CPU and a memory are high, and the occupation of the memory is large.

And (3) data processing and annotation after comparison: the memory requirement is high and increases with the increase of the data amount. Software is mostly based on (Python, Java, etc.)

Data screening and visual display: the requirements on the CPU and the memory are not high. Software is mostly based on (perl, R, etc.)

The requirements of the steps on system software environment and computing resources are different, and during disassembly, the system can be disassembled according to the system environment requirements and based language requirements (including Perl, python, Java, R and the like), and then specific mirror image construction and packaging are carried out. The specific construction scheme is that splitting is carried out according to the split mirror image data.

After the analysis calculation is completed, the storage service of the object storage is adopted to store the analysis result, and the object storage sends the result downloading address and the account password for extracting the analysis result to the resident small server. In this embodiment, the storage service of the object storage adopts a low-frequency storage mode or an archive storage mode with AES encryption.

As an embodiment of the present invention, under the condition that the capacity needs to be expanded, a pipeline type analysis calculation is adopted, specifically: the method comprises the steps that a file storage server storing data is mounted on all kinds of created analysis and calculation containers, the original data are circulated in the file storage, workflow analysis tasks are executed after the analysis and calculation containers, after the tasks are completed, completion information is returned to a resident small management and control server, and the next analysis and calculation container replaces the resident small management and control server to continue analysis until all analysis tasks are completed.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention in its spirit and scope. Are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A biological information analysis computing system based on container technology, comprising:

a data sharing module: the data storage module is used for storing data generated by the operation of each task node container;

2. The system of claim 1, wherein the analytical computing container is configured to perform specific mirroring package according to different system environments and different software applications required in different analysis steps by using a docker technology, and then create the analytical computing container by using corresponding computing resources according to various software and data analysis requirements of the mirrored package.

3. The system according to claim 1, wherein the storage module comprises an object storage server for long-term storage and data copying of data and a file storage server for receiving copied data and invoking various analysis and computation containers to analyze and process data centered on the copied data.

4. A biological information analysis and calculation method based on container technology is characterized by comprising the following steps:

s5: after the analysis and calculation are completed, storing the analysis result and related data, executing corresponding container life cycles by each type of analysis and calculation container, continuously analyzing other tasks by a task, automatically closing the container after no task, and releasing resources;

5. The method for analyzing and calculating the biological information based on the container technology according to claim 4, wherein the analyzing and calculating task container mirror image is produced by the following steps: disassembling various processes of the step of the raw information analysis, integrating software analysis contents with similar hardware and software resource requirements, specifically constructing a container mirror image adapted to system software according to different configuration requirements of various analysis tasks, and distributing adapted computing resources for the container when the container is created.

6. The method for analyzing and calculating biological information based on container technology according to claim 4, wherein the analyzing and calculating container in step S3 is constructed according to the disassembled data of the analyzing task, and the data disassembling process of the analyzing task comprises: data quality control and cleaning, reference genome comparison, data processing and annotation after comparison, data screening and visual display;

7. The method for analyzing and calculating the biological information based on the container technology according to claim 4, wherein after the analysis and calculation is completed, the analysis result is stored by using a memory service stored by the object, and a result downloading address and an account password for extracting the analysis result are sent to the resident small management and control server by the object storage.

8. The method according to claim 4, wherein the storage service of the object storage is stored in a low-frequency storage mode with AES encryption or an archival storage mode.

9. The method for analyzing and calculating biological information based on container technology according to claim 4, wherein the step S5 further comprises: the original data described in step S1 is copied into the mounted file storage server by the minimanagement server using the service of the file storage server.

10. The method for analyzing and calculating biological information based on container technology according to claim 4, wherein the specific process of the pipelined analysis and calculation is as follows: the method comprises the steps that a file storage server storing data is mounted on all kinds of created analysis and calculation containers, the original data are circulated in the file storage, workflow analysis tasks are executed after the analysis and calculation containers, after the tasks are completed, completion information is returned to a resident small management and control server, and the next analysis and calculation container replaces the resident small management and control server to continue analysis until all analysis tasks are completed.