CN113139798A - Gene sequencing process management control method and system - Google Patents

Gene sequencing process management control method and system Download PDF

Info

Publication number
CN113139798A
CN113139798A CN202110633608.8A CN202110633608A CN113139798A CN 113139798 A CN113139798 A CN 113139798A CN 202110633608 A CN202110633608 A CN 202110633608A CN 113139798 A CN113139798 A CN 113139798A
Authority
CN
China
Prior art keywords
data
management
gene
command
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110633608.8A
Other languages
Chinese (zh)
Other versions
CN113139798B (en
Inventor
谭光明
康宁
张春明
段勃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Western Research Institute Of China Science And Technology Computing Technology
Original Assignee
Western Research Institute Of China Science And Technology Computing Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Western Research Institute Of China Science And Technology Computing Technology filed Critical Western Research Institute Of China Science And Technology Computing Technology
Priority to CN202110633608.8A priority Critical patent/CN113139798B/en
Publication of CN113139798A publication Critical patent/CN113139798A/en
Application granted granted Critical
Publication of CN113139798B publication Critical patent/CN113139798B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Abstract

The invention relates to the technical field of gene sequencing process management, in particular to a gene sequencing process management control method and a gene sequencing process management control system, wherein the gene sequencing process management control method comprises a plurality of heterogeneous units, each heterogeneous unit comprises a process control module, and each process control module is used for receiving process management information, arbitrating the process management information to obtain a management command and distributing the management command to a preset management message queue; the management device is also used for receiving the data information, calling the management command in the corresponding management message queue according to the data information and executing the management command; the heterogeneous unit is used for calling the corresponding process control module to obtain a management message queue when receiving the process management information. By adopting the scheme, programmable gene sequencing flow control can be provided, centralized management of the gene sequencing flow is realized, and the time overhead of the system in the gene sequencing flow control is effectively reduced.

Description

Gene sequencing process management control method and system
Technical Field
The invention relates to the technical field of gene sequencing process management, in particular to a gene sequencing process management control method and a gene sequencing process management control system.
Background
With the rapid development of bioinformatics, genetic analysis has become a widely used technical means in scientific research and industry, and has been successfully applied to species identification, disease diagnosis, and the like, wherein genetic sequencing has become an increasingly important field in genetic research, and generally, genetic sequencing relates to determining the nucleotide sequence of nucleic acids such as RNA or DNA fragments. By analyzing shorter gene sequences, the resulting sequence information is used in various bioinformatics methods to logically fit multiple fragments together to reliably determine sequences of a broader length of genetic material.
The gene sequencing technology is closely related to the computer technology, and the whole computer processing flow of the gene sequencing can be roughly divided into six steps: BWA-MEM, Sort, Mark Duplicate, Indel reading, BQSR, and variable Calling. The existing gene sequencing process is usually controlled by a CPU, the process is fixed at the beginning of programming, the control of the gene sequencing process cannot be adjusted, and for example, gene sequences after each processing link in each processing step of gene sequencing need to be stored based on the requirement of gene sequencing. Meanwhile, because the gene sequencing process is controlled by the CPU, the corresponding computer processing process is carried out in the CPU, so that the load of the CPU is large, switching is needed among different gene sequencing steps, and the time overhead in the gene sequencing is increased.
Therefore, there is a need for a gene sequencing process management control method and system that can be programmed for gene sequencing process control and reduce the time overhead in process control.
Disclosure of Invention
An object of the present invention is to provide a gene sequencing process management control system, which can provide programmable gene sequencing process control and can reduce the time overhead in gene sequencing process control.
The invention provides a basic scheme I: the gene sequencing process management control system comprises a plurality of heterogeneous units, wherein each heterogeneous unit comprises a process control module, and each process control module is used for receiving process management information, arbitrating the process management information to obtain a management command and distributing the management command to a preset management message queue; and the management module is also used for receiving the data information, calling the management command in the corresponding management message queue according to the data information and executing the management command.
The beneficial effects of the first basic scheme are as follows: the process control module is arranged in each heterogeneous unit, the gene sequencing process in each heterogeneous unit is managed through the process control module, and the centralized management of the gene sequencing process is realized through the programmable management of process management information.
The process management information comprises a plurality of management commands, the process control module arbitrates the process management information to obtain the plurality of management commands, and the management commands are distributed to the management message queue for storage. And when the data information is received, calling the management command from the corresponding management message queue for execution, thereby realizing the processing and transmission of the data information. Meanwhile, the processing and transmission of the data information are unloaded to the heterogeneous unit for processing, so that the delay of data information control is reduced, and the efficient control of the gene sequencing process is realized.
By adopting the scheme, programmable gene sequencing flow control is provided through the flow control module and the flow management information, meanwhile, the centralized management of the gene sequencing flow is realized, and the time overhead of the system in the gene sequencing flow control is effectively reduced.
Furthermore, the management message queue comprises a queue number, the process control module comprises a management message arbitration submodule, a data stream conversion submodule and a multi-queue submodule, the management message arbitration submodule is used for analyzing the process management information to obtain the message queue number, and the management commands are written into the corresponding management message queue in sequence according to the message queue number and the queue number;
the data flow sub-module is used for analyzing the data information to obtain a data queue number, screening a corresponding management message queue according to the data queue number and the queue number, and calling a management command in the screened management message queue, and the multi-queue sub-module is used for deleting the corresponding management command in the management message queue when the management command in the management message queue is called.
Has the advantages that: the management message queue in the flow control module has a unique queue number, and the management message arbitration submodule is arranged to obtain the message queue number from the flow management information so as to obtain the management message queue to which the management command needs to be put. The operation sequence executed in the gene sequencing process is fixed, so that the management commands are written into the management message queue in sequence, and the subsequent management commands can be called quickly.
The data information includes gene data and a queue number of a management message queue in which a gene sequencing process executed by the gene data is located, that is, a data queue number. And the data flow sub-module is arranged, a corresponding management message queue is screened based on the data queue number and the queue number, and a management command is called to execute, so that the gene sequencing process is completed. And the multi-queue submodule is arranged to delete the corresponding management command after the management command is called, so that the command executed in the next gene sequencing process is positioned at the head of the management message queue, and the management command is quickly called when the next data information comes.
And the flow control module is also used for judging whether the management command is write-in data after executing the management command, waiting for next data information when the management command is the write-in data, and otherwise calling the management command in the management message queue according to the data information after executing the management command to execute the management command.
Has the advantages that: and identifying the management command, and judging whether a certain step in the gene sequencing process or the gene sequencing process is finished or not through the identification of the management command. When the gene sequencing process is finished or one of the steps is finished, data is finally written into the local or the remote, and the subsequent execution step is judged based on the characteristic of the finally written data, so that the control of the gene sequencing process is realized.
The heterogeneous unit comprises an in-memory computing unit and a storage computing unit, the central processing unit is used for acquiring a gene data reading request and sending the gene data reading request to the storage computing unit, and the storage computing unit is used for taking the gene data reading request as data information and calling a corresponding process control module to acquire pre-stored gene data when receiving the gene data reading request;
the memory computing unit, the central processing unit and the storage computing unit sequentially process the gene data;
the storage computing unit is also used for sending the gene data to the memory computing unit when receiving the processed gene data, and the central processing unit is also used for extracting the gene data from the memory computing unit when the memory computing unit finishes receiving the gene data, processing the gene data to obtain the processed gene data and sending the gene data to the storage computing module; the storage calculation module is also used for compressing and storing gene data.
Has the advantages that: the gene data are stored in the storage computing unit, and when the gene data need to be subjected to gene sequencing, the corresponding gene data are called through a gene data reading request. And taking the gene data as data information, and performing corresponding operation on the gene data through the memory computing unit, the central processing unit and the storage computing unit. After the gene sequencing process is completed or one of the steps is completed, when the memory computing unit receives the gene data, the memory computing unit informs the central processing unit of the completion of the gene data reception, and the gene data is extracted by the central processing unit for corresponding processing.
Furthermore, when the memory computing unit, the central processing unit and the memory computing unit sequentially process the gene data,
the memory computing unit is used for taking the gene data as data information when receiving the gene data sent by the memory computing unit, and calling a corresponding process control module to obtain the processed gene data;
the central processing unit is also used for processing the gene data to obtain processed gene data when receiving the gene data sent by the memory computing unit;
the storage calculation unit is also used for taking the gene data as data information when receiving the gene data sent by the central processing unit, calling the corresponding process control module to compress and store the gene data.
Has the advantages that: the gene sequencing process includes a plurality of steps, each of which has a plurality of processing steps, and the plurality of processing steps are executed based on the memory computing unit, the central processing unit, and the storage computing unit, thereby completing gene sequencing of the gene data.
The second objective of the present invention is to provide a management and control method for gene sequencing process.
The invention provides a second basic scheme: the gene sequencing process management control method comprises the following steps:
command management step: receiving flow management information, arbitrating the flow management information to obtain a management command, and distributing the management command to a preset management message queue;
a command execution step: and receiving the data information, calling a management command in the management message queue according to the data information, and executing the management command.
The second basic scheme has the beneficial effects that: the process management information comprises a plurality of management commands, the process management information is arbitrated to obtain the plurality of management commands, and the management commands are distributed to the management message queue for storage. And when the data information is received, calling the management command from the corresponding management message queue for execution, thereby realizing the processing and transmission of the data information.
By adopting the scheme, programmable gene sequencing flow control is provided through flow management information, centralized management of the gene sequencing flow is realized, and time overhead of the system in the gene sequencing flow control is effectively reduced.
Further, the management message queue includes a queue number, and distributes the management command to a preset management message queue, including the following:
analyzing the flow management information to obtain a message queue number, and writing the management commands into the corresponding management message queues in sequence according to the message queue number and the queue number;
invoking a management command in the management message queue according to the data information, wherein the management command comprises the following contents:
analyzing the data information to obtain a data queue number, screening a corresponding management message queue according to the data queue number and the queue number, calling a management command in the screened management message queue, and deleting the corresponding management command in the management message queue.
Has the advantages that: the management message queue has a unique queue number, and the message queue number is obtained by analyzing the flow management information, so that the management message queue to which the management command needs to be put is obtained. The operation sequence executed in the gene sequencing process is fixed, so that the management commands are written into the management message queue in sequence, the subsequent management commands can be called quickly, and the efficient control of the gene sequencing process is realized.
The data information includes gene data and a queue number of a management message queue in which a gene sequencing process executed by the gene data is located, that is, a data queue number. And screening the corresponding management message queue based on the data queue number and the queue number, and calling a management command to execute so as to complete the gene sequencing process. And after the management command is called, deleting the corresponding management command, so that the command executed in the next gene sequencing process is positioned at the first position of the management message queue, realizing the quick calling of the management command when the next data information comes, and reducing the time overhead of the system in the control of the gene sequencing process.
Further, the management command comprises an operator for executing gene sequencing, reading data and writing data, and the following contents are also included:
executing a judging step: and after the management command is executed, judging whether the management command is write-in data, waiting for next data information when the management command is the write-in data, and otherwise, calling a command execution step according to the data information after the management command is executed.
Has the advantages that: the management command is identified by executing the determination step, thereby determining whether a gene sequencing process or a certain step in the gene sequencing process is completed. When the gene sequencing process is finished or one of the steps is finished, data is finally written into the local or the remote, and the subsequent execution step is judged based on the characteristic of the finally written data, so that the control of the gene sequencing process is realized.
Further, the system comprises a memory computing unit, a central processing unit and a storage computing unit, and also comprises the following contents:
and (3) data processing: the memory computing unit calls the command execution step to obtain the processed data information; the central processing unit processes the data information processed by the memory computing unit; the storage calculation unit calls the command execution step to compress and store the data information processed by the central processing unit.
Has the advantages that: the gene sequencing process comprises a plurality of steps, each step comprises a plurality of processing steps, the memory computing unit, the central processing unit and the storage computing unit are controlled through the data processing steps, the management command is acquired and executed in the command calling execution step, so that gene sequencing is completed, processed data information is stored, and the data information can be called or further processed in the subsequent steps. Meanwhile, the processing and transmission of the data information are unloaded to the memory computing unit, the central processing unit and the storage computing unit for processing, so that the delay of data information control is reduced, and the efficient control of the gene sequencing process is realized.
Further, the method also comprises a gene sequencing step, wherein the gene sequencing step comprises the following steps:
s1: acquiring process management information, and calling a command management step to acquire a management message queue;
s2: acquiring a gene data reading request, taking the gene data reading request as data information, and calling a command execution step to acquire prestored gene data;
s3: taking the gene data as data information, and calling a data processing step to process the gene data;
s4: the memory computing unit receives the processed gene data, when the memory computing unit receives the gene data, the central processing unit extracts the gene data from the memory computing unit, uses the gene data as data information, processes the data information to obtain the processed gene data, and sends the gene data to the storage computing module; and the storage calculation module compresses and stores the received gene data.
Has the advantages that: the process management information comprises all steps required by the gene sequencing process, and different steps in the gene sequencing process are distributed in different management message queues for control, so that parallel processing of gene data is realized, and the parallel efficiency among different processing steps of gene sequencing is effectively improved.
The gene data is pre-stored or stored in a designated position, and when the gene data needs to be subjected to gene sequencing, the corresponding gene data is called through a gene data reading request. And taking the gene data as data information, and performing corresponding operation on the gene data through a data processing step. After the gene sequencing process is completed or one of the steps is completed, when the memory computing unit receives the gene data, the memory computing unit informs the central processing unit of the completion of the gene data, the central processing unit extracts the gene data to perform corresponding processing, and the processed gene data is compressed and stored, so that the space performance required by storing the gene data is reduced.
Drawings
FIG. 1 is a schematic diagram of a flow control module according to an embodiment of the present invention;
FIG. 2 is a block diagram of a second embodiment of the gene sequencing process management control system of the present invention;
FIG. 3 is a schematic diagram of the structure of a processing module of the gene sequencing process management control system according to the present invention;
FIG. 4 is a schematic diagram of the gene data field segmentation of the gene sequencing process management control system of the present invention;
FIG. 5 is a flow chart showing the compression steps of the gene sequencing process management control method of the present invention.
Detailed Description
The following is further detailed by way of specific embodiments:
example one
The gene sequencing process management control system comprises a plurality of heterogeneous units, wherein each heterogeneous unit comprises a process control module. The flow control module is used for receiving the flow management information, arbitrating the flow management information to obtain a management command, and distributing the management command to a preset management message queue; the management module is further configured to receive data information, invoke a management command in a corresponding management message queue according to the data information, and execute the management command, specifically:
as shown in fig. 1, the process control module includes a multi-queue submodule, a management message arbitration submodule, and a circulation submodule, where the multi-queue submodule is configured to store a plurality of management message queues and context information of each management message queue, where the context information refers to information of the management message queue, and includes a queue number and a data depth, that is, the management message queue includes a queue number. The setting of a plurality of management message queues supports message interfaces of a plurality of queues, and realizes concurrent processing of a plurality of gene sequencing processes, thereby reducing the time overhead in the gene sequencing processes.
The management message arbitration submodule is used for receiving process management information, and the process management information is transmitted from the outside, such as process management software. The management message arbitration sub-module is also used for arbitrating the process management information to obtain a management command, analyzing the process management information to obtain a message queue number, and writing the management command into the corresponding management message queue in sequence according to the message queue number and the queue number. The management message queue contains a plurality of management commands, each management command comprises a message operation code and an operation code parameter, the message operation code marks the operation to be executed by the data, and the operation comprises executing a gene sequencing operator, reading or writing data from or to the local, and reading or writing data from or to the remote, namely the management commands comprise executing the gene sequencing operator, reading data and writing data, the reading data comprises reading data from the local and reading data from the remote, and the writing data comprises writing data to the local and writing data to the remote. The opcode parameter is to provide sideband information required for operation, such as executing a management command to read data from a local area, and the address and size of the local area data are required to be known.
The data circulation submodule is used for receiving data information, and the data information is transmitted from the outside or prestored by the system and obtained by adopting a calling mode. The data circulation sub-module is also used for analyzing the data information to obtain a data queue number, screening a corresponding management message queue according to the data queue number and the queue number, and calling a management command in the screened management message queue. The multi-queue submodule is also used for deleting the corresponding management command in the management message queue when the management command in the management message queue is called. And after the management command is called, deleting the corresponding management command, so that the management command executed in the next gene sequencing process is positioned at the first position of the management message queue, and realizing the quick calling of the management command when the next data information comes.
The flow control module is also used for judging whether the management command is write-in data after executing the management command, waiting for next data information when the management command is the write-in data, and otherwise, calling the management command in the management message queue according to the data information after executing the management command and executing the management command.
By adopting the scheme, the process control module is arranged in each heterogeneous unit, the gene sequencing process in each heterogeneous unit is managed through the process control module, and the centralized management of the gene sequencing process is realized through the programmable management of the process management information. The processing and transmission of the data information are unloaded to the heterogeneous unit for processing, so that the delay of data information control is reduced, the time overhead of the system in the control of the gene sequencing process is effectively reduced, and the efficient control of the gene sequencing process is realized.
In addition, the present embodiment further provides a management and control method for a gene sequencing process, which uses the gene sequencing process management system, and includes the following steps:
command management step: and receiving the flow management information, arbitrating the flow management information to obtain a management command, and distributing the management command to a preset management message queue. The command management step specifically comprises the following steps:
process management information is received, which is transmitted from an external source, such as process management software.
And arbitrating the flow management information to obtain a management command, analyzing the flow management information to obtain a message queue number, and writing the management command into the corresponding management message queue in sequence according to the message queue number and the queue number.
The preset management message queue is multiple, the management message queue after the management command is written comprises multiple management commands, each management command comprises a message operation code and an operation code parameter, the message operation code marks the operation to be executed by the data, and comprises the steps of executing a gene sequencing operator, reading or writing data from or to the local, and reading or writing data from or to the remote, namely the management command comprises the steps of executing the gene sequencing operator, reading the data and writing the data, the reading the data comprises reading the data from the local and reading the data from the remote, and the writing the data comprises writing the data to the local and writing the data to the remote. The opcode parameter is to provide sideband information required for operation, such as executing a management command to read data from a local area, and the address and size of the local area data are required to be known.
A command execution step: and receiving the data information, calling a management command in the management message queue according to the data information, and executing the management command. Context information of each management message queue is also preset, and the context information refers to information of the management message queue and comprises a queue number and data depth, namely the management message queue comprises the queue number. The command execution step specifically comprises the following steps:
and receiving data information, wherein the data information is transmitted from the outside or prestored by the system and is obtained by adopting a calling mode.
Analyzing the data information to obtain a data queue number, screening a corresponding management message queue according to the data queue number and the queue number, calling a management command in the screened management message queue, and deleting the corresponding management command in the management message queue.
And executing management commands, wherein each management command comprises a message operation code and an operation code parameter, the message operation code marks the operation to be executed by the data, and the operation comprises executing a gene sequencing operator, reading or writing data from or to the local, and reading or writing data from or to the remote, namely the management commands comprise executing the gene sequencing operator, reading data and writing data, the reading data comprises reading data from the local and reading data from the remote, and the writing data comprises writing data to the local and writing data to the remote. The opcode parameter is to provide sideband information required for operation, such as executing a management command to read data from a local area, and the address and size of the local area data are required to be known.
Executing a judging step: and after the management command is executed, judging whether the management command is write-in data, waiting for next data information when the management command is the write-in data, and otherwise, calling a command execution step according to the data information after the management command is executed.
By adopting the scheme, programmable gene sequencing flow control is provided through flow management information, centralized management of the gene sequencing flow is realized, and time overhead of the system in the gene sequencing flow control is effectively reduced.
Example two
The difference between the present embodiment and the first embodiment is:
the gene sequencing process management system, as shown in fig. 2, further includes a central processing unit, the heterogeneous unit includes a memory computing unit and a storage computing unit, the central processing unit, the memory computing unit, and the storage computing unit communicate with each other through a root complex, and the communication includes data transmission of gene data and interaction of management messages. Specifically, the root complex is connected to the central processing unit via the FSB, the root complex is connected to the memory computing unit via the DIMM, and the root complex is connected to the memory computing unit via the PCIe.
The memory computing unit and the memory computing unit respectively comprise a flow control module, and are respectively defined as a memory flow control module and a memory flow control module for the convenience of distinguishing.
The central processing unit is used for acquiring the process management information and sending the process management information to the heterogeneous unit, the heterogeneous unit is used for calling the corresponding process control module to acquire the management message queue when receiving the process management information, specifically, the process control module in the memory computing unit is called to acquire the management message queue in the memory computing unit, and the process control module in the memory computing unit is called to acquire the management message queue in the memory computing unit.
The central processing unit is also used for acquiring a gene data reading request and sending the gene data reading request to the storage and calculation unit, and the storage and calculation unit is used for taking the gene data reading request as data information and calling a corresponding process control module to acquire prestored gene data when receiving the gene data reading request. Specifically, the storage and calculation unit further comprises an SSD module and a processing module, and the SSD module is pre-stored with gene data for gene sequencing. The storage flow control module is used for calling a first management command in a management message queue corresponding to a queue number according to the data queue number obtained by analyzing the gene data read request, at the moment, the management command is data read from the local, and the processing module is used for reading the gene data from the SSD module according to the management command. Meanwhile, the storage flow control module is also used for deleting the management command after the management command is called.
The memory computing unit, the central processing unit and the storage computing unit sequentially process the gene data. The memory computing unit is used for taking the gene data as data information when receiving the gene data sent by the memory computing unit, and calling a corresponding process control module to obtain the processed gene data; the central processing unit is also used for processing the gene data to obtain processed gene data when receiving the gene data sent by the memory computing unit; the storage calculation unit is also used for taking the gene data as data information when receiving the gene data sent by the central processing unit, calling the corresponding process control module to compress and store the gene data. Specifically, the memory computing unit further comprises a DDR module and a logic computing module, the central processing unit comprises a customized operation function module, the memory flow control module is configured to call a first management command in a management message queue corresponding to the queue number according to the data information, the management command is a seed execution operation at this time, and the logic computing module is configured to execute the seed operation on the gene data according to the management command to obtain the processed gene data. The memory flow control module is used for calling a first management command in the management message queue corresponding to the queue number after executing the management command, at the moment, the management command is used for executing filter operation, and the logic calculation module is used for executing the filter operation on the gene data according to the management command to obtain the processed gene data. The customized operation function module is used for executing an extended operation on the gene data to obtain the processed gene data when receiving the gene data sent by the memory computing unit. The storage flow control module is used for calling a first management command in a management message queue corresponding to the queue number according to the received gene data, at the moment, the management command is used for executing compression operation, the processing module is used for executing compression operation on the gene data according to the management command to obtain processed gene data, and the compressed gene data is in a bam format. The storage process control module is used for calling a first management command in the management message queue corresponding to the queue number after executing the management command, at the moment, the management command is used for executing storage operation, and the processing module is used for executing storage operation on the gene data according to the management command and storing the gene data in the SSD module. The seed operation, filter operation and extend operation are operation links in the gene sequencing process, and the execution of the rest of the operations is the same as that of the operation links, so the description is omitted.
In other embodiments, the processing module is configured to perform a compression operation on the genetic data according to the management command, and specifically, as shown in fig. 3, the processing module includes a field separator, an operator pool, an operator selector, an operator combiner, and a field merger.
The operator pool stores various types of compression operators in advance, and in the embodiment, the compression operators comprise a data conversion class operator, an entropy coding operator and a general coding operator, wherein the data conversion class operator comprises run length coding, MTF coding, LZ77, BWT and the like, the entropy coding operator comprises Huffman coding, arithmetic coding and the like, and the general coding operator comprises Unary coding, Rice coding and the like. The compression operators in the operator pool are all in the form of configurable hardware libraries.
The field separator is used for dividing the gene data into a plurality of data fields according to the data types, and specifically, dividing each data field of the N data fields into M data blocks; wherein N is a first-level parallel design on a field level, and M is a second-level parallel design on a field algorithm level. The size of N is determined by the complexity and richness of the gene data, and the size of M is limited by hardware resources and compression effect.
The data type comprises name information, gene sequence information and quality scores corresponding to bases in the gene sequence information, the gene sequence information stores the relative position information of the GATC bases, and the corresponding data fields generated after segmentation comprise name fields, sequence fields and quality score fields. As shown in FIG. 4, the first and third rows of FIG. 4 are name fields, collectively referred to as field 1; a second behavior sequence field, referred to as field 2; the fourth row is a quality score field, referred to as field 3.
And the operator selector is used for receiving each data field and the compression requirement corresponding to each data field, wherein the compression requirement comprises a compression rate and a compression performance, and the compression performance is the performance and resource occupation condition when the hardware realizes compression. And the system is also used for selecting a compression operator from the operator pool according to the compression requirement of each data field.
And the operator combiner is used for combining the compression operators selected according to the data fields into the compression algorithms of the data fields, and each compression algorithm at least comprises one compression operator.
Different compression operators can be selected from the same data field, and the compression operators are selected to be combined into an optimal compression algorithm based on the difference between the compression rate and the compression performance of the compression operators. In this embodiment, taking the gene data in fig. 4 as an example, field 1 is a name field and is encoded by a general encoding method, and field 2 is a sequence field and is encoded by a combination of BWT operator and MTF operator; the field 3 is a quality fraction field and adopts a mode of combining a differential coding operator and a run coding operator.
And the field merger is used for compressing each data field according to the corresponding combined compression algorithm and merging the compression results of each data field. The merging mode of the compression results of the data fields is as follows: and storing the compression result of each data field in a specific format in the same file.
When the compression results are combined, the compression operator combination contained in the compression algorithm selected by each data field is marked in the file header, so that the corresponding operator is called for decompression when decompression is carried out.
The storage computing unit is also used for sending the gene data to the memory computing unit when receiving the processed gene data, and the central processing unit is also used for extracting the gene data from the memory computing unit when the memory computing unit finishes receiving the gene data, processing the gene data to obtain the processed gene data and sending the gene data to the storage computing module; the storage calculation module is also used for compressing and storing gene data.
By adopting the scheme, different steps in the gene sequencing process are distributed in different process control modules for control through the heterogeneous unit, so that parallel processing of gene data is realized, the parallel efficiency among different processing steps of gene sequencing is effectively improved, and the performance of the gene sequencing processing process is improved.
In addition, the present embodiment further provides a gene sequencing process management control method, using the gene sequencing process management control system, which includes a memory computing unit, a central processing unit, and a storage computing unit, and further includes the following steps:
and (3) data processing: the memory computing unit calls the command execution step to obtain the processed data information; the central processing unit processes the data information processed by the memory computing unit; the storage calculation unit calls the command execution step to compress and store the data information processed by the central processing unit.
In other embodiments, the compression step, as shown in fig. 5, may include the following:
an acquisition step: acquiring gene data and compression requirements of the corresponding gene data; the compression requirement is the compression ratio and the compression performance after the user balances, and the compression performance is the performance and the resource occupation condition when the hardware realizes the compression.
Field separation: dividing gene data into a plurality of data fields according to data types, specifically, dividing each data field of N data fields into M data blocks; wherein N is a first-level parallel design on a field level, and M is a second-level parallel design on a field algorithm level. The size of N is determined by the complexity and richness of the gene data, and the size of M is limited by hardware resources and compression effect.
The data type comprises name information, gene sequence information and quality score information corresponding to the base in the gene sequence information, wherein the gene sequence information stores the relative position information of the GATC base. The sequence information of the gene stores the relative position information of the GATC base, and the corresponding data fields generated after segmentation comprise a name field, a sequence field and a quality score field. As shown in FIG. 4, FIG. 4 is a partial section of data in the gene data FASTQ file, the first line and the third line are name fields, collectively referred to as field 1; a second behavior sequence field, referred to as field 2; the fourth row is a quality score field, referred to as field 3.
Operator selection and combination: and selecting corresponding compression operators from preset compression operators according to the compression requirements of the data fields, and combining the compression operators into compression algorithms corresponding to the data fields, wherein each compression algorithm at least comprises one compression operator.
The preset compression operators comprise data conversion operators, entropy coding operators and general coding operators, wherein the data conversion operators comprise run length coding, MTF coding, LZ77, BWT and the like, the entropy coding operators comprise Huffman coding, arithmetic coding and the like, and the general coding operators comprise Unary coding, Rice coding and the like. And storing preset compression operators in an operator pool in a classified manner, and selecting the compression operators from the operator pool according to the compression requirements of each data field. The operator pool also records the compression rate and compression performance of each compression operator in a list.
Different compression operators can be selected from the same data field, and the compression operators are selected to be combined into an optimal compression algorithm based on the difference between the compression rate and the compression performance of the compression operators. In this embodiment, taking the gene data in fig. 4 as an example, field 1 is a name field and is encoded by a general encoding method, and field 2 is a sequence field and is encoded by a combination of BWT operator and MTF operator; the field 3 is a quality fraction field and adopts a mode of combining a differential coding operator and a run coding operator.
The operator selection and combination step further comprises the following steps:
s101, setting compression operators in a compression algorithm in parallel and setting the compression operators as M same compression pipelines; each data field is respectively distributed with M same compression pipelines; the compression pipeline comprises a plurality of compression algorithms, and the compression algorithms are formed by combining a plurality of compression operators.
S102, obtaining a first parallelism K of a compression operator in a compression assembly lineNAnd according to the first parallelism KNGet the NthSecond parallelism M K of data fieldsN
S103, according to the second parallelism M x K of each data fieldNAnalyzing the completion time of each data field for completing compression, and recording the completed synchronization rate;
s104, judging whether the synchronization rate meets a set value or not, if not, adjusting a combination of a compression operator or a compression algorithm in the compression assembly line to obtain a first parallelism K of the compression assembly lineN' and a second parallelism M K of each data fieldN’;
And S105, repeatedly executing the step S103 and the step S104 until the synchronization rate meets the set value.
And (3) field compression step: and compressing each data field according to the corresponding combined compression algorithm to obtain the compression result of each data field.
Field merging step: and merging the compression results of the data fields. Specifically, the compression results of the data fields are stored in the same file in a specific format. When the compression results are combined, the compression operator combination contained in the compression algorithm selected by each data field is marked in the file header, so that the corresponding operator is called for decompression when decompression is carried out.
A compression performance analysis step: according to the first parallelism KNAnd a second parallelism M KNThe gene data was analyzed for compressibility. The method specifically comprises the following steps:
according to the first parallelism KNObtaining Min (K)N);
According to the second parallelism M K of each data fieldNObtaining a third parallelism M N Min (K) of the gene dataN);
According to the third parallelism M N Min (K)N) The gene data was analyzed for compressibility.
The gene sequencing step comprises the following steps:
s1: acquiring process management information, and calling a command management step to acquire a management message queue;
s2: acquiring a gene data reading request, taking the gene data reading request as data information, and calling a command execution step to acquire prestored gene data;
s3: taking the gene data as data information, and calling a data processing step to process the gene data;
s4: the memory computing unit receives the processed gene data, when the memory computing unit receives the gene data, the central processing unit extracts the gene data from the memory computing unit, uses the gene data as data information, processes the data information to obtain the processed gene data, and sends the gene data to the storage computing module; and the storage calculation module compresses and stores the received gene data.
By adopting the scheme, the processing and transmission of the data information are unloaded to the memory computing unit, the central processing unit and the memory computing unit, so that the delay of data information control is reduced, and the efficient control of the gene sequencing process is realized. Meanwhile, different steps in the gene sequencing process are distributed in different management message queues for control, so that parallel processing of gene data is realized, and the parallel efficiency among different processing steps of gene sequencing is effectively improved.
The foregoing is merely an example of the present invention, and common general knowledge in the field of known specific structures and characteristics is not described herein in any greater extent than that known in the art at the filing date or prior to the priority date of the application, so that those skilled in the art can now appreciate that all of the above-described techniques in this field and have the ability to apply routine experimentation before this date can be combined with one or more of the present teachings to complete and implement the present invention, and that certain typical known structures or known methods do not pose any impediments to the implementation of the present invention by those skilled in the art. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims (10)

1. The gene sequencing process management control system is characterized in that: the heterogeneous units comprise flow control modules, and the flow control modules are used for receiving flow management information, arbitrating the flow management information to obtain a management command, and distributing the management command to a preset management message queue; and the management module is also used for receiving the data information, calling the management command in the corresponding management message queue according to the data information and executing the management command.
2. The gene sequencing process management control system of claim 1, wherein: the management message queue comprises a queue number, the process control module comprises a management message arbitration submodule, a data stream conversion submodule and a multi-queue submodule, the management message arbitration submodule is used for analyzing the process management information to obtain a message queue number, and the management commands are written into the corresponding management message queue in sequence according to the message queue number and the queue number;
the data flow sub-module is used for analyzing the data information to obtain a data queue number, screening a corresponding management message queue according to the data queue number and the queue number, and calling a management command in the screened management message queue, and the multi-queue sub-module is used for deleting the corresponding management command in the management message queue when the management command in the management message queue is called.
3. The gene sequencing process management control system of claim 1, wherein: the management command comprises an operator for executing gene sequencing, data reading and data writing, the flow control module is also used for judging whether the management command is the data writing after the management command is executed, waiting for next data information when the management command is the data writing, otherwise, calling the management command in the management message queue according to the data information after the management command is executed, and executing the management command.
4. The gene sequencing process management control system according to any one of claims 1 to 3, wherein: the heterogeneous unit comprises an in-memory computing unit and a storage computing unit, the central processing unit is used for acquiring a gene data reading request and sending the gene data reading request to the storage computing unit, and the storage computing unit is used for taking the gene data reading request as data information and calling a corresponding process control module to acquire pre-stored gene data when receiving the gene data reading request;
the memory computing unit, the central processing unit and the storage computing unit sequentially process the gene data;
the storage computing unit is also used for sending the gene data to the memory computing unit when receiving the processed gene data, and the central processing unit is also used for extracting the gene data from the memory computing unit when the memory computing unit finishes receiving the gene data, processing the gene data to obtain the processed gene data and sending the gene data to the storage computing module; the storage calculation module is also used for compressing and storing gene data.
5. The gene sequencing process management control system of claim 4, wherein: when the memory computing unit, the central processing unit and the memory computing unit sequentially process the gene data,
the memory computing unit is used for taking the gene data as data information when receiving the gene data sent by the memory computing unit, and calling a corresponding process control module to obtain the processed gene data;
the central processing unit is also used for processing the gene data to obtain processed gene data when receiving the gene data sent by the memory computing unit;
the storage calculation unit is also used for taking the gene data as data information when receiving the gene data sent by the central processing unit, calling the corresponding process control module to compress and store the gene data.
6. The gene sequencing process management control method is characterized by comprising the following steps of:
command management step: receiving flow management information, arbitrating the flow management information to obtain a management command, and distributing the management command to a preset management message queue;
a command execution step: and receiving the data information, calling a management command in the management message queue according to the data information, and executing the management command.
7. The gene sequencing process management control method of claim 6, wherein: the management message queue comprises a queue number and distributes management commands to a preset management message queue, and the management command queue comprises the following contents:
analyzing the flow management information to obtain a message queue number, and writing the management commands into the corresponding management message queues in sequence according to the message queue number and the queue number;
invoking a management command in the management message queue according to the data information, wherein the management command comprises the following contents:
analyzing the data information to obtain a data queue number, screening a corresponding management message queue according to the data queue number and the queue number, calling a management command in the screened management message queue, and deleting the corresponding management command in the management message queue.
8. The gene sequencing process management control method of claim 6, wherein: the management command comprises the steps of executing a gene sequencing operator, reading data and writing data, and further comprises the following contents:
executing a judging step: and after the management command is executed, judging whether the management command is write-in data, waiting for next data information when the management command is the write-in data, and otherwise, calling a command execution step according to the data information after the management command is executed.
9. The gene sequencing process management and control method of any one of claims 6 to 8, comprising an in-memory computing unit, a central processing unit and a storage computing unit, further comprising:
and (3) data processing: the memory computing unit calls the command execution step to obtain the processed data information; the central processing unit processes the data information processed by the memory computing unit; the storage calculation unit calls the command execution step to compress and store the data information processed by the central processing unit.
10. The gene sequencing process management control method of claim 9, wherein: the method also comprises a gene sequencing step, wherein the gene sequencing step comprises the following steps:
s1: acquiring process management information, and calling a command management step to acquire a management message queue;
s2: acquiring a gene data reading request, taking the gene data reading request as data information, and calling a command execution step to acquire prestored gene data;
s3: taking the gene data as data information, and calling a data processing step to process the gene data;
s4: the memory computing unit receives the processed gene data, when the memory computing unit receives the gene data, the central processing unit extracts the gene data from the memory computing unit, uses the gene data as data information, processes the data information to obtain the processed gene data, and sends the gene data to the storage computing module; and the storage calculation module compresses and stores the received gene data.
CN202110633608.8A 2021-06-07 2021-06-07 Gene sequencing flow management control method and system Active CN113139798B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110633608.8A CN113139798B (en) 2021-06-07 2021-06-07 Gene sequencing flow management control method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110633608.8A CN113139798B (en) 2021-06-07 2021-06-07 Gene sequencing flow management control method and system

Publications (2)

Publication Number Publication Date
CN113139798A true CN113139798A (en) 2021-07-20
CN113139798B CN113139798B (en) 2024-02-20

Family

ID=76815999

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110633608.8A Active CN113139798B (en) 2021-06-07 2021-06-07 Gene sequencing flow management control method and system

Country Status (1)

Country Link
CN (1) CN113139798B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020082869A1 (en) * 2000-12-27 2002-06-27 Gateway, Inc. Method and system for providing and updating customized health care information based on an individual's genome
US20040029161A1 (en) * 2001-08-17 2004-02-12 Perlegen Sciences, Inc. Methods for genomic analysis
US20040259099A1 (en) * 2001-11-22 2004-12-23 Takamasa Katoh Information processing system using information on base sequence
US20130312010A1 (en) * 2012-05-21 2013-11-21 International Business Machines Corporation Processing Posted Receive Commands In A Parallel Computer
US20150067291A1 (en) * 2013-08-30 2015-03-05 Kabushiki Kaisha Toshiba Controller, memory system, and method
CN107370667A (en) * 2017-07-31 2017-11-21 北京北信源软件股份有限公司 Multi-threading parallel process method and apparatus, computer-readable recording medium and storage control
CN108537008A (en) * 2018-03-20 2018-09-14 常州大学 High-throughput gene sequencing big data analysis cloud platform system
US20190026641A1 (en) * 2017-07-21 2019-01-24 James Lu Genomic services platform supporting multiple application providers
US20190138375A1 (en) * 2017-11-03 2019-05-09 Dell Products L. P. Optimization of message oriented middleware monitoring in heterogenenous computing environments
CN110554976A (en) * 2018-06-01 2019-12-10 苹果公司 Memory cache management for graphics processing
CN111653317A (en) * 2019-05-24 2020-09-11 北京哲源科技有限责任公司 Gene comparison accelerating device, method and system
CN112365928A (en) * 2020-11-16 2021-02-12 赛福解码(北京)基因科技有限公司 Biological information data analysis and result quality control automation method and system

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020082869A1 (en) * 2000-12-27 2002-06-27 Gateway, Inc. Method and system for providing and updating customized health care information based on an individual's genome
US20040029161A1 (en) * 2001-08-17 2004-02-12 Perlegen Sciences, Inc. Methods for genomic analysis
US20040259099A1 (en) * 2001-11-22 2004-12-23 Takamasa Katoh Information processing system using information on base sequence
US20130312010A1 (en) * 2012-05-21 2013-11-21 International Business Machines Corporation Processing Posted Receive Commands In A Parallel Computer
US20150067291A1 (en) * 2013-08-30 2015-03-05 Kabushiki Kaisha Toshiba Controller, memory system, and method
US20190026641A1 (en) * 2017-07-21 2019-01-24 James Lu Genomic services platform supporting multiple application providers
CN107370667A (en) * 2017-07-31 2017-11-21 北京北信源软件股份有限公司 Multi-threading parallel process method and apparatus, computer-readable recording medium and storage control
US20190138375A1 (en) * 2017-11-03 2019-05-09 Dell Products L. P. Optimization of message oriented middleware monitoring in heterogenenous computing environments
CN108537008A (en) * 2018-03-20 2018-09-14 常州大学 High-throughput gene sequencing big data analysis cloud platform system
CN110554976A (en) * 2018-06-01 2019-12-10 苹果公司 Memory cache management for graphics processing
CN111653317A (en) * 2019-05-24 2020-09-11 北京哲源科技有限责任公司 Gene comparison accelerating device, method and system
CN112365928A (en) * 2020-11-16 2021-02-12 赛福解码(北京)基因科技有限公司 Biological information data analysis and result quality control automation method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨超;徐如志;杨峰;: "基于消息队列的多进程数据处理系统", 计算机工程与设计, vol. 31, no. 13, pages 3128 - 3131 *

Also Published As

Publication number Publication date
CN113139798B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
US11836081B2 (en) Methods and systems for handling data received by a state machine engine
US20230196065A1 (en) Methods and devices for programming a state machine engine
US9886017B2 (en) Counter operation in a state machine lattice
CN110428868B (en) Method and system for compressing, preprocessing and decompressing and reducing gene sequencing mass data
JP2022532432A (en) Data compression methods and computing devices
US20210089358A1 (en) Techniques for improving processing of bioinformatics information to decrease processing time
US20200401553A1 (en) Devices for time division multiplexing of state machine engine signals
CN114220479B (en) Protein structure prediction method, protein structure prediction device and medium
CN114237911A (en) CUDA-based gene data processing method and device and CUDA framework
CN113139798A (en) Gene sequencing process management control method and system
CN110176276B (en) Biological information analysis process management method and system
CN111258950B (en) Atomic access and storage method, storage medium, computer equipment, device and system
CN113241120A (en) Gene sequencing system and sequencing method
CN113268269B (en) Acceleration method, system and device for dynamic programming algorithm
WO2015143708A1 (en) Method and apparatus for constructing suffix array
US6205546B1 (en) Computer system having a multi-pointer branch instruction and method
CN110021342B (en) Method and system for accelerating identification of variant sites
CN110413849A (en) A kind of data reordering method and device
CN115881225B (en) Analysis method of biological information sequence, computer storage medium and electronic device
KR102258897B1 (en) Error recovery method in genome sequence analysis and genome sequence analysis apparatus
US11842048B2 (en) Process acceleration for software defined storage
CN113268460B (en) Multilayer parallel-based gene data lossless compression method and device
US20030113767A1 (en) Confirmation sequencing
CN113535638B (en) Parallel operation acceleration system and operation method thereof
CN115827221A (en) BAM file parallel reading method, system and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant