WO2024130660A1 - Système et procédé d'analyse de données de séquençage de gènes, dispositif électronique et support d'enregistrement - Google Patents

Système et procédé d'analyse de données de séquençage de gènes, dispositif électronique et support d'enregistrement Download PDF

Info

Publication number
WO2024130660A1
WO2024130660A1 PCT/CN2022/141142 CN2022141142W WO2024130660A1 WO 2024130660 A1 WO2024130660 A1 WO 2024130660A1 CN 2022141142 W CN2022141142 W CN 2022141142W WO 2024130660 A1 WO2024130660 A1 WO 2024130660A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
analysis
shard
computing
file analysis
Prior art date
Application number
PCT/CN2022/141142
Other languages
English (en)
Chinese (zh)
Inventor
王志扬
颜旭
黎宇翔
曾涛
云全新
杨恢亮
董宇亮
章文蔚
徐讯
Original Assignee
深圳华大生命科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大生命科学研究院 filed Critical 深圳华大生命科学研究院
Priority to PCT/CN2022/141142 priority Critical patent/WO2024130660A1/fr
Publication of WO2024130660A1 publication Critical patent/WO2024130660A1/fr

Links

Images

Definitions

  • the present application relates to the field of computer technology, and in particular to a gene sequencing data analysis system, method, electronic device and storage medium.
  • Gene sequencing is a typical application of high-performance computing.
  • the third-generation gene sequencing technology has gradually become the mainstream sequencing technology.
  • the third-generation sequencing generates unprecedentedly large amounts of data, and its principle determines that it can generate more and more data in the future, which is far beyond the previous sequencing technology.
  • the existing data analysis level of a single machine or deployed on a single high-computing server can no longer support the analysis needs of massive gene sequencing data under the current data growth rate, and there are problems such as low data analysis efficiency and slow computing speed.
  • the present application provides a gene sequencing data analysis system, method, electronic device and storage medium, the main purpose of which is to solve the current technical problems of low data analysis efficiency and slow calculation speed when analyzing gene sequencing data.
  • a gene sequencing data analysis system comprising: a data center server, a computing server cluster, the computing server cluster comprising a plurality of computing servers, each of the plurality of computing servers being respectively configured with a plurality of computing nodes;
  • the data center server is used to obtain the path information of the shard files generated in the gene sequencing process, create a shard file analysis task containing the path information, and fill the shard file analysis task into the task queue;
  • the data center server is connected to the computing server cluster, and the data center server is also used to determine at least one target computing server in the computing server cluster that executes the shard file analysis task.
  • the at least one target computing server is used to pull the shard file analysis task from the task queue, and call the configured multiple computing nodes to execute the shard file analysis task, and send the multiple shard file analysis results output by the multiple computing nodes to the data center server.
  • the computing node is used to retrieve the fragment file according to the path information contained in the fragment file analysis task, wherein the fragment file includes base fragment signals divided according to a specified number or channel source in an upstream gene sequencing step.
  • a gene analysis model is configured in the computing node, and the computing node is used to analyze the shard file using the gene analysis model to obtain a shard file analysis result.
  • the data center server is used to receive the multiple shard file analysis results sent by the multiple computing nodes, and merge the multiple shard file analysis results when detecting that the execution of the shard file analysis task is completed.
  • the data center server is further used to generate a task status queue, and the task status queue is used to update and store status information of the shard file analysis tasks in the task queue.
  • the data center server is further used to determine whether the segment file analysis task has been completed by detecting the status information in the task status queue.
  • the data center server is also used to clear redundant intermediate files generated by at least one target computing server and restore computing resources of the multiple computing nodes configured on the target computing server when detecting the completion of execution of the shard file analysis task.
  • a method for analyzing gene sequencing data is provided, the method being applied to a data center server, the method comprising:
  • a method for analyzing gene sequencing data is provided, the method being applied to a computing server, the method comprising:
  • shard file analysis results are the results obtained by the computing nodes searching for shard files according to the path information contained in the shard file analysis task and analyzing the shard files using the gene analysis model;
  • the analysis results of the multiple segment files are sent to the data center server.
  • an electronic device including:
  • the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method described in the second aspect or the third aspect.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to enable the computer to execute the method described in the second or third aspect above.
  • a computer program product comprising a computer program, wherein when the computer program is executed by a processor, the computer program implements the method as described in the second aspect or the third aspect.
  • the data center server can create a shard file analysis task containing the path information, and fill the shard file analysis task into the task queue; then, it can receive multiple shard file analysis results sent by the target computing server, and merge multiple shard file analysis results when the execution of the detection shard file analysis task is completed, wherein the shard file analysis result is the result obtained by calling the configured multiple computing nodes to retrieve the shard file according to the path information contained in the shard file analysis task after the target computing server pulls the shard file analysis task in the task queue, and analyzes the shard file using the gene analysis model.
  • the technical scheme in the present disclosure can be deployed by a centralized data center server to allocate data analysis tasks, and multiple computing nodes can undertake the analysis tasks of the corresponding data shards.
  • the asynchronous and multi-process implementation of the process analysis operation in the computing node can effectively improve the speed of analyzing the large amount of data generated by the third-generation sequencing, thereby improving the data analysis efficiency and meeting the analysis needs of the massive gene sequencing data under the data growth rate.
  • FIG1 is a schematic diagram of the structure of a gene sequencing data analysis system provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of a flow chart of a method for analyzing gene sequencing data provided in an embodiment of the present application
  • FIG3 is a schematic diagram of a flow chart of a method for analyzing gene sequencing data provided in an embodiment of the present application
  • FIG4 is a schematic diagram of a principle flow chart of a gene sequencing data analysis provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • 2-computing server cluster 20-computing servers, 200-computing nodes.
  • Distributed data processing methods that use virtualization technology have been widely adopted in recent years and have developed rapidly. With the explosive growth of data in the fields of biology, finance, the Internet, the Internet of Things, and artificial intelligence (AI), the computing power of a single physical host has long been a bottleneck in massive data scenarios, and cannot support data analysis, storage, AI recognition, and other processes. Distributed data processing methods can well solve the problems related to massive data, and have the characteristics of resource sharing, good scalability, accelerated computing speed, high reliability, and convenient communication. The advantages are significant, including but not limited to: increasing the system's data capacity and computing power; enhancing system availability, making the system highly available and having certain obstacle avoidance capabilities; modularization makes the system more reusable; development and deployment are more convenient; and it can be expanded in nodes.
  • the present disclosure provides a gene sequencing data analysis system, as shown in Figure 1, the system adopts a distributed architecture of task center and computing nodes 1+N (one data center server + multiple computing nodes): the centralized data center allocates data analysis tasks and provides some necessary analysis data resources, and multiple computing nodes undertake the analysis tasks of corresponding data slices.
  • the analysis data resources include central processing unit (CPU), graphics processing unit (GPU), computing accelerator card, etc.
  • the system includes: a data center server 1 and a computing server cluster 2, wherein the computing server cluster 2 includes multiple computing servers 20, and each computing server 20 in the multiple computing servers 20 is respectively configured with multiple computing nodes 200; the data center server 1 is used to obtain the path information of the shard files generated in the gene sequencing process, and create a shard file analysis task containing the path information, and fill the shard file analysis task into the task queue; the data center server 1 is connected to the computing server cluster 2, and the data center server 1 is also used to determine at least one target computing server in the computing server cluster 2 to execute the shard file analysis task, and the at least one target computing server is used to pull the shard file analysis task in the task queue, and call the configured multiple computing nodes to execute the shard file analysis task, and send the multiple shard file analysis results output by the multiple computing nodes to the data center server 1.
  • the determination of the shard file is generated in the upstream gene sequencing step. It is sharded in a specified number or according to the channel source, such as 5000 base fragment signals as one shard or all base fragment signals of a channel as one shard.
  • the path of the shard file needs to be filled into the task queue as one of the inputs so that the subsequent process can read the content of the file according to the path for further analysis; the task queue refers to the queue containing the task parameters to be submitted to the process pool for operation, which is specifically implemented as redis queue.
  • the shard file analysis task can be packaged into elements and filled into the task queue.
  • the computing node is a virtualized computing container node.
  • Virtualization refers to the virtualization of a physical host into multiple logical computers through software technology. Each logical computer can independently run different operating systems and various applications. Through virtualization technology, each virtual machine has its own virtual hardware (virtual CPU, network card, memory, etc.), and the operating system running on the virtual machine thinks that it has a physical host exclusively. The software on the virtual machine runs on a virtual platform, not a real hardware platform.
  • the container is just a special process running on the host machine, and multiple containers still use the same host machine's operating system kernel. The container does not rely on the operating system to run the application environment, and can isolate and restrict the application process through the mechanisms and features provided by the operating system kernel.
  • the data center server 1 may adopt random screening or screening according to preset screening rules.
  • the task type of the shard file analysis task may be determined, and the computing server 20 matching the task type in the computing server cluster 2 may be determined as the target computing server; as a possible implementation method, the working status information of each computing server 20 in the computing server cluster 2 may be obtained, and the computing server 20 whose working status information meets the preset conditions may be determined as the target computing server, wherein the working status information may include working hours, working status, working priority, etc., which are not specifically limited here.
  • the data center server when processing data, can play the role of a producer, responsible for creating shard file analysis tasks, monitoring task status, and necessary file operations at the end, and also supports receiving and removing tasks. Specifically, when creating a task, it will obtain information such as the list and path of all shard files to be analyzed to fill the task queue and create status information; when deleting a task, it will send deletion information to the computing node so that the computing node can restart the service according to the docker_id; input operation: select a data name and a set of analysis parameters on any client's web page, click the submit button to send a request to the data center server, the data center server will parse the parameters to obtain the entire data, and create a task according to the path of the current data shard file and other information to fill it into the message queue.
  • the analysis parameters may include the model name for deep learning base recognition, whether to cut the adapter connector (the base through-hole signal corresponding to the fixed primer segment in front of the reads base signal), the base filter length and other professional technical parameters; parsing the parameters means that the file path in the parameter can be obtained to know which data corresponds to which data folder on the data server, and the path of the fragment file can be obtained to know which fragment files under this data need to be handed over to the queue for analysis.
  • the list of fragment files to be analyzed contains the path of the fragment files to be analyzed. The determination of the fragment file is generated in the upstream sequencing step.
  • the path of the fragment file needs to be filled into the task queue as one of the inputs so that the subsequent process can read the content of the file according to the path for further analysis.
  • the status information is used to maintain a variety of data structures to know whether which fragment files have completed the corresponding analysis process. Accordingly, the data center server is also used to generate a task status queue, which is used to update the status information of the fragment file analysis task in the storage task queue.
  • the computing nodes can be responsible for the task.
  • the computing nodes will pull and consume tasks according to the information provided by the data center server. After pulling the corresponding tasks, the computing nodes will trigger their own data analysis tasks.
  • the nodes are relatively independent in terms of resource occupation and task operation. When resources are sufficient, they can be considered independent of each other.
  • Each computing node generates corresponding sharding results after the data sharding analysis process, and will be fed back to the data center server after completion.
  • the specific data flow is: task queue - obtain the sharding file in the Hierarchical Data Format Version 5 (HDF5) format to the local node through the optical fiber network - parse the HDF5 sharding file - the data is inferred by the AI model - decode the AI inference results - write the decoded results to the sharding result file - transmit the results back to the data center server through the optical fiber network.
  • the computing node when the computing node obtains the shard file, it can retrieve the shard file according to the path information contained in the shard file analysis task, wherein the shard file includes the base fragment signal divided according to the specified number or channel source in the upstream gene sequencing step.
  • each computing node is respectively configured with a gene analysis model. After retrieving the shard file, the computing node can use the gene analysis model to analyze the shard file, obtain the shard file analysis result, and transmit the shard file analysis result back to the data center server through the optical fiber network, that is, the data center server can receive multiple shard file analysis results sent by multiple computing nodes.
  • the gene analysis models of multiple computing nodes can constitute a gene analysis cluster.
  • the gene analysis cluster refers to the use of artificial intelligence learning methods and means to implement the analysis of gene sequencing data.
  • the analysis of gene sequencing data can be converted into various signals with significant rules and characteristics through the unique physical and chemical properties of its bases, and images and electrical signals are good artificial intelligence learning analysis models. Building a gene analysis cluster is to distribute the computing resources for gene sequencing data analysis, so that more computing resource nodes can accelerate and support the process of artificial intelligence learning to improve the speed and accuracy of gene sequencing data analysis.
  • the gene analysis model can be any computing model that can realize the gene sequencing data analysis task, such as a neural network model, a deep learning model, etc.
  • the gene analysis model in the present disclosure can be selected as a hidden Markov model, a conditional random field, and a neural network model, etc., which are not specifically limited here.
  • the gene analysis model configured in each computing node may be the same or different, and is not specifically limited here.
  • the corresponding training method is used until the gene analysis model is determined to have reached a convergence state and the gene analysis model is determined to have completed training. Then the trained gene analysis model can be directly put into the specific application of the gene sequencing data analysis task corresponding to the computing node, that is, the shard file is input into the trained gene analysis model, and the gene analysis model will directly output the corresponding shard file analysis results.
  • the computing node executes the shard file analysis task and obtains the shard file analysis results
  • the roles of the producer and consumer will be exchanged, and the computing node that generates the shard result will provide the corresponding end information and play the role of the producer.
  • the data center server plays the role of the consumer for monitoring.
  • the shard file analysis results provided by the computing node are merged to produce the final required complete results.
  • the entire analysis process from the beginning to the result merging is a closed loop of "total-part-total". After the end, the intermediate results and other resources are cleared and returned to the entire system process.
  • the shard file analysis result receives the HDF5 shard result file sent by the computing node - the shard file analysis result detects that all tasks are completed - the shard file analysis result reads and merges all HDF5 result files - obtains the final analysis result, and removes redundant intermediate files at the same time.
  • the data center server can determine whether the shard file analysis task has been completed by detecting the status information in the task status queue, and when detecting the end of the shard file analysis task, it clears the redundant intermediate files generated by at least one target computing server and restores the computing resources of multiple computing nodes configured on the target computing server.
  • the development environment involved in the present disclosure is relatively complex.
  • the software and hardware environment requirements of the data center server are shown in Table 1 below, and the software and hardware environment requirements of the computing node are shown in Table 2 below.
  • the algorithm module involves the reading, writing and storage of a large amount of data, as well as the training and testing of the model, which requires a large amount of data calculation and operation.
  • Running in a high-performance CPU or GPU-configured software and hardware environment can significantly improve efficiency and stability.
  • an embodiment of the present disclosure provides a method for analyzing gene sequencing data.
  • the method is applied to a gene sequencing data analysis system, and is specifically applied to a data center server.
  • the method embodiment includes:
  • Step 101 Obtain path information of the fragment files generated during the gene sequencing process.
  • the determination of the fragmentation file is generated in the upstream gene sequencing step. It is fragmented by a specified number or according to the channel source, such as 5000 base fragment signals as one fragment or all base fragment signals of a channel as one fragment. Specifically, it can be derived from the third-generation nanopore sequencing device and the supporting PC equipment.
  • the nanopore sequencing device is used to collect the current signal generated when the gene passes through the hole, and the sequencing fragment results and other information are obtained through algorithm identification and screening and saved as h5 file structure, which is stored in a hard disk device specially mounted on the PC. The results are uploaded to the data center server when the sequencing process is completed.
  • Step 102 Create a shard file analysis task containing path information, and fill the shard file analysis task into a task queue.
  • the task queue refers to the queue containing task parameters to be submitted to the process pool for operation. It is specifically implemented as redis queue.
  • the shard file analysis task can be packaged into elements and filled into the task queue. The purpose of filling into the task queue is to avoid backlogging the existing process and the next process can consume the elements in the queue to asynchronously perform the following analysis link.
  • Step 103 receiving multiple shard file analysis results sent by the target computing server, and merging the multiple shard file analysis results when detecting the completion of the shard file analysis task, wherein the shard file analysis result is the result obtained by the target computing server pulling the shard file analysis task from the task queue, calling the configured multiple computing nodes to retrieve the shard file according to the path information contained in the shard file analysis task, and analyzing the shard file using the genetic analysis model.
  • the target computing server is a computing server selected from the computing server cluster and to be used for executing the shard file analysis task.
  • the target computing server can pull the shard file analysis task from the task queue, and call multiple configured computing nodes to retrieve the shard file according to the path information contained in the shard file analysis task. After pulling the corresponding shard file, the computing nodes will trigger their own data analysis tasks, and specifically use the gene analysis model to analyze the shard file to obtain the shard file analysis results.
  • the shard file analysis results can then be sent to the data center server, that is, the data center server can receive multiple shard file analysis results sent by multiple computing nodes.
  • the gene analysis model can be any computing model that can realize the task of gene sequencing data analysis, such as a neural network model, a deep learning model, etc.
  • the gene analysis model in the present disclosure can be selected as a hidden Markov model, a conditional random field, and a neural network model, etc., which are not specifically limited here.
  • the gene analysis model configured in each computing node may be the same or different, which is not specifically limited here. It should be noted that there are many types of gene analysis models and they are adapted to different usage scenarios.
  • the algorithm direction in this application is mainly integrated and embedded, which will not be elaborated, and can support iterative updates of the integrated algorithm content.
  • the data center server can create a shard file analysis task containing the path information, and fill the shard file analysis task into the task queue; then, it can receive multiple shard file analysis results sent by the target computing server, and merge multiple shard file analysis results when detecting the end of the execution of the shard file analysis task, wherein the shard file analysis result is the result obtained by the target computing server pulling the shard file analysis task in the task queue, calling the configured multiple computing nodes to retrieve the shard files according to the path information contained in the shard file analysis task, and analyzing the shard files using the gene analysis model.
  • the technical solution disclosed in the present invention can be used by a centralized data center server to allocate data analysis tasks, and multiple computing nodes to undertake the analysis tasks of the corresponding data shards.
  • the asynchronous and multi-process implementation of the process analysis operations in the computing nodes can effectively improve the speed of analyzing the large amount of data generated by the third-generation sequencing, thereby improving the data analysis efficiency, and meeting the analysis needs of the massive gene sequencing data under the data growth rate.
  • an embodiment of the present disclosure provides a method for analyzing gene sequencing data.
  • the method is applied to an analysis system for gene sequencing data, and is specifically applied to a target computing server.
  • the method embodiment includes:
  • Step 201 Pull the shard file analysis task from the task queue.
  • the target computing server can pull the shard file analysis task to be executed from the task queue.
  • the shard file analysis task can include the path information of the shard file.
  • the multiple computing nodes configured by the target computing server can further read the content in the shard file based on the path information and further analyze it.
  • Step 202 Call multiple configured computing nodes to execute the shard file analysis task to obtain multiple shard file analysis results, wherein the shard file analysis results are the results obtained by the computing nodes searching for shard files according to the path information contained in the shard file analysis task and analyzing the shard files using the genetic analysis model.
  • the computing node is a virtualized computing container node.
  • Virtualization refers to the use of software technology to virtualize a physical host into multiple logical computers. Each logical computer can independently run different operating systems and various applications. Through virtualization technology, each virtual machine has its own virtual hardware (virtual CPU, network card, memory, etc.), and the operating system running on the virtual machine thinks that it has a physical host exclusively. The software on the virtual machine runs on a virtual platform, not a real hardware platform. The container is just a special process running on the host machine. Multiple containers still use the same host machine's operating system kernel. The container does not rely on the operating system to run the application environment, and can isolate and restrict the application process through the mechanisms and features provided by the operating system kernel.
  • the computing node will pull and consume tasks according to the information provided by the data center server. After obtaining the corresponding shard files, the computing nodes will trigger their own data analysis tasks.
  • the nodes are relatively independent in terms of resource occupation and task operation. When resources are sufficient, they can be considered independent of each other.
  • Each computing node generates corresponding shard results after the data shard analysis process, and will feed back to the data center server after completion.
  • the specific data flow is: task queue - obtain the shard file in the format of Hierarchical Data Format Version 5 (HDF5) to the node locally through the optical fiber network - parse the HDF5 shard file - reason the data through the AI model - decode the AI reasoning result - write the decoding result to the shard result file - transmit the result back to the data center server through the optical fiber network.
  • the computing node can retrieve the shard file according to the path information contained in the shard file analysis task, wherein the shard file includes the base fragment signal divided according to the specified number or channel source in the upstream gene sequencing step.
  • each computing node is configured with a gene analysis model.
  • the computing node can use the gene analysis model to analyze the shard file, obtain the analysis results of the shard file, and transmit the analysis results of the shard file back to the data center server through the optical fiber network, that is, the data center server can receive multiple shard file analysis results sent by multiple computing nodes.
  • the gene analysis models of multiple computing nodes can form a gene analysis cluster.
  • the gene analysis cluster refers to the use of artificial intelligence learning methods and means to implement the analysis of gene sequencing data.
  • the analysis of gene sequencing data can be converted into various signals with significant rules and characteristics through the unique physical and chemical properties of its bases, and images and electrical signals are good artificial intelligence learning analysis models. Building a gene analysis cluster is to distribute the computing resources for gene sequencing data analysis, so that more computing resource nodes can accelerate and support the process of artificial intelligence learning, so as to improve the speed and accuracy of gene sequencing data analysis.
  • Step 203 Send the analysis results of the multiple segment files to the data center server.
  • the technical solution disclosed in the present invention can be deployed by a centralized data center to allocate data analysis tasks, and multiple computing nodes can undertake the analysis tasks of the corresponding data fragments.
  • the asynchronous and multi-process implementation of the flow analysis operation in the computing node can effectively improve the speed of analyzing the large amount of data generated by the third-generation sequencing, thereby improving the efficiency of data analysis and meeting the analysis needs of the massive gene sequencing data under the data growth rate.
  • the present invention creatively integrates the rapidly increasing bioinformatics data in the third-generation sequencing scenario with the distributed data analysis and processing, virtualization technology, AI artificial intelligence, GPU parallel computing, asynchronous, multi-process, message queue and other advanced technologies in the current computer field, and proposes and implements a gene data analysis solution with easy expansion of computing resources, iterative algorithms, node obstacle avoidance and full resource utilization in combination with computer principles and specific software code design.
  • the rest are natively designed and implemented.
  • the use of better distributed framework products in the computer field can further improve the deployment management efficiency or self-iterate to form a new set of specialized application framework systems.
  • the data analysis algorithm deployed by the present invention can continuously increase the overall analysis speed with iterative optimization.
  • the project itself can follow up with iterations for further improvement, so that the analysis speed can be increased in line with the hardware improvement.
  • the data center server (data center/data server) can create a shard file analysis task containing the path information, fill the shard file analysis task into the task queue, and generate a task status queue for updating the status information of the shard file analysis task stored in the task queue; thereafter, the data center server can determine at least one target computing server (single/multi-computing card workstation) in the computing server cluster to execute the shard file analysis task, and the target computing server can pull the shard file analysis task in the task queue through data fiber mounting/transmission, and call the configured multiple computing nodes (computing container nodes of single computing card resources) to execute the shard file analysis task.
  • target computing server single/multi-computing card workstation
  • a gene analysis model is configured in each computing node.
  • the computing node can retrieve the shard file according to the path information contained in the shard file analysis task, and use the gene analysis model to analyze the shard file to obtain the shard file analysis result (shard result), and then transmit the shard file analysis result back to the data center server through the optical fiber network; the data center server can determine whether the shard file analysis task has been completed by detecting the status information in the task status queue.
  • the server merges the received multiple shard file analysis results, clears the redundant intermediate files generated by at least one target computing server, and restores the computing resources of multiple computing nodes configured on the target computing server.
  • the present application also provides an electronic device, a readable storage medium and a computer program product.
  • Fig. 5 shows a schematic block diagram of an example electronic device 500 that can be used to implement an embodiment of the present application.
  • the electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • the electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the present application described herein and/or required.
  • the device 500 includes a computing unit 501, which can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 502 or a computer program loaded from a storage unit 508 to a RAM (Random Access Memory) 503.
  • a ROM Read-Only Memory
  • RAM Random Access Memory
  • various programs and data required for the operation of the device 500 can also be stored.
  • the computing unit 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504.
  • An I/O (Input/Output) interface 505 is also connected to the bus 504.
  • a number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506, such as a keyboard, a mouse, etc.; an output unit 507, such as various types of displays, speakers, etc.; a storage unit 508, such as a disk, an optical disk, etc.; and a communication unit 509, such as a network card, a modem, a wireless communication transceiver, etc.
  • the communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 501 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a CPU (Central Processing Unit), a GPU (Graphic Processing Units), various dedicated AI (Artificial Intelligence) computing chips, various computing units running machine learning model algorithms, a DSP (Digital Signal Processor), and any appropriate processor, controller, microcontroller, etc.
  • the computing unit 501 performs the various methods and processes described above, such as a method for processing FASTQ data.
  • the method for processing FASTQ data may be implemented as a computer software program, which is tangibly contained in a machine-readable medium, such as a storage unit 508.
  • part or all of the computer program may be loaded and/or installed on the device 500 via the ROM 502 and/or the communication unit 509.
  • the computer program When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the method described above may be performed.
  • the computing unit 501 may be configured to execute the aforementioned communication data processing method in any other appropriate manner (for example, by means of firmware).
  • Various embodiments of the systems and techniques described above herein may be implemented in digital electronic circuit systems, integrated circuit systems, FPGAs (Field Programmable Gate Array), ASICs (Application-Specific Integrated Circuit), ASSPs (Application Specific Standard Product), SOCs (System On Chip), CPLDs (Complex Programmable Logic Device), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs Field Programmable Gate Array
  • ASICs Application-Specific Integrated Circuit
  • ASSPs Application Specific Standard Product
  • SOCs System On Chip
  • CPLDs Complex Programmable Logic Device
  • These various embodiments may include: being implemented in one or more computer programs that may be executed and/or interpreted on a programmable system including at least one programmable processor that may be a special purpose or general purpose programmable processor that may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • a programmable processor that may be a special purpose or general purpose programmable processor that may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • the program code for implementing the method of the present application can be written in any combination of one or more programming languages. These program codes can be provided to a processor or controller of a general-purpose computer, a special-purpose computer, or other programmable data processing device, so that the program code, when executed by the processor or controller, implements the functions/operations specified in the flow chart and/or block diagram.
  • the program code can be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • machine-readable storage media would include an electrical connection based on one or more wires, a portable computer disk, a hard disk, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory) or flash memory, optical fiber, CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer.
  • a display device e.g., a CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display) monitor
  • a keyboard and pointing device e.g., a mouse or trackball
  • Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including acoustic input, voice input, or tactile input).
  • the systems and techniques described herein may be implemented in a computing system that includes backend components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes frontend components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such backend components, middleware components, or frontend components.
  • the components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), the Internet, and blockchain networks.
  • a computer system may include a client and a server.
  • the client and the server are generally remote from each other and usually interact through a communication network.
  • the relationship between the client and the server is generated by computer programs running on the corresponding computers and having a client-server relationship with each other.
  • the server may be a cloud server, also known as a cloud computing server or a cloud host, which is a host product in the cloud computing service system to solve the defects of difficult management and weak business scalability in traditional physical hosts and VPS services ("Virtual Private Server", or "VPS" for short).
  • the server may also be a server for a distributed system, or a server combined with a blockchain.
  • artificial intelligence is a discipline that studies how computers can simulate certain human thought processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and includes both hardware-level and software-level technologies.
  • Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, and big data processing; artificial intelligence software technologies mainly include computer vision technology, speech recognition technology, natural language processing technology, as well as machine learning/deep learning, big data processing technology, knowledge graph technology, and other major directions.

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention se rapporte au domaine technique des ordinateurs. L'invention concerne un système et un procédé d'analyse de données de séquençage de gènes, un dispositif électronique et un support d'enregistrement. Le procédé consiste à acquérir des informations de chemin d'un fichier de fragment qui est généré pendant un processus de séquençage de gènes ; créer une tâche d'analyse de fichier de fragment qui comprend les informations de chemin et remplir une file d'attente des tâches avec la tâche d'analyse de fichier de fragment ; et lorsqu'il est détecté que l'exécution de la tâche d'analyse de fichier de fragment se termine, fusionner la pluralité de résultats d'analyse de fichier de fragment, les résultats d'analyse de fichier de fragment étant des résultats qui sont obtenus au moyen du serveur informatique cible extrayant la tâche d'analyse de fichier de fragment de la file d'attente des tâches, puis appeler une pluralité de nœuds informatiques configurés pour vérifier et acquérir le fichier de fragment selon les informations de chemin qui sont incluses dans la tâche d'analyse de fichier de fragment, et utiliser un modèle d'analyse de gènes pour analyser le fichier de fragment. La présente invention peut résoudre les problèmes techniques de faible efficacité d'analyse des données et de faible vitesse de calcul lorsque des données de séquençage de gènes sont analysées.
PCT/CN2022/141142 2022-12-22 2022-12-22 Système et procédé d'analyse de données de séquençage de gènes, dispositif électronique et support d'enregistrement WO2024130660A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/141142 WO2024130660A1 (fr) 2022-12-22 2022-12-22 Système et procédé d'analyse de données de séquençage de gènes, dispositif électronique et support d'enregistrement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/141142 WO2024130660A1 (fr) 2022-12-22 2022-12-22 Système et procédé d'analyse de données de séquençage de gènes, dispositif électronique et support d'enregistrement

Publications (1)

Publication Number Publication Date
WO2024130660A1 true WO2024130660A1 (fr) 2024-06-27

Family

ID=91587423

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141142 WO2024130660A1 (fr) 2022-12-22 2022-12-22 Système et procédé d'analyse de données de séquençage de gènes, dispositif électronique et support d'enregistrement

Country Status (1)

Country Link
WO (1) WO2024130660A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427262A (zh) * 2019-09-26 2019-11-08 深圳华大基因科技服务有限公司 一种基因数据分析方法及异构调度平台
US20200350035A1 (en) * 2017-10-27 2020-11-05 Sysmex Corporation Gene analysis method, gene analysis apparatus, management server, gene analysis system, program, and storage medium
CN112992270A (zh) * 2021-04-01 2021-06-18 山东英信计算机技术有限公司 一种基因测序方法和装置
CN114756173A (zh) * 2022-04-15 2022-07-15 京东科技信息技术有限公司 文件合并的方法、系统、设备和计算机可读介质
US20220375545A1 (en) * 2021-05-18 2022-11-24 Arizona Board of Arizona, Tech Transfer Arizona Systems and methods for generating and analyzing a customized genomic sequence incorporating gene fusions for therapeutic applications

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200350035A1 (en) * 2017-10-27 2020-11-05 Sysmex Corporation Gene analysis method, gene analysis apparatus, management server, gene analysis system, program, and storage medium
CN110427262A (zh) * 2019-09-26 2019-11-08 深圳华大基因科技服务有限公司 一种基因数据分析方法及异构调度平台
CN112992270A (zh) * 2021-04-01 2021-06-18 山东英信计算机技术有限公司 一种基因测序方法和装置
US20220375545A1 (en) * 2021-05-18 2022-11-24 Arizona Board of Arizona, Tech Transfer Arizona Systems and methods for generating and analyzing a customized genomic sequence incorporating gene fusions for therapeutic applications
CN114756173A (zh) * 2022-04-15 2022-07-15 京东科技信息技术有限公司 文件合并的方法、系统、设备和计算机可读介质

Similar Documents

Publication Publication Date Title
CN107301170B (zh) 基于人工智能的切分语句的方法和装置
JP7358698B2 (ja) 語義表現モデルの訓練方法、装置、デバイス及び記憶媒体
US20210343287A1 (en) Voice processing method, apparatus, device and storage medium for vehicle-mounted device
US20230033019A1 (en) Data processing method and apparatus, computerreadable medium, and electronic device
JP7351942B2 (ja) 分野フレーズマイニング方法、装置及び電子機器
US20230306081A1 (en) Method for training a point cloud processing model, method for performing instance segmentation on point cloud, and electronic device
US20220237376A1 (en) Method, apparatus, electronic device and storage medium for text classification
WO2023197554A1 (fr) Procédé et appareil d'accélération du raisonnement de modèle, dispositif électronique et support de stockage
JP7357114B2 (ja) 生体検出モデルのトレーニング方法、装置、電子機器および記憶媒体
CN112559378A (zh) 自动驾驶算法评估方法和装置、场景库生成方法和装置
EP4283465A1 (fr) Procédé et appareil de traitement de données, et support de stockage
WO2024036662A1 (fr) Procédé et appareil d'exploration parallèle de règles de graphes sur la base d'un échantillonnage de données
CN114820279A (zh) 基于多gpu的分布式深度学习方法、装置及电子设备
JP7387964B2 (ja) ソート学習モデルの訓練方法、ソート方法、装置、デバイス及び媒体
US20210365406A1 (en) Method and apparatus for processing snapshot, device, medium and product
US20200104465A1 (en) Real-Time Prediction of Chemical Properties Through Combining Calculated, Structured and Unstructured Data at Large Scale
CN113344214A (zh) 数据处理模型的训练方法、装置、电子设备及存储介质
CN112989797A (zh) 模型训练、文本扩展方法,装置,设备以及存储介质
WO2024130660A1 (fr) Système et procédé d'analyse de données de séquençage de gènes, dispositif électronique et support d'enregistrement
US20220207427A1 (en) Method for training data processing model, electronic device and storage medium
US12007965B2 (en) Method, device and storage medium for deduplicating entity nodes in graph database
CN115186738B (zh) 模型训练方法、装置和存储介质
CN115794742A (zh) 文件路径数据处理方法、装置、设备及存储介质
US20220129418A1 (en) Method for determining blood relationship of data, electronic device and storage medium
CN115639966A (zh) 一种数据写入方法、装置、终端设备及存储介质