CN112885412B

CN112885412B - Genome annotation method, apparatus, visualization platform and storage medium

Info

Publication number: CN112885412B
Application number: CN202110209957.7A
Authority: CN
Inventors: 董玥; 贺增泉; 徐则洋; 李悦; 郑大鹏
Original assignee: BGI Technology Solutions Co Ltd
Current assignee: BGI Technology Solutions Co Ltd
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2023-03-28
Anticipated expiration: 2041-02-25
Also published as: CN112885412A

Abstract

The application provides a genome annotation method, a genome annotation device, a visualization platform and a storage medium, wherein the method comprises the steps of obtaining sequencing data to be annotated, splitting the sequencing data according to a gene chromosome field to obtain a plurality of pieces of sub-sequencing data; mapping each sub-sequencing data to a corresponding subunit gene database for searching and analyzing so as to obtain an annotation result, wherein the subunit gene database is obtained by splitting a known gene database according to gene chromosome fields; the genome annotation method divides a gene database of billion-level data into a subunit gene database with less data volume, then divides sequencing data to be annotated into sub-sequencing data with the same size and type as the subunit gene database, and only needs to compare the sub-sequencing data with the corresponding subunit gene database when annotation is carried out, so that the query range is greatly reduced, and the calculation time and the calculation amount are saved.

Description

Genome annotation method, apparatus, visualization platform and storage medium

Technical Field

The application relates to the technical field of bioinformatics, in particular to a genome annotation method, a genome annotation device, a visualization platform and a computer-readable storage medium.

Background

After the sequencing of the gene data is completed, the mutation detection analysis is usually performed by using a matched analysis software. The mutation detection analysis refers to comparing the sequenced genome data with a reference genome to obtain base positions and change information (namely basic information of a mutation site) inconsistent with the reference sequence. However, the basic information of the variant sites is often insufficient to make a site pathogenicity determination to identify the pathogenic sites. Therefore, genome annotation (genome annotation) of variant information is required for the purpose of assisting pathogenicity determination. However, the human whole genome contains about 30 hundred million DNA base pairs, and the content to be annotated includes not only single base mutations (SNPs) but also small insertion-deletion mutations (indels), the information records in each database are huge, and gene detection has the characteristics of large sample size and concurrent genome annotation, large resource consumption and long annotation time, and the efficiency of concurrent annotation of multiple samples is low.

Content of application

In view of the above, embodiments of the present application provide a genome annotation method, apparatus, visualization platform, and computer-readable storage medium, so as to overcome the problems of large resource consumption, long annotation time, and low annotation efficiency of the genome annotation method in the prior art.

In a first aspect, the embodiments of the present application provide a genome annotation method, which includes:

obtaining sequencing data to be annotated, and splitting the sequencing data according to gene chromosome fields to obtain a plurality of pieces of sub-sequencing data;

mapping each sub-sequencing data to a corresponding subunit gene database for searching and analyzing to obtain an annotation result; wherein the subunit gene database is obtained by splitting a known gene database according to gene chromosome fields.

In a second aspect, embodiments provide a genome annotation apparatus, comprising: the sequencing data acquisition module is used for acquiring sequencing data to be annotated;

the sequencing data splitting module is used for splitting the sequencing data according to a gene chromosome field to obtain a plurality of pieces of sequencing data;

the mapping module is used for mapping each sub-sequencing data to a corresponding subunit gene database for searching and analyzing to obtain an annotation result; wherein the subunit gene database is obtained by splitting a known gene database according to gene chromosome fields.

In a third aspect, an embodiment of the present application provides a visualization platform, including: one or more clients; one or more servers coupled with the memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more servers, the one or more applications configured to perform the genome annotation method provided by the first aspect above.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, in which program codes are stored, and the program codes can be called by a processor to execute the genome annotation method provided in the first aspect.

The genome annotation method, the genome annotation device, the visualization platform and the computer-readable storage medium provided by the embodiment of the application comprise the steps of obtaining sequencing data to be annotated, splitting the sequencing data according to a gene chromosome field to obtain a plurality of pieces of sub-sequencing data; mapping each sub-sequencing data to a corresponding subunit gene database for searching and analyzing so as to obtain an annotation result, wherein the subunit gene database is obtained by splitting a known gene database according to gene chromosome fields; according to the genome annotation method, the gene database of billion-level data is divided into the subunit gene database with less data volume, then the sequencing data to be annotated is divided into the sub-sequencing data with the same size and type as the subunit gene database, and the sub-sequencing data is compared with the corresponding subunit gene database when annotation is carried out, so that the query range is greatly reduced, and the calculation time and the calculation amount are saved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram of an application scenario of a genome annotation method provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a genome annotation process provided in one embodiment of the present application;

FIG. 3 is a diagram illustrating a structure of mapping lookup data provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of the structure of a genome annotation apparatus provided in one embodiment of the present application;

FIG. 5 is a schematic structural diagram of a visualization platform provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer-readable storage medium provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely below, and it should be understood that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The noun explains:

1. high-throughput sequencing: (English: high-throughput sequencing), also known as "Next-generation" sequencing technology, is marked by the ability to perform sequencing on hundreds of thousands to millions of DNA molecules at a time in parallel, and by the short read length in general. The sequencing is to analyze the base sequence of a specific DNA fragment, that is, the arrangement of adenine (A), thymine (T), cytosine (C) and guanine (G). The advent of rapid DNA sequencing methods has greatly facilitated biological and medical research and discovery.

2. Hash function: the Hash function is a method for creating a small digital "fingerprint" from any kind of data. The hash function compresses a message or data into a digest so that the amount of data becomes small, fixing the format of the data. This function mixes the data in a hash, recreating a fingerprint called a hash value (hash sums, or hashes). The hash value is typically represented by a short string of random letters and numbers. Good hash functions rarely have hash collisions in the input domain. In hash tables and data processing, data is distinguished without suppressing conflicts, making database records more difficult to find.

3. Genome: in the fields of molecular biology and genetics, the genome is the sum of all the genetic material of an organism. These genetic materials include DNA or RNA (viral RNA). The genome comprises coding DNA and non-coding DNA, mitochondrial DNA and chloroplast DNA

4. Gene mutation: the biological meaning of (Mutation) is the alteration of a genetic gene in a cell, usually deoxyribonucleic acid present in the nucleus of the cell. It includes point mutations caused by single base changes, or deletions, duplications and insertions of multiple bases. The cause may be errors in the replication of the genetic gene during cell division, or the influence of chemicals, genotoxicity, radiation or viruses.

5. Point mutation: (point mutation) is a type of mutation that causes a single base nucleotide to be substituted for another nucleotide in genetic material DNA or RNA. Generally, the term also includes insertions or deletions that act only on a single base pair.

6. Basic information annotation: such as: mutation position, gene, transcript number, base change, amino acid change, etc.;

7. prediction software result annotation: and (4) performing functional hazard prediction on the variation through a bioinformatics related algorithm. Prediction and pathway analysis help to discover potential new genes and sites.

8. Annotation of population variation frequency: as an important reference tool, records in a database through public authorities can be used for acquiring the occurrence frequency of a certain variation in a large-scale population.

9. Mutation-related disease annotation: and 4, forming a database related to the genetic variation and the clinical phenotype by the expert through the evaluation of the information of the variation, the clinical phenotype, the empirical data, the functional annotation, the analysis and the like.

In order to explain the present application in more detail, a genome annotation method, apparatus, terminal device and computer storage medium provided by the present application are specifically described below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 shows a schematic diagram of an application scenario of the genome annotation method provided in the embodiment of the present application, where the application scenario includes a terminal device 100 provided in the embodiment of the present application, and the terminal device 100 may be various electronic devices (such as the block diagrams of 102, 104, 106, and 108) having a display screen, including but not limited to a smartphone and a computer device, where the computer device may be at least one of a desktop computer, a portable computer, a laptop computer, a tablet computer, and the like. The terminal device 100 may be generally referred to as one of a plurality of terminal devices, and the present embodiment is only illustrated by the terminal device 100. Those skilled in the art will appreciate that the number of terminal devices may be greater or fewer. For example, the number of the terminal devices may be only a few, or the number of the terminal devices may be tens of or hundreds, or may be more, and the number and the type of the terminal devices are not limited in the embodiment of the present application. The terminal device 100 may be used to perform a genome annotation method provided in the embodiments of the present application.

In an optional implementation manner, the application scenario may include a server in addition to the terminal device 100 provided in the embodiment of the present application, where a network is provided between the server and the terminal device. Networks are used as the medium for providing communication links between terminal devices and servers. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

It should be understood that the number of terminal devices, networks, and servers is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server may be a server distribution system composed of a plurality of servers. Wherein, the terminal device interacts with the server through the network to receive or send messages and the like. The server may be a server that provides various services. Wherein the server may be used to perform the steps of a genome annotation method provided in the embodiments of the present application. In addition, when the terminal device executes a genome annotation method provided in the embodiment of the present application, a part of the steps may be executed at the terminal device, and a part of the steps may be executed at the server, which is not limited herein.

Based on this, a genome annotation method is provided in the embodiments of the present application. Referring to fig. 2, fig. 2 shows a schematic flowchart of a genome annotation method provided in an embodiment of the present application, which is described by taking the method as an example for being applied to the terminal device in fig. 1, and includes the following steps:

102, obtaining sequencing data to be annotated, and splitting the sequencing data according to a gene chromosome field to obtain a plurality of pieces of sub-sequencing data;

the sequencing data to be annotated refers to data obtained by performing high-throughput sequencing on any sample needing genome annotation. Sequencing data is a (i.e. a large number of) sequences, usually vcf files, a text file that describes the results of SNPs (variation over a single base), INDELs (INDEL markers) and SVs (structural variation sites).

The gene chromosome field is also called as a gene chromosome position coordinate field and comprises fields of chromosome number, chromosome position, base mutation before, base mutation after and the like.

Step 104, mapping each sub-sequencing data to a corresponding subunit gene database for searching and analyzing to obtain an annotation result; the subunit gene database is obtained by splitting a known gene database according to a gene chromosome field.

Wherein, the known genome database refers to all databases related to genome annotation which are disclosed at present; it can be a genome database of human, animal and even pathogen, etc. Furthermore, the subunit gene database is formed by splitting a known genome database based on chromosome fields,

in addition, the gene chromosome fields used when the sequencing data to be annotated and the known genome database are split should be the same, and the size of the sub-sequencing data formed after splitting is the same as that of the subunit gene database; the sub sequencing data and the subunit gene database formed after the splitting can be regarded as a data matrix, and the same size means that the data quantity of the matrix arranged vertically and horizontally is the same.

Mapping each sub-sequencing data to a corresponding subunit gene database refers to mapping each sub-sequencing data to a subunit gene database with the same splitting field as the sub-sequencing data and the same size.

For ease of understanding, a detailed embodiment is given. Assume that there are some genomic databases G _a ，G _b ，G _c ......G _z Splitting the gene to obtain subunit gene database G _a1 ，G _a2 ，G _a3 ，......G _an (i.e., database of unit genes corresponding to Ga), G _b1 ，G _b2 ，G _b3 ，......G _bn (i.e., gb-corresponding Unit Gene database), G _c1 ，G _c2 ，G _c3 ，......G _cn G. (i.e., the database of unit genes corresponding to Gc) _z1 ，G _z2 ，G _z3 ，......G _zn (i.e., the unit gene database corresponding to Gz), there is a sequencing data Va to be annotated (there are many sequencing data to be annotated, and only one is illustrated here), and V is obtained by splitting Va _a(1，0)， V _a(1，1) ，V _a(1，2) ......V _a(1，n) (ii) a Then V is put in _a(1，0) ，V _a(1，1) ，V _a(1，2) ，......V _a(1，n) The mapping is mapped to the corresponding subunit gene database for search analysis, and the result is shown in FIG. 3, in which the mapping is searched V _a(1，0) Will only arrive at G _a1 G _b1 G _c1 .. _a(1，1) To G _a2 G _b2 G _c2 ...

The genome annotation method provided by the embodiment of the application comprises the steps of obtaining sequencing data to be annotated, splitting the sequencing data according to gene chromosome fields to obtain a plurality of pieces of sub-sequencing data; mapping each sub-sequencing data to a corresponding subunit gene database for searching and analyzing so as to obtain an annotation result, wherein the subunit gene database is obtained by splitting a known gene database according to a gene chromosome field; according to the genome annotation method, the gene database of billion-level data is divided into the subunit gene database with less data volume, then the sequencing data to be annotated is divided into the sub-sequencing data with the same size and type as the subunit gene database, and the sub-sequencing data is compared with the corresponding subunit gene database when annotation is carried out, so that the query range is greatly reduced, and the calculation time and the calculation amount are saved.

In one embodiment, the subunit gene database is obtained by:

acquiring a known genome database, a gene chromosome field and a subunit data volume parameter; and splitting each genome database according to the gene chromosome field and the subunit data quantity parameters to obtain the subunit gene database.

In one embodiment, the subunit gene database is calculated by the following formula:

wherein G is _a1 ，G _a2 ，G _a3 ，......G _an Respectively representing subunit gene databases; x represents a gene chromosome field; p represents a subunit data size parameter for adjusting the size of the subunit gene database.

The sub-unit data amount parameter refers to a parameter for adjusting the sub-unit data amount or the data matrix size of the sub-unit. In practical application, the setting is generally 2 according to the actual database data volume ³ ～2 ¹⁰ . When splitting a known genomic database, not only the gene chromosome field but also the size of the subunit gene database are taken into account.

In some embodiments, the sequencing data to be annotated comprises a plurality of vcf files, the sub-sequencing data comprising sub-unit vcf files; the method for obtaining the plurality of sub-sequencing data by splitting the sequencing data according to the gene chromosome field comprises the following steps: splitting each vcf file according to the file variable parameters to obtain a plurality of sub vcf files; and splitting each sub-vcf file according to the gene chromosome field and the data quantity parameters of the sub-units to obtain a plurality of sub-unit vcf files.

In some embodiments, the child vcf file is obtained by the following formula:

wherein, V _a Represents the a-th vcf file, V _a1 ，V _a2 ，V _a3 ，......V _an Respectively, a child vcf file, and K a file variable parameter.

In particular, the sequencing data to be annotated comprises a plurality of vcf files, e.g., V _a ，V _b ，V _c ......V _z . In an optional embodiment, the number of vcf files read each time may be different, and may be specifically determined according to the server hardware configuration condition.

The file variable parameter refers to a parameter for adjusting the size of the sub vcf file. Specifically, after a vcf file is obtained, the file needs to be split first; usually by using

To split a file, where V _a Represents the a-th vcf file, V _a1 ，V _a2 ，V _a3 ，......V _an Respectively, a child vcf file, and K a file variable parameter. Here with only V _a For example, the method for splitting other vcf files is the same as Va; in addition, the values of the file variable parameter K can be the same or different when different vcf files are split, and the values are determined according to actual requirements.

After the sub-vcf file is split and produced, the sub-vcf file needs to be split by adopting a gene chromosome field and a subunit data quantity parameter. For example:

wherein, V _a1 Represents the a1 st sub-vcf file, V _a(1，0) ，V _a(1，1) ，V _a(1，2) ，......V _a(1，n) Respectively representing subunit vcf files, and X represents a gene chromosome field; p denotes a subunit data amount parameter.

In some embodiments, the step of mapping each sub-sequencing data into a corresponding subunit gene database for lookup analysis comprises: and (3) transmitting each sub-sequencing data to a corresponding subunit gene database by adopting a Hash algorithm for searching and analyzing.

Specifically, the Hash algorithm, i.e., hash algorithm, is a method of mapping a binary value of an arbitrary length to a shorter binary value of a fixed length, and this small binary value is called a Hash value. A hash value is a unique and extremely compact representation of a piece of data as a numerical value. Typically for fast lookup and encryption algorithms. In the embodiment, the data search can be rapidly completed by adopting the Hash algorithm in the process of mapping each sub-sequencing data to the corresponding subunit gene database for search analysis.

In one embodiment, further comprising: the process of performing a lookup analysis in the splitting of sequencing data according to gene chromosome fields and/or mapping each sub-sequencing data into a corresponding subunit gene database is performed on a distributed system.

A distributed system is a system of computer nodes that communicate over a network and that work in concert to accomplish a common task. Distributed systems have emerged to perform computing and storage tasks that cannot be performed by a single computer using inexpensive, common machines. The purpose is to process more data with more machines. In this embodiment, a distributed system is used in the process of splitting the sequencing data according to the gene chromosome field and mapping each sub-sequencing data to the corresponding subunit gene database for searching and analyzing, so that the data processing amount can be greatly increased and the data processing time can be reduced.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The embodiments disclosed in the present application describe a genome annotation method in detail, and the methods disclosed in the present application can be implemented by various devices, so the present application also discloses a genome annotation apparatus corresponding to the methods, and the detailed description of the specific embodiments is given below.

Please refer to fig. 4, which is a genome annotation apparatus disclosed in the embodiment of the present application, and mainly includes:

a sequencing data acquisition module 402, configured to acquire sequencing data to be annotated;

a sequencing data splitting module 404, configured to split the sequencing data according to the gene chromosome field to obtain multiple pieces of sub-sequencing data;

a mapping module 406, configured to map each sub-sequencing data into a corresponding subunit gene database for search and analysis, so as to obtain an annotation result; wherein the subunit gene database is obtained by splitting a known gene database according to gene chromosome fields.

In one embodiment, further comprising: a parameter acquisition module:

the parameter acquisition module is used for acquiring a known genome database, a gene chromosome field and a subunit data quantity parameter;

and the sequencing data splitting module is also used for splitting each genome database according to the gene chromosome field and the subunit data quantity parameters to obtain the subunit gene database.

In one embodiment, the method further comprises the subunit gene database obtaining module:

the subunit gene database obtaining module is used for obtaining the subunit gene database through calculation according to the following formula:

wherein, G _a1 ，G _a2 ，G _a3 ，......G _an Respectively representing subunit gene databases; x represents a gene chromosome field; p represents a subunit data size parameter for adjusting the size of the subunit gene database.

In some embodiments, the sequencing data to be annotated comprises a plurality of vcf files, the sub-sequencing data comprising sub-unit vcf files; the sequencing data splitting module further comprises:

the Vcf file splitting module is used for splitting each Vcf file according to the file variable parameters to obtain a plurality of sub Vcf files;

and the sub-vcf file splitting module is used for splitting each sub-vcf file according to the gene chromosome field and the sub-unit data quantity parameter to obtain a plurality of sub-unit vcf files.

In some embodiments, further comprising:

the child vcf file obtaining module is used for obtaining a child vcf file through the following formula:

In some embodiments, the mapping module is further configured to use a Hash algorithm to map each sub-sequencing data into a corresponding subunit gene database for lookup analysis.

In one embodiment, the sequencing data splitting module is further configured to perform the splitting of the sequencing data according to the gene chromosome field on a distributed system;

and/or

And the mapping module is also used for mapping each sub-sequencing data to the corresponding subunit gene database to perform searching analysis on the distributed system.

For the specific definition of the genome annotation device, see the above definition of the method, which is not repeated herein. The various modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent of a processor in the terminal device, and can also be stored in a memory in the terminal device in a software form, so that the processor can call and execute operations corresponding to the modules.

Referring to fig. 5, fig. 5 is a block diagram illustrating a structure of a visualization platform according to an embodiment of the present application. The visualization platform in the present application may include one or more of the following components: the system comprises a client, a server, a memory (i.e. a server hard disk in the figure), a known gene database (e.g. Ga, gb, gc), and one or more application programs, wherein the one or more application programs can be stored in the memory and configured to be executed by the one or more servers, and the one or more application programs are configured to execute the methods described in the above embodiments of the method applied to the terminal device, and can also be configured to execute the methods described in the above embodiments of the method applied to the genome annotation. In addition, the client may perform information interaction with the server, for example, the client sends an annotation request, and after the annotation is completed, the server sends an annotation completion file to the client. The number of the clients can be many, and the WES generates a cluster; when receiving an annotation request sent by a client, the server transmits a file to the server, the server can read the file (namely Va-Vf in the figure) in a queue mode, then splits Va, and after splitting, mapping and searching with a subunit gene database (not shown in the figure) which is stored on a server hard disk and formed by splitting according to a known gene database to complete annotation, and then feeding back an annotation result to the client.

Further, the server may include one or more processors. The processor, using various interfaces and lines connecting various parts throughout the server, performs various functions of the server and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in memory, and calling data stored in memory. Alternatively, the processor may be implemented in hardware using at least one of Digital Signal Processing (DSP), field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor can integrate one or a combination of a Central Processing Unit (CPU), a Graphic Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is to be understood that the modem may be implemented by a communication chip without being integrated into the processor.

The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the visualization platform in use, and the like.

Those skilled in the art will appreciate that the structure shown in fig. 5 is a block diagram of only a portion of the structure relevant to the present application, and does not constitute a limitation on the terminal device to which the present application is applied, and a particular terminal device may include more or less components than those shown in the drawings, or combine certain components, or have a different arrangement of components.

In summary, the visualization platform provided in the embodiment of the present application is used to implement the corresponding genome annotation method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again.

Referring to fig. 6, a block diagram of a computer-readable storage medium according to an embodiment of the present disclosure is shown. The computer-readable storage medium 60 stores program codes that can be called by a processor to execute the methods described in the genome annotation method embodiments, and can also be called by the processor to execute the methods described in the genome annotation method embodiments.

The computer-readable storage medium 60 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 60 includes a non-transitory computer-readable storage medium. The computer readable storage medium 60 has storage space for program code 62 for performing any of the method steps of the method described above. The program code can be read from and written to one or more computer program products. The program code 62 may be compressed, for example, in a suitable form.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for genome annotation, the method comprising:

obtaining sequencing data to be annotated, and splitting the sequencing data according to a gene chromosome field to obtain a plurality of pieces of sub-sequencing data;

mapping each sub-sequencing data to a corresponding subunit gene database for searching and analyzing to obtain an annotation result; wherein the subunit gene database is obtained by splitting a known gene database according to a gene chromosome field;

the sequencing data to be annotated comprises a plurality of vcf files, and the sub-sequencing data comprises sub-unit vcf files; the step of splitting the sequencing data according to the gene chromosome field to obtain a plurality of pieces of sequencing data comprises the following steps:

splitting each vcf file according to a file variable parameter to obtain a plurality of sub-vcf files;

splitting each sub-vcf file according to the gene chromosome field and the data quantity parameter of the sub-units to obtain a plurality of sub-unit vcf files;

the subunit gene database is obtained by:

acquiring a known genome database, a gene chromosome field and a subunit data volume parameter; the subunit data size parameter is a parameter for adjusting the size of the subunit gene database;

splitting each genome database according to the gene chromosome field and the subunit data quantity parameters to obtain the subunit gene databases;

the subunit gene database is calculated by the following formula:

wherein, G _a1, G _a2, G _a3, ……,G _an Respectively representing subunit gene databases; x represents a gene chromosome field; p represents a subunit data size parameter for adjusting the size of the subunit gene database.

2. The method of claim 1, wherein the sub vcf file is obtained by the following formula:

wherein, V _a Represents the a-th vcf file, V _a1, V _a2, V _a3, ……,V _an Respectively representing the sub vcf files, and K represents the file variable parameter.

3. The method of claim 1 or 2, wherein the step of performing a lookup analysis by mapping each of the sub-sequencing data to a corresponding subunit gene database comprises:

and transmitting each sub-sequencing data to a corresponding subunit gene database by adopting a Hash algorithm for searching and analyzing.

4. The method of claim 3, further comprising:

the process of performing a lookup analysis on the sequencing data according to gene chromosome fields and/or mapping each of the sub-sequencing data into a corresponding subunit gene database is performed on a distributed system.

5. A genome annotation apparatus, wherein the apparatus comprises:

the sequencing data acquisition module is used for acquiring sequencing data to be annotated;

the mapping module is used for mapping each sub-sequencing data to a corresponding subunit gene database for searching and analyzing to obtain an annotation result; wherein the subunit gene database is obtained by splitting a known gene database according to a gene chromosome field;

sequencing data to be annotated comprises a plurality of vcf files, and sub-sequencing data comprises subunit vcf files; the sequencing data splitting module further comprises:

the child vcf file splitting module is used for splitting each child vcf file according to the gene chromosome field and the data quantity parameters of the child units to obtain a plurality of child unit vcf files;

further comprising: the parameter acquisition module is used for acquiring a known genome database, a gene chromosome field and a subunit data quantity parameter; the subunit data size parameter refers to a parameter for adjusting the size of the subunit gene database;

the sequencing data splitting module is used for splitting each genome database according to the gene chromosome field and the subunit data quantity parameters to obtain a subunit gene database; the subunit gene database is calculated by the following formula:

6. A visualization platform, comprising:

one or more clients; one or more servers coupled to the memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more servers, the one or more applications configured to perform the method of any of claims 1-4.

7. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 4.