CN115662523B - Group-oriented genome index representation and construction method and equipment - Google Patents

Group-oriented genome index representation and construction method and equipment Download PDF

Info

Publication number
CN115662523B
CN115662523B CN202211295056.5A CN202211295056A CN115662523B CN 115662523 B CN115662523 B CN 115662523B CN 202211295056 A CN202211295056 A CN 202211295056A CN 115662523 B CN115662523 B CN 115662523B
Authority
CN
China
Prior art keywords
minimizer
node
index
genome
mer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211295056.5A
Other languages
Chinese (zh)
Other versions
CN115662523A (en
Inventor
国宏哲
李高阳
刘博�
王亚东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202211295056.5A priority Critical patent/CN115662523B/en
Publication of CN115662523A publication Critical patent/CN115662523A/en
Application granted granted Critical
Publication of CN115662523B publication Critical patent/CN115662523B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method and equipment for representing and constructing group genome index belong to the field of combination of gene technology and computer technology. The invention aims to solve the problem that an effective index structure cannot be constructed for PB-class group genome data by the existing genome index structure construction method. Constructing a de Bruijn graph model index representation for a reference genome, and determining each unique path unipath; overlapping and dividing a reference genome according to a fixed length interval, wherein every two adjacent local areas are partially overlapped, and aiming at a haplotype of a certain individual, collecting variation in each local area to generate a local variation sequence, and further generating a variation sequence index file alt string; and respectively converting the unipath list and the alt string list into character string lists, and carrying out representation and construction of group genome indexes based on minimizer.

Description

Group-oriented genome index representation and construction method and equipment
Technical Field
The invention belongs to the field of combination of gene technology and computer technology, and particularly relates to a method and equipment for representing and constructing group-oriented genome indexes.
Background
Genome sequencing costs are declining year by year with advances in sequencing means, and the scale of single human genome projects currently underway is advancing to the order of hundreds of thousands and millions. The continued advancement and implementation of these large genome projects, accompanied by explosive growth of genome sequencing data, will produce data sizes and corresponding data analysis sizes that will also reach PB-ZB scale. The explosive growth in data volume and data analysis requirements has also led to a number of problems becoming more pronounced and urgent to be resolved.
The design and implementation of the genome big data index structure are the basis and core support of downstream data analysis, and the functions and performances of the index structure directly influence the data analysis speed and accuracy of genome sequence comparison, sequence splicing, mutation detection and the like. Currently, the existing indexing technology is a reference genome oriented to a single species, and a sequence diagram model indexing structure is mostly adopted, so that the memory occupation is large, and the number of samples of the genome of an integrated population is limited. The existing method cannot construct an effective index structure for PB-level group genome data, cannot support increasing genome big data analysis, and severely restricts gene data value mining and industrial transformation.
Disclosure of Invention
The invention aims to solve the problem that an effective index structure cannot be constructed for PB-class group genome data by the existing genome index structure construction method.
A method of population-oriented genome index representation and construction, comprising the steps of:
step one, data collection and pretreatment:
acquiring human reference genome data and formulated mutation data known to contain different types of mutation, and performing normalized operation pretreatment such as redundant data removal and the like on the two data files;
step two, constructing a de Bruijn graph model index representation by reference to a genome:
traversing the complete reference genome, extracting all k-mer sequences, simultaneously recording four tuples (k-mer sequences, incoming edge bases, outgoing edge bases and offset positions) corresponding to each k-mer, constructing a de Bruijn graph model based on the four tuples corresponding to each k-mer, extracting the association relationship between different k-mer sequences of input data and the k-mer sequences, wherein de Bruin graph G= < V, E > is a directed graph model, nodes V in the graph are short sequence fragments k-mers with different lengths of k, and all different k-mers on the extracted genome form node sets V= { V1, V2, …, vM }; for two nodes vi and vj, if vi overlaps with vj with the sequence of (k-1) -mer, then there is a directed edge vi→vj;
each k-mer sequence represents a node of the de Bruijn graph model, a unique path unipath is determined according to the number of the incoming edges and the outgoing edges of the node, the incoming degree and the outgoing degree of the node are respectively defined as the number of the incoming edges and the outgoing edges of the node, and one or more incoming edges and one or more outgoing edges of the node can exist; a path which can be formed by any two nodes in the graph G is a unique path unipath if the entrance degree of the initial node of the path is greater than 1 and the exit degree is 1, the exit degree of the end node is greater than 1 and the entrance degree is 1, the entrance degree of the middle node is 1 and the exit degree is 1;
utilizing all unique paths generated; for each path, the nodes thereon have associated therewith a set of generated k-mer sequence positions; traversing each node from the beginning to the back aiming at a certain path, acquiring two position sets of a current node and a subsequent node, and finding out similar position elements of the two sets; because each position set is an ordered set, traversing certain set elements, adopting a binary search method, finding out all similar positions and forming a new position set; according to the method, each node on the path is traversed backwards in sequence, the position searching and the set merging operation are executed after each new node is traversed until the end node of the path is encountered, and the position set formed finally is the position set corresponding to the current unique path;
generating a position set corresponding to all unique paths according to the method, and representing the position set as a binary index file and F-pos;
step three, expressing and constructing the local variant sequence index:
firstly, overlapping and dividing a reference genome according to fixed-length intervals, wherein each divided fixed-length interval is used as a local area, each two adjacent local areas are partially overlapped, and the length of the overlapped area is half of that of the divided area;
for a haplotype of an individual, collecting the variation in each local region and forming a local individual genome sequence with known variation, namely generating a local variation sequence; the length of the overlapping area is half of the length of the dividing area;
generating variant sequence index files by the method, wherein each variant sequence index file is correspondingly used as an alt string;
step four, for 'unipath' and 'alt string', minizer-based group genome index representation and construction:
before indexes are built for 'unipath' and 'alt string', respectively converting a 'unipath' diagram and an 'alt string' list into a character string list; then, selecting a k-mer with the smallest hash value in a sliding window as a local representative k-mer according to a minimizer selection criterion for each character string in the character string list; these local minimum k-mers, known as minimizer, will be stored in a unified list;
the minimizer lists corresponding to all the character strings are combined into a complete list, and the lists are ordered according to the hash value of the minimizer; then, constructing a secondary index of the minimizer according to the high K bit of the hash value of each minimizer, wherein K is a secondary index parameter;
the secondary index partitions the overall index of the minimizer into b=2 ^K Each cell, when assuming that the total number of minimizer is M, each cell contains M/B minimizer on average; each constructed minimizer occupies 96 bits of storage space, wherein the last Q bits of the hash value of the minimizer occupy 30 bits of storage space; the minimizer's position on the original "uniath" or "alt string" list occupies 65 bits of space.
Further, when all k-mer sequences are extracted by traversing the complete reference genome, the k-mer sequences need to be sequenced and de-duplicated, and the specific process comprises the following steps:
according to the sequence content of the first f bp of the k-mer as a file name, wherein bp represents the total quantity of bases, writing the quadruple corresponding to the k-mer sequence into a corresponding temporary file;
then, importing all k-mer sequences in a file into a memory, performing quick sequencing on k-mers in the file in the current memory to enable quadruplets corresponding to the k-mer sequences in the file to be ordered, and outputting the quadruplets to a hard disk to form a corresponding ordered file;
quick sequencing: traversing the k-mer set in the current file from the beginning, merging adjacent identical k-mer sequences, and merging offset position combinations corresponding to the k-mers;
and sequentially importing all files into a memory and rapidly sequencing the files to generate all different k-mer sequences and ensure that the position sets corresponding to the k-mers are orderly.
Further, the process of determining the unique path unipath according to the number of the incoming edges and the outgoing edges of the nodes comprises the following steps:
firstly, according to the quantity of the incoming edges and the outgoing edges of the nodes, the nodes are divided into the following types:
1) 'X' -type node: a plurality of edges and a plurality of edges;
2) 'FY' node: a plurality of edges and an edge;
3) 'RY' -type node: one in side and a plurality of out sides;
4) 'L' -shaped node: one in side and one out side;
the way in which the start and end nodes of a unique path are formed is several possibilities:
(1) The start and end nodes are the same 'X' -type node;
(2) The start node is a 'FY' type node and the end node is a 'RY' type node;
(3) The starting node is a successor node of an 'X' type node or a successor node of an 'FY' type node and the ending node is a predecessor node of an 'X' type node or a predecessor node of an 'RY' type node;
then traversing all nodes in the graph, and if the current node is the initial node of the unique path, executing backward extension operation; backward extending operation: calculating to obtain a rear-drive node according to the edge information of the current node, if the rear-drive node is an L-shaped node, continuing to extend backwards, and sequentially cycling according to the method until encountering an end node; all unique paths of the graph model can be generated by the method, and each node in the paths is ensured not to be repeated;
all the unique path index files are generated through the method, and each unique path index file is correspondingly used as a unipath.
Further, the specific process of selecting the k-mer with the smallest hash value in the sliding window in the fourth step comprises the following steps:
the minimizer is selected with two key parameters, namely window size and k-mer length; assuming that the sequence of the index to be constructed is L and the length of K-mers is L_K, when the sliding window is selected, (L-L_K+1) K-mers are selected; assuming that the windows size is W, selecting one min_k-mer1 with the smallest hash value from the 5 k-mers with the numbers of 1-5; then, selecting the smallest min_k-mer2 from the 5 of 2 to 6; then 3-7 and so on; the min_k-mers selected by adjacent different minimizer selection windows may be identical, and each repetition is recorded only once; the min_k-mer is called a minimizer.
Further, q=minizer_len in the last Q bits of the hash value is 2-K, where minizer_len represents the MINIMIZER length.
Further, when the minimizer-based group genome index representation and construction are performed for "unipath", the stored data structure of minimizer in unipath graph model is as follows:
the minimizer index of the constructed unpath sequence has a 65bit position area besides a 30bit minimizer storage space;
the location area is: wherein 32 bits are used for storing the ID of the UNITIG to which the minimizer belongs, 32 bits are used for storing the position of the UNITIG to which the minimizer belongs, and 1bit is used for storing the direction of the minimizer relative to the UNITIG.
Further, when the minimizer-based population genome index representation and construction is performed for "alt string", the stored data structure of the minimizer for alt string is as follows:
each item of minimizer of "alt string" is stored as 96bit data structure, all minimzers together form data table; the minimizer belonging to "alt string" has its coordinates uniquely marked as follows:
each minimizer contains a 65-bit location area, wherein,
(a) The 25 bits are used for storing the ID of the window block to which the minimizer belongs;
(b) 19 bits store the ID of the replotype where the minimizer is located;
(c) 22 bits are used to store the location on the replotype;
(d) The 1bit is used to store the orientation of the minimizer relative to the orientation on the replotype.
Further, minimizer-based population genome index representation and construction for "alt string" is performed by constructing an index for only sweet spot variations in the middle of each window block that are close to each other, the sweet spot being a segment of each window block from 25% L to 75% L.
Further, the method also comprises a minimizer searching process, wherein the minimizer searching process comprises the following steps:
firstly, generating a minimizer list of the read by using a minimizer generation algorithm; then, aiming at a certain minimizer, the first K bits of the hash value are acquired and positioned in a cell of the minimizer index, wherein the cell comprises M/B minimzers on average; then, in the cell, using the rest Q bits as KEY to perform binary search;
after finding the specific position of the minimizer in the index, resolving the position of the minimizer of the read in the uniath figure or alt string list through a 65-bit position area of the minimizer; if the minimizer index of the uniath figure is searched, resolving the UNITIG index into the ID of the corresponding UNITIG, the position on the UNITIG and the direction of the minimizer relative to the UNITIG; if the minimizer index of the "alt string" list is looked up, then the parse is: the ID of the window block, the ID of the replotype, the position on the replotype, and the orientation of the minimizer relative to the replotype, i.e. the coordinates of the minimizer on the WB structure.
A population-oriented genome index representation and construction apparatus, the apparatus comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to implement the population-oriented genome index representation and construction method.
The beneficial effects are that:
the invention provides a group-oriented genome index representation and construction method, which solves the performance bottleneck problem of the existing genome index structure. The invention can organize and represent PB-level large-scale group genome data and support data analysis tasks such as rapid sequence comparison, sequence splicing and the like. Meanwhile, the genome sequence information of the large-scale population is effectively integrated, and the problem of systematic deviation caused by a single genome is solved. The invention makes full use of and organizes the repeated sequences of the group genome to construct a graph model based on a single path representation method, realizes the construction of the graph model on a large-scale group genome in a limited memory space, effectively breaks through the bottleneck of the number scale of the index group genome and greatly improves the application value of the group genome index structure.
Drawings
FIG. 1 is a schematic diagram of a population-oriented genome index representation and construction method.
Fig. 2 is a schematic diagram of partitioning of window blocks.
FIG. 3 is a schematic diagram of unitary compression within a window block.
Detailed Description
The first embodiment is as follows: the present embodiment will be described with reference to figure 1,
the embodiment is a method for representing and constructing group-oriented genome indexes, which comprises the following steps:
step one, data collection and pretreatment:
firstly, the latest version of human reference genome data and the formulated mutation data known to contain different types of mutation are acquired, and normalized operation pretreatment such as redundant data removal is carried out on the two data files.
Step two, constructing a de Bruijn graph model index representation by reference to a genome:
the de Bruijn map model is an important data structure that represents and organizes genomic and sequencing data, and is widely used in numerous genomic sequence analysis tasks. The de Bruijn map model effectively organizes the repeat sequences of the reference genome. de Bruin graph G= < V, E > is a directed graph model, in which node V is a short sequence segment k-mer of length k different from each other, and all different k-mers on the genome are extracted to form node set V= { V1, V2, …, vM }. For two nodes vi and vj, if vi overlaps with vj with the sequence of (k-1) -mer, then there is a directed edge vi→vj. The ingress and egress of a node are defined as the number of ingress and egress edges, respectively, of the node, and there may be one or more ingress and egress edges of the node. A path that any two nodes in the graph G can form is a unique path if the start node ingress of the path is greater than 1 and the egress is 1, the end node egress is greater than 1 and the ingress is 1, the intermediate node ingress is 1 and the egress is 1. Thus, constructing the de Bruijn graph model requires extracting the different k-mer sequences of the input data and the associations between the k-mer sequences (edges in the graph model).
Step two, generating all different k-mer sequences:
traversing the complete reference genome, extracting all k-mer sequences, and recording the quadruplets (k-mer sequences, in-edge bases, out-edge bases, offset positions) corresponding to each k-mer. Because the genome sequence has large data quantity and large number of k-mer sequences, all k-mer sequences cannot be simultaneously placed in a memory for sorting, de-duplication and other operations, and therefore, the invention adopts a block sorting method to realize the ordering of all k-mer sequences.
Block ordering: and taking the sequence content (the sequence formed by the word strings) of the first f (default value is 6) bp (bp represents the total number of bases) of the k-mer as a file name, and writing the quadruplet corresponding to the k-mer sequence into a corresponding temporary file. By this method, each k-mer sequence is written into a corresponding file while ensuring that the k-mer sets are ordered (by word order) between files. At this point, the k-mer sequences within the file are unordered.
And then, importing all k-mer sequences in one file into a memory, performing quick sequencing on k-mers in the file in the current memory, enabling quaternary groups corresponding to the k-mer sequences in the file to be ordered, and outputting the quaternary groups to a hard disk to form a corresponding ordered file. Quick sequencing: traversing the k-mer set in the current file from the beginning, merging adjacent identical k-mer sequences, and merging offset position combinations corresponding to the k-mers.
And sequentially importing all files into a memory and rapidly sequencing the files to generate all different k-mer sequences and ensure that the position sets corresponding to the k-mers are orderly. Since f bp are taken as file names, so that files are ordered, all k-mers are ordered after all k-mer sequence files are respectively ordered.
Step two, generating a unique path of the graph model:
each k-mer sequence represents a node of the de Bruijn graph model, and the nodes are divided into the following types according to the number of the incoming edges and the outgoing edges of the nodes:
1) 'X' -type node: a plurality of edges and a plurality of edges;
2) 'FY' node: a plurality of edges and an edge;
3) 'RY' -type node: one in side and a plurality of out sides;
4) 'L' -shaped node: one in and one out.
The way in which the start and end nodes of a unique path are formed is several possibilities:
(1) The start and end nodes are the same 'X' -type node;
(2) The start node is a 'FY' type node and the end node is a 'RY' type node;
(3) The start node is a successor node of the 'X' type node or a successor node of the 'FY' type node and the end node is a predecessor node of the 'X' type node or a predecessor node of the 'RY' type node.
Traversing all nodes in the graph, and if the current node is the initial node of the unique path, executing backward extension operation. Backward extending operation: and calculating out the trailing node of the current node according to the edge information of the current node, if the trailing node is an L-shaped node, continuing to extend backwards, and sequentially cycling according to the method until the end node is encountered. All unique paths of the graph model can be generated by the method, and the nodes in the paths are ensured not to be repeated.
All the unique path index files are generated through the method, and each unique path index file is correspondingly used as one unipath and is expressed as F-unipath.
Step two, generating a position set index file of a unique path:
and utilizing all unique paths generated in the second step. For each path, the nodes thereon are associated with the set of k-mer sequence positions generated in step two. For a certain path, traversing each node from the beginning to the back, acquiring two position sets of the current node and the subsequent nodes, and finding out the similar position elements of the two sets (the position difference is smaller than a certain threshold value, and the default value is set to be 50 bp). Because each position set is an ordered set, traversing certain set elements and adopting a binary search method to find out all similar positions and form a new position set. According to the method, each node on the path is traversed backwards in turn, the position searching and the set merging operation are executed after each new node is traversed until the end node of the path is encountered, and the finally formed position set is the position set corresponding to the current unique path.
According to the method, a position set corresponding to all the unique paths is generated and expressed as a binary index file, and the position set is expressed as F-pos. The index file effectively supports genome sequence comparison based on path information of the graph model, and is a basis of a data analysis method based on a group genome index structure.
Step three, expressing and constructing the local variant sequence index:
the reference genome represents the genetic characteristics of a population of a species. And genomic variation information is a genetic characteristic unique to the representative individual that is different from the reference genome, including types of variation such as mismatches, insertions, deletions, inversions, and transpositions. The complete population genome information includes the reference genome and variation information for different individuals. For a plurality of variations in a region, the variation combinations of different individuals are independent of each other, and the variation combinations on the haploids of the same individual are also independent of each other.
The reference genome is firstly divided in an overlapping manner according to fixed-length intervals, each divided fixed-length interval is used as a local area, every two adjacent local areas (300 bp in default length) are partially overlapped (150 bp in default length), and the length of each overlapped area is half of the length of each divided area.
The variation in each local region is collected for a haplotype of an individual and local individual genomic sequences with known variation are formed, i.e., local variant sequences are generated. At the same time, the length of the overlapping region is half of the length of the dividing region, and any seed is supported and ensured to fall on a certain local region on the genome. Irrespective of the fact that there is no known variation in a region.
Through the method, variant sequence index files are generated, and each variant sequence index file is correspondingly used as one alt string and is expressed as F-alt.
Step four, representing and constructing the group genome index based on minimizer:
step four, minimizer index construction:
we are facing the "unipath" graph structure and the "alt string" list to build the index separately. At the build level of the minimizer index, the two different indexes have similar construction methods and construction processes. Overall, the minimizer index builder takes a list of strings as input and a built index as output.
Before building the index for "unipath" and "alt string", the constructor converts the "unipath" graph and "alt string" list into a string list, respectively. Then, selecting a k-mer with the smallest hash value in a sliding window as a local representative k-mer according to a minimizer selection criterion for each character string in the character string list; these local minimum k-mers, known as minimizer, will be stored in a unified list.
The specific process of selecting the k-mer with the smallest hash value in the sliding window comprises the following steps: the minimizer is selected with two key parameters, namely window size and k-mer length; assuming that the sequence of the index to be constructed is L and the K-mer length is L_K, then the sliding window is selected with (L-L_K+1) K-mers selected. Assuming that the windows size is W (a default value of 5 is usually selected), selecting one min_k-mer1 with the smallest hash value from the 5 k-mers with the numbers of 1-5; then, selecting the smallest min_k-mer2 from the 5 of 2 to 6; followed by 3-7 and so on. The min_k-mers selected by adjacent different minimizer selection windows may be identical (e.g., 1-5 and 2-6 are the same, all 3) and each repeat is recorded only once. This k-mer is known as a minimizer.
The minimizer lists corresponding to all strings are combined into a complete list, which is sorted (typically from small to large) by the size of the minimizer hash value. Thereafter, the secondary index of the minimizer is constructed with the high-K bits (K being the secondary index parameter, typically 26) of the hash value of each minimizer.
The secondary index partitions the overall index of the minimizer into b=2 ^K (typically 64M) cells, each containing on average M/B mini cells when assuming a total number of mini-ers of Mzer (assuming a value of M of 4G, each cell contains on average 64 minimizer). The built MINIMIZER each occupy 96 bits of memory space, wherein the post Q (q=minmizer_len 2-K, minmizer_len representing the MINIMIZER length, typically 10-30 bits of memory space depending on the MINIMIZER length) of the hash value of the MINIMIZER occupies 30 bits of memory space; the minimizer occupies 65bit space in the original 'unipath' or 'alt string' list; plus 1bit of useless redundant space.
Storage data structure of minimizer in unipath graph model:
the minimizer index of the unpath sequence is constructed, and besides the 30bit minimizer storage space, there is a 65bit location area. The location area is defined as follows: wherein 32 bits are used for storing the ID of the UNITIG to which the minimizer belongs; the 32 bits store the location of the UNITIG where the minimizer is located, and the 1bit is used to store the orientation of the minimizer relative to the UNITIG.
Stored data structure of minimizer of local variant sequence (alt string):
similar to the stored data structure of minimizer in unipath graph model, each item of minimizer of "alt string" will also be stored as 96bit data structure, all minimzers together forming a data table.
The minimizer belonging to "alt string" has its coordinates uniquely marked as follows: each minimizer contains a 65-bit location area, wherein,
(a) 25 bits are used to store the ID of the window block to which the minimizer belongs (up to 32M windows blocks are supported);
(b) 19 bits store the ID of the haloprogype where the minimizer is located (each window block contains at most 0.5M different haloprogpes);
(c) 22 bits are used to store the location on the halopype (up to support a length of 4M halopype);
(d) The 1bit is used to store the minimizer relative to the direction on the replotype (forward or reverse direction).
To fully illustrate the meaning of alt string, two supplementary descriptions are given to window block, etc.:
explanation 1: window block.
To facilitate the compression of the population reference genome, we divide the linear reference genome (a so-called human reference genome, which contains base data of 22 autosomes and 2 sex chromosomes) into blocks according to a fixed length, which is a concept opposed to the universal genome, which is published and updated by the international organization. Each block is no less than twice the length (denoted as L) of the second generation sequencing sequence to be processed, with overlap of length L between blocks. The rule of division is fixed, and we can obtain the corresponding reference genome interval through the ID of the block, and determine the ID of the block to which the block belongs through the reference genome coordinate (or interval), such block is called windows block, hereinafter referred to as WB, and has a fixed parameter L, and we rely on the known parameter L and the linear reference genome to complete the division and construction of the WB system, as shown in FIG. 2.
(A) Assuming that the single-ended sequencing data is L (bases) in length, a fixed-length interval division is performed for each chromosome of the reference genome (e.g., GRCH 38). Each interval is 2L in length, while there is an overlap of L bases between adjacent intervals. The first interval ranges from 1 to 2L, the second interval ranges from l+1 to 3L, the nth interval ranges from (n-1) l+1 to nL, each such interval being referred to as a Window Block (WB). Taking l=150 as an example, the WB intervals for chromosome 1 are [1,300] respectively; [151,450]; [301,600] … … (see FIG. 2).
Interpretation 2: the hap_seq is a halopype.
(B) The structure of the generic genome (here specifically the genome containing the variant data) data is constructed for the sequencing-by-synthesis (i.e., second generation sequencing method, NGS) method. It is assumed that the population variation database (large scale) has a total of h samples and 2h haplotypes (i.e., the haplotype is a diploid organism, there are two sets of chromosomes, one from the father and mother, respectively). Within each window block, the different haplotypes have their specific local base sequences (length near 2L, or fluctuating with the introduction of a variation, hereinafter referred to as hap_seq). 2h haplotypes total 2h local base sequences.
(C) The partial base sequences of different haplotypes in the same window block may or may not be identical. Within the same window block, deduplication is performed on identical hap_seq, while deleting exactly the same (hap_seq) as the hap_seq of the linear reference genome, leaving mutually different local base sequences, each containing at least one variation, as shown in fig. 3.
Interpretation 3: alt string.
Alt string is the local region in hap_seq that contains the variation. It is defined as follows: in hap_seq, the base that is mutated and the P bases in front of it (typically 28 bases) and the 28 bases behind it will be marked as the base affected by the mutation. All variant affected bases in hap_seq together constitute alt string. For example, if SNP occurs at each of bases 8 and 10, alt string is a 1 to 38 substring of hap_seq. For example, if SNP occurs in each of the bases 8 and 100, alt string is two substrings of 1 to 36 (total 36 bases) and 72 to 128 (total 57 bases) of hap_seq.
Explanation 4: listing minimizer.
Each alt string has a minimizer list; each hap_seq has an alt string list; each WB has a hap_seq list and the whole ubiquity structure has a plurality of WB. All minimizer of the whole genome structure are combined in the order WB followed by hap_seq, called complete single list.
Deduplication of minimizer of local variant sequences (alt string):
when the window blocks are 2L in length (typically 300 bp), there is an overlap of L/2 (typically 150 bp) between each adjacent window block. At this time, each mutation appears in two adjacent windows blocks at the same time. This repetition results in doubling the number of minimizer. To save memory, we simply index the variation in the middle of each window block (and its surrounding regions), called sweet spots (sweet spots), that represent a region from 25% L to 75% L of each window block (typically a region from 75bp to 225bp within a window block). In this way, it is ensured that all variations and surrounding areas are indexed, while also ensuring that they are indexed only once.
Step four, minimizer searching:
we use a hybrid of hash schemes and binary search schemes to search for the location of the read minimizer in the index. Depending on the establishment of the secondary index, the scheme gives consideration to the speed of hash searching and simultaneously avoids space redundancy and hash collision.
The specific searching method comprises the following steps:
firstly, generating a minimizer list of the read by using a minimizer generation algorithm; the first K bits of its hash value (K being a secondary index parameter, typically 26) are then obtained for a particular minimizer, and located within a cell of the minimizer index that contains on average M/B minimzers (typically average no more than 64). Then, in this cell, a binary search is performed using the remaining Q (q=minimill_len 2-K, typically 10 to 30 bits) bits as KEY.
Depending on the establishment of the secondary index, the binary search process for minimizer is limited to a local contiguous memory space. The method can greatly reduce memory page replacement errors (pages fault or miss) and reduce the binary search times, thereby improving the index speed.
After finding the specific position (a section of continuous interval) of the minimizer in the index, the position of the minimizer of the read in the unipath diagram or the alt string list is resolved through the 65-bit position area of the minimizer. If the minimizer index of the "uniath" map is looked up, then the ID of the corresponding UNITG, the position on the UNITG (POS), and the orientation of the minimizer relative to the UNITG are resolved. If the minimizer index of the "alt string" list is looked up, then the parse is: the ID of the window block, the ID of the replotype, the position on the replotype, and the orientation of the minimizer relative to the replotype, i.e. the coordinates of the minimizer on the WB structure.
The second embodiment is as follows:
the present embodiment is a population-oriented genome index representation and construction device, the device comprising a processor and a memory, it being understood that any device comprising a processor and a memory described herein may also comprise other units, modules for displaying, interacting, processing, controlling, etc. and other functions by signals or instructions;
it should be understood that any method, including those described herein, may be provided as a computer program product, software, or computerized method, which may include a non-transitory machine-readable medium having stored thereon instructions, which may be used to program a computer system, or other electronic device. The storage medium may include, but is not limited to, magnetic storage media, optical storage media; the magneto-optical storage medium includes: read only memory ROM, random access memory RAM, erasable programmable memory (e.g., EPROM and EEPROM), and flash memory layers; or other type of medium suitable for storing electronic instructions.
The memory stores at least one instruction that is loaded and executed by the processor to implement the population-oriented genome index representation and construction method.
The above examples of the present invention are only for describing the calculation model and calculation flow of the present invention in detail, and are not limiting of the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.

Claims (10)

1. A method for population genome-oriented index representation and construction, comprising the steps of:
step one, data collection and pretreatment:
acquiring human reference genome data and formulated variant data known to contain different types of variants, and performing redundant data removal normalization operation pretreatment on the two data files;
step two, constructing a de Bruijn graph model index representation by reference to a genome:
traversing the complete reference genome, extracting all k-mer sequences, simultaneously recording four tuples (k-mer sequences, incoming edge bases, outgoing edge bases and offset positions) corresponding to each k-mer, constructing a de Bruijn graph model based on the four tuples corresponding to each k-mer, extracting the association relationship between different k-mer sequences of input data and the k-mer sequences, wherein de Bruijn graph G= < V, E > is a directed graph model, nodes V in the graph are short sequence fragments k-mers with different lengths of k, and all different k-mers on the extracted genome form a node set V= { V1, V2, …, vM }; for two nodes vi and vj, if vi overlaps with vj with the sequence of (k-1) -mer, then there is a directed edge vi→vj;
each k-mer sequence represents a node of the de Bruijn graph model, a unique path unipath is determined according to the number of the incoming edges and the outgoing edges of the node, the incoming degree and the outgoing degree of the node are respectively defined as the number of the incoming edges and the outgoing edges of the node, and one or more incoming edges and one or more outgoing edges of the node can exist; a path which can be formed by any two nodes in the graph G is a unique path unipath if the entrance degree of the initial node of the path is greater than 1 and the exit degree is 1, the exit degree of the end node is greater than 1 and the entrance degree is 1, the entrance degree of the middle node is 1 and the exit degree is 1;
utilizing all unique paths generated; for each path, the nodes thereon have associated therewith a set of generated k-mer sequence positions; traversing each node from the beginning to the back aiming at a certain path, acquiring two position sets of a current node and a subsequent node, and finding out similar position elements of the two sets; because each position set is an ordered set, traversing certain set elements, adopting a binary search method, finding out all similar positions and forming a new position set; according to the method, each node on the path is traversed backwards in sequence, the position searching and the set merging operation are executed after each new node is traversed until the end node of the path is encountered, and the position set formed finally is the position set corresponding to the current unique path;
generating a position set corresponding to all unique paths according to the method, and representing the position set as a binary index file and F-pos;
step three, expressing and constructing the local variant sequence index:
firstly, overlapping and dividing a reference genome according to fixed-length intervals, wherein each divided fixed-length interval is used as a local area, each two adjacent local areas are partially overlapped, and the length of the overlapped area is half of that of the divided area;
for a haplotype of an individual, collecting the variation in each local region and forming a local individual genome sequence with known variation, namely generating a local variation sequence; the length of the overlapping area is half of the length of the dividing area;
generating variant sequence index files by the method, wherein each variant sequence index file is correspondingly used as an alt string;
step four, for 'unipath' and 'alt string', minizer-based group genome index representation and construction:
before indexes are built for 'unipath' and 'alt string', respectively converting a 'unipath' diagram and an 'alt string' list into a character string list; then, selecting a k-mer with the smallest hash value in a sliding window as a local representative k-mer according to a minimizer selection criterion for each character string in the character string list; these local minimum k-mers, known as minimizer, will be stored in a unified list;
the minimizer lists corresponding to all the character strings are combined into a complete list, and the lists are ordered according to the hash value of the minimizer; then, constructing a secondary index of the minimizer according to the high K bit of the hash value of each minimizer, wherein K is a secondary index parameter;
the secondary index partitions the overall index of the minimizer into b=2 ^K Each cell, when assuming that the total number of minimizer is M, each cell contains M/B minimizer on average; each constructed minimizer occupies 96 bits of storage space, wherein the last Q bits of the hash value of the minimizer occupy 30 bits of storage space; minimizer is found in the original "unipathThe position on the "or" alt string "list occupies 65 bits of space.
2. The method of claim 1, wherein the k-mer sequences are ordered and de-duplicated when traversing the complete reference genome to extract all k-mer sequences, the method comprising the steps of:
according to the sequence content of the first f bp of the k-mer as a file name, wherein bp represents the total quantity of bases, writing the quadruple corresponding to the k-mer sequence into a corresponding temporary file;
then, importing all k-mer sequences in a file into a memory, performing quick sequencing on k-mers in the file in the current memory to enable quadruplets corresponding to the k-mer sequences in the file to be ordered, and outputting the quadruplets to a hard disk to form a corresponding ordered file;
quick sequencing: traversing the k-mer set in the current file from the beginning, merging adjacent identical k-mer sequences, and merging offset position combinations corresponding to the k-mers;
and sequentially importing all files into a memory and rapidly sequencing the files to generate all different k-mer sequences and ensure that the position sets corresponding to the k-mers are orderly.
3. The method for representing and constructing a group-oriented genome index according to claim 2, wherein the step of determining unique paths unipath according to the number of in-edges and out-edges of the nodes comprises the steps of:
firstly, according to the quantity of the incoming edges and the outgoing edges of the nodes, the nodes are divided into the following types:
1) 'X' -type node: a plurality of edges and a plurality of edges;
2) 'FY' node: a plurality of edges and an edge;
3) 'RY' -type node: one in side and a plurality of out sides;
4) 'L' -shaped node: one in side and one out side;
the way in which the start and end nodes of a unique path are formed is several possibilities:
(1) The start and end nodes are the same 'X' -type node;
(2) The start node is a 'FY' type node and the end node is a 'RY' type node;
(3) The starting node is a successor node of an 'X' type node or a successor node of an 'FY' type node and the ending node is a predecessor node of an 'X' type node or a predecessor node of an 'RY' type node;
then traversing all nodes in the graph, and if the current node is the initial node of the unique path, executing backward extension operation; backward extending operation: calculating to obtain a rear-drive node according to the edge information of the current node, if the rear-drive node is an L-shaped node, continuing to extend backwards, and sequentially cycling according to the method until encountering an end node; all unique paths of the graph model can be generated by the method, and each node in the paths is ensured not to be repeated;
all the unique path index files are generated through the method, and each unique path index file is correspondingly used as a unipath.
4. The method for representing and constructing a group-oriented genome index according to claim 3, wherein the specific process of selecting the k-mer with the smallest hash value in the sliding window in the fourth step comprises the following steps:
the minimizer is selected with two key parameters, namely window size and k-mer length; assuming that the sequence of the index to be constructed is L and the length of K-mers is L_K, when the sliding window is selected, (L-L_K+1) K-mers are selected; assuming that the windows size is W, selecting one min_k-mer1 with the smallest hash value from the 5 k-mers with the numbers of 1-5; then, selecting the smallest min_k-mer2 from the 5 of 2 to 6; then 3-7 and so on; the min_k-mers selected by adjacent different minimizer selection windows may be identical, and each repetition is recorded only once; the min_k-mer is called a minimizer.
5. The method of claim 4, wherein Q = minizer_len x 2-K in the last Q bits of the hash value, wherein minizer_len represents a MINIMIZER length.
6. The method of claim 5, wherein the minimizer-based representation and construction of the population genome index for "uniath" is performed by:
the minimizer index of the constructed unpath sequence has a 65bit position area besides a 30bit minimizer storage space;
the location area is: wherein 32 bits are used for storing the ID of the UNITIG to which the minimizer belongs, 32 bits are used for storing the position of the UNITIG to which the minimizer belongs, and 1bit is used for storing the direction of the minimizer relative to the UNITIG.
7. The method of population genome index expression and construction of claim 6, wherein the minimizer-based population genome index expression and construction for "alt string" is performed with a stored data structure of the minimizer for alt string as follows:
each item of minimizer of "alt string" is stored as 96bit data structure, all minimzers together form data table; the minimizer belonging to "alt string" has its coordinates uniquely marked as follows:
each minimizer contains a 65-bit location area, wherein,
(a) The 25 bits are used for storing the ID of the window block to which the minimizer belongs;
(b) 19 bits store the ID of the replotype where the minimizer is located;
(c) 22 bits are used to store the location on the replotype;
(d) The 1bit is used to store the orientation of the minimizer relative to the orientation on the replotype.
8. The method of claim 7, wherein the minimizer-based population genome index representation and construction is performed for "alt string" by indexing only sweet spot variations in the middle of each window block that are close to each other, the sweet spot being a region of each window block from 25% L to 75% L.
9. A method of population genome-oriented index representation and construction according to claim 7 or 8, further comprising a minimizer lookup procedure comprising the steps of:
firstly, generating a minimizer list of the read by using a minimizer generation algorithm; then, aiming at a certain minimizer, the first K bits of the hash value are acquired and positioned in a cell of the minimizer index, wherein the cell comprises M/B minimzers on average; then, in the cell, using the rest Q bits as KEY to perform binary search;
after finding the specific position of the minimizer in the index, resolving the position of the minimizer of the read in the uniath figure or alt string list through a 65-bit position area of the minimizer; if the minimizer index of the uniath figure is searched, resolving the UNITIG index into the ID of the corresponding UNITIG, the position on the UNITIG and the direction of the minimizer relative to the UNITIG; if the minimizer index of the "alt string" list is looked up, then the parse is: the ID of the window block, the ID of the replotype, the position on the replotype, and the orientation of the minimizer relative to the replotype, i.e. the coordinates of the minimizer on the WB structure.
10. Population genome-oriented index representation and construction apparatus, characterized in that it comprises a processor and a memory in which at least one instruction is stored, which is loaded and executed by the processor to implement a population genome-oriented index representation and construction method according to one of claims 1 to 9.
CN202211295056.5A 2022-10-21 2022-10-21 Group-oriented genome index representation and construction method and equipment Active CN115662523B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211295056.5A CN115662523B (en) 2022-10-21 2022-10-21 Group-oriented genome index representation and construction method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211295056.5A CN115662523B (en) 2022-10-21 2022-10-21 Group-oriented genome index representation and construction method and equipment

Publications (2)

Publication Number Publication Date
CN115662523A CN115662523A (en) 2023-01-31
CN115662523B true CN115662523B (en) 2023-06-20

Family

ID=84989722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211295056.5A Active CN115662523B (en) 2022-10-21 2022-10-21 Group-oriented genome index representation and construction method and equipment

Country Status (1)

Country Link
CN (1) CN115662523B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665772B (en) * 2023-05-30 2024-02-13 之江实验室 Genome map analysis method, device and medium based on memory calculation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336916A (en) * 2013-07-05 2013-10-02 中国科学院数学与系统科学研究院 Sequencing sequence mapping method and sequencing sequence mapping system
CN110491441A (en) * 2019-05-06 2019-11-22 西安交通大学 A kind of gene sequencing data simulation system and method for simulation crowd background information
CN114424288A (en) * 2019-09-30 2022-04-29 朗斯科技有限公司 Method for determining a measure related to the probability that two mutated sequence reads are derived from the same sequence comprising a mutation

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10068054B2 (en) * 2013-01-17 2018-09-04 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
CN105989249B (en) * 2014-09-26 2019-03-15 南京无尽生物科技有限公司 For assembling the method, system and device of genome sequence
CN104899476A (en) * 2015-06-15 2015-09-09 中国人民解放军国防科学技术大学 Parallel accelerating method for BWT index construction for multiple sequences
US20170270245A1 (en) * 2016-01-11 2017-09-21 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing
CN110008217B (en) * 2019-04-08 2021-11-30 湖南大地同年生物科技有限公司 Genome mutation data oriented storage and index processing method
CN111028897B (en) * 2019-12-13 2023-06-20 内蒙古农业大学 Hadoop-based distributed parallel computing method for genome index construction
US20210375397A1 (en) * 2020-02-14 2021-12-02 Guardant Health, Inc. Methods and systems for determining fusion events

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336916A (en) * 2013-07-05 2013-10-02 中国科学院数学与系统科学研究院 Sequencing sequence mapping method and sequencing sequence mapping system
CN110491441A (en) * 2019-05-06 2019-11-22 西安交通大学 A kind of gene sequencing data simulation system and method for simulation crowd background information
CN114424288A (en) * 2019-09-30 2022-04-29 朗斯科技有限公司 Method for determining a measure related to the probability that two mutated sequence reads are derived from the same sequence comprising a mutation

Also Published As

Publication number Publication date
CN115662523A (en) 2023-01-31

Similar Documents

Publication Publication Date Title
CA2839802C (en) Methods and systems for data analysis
US7640256B2 (en) Data collection cataloguing and searching method and system
US8762073B2 (en) Transcript mapping method
JP5183155B2 (en) Batch search method and search system for a large number of sequences
US7809510B2 (en) Positional hashing method for performing DNA sequence similarity search
CN109712674B (en) Annotation database index structure, and method and system for rapidly annotating genetic variation
CN115662523B (en) Group-oriented genome index representation and construction method and equipment
US20180247016A1 (en) Systems and methods for providing assisted local alignment
US8788522B2 (en) Pair character string retrieval system
US20220005546A1 (en) Non-redundant gene set clustering method and system, and electronic device
Wang et al. A brief review of machine learning methods for RNA methylation sites prediction
CN109658981B (en) Data classification method for single cell sequencing
US20230230657A1 (en) Methods and Systems for Improved K-mer Storage and Retrieval
WO2011073680A1 (en) Improvements relating to hash tables
Zhang et al. SMOTIF: efficient structured pattern and profile motif search
CN109828785B (en) Approximate code clone detection method accelerated by GPU
Gagie et al. Compressing and indexing aligned readsets
US20190205394A1 (en) Method for generating text string dictionary, method for searching text string dictionary, and system for processing text string dictionary
JPH07105224A (en) Character array retrieving method
Nicolas et al. Finding and characterizing repeats in plant genomes
TWI785847B (en) Data processing system for processing gene sequencing data
Sulistyawan et al. An adaptive BWT-HMM-based lossless compression system for genomic data
JP7422367B2 (en) Approximate string matching method and computer program for realizing the method
Kleffe et al. ClustDB: A high-performance tool for large scale sequence matching
Nguyen et al. BWTaligner: a genome short-read aligner

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant