CN112071367B

CN112071367B - Manifold evolutionary graph construction method, device and equipment and storable medium

Info

Publication number: CN112071367B
Application number: CN202010911346.2A
Authority: CN
Inventors: 田圃
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2023-04-07
Anticipated expiration: 2040-09-02
Also published as: CN112071367A

Abstract

The invention is suitable for the technical field of data processing, and provides a method, a device, equipment and a storage medium for constructing a manifold evolutionary graph, wherein the method comprises the following steps: constructing a basic component sequence vector tree; the basic component sequence vector tree comprises at least one main sequence node, a basic component sequence vector and a weight vector of the main sequence node; determining the basic component sequence vector and the weight vector of the candidate sequence node with different neighbor relations with the main sequence node according to the basic component sequence vector and the weight vector of the main sequence node; when the candidate sequence node is found in the basic component sequence vector tree according to the basic component sequence vector and the weight vector of the candidate sequence node, marking the corresponding neighbor relation; and constructing a manifold evolutionary graph according to the main sequence nodes and the neighbor relation. The relationship among the sequence nodes is obtained through the manifold space neighbor relationship, the calculation amount is small, the reliability is high, and the analysis of the evolutionary relationship of any large-scale complex high-dimensional nonlinear strong correlation data can be realized.

Description

Manifold evolutionary graph construction method, device, equipment and storage medium

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a method, a device and equipment for constructing a manifold evolutionary graph and a storable medium.

Background

The method for clearing the relationship between complex high-dimensional space nonlinear strongly-associated objects/data (such as biological sequences, biological macromolecular structures, pictures, texts, audios and videos and the like) is a challenge which cannot be effectively solved but needs to be overcome by the existing artificial intelligence technology. The strong correlation between dimensions makes these data actually exist in a relatively low-dimensional manifold space well below the nominal dimension.

The various Distance characterization methods commonly used (e.g., euclidean Distance, manhattan Distance, mahalanobis Distance, minkowski Distance, chebyshev Distance, biological sequence identity, etc.) are all calculations performed directly in a nominal high-dimensional space, generally approaching the manifold-space Distance only when the Distance in the nominal high-dimensional space is small, and often ineffective in many other situations. All complex high-dimensional non-linear strongly correlated data have similar difficulties, with physically meaningful distances between different data records/objects being distances in a relatively low-dimensional manifold space, rather than distances in a nominally high-dimensional space. However, due to the complexity of the non-linear strong correlation between multiple dimensions, it is generally very difficult to theoretically determine the shape of the manifold space formed by any actual data object and the mapping relation with the nominal high-dimensional space. For example, when a pair of protein sequences have about 30% identity, it is difficult to determine whether the corresponding amino acid identities happen to be formed or because of a close homologous evolutionary relationship. At present, people generally adopt a method of constructing an evolutionary tree to determine the evolutionary relationship of biological sequences, but a reliable evolutionary tree construction process (such as a consistency based method) has high computational complexity, is difficult to realize a large number of biological sequences (ten thousand or more), and protein sequence data of more than ten billion levels are very close to us due to the current sequencing capability. In addition, because the basis of the construction of the evolutionary tree is usually the basis of identity calculation based on sequence alignment, the fundamental limitation that the sequence identity is unreliable when being low and the obvious limitation that the sequence alignment depends on a plurality of parameters with high uncertainty (such as a scoring matrix used by pairwise sequence alignment and other parameters for constructing multi-sequence alignment) cannot be overcome; other high-dimensional non-linear strongly correlated data have similar difficulties.

Therefore, the distance measurement method in the prior art has the problem that the distance between high-dimensional nonlinear strongly-correlated data samples in the low-dimensional manifold space cannot be effectively characterized.

Disclosure of Invention

The embodiment of the invention aims to provide a manifold evolutionary graph construction method, and aims to solve the problem that a distance measurement mode in the prior art cannot effectively represent the distance between high-dimensional nonlinear strongly-associated data samples in a low-dimensional manifold space.

The embodiment of the invention is realized in such a way that a method for constructing a manifold evolutionary graph comprises the following steps:

acquiring a data set to be processed;

constructing a basic component sequence vector tree according to the data set to be processed; the base component sequence vector tree comprises at least one main sequence node and a base component sequence vector and a base component weight vector of the main sequence node;

determining a basic component sequence vector and a basic component weight vector of a candidate sequence node with different neighbor relations with the main sequence node according to the basic component sequence vector and the basic component weight vector of the main sequence node;

when the candidate sequence node is found in the basic component order vector tree according to the basic component order vector and the basic component weight vector of the candidate sequence node with different neighbor relations, the corresponding neighbor relations are labeled mutually;

and constructing a manifold evolutionary graph according to the main sequence nodes and the neighbor relation.

Another objective of an embodiment of the present invention is to provide a manifold evolutionary graph building apparatus, including:

the data set acquisition unit is used for acquiring a data set to be processed;

the component sequence vector tree construction unit is used for constructing a basic component sequence vector tree according to the data set to be processed; the base component sequence vector tree comprises at least one main sequence node and a base component sequence vector and a base component weight vector of the main sequence node;

the sequence vector and weight vector determining unit is used for determining the basic component sequence vector and the basic component weight vector of the candidate sequence node with different neighbor relations with the main sequence node according to the basic component sequence vector and the basic component weight vector of the main sequence node;

a neighbor relation labeling unit, configured to label the corresponding neighbor relations with each other when the candidate sequence node is found in the basic component sequence vector tree according to the basic component sequence vector and the basic component weight vector of the candidate sequence node in the different neighbor relations; and

and the manifold evolutionary graph constructing unit constructs a manifold evolutionary graph according to the main sequence node and the neighbor relation.

It is a further object of embodiments of the present invention a computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the manifold evolutionary graph building method.

Another object of an embodiment of the present invention is a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to execute the steps of the manifold evolutionary graph construction method.

The method for constructing the manifold evolutionary graph provided by the embodiment of the invention constructs the evolutionary manifold graph through the basic component sequence vector tree, each sequence node in the basic component sequence vector tree is used as a node of the manifold evolutionary graph, when only a plurality of extremely individual highly similar but not completely identical sequences share the basic component sequence vector and the basic component weight vector, the adjacent relation labeling is realized by sequence comparison, and the vast majority of adjacent relations are realized through the basic component sequence vector and/or the weight vector, so the calculation amount is small; for the sequences which are extremely required to be compared and share the basic component sequence vector and the weight vector, the comparison result has small dependence on a scoring matrix and other related parameters and high reliability due to high similarity; in addition, sequences of different basic component weight vectors in the same node of the basic component sequence vector tree distinguish adjacent relations according to differences of the basic component weight vectors, and the analysis of the evolutionary relation of any super-large-scale complex high-dimensional nonlinear strongly-associated data can be realized.

Drawings

FIG. 1 is a flowchart illustrating an implementation of a method for constructing a manifold evolutionary graph according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an implementation of another method for constructing a manifold evolutionary graph according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating an implementation of another method for constructing a manifold evolutionary graph according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of adjacent sequence relationships in a manifold evolution diagram according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating an implementation of a method for constructing a manifold evolutionary graph according to another embodiment of the present invention;

FIG. 6 is a block diagram of a device for constructing a manifold evolutionary graph according to an embodiment of the present invention;

fig. 7 is a block diagram illustrating a structure of a sequence vector and weight vector determining unit in the apparatus for constructing a manifold evolutionary graph according to an embodiment of the present invention;

fig. 8 is a block diagram of an internal configuration of a computer apparatus according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

In order to solve the problem that the distance measurement mode in the prior art cannot effectively characterize the distance between high-dimensional nonlinear strongly-associated data samples in the low-dimensional manifold space, the embodiment of the invention provides a manifold evolutionary graph construction method, which comprises the steps of constructing a basic component sequence vector tree containing each unique record (also called a sequence and also a graph node in the manifold evolutionary graph to be constructed) according to a data set to be processed; the primitive component-order vector tree includes at least one root node and one non-empty subtree node, each non-empty subtree node includes at least one record/sequence/graph node, and the position of the non-empty subtree node in the component-order vector tree is determined by the corresponding primitive component-order vector. Each record/sequence/graph node may be saved according to the size of the data set (record/sequence/graph node itself, its number, one, two or all of the component weight vectors); for each unique record/sequence/graph node, determining a basic component sequence vector and a weight vector or complete record/sequence/graph node information of candidate record/sequence/graph nodes in different neighbor relations with the unique record/sequence/graph node according to the basic component sequence vector and the weight vector or the complete information of the unique record/sequence/graph node; and (3) checking whether each candidate record/sequence/graph node of each unique record/sequence/graph node exists in the constructed component sequence vector tree, and if so, performing relation labeling between the record and the corresponding candidate record, wherein the relation labeling is an edge (edge) in the manifold evolutionary graph. And forming a manifold evolutionary graph by all edges and all graph nodes added after finishing all the different neighbor relation candidate records. The relationship among all records in the invention is obtained through the very reliable manifold space neighbor relationship, the calculated amount is small, the reliability is high, and the analysis of the evolutionary relationship of any super-large-scale complex high-dimensional nonlinear strongly-associated data can be realized.

It should be noted that, in the primitive component sequence vector tree nodes (abbreviated as tree nodes), each primitive component sequence vector (CRV, composition rank vector) corresponds to one tree node, but may correspond to a plurality of primitive component weight vectors (CWV, composition weight vector), so that there may be many records in each tree node, and for the convenience of searching, a binary tree may be used to store graph nodes in the tree node containing many records, and each node in the binary tree corresponds to a unique weight vector (CWV), so that it may be called a CWV node. Each record/sequence is a graph node, so the record/sequence/graph node/master sequence node is a matter of things in the present invention. The CWV size comparison first looks at the first bit, if equal, compares the second bit, and so on. It is rare that different sequences/records share CRV and CWV, and if they occur, they are stored in list data structure (record number or complete record) in CWV nodes on the binary tree in the corresponding tree node.

To further explain the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the embodiments, structures, features and effects according to the present invention will be given with reference to the accompanying drawings and preferred embodiments.

Fig. 1 is a flowchart of an implementation of a method for constructing a manifold evolutionary graph according to an embodiment of the present invention, which is described in detail below.

Step S101, a data set to be processed is obtained.

In the embodiment of the present invention, the data set to be processed refers to complex high-dimensional spatial nonlinear strongly-associated objects/data, which include biological sequences, biological macromolecular structures, and may also extend to text, sound/audio, images/video, or a combination of these data and other complex high-dimensional nonlinear strongly-associated data.

And S102, constructing a basic component sequence vector tree according to the data set to be processed.

In an embodiment of the present invention, the basic component sequence vector tree includes at least one main sequence node and a basic component sequence vector and a basic component weight vector of the main sequence node. In the invention, each sequence node in the basic component sequence vector tree is used as a node of the manifold evolutionary graph, and in order to facilitate the construction of the subsequent manifold evolutionary graph, the sequence nodes in the basic component sequence vector tree can store the Calculated Weight Vector (CWV) so as to avoid subsequent repeated calculation.

In this embodiment of the present invention, as shown in fig. 2, the step S102 includes:

step S201, defining basic components according to the data set to be processed and a preset definition rule.

In the present embodiment, as exemplified by protein sequence data, twenty natural amino acids are defined in the protein sequence data as the basic components, or two combinations of twenty natural amino acids, or twenty amino acids are first combined to define fewer classes including seven classes as generally defined in biochemical textbooks, and these classes are then combined two or more times.

In the present example, for protein structure data, the eight DSSP definitions were first usedSingle amino acid like secondary structure state as essential component; respectively H is alpha helix; g:3 ₁₀ A helix; i is a pi helix; e, parallel beta folding; b, independent beta double strand; t is hydrogen bond turning angle; s, bending; a coil or a quadratic combination of these eight classes of secondary structures forming a class 64 base component or a combination with a sequence of 8 × 20=160 base components or a combination of other class definitions of single amino acid secondary structures with sequence component definitions.

In the embodiment of the present invention, for example, the DNA sequence is used as an example, the ATCG four bases are arranged and combined at a certain length, the arrangement and combination of the bases are used as basic components, and the basic components are defined by the arrangement and combination of the AUCG four bases or different combinations of the combinations for constructing the RNA sequence basic component sequence vector.

Step S202, determining a basic component sequence vector according to the basic components.

In the examples of the present invention, the sequence of the protein molecule Alpha Hemolysin has 87 19 different amino acids, which are as follows:

MELKNSISDYTEAEFVQLLKEIEKENVAATDDVLDVLLEHFVKITEHPDGTDLIYYPSDNRDDSPEGIVKEIKEWRAANGKPGFKQG

among them, 11E, 9D, 8K, 7L, 6V/I, 5A/G, 4S/T/P/N, 3F/Y, 2R/Q/H, 1M/W, there are the following basic component order vectors (CRV) in order from at least: <xnotran> [ EDKLVIAGSTPNFYRQHMW ], (CWV) [11,9,8,7,6,6,5,5,4,4,4,4,3,3,2,2, 2,1,1]. </xnotran>

In the embodiment of the present invention, for example, the structure of the protein molecule, flock House virus particle (PDB No. 4 FTB), has 3 types of secondary structures with 44 amino acids, which are as follows:

--HHHHHHHHHHHHHHT---------------------------

of which 29H, 14 '-'s and one T, the sequence vector (CRV) is [ -HT ], and the Corresponding Weight Vector (CWV) is [29,14,1].

Step S203, inserting the basic component sequence vector into the corresponding node of the multi-branch tree, and constructing a basic component sequence vector tree.

In the embodiment of the invention, the basic component sequence vector is inserted into the corresponding node of the multi-branch tree, and the process is carried out independently for each record without comparing with the existing records in all the multi-branch trees, which is the root cause of high efficiency of unsupervised classification of the basic component sequence vector; if the record is found to be repeated with the existing record in the same node during the insertion, the record is abandoned; in addition, to check for the presence of non-adjacent but similar sequences in the multi-way tree due to individual easily reversible component orders, if found, they have to be labeled to each other.

And step S103, determining the basic component sequence vector and the basic component weight vector of the candidate sequence node with different neighbor relations with the main sequence node according to the basic component sequence vector and the basic component weight vector of the main sequence node.

In the embodiment of the invention, for each record in the constructed component sequence vector tree, namely the main sequence node, the necessary candidate records, namely the candidate sequence nodes, with different adjacent degrees are found, and the weight vector (CWV) of each candidate record and the neighbor relation between the CWV and the source record are marked and stored, wherein the source record is the record used for generating the candidate record, and some records may have a plurality of source records.

In the embodiment of the present invention, a candidate sequence node in different neighbor relation with a main sequence node refers to a candidate sequence node corresponding to a new basic component sequence vector generated after adjusting basic components of the main sequence node, for example, a candidate sequence corresponding to a new basic component sequence vector generated by performing one or more amino acid mutations or insertions or deletions on the basic component sequence vector of the sequence of the protein molecule Alpha Hemolysin, that is, a candidate sequence corresponding to a new basic component sequence vector is in different neighbor relation with the sequence of the protein molecule Alpha Hemolysin, and the number of mutations or insertions or deletions is determined from small to large, for example, a new component sequence vector generated by one mutation/insertion/deletion is referred to as a primary neighbor component sequence vector, and the corresponding sequence is referred to a primary neighbor sequence; two mutation/insertion/deletion/formation new component sequence vectors are called secondary adjacent component sequence vectors of the current sequence, the corresponding sequences are called secondary adjacent sequences, and so on.

Step S104, when the candidate sequence node is found in the basic component sequence vector tree according to the basic component sequence vector and the basic component weight vector of the candidate sequence node with different neighbor relations, the corresponding neighbor relations are labeled mutually.

In the embodiment of the invention, for each sequence/record/graph node (source record) in the constructed basic component sequence vector tree, whether each candidate sequence/record/graph node (neighboring record) exists is searched according to the basic component sequence vectors and the basic component weight vectors of the candidate sequence nodes with different neighbor relations, if the neighboring record exists, mutual labeling is carried out, the labeling is an edge (edge) in the manifold evolutionary graph, and the labeling can contain the relation between the corresponding source record and the neighboring record. If the neighbor record does not exist, no action is performed. All graph nodes in the basic component sequence vector tree and all edges added in the steps form the manifold evolutionary graph.

And S105, constructing a manifold evolutionary graph according to the main sequence nodes and the neighbor relation.

In the embodiment of the invention, each sequence in the basic component sequence vector tree is taken as a node of a manifold evolutionary graph aiming at the internal sequence of each node of the basic component sequence vector tree, all sequences which are positioned at the same basic component sequence vector tree node and have the same weight vector (namely share CRV and CWV) form a 'basic component recombination' super manifold evolutionary graph node, and the internal mutation relation needs to be realized by sequence comparison. Secondly, sequences of different weight vectors in the same node of the component sequence vector tree need to label the adjacent relation according to the difference of the weight vectors. Note that the nodes of the manifold evolutionary graph (graph nodes) and the nodes of the component order vector tree (tree nodes) are distinguished. The tree nodes may be empty nodes, or may contain multiple graph nodes or "primitive constituent recombination" super-manifold-evolutionary graph nodes.

In the embodiment of the invention, each sequence is connected with a grade adjacent component sequence vector thereof and each sequence corresponding to the weight vector according to the label of the adjacent basic component sequence vector, and the weight/distance of each sequence is corresponding adjacent grade. If necessary, a higher level connection procedure is followed.

In the embodiment of the invention, graph analysis is carried out after the nodes of the manifold evolutionary graph are connected, and if all sequence nodes are connected together to form a connected graph after all the first-level adjacent relations are added, the construction of the manifold evolutionary graph is finished. The manifold is the manifold with the highest reliability (primary reliability). If all sequence nodes do not form a connected graph, each connected subgraph is defined as a first-level reliability subgraph, edges of second-level adjacent sequences are further added for the same connection analysis, and if the original upper-level reliability subgraphs are connected, the subgraphs are marked as second-level reliability subgraphs after combination (the original upper-level subgraph marks are kept). And so on until an overall connected graph is formed or edges formed by neighbor relations corresponding to all farthest adjacent sequences are used. If a plurality of connected subgraphs remain, the possible processing scheme is to continue to increase the farthest adjacent level or not to process and wait for the subsequent data.

In addition, to-be-processed data sets, each record is numbered and saved separately using a file or a database. Of course if there are duplicates, they are discarded in the process of building the component order vector tree and the saved file or database is modified. When the space occupied by each record is small or the total number of records is small, the complete record can be stored in the component sequence vector tree and the manifold evolutionary graph, but if each record is large and the number of records is large, the record itself needs to be stored by using a database independently, and the record number and the CWV can be used in the basic component sequence vector tree and the manifold evolutionary graph.

The method for constructing the manifold evolutionary graph provided by the embodiment of the invention constructs the evolutionary manifold graph through the basic component sequence vector tree, each sequence node in the basic component sequence vector tree is used as a node of the manifold evolutionary graph, when only a plurality of extremely individual highly similar but not completely identical sequences share the basic component sequence vector and the basic component weight vector, the adjacent relation labeling is realized by sequence comparison, and the vast majority of adjacent relations are realized through the basic component sequence vector and/or the weight vector, so the calculation amount is small; sequences which need sequence comparison and share basic component sequence vectors and weight vectors are extremely and individually matched, and due to high similarity, the dependence of comparison results on a scoring matrix and other related parameters is small, and the reliability is high; in addition, sequences of different basic component weight vectors in the same node of the basic component sequence vector tree distinguish adjacent relations according to differences of the basic component weight vectors, and the evolutionary relation analysis of any super-large-scale complex high-dimensional nonlinear strongly-associated data can be achieved.

Fig. 3 is a flowchart of an implementation of another method for constructing a manifold evolutionary graph according to an embodiment of the present invention, which is similar to the above embodiment, except that the step S103 includes the following steps:

step S301, determining the basic component sequence vector and the basic component weight vector of the primary candidate sequence node adjacent to the primary of the main sequence node according to the basic component sequence vector and the basic component weight vector of the main sequence node.

In the embodiment of the present invention, when the main sequence node is a protein sequence, the base component sequence vector of the primary candidate sequence node that is one-stage adjacent to the main sequence node is formed by one amino acid mutation/insertion/deletion in the base component sequence vector of the main sequence node. When the main sequence node is a DNA/RNA sequence, the basic component sequence vector of the primary candidate sequence node adjacent to the primary of the main sequence node is formed by one base mutation/insertion/deletion in the basic component sequence vector of the main sequence node.

Step S302, according to the basic component sequence vector and the basic component weight vector of the main sequence node, determining the basic component sequence vector and the basic component weight vector of the secondary candidate sequence node which is adjacent to the main sequence node in the second level.

In the embodiment of the present invention, when the main sequence node is a protein sequence, the base component sequence vector of the secondary candidate sequence node secondarily adjacent to the main sequence node is formed by two amino acid mutations/insertions/deletions in the base component sequence vector of the main sequence node. When the main sequence node is a DNA/RNA sequence, the basic component sequence vector of the secondary candidate sequence node which is secondarily adjacent to the main sequence node is formed by mutation/insertion/deletion of two bases in the basic component sequence vector of the main sequence node.

In the present example, the case where twenty natural amino acids are defined as the basic components in the protein sequence data is exemplified, and as the sequence of the above-mentioned protein molecule Alpha Hemolysin is exemplified, a new basic component sequence vector generated by only one mutation/insertion/deletion is referred to as a primary adjacent component sequence vector, and the corresponding sequence is referred to as a primary adjacent sequence, i.e., a primary candidate sequence node; the new basic component sequence vector formed by two mutation/insertion/deletion is called a secondary adjacent component sequence vector of the current sequence, the corresponding sequence is called a secondary adjacent sequence, namely a secondary candidate sequence node, and the like \8230, mutation without generating a new basic component sequence vector is not considered, because all possible new sequences fall into the same node of the basic component sequence vector tree and do not need to be searched.

Further, if V in the above example is mutated to I, the new sequence vector (CRV) is changed to [ edklivivasstpnfyrqww ], the Corresponding Weight Vector (CWV) is [11,9,8,7, 7, 5,5, 5,4,4,4,4,3,3,2,2, 1,1], whether the corresponding new sequence exists (may be zero or more) is detected in the component sequence vector tree through the new sequence vector and the weight vector, and if so, the new sequence is labeled, and they are in one-level adjacent relationship with each other, only the mutation relationship is reciprocal, I → I "and I → V, respectively. Regardless of the size of the whole component sequence vector tree, the calculation cost for checking whether the corresponding sequence of each adjacent component sequence vector exists is a constant not greater than the depth of the tree. The number of candidate close sequences to be checked is exponentially changed according to the level of sequence closeness defined above, so that the labeling of adjacent sequences must be controlled to a close farthest adjacent level (it is recommended not to exceed two levels). In this example, the total number of possible primary contiguous sequences formed by insertions and deletions is 39, including any of the 20 classes of amino acids inserted, and any of the 19 classes of amino acids deleted from the sequence already present. Whereas, one mutation theoretically could have one mutation in each of the 19 classes of amino acids already present into one of the remaining 19 classes, so the total possible candidate primary contiguous sequence of mutations is 19 × 19= 361; however, only a small portion of these candidate sequences can generally change the component sequence vectors. In practice, it is difficult to directly mutate each amino acid into an amino acid that is very different from its physicochemical properties, so that only a few amino acids that are close to it can be considered to reduce the number of candidate first (second) order adjacent sequences.

Further, the total possible secondary adjacent sequences of the above sequences include optional two insertions of 20 amino acids (20 × 20= 400), there are not less than 2 in 17 of 19 classes already existing, there are 17 × 17=289 possibilities, and there are two (M and W) with only one amino acid in two classes, one of them can be deleted or selected together with any one of the previous 17 classes, there are 1+17 =35 possibilities, there are 289+35=324 possibilities of deleting two amino acids, and the possibility of two mutations is (19 × 19) for not less than two amino acids in 17 classes ² 130321, the remaining two classes each having an amino acid that can be mutated in a total of 19 × 19=361, so the probability of secondary adjacent sequences is 131401 in total, which is expensive to calculate. On the other hand, the larger the number of insertions/deletions/mutations, the more evolutionary relationships that are actually possible, and the less reliable the mutations that we can directly infer from the sequence, the less practical it is to do this kind of potential survey of more distant sequences.

In the present example, 16 two-base segments were defined as the basic components for the nucleic acid sequence data, and the 16 two-base segments were AA, AC, AG, AT, CC, CA, CG, CT, GG, GA, GC, GT, TT, TA, TC, TG, respectively. Taking the COI gene sequence as an example, the sequence contains 624 total 16 kinds of double base pairs, which are as follows:

TTGGAATCTGAACAGGACTAGTAGCCACGAGAATGAGACTCCTAATTCGAGCTGAGCTTGGACAACCTGGAACTCTTCTAGGAGACGATCAAATTTATAATTGCCTTATTACCGCTCATGGTCTATTAATGATATTTTTTGTAGTCCTACCTATTTTAATAGGAGGATTTGGAAATTGACTAGTTCCCTTAATACTAGGAGCTCCAGACATGGCTTTTCCCCGGATTAATAATCTTGGGTTCTGACTTATTCCCCCCGCAGTAATTCTCCTAGTAATATCCGCTTTTATCGAAAAAGGGGCTGGAACAGGATGAACTGTCTACCCTCCTTTAGCCTCTAATATTGCCCATGCAGGGCCATGCATTGATTTAGCTATTTTTGCCCTTCATTTATCCGGAGTATCCTCAATTCTAGCCTCTATCAACTTTATTACAACTGTAATAAATATACGATATAAAGGTCTTCGACTAGAACGAGTTCCTTTATTTGTATGAAGAGTAAAACTAACTGCAGTTCTTCTTCTTCTCTCAATTCCAGTTCTTGCCGGTGGACTTACTATACTTCTCACCGATCGAAATTTAAATACGTCCTTCTTTGACCCCGCAGGAGGAGGGGACCCAGTTC

among them 73 TT, 58 CT, 56 TA, 53 AT, 46 TC, 44 AA, 42 CC, 42 GA, 36 AG, 32 AC, 30 TG, 30 GG, 23 CA, 21 GC, 20 GT and 17 CG; corresponding to CRV, [ TT, CT, TA, AT, TC, AA, CC, GA, AG, AC, TG, GG, CA, GC, GT, CG ], corresponding CWV, [73, 58, 56, 53, 46, 44, 42, 42, 36, 32, 30, 30, 23, 21, 20, 17].

The primary adjacent sequence of the sequence may be formed by insertion, deletion or mutation of one base. There are 625 possibilities of single-base insertion, 624 possibilities of single-base deletion, and 624 possibilities of single-base mutation, i.e., 624 × 3=1872 possibilities. For a nucleic acid sequence with n bases, the potential primary adjacent sequence has 3n +1 possibilities. It is obvious that the potential secondary adjacent sequences will be proportional to n ² Checking one by one brings a high computational cost (O (n) ² ). However, because the secondary adjacent sequence of each target sequence is the primary adjacent sequence of the primary adjacent sequence, when the sequence database is large, the construction of the manifold evolutionary graph can be realized only by labeling the primary adjacent sequence.

It should be noted that for each potential primary adjacent sequence of the nucleic acid sequence with the double-base segment as the basic component, we need to specify its site (such as 1 st or 27 th) to determine its corresponding CRV and CWV. For protein sequences based on 20 types of amino acids, only one type of amino acid needs to be inserted, deleted or mutated to be called another type. This is an inconvenience due to the combination of the essential components. If a 400-class dipeptide fragment is defined as the essential component, then the protein sequence definition of adjacent sequences is also a desirable site. At the same time, the information of each candidate record is completely determined, and in the protein sequence with 20 amino acids as basic components, the candidate records only determine component sequence vectors and weight vectors, but have no exact sequence information.

In the COI gene sequence, the primary adjacent sequences are exemplified as follows:

1) Insertion of a base A between the first and second positions results in a change in the basic composition, one more TA, one AT, one less TT, no new CRV, but a change in the CWV to [72, 58, 57, 54, 46, 44, 42, 42, 36, 32, 30, 30, 23, 21, 20, 17].

2) If the second T is mutated to G, the basic component is one TT less, and one GG more corresponds to CRV to [ TT, CT, TA, AT, TC, AA, CC, GA, AG, AC, GG, TG, CA, GC, GT, CG ], CWV to [72, 58, 57, 54, 46, 44, 42, 42, 36, 32, 31, 30, 23, 21, 20, 17].

In the embodiment of the present invention, as shown in fig. 4, a, B, and C in the diagram represent three sequences (records), respectively, a thick line represents a primary adjacent relationship, a thin line represents a secondary adjacent relationship, and a more distant adjacent relationship is not shown in the diagram; a and B are in a primary adjacent relationship, B and C are in a primary adjacent relationship, A and C are in a secondary adjacent relationship, and A and C can be connected by other channels; so when there are enough records, it is sufficient to detect only the primary adjacent sequences to plot the manifold evolutionary graph. Currently, all knowledge about evolution is speculation, and the first-level neighbor relation is not only small in number, but also the most reliable speculation. Therefore, the advantage of the present invention using the basic component sequence vector tree to construct the evolutionary manifold is that the secondary adjacent sequence of the a sequence is the primary adjacent sequence of the primary adjacent sequence B, so when the sequence data set is large enough, almost all more distant adjacent sequences can be found by considering only the primary adjacent sequence.

It is noted that while searching for primary neighboring sequences/records, secondary or further neighboring sequences may be found. When the basic component sequence vector corresponding to the primary candidate sequence reaches a target node, a sequence/record corresponding to the candidate weight vector is not found, but other different basic component weight vectors only share the sequence/record of the basic component sequence vector in the node, and in addition, similar sequences/records can be also existed in the adjacent basic component sequence vector tree nodes, and the adjacent levels of the corresponding sequences/records can be directly determined by comparing the differences of the weight vectors. Similarly, more distant adjacent sequences may be found when detecting secondary candidate sequences, but when the sequence data set is large enough, then detection of secondary candidate sequences is not necessary.

In the embodiment of the present invention, on the finished manifold evolutionary graph, the evolutionary distance between any two sequences is the minimum connection distance in the connected graph shown in fig. 4. This distance is essentially independent of sequence alignment parameters, and although we use sequence alignment within "super-component recombination nodes", the alignment results are essentially independent of the scoring matrix due to the very high degree of sequence similarity within these same super-nodes.

In the embodiment of the invention, the constructed manifold evolutionary graph can be used for analyzing the evolutionary relationship between sequences/records, at present, people are difficult to judge the sequence homology degree for sequences with 30% sequence uniformity, and the manifold distance on the evolutionary manifold graph can well determine the height of the relative possible homology degree; secondly, it can also be used to select representative sequences of different proximity, which is often used in various machine learning studies using protein sequences as input, in conventional protein sequence or similar data clustering methods, each time only according to a given proximity, while using an evolutionary manifold graph, it can select representative sequences for arbitrary given sequence (proximity) proximity. This is done by selecting a sequence/record in FIG. 4, and deleting records that satisfy a given proximity until all records in the evolutionary manifold are either selected as representative sequences/records or deleted; moreover, the method can also be used for carrying out statistical inference and evolution direction inference in the evolution relation close to the global situation based on the manifold evolutionary graph; in addition, a manifold evolutionary graph is respectively constructed for a protein structure and a sequence, as mentioned above, the protein structure component sequence vector tree is explained in the basic component sequence vector tree, and particularly, the basic component is defined and constructed by using the composition of a secondary structure state and an amino acid type, so that a new basic technical facility is provided for the design of synthetic biology and the prediction design of a protein molecular structure in the research of the corresponding relation of a sequence structure which is expanded on a scale close to the global scale of a natural protein beyond the current situation that the evolution information is only limited to the construction and the use of a PSSM matrix.

It is worth noting that the method can be used for analyzing the evolutionary relationship of any ultra-large-scale complex high-dimensional nonlinear strongly-correlated data. Taking the protein sequence as an example, the method can realize the analysis of the evolutionary relationship of any large protein sequence data set independent of a scoring matrix and large-scale sequence comparison, and based on the same principle, the method can also be used for gene sequences, genome sequences and other high-dimensional complex nonlinear strong correlation data.

In one embodiment, as shown in fig. 5, a method for constructing a manifold evolutionary graph is different from the method shown in fig. 1 in that the method further comprises:

step S501, when a new sequence node exists, calculating a basic component sequence vector and a basic component weight vector of the new sequence node.

Step S502, according to the basic component sequence vector and the basic component weight vector, the new sequence node is inserted into a basic component sequence vector tree, and the basic component sequence vector and the basic component weight vector of a new candidate sequence node with different neighbor relations with the new sequence node are obtained.

Step S503, adding the new sequence node and the new candidate sequence node into the manifold evolutionary graph according to the basic component sequence vector and the basic component weight vector of the new sequence node, the basic component sequence vector and the basic component weight vector of the new candidate sequence node, and the neighbor relationship between the new sequence node and the new candidate sequence node.

In the embodiment of the invention, each record to be added subsequently, namely a new sequence node, is calculated with CRV and CWV, inserted into a basic component sequence vector tree, then all adjacent sequences from one level to the farthest adjacent level are marked, and the sequence is added into a manifold evolutionary graph by using the adjacent relation increased step by step until the adjacent relation with all marked adjacent sequences is used up or the whole connected graph is completed. The method is an important advantage of constructing the manifold graph based on the component sequence vector tree, and all the current evolutionary analysis algorithms based on the protein sequence clustering and the evolutionary tree idea need to be performed again from beginning when more new sequences are added. In the invention, the new sequence is very easy to add, and the new sequence is added into the evolutionary manifold graph only by the original method for analyzing the adjacent sequence. More importantly, when the number of sequences which can be used is increased to a certain degree, the connected graph can be obtained only by considering the adjacent sequences at one level.

As shown in fig. 6, in an embodiment, a manifold-evolutionary graph apparatus is provided, which may specifically include a data set obtaining unit 610, a component order vector tree constructing unit 620, an order vector and weight vector determining unit 630, a neighbor relation labeling unit 640, and a manifold-evolutionary graph constructing unit 6/50.

The data set obtaining unit 610 is configured to obtain a data set to be processed.

The component order vector tree constructing unit 620 is configured to construct a basic component order vector tree according to the to-be-processed data set.

In an embodiment of the present invention, the basic component order vector tree includes at least one main sequence node, and a basic component order vector and a basic component weight vector of the main sequence node. In the invention, each sequence node in the basic component sequence vector tree is used as a node of the manifold evolutionary graph, and in order to facilitate the construction of the subsequent manifold evolutionary graph, the sequence nodes in the basic component sequence vector tree can store the Calculated Weight Vector (CWV) so as to avoid subsequent repeated calculation.

The order vector and weight vector determining unit 630 is configured to determine, according to the basic component order vector and the basic component weight vector of the master sequence node, a basic component order vector and a basic component weight vector of a candidate sequence node having a neighbor relation different from that of the master sequence node.

In the embodiment of the present invention, the candidate sequence node having different neighbor relations with the main sequence node refers to a candidate sequence node corresponding to a new basic component sequence vector generated after adjusting the basic components of the main sequence node, for example, a candidate sequence corresponding to a new basic component sequence vector generated by performing one or more amino acid mutations or insertions or deletions on the basic component sequence vector of the sequence of the protein molecule Alpha Hemolysin, that is, a candidate sequence corresponding to a new basic component sequence vector generated by performing one or more amino acid mutations or insertions or deletions on the basic component sequence vector of the sequence of the protein molecule Alpha Hemolysin is a different neighbor relation, and the number of mutations or insertions or deletions is determined from near to far by the number of mutations or insertions or deletions, for example, a new component sequence vector generated by one mutation/insertion/deletion is called a primary neighbor component sequence vector, and the corresponding sequence is called a primary neighbor sequence; the two mutation/insertion/deletion/formation new component sequence vectors are called secondary adjacent component sequence vectors for the current sequence, the corresponding sequences are called secondary adjacent sequences, and so on.

The neighbor relation labeling unit 640 is configured to label the corresponding neighbor relations with each other when the candidate sequence node is found in the basic component order vector tree according to the basic component order vector and the basic component weight vector of the candidate sequence node with different neighbor relations.

In the embodiment of the invention, for each sequence/record/graph node (source record) in the constructed basic component sequence vector tree, whether each candidate sequence/record/graph node (neighboring record) exists is searched according to the basic component sequence vector and the basic component weight vector of the candidate sequence node with different neighboring relations, if the neighboring record exists, mutual labeling is carried out, and the labeling is an edge (edge) in the manifold evolutionary graph, wherein the labeling can comprise the relation between the corresponding source record and the neighboring record. If the neighbor record does not exist, no action is performed. All graph nodes in the basic component sequence vector tree and all edges added in the steps form the manifold evolutionary graph.

The manifold evolutionary graph constructing unit 650 is configured to construct a manifold evolutionary graph according to the main sequence node and the neighbor relation.

In the embodiment of the invention, aiming at the internal sequence of each node of the basic component sequence vector tree, each sequence in the basic component sequence vector tree is taken as a node of a manifold evolutionary graph, all sequences which are positioned at the same basic component sequence vector tree node and have the same weight vector (namely share CRV and CWV) form a 'basic component recombination' super manifold evolutionary graph node, the internal mutation relation needs to be realized through sequence comparison, but the calculation amount is small firstly, and because the sequences are highly similar, the dependence of the comparison result on a division matrix and other related parameters is small, and the reliability is high. Secondly, sequences of different weight vectors in the same node of the component sequence vector tree need to label the adjacent relation according to the difference of the weight vectors. Note that the nodes of the manifold evolutionary graph (graph nodes) and the nodes of the component order vector tree (tree nodes) are distinguished. The tree nodes may be empty nodes, or may contain multiple graph nodes or "primitive constituent recombination" super-manifold-evolutionary graph nodes.

In the embodiment of the invention, graph analysis is carried out after the connection of the primary candidate sequence nodes in the manifold evolutionary graph is finished, and if all the sequence nodes are connected together to form a connected graph (connected graph), the construction of the manifold evolutionary graph is finished. The manifold is the most reliable (primary reliability) manifold. If all sequence nodes do not form a connected graph, each connected subgraph is defined as a first-level reliability subgraph, then the same connection analysis is carried out on the edges added with a second-level adjacent sequence, and if the original upper-level reliability subgraphs are connected, the combined subgraphs are marked as second-level reliability subgraphs (the original upper-level subgraph marks are kept). And so on until an overall connected graph is formed or all edges of the furthest adjacent sequence are used. If a plurality of connected subgraphs remain, the possible processing scheme is to continue to increase the farthest adjacent level or not to process and wait for the subsequent data.

The device for constructing the manifold evolutionary graph constructs the evolutionary manifold evolutionary graph through the basic component sequence vector tree, each sequence node in the basic component sequence vector tree is used as a node of the manifold evolutionary graph, all sequences which are located at the same basic component sequence vector tree node and have the same weight vector (namely share CRV and CWV at the same time) form a 'basic component recombination' super manifold evolutionary graph node, and the mutation relation in the super manifold evolutionary graph node is realized through sequence comparison, but the calculated amount is small; and because the sequences are highly similar, the dependence of the comparison result on the scoring matrix and other related parameters is small, and the reliability is high; in addition, sequences of different basic component weight vectors in the same node of the basic component sequence vector tree distinguish adjacent relations according to differences of the basic component weight vectors, and the analysis of the evolutionary relation of any super-large-scale complex high-dimensional nonlinear strongly-associated data can be realized.

As shown in fig. 7, in an embodiment, the order vector and weight vector determining unit 630 specifically includes: a primary candidate vector determination module 631, and a secondary candidate vector determination module 632.

The primary candidate vector determining module 631 is configured to determine, according to the basic component sequence vector and the basic component weight vector of the primary sequence node, a basic component sequence vector and a basic component weight vector of a primary candidate sequence node that is one-level adjacent to the primary sequence node.

The secondary candidate vector determining module 632 is configured to determine a basis component order vector and a basis component weight vector of a secondary candidate sequence node that is two-stage adjacent to the primary sequence node according to the basis component order vector and the basis component weight vector of the primary sequence node.

Further, if V in the above example is mutated to I, the new sequence vector (CRV) is changed to [ edklivivasstpnfyrqww ], the Corresponding Weight Vector (CWV) is [11,9,8,7, 7, 5,5, 5,4,4,4,4,3,3,2,2, 1,1], whether the corresponding new sequence exists (may be zero or more) is detected in the component sequence vector tree through the new sequence vector and the weight vector, and if so, the new sequence is labeled, and they are in one-level adjacent relationship with each other, only the mutation relationship is reciprocal, I → I "and I → V, respectively. Regardless of the size of the whole component sequence vector tree, the calculation cost for checking whether the corresponding sequence of each adjacent component sequence vector exists is a constant not greater than the depth of the tree. The number of candidate close sequences to be checked is an exponential change of the level of sequence closeness defined above, so the labeling of adjacent sequences must be controlled to a close farthest adjacent level (preferably no more than two levels). In this example, the total number of possible primary contiguous sequences formed by insertions and deletions is 39, including any of the 20 classes of amino acids inserted, and any of the 19 classes of amino acids deleted from the sequence already present. Whereas, one mutation theoretically could have one mutation in each of the 19 classes of amino acids already present into one of the remaining 19 classes, so the total possible candidate primary contiguous sequence of mutations is 19 × 19= 361; however, typically only a small portion of these candidate sequences can change the component sequence vectors. In fact, each amino acid is difficult to directly mutate into an amino acid with very different physicochemical properties, so that only a few amino acids close to the amino acid can be considered to reduce the number of candidate first (second) order adjacent sequences.

Further, the total possible secondary adjacent sequences of the above sequences include an optional two insertions of 20 amino acids (20 × 20= 400), already existing in 19 classes 1Class 7 is not less than 2, with 17 × 17=289 possibilities, and two classes (M and W) with only one amino acid can be deleted or selected together with one of the two classes plus any one of the former 17 classes, with 1+17 =35 possibilities, and there are totally 289+35=324 possibilities of deleting two amino acids, and the possibility of two mutations is (19 × 19) for not less than two amino acids of class 17 ² =130321, and the remaining two classes each having an amino acid that can be mutated in a total of 19 × 19=361, respectively, the probability of secondary adjacent sequences is 131401, which is expensive to calculate. On the other hand, the larger the number of insertions/deletions/mutations, the more evolutionary relationships that are actually possible, and the less reliable the mutations that we can directly infer from the sequence, the less practical it is to do this kind of potential survey of more distant sequences.

In the embodiment of the present invention, as shown in fig. 4, a, B, and C in the diagram represent three sequences (records), respectively, a thick line represents a primary adjacent relationship, a thin line represents a secondary adjacent relationship, and a more distant adjacent relationship is not shown in the diagram; a and B are in a primary adjacent relationship, B and C are in a primary adjacent relationship, A and C are in a secondary adjacent relationship, and A and C can be connected by other channels; so when there are enough records, it is sufficient to detect only the primary adjacent sequence to draw a manifold. Currently, all knowledge about evolution is speculation, and the primary adjacent relation is not only small in number, but more important, it is the most reliable speculation. Therefore, the advantage of the present invention of using the basic component sequence vector tree to construct the evolutionary manifold is that the second-order neighbor of the a sequence is the first-order neighbor of the first-order neighbor B, so that when the sequence data set is large enough, almost all further neighbors can be found by considering only the first-order neighbors.

It is noted that while searching for primary neighboring sequences/records, secondary or further neighboring sequences may be found. When the component sequence vector corresponding to the first-level candidate adjacent sequence reaches the target node, the sequence/record corresponding to the candidate weight vector is not found, but other sequences/records of which different weight vectors only share the component sequence vector exist in the node, and in addition, the adjacent component sequence vector tree nodes of the node also have similar sequences/records, and the adjacent level of the corresponding sequence/record can be directly determined by comparing the difference of the weight vectors. Similarly, more distant neighbors may be found when detecting secondary neighbors (not necessary when the sequence data set is large enough). FIG. 8 is a diagram illustrating an internal structure of a computer device in one embodiment. As shown in fig. 8, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen, which are connected through a system bus. The memory comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program, which, when executed by the processor, causes the processor to implement the manifold evolutionary graph building method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform the method for constructing a manifold evolutionary graph. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the manifold evolutionary graph building apparatus provided by the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 8. The memory of the computer device may store various program modules constituting the xx device, such as a data set acquisition unit 610, a group order vector tree construction unit 620, an order vector and weight vector determination unit 630, and a manifold evolution diagram construction unit 640 shown in fig. 6. The computer program constituted by the respective program modules causes the processor to execute the steps in the manifold evolutionary graph building method of the respective embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 8 may execute step S101 through the data set acquisition unit 610 module in the manifold evolutionary graph building apparatus shown in fig. 6. The computer device may perform step S102 by the component order vector tree construction unit 620. The computer device may perform step S103 through the order vector and weight vector determination unit 630. The computer device may perform step S104 by the manifold construction unit 640.

In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

acquiring a data set to be processed;

determining the basic component sequence vector and the basic component weight vector of a candidate sequence node with different neighbor relations with the main sequence node according to the basic component sequence vector and the basic component weight vector of the main sequence node;

and constructing a manifold evolutionary graph according to the neighbor relation, the basic component sequence vector and the basic component weight vector.

In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which, when executed by a processor, causes the processor to perform the steps of: acquiring a data set to be processed;

and constructing a manifold evolutionary graph according to the neighbor relation, the basic component sequence vector and the basic component weight vector. It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Although the basic component sequence vector tree is one of the most effective means in the manifold evolutionary graph construction, other faster sequence clustering means (e.g., MMSEQ) can be used to find similar sequences, and then the manifold evolutionary graph is constructed by performing sequence alignment on candidate sequences. Therefore, the manifold evolutionary graph is constructed by means of highly similar sequence evolutionary relationships, so that the evolutionary is understood to be the core idea of the present invention, and any modifications, equivalent substitutions, improvements and the like which are made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A manifold evolutionary graph construction method for protein sequence data set evolutionary relationship analysis is characterized by comprising the following steps:

acquiring a data set to be processed;

constructing a basic component sequence vector tree according to the data set to be processed; the base component sequence vector tree comprises at least one main sequence node and a base component sequence vector and a base component weight vector of the main sequence node; the main sequence node is a protein sequence; the step of constructing a basic component sequence vector tree according to the data set to be processed comprises the following steps:

defining basic components according to the data set to be processed and a preset definition rule; wherein the base component is defined using a combination of a class definition of the secondary structure of a single amino acid and a sequence component definition;

determining a basic component sequence vector according to the basic component;

inserting the basic component sequence vector into a corresponding node of a multi-branch tree to construct a basic component sequence vector tree;

determining the basic component sequence vector and the basic component weight vector of the candidate sequence node with different neighbor relations with the main sequence node according to the basic component sequence vector and the basic component weight vector of the main sequence node, comprising:

determining a basic component sequence vector and a basic component weight vector of a primary candidate sequence node adjacent to the primary of the main sequence node according to the basic component sequence vector and the basic component weight vector of the main sequence node; the basic component sequence vector of the primary candidate sequence node which is primary adjacent to the main sequence node is formed by one amino acid mutation/insertion/deletion in the basic component sequence vector of the main sequence node;

determining a basic component sequence vector and a basic component weight vector of a secondary candidate sequence node which is adjacent to the primary sequence node in the second level according to the basic component sequence vector and the basic component weight vector of the primary sequence node; the basic component sequence vector of the secondary candidate sequence node which is secondarily adjacent to the main sequence node is formed by mutation/insertion/deletion of two amino acids in the basic component sequence vector of the main sequence node;

when the candidate sequence node is found in the basic component sequence vector tree according to the basic component sequence vector and the basic component weight vector of the candidate sequence node with different neighbor relations, the corresponding neighbor relations are labeled mutually;

constructing a manifold evolutionary graph according to the main sequence nodes and the neighbor relation;

for each sequence/record/graph node in the constructed basic component sequence vector tree, searching whether each candidate sequence/record/graph node exists according to the basic component sequence vector and the basic component weight vector of the candidate sequence node with different neighbor relations, and if so, labeling each candidate sequence/record/graph node, wherein the label is an edge in the manifold evolutionary graph, and the label can contain the relation between a corresponding source record and a neighbor record; carrying out graph analysis after the nodes of the manifold evolutionary graph are connected, and finishing constructing the manifold evolutionary graph if all sequence nodes are connected together to form a connected graph after all the first-level adjacent relations are added; the manifold evolutionary graph is a manifold evolutionary graph with primary reliability; if all sequence nodes do not form a connected graph, defining each connected subgraph as a first-stage reliability subgraph, further adding edges of second-stage adjacent sequences for the same connection analysis, and if the original previous-stage reliability subgraphs are connected, merging and marking as second-stage reliability subgraphs; and so on until forming the whole connected graph or using the edge formed by the nearest neighbor relation corresponding to all the farthest adjacent sequences; if a plurality of connected subgraphs remain in the end, the farthest adjacent level is continuously increased, or no processing is carried out to wait for subsequent data.

2. A manifold evolutionary graph construction method for DNA/RNA sequence data set evolutionary relationship analysis is characterized by comprising the following steps:

acquiring a data set to be processed;

constructing a basic component sequence vector tree according to the data set to be processed; the base component sequence vector tree comprises at least one main sequence node and a base component sequence vector and a base component weight vector of the main sequence node; the main sequence node is a DNA/RNA sequence; the step of constructing a basic component sequence vector tree according to the data set to be processed comprises the following steps:

defining basic components according to the data set to be processed and a preset definition rule; for a DNA sequence, carrying out permutation and combination on four basic groups of ATCG, and taking the permutation and combination of the basic groups as basic components; for RNA sequences, defining permutation combination or different combination classes of combination of four bases of AUCG as basic components;

determining a basic component sequence vector according to the basic component;

determining a basic component sequence vector and a basic component weight vector of a primary candidate sequence node adjacent to the primary of the main sequence node according to the basic component sequence vector and the basic component weight vector of the main sequence node; the basic component sequence vector of the primary candidate sequence node which is primary adjacent to the main sequence node is formed by one base mutation/insertion/deletion in the basic component sequence vector of the main sequence node;

determining a basic component sequence vector and a basic component weight vector of a secondary candidate sequence node which is adjacent to the primary sequence node in the second level according to the basic component sequence vector and the basic component weight vector of the primary sequence node; the basic component sequence vector of the secondary candidate sequence node which is secondarily adjacent to the main sequence node is formed by mutation/insertion/deletion of two bases in the basic component sequence vector of the main sequence node;

for each sequence/record/graph node in the constructed basic component sequence vector tree, searching whether each candidate sequence/record/graph node exists according to the basic component sequence vectors and the basic component weight vectors of the candidate sequence nodes with different neighbor relations, and if so, labeling each candidate sequence/record/graph node mutually, wherein the label is an edge in the manifold evolutionary graph and can contain the relation between a corresponding source record and a neighbor record; carrying out graph analysis after the nodes of the manifold evolutionary graph are connected, and finishing constructing the manifold evolutionary graph if all sequence nodes are connected together to form a connected graph after all the first-level adjacent relations are added; the manifold evolutionary graph is a manifold evolutionary graph with primary reliability; if all sequence nodes do not form a connected graph, defining each connected subgraph as a first-stage reliability subgraph, further adding edges of second-stage adjacent sequences for the same connection analysis, and if the original previous-stage reliability subgraphs are connected, merging and marking as second-stage reliability subgraphs; and so on until forming the whole connected graph or using the edge formed by the nearest neighbor relation corresponding to all the farthest adjacent sequences; if a plurality of connected subgraphs remain, the farthest adjacent level is continuously increased or no processing is carried out to wait for subsequent data.

3. The manifold evolutionary graph building method according to claim 1 or 2, further comprising:

when a new sequence node exists, calculating a basic component sequence vector and a basic component weight vector of the new sequence node;

inserting the new sequence node into the basic component sequence vector tree according to the basic component sequence vector and the basic component weight vector, and acquiring the basic component sequence vector and the basic component weight vector of a new candidate sequence node which has different neighbor relations with the new sequence node;

and adding the new sequence node and the new candidate sequence node into the manifold evolutionary graph according to the basic component sequence vector and the basic component weight vector of the new sequence node, the basic component sequence vector and the basic component weight vector of the new candidate sequence node and the neighbor relation between the new sequence node and the new candidate sequence node.

4. A manifold evolutionary graph construction device for protein sequence data set evolutionary relationship analysis is applied to the manifold evolutionary graph construction method for protein sequence data set evolutionary relationship analysis, and comprises the following steps:

the data set acquisition unit is used for acquiring a data set to be processed;

the component sequence vector tree construction unit is used for constructing a basic component sequence vector tree according to the data set to be processed; the base component sequence vector tree comprises at least one main sequence node and a base component sequence vector and a base component weight vector of the main sequence node; the component sequence vector tree construction unit:

defining basic components according to the data set to be processed and a preset definition rule; wherein the base component is defined using a combination of a class definition of the secondary structure of a single amino acid and a definition of the sequence component;

determining a basic component sequence vector according to the basic component;

the sequence vector and weight vector determining unit is used for determining the basic component sequence vector and the basic component weight vector of the candidate sequence node with different neighbor relations with the master sequence node according to the basic component sequence vector and the basic component weight vector of the master sequence node; wherein the order vector and weight vector determination unit includes:

a primary candidate vector determining module, configured to determine, according to the basic component sequence vector and the basic component weight vector of the primary sequence node, a basic component sequence vector and a basic component weight vector of a primary candidate sequence node that is primary-adjacent to the primary sequence node; the basic component sequence vector of the primary candidate sequence node which is primary adjacent to the main sequence node is formed by mutation/insertion/deletion of one amino acid in the basic component sequence vector of the main sequence node; and

a secondary candidate vector determining module, configured to determine, according to the basis component sequence vector and the basis component weight vector of the primary sequence node, a basis component sequence vector and a basis component weight vector of a secondary candidate sequence node that is two-stage adjacent to the primary sequence node; the basic component sequence vector of the secondary candidate sequence node which is secondarily adjacent to the main sequence node is formed by mutation/insertion/deletion of two amino acids in the basic component sequence vector of the main sequence node;

a neighbor relation labeling unit, configured to label the neighbor relations when the candidate sequence node is found in the basic component order vector tree according to the basic component order vector and the basic component weight vector of the candidate sequence node with different neighbor relations; and

5. A manifold evolutionary graph construction device for DNA/RNA sequence data set evolutionary relationship analysis is characterized in that the manifold evolutionary graph construction method applied to the DNA/RNA sequence data set evolutionary relationship analysis of claim 2 comprises the following steps:

the data set acquisition unit is used for acquiring a data set to be processed;

the component sequence vector tree construction unit is used for constructing a basic component sequence vector tree according to the data set to be processed; the base component sequence vector tree comprises at least one main sequence node and a base component sequence vector and a base component weight vector of the main sequence node; the main sequence node is a DNA/RNA sequence; the component sequence vector tree construction unit:

determining a basic component sequence vector according to the basic component;

a primary candidate vector determining module, configured to determine, according to the basic component sequence vector and the basic component weight vector of the primary sequence node, a basic component sequence vector and a basic component weight vector of a primary candidate sequence node that is primary-adjacent to the primary sequence node; the basic component sequence vector of the primary candidate sequence node which is adjacent to the primary main sequence node in the first level is formed by one base mutation/insertion/deletion in the basic component sequence vector of the main sequence node; and

a secondary candidate vector determining module, configured to determine, according to the basis component sequence vector and the basis component weight vector of the primary sequence node, a basis component sequence vector and a basis component weight vector of a secondary candidate sequence node that is two-stage adjacent to the primary sequence node; the basic component sequence vector of the secondary candidate sequence node which is secondarily adjacent to the main sequence node is formed by mutation/insertion/deletion of two bases in the basic component sequence vector of the main sequence node;

and the manifold evolutionary graph building unit builds a manifold evolutionary graph according to the main sequence node and the neighbor relation.

6. A computer device comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to carry out the steps of the manifold evolutionary graph construction method of any one of claims 1 to 4.

7. A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of the method of constructing a manifold as claimed in any of claims 1 to 4.