CN110060735B

CN110060735B - Biological sequence clustering method based on k-mer group segmentation

Info

Publication number: CN110060735B
Application number: CN201910271872.4A
Authority: CN
Inventors: 江育娥; 俞婷婷; 林劼
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2023-02-10
Anticipated expiration: 2039-04-04
Also published as: CN110060735A

Abstract

The invention discloses a biological sequence clustering method based on k-mer group segmentation, which comprises the following steps: step 1, segmenting a sequence in a data set, and counting segmented k-mers word frequency; step 2, constructing a bipartite graph according to the relation between the sequence and the k-mers; step 3, randomly grouping the k-mers, and calculating the importance of the sequence under each group of k-mers; 4, sorting the importance of the sequences in a reverse order, screening out candidate sequences and removing duplication of the candidate sequences; step 5, clustering the candidate sequences, and searching sequence centers; and 6, clustering all the sequences. According to the invention, through constructing a bipartite graph model, the biological sequence is subjected to cluster analysis, deep information meaning and reliable conclusion are obtained from biological sequence data, and the problems of high complexity of node weight, insufficient representativeness of node importance and great influence of sequence length on the node importance in the conventional calculation are effectively solved.

Description

Biological sequence clustering method based on k-mer group segmentation

Technical Field

The invention relates to the field of bioinformatics, in particular to a biological sequence clustering method based on k-mer group segmentation.

Background

With the continuous development of the second generation sequencing technology, the biological database is increasingly scaled up. The huge biological information database provides wide opportunities for scientific research personnel and also brings challenges, how to dig out useful information from the billions of biological information databases and data mining provides basic means for scientific research work. Similarity of biological sequences often reflects correlation of functions thereof, and clustering analysis is a commonly used technique in data mining.

In biology, sequence alignment identifies regions of similar sequences that can describe functional, structural, and evolutionary relationships between sequences by way of aligning biological sequences. Clustering is the division of similar sequences into the same groups and of dissimilar sequences into different groups, so that the distance between sequences within the same group is the smallest, while the distance within different groups is the largest. If two similar sequences are grouped together in the same group, to some extent indicating homology, this will greatly save time and effort in re-determining the structure and function of the unknown sequence. In addition, sequence alignment generally determines the analysis results of many bioinformatics techniques and programs, affects the conclusions and biological interpretations of many sequence comparison studies, and is an important content in studies such as biological sequence clustering analysis.

Bipartite graphs (Bipartite graphs) are one of the more widely used graphs, and are used in real life: how to arrange work can meet the requirements of each person to the maximum extent, and the work competence condition of each person is known; how to arrange courses can meet the conditions among the classroom, the teacher and the students. These involve the matching problem of bipartite graphs, which can be solved by building graph models.

The biological sequences are clustered and researched by combining a graph model, and the sequences related to functions are clustered into one type, so that scientific research personnel can be helped to quickly know the functions of the biological sequences, and the diversity of internal populations is determined, so that the protection of the biological diversity is promoted, and biological resources are reasonably developed and utilized.

Disclosure of Invention

The invention aims to provide a biological sequence clustering method based on k-mer group segmentation.

The technical scheme adopted by the invention is as follows:

a biological sequence clustering method based on k-mer group segmentation comprises the following steps:

step 1: setting the size of a sliding window, segmenting the sequence in the data set, and counting the segmented k-mers word frequency;

step 2: constructing a bipartite graph according to the relation between the sequences and the k-mers, and respectively counting any sequence s _i The word frequency of k-mers co-occurring with other sequences;

and 3, step 3: dividing k-mers into t groups uniformly at random: g ₁ ,g ₂ ,…,g _t Calculating the importance of the sequences under each group of k-mers;

and 4, step 4: the importance of the sequences is sorted in reverse order: setting k as the central number of the m sequences, and screening the k sequences from the t groups at uniform intervals to serve as candidate sequences; and de-duplicating the candidate sequences in the t x k sequences to obtain non-repeated candidate sequences;

and 5: performing k-mers clustering on the candidate sequences; clustering DNA sequences based on the set size of the sliding window to obtain a k-mers set; screening out a point closest to the current centroid as a sequence center from each cluster in the K-means clustering result; by analogy, k central sequences can be obtained;

step 6: pairing m sequences S = { S } based on k center sequences _i I =1, 2.., m } is clustered.

Further, the data set in step 1 is a sequence set S = { S } of length m _i I =1,2,. Eta, m }, sliding window size L, and segmented K-mers set K = { K }, with K-mers set _j |j＝1，2，...，n}。

Further, the bipartite graph constructed in step 2 is G = (V, E), and also a sequence-k-mers graph is made. G = (V, E) is a undirected graph model composed of a set of nodes and a set of edges, where V is the set of nodes and V can be decomposed into two subsets, i.e., V = S ≦ K, and

S＝{s _i i =1,2,. Lamda., m } is a set of sequences, s _i For the ith sequence, K = { K = _j I j =1, 2.. N } is a set of k-mers, k _j Is the jth k-mers; e represents the set of edges formed by the interaction between the nodes, and the two end points of each edge in E are in subset S and subset K, respectively, i.e. E = { E (S) = _i ，k _j )|s _i ∈S，k _j E.g., K }, where e(s) _i ，k _j ) Represents a sequence s _i And k-mersk _j There is a membership relationship between them.

Further, the method for determining the importance of the sequence in step 3 comprises:

step 3.1: the weight of the edge is calculated. When two sequences v _i And v _j If there is a common k-mers, then v is considered to be _i And v _j Are adjacent nodes, and have an edge connected with each other, the weight w of the edge _ji The number of co-occurrences of k-mers for the presence of both sequences, i.e.: for any two sequences v _i And v _j If they have a common k-mers, w can be used _ji Indicating undirected interactions between nodes.

Step 3.2: and calculating the weight of the node.

ForAny two nodes v _i And v _j Node v _i By connecting their edges w _ji To node v _j Transitive, the magnitude of the edge weight determines v _i For v _j The magnitude of the effect of (1). When node v _j Having an edge relationship with a plurality of nodes, i.e. node v _j With a plurality of adjacent nodes, in which case node v _j Is weighted as node v _j The sum of contributions from other nodes is received.

Step 3.3: each node v is calculated iteratively _i The weight of (c) can be obtained as a node v _i Of the cell.

Further, in step 3.1 adjacent node v _i And v _j The weight of an edge is w _ji W can be calculated by the following formula _ji ：

Wherein kmer ∈ v _i &kmer∈v _j Indicates that k-mers is present at node v _i Again exists at node v _j In (1).

And

respectively representing the current k-mers at node v _i And node v _j Frequency of occurrence in (2). Since the interaction between nodes is non-directional, there is w _ji ＝w _ij 。

Further, in step 3.2 w _j. Representation node v _j Upon receiving the contribution from the other node, w is calculated by the following equation _j. ：

Wherein w _j. Representing each node pair V in the set of nodes V _j The degree of contribution of (c).

Further, each node v in step 3.3 _i Is WS (v) _i )，WS(v _i ) The corresponding SeqRank calculation formula is as follows:

where d is a damping coefficient (0. Ltoreq. D. Ltoreq.1) representing the probability of wandering from one node to another at any time, i.e., the probability of each node having (1-d) wandering randomly to other nodes. v. of _j ∈e(v _i ，v _j ) Representing a node v _i And node v _j There are common edges; at v _k ∈e(v _j ，v _k ) In, v _k Is and node v _j There are nodes of the common edge. w is a _ij (or w) _ji ) Representing a connecting node v _i And node v _j The weight of the edge of (c), i.e. node v _i And node v _j The sum of co-occurrence frequencies of k-mers present. Denominator

Denotes v _k ∈e(v _j ，v _k ) Time node v _j Point to node v _k Is determined by the weighted sum of the weights of the edges of (1). WS (v) _j ) For node v after the last iteration _j The importance of (c).

Since the weight of the node itself is needed to be used when calculating the weight of the node, iterative calculation is needed. If WS (v) is used _i ) ^t Representing a node v _i After t iterations, the importance can be expressed as formula (3):

and the SeqRank algorithm carries out iterative calculation on the graph model until a convergence condition is met.

Further, d in step 3.3 satisfies 0. Ltoreq. D.ltoreq.1.

Further, d is 0.85.

Further, the method for determining the center sequence in step 5 comprises:

step 5.1: clustering the candidate sequences by using a K-means algorithm, wherein the central number of the K-means is K, and the characteristic is the K-mers frequency of the candidate sequences;

and step 5.2: for each cluster, the point closest to the current center is screened out as the sequence center.

Further, the method for determining the sequence cluster in step 6 comprises the following steps:

step 6.1: labeling k central sequences as μ ₁ ，μ ₂ ，...，μ _k ；

Step 6.2: for each sequence S in the sequence set S _i Its prediction class is calculated using the following formula:

wherein, pre _i Represents a sequence s _i The closest class among the k clusters, i.e. the prediction class of the ith sequence, | w _i -μ _j || ² Represents a pair sequence s _i Is a value w of importance _i Calculating the Euclidean distance between the central point and each central point,

indicates the distance w _i The closest central point is determined as the prediction category of the ith sequence. Therefore, the prediction types corresponding to the m sequences can be obtained.

By adopting the technical scheme, the biological sequence is subjected to cluster analysis by constructing a bipartite graph model, and the meaning and reliable conclusion of deep information are obtained from the biological sequence data at the level of cluster analysis, so that the problems of high complexity of node weight calculation, insufficient representativeness of node importance, great influence of sequence length on the node importance and the like in the prior art are effectively solved.

Drawings

The invention is described in further detail below with reference to the accompanying drawings and the detailed description;

FIG. 1 is a schematic flow chart of a biological sequence clustering method based on k-mer group segmentation according to the present invention;

FIG. 2 is a schematic diagram of group g1 randomly and uniformly grouping k-mers according to the present invention;

FIG. 3 is a schematic diagram of the group g2 randomly and uniformly grouping k-mers according to the present invention;

FIG. 4 is a schematic diagram of the bipartite sequence of XYZ triplets according to the present invention;

FIG. 5 is a diagram showing the result of K-means clustering performed on sequences according to the present invention.

Detailed Description

As shown in one of FIGS. 1 to 5, the invention discloses a biological sequence clustering method based on k-mer cluster segmentation. According to the invention, a bipartite graph is constructed according to the relation between sequences and k-mers under different sliding window sizes, the k-mers are grouped, the importance of the sequences under different groups is calculated, the sequence centers are found out and clustered, and the problems of high complexity of calculating node weight, insufficient representativeness of node importance, great influence of the node importance on the sequence length and the like in the prior art are effectively solved.

As shown in FIG. 1, the invention discloses a k-mer group segmentation-based clustering algorithm, which comprises the following steps:

step 1: s = { S for a given set of sequences in terms of sliding window size L _i I =1, 2.. Multidot.m } is divided, and the K-mers set after the division is K = { K = _j I j =1,2,. Eta., n }, m and n being the sizes of the sequence set S and k-mers set, respectively, the statistical k-mers word frequency.

In particular in the sequence s _i The number of all elements with the length of the n-Gram being L is (| s) _i L + 1).

For each sequence s _i And counting the occurrence word frequency of each k-mers. The k-mers set between sequences is sigma ^L The number of elements in the set is [ sigma ]) ^L Namely: the number of k-mers which may be present in a DNA sequence having the base type 4 is 4 ^L (ii) a For a protein sequence with the amino acid species of 20,the number of possible k-mers is 20 ^L 。

For example, for three sequences X = "ACAGT", Y = "ACACG" and Z = "CACGT", when the sliding window size of k-mer elements is set to 2, the n-Gram length 2 elements among all sequences in the three sequences are 4, and the occurrence of k-mers in each sequence is shown in table 1:

TABLE 1 occurrence of k-mers between sequences

k-mers	AC	AG	CA	CG	GT	sum
							Sequence X
	1	1	1	0	1	4
							Sequence Y	2	0	1	1	0	4
Sequence Z	1	0	1	1	1	4

Step 2: constructing a bipartite graph G = (V, E) according to the relation between the sequence and the k-mers. G = (V, E) is a undirected graph model composed of node sets and edge sets. Where V is a set of nodes, E represents a set of edges formed by interaction relationships between nodes, and E = { E(s) = _i ，k _j )|s _i ∈S，k _j E.g., K }, where e(s) _i ，k _j ) Represents a sequence s _i And k-mersk _j There is a membership relationship between them.

Specifically, as can be seen from Table 1, for sequence X and sequence Y, there are k-mers in common between the two sequences: AC and CA, then the sequence X is considered to be connected with the sequence Y by one side; for sequence X and sequence Z, there is a common k-mers between the two sequences: AC. CA and GT, then the sequence X is considered to have a side connection with the sequence Z; for sequences Y and Z, there is a common k-mers between the two sequences: AC. CA and CG, and the sequence Y is considered to be connected with the sequence Z by one side.

And step 3: the importance of the m sequences was calculated.

Step 3.1: randomly and uniformly dividing n k-mers into t groups: g ₁ ，g ₂ ，...，g _t As shown in FIG. 2, sequence s ₁ And sequence s _i Presence of common k-mersk ₁ Sequence s ₁ And sequence s _m Presence of common k-mersk ₃ I.e. the sequence s ₁ And sequence s _i Sequence s ₁ And sequence s _m There is an interaction relationship. This is achieved byBy taking the sequence and k-mers as nodes, a graph model G = (V, E) can be constructed.

Step 3.2: the weight of the edge is calculated. When two sequences v _i And v _j If a common k-mers is present, then v is considered to be _i And v _j Are adjacent nodes, and have an edge connected with each other, the weight w of the edge _ji The number of co-occurrences of k-mers for the presence of both sequences, i.e.: for any two sequences v _i And v _j If they have a common k-mers, w can be used _ji Representing the undirected interaction between nodes, w is calculated by the following formula _ji ：

Specifically, for three sequences X = "ACAGT", Y = "ACACG", and Z = "CACGT", when the sliding window size of the k-mer element is set to 2, a bipartite graph with respect to the three sequences is shown in fig. 3:

the k-mers frequency of the three sequences X, Y and Z is shown in Table 2.

TABLE 2 k-mers occurrence of the three sequences

k-mers	AC	AG	CA	CG	GT
						Sequence X
	1	1	1	0	1
						Sequence Y	2	0	1	1	0
Sequence Z	1	0	1	1	1

As can be seen from Table 2, for sequence X and sequence Y, there is a common k-mers between the two sequences: AC and CA. Since the weight of an edge is the number of co-occurrences of k-mers in the two sequences, the relationship between sequence X and sequence Y in Table 2 can be simplified as follows:

TABLE 3 Co-occurrence frequency of k-mers in the presence of both X and Y sequences

k-mers	AC	CA
			Sequence X
	1	1
			Sequence Y	2	1

In this case the interaction w of the sequence X with the sequence Y _YX Can be calculated from the following formula:

w _YX ＝min(|AC _X |，|AC _Y |)+min(|CA _X |，|CA _Y |)＝1+1＝2

step 3.3: and calculating the weight of the node. For any two nodes v _i And v _j Node v _i By connecting their edges w _ji To node v _j The magnitude of the edge weight determining v _i For v _j The magnitude of the effect of (1). When node v _j Having an edge relationship with a plurality of nodes, i.e. node v _j Having a plurality of adjacent nodes, w _j. Representing a node v _j Upon receiving the contribution from the other node, w is calculated by the following equation _j. ：

For Table 3, by calculating the co-occurrence frequency of k-mers in the presence of X, Y and Z sequences, the following relationship matrix M can be obtained:

TABLE 4 Co-occurrence frequency of k-mers in the presence of three sequences X, Y and Z

	Sequence X	Sequence Y	Sequence Z
				Sequence X	0	2	3
Sequence Y	2	0	3
				Sequence Z	3	3	0

In table 4, the size of the matrix M is | V |. V |, where | V | represents the number of nodes. The values in the matrix represent the interaction between two sequences, e.g., M1, 3 indicates that the interaction of sequence X with sequence Z is 3.

Each node v is calculated iteratively _i The weight of (c) can be obtained as a node v _i WS (v) of importance _i )，WS(v _i ) The corresponding SeqRank calculation formula is as follows:

wherein d is a damping coefficient (d is more than or equal to 0 and less than or equal to 1) and represents the probability of wandering from one node to another at any time, namely the probability that each node has (1-d) is wandered to other nodes randomly. v. of _j ∈e(v _i ，v _j ) Representing a node v _i And node v _j There are common edges; at v _k ∈e(v _j ，v _k ) In, v _k Is and node v _j There are nodes of the common edge. w is a _ij (or w) _ji ) Representing a connecting node v _i And node v _j The weight of the edge of (1), i.e. node v _i And node v _j The sum of co-occurrence frequencies of k-mers present. Denominator

Denotes v _k ∈e(v _j ，v _k ) Time node v _j Point of direction v _k Is determined by the weighted sum of the weights of the edges of (1). WS (v) _j ) For the node v after the last iteration _j Of the cell.

Since the weight of the node itself is needed to be used when calculating the weight of the node, iterative calculation is needed. If WS (v) is used _i ) ^t Representing a node v _i After t iterations, the importance can be expressed as the following equation (3):

and 4, step 4: the importance of the sequences is sorted in reverse order. The importance of the m sequences under different sets of k-mers is represented by a t x m dimensional matrix I, which is shown in table 5.

TABLE 5 importance of m sequences under t groups

Specifically, in matrix I, the rows of the matrix represent t sets of k-mers, the columns of the matrix represent m sequences, and I [ p, q ] in the matrix]Denotes the importance value of the q (1. Ltoreq. Q. Ltoreq.m) th sequence in the p (1. Ltoreq. P.ltoreq.t) th group k-mers, such as I1,]is shown in group g ₁ Importance of the following m sequences, I [, q]Indicates the importance of the q-th sequence calculated under all groups.

It is noted that the same sequence in different groups may exhibit different importance,e.g. sequence s in matrix I ₁ In group g ₁ Importance of the following is 0.96823, while in group g ₂ The importance of 1.040769; conversely, the maximum values obtained by calculating the importance of different groups may be different sequences, such as group g ₁ The next most important sequence is s _m Group g _p The next most important sequence is s ₁ 。

The results obtained by sorting the importance of the sequences in reverse order in units of groups are shown in Table 6.

TABLE 6 reverse order ranking of importance

Unlike Table 5, in Table 6, the values in matrix I represent the reverse ordered sequence numbers, e.g., I [2,1 ]]=7 denotes: in group g ₂ The most important 7 sequences calculated below; i [ p, m ]]=7 denotes: 7 sequences in group g _p The following sequences are considered to be the least important sequences. The most important sequence numbers calculated at different sets of k-mers are not the same.

Assuming k (k ≦ m) as the center number of m sequences, in group g ₁ Sequence number SR of ₁ For all, k sequences were screened from each of the t sets at uniform intervals, as shown in shaded portion of table 7.

Table 7 screening for importance

And taking the sequence of each group of k screened sequence numbers as a candidate sequence, wherein the size of the candidate sequence set is t x k. In the t × k sequences, there may be repeated sequences, such as group g ₁ Middle sequences 5 and 7 are also in group g _t In the candidate sequence set of (2); group g _t The sequence of SEQ ID NO 78 is also in group g ₂ Of the candidate sequence set of (2). Therefore, we want to duplicate the selected t × k candidate sequences to obtain n (n)T × k) non-repeating candidate sequences.

And 5: and clustering the candidate sequences by using a K-means algorithm. The number of K-means centers is K, and the characteristic is the K-mers frequency of the candidate sequence. With a sliding window size L, for DNA sequences, the k-mers set that may occur between sequences is sigma ^L . Assuming L =2, there is the following matrix O:

TABLE 8 frequency of occurrence of k-mers in candidate sequences

The size of the matrix O is n × Σ - ^L Where n is the number of candidate sequences, ∑ ^L And the k-mers set corresponding to the sequence when the size of the current sliding window is L. When L =2, for the DNA sequence, there is ∑ ^L = { AA, AC, AG.,. TT }. In the matrix O, 0[ s ] _i ，]Represents the frequency of occurrence of each k-mers in the ith sequence, namely: 140, 122, 200,...,101.

The results obtained with K-means are shown in FIG. 4. In fig. 4, n candidate sequences are grouped into k classes: cluster1, cluster 2. For each Cluster, the point closest to the current centroid is screened out as the sequence center, namely, for Cluster2, the point is measured in a certain distance mode, at the moment, the point closest to the Cluster2 is the point A, and the sequence in which the point A is located is considered as the center sequence of the Cluster 2. By analogy, k central sequences can be obtained.

Step 6: for m sequences S = { S = { S _i I =1, 2.., m } is clustered.

Specifically, when the sliding window size is L, the K-mers set of m sequences is K = { K = _j |j＝1，2，...，|∑| ^L H, the size of the k-mers frequency matrix O constructed at the moment is m × Σ ^L 。

For K centers, the clustering process of the K-means algorithm is described specifically as follows, wherein the clustering process is characterized by representing a frequency matrix O of K-mers in m sequences:

(1) For the screened k centers, the numbers are marked as mu ₁ ，μ ₂ ，...，μ _k ；

(2) For each point w in the matrix O _i，j The class to which it belongs is calculated using equation (5):

in the formula (3), pre _i Represents a sequence s _i The closest class among the k clusters, i.e. the prediction class of the ith sequence, | w _i -μ _j || ² Represents a pair sequence s _i Is a value of importance w _i Calculating the Euclidean distance between the central point and each central point,

Claims

1. A biological sequence clustering method based on k-mer group segmentation is characterized by comprising the following steps: which comprises the following steps:

step 1: acquiring a data set of a sequence set to be processed, and segmenting the sequence in the data set according to the size of a set sliding window to obtain a k-mers set;

step 2: constructing a bipartite graph according to the relation between the sequences and the k-mers set, and respectively counting any sequence s _i Word frequencies of k-mers co-occurring with other sequences; namely, a constructed bipartite graph is G = (V, E), and a sequence-k-mers graph is also drawn; g = (V, E) is a undirected graph model composed of node sets and edge sets,

where V is a set of nodes and V can be decomposed into two subsets, i.e. V = S @, and

S＝{s _i l i =1,2, \8230 |, m } is a set of sequences, s _i For the ith sequence, K = { K = _j I j =1,2, \ 8230;, n } is a set of k-mers, k _j Is the jth k-mers;

e represents a set of edges formed by interaction between nodes and two end points of each edge in E are respectively in the subset S and the subset K, namely E = { E (S) = _i ,k _j )|s _i ∈S,k _j E.g., K }, where e(s) _i ,k _j ) Represents a sequence s _i And k-mers k _j Membership exists among the groups;

and step 3: dividing k-mers into t groups uniformly at random: g ₁ ,g ₂ ,…,g _t Calculating the importance of the sequences under each group of k-mers; the determination method of the sequence importance comprises the following steps:

step 3.1: calculating the weight of the edge: when two sequences v _i And v _j If a common k-mers is present, then v is considered to be _i And v _j Are adjacent nodes, and have an edge connected with each other, the weight w of the edge _ji For the number of co-occurrences of k-mers present in both sequences,

wherein adjacent nodes v _i And v _j The weight of an edge is w _ji W is calculated by the following formula _ji ：

Wherein kmer ∈ v _i &kmer∈v _j Indicates that k-mers is present at node v _i Again at node v _j Performing the following steps;

and

respectively representing the current k-mers at node v _i And node v _j And w is _ji ＝w _ij ；

Step 3.2: calculating the weight of the node:

for any two nodes v _i And v _j Node v _i By connecting their edges w _ji To node v _j Transitive, the magnitude of the edge weight determines v _i For v _j The magnitude of the effect of (c);

when node v _j With edge relationships to multiple nodes, i.e. node v _j With a plurality of adjacent nodes, in which case node v _j Is weighted as node v _j Receiving the sum of the effects from other nodes;

wherein, w _j. Representing a node v _j Upon receiving the contribution from the other node, w is calculated by the following equation _j. ：

Wherein, w _ji Represents any node i in the node set V to the node V _j The degree of contribution of (c);

step 3.3: each node v is calculated iteratively _i Is weighted to obtain a node v _i Of importance of, each node v _i Of importance is WS (v) _i )，WS(v _i ) The corresponding SeqRank calculation formula is as follows:

wherein d is a damping coefficient and d is more than or equal to 0 and less than or equal to 1, representing the probability of wandering from one node to another at any time, v _j ∈e(v _i ,v _j ) Representing a node v _i And node v _j There are common edges; at v _k ∈e(v _j ,v _k ) In, v _k Is and node v _j Having a common edgeA node of (a); w is a _ij Or w _ji Representing a connecting node v _i And node v _j The weight of the edge of (1), i.e. node v _i And node v _j Sum of co-occurrence frequencies of k-mers present; denominator

Denotes v _k ∈e(v _j ,v _k ) Time node v _j Point of direction v _k Weighted sum of the weights of the edges of, WS (v) _j ) For the node v after the last iteration _j The importance of (c);

by WS (v) _i ) ^t Representing a node v _i After t iterations, the importance can be expressed as formula (3):

the SeqRank algorithm carries out iterative computation on the graph model until a convergence condition is met;

and 4, step 4: the importance of the sequences is sorted in reverse order: setting k as the central number of the m sequences, and screening the k sequences from the t groups at uniform intervals to serve as candidate sequences; and de-duplicating the candidate sequence in the t × k sequences to obtain a non-repeated candidate sequence;

and 6: pairing m sequences S = { S } based on k center sequences _i And l i =1,2, \8230 |, m } carries out clustering to obtain the prediction categories corresponding to the m sequences.

2. The method of claim 1, wherein the method comprises the steps of: the data set in step 1 is a sequence set S = { S } of length m _i L i =1,2, \8230 |, m }, slipThe window size is L, and the segmented K-mers set is K = { K = _j |j＝1,2,…,n}。

3. The method of claim 1, wherein the method comprises the steps of: the value of d is 0.85.

4. The method of claim 1, wherein the method comprises the steps of: the method for determining the center sequence in the step 5 comprises the following steps:

step 5.1: clustering the candidate sequences by using a K-means algorithm, wherein the center number of the K-means is K, and the characteristic is the K-mers frequency of the candidate sequences;

and step 5.2: the closest point to the current center is screened for each cluster as the sequence center.

5. The method of claim 1, wherein the method comprises the steps of: the method for determining the sequence clustering in the step 6 comprises the following steps:

step 6.1: the k central sequences, respectively designated as μ ₁ ,μ ₂ ,…,μ _k ；

therein, pre _i Represents a sequence s _i The closest class among the k clusters, i.e., the prediction class of the ith sequence, | w _i -μ _j || ² Represents a pair sequence s _i Is a value w of importance _i Calculating the Euclidean distance between the central point and each central point,

indicates the distance w _i The nearest central point is determined as the pre-point of the ith sequenceAnd (5) measuring the category.