CN110060735B - Biological sequence clustering method based on k-mer group segmentation - Google Patents

Biological sequence clustering method based on k-mer group segmentation Download PDF

Info

Publication number
CN110060735B
CN110060735B CN201910271872.4A CN201910271872A CN110060735B CN 110060735 B CN110060735 B CN 110060735B CN 201910271872 A CN201910271872 A CN 201910271872A CN 110060735 B CN110060735 B CN 110060735B
Authority
CN
China
Prior art keywords
node
sequence
sequences
mers
importance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910271872.4A
Other languages
Chinese (zh)
Other versions
CN110060735A (en
Inventor
江育娥
俞婷婷
林劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201910271872.4A priority Critical patent/CN110060735B/en
Publication of CN110060735A publication Critical patent/CN110060735A/en
Application granted granted Critical
Publication of CN110060735B publication Critical patent/CN110060735B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Analytical Chemistry (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a biological sequence clustering method based on k-mer group segmentation, which comprises the following steps: step 1, segmenting a sequence in a data set, and counting segmented k-mers word frequency; step 2, constructing a bipartite graph according to the relation between the sequence and the k-mers; step 3, randomly grouping the k-mers, and calculating the importance of the sequence under each group of k-mers; 4, sorting the importance of the sequences in a reverse order, screening out candidate sequences and removing duplication of the candidate sequences; step 5, clustering the candidate sequences, and searching sequence centers; and 6, clustering all the sequences. According to the invention, through constructing a bipartite graph model, the biological sequence is subjected to cluster analysis, deep information meaning and reliable conclusion are obtained from biological sequence data, and the problems of high complexity of node weight, insufficient representativeness of node importance and great influence of sequence length on the node importance in the conventional calculation are effectively solved.

Description

Biological sequence clustering method based on k-mer group segmentation
Technical Field
The invention relates to the field of bioinformatics, in particular to a biological sequence clustering method based on k-mer group segmentation.
Background
With the continuous development of the second generation sequencing technology, the biological database is increasingly scaled up. The huge biological information database provides wide opportunities for scientific research personnel and also brings challenges, how to dig out useful information from the billions of biological information databases and data mining provides basic means for scientific research work. Similarity of biological sequences often reflects correlation of functions thereof, and clustering analysis is a commonly used technique in data mining.
In biology, sequence alignment identifies regions of similar sequences that can describe functional, structural, and evolutionary relationships between sequences by way of aligning biological sequences. Clustering is the division of similar sequences into the same groups and of dissimilar sequences into different groups, so that the distance between sequences within the same group is the smallest, while the distance within different groups is the largest. If two similar sequences are grouped together in the same group, to some extent indicating homology, this will greatly save time and effort in re-determining the structure and function of the unknown sequence. In addition, sequence alignment generally determines the analysis results of many bioinformatics techniques and programs, affects the conclusions and biological interpretations of many sequence comparison studies, and is an important content in studies such as biological sequence clustering analysis.
Bipartite graphs (Bipartite graphs) are one of the more widely used graphs, and are used in real life: how to arrange work can meet the requirements of each person to the maximum extent, and the work competence condition of each person is known; how to arrange courses can meet the conditions among the classroom, the teacher and the students. These involve the matching problem of bipartite graphs, which can be solved by building graph models.
The biological sequences are clustered and researched by combining a graph model, and the sequences related to functions are clustered into one type, so that scientific research personnel can be helped to quickly know the functions of the biological sequences, and the diversity of internal populations is determined, so that the protection of the biological diversity is promoted, and biological resources are reasonably developed and utilized.
Disclosure of Invention
The invention aims to provide a biological sequence clustering method based on k-mer group segmentation.
The technical scheme adopted by the invention is as follows:
a biological sequence clustering method based on k-mer group segmentation comprises the following steps:
step 1: setting the size of a sliding window, segmenting the sequence in the data set, and counting the segmented k-mers word frequency;
step 2: constructing a bipartite graph according to the relation between the sequences and the k-mers, and respectively counting any sequence s i The word frequency of k-mers co-occurring with other sequences;
and 3, step 3: dividing k-mers into t groups uniformly at random: g 1 ,g 2 ,…,g t Calculating the importance of the sequences under each group of k-mers;
and 4, step 4: the importance of the sequences is sorted in reverse order: setting k as the central number of the m sequences, and screening the k sequences from the t groups at uniform intervals to serve as candidate sequences; and de-duplicating the candidate sequences in the t x k sequences to obtain non-repeated candidate sequences;
and 5: performing k-mers clustering on the candidate sequences; clustering DNA sequences based on the set size of the sliding window to obtain a k-mers set; screening out a point closest to the current centroid as a sequence center from each cluster in the K-means clustering result; by analogy, k central sequences can be obtained;
step 6: pairing m sequences S = { S } based on k center sequences i I =1, 2.., m } is clustered.
Further, the data set in step 1 is a sequence set S = { S } of length m i I =1,2,. Eta, m }, sliding window size L, and segmented K-mers set K = { K }, with K-mers set j |j=1,2,...,n}。
Further, the bipartite graph constructed in step 2 is G = (V, E), and also a sequence-k-mers graph is made. G = (V, E) is a undirected graph model composed of a set of nodes and a set of edges, where V is the set of nodes and V can be decomposed into two subsets, i.e., V = S ≦ K, and
Figure BDA0002018663860000021
S={s i i =1,2,. Lamda., m } is a set of sequences, s i For the ith sequence, K = { K = j I j =1, 2.. N } is a set of k-mers, k j Is the jth k-mers; e represents the set of edges formed by the interaction between the nodes, and the two end points of each edge in E are in subset S and subset K, respectively, i.e. E = { E (S) = i ,k j )|s i ∈S,k j E.g., K }, where e(s) i ,k j ) Represents a sequence s i And k-mersk j There is a membership relationship between them.
Further, the method for determining the importance of the sequence in step 3 comprises:
step 3.1: the weight of the edge is calculated. When two sequences v i And v j If there is a common k-mers, then v is considered to be i And v j Are adjacent nodes, and have an edge connected with each other, the weight w of the edge ji The number of co-occurrences of k-mers for the presence of both sequences, i.e.: for any two sequences v i And v j If they have a common k-mers, w can be used ji Indicating undirected interactions between nodes.
Step 3.2: and calculating the weight of the node.
ForAny two nodes v i And v j Node v i By connecting their edges w ji To node v j Transitive, the magnitude of the edge weight determines v i For v j The magnitude of the effect of (1). When node v j Having an edge relationship with a plurality of nodes, i.e. node v j With a plurality of adjacent nodes, in which case node v j Is weighted as node v j The sum of contributions from other nodes is received.
Step 3.3: each node v is calculated iteratively i The weight of (c) can be obtained as a node v i Of the cell.
Further, in step 3.1 adjacent node v i And v j The weight of an edge is w ji W can be calculated by the following formula ji
Figure BDA0002018663860000022
Wherein kmer ∈ v i &kmer∈v j Indicates that k-mers is present at node v i Again exists at node v j In (1).
Figure BDA0002018663860000023
And
Figure BDA0002018663860000031
respectively representing the current k-mers at node v i And node v j Frequency of occurrence in (2). Since the interaction between nodes is non-directional, there is w ji =w ij
Further, in step 3.2 w j. Representation node v j Upon receiving the contribution from the other node, w is calculated by the following equation j.
Figure BDA0002018663860000032
Wherein w j. Representing each node pair V in the set of nodes V j The degree of contribution of (c).
Further, each node v in step 3.3 i Is WS (v) i ),WS(v i ) The corresponding SeqRank calculation formula is as follows:
Figure BDA0002018663860000033
where d is a damping coefficient (0. Ltoreq. D. Ltoreq.1) representing the probability of wandering from one node to another at any time, i.e., the probability of each node having (1-d) wandering randomly to other nodes. v. of j ∈e(v i ,v j ) Representing a node v i And node v j There are common edges; at v k ∈e(v j ,v k ) In, v k Is and node v j There are nodes of the common edge. w is a ij (or w) ji ) Representing a connecting node v i And node v j The weight of the edge of (c), i.e. node v i And node v j The sum of co-occurrence frequencies of k-mers present. Denominator
Figure BDA0002018663860000034
Denotes v k ∈e(v j ,v k ) Time node v j Point to node v k Is determined by the weighted sum of the weights of the edges of (1). WS (v) j ) For node v after the last iteration j The importance of (c).
Since the weight of the node itself is needed to be used when calculating the weight of the node, iterative calculation is needed. If WS (v) is used i ) t Representing a node v i After t iterations, the importance can be expressed as formula (3):
Figure BDA0002018663860000035
and the SeqRank algorithm carries out iterative calculation on the graph model until a convergence condition is met.
Further, d in step 3.3 satisfies 0. Ltoreq. D.ltoreq.1.
Further, d is 0.85.
Further, the method for determining the center sequence in step 5 comprises:
step 5.1: clustering the candidate sequences by using a K-means algorithm, wherein the central number of the K-means is K, and the characteristic is the K-mers frequency of the candidate sequences;
and step 5.2: for each cluster, the point closest to the current center is screened out as the sequence center.
Further, the method for determining the sequence cluster in step 6 comprises the following steps:
step 6.1: labeling k central sequences as μ 1 ,μ 2 ,...,μ k
Step 6.2: for each sequence S in the sequence set S i Its prediction class is calculated using the following formula:
Figure BDA0002018663860000041
wherein, pre i Represents a sequence s i The closest class among the k clusters, i.e. the prediction class of the ith sequence, | w ij || 2 Represents a pair sequence s i Is a value w of importance i Calculating the Euclidean distance between the central point and each central point,
Figure BDA0002018663860000042
indicates the distance w i The closest central point is determined as the prediction category of the ith sequence. Therefore, the prediction types corresponding to the m sequences can be obtained.
By adopting the technical scheme, the biological sequence is subjected to cluster analysis by constructing a bipartite graph model, and the meaning and reliable conclusion of deep information are obtained from the biological sequence data at the level of cluster analysis, so that the problems of high complexity of node weight calculation, insufficient representativeness of node importance, great influence of sequence length on the node importance and the like in the prior art are effectively solved.
Drawings
The invention is described in further detail below with reference to the accompanying drawings and the detailed description;
FIG. 1 is a schematic flow chart of a biological sequence clustering method based on k-mer group segmentation according to the present invention;
FIG. 2 is a schematic diagram of group g1 randomly and uniformly grouping k-mers according to the present invention;
FIG. 3 is a schematic diagram of the group g2 randomly and uniformly grouping k-mers according to the present invention;
FIG. 4 is a schematic diagram of the bipartite sequence of XYZ triplets according to the present invention;
FIG. 5 is a diagram showing the result of K-means clustering performed on sequences according to the present invention.
Detailed Description
As shown in one of FIGS. 1 to 5, the invention discloses a biological sequence clustering method based on k-mer cluster segmentation. According to the invention, a bipartite graph is constructed according to the relation between sequences and k-mers under different sliding window sizes, the k-mers are grouped, the importance of the sequences under different groups is calculated, the sequence centers are found out and clustered, and the problems of high complexity of calculating node weight, insufficient representativeness of node importance, great influence of the node importance on the sequence length and the like in the prior art are effectively solved.
As shown in FIG. 1, the invention discloses a k-mer group segmentation-based clustering algorithm, which comprises the following steps:
step 1: s = { S for a given set of sequences in terms of sliding window size L i I =1, 2.. Multidot.m } is divided, and the K-mers set after the division is K = { K = j I j =1,2,. Eta., n }, m and n being the sizes of the sequence set S and k-mers set, respectively, the statistical k-mers word frequency.
In particular in the sequence s i The number of all elements with the length of the n-Gram being L is (| s) i L + 1).
For each sequence s i And counting the occurrence word frequency of each k-mers. The k-mers set between sequences is sigma L The number of elements in the set is [ sigma ]) L Namely: the number of k-mers which may be present in a DNA sequence having the base type 4 is 4 L (ii) a For a protein sequence with the amino acid species of 20,the number of possible k-mers is 20 L
For example, for three sequences X = "ACAGT", Y = "ACACG" and Z = "CACGT", when the sliding window size of k-mer elements is set to 2, the n-Gram length 2 elements among all sequences in the three sequences are 4, and the occurrence of k-mers in each sequence is shown in table 1:
TABLE 1 occurrence of k-mers between sequences
k-mers AC AG CA CG GT sum
Sequence X
1 1 1 0 1 4
Sequence Y 2 0 1 1 0 4
Sequence Z 1 0 1 1 1 4
Step 2: constructing a bipartite graph G = (V, E) according to the relation between the sequence and the k-mers. G = (V, E) is a undirected graph model composed of node sets and edge sets. Where V is a set of nodes, E represents a set of edges formed by interaction relationships between nodes, and E = { E(s) = i ,k j )|s i ∈S,k j E.g., K }, where e(s) i ,k j ) Represents a sequence s i And k-mersk j There is a membership relationship between them.
Specifically, as can be seen from Table 1, for sequence X and sequence Y, there are k-mers in common between the two sequences: AC and CA, then the sequence X is considered to be connected with the sequence Y by one side; for sequence X and sequence Z, there is a common k-mers between the two sequences: AC. CA and GT, then the sequence X is considered to have a side connection with the sequence Z; for sequences Y and Z, there is a common k-mers between the two sequences: AC. CA and CG, and the sequence Y is considered to be connected with the sequence Z by one side.
And step 3: the importance of the m sequences was calculated.
Step 3.1: randomly and uniformly dividing n k-mers into t groups: g 1 ,g 2 ,...,g t As shown in FIG. 2, sequence s 1 And sequence s i Presence of common k-mersk 1 Sequence s 1 And sequence s m Presence of common k-mersk 3 I.e. the sequence s 1 And sequence s i Sequence s 1 And sequence s m There is an interaction relationship. This is achieved byBy taking the sequence and k-mers as nodes, a graph model G = (V, E) can be constructed.
Step 3.2: the weight of the edge is calculated. When two sequences v i And v j If a common k-mers is present, then v is considered to be i And v j Are adjacent nodes, and have an edge connected with each other, the weight w of the edge ji The number of co-occurrences of k-mers for the presence of both sequences, i.e.: for any two sequences v i And v j If they have a common k-mers, w can be used ji Representing the undirected interaction between nodes, w is calculated by the following formula ji
Figure BDA0002018663860000051
Specifically, for three sequences X = "ACAGT", Y = "ACACG", and Z = "CACGT", when the sliding window size of the k-mer element is set to 2, a bipartite graph with respect to the three sequences is shown in fig. 3:
the k-mers frequency of the three sequences X, Y and Z is shown in Table 2.
TABLE 2 k-mers occurrence of the three sequences
k-mers AC AG CA CG GT
Sequence X
1 1 1 0 1
Sequence Y 2 0 1 1 0
Sequence Z 1 0 1 1 1
As can be seen from Table 2, for sequence X and sequence Y, there is a common k-mers between the two sequences: AC and CA. Since the weight of an edge is the number of co-occurrences of k-mers in the two sequences, the relationship between sequence X and sequence Y in Table 2 can be simplified as follows:
TABLE 3 Co-occurrence frequency of k-mers in the presence of both X and Y sequences
k-mers AC CA
Sequence X
1 1
Sequence Y 2 1
In this case the interaction w of the sequence X with the sequence Y YX Can be calculated from the following formula:
w YX =min(|AC X |,|AC Y |)+min(|CA X |,|CA Y |)=1+1=2
step 3.3: and calculating the weight of the node. For any two nodes v i And v j Node v i By connecting their edges w ji To node v j The magnitude of the edge weight determining v i For v j The magnitude of the effect of (1). When node v j Having an edge relationship with a plurality of nodes, i.e. node v j Having a plurality of adjacent nodes, w j. Representing a node v j Upon receiving the contribution from the other node, w is calculated by the following equation j.
Figure BDA0002018663860000061
For Table 3, by calculating the co-occurrence frequency of k-mers in the presence of X, Y and Z sequences, the following relationship matrix M can be obtained:
TABLE 4 Co-occurrence frequency of k-mers in the presence of three sequences X, Y and Z
Sequence X Sequence Y Sequence Z
Sequence X 0 2 3
Sequence Y 2 0 3
Sequence Z 3 3 0
In table 4, the size of the matrix M is | V |. V |, where | V | represents the number of nodes. The values in the matrix represent the interaction between two sequences, e.g., M1, 3 indicates that the interaction of sequence X with sequence Z is 3.
Each node v is calculated iteratively i The weight of (c) can be obtained as a node v i WS (v) of importance i ),WS(v i ) The corresponding SeqRank calculation formula is as follows:
Figure BDA0002018663860000062
wherein d is a damping coefficient (d is more than or equal to 0 and less than or equal to 1) and represents the probability of wandering from one node to another at any time, namely the probability that each node has (1-d) is wandered to other nodes randomly. v. of j ∈e(v i ,v j ) Representing a node v i And node v j There are common edges; at v k ∈e(v j ,v k ) In, v k Is and node v j There are nodes of the common edge. w is a ij (or w) ji ) Representing a connecting node v i And node v j The weight of the edge of (1), i.e. node v i And node v j The sum of co-occurrence frequencies of k-mers present. Denominator
Figure BDA0002018663860000071
Denotes v k ∈e(v j ,v k ) Time node v j Point of direction v k Is determined by the weighted sum of the weights of the edges of (1). WS (v) j ) For the node v after the last iteration j Of the cell.
Since the weight of the node itself is needed to be used when calculating the weight of the node, iterative calculation is needed. If WS (v) is used i ) t Representing a node v i After t iterations, the importance can be expressed as the following equation (3):
Figure BDA0002018663860000072
and 4, step 4: the importance of the sequences is sorted in reverse order. The importance of the m sequences under different sets of k-mers is represented by a t x m dimensional matrix I, which is shown in table 5.
TABLE 5 importance of m sequences under t groups
Figure BDA0002018663860000073
Specifically, in matrix I, the rows of the matrix represent t sets of k-mers, the columns of the matrix represent m sequences, and I [ p, q ] in the matrix]Denotes the importance value of the q (1. Ltoreq. Q. Ltoreq.m) th sequence in the p (1. Ltoreq. P.ltoreq.t) th group k-mers, such as I1,]is shown in group g 1 Importance of the following m sequences, I [, q]Indicates the importance of the q-th sequence calculated under all groups.
It is noted that the same sequence in different groups may exhibit different importance,e.g. sequence s in matrix I 1 In group g 1 Importance of the following is 0.96823, while in group g 2 The importance of 1.040769; conversely, the maximum values obtained by calculating the importance of different groups may be different sequences, such as group g 1 The next most important sequence is s m Group g p The next most important sequence is s 1
The results obtained by sorting the importance of the sequences in reverse order in units of groups are shown in Table 6.
TABLE 6 reverse order ranking of importance
Figure BDA0002018663860000074
Unlike Table 5, in Table 6, the values in matrix I represent the reverse ordered sequence numbers, e.g., I [2,1 ]]=7 denotes: in group g 2 The most important 7 sequences calculated below; i [ p, m ]]=7 denotes: 7 sequences in group g p The following sequences are considered to be the least important sequences. The most important sequence numbers calculated at different sets of k-mers are not the same.
Assuming k (k ≦ m) as the center number of m sequences, in group g 1 Sequence number SR of 1 For all, k sequences were screened from each of the t sets at uniform intervals, as shown in shaded portion of table 7.
Table 7 screening for importance
Figure BDA0002018663860000081
And taking the sequence of each group of k screened sequence numbers as a candidate sequence, wherein the size of the candidate sequence set is t x k. In the t × k sequences, there may be repeated sequences, such as group g 1 Middle sequences 5 and 7 are also in group g t In the candidate sequence set of (2); group g t The sequence of SEQ ID NO 78 is also in group g 2 Of the candidate sequence set of (2). Therefore, we want to duplicate the selected t × k candidate sequences to obtain n (n)T × k) non-repeating candidate sequences.
And 5: and clustering the candidate sequences by using a K-means algorithm. The number of K-means centers is K, and the characteristic is the K-mers frequency of the candidate sequence. With a sliding window size L, for DNA sequences, the k-mers set that may occur between sequences is sigma L . Assuming L =2, there is the following matrix O:
TABLE 8 frequency of occurrence of k-mers in candidate sequences
Figure BDA0002018663860000082
The size of the matrix O is n × Σ - L Where n is the number of candidate sequences, ∑ L And the k-mers set corresponding to the sequence when the size of the current sliding window is L. When L =2, for the DNA sequence, there is ∑ L = { AA, AC, AG.,. TT }. In the matrix O, 0[ s ] i ,]Represents the frequency of occurrence of each k-mers in the ith sequence, namely: 140, 122, 200,...,101.
The results obtained with K-means are shown in FIG. 4. In fig. 4, n candidate sequences are grouped into k classes: cluster1, cluster 2. For each Cluster, the point closest to the current centroid is screened out as the sequence center, namely, for Cluster2, the point is measured in a certain distance mode, at the moment, the point closest to the Cluster2 is the point A, and the sequence in which the point A is located is considered as the center sequence of the Cluster 2. By analogy, k central sequences can be obtained.
Step 6: for m sequences S = { S = { S i I =1, 2.., m } is clustered.
Specifically, when the sliding window size is L, the K-mers set of m sequences is K = { K = j |j=1,2,...,|∑| L H, the size of the k-mers frequency matrix O constructed at the moment is m × Σ L
For K centers, the clustering process of the K-means algorithm is described specifically as follows, wherein the clustering process is characterized by representing a frequency matrix O of K-mers in m sequences:
(1) For the screened k centers, the numbers are marked as mu 1 ,μ 2 ,...,μ k
(2) For each point w in the matrix O i,j The class to which it belongs is calculated using equation (5):
Figure BDA0002018663860000091
in the formula (3), pre i Represents a sequence s i The closest class among the k clusters, i.e. the prediction class of the ith sequence, | w ij || 2 Represents a pair sequence s i Is a value of importance w i Calculating the Euclidean distance between the central point and each central point,
Figure BDA0002018663860000092
indicates the distance w i The closest central point is determined as the prediction category of the ith sequence. Therefore, the prediction types corresponding to the m sequences can be obtained.
By adopting the technical scheme, the biological sequence is subjected to cluster analysis by constructing a bipartite graph model, and the meaning and reliable conclusion of deep information are obtained from the biological sequence data at the level of cluster analysis, so that the problems of high complexity of node weight calculation, insufficient representativeness of node importance, great influence of sequence length on the node importance and the like in the prior art are effectively solved.

Claims (5)

1. A biological sequence clustering method based on k-mer group segmentation is characterized by comprising the following steps: which comprises the following steps:
step 1: acquiring a data set of a sequence set to be processed, and segmenting the sequence in the data set according to the size of a set sliding window to obtain a k-mers set;
step 2: constructing a bipartite graph according to the relation between the sequences and the k-mers set, and respectively counting any sequence s i Word frequencies of k-mers co-occurring with other sequences; namely, a constructed bipartite graph is G = (V, E), and a sequence-k-mers graph is also drawn; g = (V, E) is a undirected graph model composed of node sets and edge sets,
where V is a set of nodes and V can be decomposed into two subsets, i.e. V = S @, and
Figure FDA0004000873840000011
S={s i l i =1,2, \8230 |, m } is a set of sequences, s i For the ith sequence, K = { K = j I j =1,2, \ 8230;, n } is a set of k-mers, k j Is the jth k-mers;
e represents a set of edges formed by interaction between nodes and two end points of each edge in E are respectively in the subset S and the subset K, namely E = { E (S) = i ,k j )|s i ∈S,k j E.g., K }, where e(s) i ,k j ) Represents a sequence s i And k-mers k j Membership exists among the groups;
and step 3: dividing k-mers into t groups uniformly at random: g 1 ,g 2 ,…,g t Calculating the importance of the sequences under each group of k-mers; the determination method of the sequence importance comprises the following steps:
step 3.1: calculating the weight of the edge: when two sequences v i And v j If a common k-mers is present, then v is considered to be i And v j Are adjacent nodes, and have an edge connected with each other, the weight w of the edge ji For the number of co-occurrences of k-mers present in both sequences,
wherein adjacent nodes v i And v j The weight of an edge is w ji W is calculated by the following formula ji
Figure FDA0004000873840000012
Wherein kmer ∈ v i &kmer∈v j Indicates that k-mers is present at node v i Again at node v j Performing the following steps;
Figure FDA0004000873840000013
and
Figure FDA0004000873840000014
respectively representing the current k-mers at node v i And node v j And w is ji =w ij
Step 3.2: calculating the weight of the node:
for any two nodes v i And v j Node v i By connecting their edges w ji To node v j Transitive, the magnitude of the edge weight determines v i For v j The magnitude of the effect of (c);
when node v j With edge relationships to multiple nodes, i.e. node v j With a plurality of adjacent nodes, in which case node v j Is weighted as node v j Receiving the sum of the effects from other nodes;
wherein, w j. Representing a node v j Upon receiving the contribution from the other node, w is calculated by the following equation j.
Figure FDA0004000873840000015
Wherein, w ji Represents any node i in the node set V to the node V j The degree of contribution of (c);
step 3.3: each node v is calculated iteratively i Is weighted to obtain a node v i Of importance of, each node v i Of importance is WS (v) i ),WS(v i ) The corresponding SeqRank calculation formula is as follows:
Figure FDA0004000873840000021
wherein d is a damping coefficient and d is more than or equal to 0 and less than or equal to 1, representing the probability of wandering from one node to another at any time, v j ∈e(v i ,v j ) Representing a node v i And node v j There are common edges; at v k ∈e(v j ,v k ) In, v k Is and node v j Having a common edgeA node of (a); w is a ij Or w ji Representing a connecting node v i And node v j The weight of the edge of (1), i.e. node v i And node v j Sum of co-occurrence frequencies of k-mers present; denominator
Figure FDA0004000873840000022
Denotes v k ∈e(v j ,v k ) Time node v j Point of direction v k Weighted sum of the weights of the edges of, WS (v) j ) For the node v after the last iteration j The importance of (c);
by WS (v) i ) t Representing a node v i After t iterations, the importance can be expressed as formula (3):
Figure FDA0004000873840000023
the SeqRank algorithm carries out iterative computation on the graph model until a convergence condition is met;
and 4, step 4: the importance of the sequences is sorted in reverse order: setting k as the central number of the m sequences, and screening the k sequences from the t groups at uniform intervals to serve as candidate sequences; and de-duplicating the candidate sequence in the t × k sequences to obtain a non-repeated candidate sequence;
and 5: performing k-mers clustering on the candidate sequences; clustering DNA sequences based on the set size of the sliding window to obtain a k-mers set; screening out a point closest to the current centroid as a sequence center from each cluster in the K-means clustering result; by analogy, k central sequences can be obtained;
and 6: pairing m sequences S = { S } based on k center sequences i And l i =1,2, \8230 |, m } carries out clustering to obtain the prediction categories corresponding to the m sequences.
2. The method of claim 1, wherein the method comprises the steps of: the data set in step 1 is a sequence set S = { S } of length m i L i =1,2, \8230 |, m }, slipThe window size is L, and the segmented K-mers set is K = { K = j |j=1,2,…,n}。
3. The method of claim 1, wherein the method comprises the steps of: the value of d is 0.85.
4. The method of claim 1, wherein the method comprises the steps of: the method for determining the center sequence in the step 5 comprises the following steps:
step 5.1: clustering the candidate sequences by using a K-means algorithm, wherein the center number of the K-means is K, and the characteristic is the K-mers frequency of the candidate sequences;
and step 5.2: the closest point to the current center is screened for each cluster as the sequence center.
5. The method of claim 1, wherein the method comprises the steps of: the method for determining the sequence clustering in the step 6 comprises the following steps:
step 6.1: the k central sequences, respectively designated as μ 12 ,…,μ k
Step 6.2: for each sequence S in the sequence set S i Its prediction class is calculated using the following formula:
Figure FDA0004000873840000031
therein, pre i Represents a sequence s i The closest class among the k clusters, i.e., the prediction class of the ith sequence, | w ij || 2 Represents a pair sequence s i Is a value w of importance i Calculating the Euclidean distance between the central point and each central point,
Figure FDA0004000873840000032
indicates the distance w i The nearest central point is determined as the pre-point of the ith sequenceAnd (5) measuring the category.
CN201910271872.4A 2019-04-04 2019-04-04 Biological sequence clustering method based on k-mer group segmentation Active CN110060735B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910271872.4A CN110060735B (en) 2019-04-04 2019-04-04 Biological sequence clustering method based on k-mer group segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910271872.4A CN110060735B (en) 2019-04-04 2019-04-04 Biological sequence clustering method based on k-mer group segmentation

Publications (2)

Publication Number Publication Date
CN110060735A CN110060735A (en) 2019-07-26
CN110060735B true CN110060735B (en) 2023-02-10

Family

ID=67318406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910271872.4A Active CN110060735B (en) 2019-04-04 2019-04-04 Biological sequence clustering method based on k-mer group segmentation

Country Status (1)

Country Link
CN (1) CN110060735B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851655B (en) * 2019-11-07 2024-05-17 中国银联股份有限公司 Method and system for simplifying complex network
CN114822699B (en) * 2022-04-07 2023-04-07 天津大学四川创新研究院 Clustering algorithm-based high-performance k-mer frequency counting method and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100518781B1 (en) * 2001-10-17 2005-10-06 한국과학기술원 The Devices and Methods for Hyper-rectangle Based Multidimensional Data Segmentation and Clustering
CN107480471B (en) * 2017-07-19 2020-09-01 福建师范大学 Sequence similarity analysis method based on wavelet transform characteristics
CN109326327B (en) * 2018-08-28 2021-11-12 福建师范大学 Biological sequence clustering method based on SeqRank graph algorithm

Also Published As

Publication number Publication date
CN110060735A (en) 2019-07-26

Similar Documents

Publication Publication Date Title
Rendón et al. Internal versus external cluster validation indexes
Madhulatha Comparison between k-means and k-medoids clustering algorithms
CN112750502B (en) Single cell transcriptome sequencing data clustering recommendation method based on two-dimensional distribution structure judgment
Van der Laan et al. A new algorithm for hybrid clustering of gene expression data with visualization and the bootstrap
CN110060735B (en) Biological sequence clustering method based on k-mer group segmentation
CN111625576B (en) Score clustering analysis method based on t-SNE
CN115240772A (en) Method for analyzing active pathway in unicellular multiomics based on graph neural network
Bulut et al. An improved ant-based algorithm based on heaps merging and fuzzy c-means for clustering cancer gene expression data
Celik et al. Biological cartography: Building and benchmarking representations of life
CN107103206B (en) The DNA sequence dna of local sensitivity Hash based on standard entropy clusters
CN109326327B (en) Biological sequence clustering method based on SeqRank graph algorithm
CN106557668A (en) DNA sequence dna similar test method based on LF entropys
Sekula OptCluster: an R package for determining the optimal clustering algorithm and optimal number of clusters.
Priscilla et al. A semi-supervised hierarchical approach: Two-dimensional clustering of microarray gene expression data
CN116976574A (en) Building load curve dimension reduction method based on two-stage hybrid clustering algorithm
Bose et al. Effectiveness of different partition based clustering algorithms for estimation of missing values in microarray gene expression data
Alexe et al. PCA and clustering reveal alternate mtDNA phylogeny of N and M clades
CN107609348B (en) High-throughput transcriptome data sample classification number estimation method
Ghasemi et al. High-dimensional unsupervised active learning method
Das et al. A novel SFLA based method for gene expression biclustering
Pandole et al. Comparison and evaluation for grouping of null data in database based on K-means and genetic algorithm
Bhat et al. OTU clustering: A window to analyse uncultured microbial world
Yearwood et al. Optimization methods and the k-committees algorithm for clustering of sequence data
Yearwood et al. Experimental investigation of classification algorithms for ITS dataset
Costa et al. A symbolic approach to gene expression time series analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant