WO2022062114A1 - 一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质 - Google Patents
一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质 Download PDFInfo
- Publication number
- WO2022062114A1 WO2022062114A1 PCT/CN2020/128253 CN2020128253W WO2022062114A1 WO 2022062114 A1 WO2022062114 A1 WO 2022062114A1 CN 2020128253 W CN2020128253 W CN 2020128253W WO 2022062114 A1 WO2022062114 A1 WO 2022062114A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequence
- sequences
- negative
- frequent
- similarity
- Prior art date
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 51
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 64
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 35
- 238000005065 mining Methods 0.000 claims abstract description 30
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 238000000034 method Methods 0.000 claims description 42
- 239000011159 matrix material Substances 0.000 claims description 25
- 238000010586 diagram Methods 0.000 claims description 18
- CZVCGJBESNRLEQ-UHFFFAOYSA-N 7h-purine;pyrimidine Chemical compound C1=CN=CN=C1.C1=NC=C2NC=NC2=N1 CZVCGJBESNRLEQ-UHFFFAOYSA-N 0.000 claims description 10
- 239000013598 vector Substances 0.000 claims description 10
- 239000002773 nucleotide Substances 0.000 claims description 4
- 125000003729 nucleotide group Chemical group 0.000 claims description 4
- 150000003212 purines Chemical class 0.000 claims description 4
- 238000005452 bending Methods 0.000 claims description 3
- 150000003230 pyrimidines Chemical class 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 108090000623 proteins and genes Proteins 0.000 description 16
- 108020004414 DNA Proteins 0.000 description 12
- 241000894007 species Species 0.000 description 10
- 230000002068 genetic effect Effects 0.000 description 7
- 102000004169 proteins and genes Human genes 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 5
- KDCGOANMDULRCW-UHFFFAOYSA-N 7H-purine Chemical compound N1=CNC2=NC=NC2=C1 KDCGOANMDULRCW-UHFFFAOYSA-N 0.000 description 4
- 101000958041 Homo sapiens Musculin Proteins 0.000 description 4
- 102000046949 human MSC Human genes 0.000 description 4
- 241000282414 Homo sapiens Species 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000002864 sequence alignment Methods 0.000 description 3
- 108010039224 Amidophosphoribosyltransferase Proteins 0.000 description 2
- 241000282412 Homo Species 0.000 description 2
- CZPWVGJYEJSRLH-UHFFFAOYSA-N Pyrimidine Chemical compound C1=CN=CN=C1 CZPWVGJYEJSRLH-UHFFFAOYSA-N 0.000 description 2
- 102100022647 Reticulon-1 Human genes 0.000 description 2
- 238000012300 Sequence Analysis Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000000903 blocking effect Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000009268 pathologic speech processing Effects 0.000 description 2
- 208000032207 progressive 1 supranuclear palsy Diseases 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000006269 (delayed) early viral mRNA transcription Effects 0.000 description 1
- 241000894006 Bacteria Species 0.000 description 1
- 102000053602 DNA Human genes 0.000 description 1
- 238000001712 DNA sequencing Methods 0.000 description 1
- 241000289427 Didelphidae Species 0.000 description 1
- 108700024394 Exon Proteins 0.000 description 1
- 208000012902 Nervous system disease Diseases 0.000 description 1
- 208000025966 Neurological disease Diseases 0.000 description 1
- 241000282577 Pan troglodytes Species 0.000 description 1
- 241000288906 Primates Species 0.000 description 1
- 241000283984 Rodentia Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012733 comparative method Methods 0.000 description 1
- 230000021615 conjugation Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000012252 genetic analysis Methods 0.000 description 1
- 108060003196 globin Proteins 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000012067 mathematical method Methods 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000000734 protein sequencing Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 125000000714 pyrimidinyl group Chemical group 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
Definitions
- the invention relates to a biological sequence-based negative sequence pattern similarity analysis method, an implementation system and a medium, and belongs to the application technical field of decision-making and high-efficiency negative sequence rules.
- sequence pattern mining algorithms help to identify co-occurring biological sequences and discover relationships in DNA or protein sequences, so studying missing base pair sequences has a higher probability than simply demining frequent sequence patterns. significance.
- sequence similarity analysis is by no means a simple mechanical comparison, but must be varied.
- many mathematical and statistical methods need to be used for auxiliary analysis and evaluation.
- sequence similarity analysis alignment is the most commonly used and classic research method. Analyzing the similarity of sequences from the level of biological sequences, it is inferred that their structural functions and evolutionary connections are the basis of gene identification, molecular evolution, and the origin of life research. However, there are two problems in sequence alignment that directly affect the similarity.
- sequence pattern mining algorithms help to identify co-occurring biological sequences and discover relationships in DNA or protein sequences. Therefore, studying missing base pair sequences has more advantages than single-mining frequent sequence patterns. higher meaning.
- sequence pattern mining algorithms help to identify co-occurring biological sequences and discover relationships in DNA or protein sequences.
- Biological sequence data often contains a lot of valuable biological information. For example, genes and protein fragments that frequently appear in biological sequences often contain a lot of unknown information, and it is of great significance to mine this information; some bacteria attack the human body by their genes.
- the present invention proposes a similarity analysis method based on the negative sequence pattern of biological sequences
- the present invention also provides an implementation system of the above similarity analysis method.
- DNA sequence also known as gene sequence, is the primary structure of a real or hypothetical DNA molecule carrying genetic information represented by a string of letters.
- f-NSP algorithm uses bitmap to store PSP data, and calculates NSC support through bit operation. It creates a bitmap for a PSP whose size is greater than 1. If a positive sequence is contained in the i-th data sequence, we set the i-th position of the bitmap of this positive sequence to 1, otherwise it is set to 0. The length of each bitmap is equal to the number of sequences contained in the data sequence. We have adopted a new bitmap storage structure, which can use the bitwise OR (OR) operation to replace the original union operation. The length of each bitmap is equal to the number of sequences in the database.
- ns contains only one negative element, then the support of the sequence ns is:
- the f-NSP algorithm includes the following steps. 1. Find all PSP algorithms from the sequence database based on the GSP algorithm. All PSPs and their bitmaps are stored in a hash table PSPHash; 2. Generate NSCs for each PSP using the NSC (negative candidate sequence) generation method; 3. Calculate 1-neg using equations (2) and (3) -size of nsc support. The support of other nscs can be easily calculated by formula (1). Specifically, we first get a bitmap of each 1-neg-MS' in the 1-negMSS nsc . Second, use the OR operation to get the union of the bitmaps. Then, the support degree of nsc is calculated according to formula (1). Finally, whether an nsc is an NSP is determined by comparing its support with min_sup; 4. Return the result and end the whole algorithm.
- GSP algorithm GSP algorithm is a mining algorithm based on breadth-first search strategy. The algorithm scans the database to obtain frequent itemsets contained in the database, and then generates candidates with increasing length through corresponding connection and pruning methods. sequence, and obtain the support of candidate sequences based on the pattern of repeated scanning database to determine the positive sequence pattern.
- the GSP algorithm is a typical Apriori-like algorithm. On the basis of Apriori algorithm, GSP algorithm adds classification hierarchy, time constraint and sliding time window technology, which optimizes the algorithm as a whole. At the same time, GSP also limits the scanning conditions of the dataset, which reduces the number of candidate sequences that need to be scanned and the generation of useless patterns.
- the ordinate, the points representing the real number a are all on the x-axis, so the x-axis is also called the "real axis"; the points representing the pure imaginary number b are all on the y-axis, so the y-axis is also called the "imaginary axis"; the y-axis There is one and only one real point on it is the origin "0".
- the purine pyrimidine map in simple terms, is to draw a vector on a plane to accurately represent the different base pairs in the DNA sequence.
- the first and second quadrants are purines (A, G and )
- the fourth quadrant is pyrimidine (T, C and ).
- Unit vectors representing the four nucleotides A, G, C and their corresponding negative sequences are as follows. In this way, different base pairs can be uniquely represented, and the conjugation relationship between base pairs is satisfied.
- This purine-pyrimidine map conforms to the characteristic that the DNA sequence corresponds to its time sequence one-to-one.
- DTW Dynamic Time Warping
- a method for similarity analysis of negative sequence patterns based on biological sequences comprising the following steps:
- the letters in the DNA sequence are represented by numbers; since the length of the DNA sequence is very long, the DNA sequence represented by the numbers is divided into several blocks, and the number of bases in each block is the same, and the obtained blocks are used as frequent pattern mining. data set;
- the similarity matrix can be used to evaluate the effectiveness of DNA similarity analysis algorithms. It can reveal the evolutionary or genetic relationship between different species from the side.
- the calculation of the distance between DNA sequences is the basis of DNA similarity analysis. Euclidean distance and correlation angle are the most commonly used distance calculation methods. And the smaller the Euclidean distance between the specified sequences, the more similar the DNA sequences are. The smaller the relative angle between two vectors, the more similar the DNA sequences.
- step (2) the f-NSP algorithm is used to mine the data set, the data set is D, and the steps are as follows:
- the sequence patterns of length 1 from the original seed set P 1 , and connect them to generate a candidate sequence set C 2 of length 2 through the connection operation; use the Apriori property to prune the candidate sequence set C 2 , and then scan
- the candidate sequence set C 2 determines the support degree of the remaining sequences, saves the sequence pattern whose support degree is higher than the minimum support degree, and outputs the sequence pattern L 2 of length 2, which is used as the seed set of length 2; used to generate Candidate sequences of increasing length.
- the sequence pattern L 3 of length 3 the sequence pattern L 4 of length 4 ...
- the minimum support is the artificially set support threshold min_sup; the description is:
- NSC refers to negative candidate sequences, and positive frequent sequences are collectively referred to as positive sequences.
- the key process of generating NSCs is to convert non-consecutive elements with positive patterns into their negative partners.
- NSCs Refers to all negative candidate sequences.
- NSCs After NSCs are generated, their support is calculated, and negative frequent sequence patterns are obtained when the support of negative candidate sequences is satisfied.
- a m > is a negative sequence, if ns' consists only of all positive elements in ns, then ns' is called the largest positron of ns sequence, defined as MPS(ns); for example, A sequence consisting of the MPS(ns) of this sequence and a negative element a in ns is called a 1-neg-size maximum subsequence, defined as 1-negMS. E.g, Then its 1-negMS is and
- graphically representing the maximum frequent positive and negative sequence patterns includes: constructing a purine-pyrimidine map in the complex plane, and the first and second quadrants are purines in the purine-pyrimidine map, including A , G and The third and fourth quadrants are pyrimidines, including T, C and unit vector of the four nucleotides A, G, T, C and their corresponding negative sequences As shown in formula (I) to formula (VIII):
- b and d are non-zero real numbers, A and T are conjugated, and G and C are also conjugated, i.e., A, T, C, G represent actual base pairs, Represents the base pair that should appear in the DNA sequence but does not appear, also known as the missing base pair, also called the unit vector of A, G, T, C and their corresponding negative sequences;
- j represents the base type at the 0, 1, 2, ..., nth position in the sequence S, and n is the length of the DNA sequence to be studied;
- the time series after the transformation of 12 frequent sequence patterns can be obtained.
- step (4) a distance matrix is obtained, and the distance matrix is used to represent the similarity of different DNA sequences.
- D(m,n) is the minimum accumulated value of the curved path in A m ⁇ n .
- the implementation system of the above similarity analysis method includes a data preprocessing module, a frequent pattern mining module, a graphic representation module, and a similarity analysis module connected in sequence; the data preprocessing module is used for performing step (1); the frequent pattern The mining module is used to perform step (2); the graphic representation module is used to perform step (3); the similarity analysis module is used to perform step (4).
- a computer-readable storage medium wherein the computer-readable storage medium stores a similarity analysis program based on a negative sequence pattern of a biological sequence, and the similarity analysis program based on the negative sequence pattern of a biological sequence is When executed by the processor, any one of the steps of the method for similarity analysis based on negative sequence patterns of biological sequences is implemented.
- the present invention can effectively express and analyze negative sequences, and by selecting different maximum frequent pattern combinations, different analysis results can be obtained.
- the present invention selects the frequent mode for similarity analysis, which greatly saves the memory and time consumption of the computer.
- Fig. 1 is the flow chart of the similarity analysis method based on the negative sequence pattern of biological sequence of the present invention
- Fig. 2 is the schematic diagram of purine pyrimidine diagram of the present invention
- Fig. 3 is the structural block diagram of the realization system of the similarity analysis method based on the negative sequence pattern of biological sequence of the present invention
- FIG. 4 is a schematic diagram of a bit-or (OR) operation process in an embodiment
- Figure 5(a) is a schematic diagram of a phylogenetic tree drawn after similarity analysis of the most frequent sequences Human1, Opossum2, Rat2 and Chimpanzee2;
- Figure 5(b) is a schematic diagram of a phylogenetic tree drawn after similarity analysis of the most frequent sequences Human2, Opossum1, Rat2, and Chimpanzee1;
- Figure 6(a) is a schematic diagram of a phylogenetic tree drawn after similarity analysis of the most frequent sequences Human2, Opossum2, Rat2 and Chimpanzee1;
- Figure 6(b) is a schematic diagram of a phylogenetic tree drawn after similarity analysis of the most frequent sequences Human3, Opossu3, Rat3 and Chimpanzee3;
- Figure 7 is a schematic diagram of normalized species distances.
- a method for similarity analysis of negative sequence patterns based on biological sequences includes the following steps:
- the letters in the DNA sequence are represented by numbers; since the length of the DNA sequence is very long, the DNA sequence represented by the numbers is divided into several blocks, and the number of bases in each block is the same, and the obtained blocks are used as frequent pattern mining. data set;
- each sequence is first divided into several blocks, and each block consists of the same number of consecutive bases.
- the blocks are independent of each other, and the size of the blocks can vary in practice. Note that if the size of the last block is smaller than the specified block size, then this block will be discarded.
- here is an example of splitting blocks. In this example, there are two sequences S 1 and S 2 . Assuming a block size of 15, the two sequences are divided into 2 and 3 blocks, respectively. The last block of size 3 is discarded. Each of these blocks is marked with curved and straight lines. This is also called sequence blocking, and it is an important step that brings two main advantages. First, fine-grained information about the sequence can be captured, including location and ordering information. Second, even for long sequences, blocking can reduce the memory and time consumption of sequence processing.
- the selected datasets are from the first exon of ⁇ -protein genes from four species, as shown in Table 1:
- the similarity matrix can be used to evaluate the effectiveness of DNA similarity analysis algorithms. It can reveal the evolutionary or genetic relationship between different species from the side.
- the calculation of the distance between DNA sequences is the basis of DNA similarity analysis. Euclidean distance and correlation angle are the most commonly used distance calculation methods. And the smaller the Euclidean distance between the specified sequences, the more similar the DNA sequences are. The smaller the relative angle between two vectors, the more similar the DNA sequences.
- step (2) the f-NSP algorithm is used to mine the data set, the data set is D, and the steps are as follows:
- the sequence patterns of length 1 from the original seed set P 1 , and connect them to generate a candidate sequence set C 2 of length 2 through the connection operation; use the Apriori property to prune the candidate sequence set C 2 , and then scan
- the candidate sequence set C 2 determines the support degree of the remaining sequences, saves the sequence pattern whose support degree is higher than the minimum support degree, and outputs the sequence pattern L 2 of length 2, which is used as the seed set of length 2; used to generate Candidate sequences of increasing length.
- the sequence pattern L 3 of length 3 the sequence pattern L 4 of length 4 ...
- the minimum support is the artificially set support threshold min_sup; the description is:
- a bitwise OR operation is explained using FIG. 4 .
- a sequence S is called a frequent (positive) sequence pattern if sup(s) ⁇ min_sup, and an infrequent sequence pattern if sup(s) ⁇ min_sup.
- NSC refers to negative candidate sequences, and positive frequent sequences are collectively referred to as positive sequences.
- the key process of generating NSCs is to convert non-consecutive elements with positive patterns into their negative partners.
- NSCs Refers to all negative candidate sequences.
- NSCs After NSCs are generated, their support is calculated, and negative frequent sequence patterns are obtained when the support of negative candidate sequences is satisfied.
- a m > is a negative sequence, if ns' consists only of all positive elements in ns, then ns' is called the largest positron of ns sequence, defined as MPS(ns); for example, A sequence consisting of the MPS(ns) of this sequence and a negative element a in ns is called a 1-neg-size maximum subsequence, defined as 1-negMS. E.g, Then its 1-negMS is and
- b and d are non-zero real numbers, A and T are conjugated, and G and C are also conjugated, i.e., A, T, C, G represent actual base pairs, Represents the base pairs that should have appeared but did not appear in the DNA sequence, also known as missing base pairs, also called A, G, T, C and the unit vector of their corresponding negative sequences; as shown in Figure 2.
- j represents the base type at the 0, 1, 2, ..., nth position in the sequence S, and n is the length of the DNA sequence to be studied;
- the time series after the transformation of 12 frequent sequence patterns can be obtained.
- step (4) a distance matrix is obtained by the DTW algorithm, and the distance matrix is used to represent the similarity of different DNA sequences.
- D(m,n) is the minimum accumulated value of the curved path in A m ⁇ n .
- Figure 5(a) is a schematic diagram of the phylogenetic tree drawn after the similarity analysis of the most frequent sequences Human1, Opossum2, Rat2 and Chimpanzee2; Figure 5(a) is the similarity of the most frequent sequences Human2, Opossum1, Rat2, and Chimpanzee1
- the schematic diagram of the phylogenetic tree drawn after the analysis
- Figure 6(a) is a schematic diagram of the phylogenetic tree drawn after the similarity analysis of the most frequent sequences Human2, Opossum2, Rat2 and Chimpanzee1
- Figure 6(a) is the most frequent sequence Human3
- the schematic diagram of the phylogenetic tree drawn by Opossu3, Rat3 and Chimpanzee3 after similarity analysis; the present invention selects the combination of four frequent patterns to obtain four different classification results, which are all in line with the evolutionary law of species.
- Figure 7 is a schematic diagram of normalized species distances. Among them, the ordinate is the normalized distance. Figure 7 shows the Pearson correlation coefficient between the results of this method and the two comparative methods and the MEGA results. Table 5 details the distances between the four methods and other species and humans.
- the correlation coefficient between the method of the present invention and MEGA is the highest, indicating that the method of the present invention can more accurately calculate the similarity between DNA sequences.
- Fig. 7 the curve calculated by the method of the present invention is closer to that of MEGA, which again shows that the method of the present invention has the highest correlation with MEGA.
- the implementation system of a biological sequence-based negative sequence pattern similarity analysis method described in any one of Embodiments 1-4, as shown in FIG. 3 includes a data preprocessing module, a frequent pattern mining module, a graphic Representation module, similarity analysis module; data preprocessing module is used to execute step (1); frequent pattern mining module is used to execute step (2); graphic representation module is used to execute step (3); similarity analysis module is used to execute step (3) Step (4).
- a computer-readable storage medium characterized in that the computer-readable storage medium stores a similarity analysis program based on the negative sequence pattern of biological sequences, and when the similarity analysis program based on the negative sequence pattern of biological sequences is executed by a processor , to implement the steps of the method for similarity analysis based on the negative sequence pattern of biological sequences described in any one of Embodiments 1-4.
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Pure & Applied Mathematics (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Software Systems (AREA)
- Computational Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Public Health (AREA)
- Artificial Intelligence (AREA)
- Epidemiology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Algebra (AREA)
- General Engineering & Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims (7)
- 一种基于生物序列的负序列模式的相似性分析方法,其特征在于,包括步骤如下:(1)数据预处理将DNA序列中的字母用数字来表示;并将数字表示后的DNA序列分割成若干个块,每个块碱基数量相同,得到的若干个块作为频繁模式挖掘的数据集;(2)频繁模式挖掘使用f-NSP算法来挖掘数据集,得到最大频繁正、负序列模式;(3)对最大频繁正、负序列模式进行图形表示(4)DNA序列的相似性分析求取不同DNA序列的相似度,相似度越小,DNA序列越相似。
- 根据权利要求1所述的一种基于生物序列的负序列模式的相似性分析方法,其特征在于,步骤(2)中,使用f-NSP算法来挖掘数据集,数据集为D,包括步骤如下:A、使用GSP算法得到所有的正频繁序列,并将每条正频繁序列对应的位图存入哈希表中;包括:a、扫描数据集得到所有长度为1的序列模式放入原始种子集P 1中;b、从原始种子集P 1中获取长度为1的序列模式,并将它们通过连接操作生成长度为2的候选序列集C 2;使用Apriori性质对候选序列集C 2进行剪枝,再通过扫描候选序列集C 2确定其中剩余序列的支持度,将支持度高于最小支持度的序列模式保存下来,输出为长度为2的序列模式L 2,并作为长度为2的种子集;按照该方法一直输出长度为3的序列模式L 3、长度为4的序列模式L 4……长度为n+1的序列模式L n+1,直到挖掘不出新的序列模式,得到序列模式即所有的正频繁序列,最小支持度是人为设置的支持度阈值min_sup;B、基于所有正频繁序列生成相应的NSC;NSC是指负候选序列,正频繁序列统称为正序列,对于一个k-size的PSP,NSCs是通过将任意m个不相邻的元素改变为它的负数来生成的,用 来表示, 是不小于k/2的最小整数;k-size是指序列的大小为k;NSCs是指所有的负候选序列;C、利用位操作快速计算负侯选序列的支持度;NSCs的支持度计算如下:给定一个m-size和n-neg-size负序列ns,对于 那么在数据集D中ns的支持度为: m-size是指序列大小为m;假设ns=<a 1a 2…a m>是一个负序列,若ns′仅由ns中所有的正元素组成,则将ns′称作ns的最大正子序列,定义为MPS(ns);由这个序列的MPS(ns)和ns中一个负元素a组成的序列称为1-neg-size最大子序列,定义为1-negMS;通过频繁模式挖掘,获得了12种最大频繁正、负序列模式。
- 根据权利要求1所述的一种基于生物序列的负序列模式的相似性分析方法,其特征在于,步骤(3)中,对最大频繁正、负序列模式进行图形表示,包括:在复平面构造一个嘌呤嘧啶图,嘌呤嘧啶图中,第一、二象限是嘌呤,包括A、 G和 第三、四象限是嘧啶,包括T、 C和 四个核苷酸A、G、T、C及其对应的负序列的单位向量 如式(Ⅰ)至式(Ⅷ)所示:(b+di)→A(Ⅰ)(d+bi)→G(Ⅱ)(b-di)→T(Ⅲ)(d-bi)→C(Ⅳ)式(Ⅰ)至式(Ⅷ)中,b和d是非零的实数, A和T是共轭的,G和C也是共轭的,即, A、T、C,G代表现实存在的碱基对, 表示的是DNA序列中本该出现却没有出现的碱基对,又称缺失的碱基对,也叫A、G、T、C及其对应的负序列的单位向量;式(Ⅸ)中,s(0)=0,其中y(j)满足式(Ⅹ):式(Ⅹ)中,j表示序列S中第0,1,2,...,n个位置上的碱基类型,n是被研究的DNA序列的长度;利用式(Ⅹ)把12种最大频繁正、负序列模式转化为数字序列。
- 根据权利要求1-3任一所述的一种基于生物序列的负序列模式的相似性分析方法,其特征在于,步骤(4)中,求取距离矩阵,距离矩阵用于表示不同DNA序列的相似度。
- 根据权利要求4所述的一种基于生物序列的负序列模式的相似性分析方法,其特征在于,步骤(4)中,通过DTW算法求取距离矩阵,设转化DNA序列而获得的时间序列为, 其长度分别为m和n;按照它们的时间位置进行排序,构造m×n矩阵A m×n,矩阵中的每个元素 在矩阵中,把一组相邻的矩阵元素的集合称为弯曲路径,记为W=w 1,w 2,...,w k,W的第k个元素w k=(a ij) k,这条路径满足下列条件:①max{m,n}≤K≤m+m-1;②w 1=a 11,w k=a mn;式(Ⅺ)中,i=2,3,...,m;j=2,3,...,n,D(m,n)为A m×n中弯曲路径的最小累加值。
- 权利要求1-5任一所述一种基于生物序列的负序列模式的相似性分析方法的实现系统,其特征在于,包括依次连接的数据预处理模块、频繁模式挖掘模块、图形表示模块、相似性分析模块;所述数据预处理模块用于执行步骤(1);所述频繁模式挖掘模块用于执行步骤(2);所述图形表示模块用于执行步骤(3);所述相似性分析模块用于执行步骤(4)。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有基于生物序列的负序列模式的相似性分析程序,所述基于生物序列的负序列模式的相似性分析程序被处理器执行时,实现权利要求1-5任一所述基于生物序列的负序列模式的相似性分析方法的步骤。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020217034664A KR20220042300A (ko) | 2020-09-25 | 2020-11-12 | 생물학적 시퀀스 기반의 네거티브 시퀀스 패턴의 유사성 분석 방법, 구현 시스템 및 매체 |
CA3129990A CA3129990A1 (en) | 2020-09-25 | 2020-11-12 | A similarity analysis method of negative sequential patterns based on biological sequences and its implementation system and medium |
JP2021561803A JP7260934B2 (ja) | 2020-09-25 | 2020-11-12 | 生物学的配列に基づく負の配列パターンの類似性分析方法、その実装システム及び媒体 |
US17/446,176 US20220101949A1 (en) | 2020-09-25 | 2021-08-27 | Similarity analysis method of negative sequential patterns based on biological sequences and its implementation system and medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011022788.8A CN112182497B (zh) | 2020-09-25 | 2020-09-25 | 一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质 |
CN202011022788.8 | 2020-09-25 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/446,176 Continuation US20220101949A1 (en) | 2020-09-25 | 2021-08-27 | Similarity analysis method of negative sequential patterns based on biological sequences and its implementation system and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022062114A1 true WO2022062114A1 (zh) | 2022-03-31 |
Family
ID=73943524
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/128253 WO2022062114A1 (zh) | 2020-09-25 | 2020-11-12 | 一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质 |
Country Status (4)
Country | Link |
---|---|
CN (1) | CN112182497B (zh) |
AU (1) | AU2020103216A4 (zh) |
LU (1) | LU102312B1 (zh) |
WO (1) | WO2022062114A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113742396B (zh) * | 2021-08-26 | 2023-10-27 | 华中师范大学 | 一种对象学习行为模式的挖掘方法及装置 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101950326A (zh) * | 2010-09-10 | 2011-01-19 | 重庆大学 | 基于Hurst指数的DNA序列相似性检测方法 |
US20140121985A1 (en) * | 2012-07-30 | 2014-05-01 | Khalid Sayood | Classification of nucleotide sequences by latent semantic analysis |
CN104574153A (zh) * | 2015-01-19 | 2015-04-29 | 齐鲁工业大学 | 快速的负序列挖掘模式在客户购买行为分析中的应用 |
CN107516020A (zh) * | 2017-08-17 | 2017-12-26 | 中国科学院深圳先进技术研究院 | 序列位点重要度的确定方法、装置、设备及存储介质 |
CN109146542A (zh) * | 2018-07-10 | 2019-01-04 | 齐鲁工业大学 | 一种挖掘正负序列规则的方法 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
BRPI1015129A2 (pt) * | 2009-06-30 | 2016-07-12 | Dow Agrosciences Llc | aplicação de métodos em aprendizagem de máquina para regras de associação na mineração de conjuntos de dados contendo marcadores genéticos moleculares de plantas e de animais, seguida pela classificação ou predição utilizando atributos criados a partir destas regras de associação |
JP2011086252A (ja) * | 2009-10-19 | 2011-04-28 | Fujitsu Ltd | パターン抽出プログラム及びパターン抽出方法 |
CN103995690B (zh) * | 2014-04-25 | 2016-08-17 | 清华大学深圳研究生院 | 一种基于gpu的并行时间序列挖掘方法 |
CN107729762A (zh) * | 2017-08-31 | 2018-02-23 | 徐州医科大学 | 一种基于差分隐私保护模型的dna闭频繁模体识别方法 |
CN109783696B (zh) * | 2018-12-03 | 2021-06-04 | 中国科学院信息工程研究所 | 一种面向弱结构相关性的多模式图索引构建方法及系统 |
CN111581262A (zh) * | 2020-06-15 | 2020-08-25 | 河北工业大学 | 保序序列模式挖掘方法 |
-
2020
- 2020-09-25 CN CN202011022788.8A patent/CN112182497B/zh active Active
- 2020-11-04 AU AU2020103216A patent/AU2020103216A4/en not_active Ceased
- 2020-11-12 WO PCT/CN2020/128253 patent/WO2022062114A1/zh active Application Filing
- 2020-12-18 LU LU102312A patent/LU102312B1/de active IP Right Grant
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101950326A (zh) * | 2010-09-10 | 2011-01-19 | 重庆大学 | 基于Hurst指数的DNA序列相似性检测方法 |
US20140121985A1 (en) * | 2012-07-30 | 2014-05-01 | Khalid Sayood | Classification of nucleotide sequences by latent semantic analysis |
CN104574153A (zh) * | 2015-01-19 | 2015-04-29 | 齐鲁工业大学 | 快速的负序列挖掘模式在客户购买行为分析中的应用 |
CN107516020A (zh) * | 2017-08-17 | 2017-12-26 | 中国科学院深圳先进技术研究院 | 序列位点重要度的确定方法、装置、设备及存储介质 |
CN109146542A (zh) * | 2018-07-10 | 2019-01-04 | 齐鲁工业大学 | 一种挖掘正负序列规则的方法 |
Also Published As
Publication number | Publication date |
---|---|
LU102312B1 (de) | 2021-06-30 |
CN112182497B (zh) | 2021-04-27 |
CN112182497A (zh) | 2021-01-05 |
AU2020103216A4 (en) | 2021-01-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111881714B (zh) | 一种无监督跨域行人再识别方法 | |
Schleif et al. | Indefinite proximity learning: A review | |
US7831392B2 (en) | System and process for validating, aligning and reordering one or more genetic sequence maps using at least one ordered restriction map | |
CN106021990B (zh) | 一种将生物基因以特定的性状进行分类与自我识别的方法 | |
CN112214335B (zh) | 基于知识图谱和相似度网络的Web服务发现方法 | |
CN108984642A (zh) | 一种基于哈希编码的印花织物图像检索方法 | |
WO2022062114A1 (zh) | 一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质 | |
CN109522821A (zh) | 一种基于跨模态深度哈希网络的大规模跨源遥感影像检索方法 | |
US7047137B1 (en) | Computer method and apparatus for uniform representation of genome sequences | |
Wang et al. | Gcmapcrys: integrating graph attention network with predicted contact map for multi-stage protein crystallization propensity prediction | |
Wong et al. | Predicting approximate protein-DNA binding cores using association rule mining | |
CN114020948B (zh) | 基于排序聚类序列辨别选择的草图图像检索方法及系统 | |
CN114090813B (zh) | 基于多通道特征融合的变分自编码器平衡哈希遥感图像检索方法 | |
US20220101949A1 (en) | Similarity analysis method of negative sequential patterns based on biological sequences and its implementation system and medium | |
CN114861940A (zh) | 预测植物lncRNA中sORFs的贝叶斯优化集成学习方法 | |
CN110727833B (zh) | 一种基于多视角学习的图数据检索结果优化方法 | |
Kecman et al. | Adaptive local hyperplane for regression tasks | |
Ganesh et al. | MOPAC: motif finding by preprocessing and agglomerative clustering from microarrays | |
CN117746997B (zh) | 一种基于多模态先验信息的顺式调控模体识别方法 | |
CN112885409B (zh) | 一种基于特征选择的结直肠癌蛋白标志物选择系统 | |
CN118522346B (zh) | 一种蛋白质结合位点预测方法、系统、介质、设备及产品 | |
Somboonsak et al. | A new edit distance method for finding similarity in Dna sequence | |
CN114022701A (zh) | 基于近邻监督离散判别哈希的图像分类方法 | |
Wang et al. | Document classification algorithm based on NPE and PSO | |
Xiang et al. | EdtClust: A fast homologous protein sequences clustering method based on edit distance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2021561803 Country of ref document: JP Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20954958 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20954958 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20954958 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 021023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20954958 Country of ref document: EP Kind code of ref document: A1 |