WO2022062114A1 - 一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质 - Google Patents

一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质 Download PDF

Info

Publication number
WO2022062114A1
WO2022062114A1 PCT/CN2020/128253 CN2020128253W WO2022062114A1 WO 2022062114 A1 WO2022062114 A1 WO 2022062114A1 CN 2020128253 W CN2020128253 W CN 2020128253W WO 2022062114 A1 WO2022062114 A1 WO 2022062114A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
sequences
negative
frequent
similarity
Prior art date
Application number
PCT/CN2020/128253
Other languages
English (en)
French (fr)
Inventor
董祥军
芦月
Original Assignee
齐鲁工业大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 齐鲁工业大学 filed Critical 齐鲁工业大学
Priority to KR1020217034664A priority Critical patent/KR20220042300A/ko
Priority to CA3129990A priority patent/CA3129990A1/en
Priority to JP2021561803A priority patent/JP7260934B2/ja
Priority to US17/446,176 priority patent/US20220101949A1/en
Publication of WO2022062114A1 publication Critical patent/WO2022062114A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the invention relates to a biological sequence-based negative sequence pattern similarity analysis method, an implementation system and a medium, and belongs to the application technical field of decision-making and high-efficiency negative sequence rules.
  • sequence pattern mining algorithms help to identify co-occurring biological sequences and discover relationships in DNA or protein sequences, so studying missing base pair sequences has a higher probability than simply demining frequent sequence patterns. significance.
  • sequence similarity analysis is by no means a simple mechanical comparison, but must be varied.
  • many mathematical and statistical methods need to be used for auxiliary analysis and evaluation.
  • sequence similarity analysis alignment is the most commonly used and classic research method. Analyzing the similarity of sequences from the level of biological sequences, it is inferred that their structural functions and evolutionary connections are the basis of gene identification, molecular evolution, and the origin of life research. However, there are two problems in sequence alignment that directly affect the similarity.
  • sequence pattern mining algorithms help to identify co-occurring biological sequences and discover relationships in DNA or protein sequences. Therefore, studying missing base pair sequences has more advantages than single-mining frequent sequence patterns. higher meaning.
  • sequence pattern mining algorithms help to identify co-occurring biological sequences and discover relationships in DNA or protein sequences.
  • Biological sequence data often contains a lot of valuable biological information. For example, genes and protein fragments that frequently appear in biological sequences often contain a lot of unknown information, and it is of great significance to mine this information; some bacteria attack the human body by their genes.
  • the present invention proposes a similarity analysis method based on the negative sequence pattern of biological sequences
  • the present invention also provides an implementation system of the above similarity analysis method.
  • DNA sequence also known as gene sequence, is the primary structure of a real or hypothetical DNA molecule carrying genetic information represented by a string of letters.
  • f-NSP algorithm uses bitmap to store PSP data, and calculates NSC support through bit operation. It creates a bitmap for a PSP whose size is greater than 1. If a positive sequence is contained in the i-th data sequence, we set the i-th position of the bitmap of this positive sequence to 1, otherwise it is set to 0. The length of each bitmap is equal to the number of sequences contained in the data sequence. We have adopted a new bitmap storage structure, which can use the bitwise OR (OR) operation to replace the original union operation. The length of each bitmap is equal to the number of sequences in the database.
  • ns contains only one negative element, then the support of the sequence ns is:
  • the f-NSP algorithm includes the following steps. 1. Find all PSP algorithms from the sequence database based on the GSP algorithm. All PSPs and their bitmaps are stored in a hash table PSPHash; 2. Generate NSCs for each PSP using the NSC (negative candidate sequence) generation method; 3. Calculate 1-neg using equations (2) and (3) -size of nsc support. The support of other nscs can be easily calculated by formula (1). Specifically, we first get a bitmap of each 1-neg-MS' in the 1-negMSS nsc . Second, use the OR operation to get the union of the bitmaps. Then, the support degree of nsc is calculated according to formula (1). Finally, whether an nsc is an NSP is determined by comparing its support with min_sup; 4. Return the result and end the whole algorithm.
  • GSP algorithm GSP algorithm is a mining algorithm based on breadth-first search strategy. The algorithm scans the database to obtain frequent itemsets contained in the database, and then generates candidates with increasing length through corresponding connection and pruning methods. sequence, and obtain the support of candidate sequences based on the pattern of repeated scanning database to determine the positive sequence pattern.
  • the GSP algorithm is a typical Apriori-like algorithm. On the basis of Apriori algorithm, GSP algorithm adds classification hierarchy, time constraint and sliding time window technology, which optimizes the algorithm as a whole. At the same time, GSP also limits the scanning conditions of the dataset, which reduces the number of candidate sequences that need to be scanned and the generation of useless patterns.
  • the ordinate, the points representing the real number a are all on the x-axis, so the x-axis is also called the "real axis"; the points representing the pure imaginary number b are all on the y-axis, so the y-axis is also called the "imaginary axis"; the y-axis There is one and only one real point on it is the origin "0".
  • the purine pyrimidine map in simple terms, is to draw a vector on a plane to accurately represent the different base pairs in the DNA sequence.
  • the first and second quadrants are purines (A, G and )
  • the fourth quadrant is pyrimidine (T, C and ).
  • Unit vectors representing the four nucleotides A, G, C and their corresponding negative sequences are as follows. In this way, different base pairs can be uniquely represented, and the conjugation relationship between base pairs is satisfied.
  • This purine-pyrimidine map conforms to the characteristic that the DNA sequence corresponds to its time sequence one-to-one.
  • DTW Dynamic Time Warping
  • a method for similarity analysis of negative sequence patterns based on biological sequences comprising the following steps:
  • the letters in the DNA sequence are represented by numbers; since the length of the DNA sequence is very long, the DNA sequence represented by the numbers is divided into several blocks, and the number of bases in each block is the same, and the obtained blocks are used as frequent pattern mining. data set;
  • the similarity matrix can be used to evaluate the effectiveness of DNA similarity analysis algorithms. It can reveal the evolutionary or genetic relationship between different species from the side.
  • the calculation of the distance between DNA sequences is the basis of DNA similarity analysis. Euclidean distance and correlation angle are the most commonly used distance calculation methods. And the smaller the Euclidean distance between the specified sequences, the more similar the DNA sequences are. The smaller the relative angle between two vectors, the more similar the DNA sequences.
  • step (2) the f-NSP algorithm is used to mine the data set, the data set is D, and the steps are as follows:
  • the sequence patterns of length 1 from the original seed set P 1 , and connect them to generate a candidate sequence set C 2 of length 2 through the connection operation; use the Apriori property to prune the candidate sequence set C 2 , and then scan
  • the candidate sequence set C 2 determines the support degree of the remaining sequences, saves the sequence pattern whose support degree is higher than the minimum support degree, and outputs the sequence pattern L 2 of length 2, which is used as the seed set of length 2; used to generate Candidate sequences of increasing length.
  • the sequence pattern L 3 of length 3 the sequence pattern L 4 of length 4 ...
  • the minimum support is the artificially set support threshold min_sup; the description is:
  • NSC refers to negative candidate sequences, and positive frequent sequences are collectively referred to as positive sequences.
  • the key process of generating NSCs is to convert non-consecutive elements with positive patterns into their negative partners.
  • NSCs Refers to all negative candidate sequences.
  • NSCs After NSCs are generated, their support is calculated, and negative frequent sequence patterns are obtained when the support of negative candidate sequences is satisfied.
  • a m > is a negative sequence, if ns' consists only of all positive elements in ns, then ns' is called the largest positron of ns sequence, defined as MPS(ns); for example, A sequence consisting of the MPS(ns) of this sequence and a negative element a in ns is called a 1-neg-size maximum subsequence, defined as 1-negMS. E.g, Then its 1-negMS is and
  • graphically representing the maximum frequent positive and negative sequence patterns includes: constructing a purine-pyrimidine map in the complex plane, and the first and second quadrants are purines in the purine-pyrimidine map, including A , G and The third and fourth quadrants are pyrimidines, including T, C and unit vector of the four nucleotides A, G, T, C and their corresponding negative sequences As shown in formula (I) to formula (VIII):
  • b and d are non-zero real numbers, A and T are conjugated, and G and C are also conjugated, i.e., A, T, C, G represent actual base pairs, Represents the base pair that should appear in the DNA sequence but does not appear, also known as the missing base pair, also called the unit vector of A, G, T, C and their corresponding negative sequences;
  • j represents the base type at the 0, 1, 2, ..., nth position in the sequence S, and n is the length of the DNA sequence to be studied;
  • the time series after the transformation of 12 frequent sequence patterns can be obtained.
  • step (4) a distance matrix is obtained, and the distance matrix is used to represent the similarity of different DNA sequences.
  • D(m,n) is the minimum accumulated value of the curved path in A m ⁇ n .
  • the implementation system of the above similarity analysis method includes a data preprocessing module, a frequent pattern mining module, a graphic representation module, and a similarity analysis module connected in sequence; the data preprocessing module is used for performing step (1); the frequent pattern The mining module is used to perform step (2); the graphic representation module is used to perform step (3); the similarity analysis module is used to perform step (4).
  • a computer-readable storage medium wherein the computer-readable storage medium stores a similarity analysis program based on a negative sequence pattern of a biological sequence, and the similarity analysis program based on the negative sequence pattern of a biological sequence is When executed by the processor, any one of the steps of the method for similarity analysis based on negative sequence patterns of biological sequences is implemented.
  • the present invention can effectively express and analyze negative sequences, and by selecting different maximum frequent pattern combinations, different analysis results can be obtained.
  • the present invention selects the frequent mode for similarity analysis, which greatly saves the memory and time consumption of the computer.
  • Fig. 1 is the flow chart of the similarity analysis method based on the negative sequence pattern of biological sequence of the present invention
  • Fig. 2 is the schematic diagram of purine pyrimidine diagram of the present invention
  • Fig. 3 is the structural block diagram of the realization system of the similarity analysis method based on the negative sequence pattern of biological sequence of the present invention
  • FIG. 4 is a schematic diagram of a bit-or (OR) operation process in an embodiment
  • Figure 5(a) is a schematic diagram of a phylogenetic tree drawn after similarity analysis of the most frequent sequences Human1, Opossum2, Rat2 and Chimpanzee2;
  • Figure 5(b) is a schematic diagram of a phylogenetic tree drawn after similarity analysis of the most frequent sequences Human2, Opossum1, Rat2, and Chimpanzee1;
  • Figure 6(a) is a schematic diagram of a phylogenetic tree drawn after similarity analysis of the most frequent sequences Human2, Opossum2, Rat2 and Chimpanzee1;
  • Figure 6(b) is a schematic diagram of a phylogenetic tree drawn after similarity analysis of the most frequent sequences Human3, Opossu3, Rat3 and Chimpanzee3;
  • Figure 7 is a schematic diagram of normalized species distances.
  • a method for similarity analysis of negative sequence patterns based on biological sequences includes the following steps:
  • the letters in the DNA sequence are represented by numbers; since the length of the DNA sequence is very long, the DNA sequence represented by the numbers is divided into several blocks, and the number of bases in each block is the same, and the obtained blocks are used as frequent pattern mining. data set;
  • each sequence is first divided into several blocks, and each block consists of the same number of consecutive bases.
  • the blocks are independent of each other, and the size of the blocks can vary in practice. Note that if the size of the last block is smaller than the specified block size, then this block will be discarded.
  • here is an example of splitting blocks. In this example, there are two sequences S 1 and S 2 . Assuming a block size of 15, the two sequences are divided into 2 and 3 blocks, respectively. The last block of size 3 is discarded. Each of these blocks is marked with curved and straight lines. This is also called sequence blocking, and it is an important step that brings two main advantages. First, fine-grained information about the sequence can be captured, including location and ordering information. Second, even for long sequences, blocking can reduce the memory and time consumption of sequence processing.
  • the selected datasets are from the first exon of ⁇ -protein genes from four species, as shown in Table 1:
  • the similarity matrix can be used to evaluate the effectiveness of DNA similarity analysis algorithms. It can reveal the evolutionary or genetic relationship between different species from the side.
  • the calculation of the distance between DNA sequences is the basis of DNA similarity analysis. Euclidean distance and correlation angle are the most commonly used distance calculation methods. And the smaller the Euclidean distance between the specified sequences, the more similar the DNA sequences are. The smaller the relative angle between two vectors, the more similar the DNA sequences.
  • step (2) the f-NSP algorithm is used to mine the data set, the data set is D, and the steps are as follows:
  • the sequence patterns of length 1 from the original seed set P 1 , and connect them to generate a candidate sequence set C 2 of length 2 through the connection operation; use the Apriori property to prune the candidate sequence set C 2 , and then scan
  • the candidate sequence set C 2 determines the support degree of the remaining sequences, saves the sequence pattern whose support degree is higher than the minimum support degree, and outputs the sequence pattern L 2 of length 2, which is used as the seed set of length 2; used to generate Candidate sequences of increasing length.
  • the sequence pattern L 3 of length 3 the sequence pattern L 4 of length 4 ...
  • the minimum support is the artificially set support threshold min_sup; the description is:
  • a bitwise OR operation is explained using FIG. 4 .
  • a sequence S is called a frequent (positive) sequence pattern if sup(s) ⁇ min_sup, and an infrequent sequence pattern if sup(s) ⁇ min_sup.
  • NSC refers to negative candidate sequences, and positive frequent sequences are collectively referred to as positive sequences.
  • the key process of generating NSCs is to convert non-consecutive elements with positive patterns into their negative partners.
  • NSCs Refers to all negative candidate sequences.
  • NSCs After NSCs are generated, their support is calculated, and negative frequent sequence patterns are obtained when the support of negative candidate sequences is satisfied.
  • a m > is a negative sequence, if ns' consists only of all positive elements in ns, then ns' is called the largest positron of ns sequence, defined as MPS(ns); for example, A sequence consisting of the MPS(ns) of this sequence and a negative element a in ns is called a 1-neg-size maximum subsequence, defined as 1-negMS. E.g, Then its 1-negMS is and
  • b and d are non-zero real numbers, A and T are conjugated, and G and C are also conjugated, i.e., A, T, C, G represent actual base pairs, Represents the base pairs that should have appeared but did not appear in the DNA sequence, also known as missing base pairs, also called A, G, T, C and the unit vector of their corresponding negative sequences; as shown in Figure 2.
  • j represents the base type at the 0, 1, 2, ..., nth position in the sequence S, and n is the length of the DNA sequence to be studied;
  • the time series after the transformation of 12 frequent sequence patterns can be obtained.
  • step (4) a distance matrix is obtained by the DTW algorithm, and the distance matrix is used to represent the similarity of different DNA sequences.
  • D(m,n) is the minimum accumulated value of the curved path in A m ⁇ n .
  • Figure 5(a) is a schematic diagram of the phylogenetic tree drawn after the similarity analysis of the most frequent sequences Human1, Opossum2, Rat2 and Chimpanzee2; Figure 5(a) is the similarity of the most frequent sequences Human2, Opossum1, Rat2, and Chimpanzee1
  • the schematic diagram of the phylogenetic tree drawn after the analysis
  • Figure 6(a) is a schematic diagram of the phylogenetic tree drawn after the similarity analysis of the most frequent sequences Human2, Opossum2, Rat2 and Chimpanzee1
  • Figure 6(a) is the most frequent sequence Human3
  • the schematic diagram of the phylogenetic tree drawn by Opossu3, Rat3 and Chimpanzee3 after similarity analysis; the present invention selects the combination of four frequent patterns to obtain four different classification results, which are all in line with the evolutionary law of species.
  • Figure 7 is a schematic diagram of normalized species distances. Among them, the ordinate is the normalized distance. Figure 7 shows the Pearson correlation coefficient between the results of this method and the two comparative methods and the MEGA results. Table 5 details the distances between the four methods and other species and humans.
  • the correlation coefficient between the method of the present invention and MEGA is the highest, indicating that the method of the present invention can more accurately calculate the similarity between DNA sequences.
  • Fig. 7 the curve calculated by the method of the present invention is closer to that of MEGA, which again shows that the method of the present invention has the highest correlation with MEGA.
  • the implementation system of a biological sequence-based negative sequence pattern similarity analysis method described in any one of Embodiments 1-4, as shown in FIG. 3 includes a data preprocessing module, a frequent pattern mining module, a graphic Representation module, similarity analysis module; data preprocessing module is used to execute step (1); frequent pattern mining module is used to execute step (2); graphic representation module is used to execute step (3); similarity analysis module is used to execute step (3) Step (4).
  • a computer-readable storage medium characterized in that the computer-readable storage medium stores a similarity analysis program based on the negative sequence pattern of biological sequences, and when the similarity analysis program based on the negative sequence pattern of biological sequences is executed by a processor , to implement the steps of the method for similarity analysis based on the negative sequence pattern of biological sequences described in any one of Embodiments 1-4.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Software Systems (AREA)
  • Computational Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Algebra (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质,包括:(1)数据预处理:将DNA序列中的字母用数字来表示;并分割成若干个块,得到的若干个块作为频繁模式挖掘的数据集;(2)频繁模式挖掘:使用f-NSP算法来挖掘数据集;(3)对最大频繁正、负序列模式进行图形表示;把最大频繁正、负序列模式转化为数字序列;(4)DNA序列的相似性分析:求取不同DNA序列的相似度,选取相似度最小的对应的DNA序列为待研究的DNA序列。可以有效地对负序列进行有效的表达和分析,并且通过选取不同的最大频繁模式组合,能够得到不同的分析结果,大大节省了计算机的内存和时间的消耗。

Description

一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质 技术领域
本发明涉及一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质,属于可决策的高效用负序列规则的应用技术领域。
背景技术
近年来,我们获得了海量的生物序列数据,随着DNA及蛋白质测序技术的进步,对解读生物序列数据中所含的各种信息,尤其是DNA序列中的遗传及调控信息、蛋白质序列结构与功能的关系的数据分析工具的需求增加,序列相似性分析得到广泛的应用。每当我们获得一个新的DNA序列的时候,希望通过相似性分析来证明它与某些已知的序列相似,如果和已知的序列具有同源性的话,会大大节省重新测定新序列的功能的时间和精力,而生物序列庞大,这就显得尤为重要了。在生物序列分析中,序列模式挖掘算法有助于识别同时发生的生物序列和发现DNA或蛋白质序列中的关系,因此研究缺失的碱基对序列比单一的去挖掘频繁的序列模式具有更高的意义。在生物信息学研究中,生物序列的相似性分析绝非简单机械的比较,而必然是多种多样的,同时还需要运用许多数学和统计学方法进行辅助分析与评判。序列相似性分析中,比对是最常用和最经典的研究手段。从生物序列的层次分析序列的相似性,推测其结构功能及进化上的联系是基因识别、分子进化、生命起源研究的基础,然而,在进行序列比对时有两方面的问题直接影响相似性分值:取代矩阵和空值罚分,粗糙的比对方法仅适用相同或不同来描述两个碱基的关系。生物序列的相似性分析用于提取储存在蛋白质序列中的信息,为此提出了许多数学方案。生物序列的图形表示可以识别任何序列的信息内容,以帮助生物学家选择另一种复杂的理论或实验方法。图形表示不仅提供了基因数据的可视化定性检查,而且还通过矩阵等对象提供了数学描述。大部分的数学方案是基于2-D和3-D表示的。
关于序列模式挖掘,对于正序列模式(Positive Sequential Pattern,PSP)挖掘仅仅考虑了已经发生了的事件(行为),不同于传统序列模式挖掘的思路,负序列模式挖掘(Negative Sequential Pattern,NSP)还考虑了未发生的事件(行为),也就是不存在于序列中的项,这样可以为人类提供更加全面地决策信息,比如,校园中存在的各种现状对学生的学习和生活产生的不同程度的影响;涉嫌医疗欺诈行为的参保人消除不良购药记录;缺失的基因片段可能诱发潜在的疾病等,但是,它们往往容易被人类所忽视,因此,越来越受到从事数据挖掘工作人员的关注。尤其是在生物序列分析中,序列模式挖掘算法有助于识别同时发生的生物序列和发现DNA或蛋白质序列中的关系,因此,研究缺失的碱基对序列比单一的去挖掘频繁的序列模式具有更高的意义。生物数据分析或生物数据挖掘存在一些重要的问题,如寻找共现的生物序列,对生物序列进行有效分类、对生物序列进行聚类分析等。而序列模式挖掘算法有助于识别同时发生的生物序列和发现DNA或蛋白质序列中的关系。生物序列数据往往包含着大量有价值的生物信息,例如,生物序列中频繁出现的基因和蛋白质片段往往含有许多未知的信息,挖掘这些信息具有重要的意义;某些细菌对人体的攻击受其基因中某些片段的影响;一些数目可变的串联重复序列的极度扩张可能会导致相关神经系统方面疾病。此外,DNA序列中的频繁模式的发现将是解释生物遗传特性的一种有效方法,这些频繁模式往往作为生物序列隐含数据的可能趋势和某些事件的相关标记。所以,在蛋白质或DNA等生物序列中频繁模式的挖掘具有重要价值。
目前存在的相似性分析方法,主要是针对PSP,对于前面我们挖掘出的NSP,尚缺少统一的相似性度量方法。而序列比对有一些缺点,促使人们试图寻找其他方法来比较DNA序列相似性。我们知道NSP在生物数据中的存在是不可避免的,甚至对一些致病基因至关重要。这就迫使我们找到一种方法,对缺失碱基序列的DNA进行相似性分析。
发明内容
针对现有技术的不足,本发明提出了一种基于生物序列的负序列模式的相似性分析方法;
本发明还提出了上述相似性分析方法的实现系统。
为了有效的分析DNA序列的相似性,应考虑解决以下关键问题:(1)如何用数字序列有效地表示DNA主序列。(2)如何获得并选择合适的可视为DNA序列特征的描述符,根据数字序列对其进行表征。(3)如何有效处理不同长度的DNA序列,并保持其一致性。(4)如何对负序列进行有效的相似性分析。
术语解释:
1、DNA序列,又称基因序列,是使用一串字母表示的真实的或者假设的携带基因信息的DNA分子的一级结构。
2、f-NSP算法,f-NSP使用位图来存储PSP数据,并通过位操作计算NSC支持度。它为size大于1的PSP创建位图,如果一条正序列被第i条数据序列包含,我们就将这条正序列的位图第i个位置置为1,反之则置为0。每一个位图的长度与数据序列所含有序列数相等。我们采用了新的位图存储结构,就可以使用位或(OR)操作来替代原有的并集操作。每个位图的长度等于数据库中序列的数量。假设s是一条正序列,它的位图使用B(s)来表示,获得位图中“1”的个数使用N(B(s))表示。则给出一个m-size并且n-neg-size的负序列ns,它的支持度是:
Figure PCTCN2020128253-appb-000001
如果ns只包含一个负元素,那么序列ns的支持度是:
sup(ns)=sup(MPS(ns))-sup(p(ns))   (2)
特别地,对于单元素负序列
Figure PCTCN2020128253-appb-000002
Figure PCTCN2020128253-appb-000003
f-NSP算法包括以下步骤。1.基于GSP算法从序列数据库中查找所有的PSP算法。所有的PSPs和他们的位图被保存在一个散列表PSPHash中;2.使用NSC(负候选序列)生成方法为每个PSP生成NSCs;3.使用公式(2)和(3)计算1-neg-size的nsc的支持度。通过公式(1)可以很容易地计算出其他nsc的支持度。具体地说,我们首先在1-negMSS nsc中得到每个1-neg-MS’的位图。其次,使用OR操作获取位图的并集。然后,根据公式(1)计算nsc的支持度。最后,一个nsc是否是一个NSP是通过比较它与min_sup的支持度来确定的;4.返回结果并结束整个算法。
3、GSP算法,GSP算法是一种基于宽度优先搜索策略的挖掘算法,该算法通过一遍扫描数据库获得该数据库中包含的频繁项集,之后通过相应的连接和剪枝方法生成长度不断递增的候选序列,并基于重复扫描数据库的模式获得候选序列的支持度以判定正序列模式。GSP算法是典型的类Apriori算法。在Apriori算法的基础上,GSP算法加入了分类层次、时间约束、滑动时间窗技术,使算法整体得到了优化。同时,GSP还对数据集的扫描条件进行了限定,降低了需要扫描的候选序列的数量,减少了无用模式的产生。
4、复平面,又称复数平面,即是z=a+bi,它对应的坐标为(a,b),其中,a表示的是复平面内的横坐标,b表示的是复平面内的纵坐标,表示实数a的点都在x轴上,所以x轴又称为"实轴";表示纯虚数b的点都在y轴上,所以y轴又称为"虚轴";y轴上有且仅有一个实点即为原点"0"。
5、嘌呤嘧啶图,简单来说就是在平面上绘制向量,将DNA序列中不同的碱基对准确的表示出来。这里我们是在复平面构造一个嘌呤嘧啶图,第一、二象限是嘌呤(A、
Figure PCTCN2020128253-appb-000004
G和
Figure PCTCN2020128253-appb-000005
),第四象限是嘧啶(T、
Figure PCTCN2020128253-appb-000006
C和
Figure PCTCN2020128253-appb-000007
)。表示四个核苷酸A、G、C和及其对应的负序列的单位向量如下。这样不同的碱基对就能别唯一的表示,并且碱基对之间满足共轭关系。这种嘌呤嘧啶图符合DNA序列与它的时间序列是一一对应的特性。
6、DTW,Dynamic Time Warping,它出现的目的也比较单纯,首先在语音识别领域得到了广泛的应用,是一项把时间规划和距离测度结合起来的非线性规划技术,同时用于计算两个时间序列之间的最大相似性,即最小距离。
7、Apriori性质,任一频繁项集的所有非空子集也必须是频繁的。
本发明的技术方案为:
一种基于生物序列的负序列模式的相似性分析方法,包括步骤如下:
(1)数据预处理
对于每个要处理的序列或基因组,在将其进行频繁模式挖掘之前,都要进行预处理。将DNA序列中的字母用数字来表示;由于DNA序列长度非常长,并将数字表示后的DNA序列分割成若干个块,每个块碱基数量相同,得到的若干个块作为频繁模式挖掘的数据集;
(2)频繁模式挖掘
使用f-NSP算法来挖掘数据集,得到最大频繁正、负序列模式;
(3)对最大频繁正、负序列模式进行图形表示
(4)DNA序列的相似性分析
求取不同DNA序列的相似度,相似度越小,DNA序列越相似。
相似性矩阵可以用来评价DNA相似性分析算法的有效性。它可以从侧面揭示不同物种之间的进化或遗传关系。DNA序列间的距离的计算是DNA相似性分析的基础,欧几里得距离和相关角是最常用的距离计算方法。并且规定序列之间的欧氏距离越小,DNA序列越相似。两个载体之间的相关角度越小,DNA序列越相似。
根据本发明优选的,步骤(2)中,使用f-NSP算法来挖掘数据集,数据集为D,包括步骤如下:
A、使用GSP算法得到所有的正频繁序列,并将每条正频繁序列对应的位图存入哈希表中;包括:
a、扫描数据集得到所有长度为1的序列模式放入原始种子集P 1中;
b、从原始种子集P 1中获取长度为1的序列模式,并将它们通过连接操作生成长度为2的候选序列集C 2;使用Apriori性质对候选序列集C 2进行剪枝,再通过扫描候选序列集C 2确定其中剩余序列的支持度,将支持度高于最小支持度的序列模式保存下来,输出为长度为2的序列模式L 2,并作为长度为2的种子集;用来生成长度递增的候选序列。按照该方法一直输出长度为3的序列模式L 3、长度为4的序列模式L 4……长度为n+1的序列模式L n+1,直到挖掘不出新的序列模式,得到序列模式即所有的正频繁序列,最小支持度是人为设置的支持度阈值min_sup;描述为:
L 1→C 2→L 2→C 3→L 3→C 4→L 4……若不能生成L n+1停止。
B、基于所有正频繁序列生成相应的NSC;
NSC是指负候选序列,正频繁序列统称为正序列,为了从正序列中生成所有非冗余的NSC,生成NSC的关键过程是将具有正模式的非连续元素转换为它们的负伙伴,对于一个k-size的PSP,NSCs是通过将任意m个不相邻的元素改变为它的负数来生成的,用
Figure PCTCN2020128253-appb-000008
来表示,
Figure PCTCN2020128253-appb-000009
Figure PCTCN2020128253-appb-000010
是不小于k/2的最小整数;k-size是指序列的大小为k;比如序列S={A T T C C},其大小为5-size。NSCs:指的是所有的负候选序列。
例如,<A T C C>的NSC包括:(1)m=1时,
Figure PCTCN2020128253-appb-000011
(2)m=2时,
Figure PCTCN2020128253-appb-000012
这里规定不允许出现两个连续的负项。
C、利用位操作快速计算负侯选序列的支持度;
生成NSCs之后,计算它的支持度,当满足负候选序列的支持度时,得到负频繁序列模式。NSCs的支持度计算如下:给定一个m-size和n-neg-size负序列ns,对于
Figure PCTCN2020128253-appb-000013
那么在数据集D中ns的支持度为:
Figure PCTCN2020128253-appb-000014
m-size是指序列大小为m;假设ns=<a 1a 2…a m>是一个负序列,若ns′仅由ns中所有的正元素组成,则将ns′称作ns的最大正子序列,定义为MPS(ns);例如,
Figure PCTCN2020128253-appb-000015
由这个序列的MPS(ns)和ns中一个负元素a组成的序列称为1-neg-size最大子序列,定义为1-negMS。例如,
Figure PCTCN2020128253-appb-000016
那么它的1-negMS就是
Figure PCTCN2020128253-appb-000017
Figure PCTCN2020128253-appb-000018
通过频繁模式挖掘,获得了12种最大频繁正、负序列模式;
根据本发明优选的,步骤(3)中,对最大频繁正、负序列模式进行图形表示,包括:在复平面构造一个嘌呤嘧啶图,嘌呤嘧啶图中,第一、二象限是嘌呤,包括A、
Figure PCTCN2020128253-appb-000019
G和
Figure PCTCN2020128253-appb-000020
第三、四象限是嘧啶,包括T、
Figure PCTCN2020128253-appb-000021
C和
Figure PCTCN2020128253-appb-000022
四个核苷酸A、G、T、C及其对应的负序列的单位向量
Figure PCTCN2020128253-appb-000023
Figure PCTCN2020128253-appb-000024
如式(Ⅰ)至式(Ⅷ)所示:
(b+di)→A(Ⅰ)
(d+bi)→G(Ⅱ)
(b-di)→T(Ⅲ)
(d-bi)→C(Ⅳ)
Figure PCTCN2020128253-appb-000025
Figure PCTCN2020128253-appb-000026
Figure PCTCN2020128253-appb-000027
Figure PCTCN2020128253-appb-000028
式(Ⅰ)至式(Ⅷ)中,b和d是非零的实数,
Figure PCTCN2020128253-appb-000029
A和T是共轭的,G和C也是共轭的,即,
Figure PCTCN2020128253-appb-000030
A、T、C,G代表现实存在的碱基对,
Figure PCTCN2020128253-appb-000031
表示的是DNA序列中本该出现却没有出现的碱基对,又称缺失的碱基对,也叫A、G、T、C及其对应的负序列的单位向量;
通过这种表示方法,将一个DNA序列碱基
Figure PCTCN2020128253-appb-000032
还原为一个数字序列s(n),如式(Ⅸ)所示:
Figure PCTCN2020128253-appb-000033
式(Ⅸ)中,s(0)=0,其中y(j)满足式(Ⅹ):
Figure PCTCN2020128253-appb-000034
式(Ⅹ)中,j表示序列S中第0,1,2,...,n个位置上的碱基类型,n是被研究的DNA序列的长度;
通过上面的步骤,从“嘌呤嘧啶图”中唯一地获得原始的DNA序列的时间序列;
利用式(Ⅹ)把12种最大频繁正、负序列模式转化为数字序列;比如序列Human1通过式(Ⅸ)-(Ⅹ)得到的复数字序列为s(H1)={0.866+0.5i,1.366-0.366i,2.2321+0.134i,3.0981+0.634i,3.5981+1.5i,4.4641+2i},模组成的时间序列为S(H1)={1.0000,1.4142,2.2361,3.1623,3.8982,4.8916}。通过这样的方法可以得到12种频繁序列模式转化之后的时间序列。
根据本发明优选的,步骤(4)中,求取距离矩阵,距离矩阵用于表示不同DNA序列的相似 度。
根据本发明优选的,步骤(4)中,通过DTW算法求取距离矩阵,设转化DNA序列而获得的时间序列为,
Figure PCTCN2020128253-appb-000035
其长度分别为m和n;按照它们的时间位置进行排序,构造m×n矩阵A m×n,矩阵中的每个元素
Figure PCTCN2020128253-appb-000036
在矩阵中,把一组相邻的矩阵元素的集合称为弯曲路径,记为W=w 1,w 2,...,w k,W的第k个元素w k=(a ij) k,这条路径满足下列条件:
①max{m,n}≤K≤m+m-1;
②w 1=a 11,w k=a mn
③对w k=a ij,w k-1=a i'j'必须满足0≤i-i'≤1,0≤j-j'≤1,则
Figure PCTCN2020128253-appb-000037
DTW算法为运用动态规划思想寻找一条具有最小弯曲代价的最佳路径,如式(Ⅺ)所示:
Figure PCTCN2020128253-appb-000038
其中,i=2,3,...,m;j=2,3,...,n。D(m,n)为A m×n中弯曲路径的最小累加值。
上述相似性分析方法的实现系统,包括依次连接的数据预处理模块、频繁模式挖掘模块、图形表示模块、相似性分析模块;所述数据预处理模块用于执行步骤(1);所述频繁模式挖掘模块用于执行步骤(2);所述图形表示模块用于执行步骤(3);所述相似性分析模块用于执行步骤(4)。
一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有基于生物序列的负序列模式的相似性分析程序,所述基于生物序列的负序列模式的相似性分析程序被处理器执行时,实现任一项所述的基于生物序列的负序列模式的相似性分析方法的步骤。
本发明的有益效果为:
1、本发明可以有效地对负序列进行有效的表达和分析,并且通过选取不同的最大频繁模式组合,能够得到不同的分析结果。
2、本发明选取的是频繁模式来进行的相似性分析,大大节省了计算机的内存和时间的消耗。
附图说明
图1为本发明基于生物序列的负序列模式的相似性分析方法的流程框图;
图2为本发明嘌呤嘧啶图的示意图;
图3为本发明基于生物序列的负序列模式的相似性分析方法的实现系统的结构框图;
图4为实施例中的位或(OR)运算过程示意图;
图5(a)为对最大频繁序列Human1,Opossum2,Rat2 and Chimpanzee2进行相似性分析后绘制的系统发育树示意图;
图5(b)为对最大频繁序列Human2,Opossum1,Rat2,and Chimpanzee1进行相似性分析后绘制的系统发育树示意图;
图6(a)为对最大频繁序列Human2,Opossum2,Rat2 and Chimpanzee1进行相似性分析后绘制的系统发育树示意图;
图6(b)为对最大频繁序列Human3,Opossu3,Rat3 and Chimpanzee3进行相似性分析后绘制的系统发育树示意图;
图7为归一化物种距离示意图。
具体实施方式
下面结合说明书附图和实施例对本发明作进一步限定,但不限于此。
实施例1
一种基于生物序列的负序列模式的相似性分析方法,如图1所示,包括步骤如下:
(1)数据预处理
对于每个要处理的序列或基因组,在将其进行频繁模式挖掘之前,都要进行预处理。将DNA序列中的字母用数字来表示;由于DNA序列长度非常长,并将数字表示后的DNA序列分割成若干个块,每个块碱基数量相同,得到的若干个块作为频繁模式挖掘的数据集;
本发明中,每个序列首先被分割成几个块,每个块由相同数量的连续基组成。这些块是相互独立的,并且块的大小可以在实践中改变。注意,如果最后一个块的大小小于指定的块大小,那么这个块将被丢弃。为了更清楚,下面是一个关于分割块的例子。在本例中,有两个序列S 1和S 2。假设块大小为15,这两个序列分别被分成2个和3个块。最后一个大小为3的块被丢弃。其中每分一个块都被用曲线和直线标记出来。这也叫序列阻塞,它是一个重要的步骤,它带来两个主要的优点。首先,可以捕获序列的细粒度信息,包括位置信息和排序信息。其次,即使对于长序列,阻塞也可以减少序列处理的内存和时间消耗。
Figure PCTCN2020128253-appb-000039
目前可以用于序列相似性研究的DNA序列很少,并且寻找更合适的DNA序列仍然是一个问题。来自15个物种的红蛋白基因的三个外显子序列是最常用的DNA序列。这三个基因序列包括第一、第二和第三外显子,序列的平均长度分别为92个碱基、222个碱基和114个碱基。其中, 11个不同物种的β基因的第一个外显子是应用最广泛的DNA序列数据。
所选数据集来自四个物种的β-蛋白基因的第一个外显子,如表1所示:
表1
Figure PCTCN2020128253-appb-000040
(2)频繁模式挖掘
使用f-NSP算法来挖掘数据集,得到最大频繁正、负序列模式;
(3)对最大频繁正、负序列模式进行图形表示
(4)DNA序列的相似性分析
求取不同DNA序列的相似度,相似度越小,DNA序列越相似。
相似性矩阵可以用来评价DNA相似性分析算法的有效性。它可以从侧面揭示不同物种之间的进化或遗传关系。DNA序列间的距离的计算是DNA相似性分析的基础,欧几里得距离和相关角是最常用的距离计算方法。并且规定序列之间的欧氏距离越小,DNA序列越相似。两个载体之间的相关角度越小,DNA序列越相似。
实施例2
根据实施例1所述的一种基于生物序列的负序列模式的相似性分析方法,其区别在于:
步骤(2)中,使用f-NSP算法来挖掘数据集,数据集为D,包括步骤如下:
A、使用GSP算法得到所有的正频繁序列,并将每条正频繁序列对应的位图存入哈希表中;包括:
a、扫描数据集得到所有长度为1的序列模式放入原始种子集P 1中;
b、从原始种子集P 1中获取长度为1的序列模式,并将它们通过连接操作生成长度为2的候选 序列集C 2;使用Apriori性质对候选序列集C 2进行剪枝,再通过扫描候选序列集C 2确定其中剩余序列的支持度,将支持度高于最小支持度的序列模式保存下来,输出为长度为2的序列模式L 2,并作为长度为2的种子集;用来生成长度递增的候选序列。按照该方法一直输出长度为3的序列模式L 3、长度为4的序列模式L 4……长度为n+1的序列模式L n+1,直到挖掘不出新的序列模式,得到序列模式即所有的正频繁序列,最小支持度是人为设置的支持度阈值min_sup;描述为:
L 1→C 2→L 2→C 3→L 3→C 4→L 4……若不能生成L n+1停止。
使用图4来解释说明位或运算(OR)。序列S如果sup(s)≥min_sup,则称为频繁(正)序列模式,而如果sup(s)<min_sup,则称为不频繁序列模式。假设一条正频繁序列为<G C T A>,并且sup(C A)=5,则根据负候选生成方法,一条负候选序列ns为
Figure PCTCN2020128253-appb-000041
则相应的,MPS(ns)=<CA>,P(1-negMS 1)=<GCA>,P(1-negMS 2)=<C TA>。假设B(<G CA>)=|1|0|0|1|0|,B(<C TA>)=|1|1|0|1|0|。那么B(<GCA>)ORB(<CTA>)的位图如图4所示。因此可以轻易的得到N(unionbitmap)=4,之后由公式1得到
Figure PCTCN2020128253-appb-000042
C、基于所有正频繁序列生成相应的NSC;
NSC是指负候选序列,正频繁序列统称为正序列,为了从正序列中生成所有非冗余的NSC,生成NSC的关键过程是将具有正模式的非连续元素转换为它们的负伙伴,对于一个k-size的PSP,NSCs是通过将任意m个不相邻的元素改变为它的负数来生成的,用
Figure PCTCN2020128253-appb-000043
来表示,
Figure PCTCN2020128253-appb-000044
Figure PCTCN2020128253-appb-000045
是不小于k/2的最小整数;k-size是指序列的大小为k;比如序列S={A T T C C},其大小为5-size。NSCs:指的是所有的负候选序列。
例如,<A T C C>的NSC包括:(1)m=1时,
Figure PCTCN2020128253-appb-000046
(2)m=2时,
Figure PCTCN2020128253-appb-000047
这里规定不允许出现两个连续的负项。
C、利用位操作快速计算负侯选序列的支持度;
生成NSCs之后,计算它的支持度,当满足负候选序列的支持度时,得到负频繁序列模式。NSCs的支持度计算如下:给定一个m-size和n-neg-size负序列ns,对于
Figure PCTCN2020128253-appb-000048
那么在数据集D中ns的支持度为:
Figure PCTCN2020128253-appb-000049
m-size是指序列大小为m;假设ns=<a 1a 2…a m>是一个负序列,若ns′仅由ns中所有的正元素组成,则将ns′称作ns的最大正子序列,定义为MPS(ns);例如,
Figure PCTCN2020128253-appb-000050
由这个序列的MPS(ns)和ns中一个负元素a组成的序列称为1-neg-size最大子序列,定义为1-negMS。例如,
Figure PCTCN2020128253-appb-000051
那么它的1-negMS就是
Figure PCTCN2020128253-appb-000052
Figure PCTCN2020128253-appb-000053
通过频繁模式挖掘,获得了12种最大频繁正、负序列模式;
最大频繁序列模式。给定一个DNA序列S,该序列为碱基序列,S=<s 1s 2...s n>,其中s i(1≤i≤n)是一个字符的字符集Ω={A、T、C、G}。如果一个模式<s k s k+1...s m>(1≤k≤m≤n)的支持度不小于最小支持,这个序列就是频繁序列。最大频繁模式是指它的超序列都不频繁的模式。设min_sup=0.3,获得多种最大频繁序列模式。选取其中12种频繁序列模式来作为序列模式分析的数据集。这12种频繁序列模式如下表2:
表2.
Figure PCTCN2020128253-appb-000054
实施例3
根据实施例1所述的一种基于生物序列的负序列模式的相似性分析方法,其区别在于:步骤(3)中,对最大频繁正、负序列模式进行图形表示,包括:在复平面构造一个嘌呤嘧啶图,嘌呤嘧啶图中,第一、二象限是嘌呤,包括A、
Figure PCTCN2020128253-appb-000055
G和
Figure PCTCN2020128253-appb-000056
第三、四象限是嘧啶,包括T、
Figure PCTCN2020128253-appb-000057
C和
Figure PCTCN2020128253-appb-000058
(b+di)→A(Ⅰ)
(d+bi)→G(Ⅱ)
(b-di)→T(Ⅲ)
(d-bi)→C(Ⅳ)
Figure PCTCN2020128253-appb-000059
Figure PCTCN2020128253-appb-000060
Figure PCTCN2020128253-appb-000061
Figure PCTCN2020128253-appb-000062
四个核苷酸A、G、T、C及其对应的负序列的单位向量
Figure PCTCN2020128253-appb-000063
如式(Ⅰ)至式(Ⅷ)所示:
式(Ⅰ)至式(Ⅷ)中,b和d是非零的实数,
Figure PCTCN2020128253-appb-000064
A和T是共轭的,G和C也是共轭的,即,
Figure PCTCN2020128253-appb-000065
A、T、C,G代表现实存在的碱基对,
Figure PCTCN2020128253-appb-000066
表示的是DNA序列中本该出现却没有出现的碱基对,又称缺失的碱基对,也叫A、G、T、C及其对应的负序列的单位向量;如图2所示。
通过这种表示方法,将一个DNA序列碱基
Figure PCTCN2020128253-appb-000067
还原为一个数字序列s(n),如式(Ⅸ)所示:
Figure PCTCN2020128253-appb-000068
式(Ⅸ)中,s(0)=0,其中y(j)满足式(Ⅹ):
Figure PCTCN2020128253-appb-000069
式(Ⅹ)中,j表示序列S中第0,1,2,...,n个位置上的碱基类型,n是被研究的DNA序列的长度;
通过上面的步骤,从“嘌呤嘧啶图”中唯一地获得原始的DNA序列的时间序列;
利用式(Ⅹ)把12种最大频繁正、负序列模式转化为数字序列;比如序列Human1通过式(Ⅸ)-(Ⅹ)得到的复数字序列为s(H1)={0.866+0.5i,1.366-0.366i,2.2321+0.134i,3.0981+0.634i,3.5981+1.5i,4.4641+2i},模组成的时间序列为S(H1)={1.0000,1.4142,2.2361,3.1623,3.8982,4.8916}。通过这样的方法可以得到12种频繁序列模式转化之后的时间序列。
实施例4
根据实施例1所述的一种基于生物序列的负序列模式的相似性分析方法,其区别在于:
步骤(4)中,通过DTW算法求取距离矩阵,距离矩阵用于表示不同DNA序列的相似度。
设转化DNA序列而获得的时间序列为,
Figure PCTCN2020128253-appb-000070
其长度分别为m和n;按照它们的时间位置进行排序,构造m×n矩阵A m×n,矩阵中的每个元素
Figure PCTCN2020128253-appb-000071
在矩阵中,把一组相邻的矩阵元素的集合称为弯曲路径,记为W=w 1,w 2,...,w k,W的第k个元素w k=(a ij) k,这条路径满足下列条件:
①max{m,n}≤K≤m+m-1;
②w 1=a 11,w k=a mn
③对w k=a ij,w k-1=a i'j'必须满足0≤i-i'≤1,0≤j-j'≤1,则
Figure PCTCN2020128253-appb-000072
DTW算法为运用动态规划思想寻找一条具有最小弯曲代价的最佳路径,如式(Ⅺ)所示:
Figure PCTCN2020128253-appb-000073
其中,i=2,3,...,m;j=2,3,...,n。D(m,n)为A m×n中弯曲路径的最小累加值。
通过对12种频繁序列转化之后的时间序列进行DTW距离测量,分别获得8种PSPs和4种NSP之间的距离矩阵,分别如表3、表4所示:
表3
Figure PCTCN2020128253-appb-000074
表4
Figure PCTCN2020128253-appb-000075
据了解,Human、Chimpanzee属于灵长类动物,Rat属于啮齿类动物,Opossum属于后兽动物。本发明方法的整体变化与其分类一致,因此,本发明提出的方法是有效可行的。并且提出的方法对于短序列和长序列都是有效的,由于本发明使用的数据是挖掘之后的频繁模式,用于比较的序列的长度普遍减短,且保留了原序列的特性,所以计算非常简单,节省了计算机的内存消耗。通过四个物种之间的相似性比较,可以知道不同的模式组合得到不同的结果,这些结果可能在不同的考虑下有用。
随机选取某几个最大频繁序列,序列的距离矩阵(如表3、表4所示),表3、4中列出的不同数据组的相似度,如果能够合理进行聚类的话,利用本发明方法构建系统发育树。分子进化遗传学分析MEGA5是一个用户友好的软件,用于建立序列比对和系统发育树。系统发生树是一个树状分支图,总结了各种生物的遗传或进化关系。图5(a)为对最大频繁序列Human1,Opossum2,Rat2 and Chimpanzee2进行相似性分析后绘制的系统发育树示意图;图5(a)为对最大频繁序列Human2,Opossum1,Rat2,and Chimpanzee1进行相似性分析后绘制的系统发育树示意图;图6(a)为对最大频繁序列Human2,Opossum2,Rat2 and Chimpanzee1进行相似性分析后绘制的系统发育树示意图;图6(a)为对最大频繁序列Human3,Opossu3,Rat3 and Chimpanzee3进行相似性分析后绘制的系统发育树示意图;本发明选取四种频繁模式的组合,便得到了四种不同的分类结果,这都符合物种的进化规律。
通过归一处理数据,让本发明的结果和其他方法进行比较。图7为归一化物种距离示意图。其中,纵坐标为归一化距离。图7展示了本方法与两种比较方法的结果及MEGA结果之间的Pearson相关系数。表5详细说明了四种方法与其他物种和人类之间的距离。
表5
Figure PCTCN2020128253-appb-000076
Figure PCTCN2020128253-appb-000077
表5中,括号中的值是归一化到0到1之间的真实距离。Ref.【1】参见ZhiyiMo,WenZhu,Yi Sun,Qilin Xiang,MingZheng,MinChen,ZejunLi.One novel representation of DNA sequence based on the global and local position information.[J].Scientific reports,2018,8(1).Ref.【2】参见Yu Hong-Jie,Huang De-Shuang.Graphical representation for DNA sequences via joint diagonalization of matrix pencil.[J].IEEE Journal of Biomedical & Health Informatics,2013,17(3):503-511.计算了本方法与两种比较方法结果之间的Pearson相关系数。
可以看出,本发明方法与MEGA的相关系数最高,说明本发明方法能够更准确地计算出DNA序列之间的相似性。此外,从图7可以看出,本发明方法与MEGA计算的曲线更加接近,这再次说明本发明方法与MEGA的相关性最高。
比对可知,通过这种方法,可以有效地对负序列进行有效的表达和分析,并且通过选取不同的最大频繁模式组合,能够得到不同的分析结果。由于所选取的是频繁模式来进行的相似性分析,大大节省了计算机的内存和时间的消耗。此方法也和MEGA具有最高的相关性。
实施例5
根据实施例1-4任一所述的一种基于生物序列的负序列模式的相似性分析方法的实现系统,如图3所示,包括依次连接的数据预处理模块、频繁模式挖掘模块、图形表示模块、相似性分析模块;数据预处理模块用于执行步骤(1);频繁模式挖掘模块用于执行步骤(2);图形表示模块用于执行步骤(3);相似性分析模块用于执行步骤(4)。
实施例6
一种计算机可读存储介质,其特征在于,计算机可读存储介质中存储有基于生物序列的负序列模式的相似性分析程序,基于生物序列的负序列模式的相似性分析程序被处理器执行时,实现实施例1-4任一所述的基于生物序列的负序列模式的相似性分析方法的步骤。

Claims (7)

  1. 一种基于生物序列的负序列模式的相似性分析方法,其特征在于,包括步骤如下:
    (1)数据预处理
    将DNA序列中的字母用数字来表示;并将数字表示后的DNA序列分割成若干个块,每个块碱基数量相同,得到的若干个块作为频繁模式挖掘的数据集;
    (2)频繁模式挖掘
    使用f-NSP算法来挖掘数据集,得到最大频繁正、负序列模式;
    (3)对最大频繁正、负序列模式进行图形表示
    (4)DNA序列的相似性分析
    求取不同DNA序列的相似度,相似度越小,DNA序列越相似。
  2. 根据权利要求1所述的一种基于生物序列的负序列模式的相似性分析方法,其特征在于,步骤(2)中,使用f-NSP算法来挖掘数据集,数据集为D,包括步骤如下:
    A、使用GSP算法得到所有的正频繁序列,并将每条正频繁序列对应的位图存入哈希表中;包括:
    a、扫描数据集得到所有长度为1的序列模式放入原始种子集P 1中;
    b、从原始种子集P 1中获取长度为1的序列模式,并将它们通过连接操作生成长度为2的候选序列集C 2;使用Apriori性质对候选序列集C 2进行剪枝,再通过扫描候选序列集C 2确定其中剩余序列的支持度,将支持度高于最小支持度的序列模式保存下来,输出为长度为2的序列模式L 2,并作为长度为2的种子集;按照该方法一直输出长度为3的序列模式L 3、长度为4的序列模式L 4……长度为n+1的序列模式L n+1,直到挖掘不出新的序列模式,得到序列模式即所有的正频繁序列,最小支持度是人为设置的支持度阈值min_sup;
    B、基于所有正频繁序列生成相应的NSC;
    NSC是指负候选序列,正频繁序列统称为正序列,对于一个k-size的PSP,NSCs是通过将任意m个不相邻的元素改变为它的负数来生成的,用
    Figure PCTCN2020128253-appb-100001
    来表示,
    Figure PCTCN2020128253-appb-100002
    是不小于k/2的最小整数;k-size是指序列的大小为k;NSCs是指所有的负候选序列;
    C、利用位操作快速计算负侯选序列的支持度;
    NSCs的支持度计算如下:给定一个m-size和n-neg-size负序列ns,对于
    Figure PCTCN2020128253-appb-100003
    那么在数据集D中ns的支持度为:
    Figure PCTCN2020128253-appb-100004
    m-size是指序列大小为m;假设ns=<a 1a 2…a m>是一个负序列,若ns′仅由ns中所有的正元素组成,则将ns′称作ns的最大正子序列,定义为MPS(ns);由这个序列的MPS(ns)和ns中一个负元素a组成的序列称为1-neg-size最大子序列,定义为1-negMS;
    通过频繁模式挖掘,获得了12种最大频繁正、负序列模式。
  3. 根据权利要求1所述的一种基于生物序列的负序列模式的相似性分析方法,其特征在于,步骤(3)中,对最大频繁正、负序列模式进行图形表示,包括:在复平面构造一个嘌呤嘧啶图,嘌呤嘧啶图中,第一、二象限是嘌呤,包括A、
    Figure PCTCN2020128253-appb-100005
    G和
    Figure PCTCN2020128253-appb-100006
    第三、四象限是嘧啶,包括T、
    Figure PCTCN2020128253-appb-100007
    C和
    Figure PCTCN2020128253-appb-100008
    四个核苷酸A、G、T、C及其对应的负序列的单位向量
    Figure PCTCN2020128253-appb-100009
    如式(Ⅰ)至式(Ⅷ)所示:
    (b+di)→A(Ⅰ)
    (d+bi)→G(Ⅱ)
    (b-di)→T(Ⅲ)
    (d-bi)→C(Ⅳ)
    Figure PCTCN2020128253-appb-100010
    Figure PCTCN2020128253-appb-100011
    Figure PCTCN2020128253-appb-100012
    Figure PCTCN2020128253-appb-100013
    式(Ⅰ)至式(Ⅷ)中,b和d是非零的实数,
    Figure PCTCN2020128253-appb-100014
    A和T是共轭的,G和C也是共轭的,即,
    Figure PCTCN2020128253-appb-100015
    A、T、C,G代表现实存在的碱基对,
    Figure PCTCN2020128253-appb-100016
    表示的是DNA序列中本该出现却没有出现的碱基对,又称缺失的碱基对,也叫A、G、T、C及其对应的负序列的单位向量;
    通过这种表示方法,将一个DNA序列碱基
    Figure PCTCN2020128253-appb-100017
    还原为一个数字序列s(n),如式(Ⅸ)所示:
    Figure PCTCN2020128253-appb-100018
    式(Ⅸ)中,s(0)=0,其中y(j)满足式(Ⅹ):
    Figure PCTCN2020128253-appb-100019
    式(Ⅹ)中,j表示序列S中第0,1,2,...,n个位置上的碱基类型,n是被研究的DNA序列的长度;
    利用式(Ⅹ)把12种最大频繁正、负序列模式转化为数字序列。
  4. 根据权利要求1-3任一所述的一种基于生物序列的负序列模式的相似性分析方法,其特征在于,步骤(4)中,求取距离矩阵,距离矩阵用于表示不同DNA序列的相似度。
  5. 根据权利要求4所述的一种基于生物序列的负序列模式的相似性分析方法,其特征在于,步骤(4)中,通过DTW算法求取距离矩阵,设转化DNA序列而获得的时间序列为,
    Figure PCTCN2020128253-appb-100020
    其长度分别为m和n;按照它们的时间位置进行排序,构造m×n矩阵A m×n,矩阵中的每个元素
    Figure PCTCN2020128253-appb-100021
    在矩阵中,把一组相邻的矩阵元素的集合称为弯曲路径,记为W=w 1,w 2,...,w k,W的第k个元素w k=(a ij) k,这条路径满足下列条件:
    ①max{m,n}≤K≤m+m-1;
    ②w 1=a 11,w k=a mn
    ③对w k=a ij,w k-1=a i'j'必须满足0≤i-i'≤1,0≤j-j'≤1,则
    Figure PCTCN2020128253-appb-100022
    DTW算法为运用动态规划思想寻找一条具有最小弯曲代价的最佳路径,如式(Ⅺ)所示:
    Figure PCTCN2020128253-appb-100023
    式(Ⅺ)中,i=2,3,...,m;j=2,3,...,n,D(m,n)为A m×n中弯曲路径的最小累加值。
  6. 权利要求1-5任一所述一种基于生物序列的负序列模式的相似性分析方法的实现系统,其特征在于,包括依次连接的数据预处理模块、频繁模式挖掘模块、图形表示模块、相似性分析模块;所述数据预处理模块用于执行步骤(1);所述频繁模式挖掘模块用于执行步骤(2);所述图形表示模块用于执行步骤(3);所述相似性分析模块用于执行步骤(4)。
  7. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有基于生物序列的负序列模式的相似性分析程序,所述基于生物序列的负序列模式的相似性分析程序被处理器执行时,实现权利要求1-5任一所述基于生物序列的负序列模式的相似性分析方法的步骤。
PCT/CN2020/128253 2020-09-25 2020-11-12 一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质 WO2022062114A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
KR1020217034664A KR20220042300A (ko) 2020-09-25 2020-11-12 생물학적 시퀀스 기반의 네거티브 시퀀스 패턴의 유사성 분석 방법, 구현 시스템 및 매체
CA3129990A CA3129990A1 (en) 2020-09-25 2020-11-12 A similarity analysis method of negative sequential patterns based on biological sequences and its implementation system and medium
JP2021561803A JP7260934B2 (ja) 2020-09-25 2020-11-12 生物学的配列に基づく負の配列パターンの類似性分析方法、その実装システム及び媒体
US17/446,176 US20220101949A1 (en) 2020-09-25 2021-08-27 Similarity analysis method of negative sequential patterns based on biological sequences and its implementation system and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011022788.8A CN112182497B (zh) 2020-09-25 2020-09-25 一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质
CN202011022788.8 2020-09-25

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/446,176 Continuation US20220101949A1 (en) 2020-09-25 2021-08-27 Similarity analysis method of negative sequential patterns based on biological sequences and its implementation system and medium

Publications (1)

Publication Number Publication Date
WO2022062114A1 true WO2022062114A1 (zh) 2022-03-31

Family

ID=73943524

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/128253 WO2022062114A1 (zh) 2020-09-25 2020-11-12 一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质

Country Status (4)

Country Link
CN (1) CN112182497B (zh)
AU (1) AU2020103216A4 (zh)
LU (1) LU102312B1 (zh)
WO (1) WO2022062114A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742396B (zh) * 2021-08-26 2023-10-27 华中师范大学 一种对象学习行为模式的挖掘方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950326A (zh) * 2010-09-10 2011-01-19 重庆大学 基于Hurst指数的DNA序列相似性检测方法
US20140121985A1 (en) * 2012-07-30 2014-05-01 Khalid Sayood Classification of nucleotide sequences by latent semantic analysis
CN104574153A (zh) * 2015-01-19 2015-04-29 齐鲁工业大学 快速的负序列挖掘模式在客户购买行为分析中的应用
CN107516020A (zh) * 2017-08-17 2017-12-26 中国科学院深圳先进技术研究院 序列位点重要度的确定方法、装置、设备及存储介质
CN109146542A (zh) * 2018-07-10 2019-01-04 齐鲁工业大学 一种挖掘正负序列规则的方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
BRPI1015129A2 (pt) * 2009-06-30 2016-07-12 Dow Agrosciences Llc aplicação de métodos em aprendizagem de máquina para regras de associação na mineração de conjuntos de dados contendo marcadores genéticos moleculares de plantas e de animais, seguida pela classificação ou predição utilizando atributos criados a partir destas regras de associação
JP2011086252A (ja) * 2009-10-19 2011-04-28 Fujitsu Ltd パターン抽出プログラム及びパターン抽出方法
CN103995690B (zh) * 2014-04-25 2016-08-17 清华大学深圳研究生院 一种基于gpu的并行时间序列挖掘方法
CN107729762A (zh) * 2017-08-31 2018-02-23 徐州医科大学 一种基于差分隐私保护模型的dna闭频繁模体识别方法
CN109783696B (zh) * 2018-12-03 2021-06-04 中国科学院信息工程研究所 一种面向弱结构相关性的多模式图索引构建方法及系统
CN111581262A (zh) * 2020-06-15 2020-08-25 河北工业大学 保序序列模式挖掘方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950326A (zh) * 2010-09-10 2011-01-19 重庆大学 基于Hurst指数的DNA序列相似性检测方法
US20140121985A1 (en) * 2012-07-30 2014-05-01 Khalid Sayood Classification of nucleotide sequences by latent semantic analysis
CN104574153A (zh) * 2015-01-19 2015-04-29 齐鲁工业大学 快速的负序列挖掘模式在客户购买行为分析中的应用
CN107516020A (zh) * 2017-08-17 2017-12-26 中国科学院深圳先进技术研究院 序列位点重要度的确定方法、装置、设备及存储介质
CN109146542A (zh) * 2018-07-10 2019-01-04 齐鲁工业大学 一种挖掘正负序列规则的方法

Also Published As

Publication number Publication date
LU102312B1 (de) 2021-06-30
CN112182497B (zh) 2021-04-27
CN112182497A (zh) 2021-01-05
AU2020103216A4 (en) 2021-01-14

Similar Documents

Publication Publication Date Title
CN111881714B (zh) 一种无监督跨域行人再识别方法
Schleif et al. Indefinite proximity learning: A review
US7831392B2 (en) System and process for validating, aligning and reordering one or more genetic sequence maps using at least one ordered restriction map
CN106021990B (zh) 一种将生物基因以特定的性状进行分类与自我识别的方法
CN112214335B (zh) 基于知识图谱和相似度网络的Web服务发现方法
CN108984642A (zh) 一种基于哈希编码的印花织物图像检索方法
WO2022062114A1 (zh) 一种基于生物序列的负序列模式的相似性分析方法、实现系统及介质
CN109522821A (zh) 一种基于跨模态深度哈希网络的大规模跨源遥感影像检索方法
US7047137B1 (en) Computer method and apparatus for uniform representation of genome sequences
Wang et al. Gcmapcrys: integrating graph attention network with predicted contact map for multi-stage protein crystallization propensity prediction
Wong et al. Predicting approximate protein-DNA binding cores using association rule mining
CN114020948B (zh) 基于排序聚类序列辨别选择的草图图像检索方法及系统
CN114090813B (zh) 基于多通道特征融合的变分自编码器平衡哈希遥感图像检索方法
US20220101949A1 (en) Similarity analysis method of negative sequential patterns based on biological sequences and its implementation system and medium
CN114861940A (zh) 预测植物lncRNA中sORFs的贝叶斯优化集成学习方法
CN110727833B (zh) 一种基于多视角学习的图数据检索结果优化方法
Kecman et al. Adaptive local hyperplane for regression tasks
Ganesh et al. MOPAC: motif finding by preprocessing and agglomerative clustering from microarrays
CN117746997B (zh) 一种基于多模态先验信息的顺式调控模体识别方法
CN112885409B (zh) 一种基于特征选择的结直肠癌蛋白标志物选择系统
CN118522346B (zh) 一种蛋白质结合位点预测方法、系统、介质、设备及产品
Somboonsak et al. A new edit distance method for finding similarity in Dna sequence
CN114022701A (zh) 基于近邻监督离散判别哈希的图像分类方法
Wang et al. Document classification algorithm based on NPE and PSO
Xiang et al. EdtClust: A fast homologous protein sequences clustering method based on edit distance

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021561803

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20954958

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20954958

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20954958

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 021023)

122 Ep: pct application non-entry in european phase

Ref document number: 20954958

Country of ref document: EP

Kind code of ref document: A1