CN116741265A - Machine learning-based nanopore protein sequencing data processing method and application thereof - Google Patents
Machine learning-based nanopore protein sequencing data processing method and application thereof Download PDFInfo
- Publication number
- CN116741265A CN116741265A CN202310705437.4A CN202310705437A CN116741265A CN 116741265 A CN116741265 A CN 116741265A CN 202310705437 A CN202310705437 A CN 202310705437A CN 116741265 A CN116741265 A CN 116741265A
- Authority
- CN
- China
- Prior art keywords
- clustering
- sequence
- current
- machine learning
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000734 protein sequencing Methods 0.000 title claims abstract description 39
- 238000010801 machine learning Methods 0.000 title claims abstract description 24
- 238000003672 processing method Methods 0.000 title claims abstract description 24
- 238000000034 method Methods 0.000 claims abstract description 36
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 31
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 30
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 25
- 230000005945 translocation Effects 0.000 claims abstract description 21
- 238000013528 artificial neural network Methods 0.000 claims abstract description 20
- 230000008569 process Effects 0.000 claims abstract description 20
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 10
- 238000007635 classification algorithm Methods 0.000 claims abstract description 8
- 150000001413 amino acids Chemical class 0.000 claims description 23
- 230000000903 blocking effect Effects 0.000 claims description 17
- 210000002569 neuron Anatomy 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 11
- 238000003064 k means clustering Methods 0.000 claims description 10
- 230000009466 transformation Effects 0.000 claims description 7
- 238000009826 distribution Methods 0.000 claims description 6
- 238000012935 Averaging Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 108090000765 processed proteins & peptides Proteins 0.000 claims description 5
- 230000007774 longterm Effects 0.000 claims description 4
- 238000005259 measurement Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 238000011524 similarity measure Methods 0.000 claims description 4
- 102000013455 Amyloid beta-Peptides Human genes 0.000 claims description 3
- 108010090849 Amyloid beta-Peptides Proteins 0.000 claims description 3
- 229920001184 polypeptide Polymers 0.000 claims description 3
- 102000004196 processed proteins & peptides Human genes 0.000 claims description 3
- 238000003860 storage Methods 0.000 claims description 3
- 238000002050 diffraction method Methods 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims description 2
- 239000011159 matrix material Substances 0.000 claims description 2
- 230000015654 memory Effects 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 claims description 2
- 230000009467 reduction Effects 0.000 claims description 2
- 230000002123 temporal effect Effects 0.000 claims 2
- 238000012163 sequencing technique Methods 0.000 abstract description 5
- 238000007672 fourth generation sequencing Methods 0.000 abstract description 2
- DZHSAHHDTRWUTF-SIQRNXPUSA-N amyloid-beta polypeptide 42 Chemical compound C([C@@H](C(=O)N[C@@H](C)C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](CC(O)=O)C(=O)N[C@H](C(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](CCCCN)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@H](C(=O)N[C@@H]([C@@H](C)CC)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCSC)C(=O)N[C@@H](C(C)C)C(=O)NCC(=O)NCC(=O)N[C@@H](C(C)C)C(=O)N[C@@H](C(C)C)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](C)C(O)=O)[C@@H](C)CC)C(C)C)NC(=O)[C@H](CC=1C=CC=CC=1)NC(=O)[C@@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCCCN)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CC=1N=CNC=1)NC(=O)[C@H](CC=1N=CNC=1)NC(=O)[C@@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@H](CC=1C=CC(O)=CC=1)NC(=O)CNC(=O)[C@H](CO)NC(=O)[C@H](CC(O)=O)NC(=O)[C@H](CC=1N=CNC=1)NC(=O)[C@H](CCCNC(N)=N)NC(=O)[C@H](CC=1C=CC=CC=1)NC(=O)[C@H](CCC(O)=O)NC(=O)[C@H](C)NC(=O)[C@@H](N)CC(O)=O)C(C)C)C(C)C)C1=CC=CC=C1 DZHSAHHDTRWUTF-SIQRNXPUSA-N 0.000 description 13
- 230000000875 corresponding effect Effects 0.000 description 12
- 238000004458 analytical method Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000003062 neural network model Methods 0.000 description 4
- 239000011148 porous material Substances 0.000 description 4
- 239000007787 solid Substances 0.000 description 3
- 238000001712 DNA sequencing Methods 0.000 description 2
- 239000003153 chemical reaction reagent Substances 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 238000012804 iterative process Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 101100153331 Mus musculus Timp1 gene Proteins 0.000 description 1
- 102000007079 Peptide Fragments Human genes 0.000 description 1
- 108010033276 Peptide Fragments Proteins 0.000 description 1
- 108010026552 Proteome Proteins 0.000 description 1
- 108010014387 aerolysin Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000031018 biological processes and functions Effects 0.000 description 1
- 210000004899 c-terminal region Anatomy 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004064 dysfunction Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005684 electric field Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000004907 flux Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000004949 mass spectrometry Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Genetics & Genomics (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Biotechnology (AREA)
- Probability & Statistics with Applications (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention belongs to the field of nanopore protein sequencing, and particularly discloses a machine learning-based nanopore protein sequencing data processing method and application thereof; the invention discloses a machine learning method for processing and analyzing nanopore sequencing data, which constructs a clustering algorithm based on a dynamic time warping algorithm (Dynamic Time Warping, DTW) and K-Means and a classification algorithm based on a convolutional neural network and a cyclic neural network, and solves the self-directivity judgment problem when no tag molecule passes through nanopore translocation in nanopore protein head sequencing experimental data and the identity recognition problem for different molecules in translocation process.
Description
Technical Field
The invention belongs to the field of nanopore protein sequencing, and particularly discloses a machine learning-based nanopore protein sequencing data processing method and application thereof.
Background
Protein analysis provides key information for most biological processes, cell phenotypes, and diseases. Studies have shown that many diseases are caused by protein dysfunction, and that the primary structure of proteins, i.e. the Amino Acids (AA) arrangement, determines their higher structure and thus ultimately the function of the protein. Thus, de novo sequencing of proteins is one of the core technologies that facilitate further development of proteomics.
Because the nanopore sensor has the advantages of single-molecule sensitivity, long reading length, high flux and the like, the nanopore sensor has a great potential in protein analysis and protein sequencing besides being successfully applied to DNA sequencing (such as MinION of oxford nanopore technology company in England and QNome-3841 of Chinese zirco technology company). Studies have shown that the primary structure of proteins can be read using solid state nanopores (g.timp et al 2016,Nature Nanotechnology 11,968-976), and that 13 of the 20 protein amino acids can be distinguished using aerolysin nanopores according to the characteristics of the respective current blocking signals, and that the level of residual current is closely related to the volume size of the amino acid being detected (a.oukhalied et al 2020,Nature Nanotechnology 38,176-181). Recent studies in 2021 have shown that repeated reads of individual peptide fragments can be achieved using nanopores, distinguishing amino acid substitutions at individual residue accuracy (C.Dekker et al 2021, science 374 (6574): 1509-1513). Protein single molecules translocated through the nanopore transmembrane, whose amino acid sequence information is hidden in the corresponding occlusion current fluctuations. Thus, after the implementation of a scientific, efficient signal processing method, we can read the protein primary structure by obtaining a subtle residual level fluctuation pattern in the blocking current signal.
The essence of protein sequencing is to read the primary structure of the protein, i.e. to measure the amino acid sequence on the peptide chain. In contrast to gene sequencing, protein sequencing lacks a reference genome and therefore lacks a reference benchmark during data analysis; the amino acids comprising the peptide chain units are small in size, only about one tenth of a nucleotide, and a large amount of noise interference exists, which results in low signal-to-noise ratio, thereby affecting the accuracy and reliability of data; and the number of amino acids is 20 or more, which is far more than 4 bases forming a DNA chain, and the increase of sequence diversity can lead to the exponential increase of the complexity of signals, so that the difficulty of signal processing and analysis is greatly increased, and the existing data processing and analysis method for nanopore DNA sequencing is difficult to be suitable for the field of protein sequencing. Traditional proteome experimental analysis methods, such as mass spectrometry, have difficulty distinguishing between different analytes at the resolution of a single amino acid. Currently, proteomics research requires a protein sequencing technology with single-point residue specificity. Notably, nanopore single molecule sequencing technology has been successful in the field of genomics and began to advance to proteomics. However, the study of the problem of self-directionality determination when unlabeled protein molecules translocate through a nanopore is currently left blank as to how to accurately determine whether each polypeptide chain is advanced at the amino-or carboxy-terminus when translocated through the pore after being captured by an electric field. This is a precondition for the nanopore to accurately read the primary structure of the protein.
Disclosure of Invention
In order to solve the problems, the invention discloses a machine learning-based nanopore protein sequencing data processing method and application thereof, which solve the problem of judging the self directivity of unlabeled protein single molecules in the process of transmembrane translocation in a pore, perfect the calculation of a consensus current track and solve the problem of identity recognition of different molecules in the translocation process.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a machine learning-based nanopore protein sequencing data processing method comprises the following steps:
1) Judging the directionality of unlabeled molecules when translocation occurs through the nanopore by using a clustering algorithm based on a dynamic time warping algorithm DTW and a K-Means;
2) Judging the identities of different molecules in the translocation process by using a classification algorithm based on a convolutional neural network and a cyclic neural network;
the DTW is a nonlinear normalization technique for comparing the similarity of two time series, and finds the optimal alignment path between the two time series by aligning the coordinates in the two time series;
the K-means is an unsupervised learning algorithm, and is used for measuring the similarity between samples through Euclidean distance and distributing samples with high similarity to the same category through an iterative optimization mode;
the unlabeled molecule is a linear protein polypeptide molecule.
Further, the above-mentioned nanopore protein sequencing data processing method based on machine learning, the step 1) includes the following steps:
firstly, using the similarity between the DTW distance measurement time sequences to construct a distance matrix of a corresponding data set;
setting the clustering class number K value as 2 by using the K-Means clustering principle, taking a linear amino acid crystallography volume model corresponding to the primary structure of the protein to be detected and a sequence turned over along a time axis as two initialization clustering centers, calculating the distance between a time domain current sequence in each blocking event and the two clustering centers, and distributing the distance to the closest clustering center;
after the distribution of all the events is completed, calculating the average value of all the events in the two clusters respectively, updating the clustering center, repeating the processes, continuously updating and iterating the clustering center, improving the accuracy of the clustering result, and stopping iterating until the clustering center is not changed;
and finally, analyzing the clustering result, perfecting a calculation method of the consensus current track, and evaluating the clustering result according to similarity indexes including but not limited to the pearson correlation coefficient PCC.
Further, the above-mentioned method for processing nanopore protein sequencing data based on machine learning, wherein the step 1) comprises the following steps:
a. collecting and acquiring time sequence data of blocking current generated by translocation of unlabeled molecules in the nanopore;
b. preprocessing data, uniformly interpolating current blocking events with unequal lengths to 500 points by using Matlab self-contained function ip 1, and normalizing by Z-Score to eliminate the difference of data dimension;
c. processing a data set by using a K-Means clustering algorithm, setting a clustering class number K value to be 2 according to data characteristics, selecting a DTW algorithm which can stretch on a time axis as a similarity measure between current time sequences, customizing two initialized clustering centers respectively to be a volume model corresponding to a primary structure of a label-free molecule and a volume model turned over along the time axis, adding the two volume models into the data set, and participating in an iteration process;
d. in the clustering process, the distance between each event and two clustering centers is calculated respectively, the event is distributed to the clustering center closest to the event, the average value of the two clusters is calculated according to the newly distributed result to replace the original clustering center, the process is repeated until the clustering center is not updated any more, iteration is stopped, and the current clustering center and the distribution result of each event in the data set are output;
e. all events in the data set are distributed into two clusters, the volume models before and after overturning are distributed into the two clusters respectively, PCC values between two clustering centers are calculated, and the correctness of the clustering result is judged by comparing whether the two clustering centers are in correlation with medium intensity or more, i.e., |PCC|is more than or equal to 0.3 before and after overturning; ( For example, the PCC value between two cluster centers is-0.6, i.e. the two are inversely related, and after one cluster center is turned over, the PCC value between the two cluster centers is 0.58, i.e. the two are positively related, so that the correctness of the clustering result can be primarily determined. Generally, the absolute value of the two PCC values is more than or equal to 0.3, which can indicate that the two variables have medium-strength correlation, and the correctness of the clustering result can be primarily judged. )
f. Judging event directionality according to the clustering result, selecting a plurality of typical events for averaging to obtain a consensus current track of the unlabeled molecule, and judging a PCC value between the consensus current track and a volume model of the unlabeled molecule;
g. and selecting a proper W value, so that the PCC value between the consensus current track and the volume model is highest after aligning the consensus current track to the amino acid volume model in the DTW.
Further, the machine-based machineMethod for processing sequencing data of learned nanopore protein, wherein the unlabeled molecule is beta-amyloid Native Abeta 1-42 。
Furthermore, the above-mentioned machine learning-based nanopore protein sequencing data processing method, wherein the unlabeled molecule is a scissored Abeta 1-42 。
Further, in the above-mentioned machine learning-based nanopore protein sequencing data processing method, in the step 2), an encoder-decoder framework is used, the encoder part firstly carries out convolution operation on input current signals of different types for a plurality of times by adopting a convolution neural network to extract spatial characteristics of the current signals, then uses a cyclic neural network to learn time characteristics thereof, and captures time correlation in the sequence;
the decoder part predicts the final current signal category by utilizing the nonlinear dimension reduction of the signals by using the multi-layer perceptron, and reversely updates the parameter values of each level in the middle of the neural network by continuously iteratively comparing the difference between the predicted result and the real label, thereby improving the accuracy of the predicted result.
Further, in the above-mentioned machine learning-based nanopore protein sequencing data processing method, in the step 2), two different unlabeled molecules are identified by using an encoder-decoder method, and the method includes the following steps:
I. the encoder part comprises a convolutional neural network and a cyclic neural network, wherein the convolutional neural network comprises two convolutional layers, each convolutional layer comprises 256 convolutional kernels, each convolutional kernel carries out convolutional operation on an input current signal, the elements in a current sequence and weights in the convolutional kernels are multiplied and summed, frequency components and spatial features of the sequence are extracted, and features of different layers of the sequence are extracted through different numbers of convolutional kernels to obtain an output feature sequence of the final convolutional layer;
using a cyclic neural network in the encoder to learn the time step characteristics of the current signal, memorizing the information processed previously while processing the input sequence, and modeling the sequence data by taking the hidden state of the last moment as the input of the current moment, thereby capturing the time correlation in the sequence;
III, in a decoder part, mapping the characteristic sequence output by the encoder into a final current signal class by using a multi-layer perceptron, normalizing the output sequence representation by a sigmoid function, and obtaining the probability of the final class attribution of the current signal;
updating the weight of each level of the model by comparing the error between the output of the model and the real label, so that the model can predict the category attribution probability more accurately.
Furthermore, in the machine learning-based nanopore protein sequencing data processing method, the circulating neural network structure in the step II is a long-short-term memory LSTM structure, and the LSTM structure comprises three gating mechanisms of an input gate, a forgetting gate and an output gate, and the input, the output and the forgetting of the previous information in the current sequence are respectively controlled, so that the effective long-term information storage and control are realized; each LSTM gate contains 64 neurons, each learning a different characteristic representation of the signal.
Further, the multi-layer perceptron in the step III is composed of three fully connected layers, including two hidden layers and an output layer, wherein the two hidden layers each contain 100 and 50 neurons, each neuron is connected with all neurons of the previous layer, each layer combines and abstracts different features through nonlinear transformation to generate a low-dimensional representation of a sequence, finally, after transformation of the hidden layers, signals reach the output layer, and after normalization of the output sequence representation through a sigmoid function, the probability of final category attribution of current signals is obtained.
On the other hand, the invention discloses application of the machine learning-based nanopore protein sequencing data processing method in large-scale protein sequencing.
The invention has the following beneficial effects:
for Native Abeta 1-42 And Scramble Abeta 1-42 When the nanopore sequencing data of (1) are processed, the calculation is performed by hundreds to thousands ofPrior to the consensus current trajectory consisting of individual single molecule transmembrane translocation events, the problem to be solved was how to determine the self-directionality of the unlabeled protein single molecule at translocation. According to the data characteristics, namely only two possible directions, a K-Means clustering algorithm with a K value of 2 is selected, DTW is combined as distance measurement, a clustering center initialization mode is changed, a volume model corresponding to a protein single-molecule primary structure to be detected and a volume model turned over along a time axis are used as two initialized clustering centers, a data set is added, the directionality of a label-free time sequence is judged, and finally the data set is divided into two clusters. After the directionality of each translocation event is obtained according to the clustering result, the consensus current track of the protein to be detected is calculated, and the linear amino acid volume model corresponding to the consensus current track is found to be highly relevant, and the PCC value is as high as 0.8.
By using a neural network-based classification algorithm to classify different target protein sequencing signals, a high-efficiency and accurate classification result is realized, and the AUC value (AUC E [0.5 ] 1) of classification reaches 0.97+/-0.01. (AUC measures the distinguishing capability of the model to positive and negative samples under different thresholds, can be used for evaluating the overall performance of the model, and the accuracy rate refers to the ratio of the number of correctly classified samples to the total number of samples of the classification model. The method can capture abstract features in signals, and fully utilizes sequence information and structural features of different proteins, so that accurate classification is realized. Meanwhile, the method has high robustness and generalization capability, can process protein sequencing signals of various types and lengths, and has good adaptability and popularization capability. And through efficient algorithm design and continuous optimization, the higher classification speed is realized while the accuracy is ensured. The method can classify a large number of protein sequencing signals in a short time, and improves the processing efficiency and the production benefit.
Drawings
Fig. 1: and judging a directional flow chart of translocation of the unlabeled protein single molecule through the nanopore transmembrane based on a DTW and K-Means clustering algorithm. (a) Original time current sequences of different lengths generated when protein single molecules translocate through a nanopore transmembrane. (b) Each occlusion event is interpolated to 500 points and Z-Score processed. (c) Setting a K-Means clustering algorithm, setting a K value as 2, initializing a clustering center as a volume model before and after overturning, and measuring a distance as DTW, A: blocking event, B: clustering center, W: size of DTW warp window. (d) Clustering results, all events in the dataset are grouped into two clusters. Grey: occlusion event, black: clustering centers;
FIG. 2 (a) and (d) are respectively Native Abeta 1-42 Data set and Scrambled aβ 1-42 Characteristic profile of events in the dataset over the blocking duration and current blocking ratio. Each block in the figure represents a blocking event. Black square: occlusion duration lies between 100-500 mus. Red square: a typical occlusion event. Contours: black contours are formed by the connection of these points with 50% x PDFmax values in the normalized heat map of the Probability Density Function (PDF) of the dataset. (b) solid red line: native Abeta averaged from 50 typical events in Panel (a) 1-42 Is a common current trace of (a). Black solid line: native Aβ 1-42 Linear amino acid volume model of (kmer=1). Gray solid line: when the two sequences are aligned by using a DTW algorithm, the corresponding relation between the two sequences is obtained. (c) Red solid line, when w=12, the DTW method was used to direct the Native consensus current trace to Native aβ 1-42 Aligned linear amino acid volume model (kmer=1). (e) solid red line: scramble Abeta obtained by averaging 50 typical events in Panel (d) 1-42 Is a common current trace of (a). Black solid line: scramble Aβ 1-42 Linear amino acid volume model of (kmer=1). Gray solid line: when the two sequences are aligned by using a DTW algorithm, the corresponding relation between the two sequences is obtained. (f) Red solid line, when w=12, the wrambled consensus current trace was drawn to wrambled aβ using DTW method 1-42 Is of the (2)Results after alignment of amino acid volume model (kmer=1).
FIG. 3 is a classification algorithm based on convolutional neural networks and cyclic neural networks, solving the problem of identity recognition for different molecules of translocation.
(a) Classification algorithm based on neural network model framework (b) a curve of subject working characteristics (receiver operating characteristic, ROC) of classification algorithm: a group of the Scrambled Aβ 1-42 Data sets and three different groups of Native Abeta 1-42 The data set is input into a neural network model for training and evaluation, and the obtained AUC average value is 0.97.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The reagents or instruments used in the examples of the present invention were not manufacturer-identified and were conventional reagent products commercially available.
Example 1
Native Aβ 1-42 Judging by self directivity of the nano-pore translocation.
After obtaining beta-amyloid (Native Abeta 1-42 ) After translocation of the generated blocking current time series data set in the nanopore, native aβ was judged 1-42 The flow of blocking event directionality is shown in fig. 1. The data were first pre-processed, current blocking events of unequal length were uniformly interpolated to 500 points using Matlab self-contained function interp1, and Z-Score normalized to eliminate differences in data dimensions. The dataset is then processed using a K-Means clustering algorithm. According to the data characteristics, setting the K value of the clustering class number as 2, selecting a DTW algorithm which can stretch on a time axis as a similarity measure between current time sequences, and customizing two initialization clustering centers as Native Abeta respectively 1-42 Volume model and edge corresponding to primary structure of (a)And (3) a volume model with the time axis turned over, and adding the two volume models into a data set to participate in an iterative process. In the clustering process, the distance between each event and two clustering centers is calculated respectively, the event is distributed to the clustering center closest to the event, the average value of the two clusters is calculated according to the newly distributed result to replace the original clustering center, the process is repeated until the clustering center is not updated any more, iteration is stopped, and the current clustering center and the distribution result of each event in the data set are output. Eventually, all events in the dataset are assigned to two clusters. The volume models before and after the overturn are respectively distributed into two clusters, the PCC value between the two cluster centers is-0.63, the PCC value between the cluster center 1 and the cluster center 2 after the overturn is 0.58, and the preliminary correctness of the clustering result is indicated. Then, determining event directionality according to the clustering result, and selecting 50 typical events (as shown in fig. 2 (a)) for averaging to obtain Native aβ 1-42 Is identical to Native Abeta in current trace 1-42 PCC values between the volume models (k-mer=1) were 0.84, as in fig. 2 (b), both highly correlated. When w=12, DTW is used, and the PCC value between the consensus current trace and the volume model is as high as 0.96 as shown in fig. 2 (c) after aligning the consensus current trace to the amino acid volume model.
Example 2
Scrambled Aβ 1-42 Judging by self directivity of the nano-pore translocation.
After obtaining the Scramble Abeta 1-42 Judging the Scramble Abeta after translocation of the generated occlusion current time series data set in the nanopore 1-42 The flow of blocking event directionality is shown in fig. 1. The data were first pre-processed, current blocking events of unequal length were uniformly interpolated to 500 points using Matlab self-contained function interp1, and Z-Score normalized to eliminate differences in data dimensions. The dataset is then processed using a K-Means clustering algorithm. According to the data characteristics, setting the K value of the clustering class number as 2, selecting a DTW algorithm which can stretch on a time axis as a similarity measure between current time sequences, and customizing two initialization clustering centers as the SCRAMBLED Abeta respectively 1-42 Corresponding volume of primary structure of (2)The model and the volume model turned over along the time axis, and the two volume models are added into the data set to participate in the iterative process. In the clustering process, the distance between each event and two clustering centers is calculated respectively, the event is distributed to the clustering center closest to the event, the average value of the two clusters is calculated according to the newly distributed result to replace the original clustering center, the process is repeated until the clustering center is not updated any more, iteration is stopped, and the current clustering center and the distribution result of each event in the data set are output. Eventually, all events in the dataset are assigned to two clusters. The volume models before and after the overturn are respectively distributed into two clusters, the PCC value between the two cluster centers is-0.65, the PCC value between the cluster center 1 and the cluster center 2 after the overturn is 0.56, and the preliminary correctness of the clustering result is indicated. Then, determining event directionality according to the clustering result, and selecting 50 typical events (as shown in fig. 2 (d)) for averaging to obtain the scanned aβ 1-42 Is identical to the common current trace of the Scramble Abeta 1-42 PCC values between amino acid volume models (k-mer=1) were 0.76, both highly correlated as in fig. 2 (e). When w=12, DTW is used, and the PCC value between the consensus current trace and the volume model is as high as 0.96 as shown in fig. 2 (f) after aligning the consensus current trace to the amino acid volume model.
Example 3
Identification of the different molecules of the translocation.
For Native Abeta 1-42 And Scramble Abeta 1-42 The present invention uses encoder-decoder methods to identify two different target proteins by translocating the resulting occlusion current time series data set in the nanopore. The neural network model of the present invention is implemented using the Python programming language and the tensorflow.
Specifically, the convolutional neural network of the encoder section includes two convolutional layers. Each convolution layer comprises 256 convolution kernels, each convolution kernel carries out convolution operation on an input current signal, the elements in the current sequence and the weights in the convolution kernels are multiplied and summed, frequency components, spatial features and the like of the sequence are extracted, and features of different layers of the sequence can be extracted through different numbers of convolution kernels, so that an output feature sequence of the final convolution layer is obtained. By stacking two convolution layers, the expression capacity and prediction accuracy of the model are improved. Next, the recurrent neural network is used to learn the time-step characteristics of the current signal. The recurrent neural network can memorize the information processed previously while processing the input sequence, and model the sequence data by taking the hidden state of the last moment as the input of the current moment, thereby capturing the time correlation in the sequence. The long-term memory (Long Short Term Memory, LSTM) used in the invention is a special cyclic neural network structure, can effectively avoid the problems of gradient elimination and gradient explosion when learning long sequences, and improves the learning capacity and generalization capacity of the model. The LSTM structure comprises three gating mechanisms, namely an input gate, a forget gate and an output gate. These gating mechanisms can control the input, output, and forgetting of previous information in the current sequence, thereby enabling efficient long-term information storage and control. Each LSTM layer contains 64 neurons, each learning a different characteristic representation of the signal, by stacking three LSTM layers, long-term dependencies in the sequence are more effectively captured, and the expressive power of the model is improved. The pooling layer performs downsampling operation on the feature sequence output by the LSTM layer to reduce the dimension of the feature sequence, and the pooling layer aggregates elements in the feature sequence, takes the average value of each feature channel in the sequence, takes the average value as a representative value, reduces the size of the feature sequence, reduces the parameter number and the calculation complexity of a model, and improves the robustness of the model.
In the decoder section, the feature sequences output by the encoder are mapped into the final current signal class using a multi-layer perceptron. The multi-layer perceptron consists of three fully connected layers, including two hidden layers and one output layer, both hidden layers containing 100 and 50 neurons, each neuron being connected to all neurons of the previous layer. Each layer combines and abstracts the different features through non-linear transformations, generating a low-dimensional representation of the sequence. Finally, after transformation of the hidden layer, the signal reaches the output layer, and after normalization of the output sequence representation by a sigmoid function, the probability of the final category attribution of the current signal is obtained.
The weight of each level of the model is updated by comparing the error between the output of the model and the real label, so that the model can predict the category attribution probability more accurately. In the training process, the weights are updated through 100 rounds of iteration, and the neural network model is used for the Native Abeta 1-42 And Scramble Abeta 1-42 The classification effect of (c) achieves auc=0.97±0.01.
As can be seen from the above embodiments: according to the data characteristics, namely only two possible directions, a K-Means clustering algorithm with a K value of 2 is selected, DTW is combined as distance measurement, a clustering center initialization mode is changed, a volume model corresponding to a protein single-molecule primary structure to be detected and a volume model turned over along a time axis are used as two initialized clustering centers, a data set is added, the directionality of a label-free time sequence is judged, and finally the data set is divided into two clusters. After the directionality of each translocation event is obtained according to the clustering result, the consensus current track of the protein to be detected is calculated, and the linear amino acid volume model corresponding to the consensus current track is found to be highly relevant, and the PCC value is as high as 0.8.
By using a neural network-based classification algorithm to classify different target protein sequencing signals, a high-efficiency and accurate classification result is realized, and the AUC value of classification reaches 0.97+/-0.01. Compared with the traditional method, the algorithm has strong diversity classification capability, and can accurately distinguish the difference and the similarity between different protein sequencing signals. The method can capture abstract features in signals, and fully utilizes sequence information and structural features of different proteins, so that accurate classification is realized. Meanwhile, the method has high robustness and generalization capability, can process protein sequencing signals of various types and lengths, and has good adaptability and popularization capability. And through efficient algorithm design and continuous optimization, the higher classification speed is realized while the accuracy is ensured. The method can classify a large number of protein sequencing signals in a short time, and improves the processing efficiency and the production benefit.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be appreciated by persons skilled in the art that the above embodiments are not intended to limit the invention in any way, and that all technical solutions obtained by means of equivalent substitutions or equivalent transformations fall within the scope of the invention.
Claims (10)
1. The machine learning-based nanopore protein sequencing data processing method is characterized by comprising the following steps of:
1) Judging the directionality of unlabeled molecules when translocation occurs through the nanopore by using a clustering algorithm based on a dynamic time warping algorithm DTW and a K-Means;
2) Judging the identities of different molecules in the translocation process by using a classification algorithm based on a convolutional neural network and a cyclic neural network;
the DTW is a nonlinear normalization technique for comparing the similarity of two time series, and finds the optimal alignment path between the two time series by aligning the coordinates in the two time series;
the K-means is an unsupervised learning algorithm, and is used for measuring the similarity between samples through Euclidean distance and distributing samples with high similarity to the same category through an iterative optimization mode;
the unlabeled molecule is a linear protein polypeptide molecule.
2. The machine learning based nanopore protein sequencing data processing method of claim 1, wherein said step 1) comprises the steps of:
firstly, using the similarity between the DTW distance measurement time sequences to construct a distance matrix of a corresponding data set;
setting the clustering class number K value as 2 by using the K-Means clustering principle, taking a linear amino acid crystallography volume model corresponding to the primary structure of the protein to be detected and a sequence turned over along a time axis as two initialization clustering centers, calculating the distance between a time domain current sequence in each blocking event and the two clustering centers, and distributing the distance to the closest clustering center;
after the distribution of all the events is completed, calculating the average value of all the events in the two clusters respectively, updating the clustering center, repeating the processes, continuously updating and iterating the clustering center, improving the accuracy of the clustering result, and stopping iterating until the clustering center is not changed;
and finally, analyzing the clustering result, perfecting a calculation method of the consensus current track, and evaluating the clustering result according to similarity indexes including but not limited to the pearson correlation coefficient PCC E < -1 1 > to obtain an experimental result with strong positive correlation, namely PCC not less than 0.5.
3. The machine learning based nanopore protein sequencing data processing method of claim 1, wherein said step 1) comprises the steps of:
a. collecting and acquiring time sequence data of blocking current generated by translocation of unlabeled molecules in the nanopore;
b. preprocessing data, uniformly interpolating current blocking events with unequal lengths to 500 points by using Matlab self-contained function ip 1, and normalizing by Z-Score to eliminate the difference of data dimension;
c. processing a data set by using a K-Means clustering algorithm, setting a clustering class number K value to be 2 according to data characteristics, selecting a DTW algorithm which can stretch on a time axis as a similarity measure between current time sequences, customizing two initialized clustering centers respectively to be a volume model corresponding to a primary structure of a label-free molecule and a volume model turned over along the time axis, adding the two volume models into the data set, and participating in an iteration process;
d. in the clustering process, the distance between each event and two clustering centers is calculated respectively, the event is distributed to the clustering center closest to the event, the average value of the two clusters is calculated according to the newly distributed result to replace the original clustering center, the process is repeated until the clustering center is not updated any more, iteration is stopped, and the current clustering center and the distribution result of each event in the data set are output;
e. all events in the data set are distributed into two clusters, the volume models before and after overturning are distributed into the two clusters respectively, PCC values between two clustering centers are calculated, and the correctness of the clustering result is judged by comparing whether the two clustering centers are in correlation with medium intensity or more, i.e., |PCC|is more than or equal to 0.3 before and after overturning;
f. judging event directionality according to the clustering result, selecting a plurality of typical events for averaging to obtain a consensus current track of the unlabeled molecule, and judging a PCC value between the consensus current track and a volume model of the unlabeled molecule;
g. and selecting a proper W value, so that the PCC value between the consensus current track and the volume model is highest after aligning the consensus current track to the amino acid volume model in the DTW.
4. The machine learning based nanopore protein sequencing data processing method according to claim 1, wherein the unlabeled molecule is β -amyloid Native aβ 1-42 。
5. The machine learning based nanopore protein sequencing data processing method of claim 1, wherein the unlabeled molecule is a scissored aβ 1-42 。
6. The machine learning based nanopore protein sequencing data processing method of claim 1, wherein in step 2), using an encoder-decoder framework, the encoder portion first performs a plurality of convolution operations on the input current signals of different classes using a convolutional neural network to extract spatial features of the current signals, and then learns temporal features thereof using a cyclic neural network to capture temporal correlations in the sequence;
the decoder part predicts the final current signal category by utilizing the nonlinear dimension reduction of the signals by using the multi-layer perceptron, and reversely updates the parameter values of each level in the middle of the neural network by continuously iteratively comparing the difference between the predicted result and the real label, thereby improving the accuracy of the predicted result.
7. The machine learning based nanopore protein sequencing data processing method according to claim 1, wherein in step 2), two different unlabeled molecules are identified using an encoder-decoder method, comprising the steps of:
I. the encoder part comprises a convolutional neural network and a cyclic neural network, wherein the convolutional neural network comprises two convolutional layers, each convolutional layer comprises 256 convolutional kernels, each convolutional kernel carries out convolutional operation on an input current signal, the elements in a current sequence and weights in the convolutional kernels are multiplied and summed, frequency components and spatial features of the sequence are extracted, and features of different layers of the sequence are extracted through different numbers of convolutional kernels to obtain an output feature sequence of the final convolutional layer;
using a cyclic neural network in the encoder to learn the time step characteristics of the current signal, memorizing the information processed previously while processing the input sequence, and modeling the sequence data by taking the hidden state of the last moment as the input of the current moment, thereby capturing the time correlation in the sequence;
III, in a decoder part, mapping the characteristic sequence output by the encoder into a final current signal class by using a multi-layer perceptron, normalizing the output sequence representation by a sigmoid function, and obtaining the probability of the final class attribution of the current signal;
updating the weight of each level of the model by comparing the error between the output of the model and the real label, so that the model can predict the category attribution probability more accurately.
8. The machine learning-based nanopore protein sequencing data processing method according to claim 7, wherein the cyclic neural network structure in the step II is a long-short-term memory LSTM structure, and the LSTM structure includes three gating mechanisms of an input gate, a forgetting gate and an output gate, and controls input, output and forgetting of previous information in a current sequence respectively, so as to realize effective long-term information storage and control; each LSTM gate contains 64 neurons, each learning a different characteristic representation of the signal.
9. The machine learning based nanopore protein sequencing data processing method of claim 7, wherein the multi-layer perceptron in step III is composed of three fully connected layers, including two hidden layers and an output layer, both hidden layers containing 100 and 50 neurons, each neuron being connected to all neurons of a previous layer, each layer combining and abstracting different features by nonlinear transformation to generate a low dimensional representation of the sequence, and finally after transformation of the hidden layers, the signal reaches the output layer, and after normalization of the output sequence representation by a sigmoid function, the probability of final class assignment of the current signal is obtained.
10. Use of the machine learning based nanopore protein sequencing data processing method of any of claims 1-9 in large scale protein sequencing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310705437.4A CN116741265A (en) | 2023-06-14 | 2023-06-14 | Machine learning-based nanopore protein sequencing data processing method and application thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310705437.4A CN116741265A (en) | 2023-06-14 | 2023-06-14 | Machine learning-based nanopore protein sequencing data processing method and application thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116741265A true CN116741265A (en) | 2023-09-12 |
Family
ID=87904053
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310705437.4A Pending CN116741265A (en) | 2023-06-14 | 2023-06-14 | Machine learning-based nanopore protein sequencing data processing method and application thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116741265A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117095754A (en) * | 2023-10-19 | 2023-11-21 | 江苏正大天创生物工程有限公司 | Method for classifying proteins by machine learning |
-
2023
- 2023-06-14 CN CN202310705437.4A patent/CN116741265A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117095754A (en) * | 2023-10-19 | 2023-11-21 | 江苏正大天创生物工程有限公司 | Method for classifying proteins by machine learning |
CN117095754B (en) * | 2023-10-19 | 2023-12-29 | 江苏正大天创生物工程有限公司 | Method for classifying proteins by machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107622182B (en) | Method and system for predicting local structural features of protein | |
CN106295124B (en) | The method of a variety of image detecting technique comprehensive analysis gene subgraph likelihood probability amounts | |
CN111210871A (en) | Protein-protein interaction prediction method based on deep forest | |
Kodogiannis et al. | Artificial odor discrimination system using electronic nose and neural networks for the identification of urinary tract infection | |
CN111126575A (en) | Gas sensor array mixed gas detection method and device based on machine learning | |
CN110880369A (en) | Gas marker detection method based on radial basis function neural network and application | |
CN116741265A (en) | Machine learning-based nanopore protein sequencing data processing method and application thereof | |
CN108877947B (en) | Depth sample learning method based on iterative mean clustering | |
CN111079805A (en) | Abnormal image detection method combining attention mechanism and information entropy minimization | |
CN112116950B (en) | Protein folding identification method based on depth measurement learning | |
CN114692732A (en) | Method, system, device and storage medium for updating online label | |
CN117153268A (en) | Cell category determining method and system | |
CN112784921A (en) | Task attention guided small sample image complementary learning classification algorithm | |
CN114743600A (en) | Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity | |
CN116484289A (en) | Carbon emission abnormal data detection method, terminal and storage medium | |
CN115472221A (en) | Protein fitness prediction method based on deep learning | |
CN113764034B (en) | Method, device, equipment and medium for predicting potential BGC in genome sequence | |
CN108694375B (en) | Imaging white spirit identification method applicable to multi-electronic nose platform | |
Zhang et al. | Robust learning from noisy web images via data purification for fine-grained recognition | |
Li et al. | MSSort-DIAXMBD: A deep learning classification tool of the peptide precursors quantified by OpenSWATH | |
Lee et al. | Neuralfp: out-of-distribution detection using fingerprints of neural networks | |
CN110781822B (en) | SAR image target recognition method based on self-adaptive multi-azimuth dictionary pair learning | |
CN115511012B (en) | Class soft label identification training method with maximum entropy constraint | |
Gan et al. | DSAE-Impute: Learning discriminative stacked autoencoders for imputing single-cell rna-seq data | |
CN114758721B (en) | Deep learning-based transcription factor binding site positioning method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |