CN116741265A - Machine learning-based nanopore protein sequencing data processing method and application thereof - Google Patents

Machine learning-based nanopore protein sequencing data processing method and application thereof Download PDF

Info

Publication number
CN116741265A
CN116741265A CN202310705437.4A CN202310705437A CN116741265A CN 116741265 A CN116741265 A CN 116741265A CN 202310705437 A CN202310705437 A CN 202310705437A CN 116741265 A CN116741265 A CN 116741265A
Authority
CN
China
Prior art keywords
clustering
sequence
current
machine learning
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310705437.4A
Other languages
Chinese (zh)
Inventor
董竹新
谢勇
陈乐�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202310705437.4A priority Critical patent/CN116741265A/en
Publication of CN116741265A publication Critical patent/CN116741265A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Biotechnology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention belongs to the field of nanopore protein sequencing, and particularly discloses a machine learning-based nanopore protein sequencing data processing method and application thereof; the invention discloses a machine learning method for processing and analyzing nanopore sequencing data, which constructs a clustering algorithm based on a dynamic time warping algorithm (Dynamic Time Warping, DTW) and K-Means and a classification algorithm based on a convolutional neural network and a cyclic neural network, and solves the self-directivity judgment problem when no tag molecule passes through nanopore translocation in nanopore protein head sequencing experimental data and the identity recognition problem for different molecules in translocation process.

Description

Machine learning-based nanopore protein sequencing data processing method and application thereof
Technical Field
The invention belongs to the field of nanopore protein sequencing, and particularly discloses a machine learning-based nanopore protein sequencing data processing method and application thereof.
Background
Protein analysis provides key information for most biological processes, cell phenotypes, and diseases. Studies have shown that many diseases are caused by protein dysfunction, and that the primary structure of proteins, i.e. the Amino Acids (AA) arrangement, determines their higher structure and thus ultimately the function of the protein. Thus, de novo sequencing of proteins is one of the core technologies that facilitate further development of proteomics.
Because the nanopore sensor has the advantages of single-molecule sensitivity, long reading length, high flux and the like, the nanopore sensor has a great potential in protein analysis and protein sequencing besides being successfully applied to DNA sequencing (such as MinION of oxford nanopore technology company in England and QNome-3841 of Chinese zirco technology company). Studies have shown that the primary structure of proteins can be read using solid state nanopores (g.timp et al 2016,Nature Nanotechnology 11,968-976), and that 13 of the 20 protein amino acids can be distinguished using aerolysin nanopores according to the characteristics of the respective current blocking signals, and that the level of residual current is closely related to the volume size of the amino acid being detected (a.oukhalied et al 2020,Nature Nanotechnology 38,176-181). Recent studies in 2021 have shown that repeated reads of individual peptide fragments can be achieved using nanopores, distinguishing amino acid substitutions at individual residue accuracy (C.Dekker et al 2021, science 374 (6574): 1509-1513). Protein single molecules translocated through the nanopore transmembrane, whose amino acid sequence information is hidden in the corresponding occlusion current fluctuations. Thus, after the implementation of a scientific, efficient signal processing method, we can read the protein primary structure by obtaining a subtle residual level fluctuation pattern in the blocking current signal.
The essence of protein sequencing is to read the primary structure of the protein, i.e. to measure the amino acid sequence on the peptide chain. In contrast to gene sequencing, protein sequencing lacks a reference genome and therefore lacks a reference benchmark during data analysis; the amino acids comprising the peptide chain units are small in size, only about one tenth of a nucleotide, and a large amount of noise interference exists, which results in low signal-to-noise ratio, thereby affecting the accuracy and reliability of data; and the number of amino acids is 20 or more, which is far more than 4 bases forming a DNA chain, and the increase of sequence diversity can lead to the exponential increase of the complexity of signals, so that the difficulty of signal processing and analysis is greatly increased, and the existing data processing and analysis method for nanopore DNA sequencing is difficult to be suitable for the field of protein sequencing. Traditional proteome experimental analysis methods, such as mass spectrometry, have difficulty distinguishing between different analytes at the resolution of a single amino acid. Currently, proteomics research requires a protein sequencing technology with single-point residue specificity. Notably, nanopore single molecule sequencing technology has been successful in the field of genomics and began to advance to proteomics. However, the study of the problem of self-directionality determination when unlabeled protein molecules translocate through a nanopore is currently left blank as to how to accurately determine whether each polypeptide chain is advanced at the amino-or carboxy-terminus when translocated through the pore after being captured by an electric field. This is a precondition for the nanopore to accurately read the primary structure of the protein.
Disclosure of Invention
In order to solve the problems, the invention discloses a machine learning-based nanopore protein sequencing data processing method and application thereof, which solve the problem of judging the self directivity of unlabeled protein single molecules in the process of transmembrane translocation in a pore, perfect the calculation of a consensus current track and solve the problem of identity recognition of different molecules in the translocation process.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a machine learning-based nanopore protein sequencing data processing method comprises the following steps:
1) Judging the directionality of unlabeled molecules when translocation occurs through the nanopore by using a clustering algorithm based on a dynamic time warping algorithm DTW and a K-Means;
2) Judging the identities of different molecules in the translocation process by using a classification algorithm based on a convolutional neural network and a cyclic neural network;
the DTW is a nonlinear normalization technique for comparing the similarity of two time series, and finds the optimal alignment path between the two time series by aligning the coordinates in the two time series;
the K-means is an unsupervised learning algorithm, and is used for measuring the similarity between samples through Euclidean distance and distributing samples with high similarity to the same category through an iterative optimization mode;
the unlabeled molecule is a linear protein polypeptide molecule.
Further, the above-mentioned nanopore protein sequencing data processing method based on machine learning, the step 1) includes the following steps:
firstly, using the similarity between the DTW distance measurement time sequences to construct a distance matrix of a corresponding data set;
setting the clustering class number K value as 2 by using the K-Means clustering principle, taking a linear amino acid crystallography volume model corresponding to the primary structure of the protein to be detected and a sequence turned over along a time axis as two initialization clustering centers, calculating the distance between a time domain current sequence in each blocking event and the two clustering centers, and distributing the distance to the closest clustering center;
after the distribution of all the events is completed, calculating the average value of all the events in the two clusters respectively, updating the clustering center, repeating the processes, continuously updating and iterating the clustering center, improving the accuracy of the clustering result, and stopping iterating until the clustering center is not changed;
and finally, analyzing the clustering result, perfecting a calculation method of the consensus current track, and evaluating the clustering result according to similarity indexes including but not limited to the pearson correlation coefficient PCC.
Further, the above-mentioned method for processing nanopore protein sequencing data based on machine learning, wherein the step 1) comprises the following steps:
a. collecting and acquiring time sequence data of blocking current generated by translocation of unlabeled molecules in the nanopore;
b. preprocessing data, uniformly interpolating current blocking events with unequal lengths to 500 points by using Matlab self-contained function ip 1, and normalizing by Z-Score to eliminate the difference of data dimension;
c. processing a data set by using a K-Means clustering algorithm, setting a clustering class number K value to be 2 according to data characteristics, selecting a DTW algorithm which can stretch on a time axis as a similarity measure between current time sequences, customizing two initialized clustering centers respectively to be a volume model corresponding to a primary structure of a label-free molecule and a volume model turned over along the time axis, adding the two volume models into the data set, and participating in an iteration process;
d. in the clustering process, the distance between each event and two clustering centers is calculated respectively, the event is distributed to the clustering center closest to the event, the average value of the two clusters is calculated according to the newly distributed result to replace the original clustering center, the process is repeated until the clustering center is not updated any more, iteration is stopped, and the current clustering center and the distribution result of each event in the data set are output;
e. all events in the data set are distributed into two clusters, the volume models before and after overturning are distributed into the two clusters respectively, PCC values between two clustering centers are calculated, and the correctness of the clustering result is judged by comparing whether the two clustering centers are in correlation with medium intensity or more, i.e., |PCC|is more than or equal to 0.3 before and after overturning; ( For example, the PCC value between two cluster centers is-0.6, i.e. the two are inversely related, and after one cluster center is turned over, the PCC value between the two cluster centers is 0.58, i.e. the two are positively related, so that the correctness of the clustering result can be primarily determined. Generally, the absolute value of the two PCC values is more than or equal to 0.3, which can indicate that the two variables have medium-strength correlation, and the correctness of the clustering result can be primarily judged. )
f. Judging event directionality according to the clustering result, selecting a plurality of typical events for averaging to obtain a consensus current track of the unlabeled molecule, and judging a PCC value between the consensus current track and a volume model of the unlabeled molecule;
g. and selecting a proper W value, so that the PCC value between the consensus current track and the volume model is highest after aligning the consensus current track to the amino acid volume model in the DTW.
Further, the machine-based machineMethod for processing sequencing data of learned nanopore protein, wherein the unlabeled molecule is beta-amyloid Native Abeta 1-42
Furthermore, the above-mentioned machine learning-based nanopore protein sequencing data processing method, wherein the unlabeled molecule is a scissored Abeta 1-42
Further, in the above-mentioned machine learning-based nanopore protein sequencing data processing method, in the step 2), an encoder-decoder framework is used, the encoder part firstly carries out convolution operation on input current signals of different types for a plurality of times by adopting a convolution neural network to extract spatial characteristics of the current signals, then uses a cyclic neural network to learn time characteristics thereof, and captures time correlation in the sequence;
the decoder part predicts the final current signal category by utilizing the nonlinear dimension reduction of the signals by using the multi-layer perceptron, and reversely updates the parameter values of each level in the middle of the neural network by continuously iteratively comparing the difference between the predicted result and the real label, thereby improving the accuracy of the predicted result.
Further, in the above-mentioned machine learning-based nanopore protein sequencing data processing method, in the step 2), two different unlabeled molecules are identified by using an encoder-decoder method, and the method includes the following steps:
I. the encoder part comprises a convolutional neural network and a cyclic neural network, wherein the convolutional neural network comprises two convolutional layers, each convolutional layer comprises 256 convolutional kernels, each convolutional kernel carries out convolutional operation on an input current signal, the elements in a current sequence and weights in the convolutional kernels are multiplied and summed, frequency components and spatial features of the sequence are extracted, and features of different layers of the sequence are extracted through different numbers of convolutional kernels to obtain an output feature sequence of the final convolutional layer;
using a cyclic neural network in the encoder to learn the time step characteristics of the current signal, memorizing the information processed previously while processing the input sequence, and modeling the sequence data by taking the hidden state of the last moment as the input of the current moment, thereby capturing the time correlation in the sequence;
III, in a decoder part, mapping the characteristic sequence output by the encoder into a final current signal class by using a multi-layer perceptron, normalizing the output sequence representation by a sigmoid function, and obtaining the probability of the final class attribution of the current signal;
updating the weight of each level of the model by comparing the error between the output of the model and the real label, so that the model can predict the category attribution probability more accurately.
Furthermore, in the machine learning-based nanopore protein sequencing data processing method, the circulating neural network structure in the step II is a long-short-term memory LSTM structure, and the LSTM structure comprises three gating mechanisms of an input gate, a forgetting gate and an output gate, and the input, the output and the forgetting of the previous information in the current sequence are respectively controlled, so that the effective long-term information storage and control are realized; each LSTM gate contains 64 neurons, each learning a different characteristic representation of the signal.
Further, the multi-layer perceptron in the step III is composed of three fully connected layers, including two hidden layers and an output layer, wherein the two hidden layers each contain 100 and 50 neurons, each neuron is connected with all neurons of the previous layer, each layer combines and abstracts different features through nonlinear transformation to generate a low-dimensional representation of a sequence, finally, after transformation of the hidden layers, signals reach the output layer, and after normalization of the output sequence representation through a sigmoid function, the probability of final category attribution of current signals is obtained.
On the other hand, the invention discloses application of the machine learning-based nanopore protein sequencing data processing method in large-scale protein sequencing.
The invention has the following beneficial effects:
for Native Abeta 1-42 And Scramble Abeta 1-42 When the nanopore sequencing data of (1) are processed, the calculation is performed by hundreds to thousands ofPrior to the consensus current trajectory consisting of individual single molecule transmembrane translocation events, the problem to be solved was how to determine the self-directionality of the unlabeled protein single molecule at translocation. According to the data characteristics, namely only two possible directions, a K-Means clustering algorithm with a K value of 2 is selected, DTW is combined as distance measurement, a clustering center initialization mode is changed, a volume model corresponding to a protein single-molecule primary structure to be detected and a volume model turned over along a time axis are used as two initialized clustering centers, a data set is added, the directionality of a label-free time sequence is judged, and finally the data set is divided into two clusters. After the directionality of each translocation event is obtained according to the clustering result, the consensus current track of the protein to be detected is calculated, and the linear amino acid volume model corresponding to the consensus current track is found to be highly relevant, and the PCC value is as high as 0.8.
By using a neural network-based classification algorithm to classify different target protein sequencing signals, a high-efficiency and accurate classification result is realized, and the AUC value (AUC E [0.5 ] 1) of classification reaches 0.97+/-0.01. (AUC measures the distinguishing capability of the model to positive and negative samples under different thresholds, can be used for evaluating the overall performance of the model, and the accuracy rate refers to the ratio of the number of correctly classified samples to the total number of samples of the classification model. The method can capture abstract features in signals, and fully utilizes sequence information and structural features of different proteins, so that accurate classification is realized. Meanwhile, the method has high robustness and generalization capability, can process protein sequencing signals of various types and lengths, and has good adaptability and popularization capability. And through efficient algorithm design and continuous optimization, the higher classification speed is realized while the accuracy is ensured. The method can classify a large number of protein sequencing signals in a short time, and improves the processing efficiency and the production benefit.
Drawings
Fig. 1: and judging a directional flow chart of translocation of the unlabeled protein single molecule through the nanopore transmembrane based on a DTW and K-Means clustering algorithm. (a) Original time current sequences of different lengths generated when protein single molecules translocate through a nanopore transmembrane. (b) Each occlusion event is interpolated to 500 points and Z-Score processed. (c) Setting a K-Means clustering algorithm, setting a K value as 2, initializing a clustering center as a volume model before and after overturning, and measuring a distance as DTW, A: blocking event, B: clustering center, W: size of DTW warp window. (d) Clustering results, all events in the dataset are grouped into two clusters. Grey: occlusion event, black: clustering centers;
FIG. 2 (a) and (d) are respectively Native Abeta 1-42 Data set and Scrambled aβ 1-42 Characteristic profile of events in the dataset over the blocking duration and current blocking ratio. Each block in the figure represents a blocking event. Black square: occlusion duration lies between 100-500 mus. Red square: a typical occlusion event. Contours: black contours are formed by the connection of these points with 50% x PDFmax values in the normalized heat map of the Probability Density Function (PDF) of the dataset. (b) solid red line: native Abeta averaged from 50 typical events in Panel (a) 1-42 Is a common current trace of (a). Black solid line: native Aβ 1-42 Linear amino acid volume model of (kmer=1). Gray solid line: when the two sequences are aligned by using a DTW algorithm, the corresponding relation between the two sequences is obtained. (c) Red solid line, when w=12, the DTW method was used to direct the Native consensus current trace to Native aβ 1-42 Aligned linear amino acid volume model (kmer=1). (e) solid red line: scramble Abeta obtained by averaging 50 typical events in Panel (d) 1-42 Is a common current trace of (a). Black solid line: scramble Aβ 1-42 Linear amino acid volume model of (kmer=1). Gray solid line: when the two sequences are aligned by using a DTW algorithm, the corresponding relation between the two sequences is obtained. (f) Red solid line, when w=12, the wrambled consensus current trace was drawn to wrambled aβ using DTW method 1-42 Is of the (2)Results after alignment of amino acid volume model (kmer=1).
FIG. 3 is a classification algorithm based on convolutional neural networks and cyclic neural networks, solving the problem of identity recognition for different molecules of translocation.
(a) Classification algorithm based on neural network model framework (b) a curve of subject working characteristics (receiver operating characteristic, ROC) of classification algorithm: a group of the Scrambled Aβ 1-42 Data sets and three different groups of Native Abeta 1-42 The data set is input into a neural network model for training and evaluation, and the obtained AUC average value is 0.97.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The reagents or instruments used in the examples of the present invention were not manufacturer-identified and were conventional reagent products commercially available.
Example 1
Native Aβ 1-42 Judging by self directivity of the nano-pore translocation.
After obtaining beta-amyloid (Native Abeta 1-42 ) After translocation of the generated blocking current time series data set in the nanopore, native aβ was judged 1-42 The flow of blocking event directionality is shown in fig. 1. The data were first pre-processed, current blocking events of unequal length were uniformly interpolated to 500 points using Matlab self-contained function interp1, and Z-Score normalized to eliminate differences in data dimensions. The dataset is then processed using a K-Means clustering algorithm. According to the data characteristics, setting the K value of the clustering class number as 2, selecting a DTW algorithm which can stretch on a time axis as a similarity measure between current time sequences, and customizing two initialization clustering centers as Native Abeta respectively 1-42 Volume model and edge corresponding to primary structure of (a)And (3) a volume model with the time axis turned over, and adding the two volume models into a data set to participate in an iterative process. In the clustering process, the distance between each event and two clustering centers is calculated respectively, the event is distributed to the clustering center closest to the event, the average value of the two clusters is calculated according to the newly distributed result to replace the original clustering center, the process is repeated until the clustering center is not updated any more, iteration is stopped, and the current clustering center and the distribution result of each event in the data set are output. Eventually, all events in the dataset are assigned to two clusters. The volume models before and after the overturn are respectively distributed into two clusters, the PCC value between the two cluster centers is-0.63, the PCC value between the cluster center 1 and the cluster center 2 after the overturn is 0.58, and the preliminary correctness of the clustering result is indicated. Then, determining event directionality according to the clustering result, and selecting 50 typical events (as shown in fig. 2 (a)) for averaging to obtain Native aβ 1-42 Is identical to Native Abeta in current trace 1-42 PCC values between the volume models (k-mer=1) were 0.84, as in fig. 2 (b), both highly correlated. When w=12, DTW is used, and the PCC value between the consensus current trace and the volume model is as high as 0.96 as shown in fig. 2 (c) after aligning the consensus current trace to the amino acid volume model.
Example 2
Scrambled Aβ 1-42 Judging by self directivity of the nano-pore translocation.
After obtaining the Scramble Abeta 1-42 Judging the Scramble Abeta after translocation of the generated occlusion current time series data set in the nanopore 1-42 The flow of blocking event directionality is shown in fig. 1. The data were first pre-processed, current blocking events of unequal length were uniformly interpolated to 500 points using Matlab self-contained function interp1, and Z-Score normalized to eliminate differences in data dimensions. The dataset is then processed using a K-Means clustering algorithm. According to the data characteristics, setting the K value of the clustering class number as 2, selecting a DTW algorithm which can stretch on a time axis as a similarity measure between current time sequences, and customizing two initialization clustering centers as the SCRAMBLED Abeta respectively 1-42 Corresponding volume of primary structure of (2)The model and the volume model turned over along the time axis, and the two volume models are added into the data set to participate in the iterative process. In the clustering process, the distance between each event and two clustering centers is calculated respectively, the event is distributed to the clustering center closest to the event, the average value of the two clusters is calculated according to the newly distributed result to replace the original clustering center, the process is repeated until the clustering center is not updated any more, iteration is stopped, and the current clustering center and the distribution result of each event in the data set are output. Eventually, all events in the dataset are assigned to two clusters. The volume models before and after the overturn are respectively distributed into two clusters, the PCC value between the two cluster centers is-0.65, the PCC value between the cluster center 1 and the cluster center 2 after the overturn is 0.56, and the preliminary correctness of the clustering result is indicated. Then, determining event directionality according to the clustering result, and selecting 50 typical events (as shown in fig. 2 (d)) for averaging to obtain the scanned aβ 1-42 Is identical to the common current trace of the Scramble Abeta 1-42 PCC values between amino acid volume models (k-mer=1) were 0.76, both highly correlated as in fig. 2 (e). When w=12, DTW is used, and the PCC value between the consensus current trace and the volume model is as high as 0.96 as shown in fig. 2 (f) after aligning the consensus current trace to the amino acid volume model.
Example 3
Identification of the different molecules of the translocation.
For Native Abeta 1-42 And Scramble Abeta 1-42 The present invention uses encoder-decoder methods to identify two different target proteins by translocating the resulting occlusion current time series data set in the nanopore. The neural network model of the present invention is implemented using the Python programming language and the tensorflow.
Specifically, the convolutional neural network of the encoder section includes two convolutional layers. Each convolution layer comprises 256 convolution kernels, each convolution kernel carries out convolution operation on an input current signal, the elements in the current sequence and the weights in the convolution kernels are multiplied and summed, frequency components, spatial features and the like of the sequence are extracted, and features of different layers of the sequence can be extracted through different numbers of convolution kernels, so that an output feature sequence of the final convolution layer is obtained. By stacking two convolution layers, the expression capacity and prediction accuracy of the model are improved. Next, the recurrent neural network is used to learn the time-step characteristics of the current signal. The recurrent neural network can memorize the information processed previously while processing the input sequence, and model the sequence data by taking the hidden state of the last moment as the input of the current moment, thereby capturing the time correlation in the sequence. The long-term memory (Long Short Term Memory, LSTM) used in the invention is a special cyclic neural network structure, can effectively avoid the problems of gradient elimination and gradient explosion when learning long sequences, and improves the learning capacity and generalization capacity of the model. The LSTM structure comprises three gating mechanisms, namely an input gate, a forget gate and an output gate. These gating mechanisms can control the input, output, and forgetting of previous information in the current sequence, thereby enabling efficient long-term information storage and control. Each LSTM layer contains 64 neurons, each learning a different characteristic representation of the signal, by stacking three LSTM layers, long-term dependencies in the sequence are more effectively captured, and the expressive power of the model is improved. The pooling layer performs downsampling operation on the feature sequence output by the LSTM layer to reduce the dimension of the feature sequence, and the pooling layer aggregates elements in the feature sequence, takes the average value of each feature channel in the sequence, takes the average value as a representative value, reduces the size of the feature sequence, reduces the parameter number and the calculation complexity of a model, and improves the robustness of the model.
In the decoder section, the feature sequences output by the encoder are mapped into the final current signal class using a multi-layer perceptron. The multi-layer perceptron consists of three fully connected layers, including two hidden layers and one output layer, both hidden layers containing 100 and 50 neurons, each neuron being connected to all neurons of the previous layer. Each layer combines and abstracts the different features through non-linear transformations, generating a low-dimensional representation of the sequence. Finally, after transformation of the hidden layer, the signal reaches the output layer, and after normalization of the output sequence representation by a sigmoid function, the probability of the final category attribution of the current signal is obtained.
The weight of each level of the model is updated by comparing the error between the output of the model and the real label, so that the model can predict the category attribution probability more accurately. In the training process, the weights are updated through 100 rounds of iteration, and the neural network model is used for the Native Abeta 1-42 And Scramble Abeta 1-42 The classification effect of (c) achieves auc=0.97±0.01.
As can be seen from the above embodiments: according to the data characteristics, namely only two possible directions, a K-Means clustering algorithm with a K value of 2 is selected, DTW is combined as distance measurement, a clustering center initialization mode is changed, a volume model corresponding to a protein single-molecule primary structure to be detected and a volume model turned over along a time axis are used as two initialized clustering centers, a data set is added, the directionality of a label-free time sequence is judged, and finally the data set is divided into two clusters. After the directionality of each translocation event is obtained according to the clustering result, the consensus current track of the protein to be detected is calculated, and the linear amino acid volume model corresponding to the consensus current track is found to be highly relevant, and the PCC value is as high as 0.8.
By using a neural network-based classification algorithm to classify different target protein sequencing signals, a high-efficiency and accurate classification result is realized, and the AUC value of classification reaches 0.97+/-0.01. Compared with the traditional method, the algorithm has strong diversity classification capability, and can accurately distinguish the difference and the similarity between different protein sequencing signals. The method can capture abstract features in signals, and fully utilizes sequence information and structural features of different proteins, so that accurate classification is realized. Meanwhile, the method has high robustness and generalization capability, can process protein sequencing signals of various types and lengths, and has good adaptability and popularization capability. And through efficient algorithm design and continuous optimization, the higher classification speed is realized while the accuracy is ensured. The method can classify a large number of protein sequencing signals in a short time, and improves the processing efficiency and the production benefit.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be appreciated by persons skilled in the art that the above embodiments are not intended to limit the invention in any way, and that all technical solutions obtained by means of equivalent substitutions or equivalent transformations fall within the scope of the invention.

Claims (10)

1. The machine learning-based nanopore protein sequencing data processing method is characterized by comprising the following steps of:
1) Judging the directionality of unlabeled molecules when translocation occurs through the nanopore by using a clustering algorithm based on a dynamic time warping algorithm DTW and a K-Means;
2) Judging the identities of different molecules in the translocation process by using a classification algorithm based on a convolutional neural network and a cyclic neural network;
the DTW is a nonlinear normalization technique for comparing the similarity of two time series, and finds the optimal alignment path between the two time series by aligning the coordinates in the two time series;
the K-means is an unsupervised learning algorithm, and is used for measuring the similarity between samples through Euclidean distance and distributing samples with high similarity to the same category through an iterative optimization mode;
the unlabeled molecule is a linear protein polypeptide molecule.
2. The machine learning based nanopore protein sequencing data processing method of claim 1, wherein said step 1) comprises the steps of:
firstly, using the similarity between the DTW distance measurement time sequences to construct a distance matrix of a corresponding data set;
setting the clustering class number K value as 2 by using the K-Means clustering principle, taking a linear amino acid crystallography volume model corresponding to the primary structure of the protein to be detected and a sequence turned over along a time axis as two initialization clustering centers, calculating the distance between a time domain current sequence in each blocking event and the two clustering centers, and distributing the distance to the closest clustering center;
after the distribution of all the events is completed, calculating the average value of all the events in the two clusters respectively, updating the clustering center, repeating the processes, continuously updating and iterating the clustering center, improving the accuracy of the clustering result, and stopping iterating until the clustering center is not changed;
and finally, analyzing the clustering result, perfecting a calculation method of the consensus current track, and evaluating the clustering result according to similarity indexes including but not limited to the pearson correlation coefficient PCC E < -1 1 > to obtain an experimental result with strong positive correlation, namely PCC not less than 0.5.
3. The machine learning based nanopore protein sequencing data processing method of claim 1, wherein said step 1) comprises the steps of:
a. collecting and acquiring time sequence data of blocking current generated by translocation of unlabeled molecules in the nanopore;
b. preprocessing data, uniformly interpolating current blocking events with unequal lengths to 500 points by using Matlab self-contained function ip 1, and normalizing by Z-Score to eliminate the difference of data dimension;
c. processing a data set by using a K-Means clustering algorithm, setting a clustering class number K value to be 2 according to data characteristics, selecting a DTW algorithm which can stretch on a time axis as a similarity measure between current time sequences, customizing two initialized clustering centers respectively to be a volume model corresponding to a primary structure of a label-free molecule and a volume model turned over along the time axis, adding the two volume models into the data set, and participating in an iteration process;
d. in the clustering process, the distance between each event and two clustering centers is calculated respectively, the event is distributed to the clustering center closest to the event, the average value of the two clusters is calculated according to the newly distributed result to replace the original clustering center, the process is repeated until the clustering center is not updated any more, iteration is stopped, and the current clustering center and the distribution result of each event in the data set are output;
e. all events in the data set are distributed into two clusters, the volume models before and after overturning are distributed into the two clusters respectively, PCC values between two clustering centers are calculated, and the correctness of the clustering result is judged by comparing whether the two clustering centers are in correlation with medium intensity or more, i.e., |PCC|is more than or equal to 0.3 before and after overturning;
f. judging event directionality according to the clustering result, selecting a plurality of typical events for averaging to obtain a consensus current track of the unlabeled molecule, and judging a PCC value between the consensus current track and a volume model of the unlabeled molecule;
g. and selecting a proper W value, so that the PCC value between the consensus current track and the volume model is highest after aligning the consensus current track to the amino acid volume model in the DTW.
4. The machine learning based nanopore protein sequencing data processing method according to claim 1, wherein the unlabeled molecule is β -amyloid Native aβ 1-42
5. The machine learning based nanopore protein sequencing data processing method of claim 1, wherein the unlabeled molecule is a scissored aβ 1-42
6. The machine learning based nanopore protein sequencing data processing method of claim 1, wherein in step 2), using an encoder-decoder framework, the encoder portion first performs a plurality of convolution operations on the input current signals of different classes using a convolutional neural network to extract spatial features of the current signals, and then learns temporal features thereof using a cyclic neural network to capture temporal correlations in the sequence;
the decoder part predicts the final current signal category by utilizing the nonlinear dimension reduction of the signals by using the multi-layer perceptron, and reversely updates the parameter values of each level in the middle of the neural network by continuously iteratively comparing the difference between the predicted result and the real label, thereby improving the accuracy of the predicted result.
7. The machine learning based nanopore protein sequencing data processing method according to claim 1, wherein in step 2), two different unlabeled molecules are identified using an encoder-decoder method, comprising the steps of:
I. the encoder part comprises a convolutional neural network and a cyclic neural network, wherein the convolutional neural network comprises two convolutional layers, each convolutional layer comprises 256 convolutional kernels, each convolutional kernel carries out convolutional operation on an input current signal, the elements in a current sequence and weights in the convolutional kernels are multiplied and summed, frequency components and spatial features of the sequence are extracted, and features of different layers of the sequence are extracted through different numbers of convolutional kernels to obtain an output feature sequence of the final convolutional layer;
using a cyclic neural network in the encoder to learn the time step characteristics of the current signal, memorizing the information processed previously while processing the input sequence, and modeling the sequence data by taking the hidden state of the last moment as the input of the current moment, thereby capturing the time correlation in the sequence;
III, in a decoder part, mapping the characteristic sequence output by the encoder into a final current signal class by using a multi-layer perceptron, normalizing the output sequence representation by a sigmoid function, and obtaining the probability of the final class attribution of the current signal;
updating the weight of each level of the model by comparing the error between the output of the model and the real label, so that the model can predict the category attribution probability more accurately.
8. The machine learning-based nanopore protein sequencing data processing method according to claim 7, wherein the cyclic neural network structure in the step II is a long-short-term memory LSTM structure, and the LSTM structure includes three gating mechanisms of an input gate, a forgetting gate and an output gate, and controls input, output and forgetting of previous information in a current sequence respectively, so as to realize effective long-term information storage and control; each LSTM gate contains 64 neurons, each learning a different characteristic representation of the signal.
9. The machine learning based nanopore protein sequencing data processing method of claim 7, wherein the multi-layer perceptron in step III is composed of three fully connected layers, including two hidden layers and an output layer, both hidden layers containing 100 and 50 neurons, each neuron being connected to all neurons of a previous layer, each layer combining and abstracting different features by nonlinear transformation to generate a low dimensional representation of the sequence, and finally after transformation of the hidden layers, the signal reaches the output layer, and after normalization of the output sequence representation by a sigmoid function, the probability of final class assignment of the current signal is obtained.
10. Use of the machine learning based nanopore protein sequencing data processing method of any of claims 1-9 in large scale protein sequencing.
CN202310705437.4A 2023-06-14 2023-06-14 Machine learning-based nanopore protein sequencing data processing method and application thereof Pending CN116741265A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310705437.4A CN116741265A (en) 2023-06-14 2023-06-14 Machine learning-based nanopore protein sequencing data processing method and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310705437.4A CN116741265A (en) 2023-06-14 2023-06-14 Machine learning-based nanopore protein sequencing data processing method and application thereof

Publications (1)

Publication Number Publication Date
CN116741265A true CN116741265A (en) 2023-09-12

Family

ID=87904053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310705437.4A Pending CN116741265A (en) 2023-06-14 2023-06-14 Machine learning-based nanopore protein sequencing data processing method and application thereof

Country Status (1)

Country Link
CN (1) CN116741265A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095754A (en) * 2023-10-19 2023-11-21 江苏正大天创生物工程有限公司 Method for classifying proteins by machine learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095754A (en) * 2023-10-19 2023-11-21 江苏正大天创生物工程有限公司 Method for classifying proteins by machine learning
CN117095754B (en) * 2023-10-19 2023-12-29 江苏正大天创生物工程有限公司 Method for classifying proteins by machine learning

Similar Documents

Publication Publication Date Title
CN107622182B (en) Method and system for predicting local structural features of protein
CN106295124B (en) The method of a variety of image detecting technique comprehensive analysis gene subgraph likelihood probability amounts
CN111210871A (en) Protein-protein interaction prediction method based on deep forest
Kodogiannis et al. Artificial odor discrimination system using electronic nose and neural networks for the identification of urinary tract infection
CN111126575A (en) Gas sensor array mixed gas detection method and device based on machine learning
CN110880369A (en) Gas marker detection method based on radial basis function neural network and application
CN116741265A (en) Machine learning-based nanopore protein sequencing data processing method and application thereof
CN108877947B (en) Depth sample learning method based on iterative mean clustering
CN111079805A (en) Abnormal image detection method combining attention mechanism and information entropy minimization
CN112116950B (en) Protein folding identification method based on depth measurement learning
CN114692732A (en) Method, system, device and storage medium for updating online label
CN117153268A (en) Cell category determining method and system
CN112784921A (en) Task attention guided small sample image complementary learning classification algorithm
CN114743600A (en) Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity
CN116484289A (en) Carbon emission abnormal data detection method, terminal and storage medium
CN115472221A (en) Protein fitness prediction method based on deep learning
CN113764034B (en) Method, device, equipment and medium for predicting potential BGC in genome sequence
CN108694375B (en) Imaging white spirit identification method applicable to multi-electronic nose platform
Zhang et al. Robust learning from noisy web images via data purification for fine-grained recognition
Li et al. MSSort-DIAXMBD: A deep learning classification tool of the peptide precursors quantified by OpenSWATH
Lee et al. Neuralfp: out-of-distribution detection using fingerprints of neural networks
CN110781822B (en) SAR image target recognition method based on self-adaptive multi-azimuth dictionary pair learning
CN115511012B (en) Class soft label identification training method with maximum entropy constraint
Gan et al. DSAE-Impute: Learning discriminative stacked autoencoders for imputing single-cell rna-seq data
CN114758721B (en) Deep learning-based transcription factor binding site positioning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination