WO2023070493A1 - Rna定位预测方法、装置及存储介质 - Google Patents

Rna定位预测方法、装置及存储介质 Download PDF

Info

Publication number
WO2023070493A1
WO2023070493A1 PCT/CN2021/127273 CN2021127273W WO2023070493A1 WO 2023070493 A1 WO2023070493 A1 WO 2023070493A1 CN 2021127273 W CN2021127273 W CN 2021127273W WO 2023070493 A1 WO2023070493 A1 WO 2023070493A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature information
lncrna
mer
sequence
value
Prior art date
Application number
PCT/CN2021/127273
Other languages
English (en)
French (fr)
Inventor
胡玉兰
张振中
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to PCT/CN2021/127273 priority Critical patent/WO2023070493A1/zh
Priority to CN202180003135.1A priority patent/CN117501374A/zh
Publication of WO2023070493A1 publication Critical patent/WO2023070493A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present disclosure relates to the field of biological information, in particular to a method, device and storage medium for RNA location prediction.
  • LncRNA Long non-coding ribonucleic acid
  • LncRNAs play different biological functions in different subcellular locations. For example, lncRNAs located in the nucleus usually participate in the regulation of gene transcription, and lncRNAs located in the cytoplasm usually regulate gene expression post-transcriptionally.
  • LncRNA subcellular localization includes experimental methods and computational prediction methods. However, due to the time-consuming and labor-intensive and high blindness of experimental methods to determine LncRNA subcellular localization, it is becoming more and more important to develop accurate and efficient computational prediction methods.
  • the existing methods for calculating the subcellular location prediction of LncRNA mainly include the following:
  • the first is LncLocator, a method based on autoencoders and integrated learning, which uses 3-mer to represent sequence features, uses stacked autoencoders to extract high-level features of sequences, and uses SVM, random forest, and stacked network ensemble learning for classification Prediction;
  • the second is the method iLoc-lncRNA based on PseKNC feature representation.
  • This method combines 8-mer and pseudo K-tuple nucleotide composition (pseudo K-tuple nucleotide composition, PseKNC) to extract sequence features.
  • SVM Support Vector Machine
  • DNN Deep Neural Networks
  • the present disclosure provides an RNA localization prediction method, device and storage medium to solve the above-mentioned technical problems existing in the prior art.
  • RNA localization prediction method provided by the embodiments of the present disclosure is as follows:
  • the attention value is input into the classification prediction model to obtain the location prediction result of the LncRNA to be located.
  • obtaining the sequence characteristic information of the LncRNA to be located includes:
  • Convolutional neural network is used to extract the sequence feature information of the LncRNA to be located from all embedded vector representations.
  • k-mer coding is performed on the sequence of the LncRNA to be located to obtain multiple k-mer coding sets, including:
  • consecutive k bases are sequentially taken from the first base of the sequence of the LncRNA to be positioned to form a k-mer in the corresponding k-mer coding set, until Complete the last k bases in the sequence of the LncRNA to be positioned to form a corresponding k-mer coding set; wherein, the first bases of two adjacent k-mers in the same k-mer coding set are in the sequence of the LncRNA adjacent to each other, and k is a natural number.
  • the training process of the k-mer pre-training model includes:
  • the BERT model Iteratively train the BERT model with all the second k-mer code sets to predict the embedded vector representation of the masked elements in the second k-mer code set, until the value of the loss function predicted by the BERT model no longer declines and stops training , to obtain the k-mer pre-training model; wherein, the BERT model only includes the MASK-LM task, and in the MASK-LM task, the special characters are used to partially cover the elements in the second k-mer code set , and the shading rates corresponding to different second k-mer coding sets are different, and the shading rate is the proportion of special characters in the second k-mer coding after shading.
  • the convolutional neural network includes:
  • the convolution layer includes a plurality of convolution kernels of different sizes, and each convolution kernel is used to perform a convolution operation on a matrix corresponding to the embedded vector representation;
  • the maximum pooling layer is connected to the output terminal of the convolution layer, and is used to segment the convolution operation results output by the convolution layer, and combine the obtained maximum eigenvalues in each segment into the The above sequence feature information.
  • obtaining the structural feature information of the LncRNA to be located includes:
  • Tree Lstm to extract tree structure features from the tree structure, as the structural feature information of the LncRNA to be located.
  • a possible implementation manner, converting the secondary structure of the LncRNA to be located into a tree structure includes:
  • the base pair that is complementary to the base is used as the root node of the tree structure, and the unpaired base as the leaf node of the previous node in the tree structure, until the last base of the sequence of the LncRNA to be located, the tree structure is obtained; wherein, when the first base is not paired, the The root node of the tree structure is empty.
  • Tree Lstm to extract tree structure features from the tree structure, including:
  • the output of all child nodes of the current node currently being processed in the tree structure is used as the input of the current node, and the corresponding state of the current node is updated according to the state of the child node
  • the gating vector and memory unit until the current node is the root node of the tree structure; wherein, the input of the leaf node is the corresponding base;
  • the output of the root node is used as the tree structure feature.
  • sequence feature information and/or the structural feature information are calculated based on an attention mechanism, and the attention value of the sequence feature information and/or the structural feature information is obtained, including:
  • a sum operation is performed on the product of each first attention weight and the structural feature information to obtain an attention value of the sequence feature information relative to the structural feature information.
  • sequence feature information and/or the structural feature information are calculated based on an attention mechanism, and the attention value of the sequence feature information and/or the structural feature information is obtained, including:
  • a sum operation is performed on the product of each second attention weight and the sequence feature information to obtain an attention value of the structural feature information relative to the sequence feature information.
  • the attention value is input into a classification prediction model to obtain the location prediction result of the LncRNA to be located, including:
  • the sub-cell corresponding to the highest probability in the localization prediction value is used as the sub-cell where the LncRNA to be localized is located, and the localization prediction result is obtained.
  • the training process of the classification prediction model includes:
  • the embodiment of the present disclosure also provides a device for RNA localization prediction, including:
  • a memory connected to the at least one processor
  • the memory stores instructions executable by the at least one processor, and the at least one processor executes the method described in the first aspect above by executing the instructions stored in the memory.
  • an embodiment of the present disclosure further provides a readable storage medium, including:
  • the memory is used to store instructions, and when the instructions are executed by the processor, the device including the readable storage medium completes the method described in the first aspect above.
  • FIG. 1 is a flowchart of a method for predicting RNA localization provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of k-mer coding for a sequence of an LncRNA to be located according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of k-mer encoding of another LncRNA sequence to be located according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of obtaining an Embedding representation of a k-mer coding set provided by an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram of obtaining sequence feature information of an LncRNA to be located provided by an embodiment of the present disclosure
  • Figure 6 is a schematic diagram of converting the secondary structure of the LncRNA to be located into a tree structure provided by an embodiment of the present disclosure
  • FIG. 7 is a schematic structural diagram of the Tree LSTM provided by an embodiment of the present disclosure.
  • Fig. 8 is a schematic diagram of obtaining the predicted value of the LncRNA sequence to be located provided by the embodiment of the present disclosure.
  • the embodiments of the present disclosure provide an RNA localization prediction method, device and storage medium to solve the above-mentioned technical problems existing in the prior art.
  • an embodiment of the present disclosure provides a method for predicting RNA localization, and the process of the method is as follows.
  • Step 101 Obtain sequence feature information and structural feature information of the LncRNA to be located.
  • LncRNA may be distributed in subcells such as cytoplasm, nucleus, chromatin, nucleolus, mitochondria, etc. LncRNA located in different subcells can play different biological functions, such as LncRNA in the nucleus is usually involved in regulating gene transcription, located in the cytoplasm LncRNAs usually post-transcriptionally regulate gene expression.
  • the sequence of the LncRNA is the arrangement of bases in the LncRNA.
  • Biologists can manually obtain the sequence feature information and structural feature information of the LncRNA to be located.
  • the disclosure also provides the following methods to obtain the sequence feature information of the LncRNA to be located. and structural features.
  • the aforementioned k-mer encoding is performed on the sequence of the LncRNA to be positioned to obtain multiple k-mer encoding sets, which can be achieved in the following ways:
  • consecutive k bases are sequentially taken from the first base of the sequence of the LncRNA to be located to form a k-mer in the corresponding k-mer coding set, until it is finished to be determined
  • the last k bases in the sequence of the lncRNA constitute the corresponding k-mer coding set; wherein, the first bases of two adjacent k-mers in the same k-mer coding set are adjacent in the sequence of the lncRNA, and k is Natural number.
  • R 1 and R 2 start from R 1 and take 3 bases in an overlapping manner until all the bases are taken. For example, first take the 3 bases corresponding to R 1 R 2 R 3 , and then start from R 2 3 bases (that is, the 3 bases corresponding to R 2 R 3 R 4 ), ... until the last three bases are taken (the bases corresponding to R L-2 R L-1 R L ), forming 3-
  • 3 bases can be continuously taken from R 1 in a non-overlapping manner until all bases are taken, such as taking R 1 R 2 R 3 first, then taking R 4 R 5 R 6 , ... until all the bases are taken
  • R 4 ' [R 1 R 2 R 3 R 4 , R 5 R 6 R 7 R 8 ,..., R L-3 R L-2 R L-1 R L ], L/4 elements in total.
  • the length of the sequence of the LncRNA to be located is the number of bases it contains, each element in the k-mer coding set is a fragment in the sequence of the LncRNA to be located, and the length of the k-mer coding set is k -The number of elements contained in the mer coding set.
  • k>1 when the bases in the LncRNA sequence are taken in an overlapping manner, the number of overlapping bases in two adjacent fragments can be determined according to actual needs, not limited to the above examples .
  • the sequence feature information of the LncRNA to be located is extracted.
  • the training process of the above-mentioned k-mer pre-training model includes:
  • Carry out k-mer encoding to the sequence of each second LncRNA in the second LncRNA set obtain multiple second k-mer encoding sets corresponding to the sequence of each second LncRNA; combine all the second k-mer encoding sets and multiple special characters as the vocabulary of the BERT model; use all the second k-mer code sets to iteratively train the BERT model to predict the embedded vector representation of the masked elements in the second k-mer code set, until the value of the loss function of the BERT model is no longer Descent to stop the training and obtain the k-mer pre-training model; among them, the BERT model only includes the MASK-LM task, and in the MASK-LM task, special characters are used to partially cover the elements in the second k-mer code set, and different from the second k
  • the masking rate corresponding to the -mer code set is different, and the masking rate is the proportion of special characters in the second k-mer code after masking.
  • the second LncRNA in the second LncRNA set may be an unlabeled historical LncRNA collected from the RNAcentral database (that is, a historical LncRNA without annotated subcellular location).
  • the second LncRNA set After obtaining the second LncRNA set, carry out k-mer encoding to the sequence of each second LncRNA therein, obtain multiple second k-mer encoding sets corresponding to each second LncRNA sequence; Specifically, for the second LncRNA
  • the k-mer coding method of the sequence of the target LncRNA is the same as the k-mer coding method of the sequence of the LncRNA to be positioned, and the value range of k is the same.
  • the second LncRNA also needs to be 1-mer coding to 4-mer coding, and details will not be repeated here.
  • the traditional BERT model includes two training tasks, Masked LM and Next Sentence Prediction.
  • Masked LM is described as: Given a sentence, one or several words in the sentence are randomly erased, and it is required to predict what the erased words are based on the remaining vocabulary. For example, in a sentence, 15% of the words are randomly selected for prediction.
  • Next Sentence Prediction Given two sentences in an article, determine whether the second sentence follows the first sentence in the text, and randomly select from the text corpus during the actual pre-training process. Select 50% correct sentence pairs and 50% wrong sentence pairs for training, combined with the Masked LM task, so that the BERT model can more accurately describe the semantic information at the sentence and even the text level.
  • the BERT model in this disclosure is a BERT model that only includes the MASK-LM task, that is, the module corresponding to the NextSentence Prediction task in the traditional BERT model is canceled (this can reduce the training complexity), and multiple special features are used in the MASK-LM task. Characters partially cover the elements in the second k-mer code set, and the cover rates corresponding to different second k-mer code sets are different. The cover rate is the proportion of special characters in the second k-mer code after cover, such as training
  • the embedding vector corresponding to the 1-mer coding set indicates that when masking, mask (shading) drops 40% of 1-mer, mask drops 20% of 2-mer, mask drops 20% of 3-mer, and 20% of 4-mer kmer.
  • the BERT model in this disclosure uses multiple second k-mer encoding sets as the k-mer vocabulary, its semantic information is richer, and since multiple special characters are added in the MASK-LM task, it can improve The predictive power of the BERT model.
  • the BERT model is composed of multiple Transformer structural units, multiple multi-head attention, hidden layers, etc., wherein the number of nodes in the hidden layer is an integer multiple of the total number of multi-head attention.
  • the BERT model uses 12 Transformer structural units, 12 multi-head attention, the number of hidden layer nodes is 768, and the embedding dimension of the position vector is 6149.
  • the model predicts the embedding vector representation of the masked elements in the second k-mer coding set, and stops training until the value of the loss function of the BERT model no longer drops, and obtains a k-mer pre-training model. In this way, the k-mer pre-training model can be used to obtain the embedding vector representation of each k-mer coding set of the LncRNA to be located.
  • the dimension of the k-mer pre-training model is D, and k is 1 to 4.
  • the embedding vector representation corresponding to the 1-mer code set to 4-mer code set of the sequence can be represented by the following matrix in turn: L ⁇ D, (L-1) ⁇ D, (L-2) ⁇ D, (L-3) ⁇ D, L is the length of the sequence of the LncRNA to be located.
  • FIG. 4 is a schematic diagram of obtaining an embedding vector representation of a k-mer coding set provided by an embodiment of the present disclosure.
  • the convolutional neural network can be used to extract the sequence feature information of the LncRNA to be located from all the embedding vector representations.
  • the above convolutional neural network includes:
  • the convolution layer includes a plurality of convolution kernels of different sizes, and each convolution kernel is used to perform a convolution operation on a matrix corresponding to the embedding vector representation.
  • the maximum pooling layer is connected to the output terminal of the convolution layer, and is used to segment the convolution operation results output by the convolution layer, and combine the obtained maximum eigenvalues in each segment into sequence feature information.
  • the convolution calculation formula is:
  • C k' is the output result of the k'th convolution operation of the convolution kernel h
  • k' 1 ⁇ n
  • g() is the activation function
  • w is the convolution kernel
  • b is the bias of the convolution kernel
  • x k':k'+h-1 means that the embedding vector of the k'-mer code set represents the elements from row k' to row k'+h-1 in the corresponding matrix.
  • the embedding vector representation of each k-mer coding set can be calculated by using the convolution calculation formula, and then all the convolution operation results of the embedding vector representation corresponding to the LncRNA to be located (denoted as C, that is, the convolution layer output by the convolution layer
  • C the convolution layer output by the convolution layer
  • the operation result, C (C 1 , C 2 , . . . , C n ).
  • P is the sequence feature information of the LncRNA to be located
  • q n/p
  • p is the number of segments
  • C 1:q is the first to q rows of the corresponding matrix of the convolution result output by the convolution layer
  • C n-q +1:n is the convolution result output by the convolutional layer corresponding to the nth to n-q+1 of the matrix
  • maxC 1:q is to take the largest feature in C 1:q
  • maxC n-q+1:n is to take The largest feature in C n-q+1:n .
  • FIG. 5 is a schematic diagram of obtaining sequence feature information of an LncRNA to be located according to an embodiment of the present disclosure.
  • the sequence feature information under different k-mer conditions is extracted through the convolutional neural network. Since the convolutional neural network can be calculated in parallel, it can quickly capture sequence feature information of different scales.
  • the secondary structure of the LncRNA to be located can be represented by planar graphics, secondary structure planar text (CT file) and dot-bracket notation.
  • CT file secondary structure planar text
  • the core idea of the dotted bracket diagram representation method is to use "(" and ")" to indicate that two bases are complementary paired, and ".” to indicate that a base is not paired.
  • the secondary structure of the LncRNA to be located can be calculated by the program, and can also be obtained by database search, experimental verification and other ways.
  • the base pair that is complementary to the base pair is used as the root node of the tree structure, and the unpaired base is used as the tree structure From the leaf nodes of the previous node, until the last base of the sequence of the LncRNA to be located, a tree structure is obtained; wherein, when the first base is not paired, the root node of the tree structure is empty.
  • FIG. 6 a schematic diagram of converting the secondary structure of the LncRNA to be located into a tree structure provided by an embodiment of the present disclosure.
  • the sequence of the LncRNA to be located in Figure 6 is: "AGUGAAGGCACAAGCCUUAC", and the secondary structure is shown in dotted brackets: ".(((((.((.%)))))), because A is not paired , so the root node is empty, and A and G are the leaves of the empty root node. Since G has a complementary base (C), the base pair (GC) is used as the root node of the next base (U), because Base U and base A are complementary paired, so the base pair UA is used as the root node of the next base until the last base (C) of the sequence of the LncRNA to be located, and the tree structure shown in Figure 6 is obtained.
  • the above tree structure may be stored according to the hierarchical order of the nodes in the tree structure.
  • Tree Lstm After obtaining the tree structure of the LncRNA to be located, Tree Lstm can be used to extract the tree structure features from the tree structure of the LncRNA to be located, and use it as the structural feature information of the LncRNA to be located.
  • Tree Lstm Using Tree Lstm to extract tree structure features from tree structures can be achieved in the following ways:
  • the output of all child nodes of the current node currently being processed in the tree structure is used as the input of the current node, and the gating vector and memory unit corresponding to the current node are updated according to the state of the child node until the current
  • the node is the root node of the tree structure; wherein, the input of the leaf node is the corresponding base; the output of the root node is used as the tree structure feature.
  • FIG. 7 a schematic structural diagram of a Tree LSTM provided by an embodiment of the present disclosure.
  • node 2 node 4, node 5, and node 6 are leaf nodes
  • nodes 4 to 6 are child nodes of node 3
  • nodes 2 and 3 are child nodes of node 1
  • node 1 is the root node
  • x6 is the input of the corresponding node in the tree
  • y1 ⁇ y6 are the output of the corresponding node in the tree.
  • node 3 in Figure 7 the output of all child nodes (node 4 ⁇ node 6) of node 3 ( y4 ⁇ y6) as the input of node 3, and update the gating vector and memory unit of node 3 according to the state of nodes 4 to 6; similarly, the gating vector and memory unit of node 1 can be updated, and finally the output of node 1 is obtained , taking the output of node 1 as the tree structure feature of the tree structure shown in FIG. 7 .
  • c j i j ⁇ u j + ⁇ k ⁇ C(j) f jk ⁇ c k ;
  • j is the current node in the tree
  • C(j) represents the set of child nodes of the current node j
  • x j is the input vector of current node j.
  • i j is the input gate of the current node j
  • f jk is the forgetting gate of each child node k of the current node j
  • o j is the output gate of the current node j
  • c j is the memory unit of the current node j
  • h j is the current The hidden layer state of node j
  • u j is the temporary memory unit of current node j
  • W (i) , W (f) , W (o) , W (u) , U (i) , U (f) , U ( o) , U (u) is the weight matrix
  • b (i) , b (f) , b (o) , b (u) is the bias vector
  • is the sigmoid function
  • represents a multiplication operation (by element multiplication).
  • Tree-LSTM The input of Tree-LSTM is multiple child nodes, and the output is to encode the child nodes to generate parent nodes.
  • the dimension of the parent node is the same as that of each child node. That is, starting from the bottom of the tree, the vector generated by encoding the child nodes at the same layer is used as the input of the corresponding parent node, until the root node at the top of the tree is processed, and the output of the root node is used as the tree of the whole tree Structural features, which are used as the structural feature information of the LncRNA to be located.
  • step 102 After obtaining the sequence feature information and structure feature information of the LncRNA to be located, step 102 can be performed.
  • Step 102 Calculate the sequence feature information and/or structure feature information based on the attention mechanism, and obtain the attention value of the sequence feature information and/or structure feature information.
  • the attention calculation is performed on the sequence feature information and/or the structure feature information, and the attention value of the sequence feature information and/or the structure feature information can be obtained, which can be realized in the following two ways:
  • the first method Calculate the correlation value between the value of each dimension in the sequence feature information and the structural feature information; perform normalized calculation on the correlation value corresponding to each dimension in the sequence feature information, and obtain each dimension in the sequence feature information
  • the first attention weight of the value of the dimension based on the attention mechanism, the product of each first attention weight and the structural feature information is summed to obtain the attention value of the sequence feature information relative to the structural feature information.
  • the second method calculate the correlation value between the value of each dimension in the structural feature information and the sequence feature information; perform normalized calculation on the correlation value corresponding to each dimension in the structural feature information, and obtain each dimension in the structural feature information
  • the second attention weight of the value of the dimension based on the attention mechanism, the product of each second attention weight and the sequence feature information is summed to obtain the attention value of the structural feature information relative to the sequence feature information.
  • A can be regarded as a 1 ⁇ s tensor
  • B can be regarded as a 1 ⁇ t tensor.
  • ⁇ i" is the attention weight
  • Softmax() is the normalization function
  • f is the function to calculate the correlation between Q i" and K
  • the Attention(Q, K, V) obtained above is a tensor of 1 ⁇ s.
  • Step 103 can be executed after obtaining the sequence feature information and/or the attention value of the structure feature information of the LncRNA to be located in the above manner.
  • Step 103 Input the attention value into the classification prediction model to obtain the location prediction result of the LncRNA sequence.
  • the training process of the specific classification prediction model is as follows:
  • the first attention value is obtained; the first attention value is input into the classification prediction model to obtain the first location prediction value corresponding to the first LncRNA; the loss value is calculated based on the first location prediction value and the label value corresponding to the first LncRNA, and by inversion Adjust the parameters in the classification prediction model to the propagation algorithm until the loss value reaches the preset condition, and obtain a trained classification prediction model.
  • the above preset condition may be that the value of the loss function of the verification set no longer decreases, or that the accuracy of the training set or the verification set no longer increases.
  • the first LncRNA in the above-mentioned first LncRNA set can be a labeled historical LncRNA collected from the RNALocate database (that is, the subcellular location of the historical LncRNA is marked), and the RNALocate database is a database dedicated to RNA subcellular localization.
  • the latest version of the database RNALocatev2.0 has recorded more than 210,000 RNA-related subcellular localization entries and experimental data, involving more than 110,000 RNAs, including 171 subcellular localizations of 104 species.
  • 9587 historical LncRNAs of LncRNA data can be extracted from the RNALocate database.
  • 6897 different historical LncRNAs are obtained to form the first LncRNA set of the present disclosure, and their locations are distributed in 40 different subcells, including cytoplasm, Nucleus, chromatin, nucleolus, mitochondria, etc.
  • first LncRNA set After obtaining the above-mentioned first LncRNA set, obtain the first sequence characteristic information and the first structural characteristic information of each first LncRNA in the first LncRNA set, and obtain the sequence characteristic information and structural characteristic information of the LncRNA to be positioned in the same way, for This will not be repeated; afterward, the first sequence feature information and the first structural feature information of each first LncRNA are calculated based on the attention mechanism, and the corresponding first attention value is obtained, and all the first attention values are It is divided into a training set and a verification set; and the training set is used to train the Multilayer Perceptron (MLP), and the verification set is used to verify the trained MLP until the verification result meets the preset conditions, and the classification prediction model is obtained.
  • MLP Multilayer Perceptron
  • the training parameters of the multi-layer perceptron include: the optimizer can use SGD, the batch size (batch) can be set to 32, the dropout can be set to 0.001, the number of iterations (epoch) can be set to 100, and the dimension of the embedding vector can be set to 768.
  • the performance indicators for 5-fold cross-validation evaluation include accuracy (accuracy, ACC), sensitivity (Sn), specificity (Sp), Precision (Pre), F1-score (F1-score), Matthews correlation coefficient (Matthews correlation coefficient, MCC), area under the receiver operating characteristic Curve (ROC) (Area under Curve, AUC).
  • the attention value corresponding to the LncRNA to be located can be input into the classification prediction model to obtain the prediction result of the LncRNA to be located, which can be achieved in the following ways:
  • the attention value is input into the classification prediction model to obtain the location prediction value of the LncRNA to be located; the subcell corresponding to the maximum probability in the location prediction value is used as the subcell where the LncRNA to be located is located, and the location prediction result is obtained.
  • FIG. 8 it is a schematic diagram of obtaining the predicted value of the LncRNA to be located according to the embodiment of the present disclosure.
  • the sequence of the LncRNA to be positioned in Figure 8 is AGUGAAGGCACAAGCCUUAC, and the secondary structure of the LncRNA to be positioned is .((((.(((.%))))), the LncRNA sequence to be positioned can be obtained by the method described above Sequence feature information and structural feature information, inputting them into the classification prediction model can obtain the predicted localization value of the above-mentioned lncRNA to be located in each subcell: cytoplasm: 0.757, cytosol: 0.182, ribosome: 0.001, endoplasmic reticulum: 0.014 , exosomes: 0.035, synapses: 0.013.
  • the subcell (cytoplasm) corresponding to the highest probability in the above localization prediction value is used as the subcell where the LncRNA to be localized is located, and the localization prediction result is obtained.
  • sequence feature information and/or structural feature information are obtained by obtaining sequence feature information and structural feature information of the LncRNA to be located; calculating sequence feature information and/or structural feature information based on an attention mechanism The attention value; the attention value is input into the classification prediction model to obtain the location prediction result of the LncRNA to be located, which makes the sequence feature information and structural feature information of the LncRNA to be located fully considered when predicting the subcellular location of the LncRNA to be located Correlation, thereby improving the accuracy of the location prediction of the LncRNA to be located.
  • an embodiment of the present disclosure provides a device for LncRNA location prediction, including: at least one processor, and
  • a memory connected to the at least one processor
  • the memory stores instructions executable by the at least one processor, and the at least one processor executes the above-mentioned RNA localization prediction method by executing the instructions stored in the memory.
  • embodiments of the present disclosure also provide a readable storage medium, including:
  • the memory is used to store instructions, and when the instructions are executed by the processor, the device including the readable storage medium completes the RNA localization prediction method described above.
  • the readable storage medium can be any available medium or data storage device that can be accessed by a processor, including volatile memory or nonvolatile memory, or can include both volatile memory and nonvolatile memory.
  • nonvolatile memory may include read-only memory (Read-Only Memory, ROM), programmable ROM (Programmable read-only memory, PROM), electrically programmable ROM (Erasable Programmable Read-Only Memory, EPROM), electrically erasable programmable ROM (Electrically Erasable Programmable read only memory, EEPROM) or flash memory, solid state hard drive (Solid State Disk or Solid State Drive, SSD), magnetic memory (such as floppy disk, hard disk, tape , Magneto-Optical disc (Magneto-Optical disc, MO), etc.), optical storage (such as CD, DVD, BD, HVD, etc.).
  • Volatile memory can include Random Access Memory (RAM), which can act as external cache memory.
  • RAM Random Access Memory
  • RAM is available in various forms such as Dynamic RAM (Dynamic Random Access Memory, DRAM), Synchronous DRAM (Synchronous Dynamic Random-Access Memory, SDRAM), Double Data Rate SDRAM (Double Data Rate SDRAM) , DDR SDRAM), enhanced SDRAM (Enhanced Synchronous DRAM, ESDRAM), synchronous link DRAM (Sync Link DRAM, SLDRAM).
  • DRAM Dynamic Random Access Memory
  • SDRAM Synchronous DRAM
  • Double Data Rate SDRAM Double Data Rate SDRAM
  • DDR SDRAM Double Data Rate SDRAM
  • ESDRAM Enhanced Synchronous DRAM
  • synchronous link DRAM Sync Link DRAM, SLDRAM.
  • the storage devices of the disclosed aspects are intended to include, but are not limited to, these and other suitable types of memory.
  • embodiments of the present disclosure may be provided as methods, systems, or program products. Accordingly, embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present disclosure may employ a computer program product embodied on one or more readable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer/processor usable program code embodied therein form.
  • readable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • Embodiments of the present disclosure are described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present disclosure. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that instructions executed by the processor of the computer or other programmable data processing equipment produce a machine for A device for realizing the functions specified in one or more procedures of a flowchart and/or one or more blocks of a block diagram.
  • program instructions may also be stored in a readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the readable memory produce an article of manufacture comprising instruction means implemented in a Flowchart A process or processes and/or a block diagram with a function specified in a block or blocks.
  • program instructions can also be loaded into a computer or other programmable data processing device, so that a series of operation steps are executed on the computer or other programmable device to produce a computer/processor-implemented process, so that the computer/processor or other programmable
  • the instructions executed on the programming device provide steps for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

RNA定位预测方法、装置及存储介质,该方法包括:获取待定位LncRNA的序列特征信息和结构特征信息(101);基于注意力机制对所述序列特征信息和/或所述结构特征信息进行计算,获得所述序列特征信息和/或所述结构特征信息的注意力值(102);将所述注意力值输入分类预测模型,获得所述待定位LncRNA的定位预测结果(103)。

Description

RNA定位预测方法、装置及存储介质 技术领域
本公开涉及生物信息领域,尤其是涉及RNA定位预测方法、装置及存储介质。
背景技术
长链非编码核糖核酸(Long non-coding Ribonucleic Acid,lncRNA)是指序列长度大于200个碱基的非编码蛋白质的RNA分子。
研究证明,LncRNA在不同位置的亚细胞发挥着不同的生物学功能,如,定位于细胞核的lncRNA通常参与调节基因的转录,定位于细胞质的LncRNA通常转录后调控基因表达。通常,LncRNA亚细胞定位包括实验方法和计算预测方法,然而由于实验方法测定LncRNA亚细胞存在定位费时费力且盲目性较高,这使得开发准确、高效的计算预测方法变得越来越重要。
目前已有的计算LncRNA的亚细胞定位预测方法主要包括以下几种:
第一种,基于自编码器和集成学习的方法LncLocator,该方法使用3-mer表示序列特征,利用堆叠栈式自编码器抽取序列高层特征,并使用SVM、随机森林和堆叠网络集成学习进行分类预测;第二种,基于PseKNC特征表示的方法iLoc-lncRNA,该方法通过8-mer和伪K-元组核苷酸组成(pseudo K-tuple nucleotide composition,PseKNC)结合抽取序列特征,使用基于二项分布的特征选择方法,并利用支持向量机(Support Vector Machine,SVM)进行分类预测;第三种,基于序列特征的深度学习方法DeepLncRNA,该方法使用k-mer编码、RNA结合蛋白基序特征、基因组位点三种类型的特征,使用深度神经网络(Deep Neural Networks,DNN)进行细胞核和细胞质的二分类预测。
由于上述几种方法都是使用LncRNA的序列的k-mer频率编码序列,容易出现特征稀疏,丢失连续空间中序列的顺序信息等问题,导致LncRNA亚 细胞定位预测的准确度不高。
发明内容
本公开提供RNA定位预测方法、装置及存储介质,用以解决现有技术中存在的上述技术问题。
第一方面,为解决上述技术问题,本公开实施例提供的一种RNA定位预测方法的技术方案如下:
获取待定位LncRNA的序列特征信息和结构特征信息;
基于注意力机制对所述序列特征信息和/或所述结构特征信息进行计算,获得所述序列特征信息和/或所述结构特征信息的注意力值;
将所述注意力值输入分类预测模型,获得所述待定位LncRNA的定位预测结果。
一种可能的实施方式,获取待定位LncRNA的序列特征信息,包括:
对所述待定位LncRNA的序列进行k-mer编码,获得至少一个k-mer编码集;其中,每个k-mer编码集中的k-mer包含的碱基数量相同,不同k-mer编码集的k-mer包含的碱基数量不同;
基于k-mer预训练模型获取每个k-mer编码集中的每个k-mer编码的嵌入向量表示;
用卷积神经网络从所有嵌入向量表示中,提取所述待定位LncRNA的序列特征信息。
一种可能的实施方式,对所述待定位LncRNA的序列进行k-mer编码,获得多个k-mer编码集,包括:
根据每个k-mer编码集对应的k,从所述待定位LncRNA的序列的第一个碱基开始依次取连续的k个碱基构成对应k-mer编码集中的一个k-mer,直至取完所述待定位LncRNA的序列中的最后k个碱基,构成对应k-mer编码集;其中,同一k-mer编码集中相邻两个k-mer的首个碱基在所述LncRNA的序列中相邻,k为自然数。
一种可能的实施方式,所述k-mer预训练模型的训练过程,包括:
对所述第二LncRNA集中的每个第二LncRNA的序列进行K-me r编码,获得所述每个第二LncRNA的序列对应的多个第二k-mer编码集;
将所有第二k-mer编码集以及多个特殊字符作为BERT模型的词表;
用所有第二k-mer编码集迭代训练所述BERT模型预测所述第二k-mer编码集中被遮蔽的元素的嵌入向量表示,直至所述BERT模型预测的损失函数的值不再下降停止训练,获得所述k-mer预训练模型;其中,所述BERT模型仅包含MASK-LM任务,在所述MASK-LM任务中使用所述特殊字符对第二k-mer编码集中的元素进行部分遮蔽,且不同第二k-mer编码集对应的遮蔽率不同,所述遮蔽率为遮蔽后的第二k-mer编码中特殊字符的占比。
一种可能的实施方式,所述卷积神经网络,包括:
卷积层,所述卷积层包括多个不同大小的卷积核,每个卷积核用于对所述嵌入向量表示对应的矩阵进行卷积运算;
最大池化层,与所述卷积层的输出端连接,用于对所述卷积层输出的卷积运算结果进行分段,并将获取的每个分段中的最大特征值组合为所述序列特征信息。
一种可能的实施方式,获取待定位LncRNA的结构特征信息,包括:
将所述待定位LncRNA的二级结构转换为树结构;
用Tree Lstm从所述树结构中提取树结构特征,作为所述待定位LncRNA的结构特征信息。
一种可能的实施方式,将所述待定位LncRNA的二级结构转换为树结构,包括:
从所述待定位LncRNA的序列的第一个碱基开始,根据所述二级结构中碱基的配对关系,将碱基互补配对的碱基对作为所述树结构的根节点,将未配对的碱基作为所述树结构中上一节点的叶子节点,直至待定位LncRNA的序列的最后一个碱基,获得所述树结构;其中,当所述第一个碱基未配对时,所述树结构的根节点为空。
一种可能的实施方式,用Tree Lstm从所述树结构中提取树结构特征,包括:
从所述树结构的叶子节点开始,将所述树结构中当前正在处理的当前节点的所有孩子节点的输出作为所述当前节点的输入,并根据所述孩子节点的状态更新所述当前节点对应的门控向量和记忆单元,直至所述当前节点为所述树结构的根节点;其中,所述叶子节点的输入为对应的碱基;
将所述根节点的输出作为所述树结构特征。
一种可能的实施方式,基于注意力机制对所述序列特征信息和/或所述结构特征信息进行计算,获得所述序列特征信息和/或所述结构特征信息的注意力值,包括:
计算所述序列特征信息中每个维度的值与所述结构特征信息的相关性值;
对所述序列特征信息中每个维度对应的相关性值进行归一化计算,获得所述序列特征信息中每个维度的值的第一注意力权重;
基于所述注意力机制,对每个第一注意力权重与所述结构特征信息的积进行和运算,获得所述序列特征信息相对所述结构特征信息的注意力值。
一种可能的实施方式,基于注意力机制对所述序列特征信息和/或所述结构特征信息进行计算,获得所述序列特征信息和/或所述结构特征信息的注意力值,包括:
计算所述结构特征信息中每个维度的值与所述序列特征信息的相关性值;
对所述结构特征信息中每个维度对应的相关性值进行归一化计算,获得所述结构特征信息中每个维度的值的第二注意力权重;
基于所述注意力机制,对每个第二注意力权重与所述序列特征信息的积进行和运算,获得所述结构特征信息相对所述序列特征信息的注意力值。
一种可能的实施方式,将所述注意力值输入分类预测模型,获得所述待定位LncRNA的定位预测结果,包括:
将所述注意力值输入分类预测模型,获得所述待定位LncRNA的定位预测值;
将所述定位预测值中最大概率对应的亚细胞,作为所述待定位LncRNA所在的亚细胞,获得所述定位预测结果。
一种可能的实施方式,所述分类预测模型的训练过程,包括:
获取有标签的第一LncRNA集中每个第一LncRNA的第一序列特征信息和第一结构特征信息;
基于注意力机制对每个第一LncRNA的第一序列特征信息和第一结构特征信息进行计算得到第一注意力值;
将所述第一注意力值输入分类预测模型,获得对应第一LncRNA的第一定位预测值;
基于所述第一定位预测值和对应第一LncRNA的标签值计算损失值,通过反向传播算法调整所述分类预测模型中的参数,直至所述损失值达到预设条件,获得训练好的分类预测模型。
第二方面,本公开实施例还提供一种RNA定位预测的装置,包括:
至少一个处理器,以及
与所述至少一个处理器连接的存储器;
其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述至少一个处理器通过执行所述存储器存储的指令,执行如上述第一方面所述的方法。
第三方面,本公开实施例还提供一种可读存储介质,包括:
存储器,
所述存储器用于存储指令,当所述指令被处理器执行时,使得包括所述可读存储介质的装置完成如上述第一方面所述的方法。
附图说明
图1为本公开实施例提供的一种RNA定位预测方法的流程图;
图2为本公开实施例提供的一种对待定位LncRNA的序列进行k-mer编码的示意图;
图3为本公开实施例提供的另一种对待定位LncRNA的序列进行k-mer编码的示意图;
图4所示为本公开实施例提供的获得k-mer编码集的Embedding表示的示意图;
图5为本公开实施例提供的获取待定位LncRNA的序列特征信息的示意图;
图6为本公开实施例提供的将待定位LncRNA的二级结构转换为树结构的示意图;
图7为本公开实施例提供的Tree LSTM的结构示意图;
图8为本公开实施例提供的获得待定位LncRNA序列的预测值的示意图。
具体实施方式
本公开实施列提供RNA定位预测方法、装置及存储介质,以解决现有技术中存在的上述技术问题。
为了更好的理解上述技术方案,下面通过附图以及具体实施例对本公开技术方案做详细的说明,应当理解本公开实施例以及实施例中的具体特征是对本公开技术方案的详细的说明,而不是对本公开技术方案的限定,在不冲突的情况下,本公开实施例以及实施例中的技术特征可以相互组合。
请参考图1,本公开实施例提供一种RNA定位预测方法,该方法的处理过程如下。
步骤101:获取待定位LncRNA的序列特征信息和结构特征信息。
LncRNA可能分布如细胞质、细胞核、染色质、核仁、线粒体等亚细胞中,LncRNA位于不同亚细胞中可以发挥不同的生物学功能,如在细胞核中的LncRNA通常参与调节基因的转录,位于细胞质的LncRNA通常转录后调控基因表达。
LncRNA的序列即LncRNA中碱基的排列,生物学工作者可以手动获取待定位LncRNA的序列特征信息及结构特征信息,为了提高工作效率,本公 开还提供了以下方式获取待定位LncRNA的序列特征信息及结构特征信息。
针对获取待定位LncRNA的序列特征信息,可以通过下列方式实现:
对待定位LncRNA的序列进行k-mer编码,获得至少一个k-mer编码集;其中,每个k-mer编码集中的k-mer包含的碱基数量相同,不同k-mer编码集的k-mer包含的碱基数量不同;基于k-mer预训练模型获取每个k-mer编码集中每个k-mer编码的嵌入向量表示;用卷积神经网络从所有嵌入向量表示中,提取待定位LncRNA的序列特征信息。
上述对待定位LncRNA的序列进行k-mer编码,获得多个k-mer编码集,可以通过下列方式实现:
根据每个k-mer编码集对应的k,从待定位LncRNA的序列的第一个碱基开始依次取连续的k个碱基构成对应k-mer编码集中的一个k-mer,直至取完待定位LncRNA的序列中的最后k个碱基,构成对应k-mer编码集;其中,同一k-mer编码集中相邻两个k-mer的首个碱基在LncRNA的序列中相邻,k为自然数。
例如,序列长度为L的待定位LncRNA的序列为R=R 1R 2R 3…R L,其中,R i’∈(A,U,G,C),i’的取值范围为1~L,R i’表示待定位LncRNA的序列中第i’个碱基,L为自然数,A、U、G、C表示LncRNA中的碱基。
若k=1,则从R 1开始取1个碱基(即R 1对应的1个碱基),取完后再取R 2对应的1个碱基,…,直到取完最后一个碱基(R L对应的碱基),形成1-mer编码集,可以表示为R 1’=[R 1,R 2,R 3,…,R L],共L个元素。
若k=2,则从R 1开始以重叠的方式连续取2个碱基直至取完所有碱基,如先取R 1R 2对应的2个碱基,取完后从R 2开始取2个碱基(即R 2R 3对应的2个碱基),…,直到取完最后两个碱基(R L-1R L对应的碱基),形成2-mer编码集,可以表示为R 2’=[R 1R 2,R 2R 3,…,R L-1R L],共L-1个元素;或者,也可以从R 1开始非重叠的方式连续取2个碱基序列直至取完所有碱基,如先取R 1R 2对应的2个碱基,再取R 3R 4,…,直至取完最后两个碱基R L-1R L,形成2-mer编码集,可以表示为R 2’=[R 1R 2,R 3R 4,…,R L-1R L],共L/2个元素。
若k=3,则从R 1开始以重叠的方式连续取3个碱基直至取完所有碱基,如先取R 1R 2R 3对应的3个碱基,取完后从R 2开始取3个碱基(即R 2R 3R 4对应的3个碱基),…,直到取完最后三个碱基(R L-2R L-1R L对应的碱基),形成3-mer编码集,可以表示为R 3’=[R 1R 2R 3,R 2R 3R 4,…,R L-2R L-1R L],共L-2个元素,如图2所示为本公开实施例提供的一种对待定位LncRNA的序列进行k-mer编码的示意图。或者,也可以从R 1开始以非重叠的方式连续取3个碱基,直至取完所有碱基,如先取R 1R 2R 3,再取R 4R 5R 6,…,直到取完最后三个碱基(R L-2R L-1R L),形成3-mer编码集,可以表示为R 2’=[R 1R 2R 3,R 4R 5R 6,…,R L-2R L-1R L],共L/3个元素如图3所示为本公开实施例提供的另一种对待定位LncRNA的序列进行k-mer编码的示意图。
若k=4,则从R 1开始以重叠的方式连续取4个碱基,直至取完所有碱基,如先取R 1R 2R 3R 4对应的4个碱基,取完后从R 2开始取4个碱基(即R 2R 3R 4R 5对应的4个碱基),…,直到取完最后四个碱基(R L-3R L-2R L-1R L对应的碱基),形成4-mer编码集,可以表示为R 4’=[R 1R 2R 3R 4,R 2R 3R 4R 5,…,R L-3R L-2R L-1R L],共L-3个元素;或者,从R 1开始以非重叠的方式连续取4个碱基,直至取完所有碱基,如先取R 1R 2R 3R 4,再取R 5R 6R 7R 8,…,直至取完最后四个碱基(R L-3R L-2R L-1R L对应的碱基),形成4-mer编码集,可以表示为R 4’=[R 1R 2R 3R 4,R 5R 6R 7R 8,…,R L-3R L-2R L-1R L],共L/4个元素。
需要理解的是,待定位LncRNA的序列的长度也就是其包含的碱基数量,k-mer编码集中的每个元素为待定位LncRNA的序列中的一个片段,k-mer编码集的长度为k-mer编码集中包含元素的数量,在k>1时,采用重叠的方式取LncRNA的序列中的碱基时,相邻两个片段中重叠的碱基数量可以根据实际需要确定,不限于上述举例。
通过对待定位LncRNA的序列同时进行不同k值的k-mer编码,可以防止进行单一k-mer编码而出现序列特征稀疏,而导致连续空间中序列信息丢失的问题,进而提高LncRNA亚细胞定位预测的准确率。
在得到待定位LncRNA的序列的多个k-mer编码集后,将它们输入k-mer 预训练模型,得到各个k-mer编码集对应的嵌入向量表示,之后用卷积神经网络从所有嵌入向量表示中提取待定位LncRNA的序列特征信息。
上述k-mer预训练模型的训练过程,包括:
对第二LncRNA集中的每个第二LncRNA的序列进行k-mer编码,获得每个第二LncRNA的序列对应的多个第二k-mer编码集;将所有第二k-mer编码集以及多个特殊字符作为BERT模型的词表;用所有第二k-mer编码集迭代训练BERT模型预测第二k-mer编码集中被遮蔽的元素的嵌入向量表示,直至BERT模型的损失函数的值不再下降停止训练,获得k-mer预训练模型;其中,BERT模型仅包含MASK-LM任务,在MASK-LM任务中使用特殊字符对第二k-mer编码集中元素进行部分遮蔽,且不同第二k-mer编码集对应的遮蔽率不同,遮蔽率为遮蔽后的第二k-mer编码中特殊字符的占比。
第二LncRNA集中的第二LncRNA可以是从RNAcentral数据库收集的无标签的历史LncRNA(即未标注亚细胞位置的历史LncRNA)。
在得到第二LncRNA集后,对其中的每个第二LncRNA的序列进行k-mer编码,得到每个第二LncRNA序列对应的多个第二k-mer编码集;具体的,对第二LncRNA的序列进行k-mer编码的方式,与前述介绍的对待定位LncRNA的序列进行k-mer编码的方式相同,且k的取值范围相同,如前述待定位LncRNA需要进行1-mer编码~4-mer编码,则第二LncRNA也需要进行1-mer编码~4-mer编码,对此不再赘述。
将上述所有第二k-mer编码集以及多个特殊字符作为BERT模型的词表(也可称之为k-mer词表),其中,特殊字符包括'<pad>','<mask>','<cls>','<sep>','<unk>'。
通常,传统BERT模型包括Masked LM和下一句预测(NextSentence Prediction)两个与训练任务。Masked LM的任务描述为:给定一句话,随机抹去这句话中的一个或几个词,要求根据剩余词汇预测被抹去的几个词分别是什么。如,在一句话中随机选择15%的词汇用于预测,对于在原句中被抹去的词汇,80%情况下采用一个特殊符号[MASK]替换,10%情况下采用一个 任意词替换,剩余10%情况下保持原词汇不变,这样在预测一个词汇时,BERT模型并不知道输入对应位置的词汇是否为正确的词汇(10%概率),这就迫使BERT模型更多地依赖于上下文信息去预测词汇,并且赋予了BERT模型一定的纠错能力。而Next Sentence Prediction的任务描述为:给定一篇文章中的两句话,判断第二句话在文本中是否紧跟在第一句话之后,在实际预训练过程中,从文本语料库中随机选择50%正确语句对和50%错误语句对进行训练,与Masked LM任务相结合,让BERT模型能够更准确地刻画语句乃至篇章层面的语义信息。
本公开中的BERT模型是仅包含MASK-LM任务的BERT模型,即取消了传统BERT模型中NextSentence Prediction任务对应的模块(这样可以减少训练复杂度),并且在MASK-LM任务中使用多个特殊字符对第二k-mer编码集中元素进行部分遮蔽,且不同第二k-mer编码集对应的遮蔽率不同,遮蔽率为遮蔽后的第二k-mer编码中特殊字符的占比,如训练1-mer编码集对应的嵌入向量表示,在遮蔽时,mask(遮蔽)掉40%的1-mer,mask掉20%的2-mer,mask掉20%的3-mer,20%的4-kmer。
由于本公开中的BERT模型使用的是多个第二k-mer编码集作为k-mer词表,使得其语义信息更加丰富,并且由于在MASK-LM任务中增加了多个特殊字符,可以提高BERT模型的预测能力。
在本公开提供的实施例中,BERT模型由多个Transformer结构单元、多个多头注意力、隐藏层等组成,其中,隐藏层的节点数是多头注意力总数的整数倍。如,BERT模型使用12个Transformer结构单元,12个多头注意力,隐藏层节点个数为768,位置向量Embedding维度取6149。
搭建好上述仅包含MASK-LM任务的BERT模型,以及准备好第二LncRNA集对应的第二k-mer编码集后,用各个第二LncRNA对应的多个第二k-mer编码集迭代训练BERT模型预测第二k-mer编码集中被遮蔽的元素的嵌入向量表示,直至BERT模型的损失函数的值不再下降停止训练,获得k-mer预训练模型。这样就可以用k-mer预训练模型,获取待定位LncRNA的各个 k-mer编码集的嵌入向量表示,如k-mer预训练模型的维度为D,k为1~4,则待定位LncRNA的序列的1-mer编码集~4-mer编码集对应的嵌入向量表示可以依次用以下矩阵表示:L×D、(L-1)×D、(L-2)×D、(L-3)×D,L为待定位LncRNA的序列的长度。如图4所示为本公开实施例提供的获得k-mer编码集的嵌入向量表示的示意图。
在得到待定位LncRNA的序列的各个k-mer编码集的嵌入向量表示后,可以使用卷积神经网络从所有嵌入向量表示中,提取待定位LncRNA的序列特征信息。
上述卷积神经网络包括:
卷积层,卷积层包括多个不同大小的卷积核,每个卷积核用于对嵌入向量表示对应的矩阵进行卷积运算。
最大池化层,与卷积层的输出端连接,用于对卷积层输出的卷积运算结果进行分段,并将获取的每个分段中的最大特征值组合为序列特征信息。
如,卷积层中可以包括3个卷积核,这3个卷积核的大小h=3、4、5,相应的这3个卷积核的尺寸为3×D、4×D、5×D,用上述3个卷积核与每个嵌入向量表示对应的矩阵做卷积,得到每个嵌入向量表示对应的卷积结果为(n-h+1)×1×m的矩阵,其中n为嵌入向量表示对应k-mer编码集中元素总数,h为卷积核的大小、m为卷积层中卷积核的总数。
卷积计算公式为:
C k’=g(wx k’:k’+h-1+b);
其中,C k’为卷积核h的第k’个卷积运算的输出结果,k’=1~n,g()为激活函数,w为卷积核,b为卷积核的偏置,x k’:k’+h-1表示k’-mer编码集的嵌入向量表示对应矩阵中的第k’行到第k’+h-1行的元素。采用卷积计算公式可以计算出每个k-mer编码集的嵌入向量表示,进而得到待定位LncRNA对应的嵌入向量表示的所有卷积运算结果(记为C,也就是卷积层输出的卷积运算结果,C=(C 1,C 2,…,C n)。
之后,用最大池化层,将卷积层输出的卷积运算结果分段,并获取的每 个分段中的最大特征值(相当于提取初级特征),将这些最大特征值组合为待定位LncRNA的序列特征信息(相当于提取高级特征)。
最大池化层采用的池化计算公式:
P=(maxC 1:q,…,maxC n-q+1:n);
其中,P为待定位LncRNA的序列特征信息,q=n/p,p为分段数,C 1:q为卷积层输出的卷积结果对应矩阵的第1~q行,C n-q+1:n为卷积层输出的卷积结果对应矩阵的第n~n-q+1,maxC 1:q为取C 1:q中的最大特征,maxC n-q+1:n为取C n-q+1:n中的最大特征。请参见图5为本公开实施例提供的获取待定位LncRNA的序列特征信息的示意图。
通过卷积神经网络抽取不同k-mer条件下的序列特征信息,由于卷积神经网络可以并行计算,因此能够快速捕获不同尺度的序列特征信息。
针对获取待定位LncRNA的结构特征信息,可以通过下列方式实现:
将待定位LncRNA的二级结构转换为树结构;用Tree Lstm从树结构中提取树结构特征,作为待定位LncRNA的结构特征信息。
待定位LncRNA的二级结构可以用平面图形表示、二级结构平面图文本(CT文件)表示和点括号图(Dot-bracket notation)等表示。其中,点括号图表示方法的核心思想是用“(”和“)”表示两个碱基互补配对,“.”表示碱基不配对。待定位LncRNA的二级结构可通过程序计算而得,也可通过数据库检索,实验验证等其它途径获得。
以待定位LncRNA的序列“AGUGAAGGCACAAGCCUUAC”为例,其二级结构用点括号图表示法可以表示为:“.((((.(((....)))))))”。
在得到上述待定位LncRNA的二级结构后,将其转换为树结构,具体可以通过下列方式实现:
从待定位LncRNA的序列的第一个碱基开始,根据二级结构中碱基的配对关系,将碱基互补配对的碱基对作为树结构的根节点,将未配对的碱基作为树结构中上一节点的叶子节点,直至待定位LncRNA的序列的最后一个碱基,获得树结构;其中,当第一个碱基未配对时,树结构的根节点为空。
请参见图6为本公开实施例提供的将待定位LncRNA的二级结构转换为树结构的示意图。
图6中待定位LncRNA的序列为:“AGUGAAGGCACAAGCCUUAC”,二级结构用点括号图为:“.((((.(((....)))))))”,由于A未配对,因此根节点为空,A和G为空根节点的叶子,由于G有互补配对的碱基(C),因此将碱基对(GC)做为下一碱(U)的根节点,由于碱基U与碱基A互补配对,因此将碱基对UA作为下一碱基的根节点,直至待定位LncRNA的序列的最后一个碱基(C),得到如图6所示的树结构。上述树结构可以按树结构中节点的层序进行存储。
在获得待定位LncRNA的树结构后,可以利用Tree Lstm从待定位LncRNA的树结构中提取树结构特征,并将之作为待定位LncRNA的结构特征信息。
用Tree Lstm从树结构中提取树结构特征,可以通过下列方式实现:
从树结构的叶子节点开始,将树结构中当前正在处理的当前节点的所有孩子节点的输出作为当前节点的输入,并根据孩子节点的状态更新当前节点对应的门控向量和记忆单元,直至当前节点为树结构的根节点;其中,叶子节点的输入为对应的碱基;将根节点的输出作为树结构特征。
请参见图7为本公开实施例提供的Tree LSTM的结构示意图。图7中节点2、节点4、节点5、节点6为叶子节点,节点4~节点6为节点3的孩子节点,节点2和节点3为节点1的孩子节点,节点1为根节点;x1~x6为树中对应节点的输入,y1~y6为树中对应节点的输出,以图7中的节点3为当前节点为例,将节点3的所有孩子节点(节点4~节点6)的输出(y4~y6)作为节点3的输入,并根据节点4~节点6的状态更新节点3的门控向量和记忆单元;同理可以更新节点1的门控向量和记忆单元,最后得到节点1的输出,将节点1的输出作为图7所示的树结构的树结构特征。
上述处理方式,用Tree LSTM公式表示如下:
Figure PCTCN2021127273-appb-000001
Figure PCTCN2021127273-appb-000002
Figure PCTCN2021127273-appb-000003
Figure PCTCN2021127273-appb-000004
Figure PCTCN2021127273-appb-000005
c j=i j⊙u j+∑ k∈C(j)f jk⊙c k
h j=o j⊙tanh(c j);
其中,j为树中的当前节点,C(j)表示当前节点j的孩子节点的集合。x j为当前节点j的输入向量。
i j是当前节点j的输入门,f jk是当前节点j的每个孩子节点k的遗忘门,o j是当前节点j的输出门,c j为当前节点j的记忆单元,h j是当前节点j的隐藏层状态,u j为当前节点j的临时记忆单元;W (i),W (f),W (o),W (u),U (i),U (f),U (o),U (u)为权重矩阵;b (i),b (f),b (o),b (u)为偏置向量;
Figure PCTCN2021127273-appb-000006
为当前节点j的孩子节点隐藏层的求和;δ为sigmoid函数;⊙表示一种乘法运算(按照元素乘法)。
Tree-LSTM的输入是多个孩子节点,输出是将孩子节点编码后产生父节点,父节点的维度和每个孩子节点是相同的。即,从树的最底层开始,将位于同一层的孩子节点编码后产生的向量作为对应父节点的输入,直至处理完树的最顶层的根节点,将根节点的输出作为整棵树的树结构特征,将之作为待定位LncRNA的结构特征信息。
在获得待定位LncRNA的序列特征信息和结构特征信息之后,便可执行步骤102。
步骤102:基于注意力机制对序列特征信息和/或结构特征信息进行计算,获得序列特征信息和/或结构特征信息的注意力值。
基于注意力机制对序列特征信息和/或结构特征信息进行注意力计算,获 得序列特征信息和/或结构特征信息的注意力值,可以通过下列两种方式实现:
第一种方式:计算序列特征信息中每个维度的值与结构特征信息的相关性值;对序列特征信息中每个维度对应的相关性值进行归一化计算,获得序列特征信息中每个维度的值的第一注意力权重;基于注意力机制,对每个第一注意力权重与结构特征信息的积进行和运算,获得序列特征信息相对结构特征信息的注意力值。
第二种方式:计算结构特征信息中每个维度的值与序列特征信息的相关性值;对结构特征信息中每个维度对应的相关性值进行归一化计算,获得结构特征信息中每个维度的值的第二注意力权重;基于注意力机制,对每个第二注意力权重与序列特征信息的积进行和运算,获得结构特征信息相对序列特征信息的注意力值。
例如,假设待定位LncRNA的序列特征信息为A=[a 1,a 2,…,a s],其中A为一个多维向量,s为多维向量A的维度,a 1、a 2、a s分别代表多维向量A中对应维度的值;待定位LncRNA的结构特征信息为B=[b 1,b 2,…,b t],其中B为一个多维向量,t为多维向量B的维度,b 1、b 2、b t分别代表多维向量B中对应维度的值,可以将A视为1×s的张量,B视为1×t的张量。
令Q=B、K=V=A,即query=B,key和value为X,其中,Q、K、V、query、key、value为transformer中的参数,注意力值的计算方式为:
Figure PCTCN2021127273-appb-000007
其中,α i"为注意力权重,Softmax()为归一化函数,f为计算Q i”和K相关性的函数,f(Q i”,K)=B i”WA,i”=1~t。
Q i”=B i”=b i”,W的初始化值为一个随机的数值。
将上述计算出的权重和相应的键值value进行加权求和,得到结构特征信息相对序列结构信息的注意力值(Attention):
Figure PCTCN2021127273-appb-000008
上述得到的Attention(Q,K,V)是一个1×s的张量。
同理,也可以计算序列特征信息对结构特征信息的attention:此时令Q=A,K=V=B,即query=A,key和value为B,得到的attention是一个1*t的张量。
通过上述方式得到待定位LncRNA的序列特征信息和/或结构特征信息的注意力值后,便可执行步骤103。
步骤103:将注意力值输入分类预测模型,获得LncRNA序列的定位预测结果。
在将待定位LncRNA的序列特征信息和/或结构特征信息的注意力值输入分类预测模型之前,还需先通过训练得到分类预测模型,具体分类预测模型的训练过程如下:
获取有标签的第一LncRNA集中每个第一LncRNA的第一序列特征信息和第一结构特征信息;基于注意力机制对每个第一LncRNA的第一序列特征信息和第一结构特征信息进行计算得到第一注意力值;将第一注意力值输入分类预测模型,获得对应第一LncRNA的第一定位预测值;基于第一定位预测值和对应第一LncRNA的标签值计算损失值,通过反向传播算法调整分类预测模型中的参数,直至损失值达到预设条件,获得训练好的分类预测模型。上述预设条件可以是验证集的损失函数的值不再下降,或,训练集或验证集的准确率不再提升。
上述第一LncRNA集中的第一LncRNA可以是从RNALocate数据库收集的有标签的历史LncRNA(即标注了历史LncRNA的亚细胞位置),RNALocate数据库是专门用于RNA亚细胞定位的数据库,该数据库最新版本RNALocatev2.0已经记录了超过210000个RNA相关的亚细胞定位条目录和实验数据,涉及超过110000个RNA,包含104个物种的171个亚细胞定位。目前,从RNALocate数据库中可以抽取得到LncRNA数据9587个历史LncRNA,去除重复后得到包含6897个不同的历史LncRNA构成本公开的第 一LncRNA集,其位置分布在40个不同的亚细胞,包括细胞质、细胞核、染色质、核仁、线粒体等。
在得到上述第一LncRNA集后,获取第一LncRNA集中每个第一LncRNA的第一序列特征信息和第一结构特征信息,获得方式与获取待定位LncRNA的序列特征信息和结构特征信息相同,对此不再赘述;之后,基于注意力机制对每个第一LncRNA的第一序列特征信息和第一结构特征信息进行计算,获得对应的第一注意力值,并将所有的第一注意力值分为训练集和验证集;并用训练集对多层感知机(Multilayer Perceptron,MLP)进行训练,用验证集对训练后的MLP进行验证,直至验证结果达到预设条件,获得分类预测模型。
多层感知机的训练参数包括:优化器可以用SGD、批量大小(batch)可以设置为32、dropout可以设置为0.001,迭代次数(epoch)可以设置为100,嵌入向量的维度可以设置为768。
得到分类预测模型后,还可以用5折交叉验证评估上述分类预测模型的性能,5折交叉验证评估的性能指标包括准确率(accuracy,ACC)、敏感性(Sn)、特异性(Sp)、精确率(precision,Pre)、F1分数(F1-score)、马修斯相关系数(Matthews correlation coefficient,MCC)、感受性曲线(Receiver Operating characteristic Curve,ROC)下的面积(Area under Curve,AUC)。
当分类预测模型的性能达到要求后,就可以将待定位LncRNA对应的注意力值输入分类预测模型,获得待定位LncRNA的预测结果,具体可以通过下列方式实现:
将注意力值输入分类预测模型,获得待定位LncRNA的定位预测值;将定位预测值中最大概率对应的亚细胞,作为待定位LncRNA所在的亚细胞,获得定位预测结果。
如图8所示,为本公开实施例提供的获得待定位LncRNA的预测值的示意图。
图8中待定位LncRNA的序列为AGUGAAGGCACAAGCCUUAC,待定位LncRNA的二级结构为.((((.(((….))))))),通过前面介绍的方式可以得到待定 位LncRNA序列的序列特征信息和结构特征信息,将它们输入分类预测模型便可得到上述待定位LncRNA在各个亚细胞的定位预测值为:细胞质:0.757,细胞液:0.182,核糖体:0.001,内质网:0.014,外泌体:0.035,突触:0.013。将上述定位预测值中最大概率对应的亚细胞(细胞质),作为上述待定位LncRNA所在的亚细胞,获得定位预测结果。
在本公开提供的实施例中,通过获取待定位LncRNA的序列特征信息和结构特征信息;基于注意力机制对序列特征信息和/或结构特征信息进行计算,获得序列特征信息和/或结构特征信息的注意力值;将注意力值输入分类预测模型,获得待定位LncRNA的定位预测结果,这使得在预测待定位LncRNA所在亚细胞位置时,充分考虑了待定位LncRNA的序列特征信息和结构特征信息的相关性,从而提高了待定位LncRNA定位预测的准确率。
基于同一发明构思,本公开实施例中提供了一种LncRNA定位预测的装置,包括:至少一个处理器,以及
与所述至少一个处理器连接的存储器;
其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述至少一个处理器通过执行所述存储器存储的指令,执行如上所述的RNA定位预测的方法。
基于同一发明构思,本公开实施例还提一种可读存储介质,包括:
存储器,
所述存储器用于存储指令,当所述指令被处理器执行时,使得包括所述可读存储介质的装置完成如上所述的RNA定位预测的方法。
所述可读存储介质可以是处理器能够存取的任何可用介质或数据存储设备,包括易失性存储器或非易失性存储器,或者可以包括易失性存储器和非易失性存储器两者。作为例子而非限制性的,非易失性存储器可以包括只读存储器(Read-Only Memory,ROM)、可编程ROM(Programmable read-only memory,PROM)、电可编程ROM(Erasable Programmable Read-Only Memory,EPROM)、电可擦写可编程ROM(Electrically Erasable Programmable read only  memory,EEPROM)或快闪存储器、固态硬盘(Solid State Disk或Solid State Drive,SSD)、磁性存储器(例如软盘、硬盘、磁带、磁光盘(Magneto-Optical disc,MO)等)、光学存储器(例如CD、DVD、BD、HVD等)。易失性存储器可以包括随机存取存储器(Random Access Memory,RAM),该RAM可以充当外部高速缓存存储器。作为例子而非限制性的,RAM可以以多种形式获得,比如动态RAM(Dynamic Random Access Memory,DRAM)、同步DRAM(Synchronous Dynamic Random-Access Memory,SDRAM)、双数据速率SDRAM(Double Data Rate SDRAM,DDR SDRAM)、增强SDRAM(EnhancedSynchronousDRAM,ESDRAM)、同步链路DRAM(Sync Link DRAM,SLDRAM)。所公开的各方面的存储设备意在包括但不限于这些和其它合适类型的存储器。
本领域内的技术人员应明白,本公开实施例可提供为方法、系统、或程序产品。因此,本公开实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开实施例可采用在一个或多个其中包含有计算机/处理器可用程序代码的可读存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的机程序产品的形式。
本公开实施例是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的可读存储器中,使得存储在该可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框 图一个方框或多个方框中指定的功能。
这些程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机/处理器实现的处理,从而在计算机/处理器或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本公开进行各种改动和变型而不脱离本公开的精神和范围。这样,倘若本公开的这些修改和变型属于本公开权利要求及其等同技术的范围之内,则本公开也意图包含这些改动和变型在内。

Claims (14)

  1. 一种RNA定位预测方法,其中,包括:
    获取待定位LncRNA的序列特征信息和结构特征信息;
    基于注意力机制对所述序列特征信息和/或所述结构特征信息进行计算,获得所述序列特征信息和/或所述结构特征信息的注意力值;
    将所述注意力值输入分类预测模型,获得所述待定位LncRNA的定位预测结果。
  2. 如权利要求1所述的方法,其中,获取待定位LncRNA的序列特征信息,包括:
    对所述待定位LncRNA的序列进行k-mer编码,获得至少一个k-mer编码集;其中,每个k-mer编码集中的k-mer包含的碱基数量相同,不同k-mer编码集的k-mer包含的碱基数量不同;
    基于k-mer预训练模型获取每个k-mer编码集中的每个k-mer编码的嵌入向量表示;
    用卷积神经网络从所有嵌入向量表示中,提取所述待定位LncRNA的序列特征信息。
  3. 如权利要求2所述的方法,其中,对所述待定位LncRNA进行k-mer编码,获得多个k-mer编码集,包括:
    根据每个k-mer编码集对应的k,从所述待定位LncRNA的序列的第一个碱基开始依次取连续的k个碱基构成对应k-mer编码集中的一个k-mer,直至取完所述待定位LncRNA的序列中的最后k个碱基,构成对应k-mer编码集;其中,同一k-mer编码集中相邻两个k-mer的首个碱基在所述LncRNA的序列中相邻,k为自然数。
  4. 如权利要求2所述的方法,其中,所述k-mer预训练模型的训练过程,包括:
    对所述第二LncRNA集中的每个第二LncRNA的序列进行k-mer编码,获得所述每个第二LncRNA的序列对应的多个第二k-mer编码集;
    将所有第二k-mer编码集以及多个特殊字符作为BERT模型的词表;
    用所有第二k-mer编码集迭代训练所述BERT模型预测所述第二k-mer编码集中被遮蔽的元素的嵌入向量表示,直至所述BERT模型的损失函数的值不再下降停止训练,获得所述k-mer预训练模型;其中,所述BERT模型仅包含MASK-LM任务,在所述MASK-LM任务中使用所述特殊字符对第二k-mer编码集中的元素进行部分遮蔽,且不同第二k-mer编码集对应的遮蔽率不同,所述遮蔽率为遮蔽后的第二k-mer编码中特殊字符的占比。
  5. 如权利要求2所述的方法,其中,所述卷积神经网络,包括:
    卷积层,所述卷积层包括多个不同大小的卷积核,每个卷积核用于对所述嵌入向量表示对应的矩阵进行卷积运算;
    最大池化层,与所述卷积层的输出端连接,用于对所述卷积层输出的卷积运算结果进行分段,并将获取的每个分段中的最大特征值组合为所述序列特征信息。
  6. 如权利要求1所述的方法,其中,获取待定位LncRNA的结构特征信息,包括:
    将所述待定位LncRNA的二级结构转换为树结构;
    用Tree Lstm从所述树结构中提取树结构特征,作为所述待定位LncRNA的结构特征信息。
  7. 如权利要求6所述的方法,其中,将所述待定位LncRNA的二级结构转换为树结构,包括:
    从所述待定位LncRNA的序列的第一个碱基开始,根据所述二级结构中碱基的配对关系,将碱基互补配对的碱基对作为所述树结构的根节点,将未配对的碱基作为所述树结构中上一节点的叶子节点,直至待定位LncRNA的序列的最后一个碱基,获得所述树结构;其中,当所述第一个碱基未配对时,所述树结构的根节点为空。
  8. 如权利要求6所述的方法,其中,用Tree Lstm从所述树结构中提取树结构特征,包括:
    从所述树结构的叶子节点开始,将所述树结构中当前正在处理的当前节点的所有孩子节点的输出作为所述当前节点的输入,并根据所述孩子节点的状态更新所述当前节点对应的门控向量和记忆单元,直至所述当前节点为所述树结构的根节点;其中,所述叶子节点的输入为对应的碱基;
    将所述根节点的输出作为所述树结构特征。
  9. 如权利要求1所述的方法,其中,基于注意力机制对所述序列特征信息和/或所述结构特征信息进行计算,获得所述序列特征信息和/或所述结构特征信息的注意力值,包括:
    计算所述序列特征信息中每个维度的值与所述结构特征信息的相关性值;
    对所述序列特征信息中每个维度对应的相关性值进行归一化计算,获得所述序列特征信息中每个维度的值的第一注意力权重;
    基于所述注意力机制,对每个第一注意力权重与所述结构特征信息的积进行和运算,获得所述序列特征信息相对所述结构特征信息的注意力值。
  10. 如权利要求1所述的方法,其中,基于注意力机制对所述序列特征信息和/或所述结构特征信息进行计算,获得所述序列特征信息和/或所述结构特征信息的注意力值,包括:
    计算所述结构特征信息中每个维度的值与所述序列特征信息的相关性值;
    对所述结构特征信息中每个维度对应的相关性值进行归一化计算,获得所述结构特征信息中每个维度的值的第二注意力权重;
    基于所述注意力机制,对每个第二注意力权重与所述序列特征信息的积进行和运算,获得所述结构特征信息相对所述序列特征信息的注意力值。
  11. 如权利要求1-7任一项所述的方法,其中,将所述注意力值输入分类预测模型,获得所述待定位LncRNA的定位预测结果,包括:
    将所述注意力值输入分类预测模型,获得所述待定位LncRNA的定位预测值;
    将所述定位预测值中最大概率对应的亚细胞,作为所述待定位LncRNA所在的亚细胞,获得所述定位预测结果。
  12. 如权利要求11所述的方法,其中,所述分类预测模型的训练过程,包括:
    获取有标签的第一LncRNA集中每个第一LncRNA的第一序列特征信息和第一结构特征信息;
    基于注意力机制对每个第一LncRNA的第一序列特征信息和第一结构特征信息进行计算得到第一注意力值;
    将所述第一注意力值输入分类预测模型,获得对应第一LncRNA的第一定位预测值;
    基于所述第一定位预测值和对应第一LncRNA的标签值计算损失值,通过反向传播算法调整所述分类预测模型中的参数,直至所述损失值达到预设条件,获得训练好的分类预测模型。
  13. 一种RNA定位预测的装置,其中,包括:
    至少一个处理器,以及
    与所述至少一个处理器连接的存储器;
    其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述至少一个处理器通过执行所述存储器存储的指令,执行如权利要求1-12任一项所述的方法。
  14. 一种可读存储介质,其中,包括存储器,
    所述存储器用于存储指令,当所述指令被处理器执行时,使得包括所述可读存储介质的装置完成如权利要求1~12中任一项所述的方法。
PCT/CN2021/127273 2021-10-29 2021-10-29 Rna定位预测方法、装置及存储介质 WO2023070493A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/127273 WO2023070493A1 (zh) 2021-10-29 2021-10-29 Rna定位预测方法、装置及存储介质
CN202180003135.1A CN117501374A (zh) 2021-10-29 2021-10-29 Rna定位预测方法、装置及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/127273 WO2023070493A1 (zh) 2021-10-29 2021-10-29 Rna定位预测方法、装置及存储介质

Publications (1)

Publication Number Publication Date
WO2023070493A1 true WO2023070493A1 (zh) 2023-05-04

Family

ID=86160394

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/127273 WO2023070493A1 (zh) 2021-10-29 2021-10-29 Rna定位预测方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN117501374A (zh)
WO (1) WO2023070493A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111341386A (zh) * 2020-02-17 2020-06-26 大连理工大学 引入注意力的多尺度CNN-BiLSTM非编码RNA互作关系预测方法
CN111696624A (zh) * 2020-06-08 2020-09-22 天津大学 基于自注意力机制的dna结合蛋白鉴定和功能注释的深度学习方法
CN111986730A (zh) * 2020-07-27 2020-11-24 中国科学院计算技术研究所苏州智能计算产业技术研究院 一种预测siRNA沉默效率的方法
CN112270955A (zh) * 2020-10-23 2021-01-26 大连民族大学 一种注意力机制预测lncRNA的RBP结合位点的方法
CN113053462A (zh) * 2021-03-11 2021-06-29 同济大学 基于双向注意力机制的rna与蛋白质绑定偏好预测方法和系统
US20210202043A1 (en) * 2018-08-20 2021-07-01 Nantomics, Llc Methods and systems for improved major histocompatibility complex (mhc)-peptide binding prediction of neoepitopes using a recurrent neural network encoder and attention weighting

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210202043A1 (en) * 2018-08-20 2021-07-01 Nantomics, Llc Methods and systems for improved major histocompatibility complex (mhc)-peptide binding prediction of neoepitopes using a recurrent neural network encoder and attention weighting
CN111341386A (zh) * 2020-02-17 2020-06-26 大连理工大学 引入注意力的多尺度CNN-BiLSTM非编码RNA互作关系预测方法
CN111696624A (zh) * 2020-06-08 2020-09-22 天津大学 基于自注意力机制的dna结合蛋白鉴定和功能注释的深度学习方法
CN111986730A (zh) * 2020-07-27 2020-11-24 中国科学院计算技术研究所苏州智能计算产业技术研究院 一种预测siRNA沉默效率的方法
CN112270955A (zh) * 2020-10-23 2021-01-26 大连民族大学 一种注意力机制预测lncRNA的RBP结合位点的方法
CN113053462A (zh) * 2021-03-11 2021-06-29 同济大学 基于双向注意力机制的rna与蛋白质绑定偏好预测方法和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Master's Thesis", 9 January 2019, SHANGHAI JIAOTONG UNIVERSITY, CN, article XIAO, YIQUN: "MicroRNA Subcellular Localization Based on Deep Mining of Sequence Patterns", pages: 1 - 69, XP009545248, DOI: 10.27307/d.cnki.gsjtu.2019.002779 *

Also Published As

Publication number Publication date
CN117501374A (zh) 2024-02-02

Similar Documents

Publication Publication Date Title
US11604956B2 (en) Sequence-to-sequence prediction using a neural network model
US11928600B2 (en) Sequence-to-sequence prediction using a neural network model
JP2023082017A (ja) コンピュータシステム
US20180107927A1 (en) Architectures for training neural networks using biological sequences, conservation, and molecular phenotypes
CN115485696A (zh) 机器学习模型的对抗预训练
JP2021526259A (ja) 訓練された統計モデルを使用するマルチモーダル予測のための方法および装置
JP2019191827A (ja) 質問応答装置、質問応答方法及びプログラム
CN108427865B (zh) 一种预测LncRNA和环境因素关联关系的方法
CN113539372A (zh) 一种LncRNA和疾病关联关系的高效预测方法
WO2023070493A1 (zh) Rna定位预测方法、装置及存储介质
Gupta et al. DAVI: Deep learning-based tool for alignment and single nucleotide variant identification
Sharma et al. Drugs–Protein affinity‐score prediction using deep convolutional neural network
Navamajiti et al. McBel-Plnc: a deep learning model for multiclass multilabel classification of protein-lncRNA interactions
Di Persia et al. exp2GO: Improving prediction of functions in the Gene Ontology with expression data
Park et al. A molecular generative model with genetic algorithm and tree search for cancer samples
Bickmann et al. TEclass2: Classification of transposable elements using Transformers
Wang et al. PreSubLncR: Predicting Subcellular Localization of Long Non-Coding RNA Based on Multi-Scale Attention Convolutional Network and Bidirectional Long Short-Term Memory Network
WO2020190887A1 (en) Methods and systems for de novo molecular configurations
CN117976047B (zh) 基于深度学习的关键蛋白质预测方法
US20230297606A1 (en) Low-resource, multi-lingual transformer models
Miller et al. Exploring neural network models for LncRNA sequence identification
US20220106589A1 (en) Methods and systems for modeling of design representation in a library of editing cassettes
JP6918030B2 (ja) 学習装置、学習方法、プログラムおよび情報処理システム
Ovens et al. Juxtapose: A Python tool for gene embedding for co-expression network comparison
CN117423389A (zh) 基于图注意力随机游走的疾病关联circRNA预测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21961855

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18291563

Country of ref document: US