WO2023044927A1 - Rna-蛋白质相互作用预测方法、装置、介质及电子设备 - Google Patents

Rna-蛋白质相互作用预测方法、装置、介质及电子设备 Download PDF

Info

Publication number
WO2023044927A1
WO2023044927A1 PCT/CN2021/121089 CN2021121089W WO2023044927A1 WO 2023044927 A1 WO2023044927 A1 WO 2023044927A1 CN 2021121089 W CN2021121089 W CN 2021121089W WO 2023044927 A1 WO2023044927 A1 WO 2023044927A1
Authority
WO
WIPO (PCT)
Prior art keywords
rna
protein
sequence
mer
pair
Prior art date
Application number
PCT/CN2021/121089
Other languages
English (en)
French (fr)
Inventor
张春会
张振中
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to PCT/CN2021/121089 priority Critical patent/WO2023044927A1/zh
Priority to CN202180002692.1A priority patent/CN116490926A/zh
Publication of WO2023044927A1 publication Critical patent/WO2023044927A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, in particular, to a method for predicting RNA-protein interaction, a device for predicting RNA-protein interaction, a computer-readable storage medium and electronic equipment.
  • Noncoding RNA (noncoding RNA, ncRNA) participates in many complex cellular processes, plays an important role in life processes such as alternative splicing, chromatin modification and epigenetics, and is closely related to many diseases. Studies have shown that most non-coding RNAs achieve their regulatory functions by interacting with proteins. Therefore, studying the interaction between non-coding RNA and protein is of great significance for revealing the molecular mechanism of non-coding RNA in human diseases and life activities, and has become one of the important ways to analyze the function of non-coding RNA and protein.
  • the present disclosure provides a method for predicting RNA-protein interaction, a device for predicting RNA-protein interaction, a computer-readable storage medium and electronic equipment.
  • the present disclosure provides a method for predicting RNA-protein interaction, including:
  • RNA-protein pair to be predicted, and obtaining an RNA sequence representation vector and a protein sequence representation vector in the RNA-protein pair to be predicted;
  • the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted Based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted, using multiple interaction prediction models to obtain the RNA-protein to be predicted respectively Multiple interaction predictors for pairs;
  • An interaction between the RNA and the protein is determined based on the plurality of interaction prediction values.
  • the feature extraction of the RNA-protein pair to be predicted is performed to obtain the sequence features of the RNA-protein pair to be predicted, including:
  • sequence features of the RNA-protein pair to be predicted are determined according to the original sequence feature set.
  • the determining the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set includes:
  • Each k-mer subsequence is searched in the original sequence feature set, and the sequence feature of the RNA-protein pair to be predicted is obtained according to the search result.
  • the determining the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set includes:
  • RNA sequence and the protein sequence in the RNA-protein pair to be predicted are respectively converted into k-mer subsequences, and the k-mer subsequences include RNA k-mer subsequences and protein k-mer subsequences;
  • the determining the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set includes:
  • RNA sequence and the protein sequence in the RNA-protein pair to be predicted are respectively converted into k-mer subsequences, and the k-mer subsequences include RNA k-mer subsequences and protein k-mer subsequences;
  • the sequence characteristics of the RNA-protein pair to be predicted are composed of the first sequence characteristics and the second sequence characteristics.
  • said obtaining the original sequence feature set includes:
  • the feature extraction is performed on each RNA-protein pair in the original data set to obtain the original sequence feature set, including:
  • RNA and protein Arrange and combine the basic units of RNA and protein to obtain k-mer subsequences
  • the original sequence feature set is determined according to the variance of each k-mer sequence.
  • the statistics are performed on the frequency of occurrence of each k-mer subsequence in the original data set, and the variance of each k-mer subsequence is calculated according to the frequency of occurrence ,include:
  • the variance of each k-mer subsequence is calculated according to the occurrence frequency of each k-mer subsequence in the original data set and the marker value in each RNA-protein pair.
  • said each k-mer subsequence is calculated according to the occurrence frequency of said each k-mer subsequence in said original data set and the tag value in each RNA-protein pair Variance of k-mer subsequences, including:
  • Var i of the i-th k-mer subsequence. in is the marker value of the i-th k-mer subsequence in the n-th RNA-protein pair
  • Freq i is the occurrence frequency of the i-th k-mer subsequence in the original data set
  • N is the RNA-protein pair in the original data set total number of .
  • the determining the original sequence feature set according to the variance of each k-mer sequence includes:
  • the k-mer subsequences satisfying the preset condition are determined according to the variance of each k-mer subsequence, and the original sequence feature set is composed of the k-mer subsequences satisfying the preset condition.
  • performing feature extraction on each RNA-protein pair in the original data set to obtain the original sequence feature set further includes:
  • RNA sequence and the protein sequence in each RNA-protein pair are converted into k-mer subsequences respectively, and the first candidate item set is formed by the k-mer subsequences, and the k-mer subsequences include RNA k -mer subsequence and protein k-mer subsequence;
  • RNA k-mer subsequence and the protein k-mer subsequence in the frequent itemset are cross-combined, and the k-mer subsequence obtained by combination forms the second candidate itemset;
  • the original sequence feature set is composed of k-mer subsequence pairs whose support meets a preset condition.
  • the vectorization of the RNA-protein pair to be predicted to obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted includes:
  • RNA sequence and the protein sequence in the RNA-protein pair to be predicted are converted into k-mer subsequences respectively, and the k-mer subsequences include M RNA k-mer subsequences and N protein k-mer subsequences sequence;
  • the interaction prediction model respectively obtains multiple interaction prediction values of the RNA-protein pair to be predicted, including:
  • RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted into at least one second interaction prediction model to obtain at least one second interaction prediction value.
  • the interaction prediction model respectively obtains multiple interaction prediction values of the RNA-protein pair to be predicted, including:
  • RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted into at least one deep learning model to obtain at least one second interaction prediction value.
  • each of the deep learning models includes at least two sub-deep learning models; the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted input into at least one deep learning model to obtain at least one second interaction prediction value, including:
  • the first sequence feature and the second sequence feature are fused, and the second interaction prediction value is obtained according to the fused feature.
  • the traditional machine learning model includes at least one of a logistic regression model, a support vector machine model, and a decision tree model
  • the deep learning model includes a convolutional neural network model and a recurrent neural network At least one of the models.
  • the determining the interaction between the RNA and the protein according to the plurality of interaction prediction values includes:
  • the interaction between the RNA and the protein is determined based on the calculation results.
  • the determining the interaction between the RNA and the protein according to the calculation results includes:
  • the calculation result is less than or equal to the preset interaction prediction value threshold, it is determined that there is no interaction between the RNA and the protein.
  • the method further includes:
  • the plurality of interaction prediction models are jointly trained.
  • the joint training of the multiple interaction prediction models includes:
  • the joint training of the multiple interaction prediction models includes:
  • RNA-protein pair a positive example RNA-protein pair and a negative example RNA-protein pair in the training data set
  • Model parameters of the plurality of interaction prediction models are adjusted according to the loss value.
  • the obtaining the joint prediction value of each RNA-protein pair in the training data set according to the plurality of interaction prediction values includes:
  • a weighted summation is performed on multiple interaction prediction values of each RNA-protein pair in the training data set to obtain a joint prediction value of each RNA-protein pair.
  • the weighted summation of multiple interaction prediction values of each RNA-protein pair in the training data set to obtain a joint prediction value of each RNA-protein pair includes :
  • the joint prediction value y out of each RNA-protein pair in the training data set is calculated.
  • y 1 is the output value of the traditional machine learning model
  • y 2 is the output value of the convolutional neural network model
  • y 3 is the output value of the recurrent neural network model
  • ⁇ , ⁇ , ⁇ are the traditional machine learning model, convolution Weight parameters for neural network models and recurrent neural network models.
  • the adjusting model parameters of the plurality of interaction prediction models according to the loss value includes:
  • the plurality of interaction prediction models are used to predict the interaction of the RNA-protein pair to be predicted.
  • the method further includes:
  • a prediction result of the interaction between the RNA and the protein is output.
  • RNA-protein interaction prediction device including:
  • the data acquisition module is used to acquire the RNA-protein pair to be predicted
  • a feature extraction module configured to perform feature extraction on the RNA-protein pair to be predicted, to obtain sequence features of the RNA-protein pair to be predicted;
  • a data vectorization module used to vectorize the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted;
  • the interaction prediction module is used to obtain the sequence characteristics of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted by using multiple interaction prediction models, respectively. Describe multiple interaction predictions for RNA-protein pairs to be predicted;
  • An interaction determination module configured to determine the interaction between the RNA and the protein according to the plurality of interaction prediction values.
  • the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the methods described above is implemented.
  • the present disclosure provides an electronic device, including: a processor; and a memory, configured to store executable instructions of the processor; wherein, the processor is configured to execute any one of the above-mentioned instructions by executing the executable instructions described method.
  • Figure 1 shows a schematic diagram of an exemplary system architecture of an RNA-protein interaction prediction method and device that can be applied to an embodiment of the present disclosure
  • Figure 2 schematically shows a flow chart of an RNA-protein interaction prediction method according to an embodiment of the present disclosure
  • Fig. 3 schematically shows a flow chart of determining the sequence characteristics of the RNA-protein pair to be predicted according to an embodiment of the present disclosure
  • Fig. 4 schematically shows a flow chart of obtaining an original sequence feature set according to an embodiment of the present disclosure
  • Fig. 5 schematically shows a flow chart of obtaining an original sequence feature set according to another embodiment of the present disclosure
  • FIG. 6 schematically shows a flow chart of training multiple interaction prediction models according to an embodiment of the present disclosure
  • Figure 7 schematically shows a block diagram of an RNA-protein interaction prediction device according to an embodiment of the present disclosure
  • FIG. 8 shows a schematic structural diagram of a computer system suitable for implementing the electronic device of the embodiment of the present disclosure.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments may, however, be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concept of example embodiments to those skilled in the art.
  • the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
  • numerous specific details are provided in order to give a thorough understanding of embodiments of the present disclosure.
  • those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details being omitted, or other methods, components, devices, steps, etc. may be adopted.
  • well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
  • FIG. 1 shows a schematic diagram of the system architecture of an exemplary application environment in which a method and device for predicting RNA-protein interaction according to an embodiment of the present disclosure can be applied.
  • the system architecture 100 may include one or more of terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
  • Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
  • Terminal devices 101, 102, 103 may be various electronic devices, including but not limited to desktop computers, portable computers, smart phones, and tablet computers. It should be understood that the numbers of terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
  • the server 105 may be one server, or a server cluster composed of multiple servers, or a cloud computing platform or a virtualization center.
  • the server 105 can be used to perform: obtaining the RNA-protein pair to be predicted; performing feature extraction on the RNA-protein pair to be predicted to obtain the sequence features of the RNA-protein pair to be predicted; Describe the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted; based on the sequence characteristics of the RNA-protein pair to be predicted, the RNA-protein pair to be predicted The RNA sequence representation vector and the protein sequence representation vector in the protein pair, using multiple interaction prediction models to obtain multiple interaction prediction values of the RNA-protein pair to be predicted; determine according to the multiple interaction prediction values The interaction between the RNA and protein.
  • the RNA-protein interaction prediction method provided by the embodiments of the present disclosure is generally executed by the server 105.
  • the RNA-protein interaction prediction device is generally set in the server 105, and the server can use the RNA-protein interaction to be predicted
  • the prediction result is sent to the terminal device, and displayed to the user by the terminal device.
  • the RNA-protein interaction prediction method provided by the embodiments of the present disclosure can also be executed by one or more of the terminal devices 101, 102, and 103.
  • the prediction device can also be set in the terminal equipment 101, 102, 103, for example, after being executed by the terminal equipment, the prediction result can be directly displayed on the display screen of the terminal equipment, or the prediction result can be provided to the user through voice broadcast, This is not specifically limited in this exemplary embodiment.
  • ncRPI noncoding RNA-protein interactions
  • RNA-protein interaction prediction method may include the following steps S210 to S250:
  • Step S210 Obtain the RNA-protein pair to be predicted
  • Step S220 Perform feature extraction on the RNA-protein pair to be predicted to obtain the sequence features of the RNA-protein pair to be predicted;
  • Step S230 Vectorize the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted;
  • Step S240 Based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted, use multiple interaction prediction models to obtain the predicted Multiple interaction predictors for RNA-protein pairs;
  • Step S250 Determine the interaction between the RNA and the protein according to the plurality of interaction prediction values.
  • the RNA-protein pair to be predicted is obtained; the RNA-protein pair to be predicted is extracted to obtain the RNA-protein to be predicted pair of sequence features; vectorize the RNA-protein pair to be predicted to obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted; based on the RNA-protein pair to be predicted Sequence features, RNA sequence representation vectors and protein sequence representation vectors in the RNA-protein pair to be predicted, using multiple interaction prediction models to obtain multiple interaction prediction values of the RNA-protein pair to be predicted; according to the The interaction between the RNA and the protein is determined based on the plurality of interaction prediction values.
  • RNA sequences and protein sequences can be fully mined, so as to accurately predict the interaction between RNA and proteins;
  • the effective combination of the characteristics of the interaction prediction model can further improve the accuracy of predicting the interaction between RNA and protein.
  • step S210 the RNA-protein pair to be predicted is obtained.
  • RNA-protein pair to be predicted can be obtained. And the interaction between RNA and protein in each RNA-protein pair to be predicted is unknown.
  • the user can input the RNA-protein pair to be predicted through the terminal device.
  • the user may manually input the RNA-protein pair to be predicted, or input the RNA-protein pair to be predicted by voice, which is not specifically limited in this example.
  • an RNA can be input, and then a protein can be input, and there is no limitation on the input order of the two.
  • RNA and protein can be entered into different text boxes, or they can be entered into the same text box. For example, after the input is completed, click the "Start Prediction" button, and then start to execute the prediction steps provided in some embodiments of the present application.
  • RNA and protein means that the function of protein is reflected in the interaction with other proteins and RNA.
  • protein-RNA interactions play an important role in protein synthesis.
  • many functions of RNA are also inseparable from the interaction with proteins.
  • the interaction can be regulation, guidance, etc., and is not limited here.
  • RNA in the presence of an interaction, RNA can guide protein synthesis, or RNA can regulate protein function.
  • the interaction between RNA and protein can also mean that the two can regulate each other's life cycle and function through physical interaction.
  • the RNA coding sequence can guide protein synthesis, and correspondingly, the protein can also regulate the expression and function of RNA.
  • the prediction result of the RNA-protein interaction to be predicted can also be output to the terminal device for users to view.
  • the prediction result may be directly displayed on the display screen of the terminal device, or the prediction result may be provided to the user through voice broadcast, which is not specifically limited in this example.
  • At least one RNA sequence to be predicted can also be obtained, and a protein sequence that interacts with each input RNA sequence to be predicted can be searched in the database through multiple interaction prediction models.
  • a protein sequence that interacts with each input RNA sequence to be predicted can be searched in the database through multiple interaction prediction models.
  • at least one protein sequence in the database can be selected, and multiple RNA-protein pairs are formed from the RNA sequence to be predicted and each protein sequence, and then multiple interactive
  • the role prediction model predicts the interaction of each RNA-protein pair, and outputs the protein sequence that can interact with the RNA sequence to be predicted according to the prediction result.
  • several types of protein sequences can be stored in the database in advance, so as to be recalled when predicting the interaction of RNA-protein pairs.
  • the protein sequence can be stored in the Redis database or in the MySQL database, and then the protein sequence to be predicted can be queried and selected in real time.
  • Redis is a key-value storage system.
  • the Redis database can include: a key-value pair (key-value) formed by a sequence identifier (such as a sequence number) and a corresponding protein sequence, wherein the key (key) is Sequence identifier, value (value) is the corresponding protein sequence.
  • a key-value pair formed by a sequence identifier (such as a sequence number) and a corresponding protein sequence, wherein the key (key) is Sequence identifier, value (value) is the corresponding protein sequence.
  • Redis can support more than 100K+ read and write frequencies per second, and has certain advantages in data reading and storage speed.
  • MySQL is a relational database management system. The relational database stores data in different tables instead of storing all data in a unified manner, which increases storage speed and flexibility. It has stable advantages in data storage and can avoid
  • RNA sequences can also be stored in the database in advance, so as to be recalled when predicting the interaction of RNA-protein pairs. Therefore, at least one protein sequence to be predicted can also be obtained, and an RNA sequence that interacts with each input protein sequence to be predicted can be searched in the database through multiple interaction prediction models. Similarly, after the user enters the protein sequence through the terminal device, at least one RNA sequence in the database can be selected, and multiple RNA-protein pairs are formed from the protein sequence to be predicted and each RNA sequence, which can then be predicted by multiple interaction prediction models. The interaction of each RNA-protein pair, and output the RNA sequence that can interact with the protein sequence to be predicted according to the prediction result, which is not specifically limited in the present disclosure.
  • step S220 feature extraction is performed on the RNA-protein pair to be predicted to obtain sequence features of the RNA-protein pair to be predicted.
  • RNA-protein pair to be predicted Before predicting the interaction of each RNA-protein pair to be predicted through multiple interaction prediction models, it is necessary to obtain the input features of each interaction prediction model.
  • feature extraction can be performed on the RNA-protein pair to be predicted, that is, feature extraction is performed sequentially on the RNA sequence and protein sequence in the RNA-protein pair to be predicted, to obtain corresponding RNA sequence features and protein sequence features,
  • the sequence features of the RNA-protein pair to be predicted are composed of RNA sequence features and protein sequence features, and the sequence features can be used as the input of the interaction prediction model.
  • the RNA-protein pair to be predicted can also be vectorized, that is, the RNA sequence and protein sequence in the RNA-protein pair are respectively vectorized to obtain the corresponding RNA sequence representation vector and protein representation vector, and the RNA Sequence representation vectors and protein representation vectors were used as input to the interaction prediction model, respectively.
  • RNA-protein pair It is also possible to perform feature extraction and vectorization processing on the RNA-protein pair to be predicted at the same time to obtain the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein representation vector in the RNA-protein pair to be predicted,
  • the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein representation vector in the RNA-protein pair to be predicted can be used as the input of the interaction prediction model, which is not specifically limited in the present disclosure.
  • feature extraction can be performed on the RNA-protein pair to be predicted according to step S310 and step S320 .
  • step S310 Obtain the original sequence feature set.
  • the original data set can be obtained, and feature extraction is performed on each RNA-protein pair in the original data set to obtain the original sequence feature set.
  • the RPI1807 data set can be used as the original data set.
  • This data set can contain 3243 RNA-protein pairs, and 1807 pairs of positive examples and 1436 pairs of negative examples are included in the 3243 RNA-protein pairs.
  • a positive example can indicate that there is an interaction between the RNA and the protein in the RNA-protein pair
  • a negative example can indicate that there is no interaction between the RNA and the protein in the RNA-protein pair.
  • the RPI2241 data set, the RPI369 data set, etc. may also be used as the original data set for experimentation, which is not specifically limited in the present disclosure.
  • feature extraction can be performed on the RNA-protein pairs in the original data set according to steps S410 to S430 to obtain the original sequence feature set.
  • Step S410 Arranging and combining the basic units of RNA and protein to obtain k-mer subsequences.
  • bases are the basic units of RNA.
  • RNA sequence four kinds of bases can be included, namely adenine (A), uracil (U), guanine (G) and cytosine (C). All k-mer subsequences of the RNA sequence can be obtained by permuting and combining the four bases.
  • amino acids are the basic units of proteins.
  • protein sequence 20 amino acids can be included, and the 20 amino acids are coded sequentially as A, G, V, I, L, F, P, Y, M, T, S, H, N, Q, W, R, K, D, E, C.
  • the 20 amino acids can be divided into ⁇ A, G, V ⁇ , ⁇ I, L, F, P ⁇ , ⁇ Y, M, T, S ⁇ , ⁇ H, N , Q, W ⁇ , ⁇ R, K ⁇ , ⁇ D, E ⁇ and ⁇ C ⁇ , there are 7 types, and each type of amino acid is recoded, such as 1, 2, 3, 4, 5, 6 and 7.
  • the protein sequence ALQDVG can be converted to 124611.
  • the seven types of amino acids can be arranged and combined to obtain all k-mer subsequences of the amino acid sequence.
  • the 20 amino acids can also be classified according to the composition of the amino acids, and the k-mer subsequence of the amino acid sequence can be obtained directly based on the arrangement and combination of the 20 amino acids without classification, which is not specifically limited in this disclosure .
  • the k-mer subsequence refers to a k-mer subsequence composed of k bases or k-type amino acids as a group.
  • the k-mer subsequence may include an RNA k-mer subsequence and a protein k-mer subsequence.
  • the k-mer subsequence may refer to an RNA k-mer subsequence obtained by permuting and combining four types of bases, and for a certain value of k, 4 k types of k-mer subsequences may be obtained.
  • a k-mer subsequence may also refer to a protein k-mer subsequence obtained by permuting and combining 7 types of amino acids, and 7 k types of k-mer subsequences can be obtained for a certain k value. It can be understood that the classification of the 20 amino acids into 7 categories is only illustrative and may not be classified. Similarly, the four bases of the RNA sequence can also be classified according to actual needs.
  • one or more values of k may be used, and the specific value of k may be adjusted according to actual conditions, which is not limited herein.
  • two values of 3 and 4 may be taken as an example for illustration.
  • AAA and AUC are two 3-mer subsequences of RNA sequences
  • AAAA and AAAU are two 4-mer subsequences of RNA sequences.
  • 111 and 112 are two 3-mer subsequences of the protein sequence, and 1111 and 1122 are two 4-mer subsequences of the protein sequence.
  • k may only be 3 or only 4, which is not specifically limited in the present disclosure.
  • Step S420 Count the occurrence frequency of each k-mer subsequence in the original data set, and calculate the variance of each k-mer subsequence according to the occurrence frequency.
  • all 3-mer subsequences and 4-mer subsequences of RNA sequences and protein sequences can be obtained according to step S410, that is, 64 kinds of 3-mer subsequences and 256 kinds of 4-mer subsequences of RNA sequences can be obtained. 343 3-mer subsequences and 2401 4-mer subsequences of sequence and protein sequences.
  • the occurrence frequency of each 3-mer subsequence or 4-mer subsequence in the original data set can be counted, and the variance of each 3-mer subsequence or 4-mer subsequence can be calculated according to the occurrence frequency.
  • the 3-mer subsequence of the sequence may include "AGA”, “GAU”, “AUG” and “UGG”
  • the 4-mer subsequence of the sequence may include "AGAU”, " GAUG” and "AUGG”, that is, the RNA sequence can be read through forward overlapping to obtain the corresponding 3-mer subsequence or 4-mer subsequence.
  • the RNA sequence can also be read by reverse overlapping to obtain the corresponding 3-mer subsequence or 4-mer subsequence.
  • the 3-mer subsequence of this sequence can also include "GGU”, “GUA”, “UAG” and “AGA”
  • the 4-mer subsequence of this sequence can also include "GGUA”, "GUAG” and "UAGA” ".
  • the RNA sequence can also be read in a non-overlapping manner to obtain the corresponding 3-mer subsequence or 4-mer subsequence, for example, the 3-mer subsequence of the sequence can also include "AGA” and "UGG", this disclosure does not specifically limit it.
  • each 3-mer subsequence and/or 4-mer subsequence of the RNA sequence and protein sequence in the original data set can be counted, and each 3-mer subsequence and/or The frequency of 4-mer subsequences in the original dataset.
  • the ratio of the frequency of occurrence of a certain 3-mer subsequence in the original data set to the total number of RNA-protein pairs in the original data set can be calculated to obtain the frequency of occurrence of the 3-mer subsequence in the original data set.
  • each 3-mer subsequence and/or 4-mer subsequence is flagged for its presence in each RNA-protein pair.
  • the variance of each k-mer subsequence can be calculated according to the occurrence frequency of each 3-mer subsequence and/or 4-mer subsequence in the original data set and the marker value in each RNA-protein pair.
  • the subsequence may be a 3-mer subsequence of an RNA sequence or a protein sequence, or a 4-mer subsequence of an RNA sequence or a protein sequence.
  • You can first count the frequency of occurrence of the subsequence in the RPI1807 data set. For example, N RNA-protein pairs (N 3243) in the RPI1807 data set can be cycled. If the subsequence appears in the current RNA-protein pair, the frequency of occurrence will be increased by 1.
  • Step S430 Determine the feature set of the original sequence according to the variance of each k-mer sequence.
  • the k-mer subsequence that meets the preset conditions can be determined according to the variance of each k-mer subsequence, and is composed of k-mer subsequences that meet the preset conditions Raw sequence feature set.
  • all 3-mer subsequences and 4-mer subsequences of the RNA sequence and all 3-mer subsequences and 4-mer subsequences of the protein sequence can be sorted according to the size of the variance, such as sorting in descending order, you can choose The top k-mer subsequences constitute the original sequence feature set.
  • the top 560 k-mer subsequences can be selected, and these 560 k-mer subsequences form the original sequence feature set.
  • it can include the 3-mer subsequences of the top 60 RNA sequences, the 4-mer subsequences of the top 200 RNA sequences, the 3-mer subsequences of the top 200 protein sequences, and the 4-mer subsequences of the top 100 protein sequences sequence.
  • the number of selected k-mer subsequences is only illustrative, and any number of k-mer subsequences can be selected according to actual needs.
  • a variance threshold may also be preset, and k-mer subsequences with a variance greater than the threshold are screened out, and the filtered k-mer subsequences form the original sequence feature set.
  • the preset variance threshold is 3
  • k-mer subsequences with a variance greater than 3 can be selected to form the original sequence feature set.
  • the average value of the number of occurrences of each k-mer subsequence in each RNA-protein pair can also be calculated, and the variance of each k-mer subsequence is calculated according to the average number of occurrences, Further, the original sequence feature set is determined according to the variance of each k-mer subsequence.
  • the number of occurrences of each 3-mer subsequence or 4-mer subsequence in each RNA-protein pair can be determined.
  • the total number of occurrences of each 3-mer subsequence or 4-mer subsequence in each RNA-protein pair can be calculated to obtain the total number of occurrences of the subsequence in the original data set.
  • each Average frequency of 3-mer subsequences or 4-mer subsequences in each RNA-protein pair According to the total number of occurrences, each Average frequency of 3-mer subsequences or 4-mer subsequences in each RNA-protein pair.
  • the variance of each subsequence can be calculated from the average number of occurrences of each 3-mer subsequence or 4-mer subsequence in each RNA-protein pair and the number of occurrences in each RNA-protein pair.
  • the subsequence may be a 3-mer subsequence of an RNA sequence or a protein sequence, or a 4-mer subsequence of an RNA sequence or a protein sequence.
  • You can first count the total number of occurrences of the subsequence in the RPI1807 data set. For example, n RNA-protein pairs (n 3243) in the RPI1807 data set can be cycled, and the number of occurrences of the subsequence in each RNA-protein pair can be obtained by statistics as x 1 , x 2 ,...,x n .
  • the total number of occurrences of the subsequence in the RPI1807 data set is obtained, which is recorded as num i .
  • the average number of occurrences of the subsequence in each RNA - protein pair can be calculated according to the total number of occurrences num i , that is, according to:
  • the variance of the subsequence can be calculated by the average number of occurrences of the i-th k-mer subsequence in each RNA-protein pair and the number of occurrences in each RNA-protein pair, that is, according to:
  • n is the number of RNA-protein pairs in the RPI1807 data set
  • m i is the average number of occurrences of the subsequence in each RNA-protein pair
  • x n is the number of the subsequence in the nth RNA-protein pair
  • x 1 is the number of occurrences of the subsequence in the first RNA-protein pair
  • x 2 is the number of occurrences of the subsequence in the second RNA-protein pair.
  • all k-mer subsequences can be sorted according to the size of the variance, such as sorting in descending order, and the top k-mer subsequences can be selected to form the original sequence features set.
  • a variance threshold may also be preset, and k-mer subsequences with a variance greater than the threshold are screened out, and the filtered k-mer subsequences form the original sequence feature set.
  • the k-mer feature of each RNA-protein pair in the original data set may be extracted, and the extracted k-mer feature of the RNA sequence and the k-mer feature of the protein sequence form the original sequence feature set.
  • the k-mer feature may include monomer component information (that is, each base contained) and sequence order information of the RNA sequence. Therefore, using the k-mer feature can better describe an RNA sequence, that is, an RNA sequence can be more accurately determined according to the k-mer feature, and different RNA sequences can also be distinguished by the k-mer feature.
  • the frequent itemset features of each RNA-protein pair in the original data set can also be extracted, and the original sequence feature set is composed of the extracted frequent itemset features.
  • the frequent itemset feature can combine the kmer feature of the RNA sequence and the kmer feature of the protein sequence. Therefore, using frequent itemset features can better distinguish between interacting and non-interacting RNA-protein pairs. It is also possible to extract k-mer features and frequent itemset features at the same time, and combine them to form the original sequence feature set. By combining the characteristics of k-mer features and frequent itemset features, it is possible to predict unknown RNA-protein pairs more accurately.
  • the interaction between RNA and protein is not specifically limited in this disclosure.
  • the frequent itemset feature refers to the k-mer subsequence pair composed of RNA k-mer subsequence and protein k-mer subsequence with a certain degree of support in the original data set, and the support degree refers to the combination of A and B.
  • AAU,137 it means a 3-mer subsequence pair composed of an RNA 3-mer subsequence AAU and a protein 3-mer subsequence 137.
  • the support degree of this subsequence pair is the ratio of RNA-protein pairs containing both subsequences AAU and 137 to all RNA-protein pairs in the original data set.
  • the frequent itemset features of each RNA-protein pair in the original data set can be extracted according to steps S510 to S550 to obtain the original sequence feature set.
  • Step S510 Convert the RNA sequence and protein sequence in each RNA-protein pair into k-mer subsequences respectively, and form the first candidate item set from the k-mer subsequences, and the k-mer subsequences Including RNA k-mer subsequence and protein k-mer subsequence.
  • the RNA sequence and protein sequence of each RNA-protein pair in the RPI1807 data set can be converted into 3-mer subsequences and 4-mer subsequences respectively.
  • the RPI1807 data set By traversing the RPI1807 data set, all RNA 3-mer subsequences, RNA 4-mer subsequences, protein 3-mer subsequences and protein 4-mer subsequences in the data set can be found out, and all 3-mer subsequences in the data set
  • the -mer subsequence and the 4-mer subsequence form the first candidate itemset C1.
  • Step S520 Count the frequency of occurrence of each k-mer subsequence in the first candidate item set in the original data set, and form a frequent itemset from k-mer subsequences satisfying a preset frequency threshold.
  • the occurrence frequency of each k-mer subsequence in the first candidate item set C1 in the original data set can be counted.
  • the subsequence may be a 3-mer subsequence of an RNA sequence or a protein sequence, or may be a 4-mer subsequence of an RNA sequence or a protein sequence.
  • each 3-mer subsequence or 4-mer subsequence in the RPI1807 data set in the first candidate item set C1 can be calculated.
  • all 3-mer subsequences and 4-mer subsequences can be screened according to a preset occurrence frequency threshold. For example, RNA 3-mer subsequences with frequency greater than the first threshold, RNA 4-mer subsequences with frequency greater than the second threshold, protein 3-mer subsequences with frequency greater than the third threshold, and protein 3-mer subsequences with frequency greater than the second threshold can be selected.
  • the four-threshold protein 4-mer subsequences together constitute the frequent itemset L1.
  • the first threshold, the second threshold, the third threshold and the fourth threshold may be the same or different, which is not specifically limited in the present disclosure.
  • the frequency of occurrence of 3-mer subsequences and 4-mer subsequences of RNA sequences and the frequency of occurrence of 3-mer subsequences and 4-mer subsequences of protein sequences can also be sorted in descending order, ranked by The preceding subsequences form the frequent itemset L1, which is not specifically limited in this disclosure.
  • Step S530 Cross-combining the RNA k-mer subsequences and protein k-mer subsequences in the frequent itemset, and forming a second candidate item set from the combined k-mer subsequence pairs.
  • RNA 3-mer subsequence and protein 3-mer subsequence, RNA 4-mer subsequence and protein 4-mer subsequence in the frequent itemset L1 can be cross-combined in pairs to obtain a variety of 3-mer subsequences pairs and 4-mer subsequence pairs, and the second candidate item set C2 is composed of multiple subsequence pairs obtained by combination.
  • the 3-mer subsequence pair "AUC_137" and “AUC_123” can also be obtained by cross-combining the RNA 3-mer subsequence and the protein 3-mer subsequence, and by combining the RNA 4-mer subsequence and the protein 4-
  • the 4-mer subsequence pairs "AAUU_1737”, “AAUU_1234", “AGUC_1737” and “AGUC_1234" can be obtained by cross-combining mer subsequences.
  • Step S540 Counting the occurrence frequency of each k-mer subsequence pair in the second candidate item set in the original data set to obtain the support degree of each k-mer subsequence pair.
  • the occurrence frequency of each subsequence pair in the second candidate item set C2 in the RPI1807 data set can be counted.
  • the subsequence pair may be a 3-mer subsequence pair, or a 4-mer subsequence pair.
  • the frequency of occurrence of the subsequence pair in the RPI1807 data set can be counted first. For example, N RNA-protein pairs in the RPI1807 data set can be cycled. If the subsequence pair appears in the current RNA-protein pair, the frequency of occurrence will be increased by 1.
  • the frequency of occurrence will be different. Change.
  • the frequency of occurrence of the fth kind of k-mer subsequence pair in the RPI1807 data set obtained by statistics is recorded as num f , and then the frequency of occurrence of the subsequence pair in the RPI1807 data set can be calculated according to the frequency of occurrence num f , that is, the The support of the subsequence pair support f .
  • the support degree of each subsequence pair in the second candidate item set C2 can be calculated.
  • Step S550 Composing the original sequence feature set from the k-mer subsequence pairs whose support meets the preset condition.
  • the subsequence pair satisfying the preset condition can be determined according to the support degree of each subsequence pair, and the original sequence feature set is composed of the subsequence pair satisfying the preset condition.
  • a support threshold may be preset, and subsequence pairs whose support is greater than the threshold are screened out, and the screened subsequence pairs form the original sequence feature set.
  • there are 370 subsequence pairs whose support degree is greater than the threshold and these 370 subsequence pairs are 370 frequent itemset features, and the original sequence feature set can be composed of 370 frequent itemset features.
  • all subsequence pairs in the second candidate item set C2 can also be sorted in descending order according to the support degree, and the top-ranked subsequence pairs are selected to form the original sequence feature set, which is not specifically limited in this disclosure .
  • the RNA sequence and the protein sequence in each RNA-protein pair can be respectively converted into k-mer subsequences to obtain k-mer subsequence pairs.
  • the frequency of occurrence of each k-mer subsequence pair in the original data set is counted, and the k-mer subsequence pair that meets the preset condition of the frequency of occurrence is used as the frequent itemset feature and constitutes the original sequence feature set.
  • the RNA sequences and protein sequences of all positive RNA-protein pairs in the RPI1807 dataset can be converted into positive 3-mer subsequences and positive 4-mer subsequences, respectively.
  • the RNA sequences and protein sequences of all negative RNA-protein pairs in this dataset are converted into negative 3-mer subsequences and negative 4-mer subsequences, respectively.
  • RNA 3-mer subsequences and protein 3-mer subsequences, RNA 4-mer subsequences and protein 4-mer subsequences in the data set are combined in pairs to obtain a variety of 3-mer subsequence pairs and 4-mer subsequences.
  • mer subsequence pair Exemplarily, positive example RNA 3-mer subsequences and positive example protein 3-mer subsequences can be cross-combined to obtain positive example 3-mer subsequence pairs.
  • Negative RNA 3-mer subsequences and negative protein 3-mer subsequences can be cross-combined to obtain negative 3-mer subsequence pairs.
  • the positive example RNA 4-mer subsequence and the positive example protein 4-mer subsequence can be cross-combined to obtain the positive example 4-mer subsequence pair.
  • the negative example RNA 4-mer subsequence and the negative example protein 4-mer subsequence can be cross-combined to obtain the negative example 4-mer subsequence pair.
  • the occurrence frequency of each sub-sequence pair in the data set can be counted. For example, for any positive example 3-mer subsequence pair, it can be based on:
  • num is the number of occurrences of the positive 3-mer subsequence pair in the data set
  • NUM is the total number of occurrences of all positive 3-mer subsequence pairs in the data set.
  • all 3-mer subsequence pairs and 4-mer subsequence pairs can be sorted according to the frequency of occurrence, For example, in descending order, the top-ranked k-mer subsequence pairs can be selected to form frequent itemsets. For example, to sort all the positive 3-mer subsequence pairs in descending order, the first m 3-mer subsequence pairs can be selected to form the frequent itemset A1. All the positive 4-mer subsequence pairs are sorted in descending order, and the first n 4-mer subsequence pairs can be selected to form the frequent itemset A2.
  • All the negative 3-mer subsequence pairs are sorted in descending order, and the first p 3-mer subsequence pairs can be selected to form the frequent itemset A3. All negative 4-mer subsequence pairs are sorted in descending order, and the first q 4-mer subsequence pairs can be selected to form frequent itemsets A4. Then the original sequence feature set is composed of these four frequent itemsets A1, A2, A3 and A4.
  • the occurrence frequency threshold can also be preset, and the k-mer subsequence pairs whose occurrence frequency is greater than the threshold are filtered out, and the filtered k-mer subsequence pairs are used as frequent itemset features to form the original sequence feature set , which is not specifically limited in the present disclosure.
  • the frequent itemset feature can combine the kmer feature of the RNA sequence and the kmer feature of the protein sequence, and use the frequent itemset feature to better distinguish between interacting and non-interacting RNA-protein pairs . Therefore, when the original sequence feature set is composed of frequent itemset features, and the feature extraction of the RNA-protein pair to be predicted is performed based on the original sequence feature set, the extracted sequence features of the RNA-protein pair to be predicted can be more accurate. Determine whether the RNA-protein pair to be predicted has an interaction.
  • Step S320 Determine the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set.
  • the RNA sequence and protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively, and after the original sequence feature set is obtained, each k-mer subsequence can be searched in the original sequence feature set. mer subsequence, and obtain the sequence characteristics of the RNA-protein pair to be predicted according to the search results.
  • the sequence feature of the RNA-protein pair to be predicted may refer to a complete sequence feature composed of RNA sequence features and protein sequence features.
  • the original sequence feature set can be composed of 560 kinds of k-mer subsequences, for example, the 560 kinds of k-mer subsequences can be [CCC, ..., AGU, CCCC, ..., CUGG, 777, ..., 373, 7774 , ..., 7571].
  • the feature calculation can be performed on the RNA sequence and the protein sequence in the RNA-protein pair to be predicted based on the original sequence feature set to obtain the sequence feature of the RNA-protein pair to be predicted.
  • the features of each feature dimension correspond to a k-mer subsequence.
  • the subsequence CCC is the feature of the first feature dimension
  • the subsequence 7571 is the feature of the 560th feature dimension. All 3-mer subsequences and 4-mer subsequences of the RNA-protein pair to be predicted can be searched in the original sequence feature set, and it is determined whether the features on each feature dimension in the original sequence feature set exist according to the search results.
  • the feature value on the feature dimension is 1, and if it does not exist, it is 0.
  • the RNA sequence in the RNA-protein pair to be predicted is AAAACCCGGG
  • the feature CCC of the first feature dimension in the original sequence feature set is also a 3-mer subsequence of the RNA-protein pair . Therefore, it can be determined that the feature CCC on the first feature dimension in the original sequence feature set exists, and the corresponding feature value can be recorded as 1.
  • the feature value on the feature dimension corresponding to the feature AGU in the original sequence feature set can be recorded as 0.
  • eigenvalue vector [1, 0, ..., 0, 1, ...] can be calculated, and the eigenvalue vector is the sequence feature of the RNA-protein pair to be predicted. It can be understood that each eigenvalue contained in the eigenvalue vector is in one-to-one correspondence with the eigenvalues of each feature dimension in the original sequence feature set.
  • the original sequence feature set is composed of the extracted k-mer feature of the RNA sequence and the k-mer feature of the protein sequence.
  • the feature extraction of the RNA-protein pair to be predicted is performed to obtain the sequence characteristics of the RNA-protein pair to be predicted.
  • the k-mer feature can include the RNA sequence Monomer component information (that is, each base contained) and sequence order information. Therefore, using the k-mer feature can better describe an RNA sequence, that is, an RNA sequence can be more accurately determined according to the k-mer feature, and different RNA sequences can also be distinguished by the k-mer feature.
  • the RNA sequence and protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively, and the RNA k-mer subsequence and protein k -mer subsequences are cross-combined to obtain a variety of RNA-protein k-mer subsequence pairs.
  • the RNA 3-mer subsequence can be And protein 3-mer subsequences, RNA 4-mer subsequences and protein 4-mer subsequences were cross-combined in pairs to obtain a variety of 3-mer subsequence pairs and 4-mer subsequence pairs.
  • Each RNA-protein k-mer subsequence pair can be searched in the original sequence feature set, and the sequence features of the RNA-protein pair can be obtained according to the search results.
  • the original sequence feature set may be composed of 370 frequent itemset features, for example, the 370 frequent itemset features may be [CUG_122, AAU_122, ..., CUUU_1312, UCUG_1312, ...].
  • RNA 3-mer subsequences and protein 3-mer subsequences, RNA 4-mer subsequences and protein 4-mer subsequences can be paired to obtain a variety of 3-mer subsequence pairs and 4-mer subsequence pairs. Then, feature calculation can be performed on the RNA sequence and protein sequence in the RNA-protein pair to be predicted based on the original sequence feature set to obtain the sequence feature of the RNA-protein pair.
  • the feature of each feature dimension corresponds to a kind of k-mer subsequence pair.
  • the subsequence pair CUG_122 is a feature of the first feature dimension. All subsequence pairs of the to-be-predicted RNA-protein pair can be searched in the original sequence feature set, and it is determined whether the features on each feature dimension in the original sequence feature set exist according to the search results. If it exists, the feature value on the feature dimension is 1, and if it does not exist, it is 0.
  • the protein sequence in the RNA-protein pair to be predicted is AUCUGAAAU
  • the protein sequence is 512261312. It can be seen that CUG_122, AAU_122, and UCUG_1312 in the subsequence pair of the RNA-protein pair exist in the original sequence feature set. Therefore, the feature value on the corresponding feature dimension in the original sequence feature can be recorded as 1.
  • a 370-dimensional eigenvalue vector [1, 1, ..., 0, 1, ...] can be calculated, and the eigenvalue vector is the sequence feature of the RNA-protein pair to be predicted.
  • each eigenvalue contained in the eigenvalue vector is in one-to-one correspondence with the eigenvalues of each feature dimension in the original sequence feature set.
  • the RNA sequence and protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively, and each k-mer subsequence can be searched in the original sequence feature set. mer subsequence to get the first sequence features. Then, RNA k-mer subsequences and protein k-mer subsequences can be combined to obtain multiple RNA-protein k-mer subsequence pairs, and each RNA-protein k-mer subsequence pair can be found in the original sequence feature set , to get the second sequence features. Finally, the sequence features of the RNA-protein pair to be predicted can be composed of the first sequence feature and the second sequence feature.
  • the original sequence feature set may include two feature subsets, and the two feature subsets include 560 k-mer subsequences [CCC, ..., CCCC, ..., 777, ..., 7774, ...] and 370 k-mer subsequences respectively.
  • Frequent itemset features [CUG_122, AAU_122, ..., CUUU_1312, UCUG_1312, ...].
  • the RNA-protein pair to be predicted can be converted into RNA 3-mer subsequence, RNA 4-mer subsequence, protein 3-mer subsequence and protein 4-mer subsequence.
  • RNA 3-mer subsequence can also be converted into Pair with protein 3-mer subsequence, RNA 4-mer subsequence and protein 4-mer subsequence to obtain various 3-mer subsequence pairs and 4-mer subsequence pairs.
  • the feature calculation can be performed on the RNA sequence and the protein sequence in the RNA-protein pair to be predicted according to the original sequence feature set to obtain the sequence feature of the RNA-protein pair to be predicted.
  • all subsequences and subsequence pairs of the RNA-protein pair to be predicted can be searched in the original sequence feature set, and it is determined whether the features on each feature dimension in the original sequence feature set exist according to the search results.
  • a 560-dimensional eigenvalue vector [1, . . . , 1, . . . , 1, . . . , 0, .
  • a 930-dimensional eigenvalue vector can be obtained by concatenating two eigenvalue vectors, which is the sequence feature of the RNA-protein pair to be predicted. It is also possible to directly input two eigenvalue vectors into the interaction prediction model at the same time, which is not specifically limited in this disclosure.
  • a 930-dimensional original sequence feature set can also be composed of 560 k-mer subsequences and 370 frequent itemset features.
  • the original sequence feature set is [CCC, ..., CCCC, ..., 777, ..., 7774, ..., CCA_121, ..., UCUG_1312, ..., AAU_122, ..., CUUU_1312, ...].
  • All subsequences and subsequence pairs of the RNA-protein pair can be searched in the original sequence feature set, and according to the search results, it is determined whether the features on each feature dimension in the original sequence feature set exist, and a 930-dimensional eigenvalue vector [1 ,...,1,...,1,...,0,...,1,...,1,...,1,...,0,...], the eigenvalue vector is the sequence feature of the RNA-protein pair to be predicted.
  • the frequent itemset features of each RNA-protein pair in the original data set can also be extracted, and the original sequence feature set is composed of the extracted frequent itemset features.
  • the frequent itemset feature can combine the kmer feature of the RNA sequence and the kmer feature of the protein sequence. Therefore, using frequent itemset features can better distinguish between interacting and non-interacting RNA-protein pairs. It is also possible to extract k-mer features and frequent itemset features at the same time, and combine them to form the original sequence feature set. By combining the characteristics of k-mer features and frequent itemset features, it is possible to predict unknown RNA-protein pairs more accurately. Interactions between RNA and proteins.
  • step S230 the RNA-protein pair to be predicted is vectorized to obtain an RNA sequence representation vector and a protein sequence representation vector in the RNA-protein pair to be predicted.
  • the obtained sequence features can be used as the input of the first interaction prediction model.
  • the RNA-protein pair to be predicted can also be vectorized, and the obtained vector can be used as the input of at least one second interaction prediction model enter.
  • the RNA sequence and the protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively.
  • RNA sequences that do not overlap can be divided into M RNA k-mer subsequences
  • protein sequences that do not overlap can be divided into N protein k-mer subsequences.
  • the RNA sequence is AUCUGAAAU, it can be divided into three RNA k-mer subsequences, namely AUC, UGA and AAU.
  • the non-overlapping division of the RNA sequence and the protein sequence into multiple k-mer subsequences is to vectorize the RNA sequence and the protein sequence, that is, the bases in the RNA sequence and the protein sequence
  • the amino acids of are vectorized in the form of k-joints.
  • each base contained in the RNA sequence in the RNA-protein pair can also be vectorized to obtain multiple nucleotide vectors, and multiple base vectors can be spliced to obtain the representation vector of the RNA sequence.
  • each amino acid contained in the protein sequence in the RNA-protein pair can be vectorized to obtain multiple amino acid vectors, and multiple amino acid vectors can be spliced to obtain the representation vector of the protein sequence.
  • the overlapping RNA sequences can also be divided into P RNA k-mer subsequences, and the overlapping protein sequences can be divided into Q protein k-mer subsequences, which is not specifically limited in this disclosure. Then, the RNA sequence representation vector and the protein sequence representation vector can be respectively input into the second interaction prediction model.
  • each k-mer subsequence of the RNA sequence and the protein sequence can be encoded first.
  • Each RNA 3-mer subsequence and protein 3-mer subsequence can be encoded in Embedding (vector mapping) in sequence, and each 3-mer subsequence can be represented by a low-dimensional vector, and corresponding multiple 3-mer subsequences can be obtained.
  • Each RNA 3-mer subsequence and protein 3-mer subsequence can be encoded in Embedding (vector mapping) in sequence, and each 3-mer subsequence can be represented by a low-dimensional vector, and corresponding multiple 3-mer subsequences
  • each 3-mer subsequence can be One-Hot (one-hot) encoded, and One-Hot encoding is also called one-bit effective encoding.
  • the method is to use N-bit status registers to perform N-state Encoding, each state has an independent register bit, and at any time, only one bit in the register is valid.
  • a 64-dimensional One-Hot vector can be obtained by encoding, and the i-th element in the vector Set it to 1, and set other elements to 0, such as [0, 1, 0, 0, ..., 0].
  • a 343-dimensional One-Hot vector can be obtained by encoding, and the jth element in the vector is set to 1, All other elements are set to 0.
  • each RNA 3-mer subsequence and protein 3-mer subsequence can correspond to a 3-mer One-Hot vector.
  • dense vectors can also be used to represent each 3-mer subsequence.
  • the Word2vec algorithm can be used to map each 3-mer subsequence into a vector space, and each 3-mer subsequence can be represented by a subsequence vector in the vector space.
  • RNA sequence data can be obtained, and the BERT pre-training model can be used for training. After the training is completed, a certain RNA sequence can be input into the trained model to obtain the high-dimensional vector of the RNA sequence. Not specifically limited.
  • RNA sequences and protein sequences in the RNA-protein pairs to be predicted can be converted into 3-mer subsequences respectively, such as obtaining M RNA 3-mer subsequence and N protein 3-mer subsequences.
  • M 3-mer One-Hot vectors corresponding to M RNA 3-mer subsequences can be determined through query, and the M 3-mer One-Hot vectors are sequentially spliced, such as splicing in the row direction to obtain a A two-dimensional matrix of M*64, such as:
  • the two-dimensional matrix is the 3-mer One-Hot representation vector of the RNA sequence.
  • N 3-mer One-Hot vectors corresponding to N protein 3-mer subsequences can also be determined by querying, and the N 3-mer One-Hot vectors are sequentially spliced in the row direction to obtain an N*343 binary Dimensional matrix, the two-dimensional matrix is the 3-mer One-Hot representation vector of the protein sequence. It is understandable that M 3-mer One-Hot vectors or N 3-mer One-Hot vectors can also be spliced in columns, and direct (ie tail) splicing can also be performed to obtain sequenced 3-mer One-Hot vectors , which is not specifically limited in the present disclosure.
  • the RNA sequence representation vector and the protein sequence representation vector can be used as the input of the deep learning model to further discover the Few or new feature combinations reveal the interactions between implicit features.
  • step S240 based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted, multiple interaction prediction models are used to obtain the Multiple interaction predictions for predicted RNA-protein pairs.
  • multiple interaction prediction models are based on joint training.
  • the multiple interaction prediction models may all be traditional machine learning models, or all may be deep learning models, or may include at least one traditional machine learning model and at least one deep learning model at the same time.
  • Traditional machine learning models refer to working with natural data in its raw form. For example, composing a pattern recognition or machine learning system requires expert knowledge to extract features from raw data (such as pixel values of an image) and convert them into an appropriate feature representation.
  • the traditional machine learning model may include a linear regression model, a logistic regression model, a support vector machine model, a decision tree model, a K-nearest neighbor (K-Nearest Neighbor, KNN) model, a random forest model, and a naive Bayesian model, etc. .
  • the deep learning model has the ability to automatically extract features, and can be composed of multiple processing layers to form a complex computing model, thereby automatically obtaining data representation and multiple levels of abstraction, which is a kind of learning for feature representation.
  • the deep learning model may include a convolutional neural network model, a recurrent neural network model, and the like.
  • the RNA-protein pair to be predicted can not be vectorized, but only the sequence features obtained by feature extraction of the RNA-protein pair to be predicted can be used as each traditional machine The input to the learned model.
  • feature extraction of the RNA-protein pair to be predicted may not be performed, but only the RNA sequence obtained by vectorizing the RNA-protein pair to be predicted represents the vector and protein sequence Representation vectors serve as input to individual deep learning models.
  • the sequence characteristics of the RNA-protein pair can be input into at least one first interaction prediction model In , at least one first interaction prediction value is obtained.
  • the RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted can be input into at least one second interaction prediction model to obtain at least one second interaction prediction value.
  • the first interaction prediction model and at least one second interaction prediction model are obtained based on joint learning. Among them, joint learning refers to combining multiple sub-models into one model to complete the final target task.
  • At least one first interaction prediction model and at least one second interaction prediction model may be combined, and a final prediction result may be obtained by fusing the outputs of each model.
  • each model and the result of weighted summation can be considered at the same time, and the model parameters of each model can be optimized at the same time to obtain the best overall model, thereby improving the predictive ability of the overall model.
  • the first interaction model may be a traditional machine learning model
  • the second interaction model may be a deep learning model.
  • the sequence features of the RNA-protein pair to be predicted can be input into at least one traditional machine learning model to obtain at least one first interaction prediction value.
  • the RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted can be input into at least one deep learning model to obtain at least one second interaction prediction value.
  • each deep learning model can include at least two sub-deep learning models, which are respectively used to process the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted.
  • the RNA sequence representation vector can be input into the first sub-deep learning model to obtain the first sequence feature.
  • the protein sequence representation vector can be input into the second sub-deep learning model to obtain the second sequence feature.
  • the first sequence feature and the second sequence feature can be fused through the fully connected layer of each deep learning model to obtain a second interaction prediction value based on the fused features.
  • the traditional machine learning model may include at least one of LR (Logistic Regression, logistic regression) model, SVM model and decision tree model
  • the deep learning model may include at least one of CNN model and recurrent neural network model One, wherein the recurrent neural network model can be an LSTM model, a BiLSTM (Bi-directional LSTM, two-way long-short memory network) model.
  • a 930-dimensional original sequence feature set can be composed of 560 k-mer subsequences and 370 frequent itemset features, and the original sequence feature set is characterized according to the k-mer subsequence of the RNA-protein pair to be predicted Calculate and obtain a 930-dimensional eigenvalue vector, and use it as the input of the LR model to obtain the interaction prediction value y 1 .
  • the RNA-protein pairs to be predicted can also be quantized to obtain the k-mer One-Hot vectors of the RNA sequence and the protein sequence respectively, and use them as the input of the CNN model and the BiLSTM model respectively to obtain the interaction prediction value y 2 , y 3 .
  • the LR model has a good memory ability, and can learn the correlation between sequences or features from data.
  • the CNN model and the BiLSTM model have good generalization capabilities, and can discover combinations of features that appear rarely or new in the data, and then reveal the interaction between implicit features.
  • the CNN model can better capture features but ignores the location information of the features, while the BiLSTM model has better memory ability, and can use the sequence information and location information of the data to make up for the defects of the CNN model in memory ability.
  • the LR model is simple and has good interpretability.
  • the joint learning of CNN model, BiLSTM model and LR model can enhance the interpretability of the overall model for RPI prediction.
  • the present disclosure can effectively combine the characteristics of each interaction prediction model, thereby improving the prediction ability of the overall model.
  • step S250 the interaction between the RNA and the protein is determined according to the plurality of interaction prediction values.
  • a weighted sum calculation can be performed on the multiple interaction prediction values, and the interaction between RNA and protein can be determined according to the calculation results. If the calculation result is greater than the preset interaction prediction value threshold, it can be determined that there is an interaction between the RNA and the protein. If the calculation result is less than or equal to the preset interaction prediction value threshold, it can be determined that there is no interaction between the RNA and the protein. By fusing the output values of each interaction prediction model to obtain the final prediction result, the interaction between RNA and protein in the RNA-protein pair to be predicted can be more accurately determined.
  • RNA-protein pairs when predicting the interaction of RNA-protein pairs, it can be based on:
  • the interaction prediction value y out of the RNA-protein pair to be predicted is calculated.
  • y 1 is the output value of the logistic regression model
  • y 2 is the output value of the convolutional neural network model
  • y 3 is the output value of the recurrent neural network model
  • ⁇ , ⁇ , ⁇ are respectively the logistic regression model, convolutional neural network
  • the weight parameter of the model and the cyclic neural network model, y out can be any value between 0-1.
  • the interaction prediction value threshold can be preset as 0.5, and when y out >0.5, the prediction result can be marked as 1, which means that the RNA-protein pair has an interaction. When y out ⁇ 0.5, the prediction result can be marked as 0, which means that the RNA-protein pair has no interaction.
  • weight parameters ⁇ , ⁇ , and ⁇ can be obtained based on joint learning and training.
  • multiple interaction prediction models and related parameters can be jointly trained in advance according to steps S610 to S650, so as to realize the optimization of all model parameters in each prediction model. According to the training The resulting final model makes predictions for RNA-protein pairs whose interactions are unknown.
  • Step S610 Obtain a training data set, which includes positive RNA-protein pairs and negative RNA-protein pairs.
  • RNA-protein pairs in the original data set can be used as the training data set, and some RNA-protein pairs in the original data set can also be used as the training data set.
  • RNA-protein pairs in the data set there are 3243 RNA-protein pairs in the data set, including 1807 pairs of positive examples and 1436 pairs of negative examples. Exemplarily, 1200 positive examples and 1000 negative examples may be selected as the training data set. It can be understood that the number of RNA-protein pairs in the training data set is only illustrative, and any number of RNA-protein pairs can be obtained to train each interaction prediction model multiple times to improve the performance of each interaction prediction model. performance.
  • RNA-protein pair can be labeled, and the obtained label value is "1", which means that the RNA-protein pair has an interaction.
  • Negative RNA-protein pairs can be labeled, and the resulting label value is "0", which means that the RNA-protein pair has no interaction.
  • Step S620 Obtain multiple interaction prediction values for each RNA-protein pair in the training data set by using the multiple interaction prediction models.
  • multiple interaction prediction models can be LR model, CNN model and BiLSTM model respectively.
  • feature extraction is performed on each RNA-protein pair in the training data set, and the extracted sequence features can be sequentially input into the LR model.
  • Each RNA-protein pair is vectorized, and the obtained RNA sequence representation vector and protein sequence representation vector can be input into the CNN model and the BiLSTM model.
  • it is taken as an example to use the i-th RNA-protein pair to train each prediction model.
  • the sequence features of the RNA-protein pair can be input into the LR model, and the predicted value y 1 is output.
  • RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair can be respectively input into two CNN sub-models, and the outputs of the two CNN sub-models are spliced to finally output a predicted value y 2 .
  • the RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair can be input into the BiLSTM model, and the predicted value y 3 can be output. It can be understood that, for each RNA-protein pair in the training data set, three interaction prediction values can be obtained through the three interaction prediction models.
  • Step S630 Obtain the joint prediction value of each RNA-protein pair in the training data set according to the plurality of interaction prediction values.
  • the multiple interaction prediction values can be weighted and summed to obtain a joint prediction value for each RNA-protein pair. Still taking the i-th RNA-protein pair in the training data set as an example, it can be based on:
  • y 1 is the output value of the LR model
  • y 2 is the output value of the CNN model
  • y 3 is the output value of the BiLSTM model
  • ⁇ , ⁇ , ⁇ are respectively Weight parameters of LR model, CNN model and BiLSTM model.
  • Step S640 Using a loss function to calculate the joint prediction value and label value of each RNA-protein pair in the training data set to obtain a corresponding loss value.
  • the i-th RNA-protein pair is positive data, and its corresponding label value is 1.
  • the loss function can be calculated according to the joint prediction value y out and the label value 1 of the RNA-protein pair to obtain the corresponding loss value.
  • the cross-entropy loss function can be selected as the objective function.
  • the cross-entropy loss function is a performance function in the prediction model, which can be used to measure the degree of inconsistency between the prediction value of the prediction model and the label value. The smaller the value of the calculated cross-entropy loss function, the better the prediction effect of the model.
  • Step S650 Adjust model parameters of the plurality of interaction prediction models according to the loss value.
  • the model parameters of the multiple interaction prediction models can be iteratively updated based on the calculated loss values, and when the iteration termination condition is met, the training of the model parameters and each weight parameter of the multiple interaction prediction models is completed.
  • the parameters may be updated using a stochastic gradient descent algorithm. According to the principle of backpropagation, the objective function such as cross-entropy loss function is continuously calculated, and the model parameters and weight parameters of each interaction prediction model are simultaneously updated according to the calculated loss value. When the objective function converges to the minimum value, the training of all model parameters is completed.
  • the model parameters can also be updated iteratively in reverse, and when the preset number of iterations is met, the training of all model parameters is completed.
  • the preset number of iterations may be 20, and each interaction prediction model is constantly updating model parameters during the 20 reverse iterations.
  • the optimized model parameters can be obtained.
  • the least squares method, Adam optimization algorithm, etc. can also be used to minimize the objective function, and the model parameters and weight parameters are updated sequentially from the back to the front to optimize the parameters.
  • the weight parameters ⁇ , ⁇ , and ⁇ are also part of the model parameters, and the weight parameters will also be trained and continuously optimized during the joint training of multiple interaction prediction models.
  • the logistic regression model when multiple interaction prediction models are used to predict the interaction of unknown RNA-protein pairs, by combining the logistic regression model with the deep learning model, the logistic regression model can be used to learn each The correlation between the k-mer features and/or the features of each frequent itemset can also use the deep learning model to reveal the interaction between the implicit features, and combine the characteristics of the CNN model and the BiLSTM model, in multiple models In the process of joint training, the model parameters of each model are optimized at the same time, thereby improving the accuracy of the overall model in predicting the interaction of unknown RNA-protein pairs.
  • the original data set when joint training is performed on multiple interaction prediction models in advance, the original data set may be divided into a training data set, a verification data set and a test data set in proportion.
  • the original data set is an RPI1807 data set as an example for illustration.
  • the data set may be divided into training data set, verification data set and test data set according to the ratio of 7:2:1.
  • the ratio of positive and negative cases in each data set can be consistent with the distribution of the overall data set, that is, the ratio is 1807:1436, which is about 1.25:1.
  • 1250 positive examples and 1000 negative examples can be selected as the training data set
  • 360 positive examples and 280 negative examples can be selected as the verification data set
  • 180 positive examples and 140 negative examples can be selected as the test data set.
  • the training data set can be input into multiple interaction prediction models, and the corresponding model parameters can be determined by using the backpropagation algorithm to obtain the first joint model.
  • the training data set may be input into each interaction prediction model, and the model parameters may be adjusted.
  • model parameters can be weight parameters, bias parameters, intercept parameters, and so on.
  • each model parameter may be updated using a stochastic gradient descent algorithm. According to the principle of backpropagation, the objective function is continuously calculated, and the model parameters are updated according to the objective function. When the objective function converges to the minimum value, the training of the model parameters is completed, so as to obtain the first joint model.
  • the verification data set can be input into the first joint model to verify the performance of the first joint model, and the second joint model can be obtained according to the verification result.
  • a set of hyperparameters may be initialized first, and multiple interaction prediction models are continuously trained using the training data set to obtain the first joint model.
  • the hyperparameters can be the learning rate, the number of CNN layers, the size of the convolution kernel, etc.
  • the verification data set can be input into the trained first joint model to verify the prediction accuracy of the first joint model.
  • the prediction accuracy reaches the preset accuracy threshold
  • the current first joint model can be used as the second joint model to obtain the final training model.
  • the final performance of the model can be tested using the test dataset on the trained model.
  • a set of hyperparameters can be reset, and the interaction prediction model can be trained and tested in turn using the training data set and the verification data set. Verification, when the prediction accuracy obtained by the trained interaction prediction model on the verification data set reaches the preset accuracy threshold, the final performance of the prediction model can be tested with a new test data set.
  • the accuracy rate of the second joint model may be determined according to the test data set, and a third joint model is obtained when the accuracy rate is greater than a preset threshold, and the third joint model includes a plurality of trained interaction prediction models.
  • a third joint model is obtained after obtaining the second joint model, each RNA-protein pair in the test data set can be input into the second joint model to judge the accuracy of the second joint model. If the accuracy of the model is greater than the preset accuracy threshold, a third joint model is obtained.
  • the third joint model includes multiple trained interaction prediction models, and then multiple interaction prediction models can be used to predict the interaction between unknown RNA-protein pairs.
  • the test data set can also be used to judge the Matthews correlation coefficient of the second joint model.
  • the Matthews correlation coefficient refers to the correlation coefficient between the actual classification and the predicted classification, and its value range is [0, 1]. A larger value indicates that the predicted value is more correlated with the real value, and a value of 1 indicates that the predicted result is completely correct. If the Matthews correlation coefficient of the model is greater than a preset threshold, a third joint model is obtained.
  • the test data set can also be used to judge the specific rate and recall rate of the second joint model, which is not specifically limited in the present disclosure. For example, if the accuracy rate of the second joint model is not greater than the preset accuracy rate threshold, a new training data set can be obtained to train the model parameters of each interaction prediction model again, so as to continuously improve the model performance.
  • parameters can be adjusted by using a training data set and a verification data set to train an optimal prediction model, and then the generalization performance of the prediction model can be tested by using a test data set.
  • the model parameters in the LR model, CNN model and BiLSTM model can be trained at the same time.
  • the model parameters of the three interaction prediction models can be adjusted at the same time according to the loss value calculated by the joint prediction value and label value of each RNA-protein pair in the training data set, and through multiple backpropagation, Finally, the model parameters of the three interaction prediction models can all tend to converge, or the training can be terminated after a certain number of iterations are satisfied.
  • the three interaction prediction models of LR model, CNN model and BiLSTM model can be trained at the same time, and then the joint learning of the three interaction prediction models can be realized. At the same time, it can not only ensure higher precision and accuracy of each interaction prediction model in predicting RNA-protein pair interaction, but also improve the training efficiency of each interaction prediction model.
  • each interaction prediction model obtained After the training of each model is completed, each interaction prediction model obtained finally can be used to predict the interaction between unknown RNA-protein pairs, and the prediction result of the RNA-protein pair interaction can be output to the terminal device for users to view .
  • the same training data set, verification data set and test data set can be used to train and test the individual LR model, CNN model, BiLSTM model and the joint learning model of the three.
  • the performance of each prediction model can be evaluated using two performance indicators, accuracy rate and Matthews correlation coefficient.
  • the accuracy rate and Matthews correlation coefficient of the joint learning model are superior to those of the LR model, CNN model, and BiLSTM model, that is, the performance of the joint learning model is better than that of a single model. That is, the predictive ability is better.
  • At least one RNA sequence can also be obtained, and a protein sequence that interacts with each input RNA sequence can be searched in the database through multiple interaction prediction models.
  • multiple interaction prediction models may be jointly trained in advance by using the original data set with reference to FIG. 6 .
  • all protein sequences participating in the joint training can be stored in the database.
  • the database can also include other protein sequences that have not participated in the joint training, that is, the number of protein sequences in the database can be arbitrary, and the database can also include any number of RNA sequences. For example, it can include but not It is limited to all RNA sequences participating in the joint training, which is not specifically limited in this disclosure.
  • each input RNA sequence can be combined with all protein sequences in the database to form several RNA-protein pairs.
  • multiple interaction models can be learned jointly according to step S220 to step S250, so as to predict the interaction of each RNA-protein pair.
  • feature extraction and vectorization processing can be performed on each RNA-protein pair, and the obtained sequence features, RNA sequence representation vectors and protein sequence representation vectors in the RNA-protein pair are input into multiple interaction prediction models, and Get interaction predictions for each RNA-protein pair.
  • An interaction prediction value of 1 indicates that the RNA-protein pair has an interaction
  • an interaction prediction value of 0 indicates that the RNA-protein pair has no interaction.
  • all RNA-protein pairs with an interaction prediction value of 1 can be screened out, and the protein sequence in each RNA-protein pair can be output to the terminal device for users to view protein sequences that interact with the input RNA sequence .
  • At least one protein sequence can also be obtained, and the RNA sequence that interacts with each input protein sequence can be searched in the database through multiple interaction prediction models.
  • each input protein sequence can be combined with all RNA sequences in the database to form several RNA-protein pairs.
  • multiple interaction models can be learned jointly according to step S220 to step S250, so as to predict the interaction of each RNA-protein pair.
  • feature extraction and vectorization processing can be performed on each RNA-protein pair, and the obtained sequence features, RNA sequence representation vectors and protein sequence representation vectors in the RNA-protein pair are input into multiple interaction prediction models, and Get interaction predictions for each RNA-protein pair.
  • An interaction prediction value of 1 indicates that the RNA-protein pair has an interaction
  • an interaction prediction value of 0 indicates that the RNA-protein pair has no interaction. Then, all RNA-protein pairs with an interaction prediction value of 1 can be screened out, and the RNA sequence in each RNA-protein pair is output to the terminal device for users to view RNA sequences that interact with the input protein sequence .
  • the RNA-protein pair to be predicted is obtained; the RNA-protein pair to be predicted is extracted to obtain the RNA-protein to be predicted pair of sequence features; vectorize the RNA-protein pair to be predicted to obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted; based on the RNA-protein pair to be predicted Sequence features, RNA sequence representation vectors and protein sequence representation vectors in the RNA-protein pair to be predicted, using multiple interaction prediction models to obtain multiple interaction prediction values of the RNA-protein pair to be predicted; according to the The interaction between the RNA and the protein is determined based on the plurality of interaction prediction values.
  • RNA sequences and protein sequences can be fully mined, so as to accurately predict the interaction between RNA and proteins;
  • the effective combination of the characteristics of the interaction prediction model can further improve the accuracy of predicting the interaction between RNA and protein.
  • RNA-protein interaction prediction device 700 may include a data acquisition module 710, a feature extraction module 720, a data vectorization module 730, an interaction prediction module 740, and an interaction determination module 750, wherein:
  • a data acquisition module 710 configured to acquire RNA-protein pairs to be predicted
  • Feature extraction module 720 for performing feature extraction on the RNA-protein pair to be predicted, to obtain the sequence features of the RNA-protein pair to be predicted;
  • the data vectorization module 730 is used to vectorize the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted;
  • An interaction prediction module 740 configured to use multiple interaction prediction models based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted Obtaining multiple interaction prediction values of the RNA-protein pair to be predicted respectively;
  • An interaction determination module 750 configured to determine the interaction between the RNA and the protein according to the plurality of interaction prediction values.
  • the feature extraction module 720 includes:
  • the feature set acquisition module is used to obtain the original sequence feature set
  • a feature determination module configured to determine the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set.
  • the feature determination module includes:
  • a sequence conversion unit for converting the RNA sequence and protein sequence in the RNA-protein pair to be predicted into k-mer subsequences respectively;
  • the first sequence search unit is configured to search for each k-mer subsequence in the original sequence feature set, and obtain the sequence feature of the RNA-protein pair to be predicted according to the search result.
  • the feature determination module includes:
  • a sequence combination unit for combining the RNA k-mer subsequence and the protein k-mer subsequence to obtain a variety of RNA-protein k-mer subsequence pairs;
  • the second sequence search unit is configured to search for each RNA-protein k-mer subsequence pair in the original sequence feature set, and obtain the sequence feature of the RNA-protein pair to be predicted according to the search result.
  • the feature determination module includes:
  • a sequence conversion unit for converting the RNA sequence and the protein sequence in the RNA-protein pair to be predicted into k-mer subsequences respectively, and the k-mer subsequences include RNA k-mer subsequences and protein k-mer subsequences mer subsequence;
  • a first sequence search unit configured to search for each k-mer subsequence in the original sequence feature set to obtain the first sequence feature
  • a sequence combination unit for combining the RNA k-mer subsequence and the protein k-mer subsequence to obtain a variety of RNA-protein k-mer subsequence pairs;
  • a second sequence search unit configured to search for each RNA-protein k-mer subsequence pair in the original sequence feature set to obtain a second sequence feature
  • a feature splicing unit configured to form the sequence feature of the RNA-protein pair to be predicted from the first sequence feature and the second sequence feature.
  • the feature set acquisition module includes:
  • the data set acquisition module is used to obtain the original data set
  • a feature extraction module configured to perform feature extraction on each RNA-protein pair in the original data set to obtain the original sequence feature set.
  • the feature extraction module includes:
  • the sequence generation unit is used to arrange and combine the basic units of RNA and protein to obtain k-mer subsequences
  • a variance calculation unit used to count the frequency of occurrence of each k-mer subsequence in the original data set, and calculate the variance of each k-mer subsequence according to the frequency of occurrence;
  • the data set determination unit is configured to determine the feature set of the original sequence according to the variance of each k-mer sequence.
  • the variance calculation unit includes:
  • a frequency statistics subunit used to count the frequency of occurrence of each k-mer subsequence in the original data set
  • a frequency calculation subunit configured to calculate the frequency of occurrence of each k-mer subsequence in the original data set according to the frequency of occurrence
  • a sequence labeling subunit configured to traverse the original data set, and mark whether each k-mer subsequence appears in each RNA-protein pair;
  • a variance calculation subunit for calculating the variance of each k-mer subsequence according to the frequency of occurrence of each k-mer subsequence in the original data set and the tag value in each RNA-protein pair .
  • the variance calculation subunit is configured to:
  • Var i of the i-th k-mer subsequence. in is the marker value of the i-th k-mer subsequence in the n-th RNA-protein pair
  • Freq i is the occurrence frequency of the i-th k-mer subsequence in the original data set
  • N is the RNA-protein pair in the original data set total number of .
  • the data set determination unit is configured to determine a k-mer subsequence that satisfies a preset condition according to the variance of each k-mer subsequence, and the k-mer subsequence that satisfies the preset condition
  • the conditional k-mer subsequences constitute the original sequence feature set.
  • the feature extraction module also includes:
  • the first item set generation unit is used to convert the RNA sequence and protein sequence in each RNA-protein pair into k-mer subsequences respectively, and the first candidate itemset is composed of the k-mer subsequences, so Described k-mer subsequence comprises RNA k-mer subsequence and protein k-mer subsequence;
  • a frequent itemset generating unit configured to count the frequency of occurrence of each k-mer subsequence in the first candidate item set in the original data set, and a frequent itemset is composed of k-mer subsequences satisfying a preset frequency threshold ;
  • the second item set generating unit is used to cross-combine the RNA k-mer subsequence and the protein k-mer subsequence in the frequent itemset, and form the second candidate item set by the k-mer subsequence obtained by combination;
  • a support determination unit configured to count the frequency of occurrence of each k-mer subsequence pair in the original data set in the second candidate item set, and obtain the support degree of each k-mer subsequence pair;
  • a feature set acquisition unit configured to form the original sequence feature set from k-mer subsequence pairs whose support meets a preset condition.
  • the data vectorization module 730 includes:
  • a sequence conversion unit for converting the RNA sequence and the protein sequence in the RNA-protein pair to be predicted into k-mer subsequences respectively, and the k-mer subsequences include M RNA k-mer subsequences and N a protein k-mer subsequence;
  • the first vector quantization unit is used to vectorize each RNA k-mer subsequence to obtain M RNA k-mer vectors;
  • the first splicing unit is used to splice the M RNA k-mer vectors to obtain the RNA sequence representation vector;
  • the second vectorization unit is used to vectorize each protein k-mer sequence to obtain N protein k-mer vectors;
  • the second splicing unit is used to splice the N protein k-mer vectors to obtain the protein sequence representation vector.
  • the interaction prediction module 740 includes:
  • a first prediction unit configured to input the sequence features of the RNA-protein pair to be predicted into at least one first interaction prediction model to obtain at least one first interaction prediction value
  • the second prediction unit is configured to input the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted into at least one second interaction prediction model to obtain at least one second interaction prediction value;
  • the interaction prediction module 740 includes:
  • a third prediction unit configured to input the sequence features of the RNA-protein pair to be predicted into at least one traditional machine learning model to obtain at least one first interaction prediction value
  • the fourth prediction unit is configured to input the RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted into at least one deep learning model to obtain at least one second interaction prediction value.
  • each deep learning model includes at least two sub-deep learning models; the fourth prediction unit includes:
  • the first feature generation unit is used to input the RNA sequence representation vector in the RNA-protein pair to be predicted into the first sub-deep learning model to obtain the first sequence feature;
  • the second feature generation unit is used to input the protein sequence representation vector in the RNA-protein pair to be predicted into the second sub-deep learning model to obtain the second sequence feature;
  • a feature fusion unit configured to fuse the first sequence features and the sequence second features, and obtain the second interaction prediction value according to the fused features.
  • the traditional machine learning model includes at least one of a logistic regression model, a support vector machine model, and a decision tree model
  • the deep learning model includes a convolutional neural network model and a recurrent neural network model. at least one of .
  • the interaction determining module 750 includes:
  • a weighted calculation unit configured to perform a weighted sum calculation on the plurality of interaction prediction values
  • an interaction determining unit configured to determine the interaction between the RNA and the protein according to the calculation result.
  • the interaction determination unit includes:
  • the first interaction determination subunit is used to determine that there is an interaction between the RNA and the protein if the calculation result is greater than a preset interaction prediction value threshold;
  • the second interaction determination subunit is configured to determine that there is no interaction between the RNA and the protein if the calculation result is less than or equal to a preset interaction prediction value threshold.
  • the RNA-protein interaction prediction device 700 also includes:
  • a joint training module configured to perform joint training on the multiple interaction prediction models.
  • the joint training module includes:
  • a training data acquisition unit configured to acquire a training data set, which includes positive RNA-protein pairs and negative example RNA-protein pairs in the training data set;
  • the first predicted value output unit is used to use the multiple interaction prediction models to obtain multiple interaction prediction values for each RNA-protein pair in the training data set;
  • the second predicted value output unit is used to obtain the joint predicted value of each RNA-protein pair in the training data set according to the plurality of interaction predicted values;
  • a loss value calculation unit configured to use a loss function to calculate the joint prediction value and label value of each RNA-protein pair in the training data set to obtain a corresponding loss value
  • a model parameter adjustment unit configured to adjust model parameters of the plurality of interaction prediction models according to the loss value.
  • the second predicted value output unit includes:
  • the second predictive value output subunit is configured to perform weighted summation of multiple interaction predictive values of each RNA-protein pair in the training data set to obtain a joint predictive value of each RNA-protein pair.
  • the second predicted value output subunit is configured to:
  • the joint prediction value y out of each RNA-protein pair in the training data set is calculated.
  • y 1 is the output value of the traditional machine learning model
  • y 2 is the output value of the convolutional neural network model
  • y 3 is the output value of the recurrent neural network model
  • ⁇ , ⁇ , ⁇ are the traditional machine learning model, convolution Weight parameters for neural network models and recurrent neural network models.
  • the model parameter adjustment unit is configured to iteratively update the model parameters of the plurality of interaction prediction models based on the loss value, and complete the adjustment of all training the model parameters of the plurality of interaction prediction models, so as to use the trained plurality of interaction prediction models to predict the interaction of the RNA-protein pair to be predicted.
  • the RNA-protein interaction prediction device 700 also includes:
  • the data output module is used for outputting the prediction result of the interaction between the RNA and the protein.
  • Each module in the above-mentioned device can be a general-purpose processor, including: a central processing unit, a network processor, etc.; it can also be a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices, discrete hardware components. Each module may also be implemented by software, firmware, and other forms. Each processor in the above device may be an independent processor, or may be integrated together.
  • Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-mentioned method in this specification is stored.
  • various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code.
  • the program product When the program product is run on the electronic device, the program code is used to make the electronic device execute the above-mentioned functions of this specification. Steps according to various exemplary embodiments of the present disclosure described in the "Exemplary Methods" section.
  • the program product may take the form of a portable compact disc read-only memory (CD-ROM) and include program code, and may run on an electronic device, such as a personal computer.
  • CD-ROM portable compact disc read-only memory
  • the program product of the present disclosure is not limited thereto.
  • a readable storage medium may be any tangible medium containing or storing a program, and the program may be used by or in combination with an instruction execution system, apparatus or device.
  • a program product may take the form of any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer readable signal medium may include a data signal carrying readable program code in baseband or as part of a carrier wave. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a readable signal medium may also be any readable medium other than a readable storage medium that can transmit, propagate, or transport a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming Language - such as "C" or similar programming language.
  • the program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server to execute.
  • the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (e.g., using an Internet service provider). business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider e.g., a wide area network
  • Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method.
  • An electronic device 800 according to such an exemplary embodiment of the present disclosure is described below with reference to FIG. 8 .
  • the electronic device 800 shown in FIG. 8 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • electronic device 800 may take the form of a general-purpose computing device.
  • Components of the electronic device 800 may include, but are not limited to: at least one processing unit 810 , at least one storage unit 820 , a bus 830 connecting different system components (including the storage unit 820 and the processing unit 810 ), and a display unit 840 .
  • the storage unit 820 stores program codes, which can be executed by the processing unit 810, so that the processing unit 810 executes the steps described in the above "Exemplary Methods" section of this specification according to various exemplary embodiments of the present disclosure.
  • the processing unit 810 may execute any one or more method steps in FIG. 2 to FIG. 6 .
  • the storage unit 820 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 821 and/or a cache storage unit 822 , and may further include a read-only storage unit (ROM) 823 .
  • RAM random access storage unit
  • ROM read-only storage unit
  • Storage unit 820 may also include a program/utility tool 824 having a set (at least one) of program modules 825, such program modules 825 including but not limited to: an operating system, one or more application programs, other program modules, and program data, Implementations of networked environments may be included in each or some combination of these examples.
  • program modules 825 including but not limited to: an operating system, one or more application programs, other program modules, and program data, Implementations of networked environments may be included in each or some combination of these examples.
  • Bus 830 may represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local area using any of a variety of bus structures. bus.
  • the electronic device 800 can also communicate with one or more external devices 900 (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable the user to interact with the electronic device 800, and/or communicate with Any device (eg, router, modem, etc.) that enables the electronic device 800 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 850 .
  • the electronic device 800 can also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN) and/or a public network such as the Internet) through the network adapter 860 . As shown, the network adapter 860 communicates with other modules of the electronic device 800 through the bus 830 .
  • other hardware and/or software modules may be used in conjunction with electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the RNA-protein interaction prediction method described in this disclosure can be performed by the processing unit 810 of the electronic device.
  • the RNA-protein pair to be predicted/RNA sequence to be predicted/protein sequence to be predicted, raw data sets, and training data sets used to train each interaction prediction model can be input through the input interface 850.
  • the RNA-protein pair to be predicted, the original data set, and the training data set used to train each interaction prediction model are input through the user interface of the electronic device.
  • the prediction result of the to-be-predicted RNA-protein interaction can be output to the external device 900 through the output interface 850 for viewing by the user.
  • the technical solutions according to the embodiments of the present disclosure can be embodied in the form of software products, and the software products can be stored in a non-volatile storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the exemplary embodiment of the present disclosure.
  • a computing device which may be a personal computer, a server, a terminal device, or a network device, etc.

Abstract

一种RNA-蛋白质相互作用预测方法、装置、介质及电子设备;涉及人工智能技术领域。所述方法包括:获取待预测的RNA-蛋白质对(S210);对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征(S220);向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量(S230);基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值(S240);根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用(S250)。

Description

RNA-蛋白质相互作用预测方法、装置、介质及电子设备 技术领域
本公开涉及人工智能技术领域,具体而言,涉及一种RNA-蛋白质相互作用预测方法、RNA-蛋白质相互作用预测装置、计算机可读存储介质以及电子设备。
背景技术
非编码RNA(noncoding RNA,ncRNA)参与很多复杂的细胞活动进程,在选择性剪切、染色质修饰和表观遗传等生命过程中发挥着重要作用,并与许多疾病有着密切的联系。研究表明,大部分非编码RNA通过与蛋白质相互作用实现其调控功能。因此,研究非编码RNA与蛋白质相互作用对于揭示非编码RNA在人类疾病和生命活动中的分子作用机制具有重要意义,已成为当下分析非编码RNA和蛋白质功能的重要途径之一。
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。
发明内容
本公开提供一种RNA-蛋白质相互作用预测方法、RNA-蛋白质相互作用预测装置、计算机可读存储介质以及电子设备。
本公开提供一种RNA-蛋白质相互作用预测方法,包括:
获取待预测的RNA-蛋白质对;
对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值;
根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。
在本公开的一种示例性实施例中,所述对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征,包括:
获取原始序列特征集;
根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征。
在本公开的一种示例性实施例中,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列;
在所述原始序列特征集中查找每种k-mer子序列,并根据查找结果得到所述待预测的 RNA-蛋白质对的序列特征。
在本公开的一种示例性实施例中,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
将RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
在本公开的一种示例性实施例中,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
在所述原始序列特征集中查找每种k-mer子序列,得到第一序列特征;
将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,得到第二序列特征;
由所述第一序列特征和第二序列特征组成所述待预测的RNA-蛋白质对的序列特征。
在本公开的一种示例性实施例中,所述获取原始序列特征集,包括:
获取原始数据集;
对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集。
在本公开的一种示例性实施例中,所述对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集,包括:
对RNA和蛋白质的基本单元分别进行排列组合得到k-mer子序列;
统计每种k-mer子序列在所述原始数据集中的出现频率,并根据所述出现频率计算所述每种k-mer子序列的方差;
根据所述每种k-mer序列的方差大小确定所述原始序列特征集。
在本公开的一种示例性实施例中,所述统计每种k-mer子序列在所述原始数据集中的出现频率,并根据所述出现频率计算所述每种k-mer子序列的方差,包括:
统计所述每种k-mer子序列在所述原始数据集中的出现频次;
根据所述出现频次计算得到所述每种k-mer子序列在所述原始数据集中的出现频率;
遍历所述原始数据集,对所述每种k-mer子序列是否出现在所述每个RNA-蛋白质对中进行标记;
根据所述每种k-mer子序列在所述原始数据集中的出现频率和在每个RNA-蛋白质对中的标记值计算所述每种k-mer子序列的方差。
在本公开的一种示例性实施例中,所述根据所述每种k-mer子序列在所述原始数据集中的出现频率和在每个RNA-蛋白质对中的标记值计算所述每种k-mer子序列的方差,包括:
根据:
Figure PCTCN2021121089-appb-000001
计算第i种k-mer子序列的方差Var i。其中,
Figure PCTCN2021121089-appb-000002
为第i种k-mer子序列在第n个RNA-蛋白质对中的标记值,Freq i为第i种k-mer子序列在原始数据集中的出现频率,N为原始数据集中RNA-蛋白质对的总数量。
在本公开的一种示例性实施例中,所述根据所述每种k-mer序列的方差大小确定所述原始序列特征集,包括:
根据所述每种k-mer子序列的方差大小确定满足预设条件的k-mer子序列,并由所述满足预设条件的k-mer子序列组成所述原始序列特征集。
在本公开的一种示例性实施例中,所述对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集,还包括:
将所述每个RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,由所述k-mer子序列组成第一候选项集,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
统计所述第一候选项集中每种k-mer子序列在所述原始数据集中的出现频率,由满足预设出现频率阈值的k-mer子序列组成频繁项集;
将所述频繁项集中的RNA k-mer子序列和蛋白质k-mer子序列进行交叉组合,由组合得到的k-mer子序列对组成第二候选项集;
统计所述第二候选项集中每种k-mer子序列对在所述原始数据集中的出现频率,得到所述每种k-mer子序列对的支持度;
由所述支持度满足预设条件的k-mer子序列对组成所述原始序列特征集。
在本公开的一种示例性实施例中,所述向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,包括:
将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括M个RNA k-mer子序列和N个蛋白质k-mer子序列;
将每个RNA k-mer子序列向量化,得到M个RNA k-mer向量;
拼接所述M个RNA k-mer向量得到所述RNA序列表示向量;
将每个蛋白质k-mer序列向量化,得到N个蛋白质k-mer向量;
拼接所述N个蛋白质k-mer向量得到所述蛋白质序列表示向量。
在本公开的一种示例性实施例中,所述基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值,包括:
将所述待预测的RNA-蛋白质对的序列特征输入到至少一个第一相互作用预测模型中,得到至少一个第一相互作用预测值;
将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个第二相互作用预测模型中,得到至少一个第二相互作用预测值。
在本公开的一种示例性实施例中,所述基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值,包括:
将所述待预测的RNA-蛋白质对的序列特征输入到至少一个传统机器学习模型中,得到至少一个第一相互作用预测值;
将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个深度学习模型中,得到至少一个第二相互作用预测值。
在本公开的一种示例性实施例中,所述每个深度学习模型至少包括两个子深度学习模型;所述将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个深度学习模型中,得到至少一个第二相互作用预测值,包括:
将所述待预测的RNA-蛋白质对中的RNA序列表示向量输入第一子深度学习模型中,得到第一序列特征;
将所述待预测的RNA-蛋白质对中的蛋白质序列表示向量输入第二子深度学习模型中,得到第二序列特征;
融合所述第一序列特征和所述序列第二特征,根据融合后的特征得到所述第二相互作用预测值。
在本公开的一种示例性实施例中,所述传统机器学习模型包括逻辑回归模型、支持向量机模型和决策树模型中的至少一种,深度学习模型包括卷积神经网络模型和循环神经网络模型中的至少一种。
在本公开的一种示例性实施例中,所述根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用,包括:
对所述多个相互作用预测值进行加权求和计算;
根据计算结果确定所述RNA和蛋白质之间的相互作用。
在本公开的一种示例性实施例中,所述根据计算结果确定所述RNA和蛋白质之间的相互作用,包括:
若所述计算结果大于预设的相互作用预测值阈值,确定所述RNA和蛋白质之间有相互作用;
若所述计算结果小于或等于预设的相互作用预测值阈值,确定所述RNA和蛋白质之间没有相互作用。
在本公开的一种示例性实施例中,所述方法还包括:
对所述多个相互作用预测模型进行联合训练。
在本公开的一种示例性实施例中,所述对所述多个相互作用预测模型进行联合训练,包括:
所述对所述多个相互作用预测模型进行联合训练,包括:
获取训练数据集,所述训练数据集中包括正例RNA-蛋白质对和负例RNA-蛋白质对;
使用所述多个相互作用预测模型分别得到所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值;
根据所述多个相互作用预测值得到所述训练数据集中每个RNA-蛋白质对的联合预测值;
利用损失函数对所述训练数据集中每个RNA-蛋白质对的联合预测值和标签值进行计算,得到对应的损失值;
根据所述损失值调整所述多个相互作用预测模型的模型参数。
在本公开的一种示例性实施例中,所述根据所述多个相互作用预测值得到所述训练数据集中每个RNA-蛋白质对的联合预测值,包括:
对所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值进行加权求和得到每个RNA-蛋白质对的联合预测值。
在本公开的一种示例性实施例中,所述对所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值进行加权求和得到每个RNA-蛋白质对的联合预测值,包括:
根据:
y out=α*y 1+β*y 2+γ*y 3
计算得到所述训练数据集中每个RNA-蛋白质对的联合预测值y out。其中,y 1为传统机器学习模型的输出值,y 2为卷积神经网络模型的输出值,y 3为循环神经网络模型的输出值,α、β、γ分别为传统机器学习模型、卷积神经网络模型和循环神经网络模型的权重参数。
在本公开的一种示例性实施例中,所述根据所述损失值调整所述多个相互作用预测模型的模型参数,包括:
基于所述损失值对所述多个相互作用预测模型的模型参数进行迭代更新,当满足迭代终止条件时,完成对所述多个相互作用预测模型的模型参数的训练,以使用训练好的所述多个相互作用预测模型对所述待预测的RNA-蛋白质对的相互作用进行预测。
在本公开的一种示例性实施例中,所述方法还包括:
输出所述RNA和蛋白质之间的相互作用的预测结果。
本公开提供一种RNA-蛋白质相互作用预测装置,包括:
数据获取模块,用于获取待预测的RNA-蛋白质对;
特征提取模块,用于对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
数据向量化模块,用于向量化所述待预测的RNA-蛋白质对,得到所述待预测RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
相互作用预测模块,用于基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值;
相互作用确定模块,用于根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。
本公开提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任意一项所述的方法。
本公开提供一种电子设备,包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器配置为经由执行所述可执行指令来执行上述任意一项所述的方法。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1示出了可以应用本公开实施例的一种RNA-蛋白质相互作用预测方法及装置的示例性系统架构的示意图;
图2示意性示出了根据本公开的一个实施例的RNA-蛋白质相互作用预测方法的流程图;
图3示意性示出了根据本公开的一个实施例的确定待预测RNA-蛋白质对的序列特征的流程图;
图4示意性示出了根据本公开的一个实施例的获取原始序列特征集的流程图;
图5示意性示出了根据本公开的另一个实施例的获取原始序列特征集的流程图;
图6示意性示出了根据本公开的一个实施例的训练多个相互作用预测模型的流程图;
图7示意性示出了根据本公开的一个实施例的RNA-蛋白质相互作用预测装置的框图;
图8示出了适于用来实现本公开实施例的电子设备的计算机系统的结构示意图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许 多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本公开的各方面变得模糊。
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
图1示出了可以应用本公开实施例的一种RNA-蛋白质相互作用预测方法及装置的示例性应用环境的系统架构的示意图。
如图1所示,系统架构100可以包括终端设备101、102、103中的一个或多个,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。终端设备101、102、103可以是各种电子设备,包括但不限于台式计算机、便携式计算机、智能手机和平板电脑等。应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。例如,服务器105可以是一个服务器,也可以是多个服务器组成的服务器集群,还可以是云计算平台或者虚拟化中心。具体地,服务器105可以用于执行:获取待预测的RNA-蛋白质对;对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值;根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。
本公开实施例所提供的RNA-蛋白质相互作用预测方法一般由服务器105执行,相应地,RNA-蛋白质相互作用预测装置一般设置于服务器105中,服务器可以将待预测的RNA-蛋白质对相互作用的预测结果发送至终端设备,并由终端设备向用户进行展示。但本领域技术人员容易理解的是,本公开实施例所提供的RNA-蛋白质相互作用预测方法也可以由终端设备101、102、103中的一个或多个执行,相应的,RNA-蛋白质相互作用预测装置也可以设置于终端设备101、102、103中,例如,由终端设备执行后可以将预测结果直接显示在终端设备的显示屏上,也可以通过语音播报的方式将预测结果提供给用户,本示例性实施例中对此不做特殊限定。
以下对本公开实施例的技术方案进行详细阐述:
目前,可以使用实验方法来研究非编码RNA-蛋白质的相互作用(noncoding  RNA-protein interactions,ncRPI)。传统实验方法可以通过实验得到有价值的数据,从而构建ncRNA-蛋白质相互作用网络,但是昂贵且耗时。
本示例实施方式提供了一种RNA-蛋白质相互作用预测方法,该方法可以应用于上述服务器105,也可以应用于上述终端设备101、102、103中的一个或多个,本示例性实施例中对此不做特殊限定。参考图2所示,该RNA-蛋白质相互作用预测方法可以包括以下步骤S210至步骤S250:
步骤S210.获取待预测的RNA-蛋白质对;
步骤S220.对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
步骤S230.向量化所述待预测的RNA-蛋白质对,得到所述待预测RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
步骤S240.基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值;
步骤S250.根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。
在本公开示例实施方式所提供的RNA-蛋白质相互作用预测方法中,获取待预测的RNA-蛋白质对;对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值;根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。一方面,通过对RNA-蛋白质对进行特征提取和向量化处理,可以充分挖掘RNA序列和蛋白质序列之间的联系,以便于准确地预测RNA和蛋白质的相互作用;另一方面,将多种相互作用预测模型的特性进行有效结合,可以进一步提高预测RNA与蛋白质相互作用的准确性。
下面,对于本示例实施方式的上述步骤进行更加详细的说明。
在步骤S210中,获取待预测的RNA-蛋白质对。
本示例实施方式中,可以获取至少一个待预测的RNA-蛋白质对。且每个待预测的RNA-蛋白质对中RNA和蛋白质之间的相互作用是未知的。示例性的,用户可以通过终端设备输入待预测的RNA-蛋白质对。例如,用户可以通过手动输入待预测的RNA-蛋白质对,也可以通过语音输入待预测的RNA-蛋白质对,本示例中对此不做具体限定。例如,可以输入一个RNA,再输入一个蛋白质,对两者的输入顺序不做限定。例如,RNA和蛋白质可以输入到不同的文本框中,也可以输入到同一个文本框中。例如,输入完成后,点击“开始预测”按钮,则开始执行本申请一些实施例中提供的预测步骤。
其中,RNA和蛋白质之间的相互作用是指蛋白质的功能是在与其它蛋白质和RNA的 相互作用中体现出来的。例如,蛋白质与RNA的相互作用在蛋白质的合成中扮演重要角色。同时,RNA诸多功能的发挥也离不开与蛋白质的相互作用。相互作用可以为调控作用、指导作用等,在此不做限定,例如,存在相互作用的情况下,RNA可以指导蛋白质的合成,或者RNA可以调控蛋白质的功能发挥。RNA和蛋白质之间的相互作用也可以指二者可以通过物理性相互作用,调节彼此的生命周期和功能。示例性的,RNA编码序列可以指导蛋白质的合成,对应的,蛋白质也可以调控RNA的表达和功能。
获取待预测的RNA-蛋白质对后,可以使用多个相互作用预测模型对输入的每个待预测的RNA-蛋白质对的相互作用进行预测,并根据预测结果确定每个待预测的RNA-蛋白质对是否有相互作用。同时,还可以将待预测的RNA-蛋白质对相互作用的预测结果输出至终端设备以供用户查看。例如,可以将预测结果直接显示在终端设备的显示屏上,也可以通过语音播报的方式将预测结果提供给用户,本示例中对此不做具体限定。
在其它示例中,也可以获取至少一个待预测的RNA序列,并通过多个相互作用预测模型在数据库中查找与输入的每个待预测的RNA序列之间有相互作用的蛋白质序列。示例性的,用户通过终端设备输入待预测的RNA序列后,可以选取数据库中的至少一个蛋白质序列,由待预测的RNA序列与各个蛋白质序列组成多个RNA-蛋白质对,进而可以通过多个相互作用预测模型预测每个RNA-蛋白质对的相互作用,并根据预测结果输出可以与待预测的RNA序列发生相互作用的蛋白质序列。优选的,可以将若干种类的蛋白质序列预先存储至数据库中,以便于在预测RNA-蛋白质对的相互作用时进行调用。例如,可以将蛋白质序列存储在Redis数据库中,也可以存储在MySQL数据库中,进而可以实时查询并选取待预测的蛋白质序列。其中,Redis是一个key-value存储系统,存储在Redis数据库中时,可以包括:序列标识(如序列编号)和对应的蛋白质序列形成的键值对(key-value),其中键(key)为序列标识,值(value)为对应的蛋白质序列。Redis作为一个高效的缓存技术,Redis能支持超过100K+每秒的读写频率,在数据读取以及存储的速度上具备一定的优势。MySQL是一种关联数据库管理系统,关联数据库将数据保存在不同的表中,而不是将所有数据进行统一存储,增加存储速度并提高灵活性,在数据存储方面具有稳定的优势,可以避免数据发生丢失。
可以理解的是,也可以将若干种类的RNA序列预先存储至数据库中,以便于在预测RNA-蛋白质对的相互作用时进行调用。因此,还可以获取至少一个待预测的蛋白质序列,并通过多个相互作用预测模型在数据库中查找与输入的每个待预测的蛋白质序列之间有相互作用的RNA序列。类似的,用户通过终端设备输入蛋白质序列后,可以选取数据库中的至少一个RNA序列,由待预测的蛋白质序列与各个RNA序列组成多个RNA-蛋白质对,进而可以通过多个相互作用预测模型预测每个RNA-蛋白质对的相互作用,并根据预测结果输出可以与待预测的蛋白质序列发生相互作用的RNA序列,本公开对此不做具体限定。
在步骤S220中,对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征。
一种示例实施方式中,可以以获取至少一个待预测的RNA-蛋白质对并对其相互作用进行预测为例进行说明。在通过多个相互作用预测模型对每个待预测的RNA-蛋白质对的相互作用进行预测之前,需要获取每个相互作用预测模型的输入特征。示例性的,可以对待预测的RNA-蛋白质对进行特征提取,也就是对该待预测的RNA-蛋白质对中的RNA序列和蛋白质序列依次进行特征提取,得到对应的RNA序列特征和蛋白质序列特征,并由RNA序列特征和蛋白质序列特征组成该待预测的RNA-蛋白质对的序列特征,可以将该序列特征作为相互作用预测模型的输入。也可以对待预测的RNA-蛋白质对进行向量化处理,也就是将该RNA-蛋白质对中的RNA序列和蛋白质序列分别进行向量化表示,得到对应的RNA序列表示向量和蛋白质表示向量,并将RNA序列表示向量和蛋白质表示向量分别作为相互作用预测模型的输入。也可以同时对待预测的RNA-蛋白质对进行特征提取和向量化处理,得到该待预测的RNA-蛋白质对的序列特征、该待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质表示向量,可以将该待预测的RNA-蛋白质对的序列特征、该待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质表示向量均作为相互作用预测模型的输入,本公开对此不做具体限定。
参考图3所示,可以根据步骤S310和步骤S320对待预测的RNA-蛋白质对进行特征提取。
在步骤S310中.获取原始序列特征集。
一种示例实施方式中,可以获取原始数据集,并对原始数据集中的每个RNA-蛋白质对进行特征提取,以得到原始序列特征集。举例而言,可以将RPI1807数据集作为原始数据集,该数据集可以包含3243个RNA-蛋白质对,在3243个RNA-蛋白质对中包含了1807对正例和1436对负例。其中,正例可以表示该RNA-蛋白质对中的RNA和蛋白质之间有相互作用,负例可以表示该RNA-蛋白质对中的RNA和蛋白质之间没有相互作用。可以理解的是,其它示例中也可以将RPI2241数据集、RPI369数据集等作为原始数据集进行实验,本公开对此不做具体限定。
获取原始数据集后,参考图4所示,可以根据步骤S410至步骤S430对原始数据集中的RNA-蛋白质对进行特征提取,得到原始序列特征集。
步骤S410.对RNA和蛋白质的基本单元分别进行排列组合得到k-mer子序列。
例如,以碱基为RNA的基本单元。对于RNA序列,可以包括4种碱基,分别为腺嘌呤(A)、尿嘧啶(U)、鸟嘌呤(G)和胞嘧啶(C)。可以对4种碱基进行排列组合得到RNA序列的所有k-mer子序列。例如,以氨基酸为蛋白质的基本单元。对于蛋白质序列,可以包括20种氨基酸,并将20种氨基酸依次编码为A、G、V、I、L、F、P、Y、M、T、S、H、N、Q、W、R、K、D、E、C。示例性的,可以先按照氨基酸的理化性质,将20种氨基酸分为{A,G,V}、{I,L,F,P}、{Y,M,T,S}、{H,N,Q,W}、{R,K}、{D,E}和{C}共7类,并对每一类氨基酸进行重新编码,如可以依次编码为1、2、3、4、5、6和7。举例而言,蛋白质序列ALQDVG可以转换为124611。然后可以对7类氨基酸进行排列组合 得到氨基酸序列的所有k-mer子序列。其它示例中,也可以按照氨基酸的组成成分将20种氨基酸进行分类,还可以不做分类而直接基于20种氨基酸进行排列组合得到氨基酸序列的k-mer子序列,本公开对此不做具体限定。
其中,k-mer子序列是指由k个碱基或k类氨基酸为一组构成的k联体。对应的,在本公开示例实施方式中,k-mer子序列可以包括RNA k-mer子序列和蛋白质k-mer子序列。示例性的,k-mer子序列可以是指由4种碱基进行排列组合得到的RNA k-mer子序列,对于某个k值可以得到4 k种k-mer子序列。k-mer子序列也可以是指由7类氨基酸进行排列组合得到的蛋白质k-mer子序列,对于某个k值可以得到7 k种k-mer子序列。可以理解的是,将20种氨基酸分为7类仅仅是示意性的,也可以不进行分类。类似的,对RNA序列的4种碱基也可以根据实际需要进行分类。
在本公开示例实施方式中,k值可以取一个或多个,k的具体取值可以根据实际情况进行调整,在此不做限定。一种示例中,可以以k取3和4两个值为例进行说明。RNA序列的3-mer子序列共有4 3=64种,4-mer子序列共有4 4=256种。蛋白质序列的3-mer子序列共有7 3=343种,4-mer子序列共有7 4=2401种。例如,AAA、AUC是RNA序列的两种3-mer子序列,AAAA、AAAU是RNA序列的两种4-mer子序列。111、112是蛋白质序列的两种3-mer子序列,1111、1122是蛋白质序列的两种4-mer子序列。其他示例中,k也可以只取3或只取4,本公开对此不做具体限定。
步骤S420.统计每种k-mer子序列在所述原始数据集中的出现频率,并根据所述出现频率计算所述每种k-mer子序列的方差。
一种示例实施方式中,可以根据步骤S410得到RNA序列和蛋白质序列的所有3-mer子序列和4-mer子序列,即包括RNA序列的64种3-mer子序列、256种4-mer子序列和蛋白质序列的343种3-mer子序列、2401种4-mer子序列。可以统计每种3-mer子序列或4-mer子序列在原始数据集中的出现频率,并根据出现频率计算每种3-mer子序列或4-mer子序列的方差。其中,在统计每种3-mer子序列或4-mer子序列在原始数据集中的出现频率之前,需要将原始数据集中每个RNA-蛋白质对的RNA序列和蛋白质序列转化为3-mer子序列和4-mer子序列。例如,对于RNA序列“AGAUGG”,该序列的3-mer子序列可以包括“AGA”、“GAU”、“AUG”和“UGG”,该序列的4-mer子序列可以包括“AGAU”、“GAUG”和“AUGG”,也即可以通过正向重叠读取该RNA序列得到对应的3-mer子序列或4-mer子序列。类似的,也可以通过反向重叠读取该RNA序列得到对应的3-mer子序列或4-mer子序列。例如,该序列的3-mer子序列也可以包括“GGU”、“GUA”、“UAG”和“AGA”,该序列的4-mer子序列也可以包括“GGUA”、“GUAG”和“UAGA”。在一些实施方式中,还可以通过非重叠的方式读取该RNA序列得到对应的3-mer子序列或4-mer子序列,例如,该序列的3-mer子序列也可以包括“AGA”和“UGG”,本公开对此不做具体限定。
示例性的,可以统计RNA序列和蛋白质序列的每种3-mer子序列和/或4-mer子序列在 原始数据集中的出现频次,根据出现频次计算得到每种3-mer子序列和/或4-mer子序列在原始数据集中的出现频率。例如,可以计算某种3-mer子序列在原始数据集中的出现频次与原始数据集中RNA-蛋白质对的总个数之间的比值得到该3-mer子序列在原始数据集中的出现频率。通过遍历原始数据集,对每种3-mer子序列和/或4-mer子序列是否出现在每个RNA-蛋白质对中进行标记。可以根据每种3-mer子序列和/或4-mer子序列在原始数据集中的出现频率和在每个RNA-蛋白质对中的标记值计算得到每种k-mer子序列的方差。
举例而言,对于第i种k-mer子序列,该子序列可以是RNA序列或蛋白质序列的3-mer子序列,也可以是RNA序列或蛋白质序列的4-mer子序列。可以先统计该子序列在RPI1807数据集中的出现频次。例如,可以循环RPI1807数据集中的N个RNA-蛋白质对(N=3243),若该子序列出现在当前RNA-蛋白质对中,则出现频次加1,若未出现在当前RNA-蛋白质对中,则出现频次不变,再对该子序列是否出现在下一个RNA-蛋白质对进行判断。将统计得到的第i种k-mer子序列在RPI1807数据集中的出现频次记为num i,进而可以根据出现频次num i计算得到该子序列在RPI1807数据集中的出现频率Freq i。即:
Figure PCTCN2021121089-appb-000003
确定第i种k-mer子序列在RPI1807数据集中的出现频率后,可以通过遍历RPI1807数据集,查看该子序列在每个RNA-蛋白质对中的出现情况,并对出现情况进行标记,记作
Figure PCTCN2021121089-appb-000004
也就是说,若该子序列出现在第n个RNA-蛋白质对中,则标记值
Figure PCTCN2021121089-appb-000005
若未出现在第n个RNA-蛋白质对中,则标记值
Figure PCTCN2021121089-appb-000006
统计得到该子序列在RPI1807数据集中的出现频率Freq i和在第n个RNA-蛋白质对中的标记值
Figure PCTCN2021121089-appb-000007
后,可以根据:
Figure PCTCN2021121089-appb-000008
对第i种k-mer子序列在RPI1807数据集的每个RNA-蛋白质对中的标记值与该k-mer子序列在RPI1807数据集中的出现频率的差值的平方进行求和,计算得到第i种k-mer子序列在RPI1807数据集中的方差Var i。其中,
Figure PCTCN2021121089-appb-000009
为第i种k-mer子序列在第n个RNA-蛋白质对中的标记值,Freq i为第i种k-mer子序列在RPI1807数据集中的出现频率,N为RPI1807数据集中RNA-蛋白质对的总数量。
步骤S430.根据所述每种k-mer序列的方差大小确定所述原始序列特征集。
计算得到每种k-mer子序列的方差后,可以根据每种k-mer子序列的方差大小确定满足预设条件的k-mer子序列,并由满足预设条件的k-mer子序列组成原始序列特征集。示例性的,可以分别将RNA序列的所有3-mer子序列、4-mer子序列和蛋白质序列的所有3-mer子序列、4-mer子序列根据方差大小进行排序,如降序排序,可以选取排名靠前的k-mer子序列组成原始序列特征集。例如,可以选取排名靠前的560种k-mer子序列,并由这560种k-mer子序列组成原始序列特征集。其中,可以包括前60种RNA序列的3-mer子序列、前200种RNA序列的4-mer子序列、前200种蛋白质序列的3-mer子序列、前100种蛋白质序列的4-mer子序列。可以理解的是,选取的k-mer子序列的数目仅仅是示意性的,根 据实际需求可以选取任意数目的k-mer子序列。其他示例中,也可以预设方差阈值,将方差大于该阈值的k-mer子序列筛选出来,并由筛选得到的k-mer子序列组成原始序列特征集。例如,预设的方差阈值为3时,可以选取方差大于3的k-mer子序列组成原始序列特征集。需要说明的是,进行特征选择时,可以优选方差较大的特征,方差较大表示该特征的数据差异性较大,也即利用该特征可以更好的区分样本,进而可以提高相互作用预测模型的分类能力和预测能力。
另一种示例实施方式中,也可以计算每种k-mer子序列在每个RNA-蛋白质对中出现次数的平均值,并根据出现次数的平均值计算每种k-mer子序列的方差,进而根据每种k-mer子序列的方差大小确定所述原始序列特征集。
示例性的,通过遍历原始数据集,可以确定每种3-mer子序列或4-mer子序列在每个RNA-蛋白质对中的出现次数。对每种3-mer子序列或4-mer子序列在每个RNA-蛋白质对中的出现次数进行统计可以得到该子序列在原始数据集中的出现总次数,根据出现总次数可以计算得到每种3-mer子序列或4-mer子序列在每个RNA-蛋白质对中出现频次的平均值。最后,可以根据每种3-mer子序列或4-mer子序列在每个RNA-蛋白质对中出现次数的平均值和在每个RNA-蛋白质对的出现次数计算每种子序列的方差。
举例而言,对于第i种k-mer子序列,该子序列可以是RNA序列或蛋白质序列的3-mer子序列,也可以是RNA序列或蛋白质序列的4-mer子序列。可以先统计该子序列在RPI1807数据集中的出现总次数。例如,可以循环RPI1807数据集中的n个RNA-蛋白质对(n=3243),统计得到该子序列在每个RNA-蛋白质对中的出现次数依次为x 1,x 2,…,x n,将x 1,x 2,…,x n叠加后得到该子序列在RPI1807数据集中的出现总次数,记为num i。进而可以根据出现总次数num i计算得到该子序列在每个RNA-蛋白质对中出现次数的平均值m i,即根据:
Figure PCTCN2021121089-appb-000010
计算得到第i种k-mer子序列在每个RNA-蛋白质对中出现次数的平均值。可以通过第i种k-mer子序列在每个RNA-蛋白质对中出现次数的平均值和在每个RNA-蛋白质对的出现次数计算出该子序列的方差,即根据:
Figure PCTCN2021121089-appb-000011
计算第i种k-mer子序列的方差s 2。其中,n为RPI1807数据集中RNA-蛋白质对的个数,m i为该子序列在每个RNA-蛋白质对中出现次数的平均值,x n为该子序列在第n个RNA-蛋白质对的出现次数,类似的,x 1为该子序列在第一个RNA-蛋白质对的出现次数,x 2为该子序列在第二个RNA-蛋白质对的出现次数。
计算得到每种k-mer子序列的方差后,示例性的,可以根据方差大小将所有k-mer子序列进行排序,如降序排序,可以选取排名靠前的k-mer子序列组成原始序列特征集。其他示例中,也可以预设方差阈值,将方差大于该阈值的k-mer子序列筛选出来,并由筛选得到的k-mer子序列组成原始序列特征集。
本公开示例实施方式中,可以提取原始数据集中每个RNA-蛋白质对的k-mer特征,由提取到的RNA序列的k-mer特征和蛋白质序列的k-mer特征组成原始序列特征集。以RNA序列的k-mer特征为例,k-mer特征中可以包含RNA序列的单体组分信息(即包含的各个碱基)和序列顺序信息。因此,使用k-mer特征可以更好的刻画一个RNA序列,也即可以根据k-mer特征更加准确的确定一个RNA序列,同时也可以通过k-mer特征区分不同的RNA序列。为了进一步挖掘RNA序列和蛋白质序列之间的联系,也可以提取原始数据集中的每个RNA-蛋白质对的频繁项集特征,由提取到的频繁项集特征组成原始序列特征集。频繁项集特征可以将RNA序列的kmer特征和蛋白质序列的kmer特征组合到一起。因此,使用频繁项集特征可以更好的区分有相互作用和没有相互作用的RNA-蛋白质对。还可以同时提取k-mer特征和频繁项集特征,并由二者共同组成原始序列特征集,通过结合k-mer特征和频繁项集特征的特性,可以更加准确地预测未知RNA-蛋白质对中RNA和蛋白质之间的相互作用,本公开对此不做具体限定。其中,频繁项集特征是指在原始数据集中具备一定支持度的RNA k-mer子序列与蛋白质k-mer子序列组合成的k-mer子序列对,支持度是指同时包含A和B的事务占所有事务的比例。例如,对于子序列对(AAU,137),表示由一个RNA的3-mer子序列AAU和一个蛋白质的3-mer子序列137组合成的3-mer子序列对。该子序列对的支持度即为原始数据集中同时包含子序列AAU和137的RNA-蛋白质对占所有RNA-蛋白质对的比例。
参考图5所示,可以根据步骤S510至步骤S550提取原始数据集中的每个RNA-蛋白质对的频繁项集特征,以得到原始序列特征集。
步骤S510.将所述每个RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,由所述k-mer子序列组成第一候选项集,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列。
一种示例实施方式中,可以先将RPI1807数据集中每个RNA-蛋白质对的RNA序列和蛋白质序列分别转化为3-mer子序列和4-mer子序列。通过遍历RPI1807数据集,可以找出该数据集中所有的RNA 3-mer子序列、RNA 4-mer子序列、蛋白质3-mer子序列和蛋白质4-mer子序列,并由该数据集中的所有3-mer子序列和4-mer子序列组成第一候选项集C1。
步骤S520.统计所述第一候选项集中每种k-mer子序列在所述原始数据集中的出现频率,由满足预设出现频率阈值的k-mer子序列组成频繁项集。
可以统计第一候选项集C1中每种k-mer子序列在原始数据集中的出现频率。示例性的,对于第j种k-mer子序列,该子序列可以是RNA序列或蛋白质序列的3-mer子序列,也可以是RNA序列或蛋白质序列的4-mer子序列。可以先统计该子序列在RPI1807数据集中的出现频次。例如,可以循环RPI1807数据集中的N个RNA-蛋白质对,若该子序列出现在当前RNA-蛋白质对中,则出现频次加1,若未出现在当前RNA-蛋白质对中,则出现频次不变。将统计得到的第j种k-mer子序列在RPI1807数据集中的出现频次记为num j,进而 可以根据出现频次num j计算得到该子序列在RPI1807数据集中的出现频率Freq j。即:
Figure PCTCN2021121089-appb-000012
类似的,可以计算出第一候选项集C1中每种3-mer子序列或4-mer子序列在RPI1807数据集中的出现频率。然后,可以根据预设出现频率阈值对所有的3-mer子序列和4-mer子序列进行筛选。例如,可以选取出现频率大于第一阈值的RNA 3-mer子序列、出现频率大于第二阈值的RNA 4-mer子序列、出现频率大于第三阈值的蛋白质3-mer子序列和出现频率大于第四阈值的蛋白质4-mer子序列共同组成频繁项集L1。其中,第一阈值、第二阈值、第三阈值和第四阈值可以相同,也可以不相同,本公开对此不做具体限定。其他示例中,也可以将RNA序列的3-mer子序列、4-mer子序列的出现频率以及蛋白质序列的3-mer子序列、4-mer子序列的出现频率分别进行降序排序,由排名靠前的子序列组成频繁项集L1,本公开对此不做具体限定。
步骤S530.将所述频繁项集中的RNA k-mer子序列和蛋白质k-mer子序列进行交叉组合,由组合得到的k-mer子序列对组成第二候选项集。
可以将频繁项集L1中的RNA 3-mer子序列和蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列分别进行两两交叉组合,得到多种3-mer子序列对和4-mer子序列对,并由组合得到的多种子序列对组成第二候选项集C2。举例而言,若频繁项集中包括[AAU,AUC,137,123,AAUU,AGUC,1737,1234],可以将“AAU”和“137”组合得到3-mer子序列对“AAU_137”,也可以将“AAU”和“123”组合得到3-mer子序列对“AAU_123”。类似的,通过将RNA 3-mer子序列和蛋白质3-mer子序列进行交叉组合还可以得到3-mer子序列对“AUC_137”、“AUC_123”,通过将RNA 4-mer子序列和蛋白质4-mer子序列进行交叉组合可以得到4-mer子序列对“AAUU_1737”、“AAUU_1234”、“AGUC_1737”和“AGUC_1234”。
步骤S540.统计所述第二候选项集中每种k-mer子序列对在所述原始数据集中的出现频率,得到所述每种k-mer子序列对的支持度。
可以统计第二候选项集C2中每种子序列对在RPI1807数据集中的出现频率。示例性的,对于第f种k-mer子序列对,该子序列对可以是3-mer子序列对,也可以是4-mer子序列对。可以先统计该子序列对在RPI1807数据集中的出现频次。例如,可以循环RPI1807数据集中的N个RNA-蛋白质对,若该子序列对出现在当前RNA-蛋白质对中,则出现频次加1,若未出现在当前RNA-蛋白质对中,则出现频次不变。将统计得到的第f种k-mer子序列对在RPI1807数据集中的出现频次记为num f,进而可以根据出现频次num f计算得到该子序列对在RPI1807数据集中的出现频率,即得到了该子序列对的支持度support f。即:
Figure PCTCN2021121089-appb-000013
类似的,可以计算出第二候选项集C2中每种子序列对的支持度。
步骤S550.由所述支持度满足预设条件的k-mer子序列对组成所述原始序列特征集。
计算得到每种子序列对的支持度后,可以根据每种子序列对的支持度大小确定满足预设条件的子序列对,并由满足预设条件的子序列对组成原始序列特征集。示例性的,可以预设支持度阈值,将支持度大于该阈值的子序列对筛选出来,并由筛选得到的子序列对组成原始序列特征集。例如,支持度大于该阈值的子序列对共有370个,这370个子序列对即为370个频繁项集特征,可以由370个频繁项集特征组成原始序列特征集。其他示例中,也可以根据支持度大小将第二候选项集C2中的所有子序列对进行降序排序,并选取排名靠前的子序列对组成原始序列特征集,本公开对此不做具体限定。
另一种示例实施方式中,可以将每个RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,得到k-mer子序列对。统计每种k-mer子序列对在原始数据集中的出现频率,将满足出现频率预设条件的k-mer子序列对作为频繁项集特征并组成原始序列特征集。
举例而言,可以先将RPI1807数据集中所有正例RNA-蛋白质对的RNA序列和蛋白质序列分别转化为正例3-mer子序列和正例4-mer子序列。类似的,将该数据集中所有负例RNA-蛋白质对的RNA序列和蛋白质序列分别转化为负例3-mer子序列和负例4-mer子序列。通过遍历RPI1807数据集,可以找出该数据集中所有的正、负例RNA 3-mer子序列、正、负例RNA 4-mer子序列、正、负例蛋白质3-mer子序列和正、负例蛋白质4-mer子序列。将该数据集中的RNA 3-mer子序列和蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列进行两两交叉组合,得到多种3-mer子序列对和4-mer子序列对。示例性的,可以将正例RNA 3-mer子序列和正例蛋白质3-mer子序列进行交叉组合,得到正例3-mer子序列对。可以将负例RNA 3-mer子序列和负例蛋白质3-mer子序列进行交叉组合,得到负例3-mer子序列对。可以将正例RNA 4-mer子序列和正例蛋白质4-mer子序列进行交叉组合,得到正例4-mer子序列对。可以将负例RNA 4-mer子序列和负例蛋白质4-mer子序列进行交叉组合,得到负例4-mer子序列对。
可以统计每一种子序列对在该数据集中的出现频率。例如,对于任一种正例3-mer子序列对,可以根据:
Figure PCTCN2021121089-appb-000014
计算得到该正例3-mer子序列对在该数据集中的出现频率Freq。其中,num为该正例3-mer子序列对在该数据集中的出现次数,NUM为所有正例3-mer子序列对在该数据集中的出现总次数。
计算得到每种k-mer子序列对在原始数据集中的出现频率后,示例性的,可以分别将所有的3-mer子序列对和4-mer子序列对根据出现频率的大小分别进行排序,如降序排序,可以选取排名靠前的k-mer子序列对组成频繁项集。例如,将所有的正例3-mer子序列对进行降序排序,可以选取前m个3-mer子序列对组成频繁项集A1。将所有的正例4-mer子序列对进行降序排序,可以选取前n个4-mer子序列对组成频繁项集A2。将所有的负例3-mer子序列对进行降序排序,可以选取前p个3-mer子序列对组成频繁项集A3。将所有 的负例4-mer子序列对进行降序排序,可以选取前q个4-mer子序列对组成频繁项集A4。进而由这四种频繁项集A1、A2、A3和A4组成原始序列特征集。其他示例中,也可以预设出现频率阈值,将出现频率大于该阈值的k-mer子序列对筛选出来,并由筛选得到的k-mer子序列对作为频繁项集特征并组成原始序列特征集,本公开对此不做具体限定。
本公开示例实施方式中,频繁项集特征可以将RNA序列的kmer特征和蛋白质序列的kmer特征组合到一起,使用频繁项集特征可以更好的区分有相互作用和没有相互作用的RNA-蛋白质对。因此,由频繁项集特征组成原始序列特征集,并基于该原始序列特征集对待预测的RNA-蛋白质对进行特征提取时,根据提取到的该待预测的RNA-蛋白质对的序列特征可以更加准确的确定该待预测的RNA-蛋白质对是否有相互作用。
步骤S320.根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征。
一种示例实施方式中,可以将待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,得到原始序列特征集后,可以在原始序列特征集中查找每种k-mer子序列,并根据查找结果得到待预测的RNA-蛋白质对的序列特征。其中,待预测的RNA-蛋白质对的序列特征可以是指由RNA序列特征和蛋白质序列特征组成的完整序列特征。
示例性的,原始序列特征集可以由560种k-mer子序列组成,例如,560种k-mer子序列可以为[CCC,…,AGU,CCCC,…,CUGG,777,…,373,7774,…,7571]。将待预测的RNA-蛋白质对转化为3-mer子序列和4-mer子序列,得到RNA 3-mer子序列、RNA4-mer子序列、蛋白质3-mer子序列和蛋白质4-mer子序列。可以基于原始序列特征集对待预测的RNA-蛋白质对中的RNA序列和蛋白质序列进行特征计算,得到该待预测的RNA-蛋白质对的序列特征。
具体地,对于包括560个特征维度的原始序列特征集[CCC,…,AGU,CCCC,…,CUGG,777,…,373,7774,…,7571],每个特征维度的特征对应于一种k-mer子序列。例如,子序列CCC为第一个特征维度的特征,子序列7571为第560个特征维度的特征。可以在原始序列特征集中查找该待预测的RNA-蛋白质对的所有3-mer子序列和4-mer子序列,并根据查找结果确定原始序列特征集中每个特征维度上的特征是否存在。若存在则该特征维度上的特征值为1,不存在则为0。举例而言,若待预测的RNA-蛋白质对中的RNA序列为AAAACCCGGG,可以看出,原始序列特征集中的第一个特征维度的特征CCC同时也是该RNA-蛋白质对的一个3-mer子序列。因此,可以确定原始序列特征集中第一个特征维度上的特征CCC存在,对应的可以将特征值记为1。再例如,原始序列特征集中的特征AGU在该RNA序列中并不存在,则可以将原始序列特征集中与特征AGU对应的特征维度上的特征值记为0。最后,可以计算得到一个560维的特征值向量[1,0,…,0,1,…],该特征值向量即为待预测的RNA-蛋白质对的序列特征。可以理解的是,该特征值向量中包含的每个特征值与原始序列特征集中每个特征维度的特征值是一一对应的。
该示例实施方式中,通过提取原始数据集中每个RNA-蛋白质对的k-mer特征,由提取到的RNA序列的k-mer特征和蛋白质序列的k-mer特征组成原始序列特征集。基于该原始 序列特征集对待预测的RNA-蛋白质对进行特征提取得到该待预测的RNA-蛋白质对的序列特征,以RNA序列的k-mer特征为例,k-mer特征中可以包含RNA序列的单体组分信息(即包含的各个碱基)和序列顺序信息。因此,使用k-mer特征可以更好的刻画一个RNA序列,也即可以根据k-mer特征更加准确的确定一个RNA序列,同时也可以通过k-mer特征区分不同的RNA序列。
另一种示例实施方式中,得到原始序列特征集后,可以将待预测的RNA-蛋白质对中RNA序列和蛋白质序列分别转化为k-mer子序列,并将RNA k-mer子序列和蛋白质k-mer子序列进行交叉组合,得到多种RNA-蛋白质k-mer子序列对。示例性的,得到该待预测的RNA-蛋白质对的RNA 3-mer子序列、蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列后,可以RNA 3-mer子序列和蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列分别进行两两交叉组合,得到多种3-mer子序列对和4-mer子序列对。可以在原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,并根据查找结果得到RNA-蛋白质对的序列特征。
示例性的,原始序列特征集可以由370个频繁项集特征组成,例如,370个频繁项集特征可以为[CUG_122,AAU_122,…,CUUU_1312,UCUG_1312,…]。将待预测的RNA-蛋白质对转化为3-mer子序列和4-mer子序列,得到RNA 3-mer子序列、RNA 4-mer子序列、蛋白质3-mer子序列和蛋白质4-mer子序列。可以将RNA 3-mer子序列和蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列进行配对,得到多种3-mer子序列对和4-mer子序列对。然后,可以基于原始序列特征集对待预测的RNA-蛋白质对中的RNA序列和蛋白质序列进行特征计算,得到该RNA-蛋白质对的序列特征。
具体地,对于包括370个特征维度的原始序列特征集[CUG_122,AAU_122,…,CUUU_1312,UCUG_1312,…],每个特征维度的特征对应于一种k-mer子序列对。例如,子序列对CUG_122为第一个特征维度的特征。可以在原始序列特征集中查找该待预测的RNA-蛋白质对的所有子序列对,并根据查找结果确定原始序列特征集中每个特征维度上的特征是否存在。若存在则该特征维度上的特征值为1,不存在则为0。举例而言,若待预测的RNA-蛋白质对中的RNA序列为AUCUGAAAU,蛋白质序列为512261312。可以看出,该RNA-蛋白质对的子序列对中的CUG_122、AAU_122、UCUG_1312存在于原始序列特征集中,因此,可以将原始序列特征中对应特征维度上的特征值记为1。最后,可以计算得到一个370维的特征值向量[1,1,…,0,1,…],该特征值向量即为待预测的RNA-蛋白质对的序列特征。类似的,该特征值向量中包含的每个特征值与原始序列特征集中每个特征维度的特征值是一一对应的。
再一种示例实施方式中,得到原始序列特征集后,可以将待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,在原始序列特征集中查找每种k-mer子序列,得到第一序列特征。然后,可以将RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对,在原始序列特征集中查找每种RNA-蛋白质 k-mer子序列对,得到第二序列特征。最后,可以由第一序列特征和第二序列特征组成待预测的RNA-蛋白质对的序列特征。
示例性的,原始序列特征集可以包括两个特征子集,两个特征子集分别包括560种k-mer子序列[CCC,…,CCCC,…,777,…,7774,…]和370个频繁项集特征[CUG_122,AAU_122,…,CUUU_1312,UCUG_1312,…]。可以将待预测的RNA-蛋白质对转化为RNA 3-mer子序列、RNA 4-mer子序列、蛋白质3-mer子序列和蛋白质4-mer子序列,同时,也可以将RNA 3-mer子序列和蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列进行配对,得到多种3-mer子序列对和4-mer子序列对。然后,可以根据原始序列特征集对待预测的RNA-蛋白质对中的RNA序列和蛋白质序列进行特征计算,得到该待预测的RNA-蛋白质对的序列特征。
具体地,可以在原始序列特征集中查找该待预测的RNA-蛋白质对的所有子序列和子序列对,并根据查找结果确定原始序列特征集中每个特征维度上的特征是否存在。例如,通过查找该待预测的RNA-蛋白质对的所有子序列可以计算得到一个560维的特征值向量[1,…,1,…,1,…,0,…],即第一序列特征。通过查找该RNA-蛋白质对的所有子序列对可以计算得到一个370维的特征值向量[1,1,…,0,1,…],即第二序列特征。例如,可以通过将两个特征值向量进行拼接,得到一个930维的特征值向量即为该待预测的RNA-蛋白质对的序列特征。也可以直接将两个特征值向量同时输入相互作用预测模型中,本公开对此不做具体限定。
其他示例中,也可以由560种k-mer子序列和370个频繁项集特征组成一个930维的原始序列特征集。例如,原始序列特征集为[CCC,…,CCCC,…,777,…,7774,…,CCA_121,…,UCUG_1312,…,AAU_122,…,CUUU_1312,…]。可以在原始序列特征集中查找该RNA-蛋白质对的所有子序列和子序列对,并根据查找结果确定原始序列特征集中每个特征维度上的特征是否存在,计算得到一个930维的特征值向量[1,…,1,…,1,…,0,…,1,…,1,…,1,…,0,…],该特征值向量即为待预测的RNA-蛋白质对的序列特征。
为了进一步挖掘RNA序列和蛋白质序列之间的联系,也可以提取原始数据集中的每个RNA-蛋白质对的频繁项集特征,由提取到的频繁项集特征组成原始序列特征集。频繁项集特征可以将RNA序列的kmer特征和蛋白质序列的kmer特征组合到一起。因此,使用频繁项集特征可以更好的区分有相互作用和没有相互作用的RNA-蛋白质对。还可以同时提取k-mer特征和频繁项集特征,并由二者共同组成原始序列特征集,通过结合k-mer特征和频繁项集特征的特性,可以更加准确地预测未知RNA-蛋白质对中RNA和蛋白质之间的相互作用。
在步骤S230中,向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量。
对待预测的RNA-蛋白质对进行特征提取后,可以将得到的序列特征作为第一相互作用 预测模型的输入。为了获取RNA序列和蛋白质序列的序列信息以及临近碱基、氨基酸之间的信息,还可以对待预测的RNA-蛋白质对进行向量化处理,并将得到的向量作为至少一个第二相互作用预测模型的输入。
一种示例实施方式中,可以将待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列。例如,可以将RNA序列不重叠的划分为M个RNA k-mer子序列,可以将蛋白质序列不重叠的划分为N个蛋白质k-mer子序列。举例而言,若RNA序列为AUCUGAAAU,则可以划分为三个RNA k-mer子序列,分别为AUC、UGA和AAU。可以理解的是,将RNA序列和蛋白质序列不重叠的划分为多个k-mer子序列,是为了将RNA序列和蛋白质序列进行向量化表示,也就是将RNA序列中的碱基和蛋白质序列中的氨基酸以k联体的方式进行向量化表示。类似的,其他示例中,也可以将RNA-蛋白质对中的RNA序列包含的每个碱基向量化,得到多个核苷酸向量,拼接多个碱基向量得到RNA序列的表示向量。同时,可以将RNA-蛋白质对中的蛋白质序列包含的每个氨基酸向量化,得到多个氨基酸向量,拼接多个氨基酸向量得到蛋白质序列的表示向量。还可以将RNA序列重叠的划分为P个RNA k-mer子序列,可以将蛋白质序列重叠的划分为Q个蛋白质k-mer子序列,本公开对此不做具体限定。然后,可以分别将RNA序列表示向量和蛋白质序列表示向量输入第二相互作用预测模型中。
示例性的,可以先将RNA序列和蛋白质序列的每种k-mer子序列进行编码。示例性的,当k=3时,RNA 3-mer子序列可以有64种,蛋白质3-mer子序列可以有343种。可以将每种RNA 3-mer子序列和蛋白质3-mer子序列依次进行Embedding(向量映射)编码,将每个3-mer子序列分别用一个低维向量表示,可以得到对应的多个3-mer子序列向量。一种示例中,可以将每个3-mer子序列进行One-Hot(独热)编码,One-Hot编码又称作一位有效编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都有独立的寄存器位,并且在任意时候,寄存器中只有一位有效。举例而言,对于第i种RNA 3-mer子序列,即索引为整数i的RNA 3-mer子序列,通过编码可以得到一个64维的One-Hot向量,并将该向量中第i个元素设为1,其它元素均设为0,形如[0,1,0,0,…,0]。对于第j个蛋白质3-mer子序列,即索引为整数j的蛋白质3-mer子序列,通过编码可以得到一个343维的One-Hot向量,并将该向量中第j个元素设为1,其它元素均设为0。类似的,每种RNA 3-mer子序列和蛋白质3-mer子序列可以对应一个3-mer One-Hot向量。其他示例中,也可以利用稠密向量来表示每个3-mer子序列。例如,可以利用Word2vec算法将每个3-mer子序列映射到向量空间中,每个3-mer子序列可以用该向量空间中的一个子序列向量表示。还可以利用BERT(来自变换器的双向编码表示)预训练模型对每个3-mer子序列进行编码,得到对应的多个3-mer子序列向量。具体地,可以获取大规模的RNA序列数据,使用BERT预训练模型进行训练,训练完成后,将某个RNA序列输入训练好的模型后即可得到该RNA序列的高维向量,本公开对此不做具体限定。
得到RNA序列和蛋白质序列的所有3-mer One-Hot向量后,可以将待预测的RNA-蛋 白质对中的RNA序列和蛋白质序列分别转化为3-mer子序列,如得到M个RNA 3-mer子序列和N个蛋白质3-mer子序列。然后,通过查询可以确定M个RNA 3-mer子序列对应的M个3-mer One-Hot向量,将M个3-mer One-Hot向量依次进行拼接,如可以在行方向上进行拼接,得到一个M*64的二维矩阵,如:
[[0,1,0,0,….,0];[0,0,0,1,….,0];…;[1,0,0,0,….,0]]
该二维矩阵即为RNA序列的3-mer One-Hot表示向量。通过查询也可以确定N个蛋白质3-mer子序列对应的N个3-mer One-Hot向量,将N个3-mer One-Hot向量在行方向依次进行拼接,可以得到一个N*343的二维矩阵,该二维矩阵即为蛋白质序列的3-mer One-Hot表示向量。可以理解的是,也可以将M个3-mer One-Hot向量或N个3-mer One-Hot向量进行列拼接,还可以进行直接(即尾部)拼接得到序列的3-mer One-Hot向量,本公开对此不做具体限定。
本公开示例实施方式中,通过将待预测的RNA-蛋白质对中的RNA序列和蛋白质序列向量化,可以将RNA序列表示向量和蛋白质序列表示向量作为深度学习模型的输入,以进一步发现数据中出现次数很少或者新的特征组合,进而揭示隐式特征之间的相互作用。
在步骤S240中,基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值。
一种示例实施方式中,多种相互作用预测模型是基于联合训练得到的。其中,多种相互作用预测模型可以均为传统机器学习模型,也可以均为深度学习模型,也可以同时包括至少一个传统机器学习模型和至少一个深度学习模型。传统机器学习模型是指使用原始形式处理自然数据。例如,构成一个模式识别或机器学习系统需要利用专业知识从原始数据中(如图像的像素值)提取特征,并转换成一个适当的特征表示。示例性的,传统机器学习模型可以包括线性回归模型、逻辑回归模型、支持向量机模型、决策树模型、K-近邻(K-Nearest Neighbor,KNN)模型、随机森林模型和朴素贝叶斯模型等。而深度学习模型具有自动提取特征的能力,可以由多个处理层组成复杂计算模型,从而自动获取数据的表示与多个抽象级别,是一种针对特征表示的学习。示例性的,深度学习模型可以包括卷积神经网络模型和循环神经网络模型等。当多种相互作用预测模型均为传统机器学习模型时,可以不对待预测的RNA-蛋白质对进行向量化处理,而仅将对待预测的RNA-蛋白质对进行特征提取得到的序列特征作为各个传统机器学习模型的输入。当多种相互作用预测模型均为深度学习模型时,可以不对待预测的RNA-蛋白质对进行特征提取,而仅将对待预测的RNA-蛋白质对进行向量化处理得到的RNA序列表示向量和蛋白质序列表示向量作为各个深度学习模型的输入。
得到待预测RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量后,可以将RNA-蛋白质对的序列特征输入到至少一个第一相互作用预测模型中,得到至少一个第一相互作用预测值。同时,可以将待预测的RNA-蛋白质 对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个第二相互作用预测模型中,得到至少一个第二相互作用预测值。需要说明的是,第一相互作用预测模型和至少一个第二相互作用预测模型是基于联合学习的方式得到的。其中,联合学习是指将多个子模型结合成一个模型以完成最终的目标任务。类似的,在本公开示例实施方式中,可以将至少一个第一相互作用预测模型和至少一个第二相互作用预测模型相结合,并通过融合各个模型的输出得到最终的预测结果。而且,在模型训练过程中可以同时考虑各个模型以及加权求和的结果,并同时优化各个模型的模型参数,以得到最佳整体模型,从而可以提高整体模型的预测能力。
一种示例实施方式中,第一相互作用模型可以是传统机器学习模型,第二相互作用模型可以是深度学习模型。对应的,可以将待预测的RNA-蛋白质对的序列特征输入到至少一个传统机器学习模型中,得到至少一个第一相互作用预测值。可以将待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个深度学习模型中,得到至少一个第二相互作用预测值。需要说明的是,每个深度学习模型又可以至少包括两个子深度学习模型,分别用于处理待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量。例如,可以将RNA序列表示向量输入第一子深度学习模型中,得到第一序列特征。同时,可以将蛋白质序列表示向量输入第二子深度学习模型中,得到第二序列特征。最后,可以通过每种深度学习模型的全连接层融合第一序列特征和序列第二特征,以根据融合后的特征得到第二相互作用预测值。
本公开示例实施方式中,传统机器学习模型可以包括LR(Logistic Regression,逻辑回归)模型、SVM模型和决策树模型中的至少一种,深度学习模型可以包括CNN模型和循环神经网络模型中的至少一种,其中,循环神经网络模型可以是LSTM模型、BiLSTM(Bi-directional LSTM,双向长短记忆网络)模型。
可以以LR模型、CNN模型和BiLSTM模型进行联合学习并预测RNA-蛋白质对的相互作用为例进行说明。示例性的,可以由560种k-mer子序列和370个频繁项集特征组成一个930维的原始序列特征集,根据待预测RNA-蛋白质对的k-mer子序列对原始序列特征集进行特征计算,得到一个930维的特征值向量,并将其作为LR模型的输入,得到相互作用预测值y 1。同时,还可以将待预测RNA-蛋白质对向量化,分别得到RNA序列和蛋白质序列的k-mer One-Hot向量,并将其分别作为CNN模型和BiLSTM模型的输入,得到相互作用预测值y 2、y 3
在本公开示例实施方式中,首先,LR模型具有良好的记忆能力,可以从数据中学习序列或者特征之间的相关性。CNN模型和BiLSTM模型具有较好的泛化能力,可以发现数据中出现次数很少或者新的特征组合,进而揭示隐式特征之间的相互作用。其次,CNN模型可以更好地捕捉特征但忽略了特征所在的位置信息,而BiLSTM模型具有更好的记忆能力,可以利用数据的序列信息与位置信息,弥补CNN模型在记忆能力上的缺陷。最后,LR模型简单,且可解释性好,将CNN模型、BiLSTM模型与LR模型进行联合学习,可以增强 整体模型对RPI进行预测的可解释性。总的来说,本公开可以将各个相互作用预测模型的特性进行有效结合,从而可以提高整体模型的预测能力。
在步骤S250中,根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。
得到多个相互作用预测值后,可以对多个相互作用预测值进行加权求和计算,并根据计算结果确定RNA和蛋白质之间的相互作用。若该计算结果大于预设的相互作用预测值阈值时,可以确定RNA和蛋白质之间有相互作用。若该计算结果小于或等于预设的相互作用预测值阈值时,可以确定RNA和蛋白质之间没有相互作用。通过融合各个相互作用预测模型的输出值得到最终的预测结果,可以更加准确的确定待预测的RNA-蛋白质对中RNA和蛋白质之间的相互作用。
示例性的,对RNA-蛋白质对的相互作用进行预测时,可以根据:
y out=α*y 1+β*y 2+γ*y 3
计算得到待预测的RNA-蛋白质对的相互作用预测值y out。其中,y 1为逻辑回归模型的输出值,y 2为卷积神经网络模型的输出值,y 3为循环神经网络模型的输出值,α、β、γ分别为逻辑回归模型、卷积神经网络模型和循环神经网络模型的权重参数,y out可以为0-1之间的任意数值。示例性的,可以预设相互作用预测值阈值为0.5,当y out>0.5时,可以将预测结果标记为1,即表示该RNA-蛋白质对有相互作用。当y out≤0.5时,可以将预测结果标记为0,即表示该RNA-蛋白质对无相互作用。
例如,上述权重参数α、β、γ可以基于联合学习训练得到。
本公开示例实施方式中,参考图6所示,可以根据步骤S610至步骤S650对多个相互作用预测模型和相关参数预先进行联合训练,以实现各个预测模型中所有模型参数的优化,可以根据训练得到的最终模型对相互作用未知的RNA-蛋白质对进行预测。
步骤S610.获取训练数据集,所述训练数据集中包括正例RNA-蛋白质对和负例RNA-蛋白质对。
可以将原始数据集中的所有RNA-蛋白质对作为训练数据集,也可以将原始数据集中的部分RNA-蛋白质对作为训练数据集。以RPI1807数据集为例,该数据集中共有3243个RNA-蛋白质对,具体包含1807对正例和1436对负例。示例性的,可以选取1200个正例和1000个负例作为训练数据集。可以理解的是,训练数据集中RNA-蛋白质对的数目仅仅是示意性的,可以获取任意数目的RNA-蛋白质对对每个相互作用预测模型进行多次训练,以提高每个相互作用预测模型的性能。需要说明的是,可以对正例RNA-蛋白质对进行标记,得到的标签值为“1”,即表示RNA-蛋白质对有相互作用。可以对负例RNA-蛋白质对进行标记,得到的标签值为“0”,即表示RNA-蛋白质对没有相互作用。
步骤S620.使用所述多个相互作用预测模型分别得到所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值。
例如,多个相互作用预测模型可以分别为LR模型、CNN模型和BiLSTM模型。类似 的,对训练数据集中的每个RNA-蛋白质对进行特征提取,可以将提取到的序列特征依次输入LR模型中。对每个RNA-蛋白质对进行向量化处理,可以将得到RNA序列表示向量和蛋白质序列表示向量输入CNN模型和BiLSTM模型中。示例性的,以利用第i个RNA-蛋白质对对各个预测模型进行训练为例。可以将该RNA-蛋白质对的序列特征输入LR模型中,并输出预测值y 1。可以将该RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量分别输入两个CNN子模型中,并将两个CNN子模型的输出进行拼接,最终输出预测值y 2。类似的,可以将该RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入BiLSTM模型中,并输出预测值y 3。可以理解的是,对于训练数据集中的每个RNA-蛋白质对,通过这三个相互作用预测模型均可以得到三个相互作用预测值。
步骤S630.根据所述多个相互作用预测值得到所述训练数据集中每个RNA-蛋白质对的联合预测值。
得到训练数据集中的每个RNA-蛋白质对的多个相互作用预测值后,可以将多个相互作用预测值进行加权求和得到每个RNA-蛋白质对的联合预测值。仍以训练数据集中的第i个RNA-蛋白质对为例,可以根据:
y out=α*y 1+β*y 2+γ*y 3
计算得到该RNA-蛋白质对的联合预测值y out,其中,y 1为LR模型的输出值,y 2为CNN模型的输出值,y 3为BiLSTM模型的输出值,α、β、γ分别为LR模型、CNN模型和BiLSTM模型的权重参数。
步骤S640.利用损失函数对所述训练数据集中每个RNA-蛋白质对的联合预测值和标签值进行计算,得到对应的损失值。
对于训练数据集中的每个RNA-蛋白质对均有一个标签值,如每对正例的标签值为1,每对负例的标签值为0。示例性的,第i个RNA-蛋白质对为正例数据,对应的其标签值为1。可以根据该RNA-蛋白质对的联合预测值y out和标签值1计算损失函数,得到对应的损失值。在预测模型的训练过程中,需要使得联合预测值无限接近于标签值,即最小化目标函数。一种示例中,需要最小化目标函数时,可以选取交叉熵损失函数为目标函数。在计算交叉熵损失函数时,若标签值为1,y out越接近于1计算得到的损失值越小,y out越接近于0计算得到的损失值越大。若标签值为0,y out越接近于0计算得到的损失值越小,y out越接近于1计算得到的损失值越大。可以理解的是,交叉熵损失函数是预测模型中的性能函数,可以用来估量预测模型的预测值与标签值的不一致程度。计算得到的交叉熵损失函数的值越小,表示模型预测的效果越好。
步骤S650.根据所述损失值调整所述多个相互作用预测模型的模型参数。
可以基于计算得到的损失值对多个相互作用预测模型的模型参数进行迭代更新,当满足迭代终止条件时,完成对多个相互作用预测模型的模型参数和各权重参数的训练。示例性的,可以利用随机梯度下降算法对参数进行更新。根据反向传播原理,不断计算目标函数如交叉熵损失函数,并根据计算得到的损失值同时更新每个相互作用预测模型的模型参 数和各权重参数。当目标函数收敛至最小值时,完成对所有模型参数的训练。也可以反向迭代式更新模型参数,当满足预设的迭代次数时,完成对所有模型参数的训练。示例性的,预设的迭代次数可以为20次,各个相互作用预测模型在进行20次反向迭代的过程中,一直在不断的更新模型参数。迭代完成后,可以得到优化后的模型参数。其他示例中,也可以交替最小二乘法、Adam优化算法等最小化目标函数,从后向前依次更新模型参数和各权重参数,对参数进行优化。需要说明的是,权重参数α、β、γ也是模型参数的一部分,在多个相互作用预测模型进行联合训练的过程中权重参数也会被进行训练并不断优化。
本公开示例实施方式中,利用多个相互作用预测模型对未知RNA-蛋白质对的相互作用进行预测时,通过将逻辑回归模型与深度学习模型相结合,可以利用逻辑回归模型从原始数据集中学习各个k-mer特征和/或各个频繁项集特征之间的相关性,又可以利用深度学习模型揭示隐式特征之间的相互作用,并结合了CNN模型与BiLSTM模型各自的特性,在多个模型进行联合训练的过程中同时优化各个模型的模型参数,从而提高了整体模型对未知RNA-蛋白质对的相互作用进行预测的准确性。
另一示例实施方式中,对多个相互作用预测模型预先进行联合训练时,可以先将原始数据集按比例分为训练数据集、验证数据集和测试数据集。示例性的,可以以原始数据集是RPI1807数据集为例进行说明。该数据集中共有3243个RNA-蛋白质对,具体包含1807对正例和1436对负例。示例性的,可以按照7:2:1的比例将该数据集划分成训练数据集、验证数据集和测试数据集。其中,每个数据集中正例和负例的比例可以与整体数据集的分布保持一致,即比例为1807:1436,约为1.25:1。举例而言,可以选取1250个正例和1000个负例作为训练数据集,选取360个正例和280个负例作为验证数据集,选取180个正例和140个负例作为测试数据集。
可以将训练数据集输入多个相互作用预测模型中,利用反向传播算法确定对应的模型参数,得到第一联合模型。示例性的,可以将训练数据集输入各个相互作用预测模型中,对模型参数进行调整。例如,模型参数可以是权重参数、偏置参数和截距参数等。示例性的,可以利用随机梯度下降算法对各个模型参数进行更新。根据反向传播原理,不断计算目标函数,并根据目标函数更新模型参数。当目标函数收敛至最小值时,完成对模型参数的训练,从而得到第一联合模型。
可以将验证数据集输入第一联合模型中,以对第一联合模型的性能进行验证,并根据验证结果得到第二联合模型。示例性的,对模型参数进行优化时,可以先初始化一组超参数,利用训练数据集对多个相互作用预测模型进行不断训练,得到第一联合模型。其中,超参数可以是学习率、CNN层数、卷积核尺寸等。然后,可以将验证数据集输入到训练好的第一联合模型中,以对第一联合模型的预测精度进行验证。当预测精度达到预设精度阈值时,可以将当前的第一联合模型作为第二联合模型,即得到最终训练模型。最后,可以在该训练模型上使用测试数据集对模型的最终性能进行测试。可以理解的是,若根据测试数据集得出第二联合模型的预测效果较差,则可以重新设定一组超参数,利用训练数据集、 验证数据集再次对相互作用预测模型依次进行训练和验证,当训练好的相互作用预测模型在验证数据集上得到的预测精度达到预设精度阈值时,可以利用新的测试数据集对该预测模型的最终性能进行测试。
示例性的,可以根据测试数据集确定第二联合模型的准确率,当准确率大于预设阈值时得到第三联合模型,第三联合模型中包括训练好的多个相互作用预测模型。例如,得到第二联合模型后,可以将测试数据集中的各个RNA-蛋白质对输入第二联合模型中,以对第二联合模型的准确率进行判断。若该模型的准确率大于预设准确率阈值时,得到第三联合模型。可以理解的是,第三联合模型中包括训练好的多个相互作用预测模型,进而可以利用多个相互作用预测模型对未知RNA-蛋白质对之间的相互作用进行预测。也可以利用测试数据集对第二联合模型的马修斯相关系数进行判断,马修斯相关系数是指实际分类与预测分类之间的相关系数,它的取值范围为[0,1],取值越大表示预测值和真实值越相关,取值为1时表示预测结果完全正确。若该模型的马修斯相关系数大于预设阈值时,得到第三联合模型。还可以利用测试数据集对第二联合模型的特异率、召回率等进行判断,本公开对此不做具体限定。例如,若第二联合模型的准确率未大于预设准确率阈值时,可以获取新的训练数据集再次对各个相互作用预测模型的模型参数进行训练,以不断提高模型性能。在本公开示例实施方式中,可以通过使用训练数据集和验证数据集对参数进行调整,训练得到最佳预测模型,进而可以通过使用测试数据集测试该预测模型的泛化性能。
在以上训练过程中,可以对LR模型、CNN模型和BiLSTM模型三者中的模型参数同时进行训练。示例性的,可以基于目标函数,根据训练数据集中每个RNA-蛋白质对的联合预测值和标签值计算得到的损失值同时调整三个相互作用预测模型的模型参数,通过多次反向传播,最终可以使三个相互作用预测模型的模型参数均趋于收敛,或者满足一定迭代次数后终止训练。通过这样的训练方式,可以同时对LR模型、CNN模型和BiLSTM模型三个相互作用预测模型进行训练,进而实现三个相互作用预测模型的联合学习。同时,不仅可以保证各相互作用预测模型在预测RNA-蛋白质对相互作用时的精度和准确度更高,还可以提升各相互作用预测模型的训练效率。
各个模型训练完成后,可以利用最终得到的各个相互作用预测模型对未知RNA-蛋白质对之间的相互作用进行预测,并将该RNA-蛋白质对相互作用的预测结果输出至终端设备以供用户查看。
一种示例实施方式中,可以使用相同的训练数据集、验证数据集和测试数据集对单独的LR模型、CNN模型、BiLSTM模型以及三者联合学习模型上进行训练和测试。例如,可以使用准确率和马修斯相关系数两个性能指标对各个预测模型的性能进行评价。当利用RPI1807数据集进行模型训练和测试时,联合学习模型的准确率和马修斯相关系数均优于LR模型、CNN模型、BiLSTM模型,即联合学习模型相比单个模型的性能更优,也即预测能力更优。
本公开示例实施方式中,也可以获取至少一个RNA序列,并通过多个相互作用预测模 型在数据库中查找与输入的每个RNA序列之间有相互作用的蛋白质序列。具体地,获取原始数据集后,可以参考图6利用原始数据集对多个相互作用预测模型预先进行联合训练。训练完成后,可以将参与联合训练的所有蛋白质序列存储至数据库中。需要说明的是,数据库中也可以包括其他未参与联合训练的蛋白质序列,也即数据库中的蛋白质序列的种类数目可以是任意的,数据库中还可以包括任意数目的RNA序列,如可以包括但不限于参与联合训练的所有RNA序列,本公开中对此不做具体限定。
当用户输入至少一个RNA序列后,可以将输入的每个RNA序列与数据库中的所有蛋白质序列组合成若干个RNA-蛋白质对。进一步的,可以根据步骤S220至步骤S250将多个相互作用模型进行联合学习,以对每个RNA-蛋白质对的相互作用进行预测。具体地,可以对每个RNA-蛋白质对进行特征提取和向量化处理,将得到的序列特征、RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入多个相互作用预测模型中,并得到每个RNA-蛋白质对的相互作用预测值。相互作用预测值为1表示该RNA-蛋白质对有相互作用,相互作用预测值为0表示RNA-蛋白质对没有相互作用。然后,可以将所有相互作用预测值为1的RNA-蛋白质对筛选出来,并将每个RNA-蛋白质对中的蛋白质序列输出至终端设备,以供用户查看与输入RNA序列有相互作用的蛋白质序列。
类似的,本公开示例实施方式中,还可以获取至少一个蛋白质序列,并通过多个相互作用预测模型在数据库中查找与输入的每个蛋白质序列之间有相互作用的RNA序列。示例性的,当用户输入至少一个蛋白质序列后,可以将输入的每个蛋白质序列与数据库中的所有RNA序列组合成若干个RNA-蛋白质对。进一步的,可以根据步骤S220至步骤S250将多个相互作用模型进行联合学习,以对每个RNA-蛋白质对的相互作用进行预测。具体地,可以对每个RNA-蛋白质对进行特征提取和向量化处理,将得到的序列特征、RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入多个相互作用预测模型中,并得到每个RNA-蛋白质对的相互作用预测值。相互作用预测值为1表示该RNA-蛋白质对有相互作用,相互作用预测值为0表示RNA-蛋白质对没有相互作用。然后,可以将所有相互作用预测值为1的RNA-蛋白质对筛选出来,并将每个RNA-蛋白质对中的RNA序列输出至终端设备,以供用户查看与输入蛋白质序列有相互作用的RNA序列。
在本公开示例实施方式所提供的RNA-蛋白质相互作用预测方法中,获取待预测的RNA-蛋白质对;对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值;根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。一方面,通过对RNA-蛋白质对进行特征提取和向量化处理,可以充分挖掘RNA序列和蛋白质序列之间的联系,以便于准确地预测RNA和蛋白质的相互作用;另一方面,将多种相互作用预测模型的特性 进行有效结合,可以进一步提高预测RNA与蛋白质相互作用的准确性。
应当注意,尽管在附图中以特定顺序描述了本公开中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
进一步的,本示例实施方式中,还提供了一种RNA-蛋白质相互作用预测装置。该装置可以应用于一服务器或终端设备。参考图7所示,该RNA-蛋白质相互作用预测装置700可以包括数据获取模块710、特征提取模块720、数据向量化模块730、相互作用预测模块740以及相互作用确定模块750,其中:
数据获取模块710,用于获取待预测的RNA-蛋白质对;
特征提取模块720,用于对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
数据向量化模块730,用于向量化所述待预测的RNA-蛋白质对,得到所述待预测RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
相互作用预测模块740,用于基于所述待预测的RNA-蛋白质对的序列特征、所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值;
相互作用确定模块750,用于根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。
在一种可选的实施方式中,特征提取模块720包括:
特征集获取模块,用于获取原始序列特征集;
特征确定模块,用于根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征。
在一种可选的实施方式中,特征确定模块包括:
序列转化单元,用于将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列;
第一序列查找单元,用于在所述原始序列特征集中查找每种k-mer子序列,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
在一种可选的实施方式中,特征确定模块包括:
序列转化单元,用于将所述待预测的RNA-蛋白质对中RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
序列组合单元,用于将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
第二序列查找单元,用于在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
在一种可选的实施方式中,特征确定模块包括:
序列转化单元,用于将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
第一序列查找单元,用于在所述原始序列特征集中查找每种k-mer子序列,得到第一序列特征;
序列组合单元,用于将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
第二序列查找单元,用于在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,得到第二序列特征;
特征拼接单元,用于由所述第一序列特征和第二序列特征组成所述待预测的RNA-蛋白质对的序列特征。
在一种可选的实施方式中,特征集获取模块包括:
数据集获取模块,用于获取原始数据集;
特征提取模块,用于对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集。
在一种可选的实施方式中,特征提取模块包括:
序列生成单元,用于对RNA和蛋白质的基本单元分别进行排列组合得到k-mer子序列;
方差计算单元,用于统计每种k-mer子序列在所述原始数据集中的出现频率,并根据所述出现频率计算所述每种k-mer子序列的方差;
数据集确定单元,用于根据所述每种k-mer序列的方差大小确定所述原始序列特征集。
在一种可选的实施方式中,方差计算单元包括:
频次统计子单元,用于统计所述每种k-mer子序列在所述原始数据集中的出现频次;
频率计算子单元,用于根据所述出现频次计算得到所述每种k-mer子序列在所述原始数据集中的出现频率;
序列标记子单元,用于遍历所述原始数据集,对所述每种k-mer子序列是否出现在所述每个RNA-蛋白质对中进行标记;
方差计算子单元,用于根据所述每种k-mer子序列在所述原始数据集中的出现频率和在每个RNA-蛋白质对中的标记值计算所述每种k-mer子序列的方差。
在一种可选的实施方式中,方差计算子单元被配置为用于根据:
Figure PCTCN2021121089-appb-000015
计算第i种k-mer子序列的方差Var i。其中,
Figure PCTCN2021121089-appb-000016
为第i种k-mer子序列在第n个RNA-蛋白质对中的标记值,Freq i为第i种k-mer子序列在原始数据集中的出现频率,N为原始数据集中RNA-蛋白质对的总数量。
在一种可选的实施方式中,数据集确定单元被配置为用于根据所述每种k-mer子序列 的方差大小确定满足预设条件的k-mer子序列,并由所述满足预设条件的k-mer子序列组成所述原始序列特征集。
在一种可选的实施方式中,特征提取模块还包括:
第一项集生成单元,用于将所述每个RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,由所述k-mer子序列组成第一候选项集,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
频繁项集生成单元,用于统计所述第一候选项集中每种k-mer子序列在所述原始数据集中的出现频率,由满足预设出现频率阈值的k-mer子序列组成频繁项集;
第二项集生成单元,用于将所述频繁项集中的RNA k-mer子序列和蛋白质k-mer子序列进行交叉组合,由组合得到的k-mer子序列对组成第二候选项集;
支持度确定单元,用于统计所述第二候选项集中每种k-mer子序列对在所述原始数据集中的出现频率,得到所述每种k-mer子序列对的支持度;
特征集获取单元,用于由所述支持度满足预设条件的k-mer子序列对组成所述原始序列特征集。
在一种可选的实施方式中,数据向量化模块730包括:
序列转化单元,用于将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括M个RNA k-mer子序列和N个蛋白质k-mer子序列;
第一向量化单元,用于将每个RNA k-mer子序列向量化,得到M个RNA k-mer向量;
第一拼接单元,用于拼接所述M个RNA k-mer向量得到所述RNA序列表示向量;
第二向量化单元,用于将每个蛋白质k-mer序列向量化,得到N个蛋白质k-mer向量;
第二拼接单元,用于拼接所述N个蛋白质k-mer向量得到所述蛋白质序列表示向量。
在一种可选的实施方式中,相互作用预测模块740包括:
第一预测单元,用于将所述待预测的RNA-蛋白质对的序列特征输入到至少一个第一相互作用预测模型中,得到至少一个第一相互作用预测值;
第二预测单元,用于将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个第二相互作用预测模型中,得到至少一个第二相互作用预测值;
在一种可选的实施方式中,相互作用预测模块740包括:
第三预测单元,用于将所述待预测的RNA-蛋白质对的序列特征输入到至少一个传统机器学习模型中,得到至少一个第一相互作用预测值;
第四预测单元,用于将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个深度学习模型中,得到至少一个第二相互作用预测值。
在一种可选的实施方式中,所述每个深度学习模型至少包括两个子深度学习模型;第四预测单元包括:
第一特征生成单元,用于将所述待预测的RNA-蛋白质对中的RNA序列表示向量输入第一子深度学习模型中,得到第一序列特征;
第二特征生成单元,用于将所述待预测的RNA-蛋白质对中的蛋白质序列表示向量输入第二子深度学习模型中,得到第二序列特征;
特征融合单元,用于融合所述第一序列特征和所述序列第二特征,根据融合后的特征得到所述第二相互作用预测值。
在一种可选的实施方式中,所述传统机器学习模型包括逻辑回归模型、支持向量机模型和决策树模型中的至少一种,深度学习模型包括卷积神经网络模型和循环神经网络模型中的至少一种。
在一种可选的实施方式中,相互作用确定模块750包括:
加权计算单元,用于对所述多个相互作用预测值进行加权求和计算;
相互作用确定单元,用于根据计算结果确定所述RNA和蛋白质之间的相互作用。
在一种可选的实施方式中,相互作用确定单元包括:
第一相互作用确定子单元,用于若所述计算结果大于预设的相互作用预测值阈值,确定所述RNA和蛋白质之间有相互作用;
第二相互作用确定子单元,用于若所述计算结果小于或等于预设的相互作用预测值阈值,确定所述RNA和蛋白质之间没有相互作用。
在一种可选的实施方式中,RNA-蛋白质相互作用预测装置700还包括:
联合训练模块,用于对所述多个相互作用预测模型进行联合训练。
在一种可选的实施方式中,联合训练模块包括:
训练数据获取单元,用于获取训练数据集,所述训练数据集中包括正例RNA-蛋白质对和负例RNA-蛋白质对;
第一预测值输出单元,用于使用所述多个相互作用预测模型分别得到所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值;
第二预测值输出单元,用于根据所述多个相互作用预测值得到所述训练数据集中每个RNA-蛋白质对的联合预测值;
损失值计算单元,用于利用损失函数对所述训练数据集中每个RNA-蛋白质对的联合预测值和标签值进行计算,得到对应的损失值;
模型参数调整单元,用于根据所述损失值调整所述多个相互作用预测模型的模型参数。
在一种可选的实施方式中,第二预测值输出单元包括:
第二预测值输出子单元,用于对所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值进行加权求和得到每个RNA-蛋白质对的联合预测值。
在一种可选的实施方式中,第二预测值输出子单元被配置为用于:
根据:
y out=α*y 1+β*y 2+γ*y 3
计算得到所述训练数据集中每个RNA-蛋白质对的联合预测值y out。其中,y 1为传统机器学习模型的输出值,y 2为卷积神经网络模型的输出值,y 3为循环神经网络模型的输出值,α、β、γ分别为传统机器学习模型、卷积神经网络模型和循环神经网络模型的权重参数。
在一种可选的实施方式中,模型参数调整单元被配置为用于基于所述损失值对所述多个相互作用预测模型的模型参数进行迭代更新,当满足迭代终止条件时,完成对所述多个相互作用预测模型的模型参数的训练,以使用训练好的所述多个相互作用预测模型对所述待预测的RNA-蛋白质对的相互作用进行预测。
在一种可选的实施方式中,RNA-蛋白质相互作用预测装置700还包括:
数据输出模块,用于输出所述RNA和蛋白质之间的相互作用的预测结果。
上述RNA-蛋白质相互作用预测装置中各模块的具体细节已经在对应的RNA-蛋白质相互作用预测方法中进行了详细的描述,因此此处不再赘述。
上述装置中各模块可以是通用处理器,包括:中央处理器、网络处理器等;还可以是数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。各模块也可以由软件、固件等形式来实现。上述装置中的各处理器可以是独立的处理器,也可以集成在一起。
本公开的示例性实施方式还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本公开的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在电子设备上运行时,程序代码用于使电子设备执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。该程序产品可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在电子设备,例如个人电脑上运行。然而,本公开的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、 光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
本公开的示例性实施方式还提供了一种能够实现上述方法的电子设备。下面参照图8来描述根据本公开的这种示例性实施方式的电子设备800。图8显示的电子设备800仅仅是一个示例,不应对本公开实施方式的功能和使用范围带来任何限制。
如图8所示,电子设备800可以以通用计算设备的形式表现。电子设备800的组件可以包括但不限于:至少一个处理单元810、至少一个存储单元820、连接不同系统组件(包括存储单元820和处理单元810)的总线830和显示单元840。
存储单元820存储有程序代码,程序代码可以被处理单元810执行,使得处理单元810执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。例如,处理单元810可以执行图2至图6中任意一个或多个方法步骤。
存储单元820可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)821和/或高速缓存存储单元822,还可以进一步包括只读存储单元(ROM)823。
存储单元820还可以包括具有一组(至少一个)程序模块825的程序/实用工具824,这样的程序模块825包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线830可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备800也可以与一个或多个外部设备900(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备800交互的设备通信,和/或与使得该电子设备800能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口850进行。并且,电子设备800还可以通过网络适配器860与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器860通过总线830与电子设备800的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备800使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
在一些实施例中,可以由电子设备的处理单元810执行本公开中所述的RNA-蛋白质相互作用预测方法。在一些实施例中,可以通过输入接口850输入待预测的RNA-蛋白质对/待预测的RNA序列/待预测的蛋白质序列、原始数据集、以及用于训练各个相互作用预测模型的训练数据集等。例如,通过电子设备的用户交互界面输入待预测的RNA-蛋白质对、原始数据集、以及用于训练各个相互作用预测模型的训练数据集等等。在一些实施例中,可以通过输出接口850将该待预测的RNA-蛋白质对的相互作用的预测结果输出至外部设备900以供用户查看。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开示例性实施方式的方法。
此外,上述附图仅是根据本公开示例性实施方式的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (27)

  1. 一种RNA-蛋白质相互作用预测方法,其特征在于,包括:
    获取待预测的RNA-蛋白质对;
    对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
    向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
    基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值;
    根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。
  2. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征,包括:
    获取原始序列特征集;
    根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征。
  3. 根据权利要求2所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
    将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列;
    在所述原始序列特征集中查找每种k-mer子序列,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
  4. 根据权利要求2所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
    将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
    将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
    在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
  5. 根据权利要求2所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
    将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
    在所述原始序列特征集中查找每种k-mer子序列,得到第一序列特征;
    将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
    在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,得到第二序列特征;
    由所述第一序列特征和第二序列特征组成所述待预测的RNA-蛋白质对的序列特征。
  6. 根据权利要求2所述的RNA-蛋白质相互作用预测方法,其特征在于,所述获取原始序列特征集,包括:
    获取原始数据集;
    对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集。
  7. 根据权利要求6所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集,包括:
    对RNA和蛋白质的基本单元分别进行排列组合得到k-mer子序列;
    统计每种k-mer子序列在所述原始数据集中的出现频率,并根据所述出现频率计算所述每种k-mer子序列的方差;
    根据所述每种k-mer序列的方差大小确定所述原始序列特征集。
  8. 根据权利要求7所述的RNA-蛋白质相互作用预测方法,其特征在于,所述统计每种k-mer子序列在所述原始数据集中的出现频率,并根据所述出现频率计算所述每种k-mer子序列的方差,包括:
    统计所述每种k-mer子序列在所述原始数据集中的出现频次;
    根据所述出现频次计算得到所述每种k-mer子序列在所述原始数据集中的出现频率;
    遍历所述原始数据集,对所述每种k-mer子序列是否出现在所述每个RNA-蛋白质对中进行标记;
    根据所述每种k-mer子序列在所述原始数据集中的出现频率和在每个RNA-蛋白质对中的标记值计算所述每种k-mer子序列的方差。
  9. 根据权利要求8所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述每种k-mer子序列在所述原始数据集中的出现频率和在每个RNA-蛋白质对中的标记值计算所述每种k-mer子序列的方差,包括:
    根据:
    Figure PCTCN2021121089-appb-100001
    计算第i种k-mer子序列的方差Var i。其中,
    Figure PCTCN2021121089-appb-100002
    为第i种k-mer子序列在第n个RNA-蛋白质对中的标记值,Freq i为第i种k-mer子序列在原始数据集中的出现频率,N为原始数据集中RNA-蛋白质对的总数量。
  10. 根据权利要求7所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述每种k-mer序列的方差大小确定所述原始序列特征集,包括:
    根据所述每种k-mer子序列的方差大小确定满足预设条件的k-mer子序列,并由所述满足预设条件的k-mer子序列组成所述原始序列特征集。
  11. 根据权利要求6所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集,还包括:
    将所述每个RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,由所述k-mer子序列组成第一候选项集,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
    统计所述第一候选项集中每种k-mer子序列在所述原始数据集中的出现频率,由满足预设出现频率阈值的k-mer子序列组成频繁项集;
    将所述频繁项集中的RNA k-mer子序列和蛋白质k-mer子序列进行交叉组合,由组合得到的k-mer子序列对组成第二候选项集;
    统计所述第二候选项集中每种k-mer子序列对在所述原始数据集中的出现频率,得到所述每种k-mer子序列对的支持度;
    由所述支持度满足预设条件的k-mer子序列对组成所述原始序列特征集。
  12. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,包括:
    将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括M个RNA k-mer子序列和N个蛋白质k-mer子序列;
    将每个RNA k-mer子序列向量化,得到M个RNA k-mer向量;
    拼接所述M个RNA k-mer向量得到所述RNA序列表示向量;
    将每个蛋白质k-mer序列向量化,得到N个蛋白质k-mer向量;
    拼接所述N个蛋白质k-mer向量得到所述蛋白质序列表示向量。
  13. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别 得到所述待预测的RNA-蛋白质对的多个相互作用预测值,包括:
    将所述待预测的RNA-蛋白质对的序列特征输入到至少一个第一相互作用预测模型中,得到至少一个第一相互作用预测值;
    将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个第二相互作用预测模型中,得到至少一个第二相互作用预测值。
  14. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值,包括:
    将所述待预测的RNA-蛋白质对的序列特征输入到至少一个传统机器学习模型中,得到至少一个第一相互作用预测值;
    将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个深度学习模型中,得到至少一个第二相互作用预测值。
  15. 根据权利要求14所述的RNA-蛋白质相互作用预测方法,其特征在于,所述每个深度学习模型至少包括两个子深度学习模型;所述将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个深度学习模型中,得到至少一个第二相互作用预测值,包括:
    将所述待预测的RNA-蛋白质对中的RNA序列表示向量输入第一子深度学习模型中,得到第一序列特征;
    将所述待预测的RNA-蛋白质对中的蛋白质序列表示向量输入第二子深度学习模型中,得到第二序列特征;
    融合所述第一序列特征和所述第二序列特征,根据融合后的特征得到所述第二相互作用预测值。
  16. 根据权利要求14所述的RNA-蛋白质相互作用预测方法,其特征在于,所述传统机器学习模型包括逻辑回归模型、支持向量机模型和决策树模型中的至少一种,深度学习模型包括卷积神经网络模型和循环神经网络模型中的至少一种。
  17. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用,包括:
    对所述多个相互作用预测值进行加权求和计算;
    根据计算结果确定所述RNA和蛋白质之间的相互作用。
  18. 根据权利要求17所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据计算结果确定所述RNA和蛋白质之间的相互作用,包括:
    若所述计算结果大于预设的相互作用预测值阈值,确定所述RNA和蛋白质之间有相互作用;
    若所述计算结果小于或等于预设的相互作用预测值阈值,确定所述RNA和蛋白质之间没有相互作用。
  19. 根据权利要求1-18任一项所述的RNA-蛋白质相互作用预测方法,其特征在于,所述方法还包括:
    对所述多个相互作用预测模型进行联合训练。
  20. 根据权利要求19所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述多个相互作用预测模型进行联合训练,包括:
    获取训练数据集,所述训练数据集中包括正例RNA-蛋白质对和负例RNA-蛋白质对;
    使用所述多个相互作用预测模型分别得到所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值;
    根据所述多个相互作用预测值得到所述训练数据集中每个RNA-蛋白质对的联合预测值;
    利用损失函数对所述训练数据集中每个RNA-蛋白质对的联合预测值和标签值进行计算,得到对应的损失值;
    根据所述损失值调整所述多个相互作用预测模型的模型参数。
  21. 根据权利要求20所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述多个相互作用预测值得到所述训练数据集中每个RNA-蛋白质对的联合预测值,包括:
    对所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值进行加权求和得到每个RNA-蛋白质对的联合预测值。
  22. 根据权利要求21所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值进行加权求和得到每个RNA-蛋白质对的联合预测值,包括:
    根据:
    y out=α*y 1+β*y 2+γ*y 3
    计算得到所述训练数据集中每个RNA-蛋白质对的联合预测值y out。其中,y 1为传统机器学习模型的输出值,y 2为卷积神经网络模型的输出值,y 3为循环神经网络模型的输出值,α、β、γ分别为传统机器学习模型、卷积神经网络模型和循环神经网络模型的权重参数。
  23. 根据权利要求20所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述损失值调整所述多个相互作用预测模型的模型参数,包括:
    基于所述损失值对所述多个相互作用预测模型的模型参数进行迭代更新, 当满足迭代终止条件时,完成对所述多个相互作用预测模型的模型参数的训练,以使用训练好的所述多个相互作用预测模型对所述待预测的RNA-蛋白质对的相互作用进行预测。
  24. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述方法还包括:
    输出所述RNA和蛋白质之间的相互作用的预测结果。
  25. 一种RNA-蛋白质相互作用预测装置,其特征在于,包括:
    数据获取模块,用于获取待预测的RNA-蛋白质对;
    特征提取模块,用于对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
    数据向量化模块,用于向量化所述待预测的RNA-蛋白质对,得到所述待预测RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
    相互作用预测模块,用于基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值;
    相互作用确定模块,用于根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。
  26. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-24任一项所述方法。
  27. 一种电子设备,其特征在于,包括:
    处理器;以及
    存储器,用于存储所述处理器的可执行指令;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1-24任一项所述的方法。
PCT/CN2021/121089 2021-09-27 2021-09-27 Rna-蛋白质相互作用预测方法、装置、介质及电子设备 WO2023044927A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/121089 WO2023044927A1 (zh) 2021-09-27 2021-09-27 Rna-蛋白质相互作用预测方法、装置、介质及电子设备
CN202180002692.1A CN116490926A (zh) 2021-09-27 2021-09-27 Rna-蛋白质相互作用预测方法、装置、介质及电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/121089 WO2023044927A1 (zh) 2021-09-27 2021-09-27 Rna-蛋白质相互作用预测方法、装置、介质及电子设备

Publications (1)

Publication Number Publication Date
WO2023044927A1 true WO2023044927A1 (zh) 2023-03-30

Family

ID=85719929

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/121089 WO2023044927A1 (zh) 2021-09-27 2021-09-27 Rna-蛋白质相互作用预测方法、装置、介质及电子设备

Country Status (2)

Country Link
CN (1) CN116490926A (zh)
WO (1) WO2023044927A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100070438A1 (en) * 2006-10-31 2010-03-18 Keio University Method for predicting interaction between protein and chemical
CN111192631A (zh) * 2020-01-02 2020-05-22 中国科学院计算技术研究所 用于构建用于预测蛋白质-rna相互作用结合位点模型的方法和系统
CN111916148A (zh) * 2020-08-13 2020-11-10 中国计量大学 蛋白质相互作用的预测方法
CN112420127A (zh) * 2020-10-26 2021-02-26 大连民族大学 基于二级结构和多模型融合的非编码rna与蛋白质相互作用预测方法
CN113313167A (zh) * 2021-05-28 2021-08-27 湖南工业大学 一种基于深度学习的双神经网络结构预测lncRNA-蛋白质相互作用方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100070438A1 (en) * 2006-10-31 2010-03-18 Keio University Method for predicting interaction between protein and chemical
CN111192631A (zh) * 2020-01-02 2020-05-22 中国科学院计算技术研究所 用于构建用于预测蛋白质-rna相互作用结合位点模型的方法和系统
CN111916148A (zh) * 2020-08-13 2020-11-10 中国计量大学 蛋白质相互作用的预测方法
CN112420127A (zh) * 2020-10-26 2021-02-26 大连民族大学 基于二级结构和多模型融合的非编码rna与蛋白质相互作用预测方法
CN113313167A (zh) * 2021-05-28 2021-08-27 湖南工业大学 一种基于深度学习的双神经网络结构预测lncRNA-蛋白质相互作用方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHUPING CHENG, JIANJUN TAN, JINGRUI MEN: "Prediction of ncRNA-protein interactions based on machine learning methods ", BEIJING BIOMEDICAL ENGINEERING, vol. 38, no. 4, 1 August 2019 (2019-08-01), XP093053645 *
ZHU, MIN: "Prediction of Protein-protein Interactions Based on Ensemble Learning", JOURNAL OF SICHUAN UNIVERSITY (ENGINEERING SCIENCE EDITION), vol. 43, no. 3, 20 May 2011 (2011-05-20), XP093053655 *

Also Published As

Publication number Publication date
CN116490926A (zh) 2023-07-25

Similar Documents

Publication Publication Date Title
CN110192203A (zh) 用于多个自然语言处理(nlp)任务的联合多任务神经网络模型
US20210158164A1 (en) Finding k extreme values in constant processing time
US20180341642A1 (en) Natural language processing with knn
Wang et al. Novel and efficient randomized algorithms for feature selection
CN112905801A (zh) 基于事件图谱的行程预测方法、系统、设备及存储介质
Cheng et al. TreeNet: Learning Sentence Representations with Unconstrained Tree Structure.
Wang et al. A novel matrix of sequence descriptors for predicting protein-protein interactions from amino acid sequences
Ozmen et al. Multi-relation message passing for multi-label text classification
Fan et al. Surrogate-assisted evolutionary neural architecture search with network embedding
WO2023044927A1 (zh) Rna-蛋白质相互作用预测方法、装置、介质及电子设备
Chen et al. Feature selection of parallel binary moth-flame optimization algorithm based on spark
CN116127201A (zh) 一种基于进化多任务的大规模用户推荐方法
WO2023044931A1 (zh) Rna-蛋白质相互作用预测方法、装置、介质及电子设备
CN110019815B (zh) 利用knn的自然语言处理
Ding et al. ABC-based stacking method for multilabel classification
CN115048530A (zh) 融合邻居重要度和特征学习的图卷积推荐系统
WO2023050204A1 (zh) Rna-蛋白质相互作用预测方法、装置、介质及电子设备
Liang et al. Incremental deep forest for multi-label data streams learning
WO2023097515A1 (zh) Rna-蛋白质相互作用预测方法、装置、介质及电子设备
WO2023130200A1 (zh) 向量模型训练方法、负样本生成方法、介质及设备
Chauleva et al. Secondary structure prediction of RNA using convolutional neural networks
Xiu et al. Prediction method for lysine acetylation sites based on LSTM network
US20240136020A1 (en) Rna-protein interaction prediction method and apparatus, and medium and electronic device
Ding et al. Partial Annotation Learning for Biomedical Entity Recognition
Hilmiaji et al. Identifying Emotion on Indonesian Tweets using Convolutional Neural Networks

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202180002692.1

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21958045

Country of ref document: EP

Kind code of ref document: A1