CN116825187A

CN116825187A - lncRNA-protein interaction prediction method and related equipment thereof

Info

Publication number: CN116825187A
Application number: CN202310769833.3A
Authority: CN
Inventors: 刘小双
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-06-27
Filing date: 2023-06-27
Publication date: 2023-09-29

Abstract

The embodiment of the application belongs to the technical field of digital medical treatment, is applied to a scene of predicting interactions between lncRNA and proteins, and relates to a method for predicting interactions between lncRNA and proteins and related equipment thereof, wherein the method comprises the steps of constructing a first association graph according to an lncRNA sequence; constructing a second association diagram according to the protein sequence; obtaining target matrixes respectively corresponding to the first association diagram and the second association diagram; acquiring a group characterization vector based on target matrixes respectively corresponding to the first association diagram and the second association diagram; predicting whether there is an interaction between the lncRNA sequence and the protein sequence in each sequence set based on the set characterization vector. The method has the advantages that the graph self-encoder training is respectively carried out on the lncRNA sequence and the protein sequence in a cooperative training mode, and then the predictor training is carried out by combining the group characterization vectors, so that the structural information of the lncRNA sequence and the protein sequence and the interaction information between the lncRNA sequence and the protein sequence are fully utilized, and the effectiveness and the cooperativity of the utilization of known information and the prediction accuracy are improved.

Description

lncRNA-protein interaction prediction method and related equipment thereof

Technical Field

The application relates to the technical field of digital medical treatment, is applied to a scene of predicating interaction between lncRNA and protein, in particular to a method for predicating interaction between lncRNA and protein and related equipment thereof.

Background

Along with the development of the computer industry and artificial intelligence and the coming of the big data age, the traditional medical mode is gradually converted into the digital medical mode. Long noncoding RNA (LncRNA) is a long non-coding RNA of a class with a length of more than 200nt, an important component of the non-coding genome. Numerous studies have shown that lncRNAs are involved in a variety of biological processes, including DNA methylation, histone modification, post-RNA transcriptional and protein translational regulation, and in the regulation of a variety of physiological and pathological processes.

Thus, predicting potential lncRNA-protein interactions is very important for disease prevention and treatment, and lncRNA-protein interaction prediction provides a new reference for hot spot research fields such as tumor biology and research of new coronaviruses. In the existing methods, there are methods for predicting based on sequence similarity or based on the existing tags, but the two methods are separate prediction processes, so that the effectiveness and the synergy of the utilization of the known information are reduced.

Disclosure of Invention

The embodiment of the application aims to provide a lncRNA-protein interaction prediction method and related equipment thereof, which are used for solving the problem that the prior art cannot reasonably and effectively and cooperatively utilize known information when predicting potential lncRNA-protein interaction.

In order to solve the above technical problems, the embodiment of the present application provides a lncRNA-protein interaction prediction method, which adopts the following technical scheme:

a method of predicting lncRNA-protein interactions comprising the steps of:

obtaining N sequence groups to be subjected to interaction prediction, wherein each sequence group comprises a lncRNA sequence and a protein sequence, and N is a positive integer;

constructing a sequence group association diagram based on the lncRNA similarity as a first association diagram according to the lncRNA sequences in the N sequence groups;

constructing a sequence group association diagram based on protein similarity as a second association diagram according to protein sequences in the N sequence groups;

performing association diagram reconstruction on the first association diagram and the second association diagram by using a diagram self-encoder in a preset interaction prediction model to obtain target matrixes respectively corresponding to the first association diagram and the second association diagram, wherein the preset interaction prediction model is trained in advance according to a sequence group with known interaction;

based on the target matrixes respectively corresponding to the first correlation diagram and the second correlation diagram, splicing an embedded vector corresponding to the lncRNA sequence and an embedded vector corresponding to the protein sequence in each sequence group to obtain a group characterization vector;

Inputting group characterization vectors corresponding to each sequence group into a predictor of the interaction prediction model to obtain a prediction result of the lncRNA sequence and the protein sequence in each sequence group, wherein the prediction result is interaction or non-interaction.

In order to solve the above technical problems, the embodiment of the present application further provides an lncRNA-protein interaction prediction apparatus, which adopts the following technical scheme:

an lncRNA-protein interaction prediction apparatus comprising:

the device comprises a sequence group acquisition module to be detected, a detection module and a detection module, wherein the sequence group acquisition module is used for acquiring N sequence groups to be subjected to interaction prediction, each sequence group comprises a lncRNA sequence and a protein sequence, and N is a positive integer;

the first association diagram construction module is used for constructing a sequence group association diagram based on the lncRNA similarity according to the lncRNA sequences in the N sequence groups, and the sequence group association diagram is used as a first association diagram;

the second correlation diagram construction module is used for constructing a sequence group correlation diagram based on protein similarity according to protein sequences in the N sequence groups, and the sequence group correlation diagram is used as a second correlation diagram;

the graph self-encoder coding module is used for respectively carrying out association graph reconstruction on the first association graph and the second association graph by using a graph self-encoder in a preset interaction prediction model to obtain target matrixes respectively corresponding to the first association graph and the second association graph, wherein the preset interaction prediction model is trained in advance according to a sequence group with known interaction;

The group characterization vector acquisition module is used for splicing the embedded vector corresponding to the lncRNA sequence and the embedded vector corresponding to the protein sequence in each sequence group based on the target matrixes respectively corresponding to the first correlation diagram and the second correlation diagram to acquire a group characterization vector;

and the predictor prediction module is used for inputting the group characterization vector corresponding to each sequence group into the predictor of the interaction prediction model to obtain a prediction result of the lncRNA sequence and the protein sequence in each sequence group, wherein the prediction result is interaction or non-interaction.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

a computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the lncRNA-protein interaction prediction method described above.

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the lncRNA-protein interaction prediction method as described above.

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

according to the lncRNA-protein interaction prediction method, N sequence groups to be subjected to interaction prediction are obtained; constructing a first association graph according to lncRNA sequences in the N sequence groups; constructing a second association graph according to protein sequences in the N sequence groups; respectively carrying out association diagram reconstruction on the first association diagram and the second association diagram by using a diagram self-encoder in a preset interaction prediction model, and obtaining target matrixes respectively corresponding to the first association diagram and the second association diagram; splicing embedded vectors respectively corresponding to the lncRNA sequences and the protein sequences in each sequence group based on target matrixes respectively corresponding to the first correlation diagram and the second correlation diagram to obtain group characterization vectors; and inputting the group characterization vector corresponding to each sequence group into a predictor of an interaction prediction model, and predicting whether interaction exists between the lncRNA sequence and the protein sequence in each sequence group. According to the application, the graph self-encoder training is respectively carried out on the lncRNA sequence and the protein sequence by adopting a cooperative training mode, and the predictor training is carried out by combining the group characterization vectors between the lncRNA sequence and the protein sequence, so that the structural information of the lncRNA sequence and the protein sequence and the interaction information between the lncRNA sequence and the protein sequence are fully utilized during training and prediction, the effectiveness and the synergy of the utilization of known information are improved, and the accuracy of the lncRNA-protein interaction prediction is improved.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method of lncRNA-protein interaction prediction according to the present application;

FIG. 3 is a flow chart of one embodiment of step 202 of FIG. 2;

FIG. 4 is a flow chart of one embodiment of step 203 shown in FIG. 2;

FIG. 5 is a flow chart of one embodiment of step 302 shown in FIG. 3;

FIG. 6 is a flow chart of one embodiment of step 303 shown in FIG. 3;

FIG. 7 is a flow chart of one embodiment of step 401 shown in FIG. 4;

FIG. 8 is a flow chart of one embodiment of step 402 shown in FIG. 4;

FIG. 9 is a schematic diagram of one embodiment of a lncRNA-protein interaction prediction device according to the present application;

FIG. 10 is a schematic diagram of an embodiment of a computer device in accordance with the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the lncRNA-protein interaction prediction method provided by the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the lncRNA-protein interaction prediction apparatus is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow chart of one embodiment of a lncRNA-protein interaction prediction method in accordance with the application is shown. The lncRNA-protein interaction prediction method comprises the following steps:

step 201, obtaining N sequence groups to be subjected to interaction prediction, wherein each sequence group contains lncRNA sequences and protein sequences, and N is a positive integer.

Step 202, constructing a sequence group association diagram based on lncRNA similarity according to lncRNA sequences in the N sequence groups, and taking the sequence group association diagram as a first association diagram.

With continued reference to FIG. 3, FIG. 3 is a flow chart of one embodiment of step 202 shown in FIG. 2, comprising:

step 301, segment processing is performed on the lncRNA sequences in each sequence group, so as to obtain M lncRNA subsequences respectively, where M is a positive integer.

Step 302, sequentially calculating sequence similarity among M lncRNA subsequences contained in different sequence groups by an edit distance method.

With continued reference to fig. 5, fig. 5 is a flow chart of one embodiment of step 302 shown in fig. 3, comprising:

step 501, combining lncRNA sequences in the different sequence groups in pairs to obtain lncRNA sequence comparison groups, wherein two lncRNA sequences contained in each lncRNA sequence comparison group are respectively a first lncRNA sequence and a second lncRNA sequence;

step 502, obtaining M lncRNA subsequences corresponding to a first lncRNA sequence and a second lncRNA sequence in a current lncRNA sequence comparison group respectively;

step 503, calculating the minimum editing distance required by converting the first lncRNA sequence into the second lncRNA sequence according to M lncRNA subsequences corresponding to the first lncRNA sequence and the second lncRNA sequence in the current lncRNA sequence comparison group and the editing distance method;

Step 504, determining the similarity between the first lncRNA sequence and the second lncRNA sequence in the current lncRNA sequence comparison group according to the minimum editing distance, wherein the minimum editing distance and the similarity are in a negative correlation relationship, i.e. the smaller the minimum editing distance is, the larger the similarity is;

and step 505, sequentially taking the different lncRNA sequence comparison groups as the current lncRNA sequence comparison group, and repeatedly executing the steps 502 to 504 to determine the similarity between the first lncRNA sequence and the second lncRNA sequence in the different lncRNA sequence comparison groups.

In this embodiment, since lncRNA is a long-chain non-coding RNA with a length greater than 200nt, directly comparing two lncRNA long sequences results in an excessively large comparison length, so that the lncRNA long sequences to be compared are split first and divided into M lncRNA subsequences, then the lncRNA subsequences are subjected to edit distance calculation by using an edit distance algorithm to obtain a minimum edit distance, and the similarity of the two lncRNA long sequences to be compared is determined according to the minimum edit distance, so that the complexity of the comparison algorithm is reduced and the comparison efficiency is improved by using the split processing and the edit distance algorithm.

Step 303, constructing a sequence group association diagram based on the lncRNA similarity according to the sequence similarity among M lncRNA subsequences contained in different sequence groups.

With continued reference to fig. 6, fig. 6 is a flow chart of one embodiment of step 303 shown in fig. 3, comprising:

step 601, presetting N associated nodes with the same number as the different sequence groups, wherein each associated node represents one sequence group;

step 602, judging whether the similarity between a first lncRNA sequence and a second lncRNA sequence in a current lncRNA sequence comparison group meets a preset first similarity threshold;

step 603, if the similarity between the first lncRNA sequence and the second lncRNA sequence in the current lncRNA sequence comparison set does not meet the preset first similarity threshold, resetting the current lncRNA sequence comparison set, and continuing to execute step 602;

step 604, if the similarity between the first lncRNA sequence and the second lncRNA sequence in the current lncRNA sequence comparison set meets a preset first similarity threshold, constructing a node connection line for the associated nodes corresponding to the first lncRNA sequence and the second lncRNA sequence respectively, resetting the current lncRNA sequence comparison set, and continuing to execute step 602;

step 605, stopping executing step 602 until all lncRNA sequence comparison groups have been set as the current lncRNA sequence comparison group, and obtaining node connection lines among the N associated nodes as the sequence group association diagram based on the lncRNA similarity.

In this embodiment, a sequence group association diagram is constructed based on the similarity of lncRNA sequences, that is, the lncRNA sequences in different sequence groups are compared separately to construct a sequence group association diagram, so as to obtain a first association diagram, and node connection processing is performed on the N association nodes substantially according to the similarity between two lncRNA sequences, so as to obtain a node connection diagram as the first association diagram, thereby fully utilizing the sequence structure of the lncRNA sequences.

And 203, constructing a sequence group association diagram based on protein similarity as a second association diagram according to protein sequences in the N sequence groups.

With continued reference to fig. 4, fig. 4 is a flow chart of one embodiment of step 203 shown in fig. 2, comprising:

step 401, sequentially calculating sequence similarity between protein sequences contained in different sequence groups by adopting a local sequence comparison algorithm.

With continued reference to fig. 7, fig. 7 is a flow chart of one embodiment of step 401 shown in fig. 4, comprising:

step 701, combining protein sequences in the different sequence groups in pairs to obtain protein sequence comparison groups, wherein two protein sequences contained in each protein sequence comparison group are respectively a first protein sequence and a second protein sequence;

Step 702, identifying a similar molecular structure region between a first protein sequence and a second protein sequence in a current protein sequence comparison group according to the local sequence comparison algorithm, and obtaining an identification result, wherein the local sequence comparison algorithm is a Smith-Waterman algorithm, and the Smith-Waterman algorithm is used for identifying the similar molecular structure region between the two protein sequences;

step 703, scoring the identification result by a preset scoring method to obtain a scoring result, wherein the preset scoring method is a BLOSUM62 matrix scoring method, and the BLOSUM62 matrix scoring method is used for performing similarity evaluation on a similar molecular structure region between two protein sequences identified by a Smith-Waterman algorithm;

step 704, setting the scoring result as the similarity of the first protein sequence and the second protein sequence in the current protein sequence comparison group;

step 705, sequentially taking the different protein sequence comparison groups as the current protein sequence comparison group, and repeatedly executing steps 702 to 704 to determine the similarity between the first protein sequence and the second protein sequence in the different protein sequence comparison groups.

In this embodiment, because the protein sequences are formed by combining different amino acid structures, the local structures of the protein sequences in different sequence groups are compared by adopting a Smith-Waterman algorithm, and then similarity evaluation is performed on the similar molecular structure regions between two protein sequences identified by the Smith-Waterman algorithm by using a BLOSUM62 matrix scoring method, so as to obtain the similarity of the protein sequences in different sequence groups.

Step 402, constructing the sequence group association diagram based on the protein similarity according to the sequence similarity among the protein sequences contained in different sequence groups.

With continued reference to fig. 8, fig. 8 is a flow chart of one embodiment of step 402 shown in fig. 4, comprising:

step 801, judging whether the similarity between the first protein sequence and the second protein sequence in the current protein sequence comparison group meets a preset second similarity threshold;

step 802, resetting the current protein sequence comparison set if the similarity between the first protein sequence and the second protein sequence in the current protein sequence comparison set does not meet a preset second similarity threshold, and continuing to execute step 801;

step 803, if the similarity between the first protein sequence and the second protein sequence in the current protein sequence comparison set meets a preset second similarity threshold, constructing a node connection for the associated nodes corresponding to the first protein sequence and the second protein sequence respectively, resetting the current protein sequence comparison set, and continuing to execute step 801;

Step 804, stopping executing step 801 until all protein sequence comparison groups have been set as the current protein sequence comparison group, and obtaining node connection lines among the N associated nodes as the sequence group association diagram based on protein similarity.

In this embodiment, the sequence group association diagram is constructed based on protein similarity, that is, the sequence group association diagram is constructed by comparing protein sequences in different sequence groups separately, so as to obtain a second association diagram, and node connection processing is performed on the N association nodes substantially according to the similarity between two protein sequences, so as to obtain a node connection diagram as the second association diagram, thereby fully utilizing the sequence structure of the protein sequences themselves.

In this embodiment, by performing node connection processing on the N associated nodes according to the similarity between the two lncRNA sequences, a node connection graph is obtained as the first associated graph, and performing node connection processing on the N associated nodes according to the similarity between the two protein sequences, so as to obtain a node connection graph as the second associated graph, and a cooperative processing manner is adopted to respectively construct the first associated graph and the second associated graph, so that data information in the sequence group to be detected is effectively and fully utilized, and accuracy of a prediction result is ensured.

And 204, performing association diagram reconstruction on the first association diagram and the second association diagram by using a diagram self-encoder in a preset interaction prediction model to obtain target matrixes respectively corresponding to the first association diagram and the second association diagram, wherein the preset interaction prediction model is trained in advance according to a sequence group with known interaction.

In this embodiment, the step of using a graph self-encoder in a preset interaction prediction model to reconstruct the first correlation graph and the second correlation graph, and obtain target matrices corresponding to the first correlation graph and the second correlation graph respectively specifically includes: constructing a first adjacency matrix according to node connecting lines among the N associated nodes in the first association graph; constructing a second adjacency matrix according to node connecting lines among the N associated nodes in the second association graph; inputting the first adjacency matrix and the second adjacency matrix into the graph self-encoder by taking the first adjacency matrix and the second adjacency matrix as inputs, wherein the graph self-encoder comprises an encoding layer based on a GCN graph rolling network and a decoding layer based on an inner-product inner product algorithm; obtaining embedded vectors corresponding to each associated node of the first adjacent matrix respectively according to the coding layer, and embedded vectors corresponding to each associated node of the second adjacent matrix respectively; reconstructing the first adjacent matrix according to the embedded vectors respectively corresponding to the decoding layer and each associated node of the first adjacent matrix to obtain a target matrix corresponding to the first associated graph; and reconstructing the second adjacent matrix according to the embedded vectors respectively corresponding to the decoding layer and each associated node of the second adjacent matrix to obtain a target matrix corresponding to the second associated graph.

Specifically, a first adjacency matrix is constructed according to node connection lines among the N associated nodes in the first association graph, if a node connection line exists between two associated nodes, a relationship vector of the two associated nodes is set to be 1, if no node connection line exists between the two associated nodes, a relationship vector of the two associated nodes is set to be 0, and similarly, a second adjacency matrix is constructed according to node connection lines among the N associated nodes in the second association graph, and an adjacency matrix consisting of 1 and 0 is also created. The map self-encoder realizes the separate encoding of the lncRNA sequence and the protein sequence, and fuses the known information to cooperatively train for optimization, so that the map self-encoder can mutually promote and restrict when the lncRNA sequence and the protein sequence are respectively encoded, and a model prediction effect is provided.

In this embodiment, before executing the step of performing the graph self-encoder in the interaction prediction model to reconstruct the first correlation graph and the second correlation graph, and obtaining the target matrices corresponding to the first correlation graph and the second correlation graph, the method further includes: obtaining a sequence group of known interaction from a preset database, and dividing a training set and a verification set, wherein the preset database comprises an NPInter database and a lncPro database, and the sequence group of known interaction represents the sequence group of interaction between a known lncRNA sequence and a protein sequence; acquiring group characterization vectors corresponding to each sequence group in the training set and acquiring group characterization vectors corresponding to each sequence group in the verification set through a graph self-encoder in the interaction prediction model; performing initial training on the predictors of the interaction prediction model according to the group characterization vectors corresponding to each sequence group in the training set to obtain predictors with completed initial training; performing iterative optimization training on the predictor with the initial training based on the group characterization vector corresponding to each sequence group in the verification set; and until the loss value of the interaction prediction model accords with a preset loss condition, training the interaction prediction model, wherein the preset loss condition is that parameters of the interaction prediction model are optimized through a minimum loss value back propagation algorithm, iterative optimization training is carried out until the iteration number reaches a preset maximum number, the loss value of the graph self-encoder and the loss value of the predictor before each iterative training are obtained, the sum value of the graph self-encoder and the loss value of the predictor are calculated through summation, and the model parameters when the sum value of the graph self-encoder and the loss value of the predictor is the minimum value are obtained through comparison to serve as the parameters of the interaction prediction model after training is completed.

The NPInter database is a non-coding RNA interaction data resource platform, which covers the omnibearing multidimensional interaction of non-coding RNA, protein, RNA and genome, provides richer interaction and molecular function annotation, and provides new reference for the research of hot spot research fields such as tumor biology and new coronaviruses. The lncPro database is a data platform developed by the Beijing university health science center for predicting interactions between lncRNA and proteins, and covers known interactions between non-coding RNA and proteins. The set of sequences of known interactions may be an interaction tag or an identification code corresponding to the interaction between the known non-coding RNA and the protein.

In essence, for the graph self-encoder in the interaction prediction model, the group characterization vector corresponding to each sequence group in the training set and the group characterization vector corresponding to each sequence group in the verification set are obtained, and the lncRNA sequences and the protein sequences in different sequence groups are respectively obtained in the embodiment, the first association graph and the second association graph corresponding to all sequence groups in the training set are respectively constructed, and similarly, the first association graph and the second association graph corresponding to all sequence groups in the verification set are constructed, and the group characterization vector is obtained according to the first association graph and the second association graph corresponding to the same sequence group. At this time, in the model training stage, the model training is performed by combining the cooperative training and the graph self-encoder, and the data information of the lncRNA sequence and the protein sequence is fully utilized.

In the model verification stage, a minimum loss value back propagation algorithm is adopted to optimize parameters of the interaction prediction model, and because the interaction prediction model comprises a graph self-encoder and a predictor, the influence factors of the loss value of the model are the loss and the value of the model, the parameters of the model at the minimum value are selected as parameters of the interaction prediction model after training is completed by presetting the maximum iteration times and comparing the loss and the value of the model after each iteration, and in order to ensure the accuracy of the model, the calculation of the loss value can be up-scaled by conversion, for example: according to the training verification stage, the loss value of the first association diagram, the loss value of the second association diagram and the loss value of the predictor jointly acquire the minimum loss value, or according to the training verification stage, the diagram jointly acquires the minimum loss value from the loss value of the encoder, the loss value of the decoder and the loss value of the predictor in the encoder, and in a word, the dimension increase or dimension reduction of the loss value calculation mainly aims at whether the loss value of the diagram self-encoder is finely divided or not.

The model verification training is carried out, the parameters of the interaction prediction model are determined by adopting a minimum loss value back propagation algorithm, the accuracy of prediction in the later stage is ensured, the model training is carried out by adopting a mode of combining collaborative training and a graph self-encoder during training, and the data information of the lncRNA sequence and the protein sequence is fully utilized.

Step 205, based on the target matrixes respectively corresponding to the first correlation diagram and the second correlation diagram, splicing the embedded vector corresponding to the lncRNA sequence and the embedded vector corresponding to the protein sequence in each sequence group, and obtaining a group characterization vector.

In this embodiment, before the step of performing the step of splicing the embedded vector corresponding to the lncRNA sequence and the embedded vector corresponding to the protein sequence in each sequence group based on the target matrices respectively corresponding to the first correlation diagram and the second correlation diagram to obtain the group characterization vector, the method further includes: obtaining embedded vectors corresponding to each lncRNA sequence in the N sequence groups respectively through a target matrix corresponding to the first association diagram; and acquiring the embedded vectors corresponding to the protein sequences in the N sequence groups respectively through the adjacent matrix corresponding to the second association diagram.

In this embodiment, the step of splicing the embedded vector corresponding to the lncRNA sequence and the embedded vector corresponding to the protein sequence in each sequence group based on the target matrices respectively corresponding to the first correlation diagram and the second correlation diagram to obtain the group characterization vector specifically includes: acquiring and splicing an embedded vector corresponding to the lncRNA sequence and an embedded vector corresponding to the protein sequence in the current sequence group to serve as a group characterization vector of the current sequence group; and sequentially taking the different sequence groups as the current sequence groups to obtain group characterization vectors respectively corresponding to the different sequence groups.

And 206, inputting group characterization vectors corresponding to each sequence group into a predictor of the interaction prediction model to obtain a prediction result of the lncRNA sequence and the protein sequence in each sequence group, wherein the prediction result is that interaction exists or no interaction exists.

After performing the step of inputting the group characterization vector corresponding to each sequence group into the predictor of the interaction prediction model to predict whether there is an interaction between the lncRNA sequence and the protein sequence in each sequence group, the method comprises: obtaining and analyzing a prediction result output by a predictor of the interaction prediction model; if interaction exists between the lncRNA sequence and the protein sequence in the current sequence group through analysis, distinguishing and marking the current sequence group, outputting interaction field information, and otherwise, sending a prompt message of unknown interaction to a target monitoring end.

The application obtains N sequence groups to be subjected to interaction prediction; constructing a first association graph according to lncRNA sequences in the N sequence groups; constructing a second association graph according to protein sequences in the N sequence groups; respectively carrying out association diagram reconstruction on the first association diagram and the second association diagram by using a diagram self-encoder in a preset interaction prediction model, and obtaining target matrixes respectively corresponding to the first association diagram and the second association diagram; splicing embedded vectors respectively corresponding to the lncRNA sequences and the protein sequences in each sequence group based on target matrixes respectively corresponding to the first correlation diagram and the second correlation diagram to obtain group characterization vectors; and inputting the group characterization vector corresponding to each sequence group into a predictor of an interaction prediction model, and predicting whether interaction exists between the lncRNA sequence and the protein sequence in each sequence group. According to the application, the graph self-encoder training is respectively carried out on the lncRNA sequence and the protein sequence by adopting a cooperative training mode, and the predictor training is carried out by combining the group characterization vectors between the lncRNA sequence and the protein sequence, so that the structural information of the lncRNA sequence and the protein sequence and the interaction information between the lncRNA sequence and the protein sequence are fully utilized during training and prediction, the effectiveness and the synergy of the utilization of known information are improved, and the accuracy of the lncRNA-protein interaction prediction is improved.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In the embodiment of the application, the pattern self-encoder training is respectively carried out on the lncRNA sequence and the protein sequence by adopting a cooperative training mode, and then the predictor training is carried out by combining the group characterization vectors between the lncRNA sequence and the protein sequence, so that the structural information of the lncRNA sequence and the protein sequence and the interaction information between the lncRNA sequence and the protein sequence are fully utilized during training and prediction, and the effectiveness and the synergy of the utilization of the known information are improved.

With further reference to fig. 9, as an implementation of the method shown in fig. 2 described above, the present application provides an embodiment of an lncRNA-protein interaction prediction apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 9, the lncRNA-protein interaction prediction apparatus 900 according to the present embodiment includes: the device comprises a sequence group to be tested acquisition module 901, a first association diagram construction module 902, a second association diagram construction module 903, a diagram self-encoder coding module 904, a group characterization vector acquisition module 905 and a predictor prediction module 906. Wherein:

the sequence group to be detected acquisition module 901 is used for acquiring N sequence groups to be subjected to interaction prediction, wherein each sequence group comprises a lncRNA sequence and a protein sequence, and N is a positive integer;

the first association diagram construction module 902 is configured to construct a sequence group association diagram based on lncRNA similarity according to lncRNA sequences in the N sequence groups, as a first association diagram;

a second correlation diagram construction module 903, configured to construct a sequence group correlation diagram based on protein similarity according to protein sequences in the N sequence groups, as a second correlation diagram;

The graph self-encoder encoding module 904 is configured to reconstruct the first correlation graph and the second correlation graph by using a graph self-encoder in a preset interaction prediction model, and obtain target matrices corresponding to the first correlation graph and the second correlation graph, where the preset interaction prediction model is trained in advance according to a sequence group of known interactions;

the group characterization vector obtaining module 905 is configured to splice an embedded vector corresponding to the lncRNA sequence and an embedded vector corresponding to the protein sequence in each sequence group based on the target matrices respectively corresponding to the first correlation diagram and the second correlation diagram, so as to obtain a group characterization vector;

and a predictor prediction module 906, configured to input a group characterization vector corresponding to each sequence group into a predictor of the interaction prediction model, to obtain a predicted result of the lncRNA sequence and the protein sequence in each sequence group, where the predicted result is that there is an interaction or no interaction.

In some embodiments of the present application, the lncRNA-protein interaction prediction apparatus 900 further includes an interaction prediction model training module, where the interaction prediction model training module is configured to obtain a sequence set of known interactions from a preset database, and divide a training set and a verification set; the self-encoder is further used for acquiring a group characterization vector corresponding to each sequence group in the training set through the graph in the interaction prediction model, and acquiring a group characterization vector corresponding to each sequence group in the verification set; the method is also used for carrying out initial training on the predictor of the interaction prediction model according to the group characterization vector corresponding to each sequence group in the training set to obtain a predictor with the initial training completed; the method is also used for carrying out iterative optimization training on the predictor with the initial training completion based on the group characterization vector corresponding to each sequence group in the verification set; and the method is also used for carrying out iterative optimization training until the loss value of the interaction prediction model accords with a preset loss condition, the interaction prediction model is trained, wherein the preset loss condition is that parameters of the interaction prediction model are optimized through a minimum loss value back propagation algorithm, iterative optimization training is carried out until the iteration number reaches a preset maximum number, the loss value of the graph self-encoder and the loss value of the predictor before each iterative training are obtained, the sum value of the two values is calculated through summation, and the model parameters when the sum value of the two values is the minimum value are obtained through comparison to serve as the parameters of the interaction prediction model after the training is completed.

Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by computer readable instructions, stored on a computer readable storage medium, that the program when executed may comprise the steps of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 10, fig. 10 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 10 includes a memory 10a, a processor 10b, and a network interface 10c communicatively coupled to each other via a system bus. It should be noted that only computer device 10 having components 10a-10c is shown in the figures, but it should be understood that not all of the illustrated components need be implemented and that more or fewer components may alternatively be implemented. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 10a includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 10a may be an internal storage unit of the computer device 10, such as a hard disk or a memory of the computer device 10. In other embodiments, the memory 10a may also be an external storage device of the computer device 10, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 10. Of course, the memory 10a may also include both internal storage units of the computer device 10 and external storage devices thereof. In this embodiment, the memory 10a is typically used to store an operating system and various application software installed on the computer device 10, such as computer readable instructions of an lncRNA-protein interaction prediction method. Further, the memory 10a may be used to temporarily store various types of data that have been output or are to be output.

The processor 10b may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 10b is generally used to control the overall operation of the computer device 10. In this embodiment, the processor 10b is configured to execute computer readable instructions stored in the memory 10a or process data, such as computer readable instructions for executing the lncRNA-protein interaction prediction method.

The network interface 10c may comprise a wireless network interface or a wired network interface, the network interface 10c typically being used to establish a communication connection between the computer device 10 and other electronic devices.

The computer equipment provided by the embodiment belongs to the technical field of digital medical treatment, and is applied to the scene of interaction prediction between lncRNA and protein. The application obtains N sequence groups to be subjected to interaction prediction; constructing a first association graph according to lncRNA sequences in the N sequence groups; constructing a second association graph according to protein sequences in the N sequence groups; respectively carrying out association diagram reconstruction on the first association diagram and the second association diagram by using a diagram self-encoder in a preset interaction prediction model, and obtaining target matrixes respectively corresponding to the first association diagram and the second association diagram; splicing embedded vectors respectively corresponding to the lncRNA sequences and the protein sequences in each sequence group based on target matrixes respectively corresponding to the first correlation diagram and the second correlation diagram to obtain group characterization vectors; and inputting the group characterization vector corresponding to each sequence group into a predictor of an interaction prediction model, and predicting whether interaction exists between the lncRNA sequence and the protein sequence in each sequence group. According to the application, the graph self-encoder training is respectively carried out on the lncRNA sequence and the protein sequence by adopting a cooperative training mode, and the predictor training is carried out by combining the group characterization vectors between the lncRNA sequence and the protein sequence, so that the structural information of the lncRNA sequence and the protein sequence and the interaction information between the lncRNA sequence and the protein sequence are fully utilized during training and prediction, the effectiveness and the synergy of the utilization of known information are improved, and the accuracy of the lncRNA-protein interaction prediction is improved.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by a processor to cause the processor to perform the steps of the lncRNA-protein interaction prediction method as described above.

The computer readable storage medium provided by the embodiment belongs to the technical field of digital medical treatment, and is applied to the scene of interaction prediction between lncRNA and protein. The application obtains N sequence groups to be subjected to interaction prediction; constructing a first association graph according to lncRNA sequences in the N sequence groups; constructing a second association graph according to protein sequences in the N sequence groups; respectively carrying out association diagram reconstruction on the first association diagram and the second association diagram by using a diagram self-encoder in a preset interaction prediction model, and obtaining target matrixes respectively corresponding to the first association diagram and the second association diagram; splicing embedded vectors respectively corresponding to the lncRNA sequences and the protein sequences in each sequence group based on target matrixes respectively corresponding to the first correlation diagram and the second correlation diagram to obtain group characterization vectors; and inputting the group characterization vector corresponding to each sequence group into a predictor of an interaction prediction model, and predicting whether interaction exists between the lncRNA sequence and the protein sequence in each sequence group. According to the application, the graph self-encoder training is respectively carried out on the lncRNA sequence and the protein sequence by adopting a cooperative training mode, and the predictor training is carried out by combining the group characterization vectors between the lncRNA sequence and the protein sequence, so that the structural information of the lncRNA sequence and the protein sequence and the interaction information between the lncRNA sequence and the protein sequence are fully utilized during training and prediction, the effectiveness and the synergy of the utilization of known information are improved, and the accuracy of the lncRNA-protein interaction prediction is improved.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A method for predicting lncRNA-protein interactions, comprising the steps of:

2. The method for predicting lncRNA-protein interactions as set forth in claim 1, wherein the constructing a sequence set associative map based on lncRNA similarity from lncRNA sequences in the N sequence sets specifically includes:

segmenting lncRNA sequences in each sequence group to obtain M lncRNA subsequences respectively, wherein M is a positive integer;

sequentially calculating sequence similarity among M lncRNA subsequences contained in different sequence groups by an edit distance method;

constructing a sequence group association diagram based on the lncRNA similarity according to the sequence similarity among M lncRNA subsequences contained in different sequence groups;

the step of constructing a sequence group association diagram based on protein similarity according to the protein sequences in the N sequence groups specifically comprises the following steps:

sequentially calculating sequence similarity among protein sequences contained in different sequence groups by adopting a local sequence comparison algorithm;

and constructing the sequence group association diagram based on the protein similarity according to the sequence similarity among protein sequences contained in different sequence groups.

3. The method for predicting lncRNA-protein interactions according to claim 2, wherein the step of sequentially calculating the sequence similarity between M lncRNA subsequences included in different sequence groups by an edit distance method, specifically comprises:

step 505, sequentially taking the different lncRNA sequence comparison groups as the current lncRNA sequence comparison group, and repeatedly executing the steps 502 to 504 to determine the similarity between the first lncRNA sequence and the second lncRNA sequence in the different lncRNA sequence comparison groups;

The step of constructing the sequence group association diagram based on the lncRNA similarity according to the sequence similarity among M lncRNA subsequences contained in different sequence groups specifically comprises the following steps:

4. The method for predicting lncRNA-protein interactions as set forth in claim 3, wherein the step of sequentially calculating the sequence similarity between the protein sequences included in the different sequence groups by using a local sequence comparison algorithm comprises:

step 705, sequentially taking the different protein sequence comparison groups as the current protein sequence comparison group, and repeatedly executing the steps 702 to 704 to determine the similarity between the first protein sequence and the second protein sequence in the different protein sequence comparison groups;

the step of constructing the sequence group association diagram based on the protein similarity according to the sequence similarity among the protein sequences contained in different sequence groups specifically comprises the following steps:

5. The lncRNA-protein interaction prediction method of claim 4, wherein the step of reconstructing the first and second correlation maps by using a map self-encoder in a preset interaction prediction model to obtain target matrices corresponding to the first and second correlation maps, respectively, specifically comprises:

constructing a first adjacency matrix according to node connecting lines among the N associated nodes in the first association graph;

constructing a second adjacency matrix according to node connecting lines among the N associated nodes in the second association graph;

inputting the first adjacency matrix and the second adjacency matrix into the graph self-encoder by taking the first adjacency matrix and the second adjacency matrix as inputs, wherein the graph self-encoder comprises an encoding layer based on a GCN graph rolling network and a decoding layer based on an inner-product inner product algorithm;

obtaining embedded vectors corresponding to each associated node of the first adjacent matrix respectively according to the coding layer, and embedded vectors corresponding to each associated node of the second adjacent matrix respectively;

Reconstructing the first adjacent matrix according to the embedded vectors respectively corresponding to the decoding layer and each associated node of the first adjacent matrix to obtain a target matrix corresponding to the first associated graph;

and reconstructing the second adjacent matrix according to the embedded vectors respectively corresponding to the decoding layer and each associated node of the second adjacent matrix to obtain a target matrix corresponding to the second associated graph.

6. The lncRNA-protein interaction prediction method of claim 1 or 5, wherein before performing the step of performing the graph reconstruction on the first and second correlation graphs, respectively, using a graph self-encoder in a preset interaction prediction model, the method further comprises:

obtaining a sequence group of known interaction from a preset database, and dividing a training set and a verification set, wherein the preset database comprises an NPInter database and a lncPro database, and the sequence group of known interaction represents the sequence group of interaction between a known lncRNA sequence and a protein sequence;

acquiring group characterization vectors corresponding to each sequence group in the training set and acquiring group characterization vectors corresponding to each sequence group in the verification set through a graph self-encoder in the interaction prediction model;

Performing initial training on the predictors of the interaction prediction model according to the group characterization vectors corresponding to each sequence group in the training set to obtain predictors with completed initial training;

performing iterative optimization training on the predictor with the initial training based on the group characterization vector corresponding to each sequence group in the verification set;

and until the loss value of the interaction prediction model accords with a preset loss condition, training the interaction prediction model, wherein the preset loss condition is that parameters of the interaction prediction model are optimized through a minimum loss value back propagation algorithm, iterative optimization training is carried out until the iteration number reaches a preset maximum number, the loss value of the graph self-encoder and the loss value of the predictor before each iterative training are obtained, the sum value of the graph self-encoder and the loss value of the predictor are calculated through summation, and the model parameters when the sum value of the graph self-encoder and the loss value of the predictor is the minimum value are obtained through comparison to serve as the parameters of the interaction prediction model after training is completed.

7. The lncRNA-protein interaction prediction method of claim 5, wherein before performing the step of concatenating the embedded vector corresponding to the lncRNA sequence and the embedded vector corresponding to the protein sequence in each sequence group based on the target matrices corresponding to the first and second correlation maps, respectively, to obtain a group characterization vector, the method further comprises:

Obtaining embedded vectors corresponding to each lncRNA sequence in the N sequence groups respectively through a target matrix corresponding to the first association diagram;

acquiring embedded vectors corresponding to each protein sequence in the N sequence groups respectively through an adjacent matrix corresponding to the second association diagram;

the step of splicing the embedded vector corresponding to the lncRNA sequence and the embedded vector corresponding to the protein sequence in each sequence group based on the target matrixes respectively corresponding to the first correlation diagram and the second correlation diagram to obtain a group characterization vector specifically comprises the following steps:

acquiring and splicing an embedded vector corresponding to the lncRNA sequence and an embedded vector corresponding to the protein sequence in the current sequence group to serve as a group characterization vector of the current sequence group;

and sequentially taking the different sequence groups as the current sequence groups to obtain group characterization vectors respectively corresponding to the different sequence groups.

8. An lncRNA-protein interaction prediction apparatus comprising:

9. A computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the lncRNA-protein interaction prediction method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the lncRNA-protein interaction prediction method of any of claims 1 to 7.