CN111613273A - Model training method, protein interaction prediction method, device and medium - Google Patents

Model training method, protein interaction prediction method, device and medium Download PDF

Info

Publication number
CN111613273A
CN111613273A CN202010277398.9A CN202010277398A CN111613273A CN 111613273 A CN111613273 A CN 111613273A CN 202010277398 A CN202010277398 A CN 202010277398A CN 111613273 A CN111613273 A CN 111613273A
Authority
CN
China
Prior art keywords
information
protein
description information
coding
artificial intelligence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010277398.9A
Other languages
Chinese (zh)
Other versions
CN111613273B (en
Inventor
赵拴平
金海�
李默
贾玉堂
徐磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Animal Husbandry and Veterinary Medicine of Anhui Academy of Agricultural Sciences
Original Assignee
Institute of Animal Husbandry and Veterinary Medicine of Anhui Academy of Agricultural Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Animal Husbandry and Veterinary Medicine of Anhui Academy of Agricultural Sciences filed Critical Institute of Animal Husbandry and Veterinary Medicine of Anhui Academy of Agricultural Sciences
Priority to CN202010277398.9A priority Critical patent/CN111613273B/en
Publication of CN111613273A publication Critical patent/CN111613273A/en
Application granted granted Critical
Publication of CN111613273B publication Critical patent/CN111613273B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a model training method, a protein interaction prediction device and a storage medium. The model training method comprises the steps of obtaining training data consisting of first description information, second description information and label information, and using the training data to execute at least one iteration operation until a loss function of the artificial intelligence model meets a convergence condition. The artificial intelligence model can have the capability of acquiring the similarity between the sequence information of different proteins by training the artificial intelligence model, and can accurately predict whether the two proteins can interact or not by judging whether the two proteins can interact or not according to the similarity of the sequence information and more accord with the mechanism of protein interaction. The invention is widely applied to the technical field of computers.

Description

Model training method, protein interaction prediction method, device and medium
Technical Field
The invention relates to the technical field of computers, in particular to an artificial intelligence model training method, a protein interaction prediction device and a storage medium.
Background
Since proteins are the material basis of life, and their interactions influence basic cellular processes and have important influences on the realization of various physiological functions in the living body, it is an important subject to study the interactions between proteins. Studying the interaction between proteins by means of biological experiments requires many determinations of protein pairs, which results in high time and cost costs. Therefore, a technique of determining whether or not an interaction between proteins occurs by a calculation method has emerged. Some prior art techniques use computer coding algorithms to feature-vectorize protein sequence data and then use learned and trained classifiers to predict whether interactions between proteins occur based on feature vectors. The limitations of these prior art techniques include: the prediction results depend on feature vectorization of the protein sequence data, i.e. the final prediction results may be different if different computer coding algorithms are used to process the protein sequence data at the beginning; the principle relied on for predicting the interaction is the classification of proteins, but the variety of proteins is wide, and the prior art can only carry out rough classification on the proteins, so that the final prediction result obtained by the prior art is unreliable.
Disclosure of Invention
In view of at least one of the above technical problems, it is an object of the present invention to provide an artificial intelligence model training method, a protein interaction prediction method, an apparatus and a storage medium.
In one aspect, an embodiment of the present invention includes an artificial intelligence model training method, including the following steps:
acquiring a plurality of groups of training data; the training data comprises first description information used for describing a first protein, second description information used for describing a second protein and label information used for describing whether an interaction occurs between the first protein and the second protein, wherein the first description information is obtained by processing a plurality of different coding information of the first protein, and the second description information is obtained by processing a plurality of different coding information of the second protein;
iteratively training the artificial intelligence model using the training data; the iterative training comprises at least one iterative operation; in each iteration operation, the first description information, the second description information and the label information in the training data are obtained, the first description information is input into the artificial intelligence model to obtain corresponding first feature information, the second description information is input into the artificial intelligence model to obtain corresponding second feature information, the distance between the first feature information and the second feature information is obtained, the value of the loss function corresponding to the iteration operation is determined according to the distance and the label information, and the parameter of the artificial intelligence model is updated according to the value of the loss function.
Further, the artificial intelligence model training method further comprises the following steps:
acquiring first sequence information corresponding to the first protein and second sequence information corresponding to the second protein;
processing the first sequence information by using a plurality of coding algorithms to obtain first coding information respectively output by each coding algorithm;
processing the second sequence information by using a plurality of coding algorithms to acquire second coding information respectively output by each coding algorithm;
fusing each piece of first coding information into the first description information, and fusing each piece of second coding information into the second description information;
and obtaining the label information for describing whether the interaction between the first protein and the second protein occurs from the protein interaction database.
Further, the step of fusing each piece of first encoding information into the first description information and each piece of second encoding information into the second description information specifically includes:
linearly interpolating at least one of the first encoded information so that all of the first encoded information has a length d1
Linearly interpolating at least one of the second encoded information so that all of the second encoded information has a length d1(ii) a D is1The maximum value of the lengths of the first coding information and the second coding information before linear interpolation is carried out;
normalizing each first coding information and then paralleling the normalized first coding information into first description information;
and normalizing each piece of second coding information and then paralleling the second coding information into the second description information.
Further, the artificial intelligence model is a Siamese network, the Siamese network comprises a convolution layer, a pooling layer, a first full-connection layer and a second full-connection layer which are sequentially connected, and the convolution layer and the first full-connection layer are activated by a ReLU function.
Further, the loss function is:
Figure BDA0002445304500000021
wherein N represents the total number of the training data, yjRepresenting label information, y, in the jth set of said training dataj0 means that no interaction occurs between the first protein and the second protein in the training data of the jth group, yj1 denotes the interaction between the first and second protein in the training data of group j, DjThe euclidean distance between the first feature information and the second feature information in the jth group of the training data is represented, and m represents a set first threshold.
In another aspect, the present invention further provides a method for predicting protein interaction, including the following steps:
acquiring third description information for describing a third protein and fourth description information for describing a fourth protein; the third description information is obtained by processing a plurality of different coding information of the third protein, and the fourth description information is obtained by processing a plurality of different coding information of the fourth protein;
inputting the third description information into the artificial intelligence model to obtain corresponding third feature information, and inputting the fourth description information into the artificial intelligence model to obtain corresponding fourth feature information; the artificial intelligence model is trained by a training method in an embodiment;
determining a distance between the third feature information and the fourth feature information;
comparing the distance with a set second threshold value;
determining whether an interaction occurs between the third protein and the fourth protein according to the comparison result.
Further, the step of acquiring third description information for describing the third protein and fourth description information for describing the fourth protein specifically includes:
acquiring third sequence information corresponding to the third protein and fourth sequence information corresponding to the fourth protein;
processing the third sequence information by using a plurality of coding algorithms to obtain third coding information respectively output by each coding algorithm;
processing the fourth sequence information by using a plurality of coding algorithms to acquire fourth coding information output by each coding algorithm;
and fusing each piece of third encoding information into the third description information, and fusing each piece of fourth encoding information into the fourth description information.
Further, the step of determining whether an interaction occurs between the third protein and the fourth protein according to the comparison result specifically includes:
determining that an interaction occurs between the third protein and the fourth protein when the distance is less than the second threshold as a result of the comparison;
and when the distance is greater than or equal to the second threshold value as a result of the comparison, determining that no interaction occurs between the third protein and the fourth protein.
In another aspect, embodiments of the present invention further include a computer apparatus including a memory for storing at least one program and a processor for loading the at least one program to perform the artificial intelligence model training method and/or the protein interaction prediction method described in the embodiments.
In another aspect, the present invention further includes a storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the artificial intelligence model training method and/or the protein interaction prediction method of the embodiments.
The invention has the beneficial effects that: the artificial intelligence model training method in the embodiment, the training data used for training the artificial intelligence model includes the description information of multiple proteins, and the description information is obtained by fusing the protein sequence information after encoding in different encoding modes, namely the description information included in the training data includes the information obtained by multiple encoding modes, compared with using a single encoding mode, the description information in the embodiment can describe the protein sequence information more fully, and because the description information includes the information obtained by multiple encoding modes, the description information has certain information redundancy, can make up the respective deficiency of various encoding modes, fuse the encoding information into the description information, can utilize the information of multiple encoding modes, and avoid the information loss formed by using only a single encoding mode; the artificial intelligence model is trained by using the training data, so that the artificial intelligence model has the capability of acquiring the similarity between the sequence information of different proteins, and whether the two proteins can interact or not is judged according to the similarity of the sequence information, so that the interaction mechanism of the proteins is more met, and the artificial intelligence model obtained by training in the embodiment can accurately predict whether the two proteins can interact or not. In the protein interaction prediction method in the embodiment, the euclidean distance between the third feature information and the fourth feature information is used to express the similarity between the third protein and the fourth protein, and the smaller the euclidean distance, the higher the similarity, and when the euclidean distance between the third feature information and the fourth feature information is smaller than the second threshold, it is considered that the possibility that the interaction between the third protein and the fourth protein can occur is sufficiently large, and based on these analyses, it is determined that the interaction between the third protein and the fourth protein can occur, and a higher accuracy can be achieved.
Drawings
FIG. 1 is a schematic diagram of an artificial intelligence model training method in an embodiment;
FIG. 2 is a flowchart of a protein interaction prediction method in the examples.
Detailed Description
Example 1
In this embodiment, a training method for an artificial intelligence model is provided, which is used to train the artificial intelligence model, so that the trained artificial intelligence model has an ability to predict whether an interaction can occur between proteins. The artificial intelligence model to be trained may be a siense network.
In this embodiment, the artificial intelligence model to be trained is a Siamese network, and two sets of description information input in the Siamese network share network parameters, so that the network parameters for deep learning are reduced, the training process is easier to learn, and the stability is better.
In this embodiment, the artificial intelligence model training method includes the following steps:
p1, acquiring a plurality of groups of training data;
p2, performing iterative training on the artificial intelligence model by using the training data; the iterative training comprises at least one iterative operation, and each iterative operation is used for updating parameters of the artificial intelligence model.
In this embodiment, the principle of the artificial intelligence model training method is shown in fig. 1. Each set of acquired training data includes first descriptive information, second descriptive information, and label information. For a set of training data, the included first description information is obtained by processing a plurality of different coding information of the first protein, so that the first description information can describe the first protein; the second description information is obtained by processing a plurality of different codes of the second protein, so that the second description information can describe the second protein; the tag information may describe whether the first protein and the second protein to which the set of training data corresponds are capable of interacting.
In this embodiment, "first" and "second" in the first protein, the second protein, the first descriptor and the second descriptor play a role in distinguishing between the same set of training data, that is, distinguish between two descriptors in the same set of training data and two corresponding proteins. Since the present embodiment can perform multiple iterations when training the artificial intelligence model by using multiple sets of training data, at least one set of training data is read to train the parameters of the artificial intelligence model each time, therefore, in this embodiment, "first" and "second" are not used to distinguish the description information and the proteins in different sets of training data, that is, each set of training data respectively includes a first protein, a second protein, first description information, and second description information, the first protein corresponding to one set of training data and the first protein corresponding to another set of training data may refer to the same protein or may refer to different proteins, and similarly, the first description information in one set of training data and the first description information in another set of training data may refer to the description information with the same content or may refer to the description information with different content.
In this embodiment, when acquiring and storing a plurality of sets of training data, each set of training data may be stored separately, that is, each set of training data is taken as a whole, and the first description information, the second description information, and the label information in each set of training data are stored. In the case of n proteins, pairwise pairing together can form
Figure BDA0002445304500000051
In a combination, i.e. one in common
Figure BDA0002445304500000052
The training data set comprises 1 set of training data, wherein each set of training data respectively comprises 1 piece of first description information, 1 piece of second description information and 1 piece of label information, and all the training data share
Figure BDA0002445304500000053
And (4) information.
In this embodiment, when acquiring and storing a plurality of sets of training data, the description information corresponding to n proteins may be acquired first, so as to obtain n pieces of description information in total, and the n pieces of description information are stored and numbered, that is, description information 1, description information 2, and description information 3 … … describe information n; n proteins can form together in pairs
Figure BDA0002445304500000054
Combining, and then determining whether two proteins of each combination interact with each other under experimental conditions by querying a protein interaction database such as DIP, BIND, BioGRID, I2D, IntAct, Mentha, MINT or reactome, so as to determine the label information corresponding to the combination, i.e. obtaining the total tag information
Figure BDA0002445304500000055
Label information, so far, all training data have n pieces of description information and
Figure BDA0002445304500000061
individual tag information, in total
Figure BDA0002445304500000062
And (4) information. When a set of training data is read to train the artificial intelligence model, a pair of description information can be randomly read to serve as first description information and second description information in the set of training data, the pair of description information represents a pair of proteins, label information describing whether the pair of proteins can interact with each other or not is read, and the label information and the read label information are readThe first descriptive information and the second descriptive information form a complete set of training data.
In this example, a protein profile is obtained by processing sequence data of the protein. The following are steps for obtaining the corresponding description information of the protein:
p101, acquiring sequence information corresponding to the protein; in this embodiment, the obtained sequence information reflects the components and structure of the protein, and has uniqueness;
p102, processing the sequence information by using a plurality of coding algorithms to obtain coding information respectively output by each coding algorithm; in this embodiment, sequence information may be processed using a triplet coding algorithm (CT), an autocovariance coding Algorithm (AC), a local feature coding algorithm (LD), and the like, and each coding algorithm outputs one piece of coding information;
p103, fusing the coding information output by each coding algorithm into description information; in this embodiment, the coding information output by the triplet coding algorithm, the coding information output by the auto-covariance coding algorithm, and the coding information output by the local feature coding algorithm are fused into the description information.
The coding information output by the triplet coding algorithm is a 343-dimensional vector, the coding information output by the auto-covariance coding algorithm is a 420-dimensional vector, and the coding information output by the local feature coding algorithm is a 630-dimensional vector, so step P103 is to fuse a 343-dimensional vector, a 420-dimensional vector, and a 630-dimensional vector into a matrix, i.e., description information. In this embodiment, in order to enable the vectors of different dimensions to be fused, the vector having the highest dimension among the vectors to be fused is first determined, and then the vectors having other dimensions less than the highest dimension are complemented to have the highest dimension. In this embodiment, the vector with the highest dimension is the coding information output by the local feature coding algorithm, and the dimension of the vector is 630 dimensions, so that 343-dimensional coding information output by the triplet coding algorithm and 420-dimensional coding information output by the auto-covariance coding algorithm need to be respectively changed into 630-dimensional vectors by means of linear interpolation.
After linear interpolation is completed, the coding information output by the triplet coding algorithm, the coding information output by the autocovariance coding algorithm and the coding information output by the local feature coding algorithm are all 630-dimensional vectors. In this embodiment, the 3 pieces of encoded information may be normalized respectively to be row vectors or column vectors, so as to form a matrix with a size of 630 × 3 or 3 × 630, where the matrix is the description information obtained by fusion.
Since "protein" in steps P101-P103 may refer to any protein, those skilled in the art may apply steps P101-P103 to step P1 to obtain the first description information corresponding to the first protein and the second description information corresponding to the second protein when constructing each set of training data.
As can be seen from steps P101 to P103, when 3 kinds of encoding algorithms are used and the maximum dimension of encoded information output from these encoding algorithms is 630 dimensions, a matrix having a size of 630 × 3 or 3 × 630 can be obtained as description information. By analogy, in the case that t kinds of encoding algorithms are used, and the maximum dimension of the encoded information output by the encoding algorithms is d dimension, a matrix with the size of d × t or t × d can be obtained as the description information, that is, the sizes of the first description information and the second description information mentioned in this embodiment are both d × t or t × d, so that the structure of the artificial intelligence model should be matched with the sizes of the first description information and the second description information received by the artificial intelligence model. In this embodiment, assuming that d is 600 and t is 3, that is, the sizes of the first description information and the second description information are both 600 × 3, a siemese network is used as the artificial intelligence model to be trained, and its structure is shown in table 1.
TABLE 1
Input device Output of
Convolutional layer (comprising a convolutional kernel of 1 × 3) 600×3 600×1
Pooling layer 300×1
First full connection layer 150×300 150×1
Second full connection layer 50×100 50×1
When the step P2 is executed to train the Siamese network, the training is executed by being divided into a plurality of times of iteration operations, after each iteration operation is finished, whether the loss function of the Siamese network meets the convergence condition or whether the number of times of the iteration operation reaches a certain value is checked, and if the loss function of the Siamese network meets the convergence condition or the number of times of the iteration operation reaches a certain value, the training of the Siamese network is finished; and if the loss function of the Siemese network does not meet the convergence condition or the number of times of the iterative operation does not reach a certain value, continuing to execute the next iterative operation.
In each iteration operation, at least one group of training data, namely a plurality of groups of first description information, second description information and label information, is read, and the first description information and the second description information are sequentially input into the Siemese network.
As shown in table 1, the Siamese network includes a convolutional layer, a pooling layer, a first fully-connected layer, and a second fully-connected layer, which are connected in sequence, and the entire Siamese network firstly uses a 1 × 3 convolutional core to convolve input 600 × 3 first description information to obtain a 600 × 1 column vector, then obtains a 300 × 1 column vector through the maximum pooling operation of the pooling layer, and then obtains two fully-connected operations through the first fully-connected layer and the second fully-connected layer. The above is a process in which the Siamese network processes the first description information to obtain the first feature information, and a process in which the Siamese network processes the second description information to obtain the second feature information is similar to the above.
The siemese network can process the description information to obtain the characteristic information, so the siemese network can be regarded as a function f, the parameter of the siemese network can be represented as w, and then f (x; w) can represent the characteristic information obtained by processing the description information x by the siemese network. The first description information in the j (th) set of training data input to the Siemese network is represented as
Figure BDA0002445304500000081
The second description information in the j (th) set of training data input to the Siemese network is represented as
Figure BDA0002445304500000082
The first characteristic information output by the siemese network can be expressed as
Figure BDA0002445304500000083
The second characteristic information output by the siemese network can be expressed as
Figure BDA0002445304500000084
The label information in the jth group of the training data is represented as yjWherein y isj0 indicates that no interaction occurs between the first protein and the second protein corresponding to the first descriptor and the second descriptor in the jth group of the training data, and y isj1 denotes group jAnd the interaction between the first protein and the second protein corresponding to the first descriptive information and the second descriptive information in the training data occurs.
The loss function of the siense network can be expressed as
Figure BDA0002445304500000085
Wherein the content of the first and second substances,
Figure BDA0002445304500000086
representing the euclidean distance between the first feature information and the second feature information in the jth set of training data, and m is a settable first threshold, which may be set to a value between 2 and 10 in general.
The convergence condition of the loss function may be determined as: the value of the loss function is smaller than a preset threshold, or the number of times of the iterative operation performed has reached a certain value although the value of the loss function is not smaller than the preset threshold, and the like.
In this embodiment, the training data used for training the artificial intelligence model includes description information of multiple proteins, and the description information is obtained by fusing protein sequence information after encoding the protein sequence information in different encoding modes, that is, the description information included in the training data includes information obtained by multiple encoding modes, compared with using a single encoding mode, the description information in this embodiment can describe the sequence information of the proteins more fully, and because the training data includes the information obtained by multiple encoding modes, the description information has certain information redundancy, can make up respective deficiencies of the various encoding modes, can utilize information of multiple encoding modes, and avoid information loss caused by using only a single encoding mode; the artificial intelligence model is trained by using the training data, so that the artificial intelligence model has the capability of acquiring the similarity between the sequence information of different proteins, and whether the two proteins can interact or not is judged according to the similarity of the sequence information, so that the interaction mechanism of the proteins is more met, and therefore, the artificial intelligence model obtained by training in the embodiment can accurately predict whether the two proteins can interact or not.
Example 2
Based on the artificial intelligence model obtained by training in example 1, a protein interaction prediction method was performed using the artificial intelligence model. Referring to fig. 2, the protein interaction prediction method includes the steps of:
s1, acquiring third description information for describing a third protein and fourth description information for describing a fourth protein; in this embodiment, referring to steps P101 to P103 in embodiment 1, after third sequence information corresponding to a third protein is obtained, the third sequence information is encoded by using a triplet coding algorithm, an autocovariance coding algorithm, and a local feature coding algorithm, respectively, and then the coding information obtained by the three coding algorithms is subjected to fusion processing, so as to obtain third description information; based on the same principle, fourth description information can be obtained;
s2, inputting third description information into the artificial intelligence model obtained in the embodiment 1, and receiving third characteristic information output by the artificial intelligence model; inputting the fourth description information into the artificial intelligence model obtained in the embodiment 1, and receiving fourth feature information output by the artificial intelligence model;
s3, determining an Euclidean distance between the third characteristic information and the fourth characteristic information;
s4, comparing the Euclidean distance between the third characteristic information and the fourth characteristic information with a set second threshold, wherein in the embodiment, the second threshold is half of the first threshold m in the embodiment, namely the second threshold is
Figure BDA0002445304500000091
S5, if the Euclidean distance between the third characteristic information and the fourth characteristic information is smaller than a second threshold value
Figure BDA0002445304500000092
Then it is determined that an interaction will occur between the third protein and the fourth protein, if the third proteinThe Euclidean distance between the feature information and the fourth feature information is greater than or equal to a second threshold value
Figure BDA0002445304500000093
It is determined that no interaction occurs between the third protein and the fourth protein.
In this embodiment, by performing steps S1 to S5, the third description information and the fourth description information are processed using the trained artificial intelligence model obtained in embodiment 1, and the high-level feature information included in the third protein and the fourth protein, that is, the third feature information and the fourth feature information, which can substantially express the third protein and the fourth protein, can be extracted, so that the similarity between the third feature information and the fourth feature information can reflect the possibility that an interaction can occur between the third protein and the fourth protein. In this embodiment, the euclidean distance between the third feature information and the fourth feature information is used to express the similarity therebetween, and the smaller the euclidean distance, the higher the similarity, and when the euclidean distance between the third feature information and the fourth feature information is smaller than the second threshold value, it is considered that the possibility that the interaction between the third protein and the fourth protein can occur is sufficiently large, and based on these analyses, it is determined that the interaction between the third protein and the fourth protein can occur, and a higher accuracy can be achieved.
Example 3
In this embodiment, a computer apparatus includes a memory for storing at least one program and a processor for loading the at least one program to perform the artificial intelligence model training method or the protein interaction prediction method described in embodiment 2 of embodiment 1, and achieve the same technical effects as described in embodiment 1 and embodiment 2.
In this embodiment, a storage medium stores therein processor-executable instructions, which when executed by a processor, are used to perform an artificial intelligence model training method or a protein interaction prediction method as described in the embodiment, achieving the same technical effects as described in the embodiment.
It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. Furthermore, the descriptions of upper, lower, left, right, etc. used in the present disclosure are only relative to the mutual positional relationship of the constituent parts of the present disclosure in the drawings. As used in this disclosure, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, unless defined otherwise, all technical and scientific terms used in this example have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description of the embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this embodiment, the term "and/or" includes any combination of one or more of the associated listed items.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element of the same type from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language ("e.g.," such as "or the like") provided with this embodiment is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.
It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object terminal oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.
Further, operations of processes described in this embodiment can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this embodiment (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.
Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this embodiment includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein.
A computer program can be applied to input data to perform the functions described in the present embodiment to convert the input data to generate output data that is stored to a non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the present invention, the transformed data represents a physical and tangible target terminal, including a particular visual depiction of the physical and tangible target terminal produced on a display.
The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims (10)

1. An artificial intelligence model training method is characterized by comprising the following steps:
acquiring a plurality of groups of training data; the training data comprises first description information used for describing a first protein, second description information used for describing a second protein and label information used for describing whether an interaction occurs between the first protein and the second protein, wherein the first description information is obtained by processing a plurality of different coding information of the first protein, and the second description information is obtained by processing a plurality of different coding information of the second protein;
iteratively training the artificial intelligence model using the training data; the iterative training comprises at least one iterative operation; in each iteration operation, the first description information, the second description information and the label information in the training data are obtained, the first description information is input into the artificial intelligence model to obtain corresponding first feature information, the second description information is input into the artificial intelligence model to obtain corresponding second feature information, the distance between the first feature information and the second feature information is obtained, the value of the loss function corresponding to the iteration operation is determined according to the distance and the label information, and the parameter of the artificial intelligence model is updated according to the value of the loss function.
2. The artificial intelligence model training method of claim 1, further comprising the steps of:
acquiring first sequence information corresponding to the first protein and second sequence information corresponding to the second protein;
processing the first sequence information by using a plurality of coding algorithms to obtain first coding information respectively output by each coding algorithm;
processing the second sequence information by using a plurality of coding algorithms to acquire second coding information respectively output by each coding algorithm;
fusing each piece of first coding information into the first description information, and fusing each piece of second coding information into the second description information;
and obtaining the label information for describing whether the interaction between the first protein and the second protein occurs from the protein interaction database.
3. The method for training an artificial intelligence model according to claim 2, wherein the step of fusing each of the first encoded information into the first description information and each of the second encoded information into the second description information specifically includes:
linearly interpolating at least one of the first encoded information so that all of the first encoded information has a length d1
Linearly interpolating at least one of the second encoded information so that all of the second encoded information has a length d1(ii) a D is1For each length of the first and second encoded information before linear interpolationMaximum value of (1);
normalizing each first coding information and then paralleling the normalized first coding information into first description information;
and normalizing each piece of second coding information and then paralleling the second coding information into the second description information.
4. The artificial intelligence model training method of any one of claims 1-3, wherein the artificial intelligence model is a Siamese network comprising a convolutional layer, a pooling layer, a first fully-connected layer, and a second fully-connected layer connected in series, the convolutional layer and the first fully-connected layer being activated by a ReLU function.
5. A method for artificial intelligence model training according to any one of claims 1-3 wherein the loss function is:
Figure FDA0002445304490000021
wherein N represents the total number of the training data, yjRepresenting label information, y, in the jth set of said training dataj0 means that no interaction occurs between the first protein and the second protein in the training data of the jth group, yj1 denotes the interaction between the first and second protein in the training data of group j, DjThe euclidean distance between the first feature information and the second feature information in the jth group of the training data is represented, and m represents a set first threshold.
6. A method for predicting protein interactions, comprising the steps of:
acquiring third description information for describing a third protein and fourth description information for describing a fourth protein; the third description information is obtained by processing a plurality of different coding information of the third protein, and the fourth description information is obtained by processing a plurality of different coding information of the fourth protein;
inputting the third description information into the artificial intelligence model to obtain corresponding third feature information, and inputting the fourth description information into the artificial intelligence model to obtain corresponding fourth feature information; the artificial intelligence model is trained by the training method according to any one of claims 1 to 5;
determining a distance between the third feature information and the fourth feature information;
comparing the distance with a set second threshold value;
determining whether an interaction occurs between the third protein and the fourth protein according to the comparison result.
7. The method for predicting protein interaction according to claim 6, wherein the step of obtaining third description information describing a third protein and fourth description information describing a fourth protein includes:
acquiring third sequence information corresponding to the third protein and fourth sequence information corresponding to the fourth protein;
processing the third sequence information by using a plurality of coding algorithms to obtain third coding information respectively output by each coding algorithm;
processing the fourth sequence information by using a plurality of coding algorithms to acquire fourth coding information output by each coding algorithm;
and fusing each piece of third encoding information into the third description information, and fusing each piece of fourth encoding information into the fourth description information.
8. The method for predicting protein interaction according to claim 6 or 7, wherein the step of determining whether the interaction between the third protein and the fourth protein occurs according to the comparison result comprises:
determining that an interaction occurs between the third protein and the fourth protein when the distance is less than the second threshold as a result of the comparison;
and when the distance is greater than or equal to the second threshold value as a result of the comparison, determining that no interaction occurs between the third protein and the fourth protein.
9. A computer apparatus comprising a memory for storing at least one program and a processor for loading the at least one program to perform the method of any one of claims 1-8.
10. A storage medium having stored therein processor-executable instructions, which when executed by a processor, are for performing the method of any one of claims 1-8.
CN202010277398.9A 2020-04-10 2020-04-10 Model training method, protein interaction prediction method, device and medium Active CN111613273B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010277398.9A CN111613273B (en) 2020-04-10 2020-04-10 Model training method, protein interaction prediction method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010277398.9A CN111613273B (en) 2020-04-10 2020-04-10 Model training method, protein interaction prediction method, device and medium

Publications (2)

Publication Number Publication Date
CN111613273A true CN111613273A (en) 2020-09-01
CN111613273B CN111613273B (en) 2023-03-28

Family

ID=72199487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010277398.9A Active CN111613273B (en) 2020-04-10 2020-04-10 Model training method, protein interaction prediction method, device and medium

Country Status (1)

Country Link
CN (1) CN111613273B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022078170A1 (en) * 2020-10-16 2022-04-21 腾讯科技(深圳)有限公司 Methods for determining interaction information and for training prediction model, an apparatus, and medium
CN115881211A (en) * 2021-12-23 2023-03-31 上海智峪生物科技有限公司 Protein sequence alignment method, device, computer equipment and storage medium
WO2023130200A1 (en) * 2022-01-04 2023-07-13 京东方科技集团股份有限公司 Vector model training method, negative-sample generation method, medium and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080021686A1 (en) * 2006-02-16 2008-01-24 Microsoft Corporation Cluster modeling, and learning cluster specific parameters of an adaptive double threading model
CN107742061A (en) * 2017-09-19 2018-02-27 中山大学 A kind of prediction of protein-protein interaction mthods, systems and devices
US20190304568A1 (en) * 2018-03-30 2019-10-03 Board Of Trustees Of Michigan State University System and methods for machine learning for drug design and discovery

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080021686A1 (en) * 2006-02-16 2008-01-24 Microsoft Corporation Cluster modeling, and learning cluster specific parameters of an adaptive double threading model
CN107742061A (en) * 2017-09-19 2018-02-27 中山大学 A kind of prediction of protein-protein interaction mthods, systems and devices
US20190304568A1 (en) * 2018-03-30 2019-10-03 Board Of Trustees Of Michigan State University System and methods for machine learning for drug design and discovery

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李哲谦等: "基于改进支持向量机方法的蛋白质相互作用预测", 《中国生物医学工程学报》 *
马吉权等: "基于随机游走的蛋白质功能预测算法设计与实现", 《黑龙江大学工程学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022078170A1 (en) * 2020-10-16 2022-04-21 腾讯科技(深圳)有限公司 Methods for determining interaction information and for training prediction model, an apparatus, and medium
CN115881211A (en) * 2021-12-23 2023-03-31 上海智峪生物科技有限公司 Protein sequence alignment method, device, computer equipment and storage medium
CN115881211B (en) * 2021-12-23 2024-02-20 上海智峪生物科技有限公司 Protein sequence alignment method, protein sequence alignment device, computer equipment and storage medium
WO2023130200A1 (en) * 2022-01-04 2023-07-13 京东方科技集团股份有限公司 Vector model training method, negative-sample generation method, medium and device

Also Published As

Publication number Publication date
CN111613273B (en) 2023-03-28

Similar Documents

Publication Publication Date Title
CN111613273B (en) Model training method, protein interaction prediction method, device and medium
CN111950638B (en) Image classification method and device based on model distillation and electronic equipment
CN110866181B (en) Resource recommendation method, device and storage medium
EP3792840A1 (en) Neural network method and apparatus
JP7250126B2 (en) Computer architecture for artificial image generation using autoencoders
CN105447498A (en) A client device configured with a neural network, a system and a server system
CN115204183B (en) Knowledge enhancement-based two-channel emotion analysis method, device and equipment
CN111414353A (en) Intelligent missing data filling method and device and computer readable storage medium
EP3766021B1 (en) Cluster compression for compressing weights in neural networks
CN112395979A (en) Image-based health state identification method, device, equipment and storage medium
Ye et al. Variable selection via penalized neural network: a drop-out-one loss approach
CN113033090B (en) Push model training method, data push device and storage medium
CN112529068B (en) Multi-view image classification method, system, computer equipment and storage medium
CN112347361A (en) Method for recommending object, neural network and training method, equipment and medium thereof
WO2019095587A1 (en) Face recognition method, application server, and computer-readable storage medium
Liao et al. Logsig-RNN: A novel network for robust and efficient skeleton-based action recognition
US20220230262A1 (en) Patent assessment method based on artificial intelligence
CN110232154B (en) Random forest-based product recommendation method, device and medium
CN113254687B (en) Image retrieval and image quantification model training method, device and storage medium
WO2022063076A1 (en) Adversarial example identification method and apparatus
EP3888091A1 (en) Machine learning for protein binding sites
US20230410465A1 (en) Real time salient object detection in images and videos
CN113868543B (en) Method for sorting recommended objects, method and device for model training and electronic equipment
CN112686306B (en) ICD operation classification automatic matching method and system based on graph neural network
CN114819140A (en) Model pruning method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant