CN111091874B - Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product - Google Patents
Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product Download PDFInfo
- Publication number
- CN111091874B CN111091874B CN201911329568.7A CN201911329568A CN111091874B CN 111091874 B CN111091874 B CN 111091874B CN 201911329568 A CN201911329568 A CN 201911329568A CN 111091874 B CN111091874 B CN 111091874B
- Authority
- CN
- China
- Prior art keywords
- protein
- gene ontology
- target
- vector
- identified
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 343
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 183
- 238000010276 construction Methods 0.000 title claims abstract description 34
- 239000013598 vector Substances 0.000 claims abstract description 177
- 238000000034 method Methods 0.000 claims abstract description 48
- 230000006870 function Effects 0.000 claims abstract description 18
- 230000031018 biological processes and functions Effects 0.000 claims abstract description 15
- 230000004879 molecular function Effects 0.000 claims abstract description 15
- 238000012549 training Methods 0.000 claims description 28
- 238000012545 processing Methods 0.000 claims description 26
- 125000003275 alpha amino acid group Chemical group 0.000 claims description 9
- 230000009467 reduction Effects 0.000 claims description 8
- 230000004853 protein function Effects 0.000 claims description 7
- 238000012935 Averaging Methods 0.000 claims description 4
- 238000013135 deep learning Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 description 7
- 150000001413 amino acids Chemical class 0.000 description 6
- 230000009471 action Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012514 protein characterization Methods 0.000 description 4
- 102100034574 P protein Human genes 0.000 description 3
- 101710181008 P protein Proteins 0.000 description 3
- 101710177166 Phosphoprotein Proteins 0.000 description 3
- 210000004027 cell Anatomy 0.000 description 3
- 238000013518 transcription Methods 0.000 description 3
- 230000035897 transcription Effects 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 108091008794 FGF receptors Proteins 0.000 description 1
- 102000044168 Fibroblast Growth Factor Receptor Human genes 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000002869 basic local alignment search tool Methods 0.000 description 1
- 239000002551 biofuel Substances 0.000 description 1
- 238000006555 catalytic reaction Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 210000003850 cellular structure Anatomy 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000009509 drug development Methods 0.000 description 1
- 230000002255 enzymatic effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 239000002547 new drug Substances 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000019491 signal transduction Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/10—Ontologies; Annotations
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Data Mining & Analysis (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Chemical & Material Sciences (AREA)
- Biochemistry (AREA)
- Library & Information Science (AREA)
- Molecular Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a protein characteristic construction method, which is characterized in that vectors corresponding to each piece of gene ontology information in a gene ontology database are obtained in advance, when a characteristic vector is required to be constructed for a certain protein such as a protein to be identified, target gene ontology information of the protein to be identified is determined according to the gene ontology database, and a target vector corresponding to the target gene ontology information is determined from the vectors obtained in advance according to identity marks of the target gene ontology information. Then, a feature vector of the protein to be identified is constructed from the target vector. Because the gene determines the functions and the characteristics of the protein, the method is based on the characteristic vector of the protein constructed by the gene ontology information, and the gene ontology information reflects the gene information and the molecular functions or biological processes, namely the gene information and the molecular functions or biological processes of the protein are considered when the protein characteristics are constructed, so that the accuracy of the constructed protein characteristics is improved.
Description
Technical Field
The present application relates to the field of biological information technology, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for constructing protein features.
Background
Proteins are the most important basic unit for life activities of all living things to represent life, can be the smallest automatic machine in nature, and have irreplaceable roles in the operation with biological systems. The function of proteins plays an important role in biotechnology and medicine research, such as new drug development, new crop development, and development of synthetic biochemicals such as biofuel.
The characteristic information of a protein can be used to represent the function of the protein, and thus, the construction of protein characteristics is important for prediction, classification, and the like of the function of the protein. Traditional protein characterization methods construct protein characterization matrices such as position-specific scoring matrices (PSSM) based on amino acid sequences such as statistical amino acid frequency of occurrence, calculated amino acid physicochemical properties, or homology searches.
However, the protein is determined by the gene and is the product of gene transcription and translation, and the method ignores the gene information and molecular functions or biological processes of the protein from the view point of sequence information, so that the accuracy of the constructed protein characteristics is low.
Disclosure of Invention
In order to solve the technical problems in the related art, the application provides a protein characteristic construction method, a device, equipment, a storage medium and a program product, and the method and the device consider the gene information and the molecular function or the biological process of the protein when constructing the protein characteristic, so that the accuracy of the constructed protein characteristic is improved.
In one aspect, an embodiment of the present application provides a method for constructing protein features, where a vector corresponding to each piece of gene ontology information in a gene ontology database is obtained in advance, the method includes:
determining target gene ontology information of proteins to be identified according to the gene ontology database, wherein the target gene ontology information has an identity;
determining a target vector corresponding to the target gene ontology information from the vectors obtained in advance according to the identity;
and constructing a characteristic vector of the protein to be identified according to the target vector.
Optionally, the determining mode of the vector corresponding to each piece of gene ontology information in the gene ontology database is as follows:
expressing the category and axiom content included in each piece of gene ontology information as sentences to obtain training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content;
and training word vectors according to the training corpus, and generating vectors corresponding to each piece of gene ontology information.
Optionally, if the target gene ontology information includes a plurality of pieces, the constructing the feature vector of the protein to be identified according to the target vector includes:
and adding and averaging the target feature vectors corresponding to the plurality of pieces of target gene ontology information to obtain the feature vectors.
Optionally, the constructing the feature vector of the protein to be identified according to the target vector includes:
performing dimension reduction processing on the target vector according to the preset length of the feature vector;
and constructing the feature vector according to the processed target vector.
Optionally, the feature vector is used for classifying the function of the protein to be identified and/or for predicting the binding site of the protein to be identified.
Optionally, if the feature vector is used for predicting the binding site of the protein to be identified, after the feature vector of the protein to be identified is constructed according to the target vector, the method further includes:
acquiring the residue characteristics of a target binding site in the protein to be identified;
expanding the residue characteristics of the target binding site according to the characteristic vector.
On the other hand, an embodiment of the present application provides a protein feature construction device, where a vector corresponding to each piece of gene ontology information in a gene ontology database is obtained in advance, and the device includes a first determining unit, a second determining unit, and a construction unit:
the first determining unit is used for determining target gene ontology information of the protein to be identified according to the gene ontology database, wherein the target gene ontology information is provided with an identity;
the second determining unit is used for determining a target vector corresponding to the target gene ontology information from the vectors obtained in advance according to the identity;
the construction unit is used for constructing the characteristic vector of the protein to be identified according to the target vector.
Optionally, the base first determining unit is further configured to:
expressing the category and axiom content included in each piece of gene ontology information as sentences to obtain training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content;
and training word vectors according to the training corpus, and generating vectors corresponding to each piece of gene ontology information.
Optionally, if the target gene ontology information includes a plurality of pieces, the construction unit is configured to:
and adding and averaging the target feature vectors corresponding to the plurality of pieces of target gene ontology information to obtain the feature vectors.
Optionally, the construction unit is configured to:
performing dimension reduction processing on the target vector according to the preset length of the feature vector;
and constructing the feature vector according to the processed target vector.
Optionally, the feature vector is used for classifying the function of the protein to be identified and/or for predicting the binding site of the protein to be identified.
Optionally, if the feature vector is used to predict the binding site of the protein to be identified, the device further comprises an acquisition unit and an expansion unit:
the acquisition unit is used for acquiring the residue characteristics of the target binding site in the protein to be identified;
the expansion unit is used for expanding the residue characteristics of the target binding site according to the characteristic vector.
In another aspect, an embodiment of the present application provides a data processing apparatus, including a memory and a processor, where the memory is configured to store program code and transmit the program code to the processor;
the processor is configured to execute any one of the protein feature construction methods according to instructions in the program code.
On the other hand, an embodiment of the present application provides a storage medium, where an instruction is stored, and when the instruction is executed on a terminal device, the instruction causes the terminal device to execute any one of the protein feature construction methods.
In another aspect, embodiments of the present application provide a computer program product, which when run on a terminal device, causes the terminal device to perform any of the protein feature construction methods described herein.
Compared with the prior art, the embodiment of the invention has the following advantages:
in the embodiment of the application, vectors corresponding to each piece of gene ontology information in a gene ontology database are obtained in advance, when a feature vector is required to be constructed for a certain protein such as a protein to be identified, target gene ontology information of the protein to be identified is determined according to the gene ontology database, and a target vector corresponding to the target gene ontology information is determined from the vectors obtained in advance according to an identity of the target gene ontology information. Then, a feature vector of the protein to be identified is constructed from the target vector. Because the gene determines the functions and the characteristics of the protein, the method is based on the characteristic vector of the protein constructed by the gene ontology information, and the gene ontology information reflects the gene information and the molecular functions or biological processes, namely the gene information and the molecular functions or biological processes of the protein are considered when the protein characteristics are constructed, so that the accuracy of the constructed protein characteristics is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an exemplary diagram of an application scenario of a protein feature construction method according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for constructing protein features according to an embodiment of the present application;
FIG. 3 is an interface diagram for determining ontology information of a target gene according to an embodiment of the present application;
FIG. 4a is a block diagram of a protein characterization device provided in an embodiment of the present application;
FIG. 4b is a block diagram of a protein characterization device provided in an embodiment of the present application.
Detailed Description
In order to make the present invention better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the related art, protein features are constructed based on amino acid sequences, but the method only considers from the view of sequence information, but neglects gene information and molecular functions or biological processes of the protein, so that the accuracy of the constructed protein features is lower.
For example, when protein a is composed of amino acid A, B, C and when amino acid C is mutated to amino acid D during the production of protein a by gene transcription and translation, the resulting protein is actually composed of amino acid A, B, D, and this protein may be referred to as protein b, and if the protein characteristics are obtained by constructing based on the amino acid sequences, the characteristics of protein a and protein b are different from each other because the amino acid sequences of protein a and protein b are different, and the functions of protein a and protein b are not similar to each other. However, in practice, the functions of protein a and protein b are similar, and it is difficult to accurately reflect the actual characteristics of proteins by constructing protein characteristics based on amino acid sequences.
The Gene Ontology (GO) theory is a method for systematically annotating species genes and their product attributes. As biotechnology progresses faster and faster, more and more data is available. A method needs to be found to organize the information. Gene ontology provides a reasonable solution, and the gene products are endowed with GO information in a database, so that the gene ontology database can be queried for the biological related information. The gene ontology is an directed acyclic graph (Directed acyclic graph, DAG) type ontology. The gene ontology provides uniformly defined entries to represent attributes of gene products.
The most basic concept in the gene ontology database is a node or entry (GO Term), each node having a name such as "Cell", "Fibroblast Growth Factor Receptor Binding" or "Signal Transduction", and a unique number such as "go_nnnnnnn".
The gene body mainly comprises three branches: a cell component (cellular component), each portion of the cell and the extracellular environment of the cell. Molecular function (molecular function), the major activities of gene products at the molecular level, such as binding and enzymatic catalysis. A biological process (biological process), a process or set of molecular events, may define events or actions that begin and end.
Based on the characteristics of the gene ontology, the embodiment of the application provides a method for constructing protein characteristics, which is used for training in advance to obtain vectors of each piece of gene ontology information in a gene ontology library, and when the characteristics of a certain protein need to be constructed, the purpose of constructing the protein characteristics based on the gene ontology information is achieved.
The protein characteristic construction method provided by the embodiment of the application can be applied to various application scenes, such as protein similarity comparison, protein function classification, protein binding site prediction and the like.
It should be noted that the method may be applied to a data processing device, where the data processing device may be a terminal device, and the terminal device may be, for example, an intelligent terminal, a computer, a personal digital assistant (Personal Digital Assistant, abbreviated as PDA), a tablet computer, or the like.
The data processing device may also be a server, which may be a stand-alone server or a cluster server. When the data processing device is a server, the server can acquire the protein to be identified sent by the terminal device, so that the characteristic vector of the protein to be identified is constructed, and the terminal device or the server can perform subsequent processing according to the constructed characteristic vector.
The data processing device may also include a terminal device and a server, where the terminal device and the server may cooperate to execute the protein feature construction method provided in the embodiments of the present application, for example, the terminal device may determine target gene body information of a protein to be identified, send an identity of the target gene body information to the server, and the server continues to perform subsequent steps to construct a feature vector of the protein to be identified.
For example, fig. 1 shows an application scenario of the protein feature construction method provided in the embodiment of the present application. The application scenario may include the terminal device 101, where the terminal device 101 may store a vector corresponding to each piece of gene ontology information in the gene ontology database obtained in advance.
The terminal device 101 acquires a protein to be identified, wherein the protein to be identified may be a protein requiring processing such as functional classification or prediction, binding site prediction, or the like. In order to perform the processing such as functional classification or prediction and binding site prediction on the protein to be identified, it is necessary to construct a feature vector of the protein to be identified so that the feature vector of the protein to be identified is used as an input for the processing such as functional classification or prediction and binding site prediction.
When it is necessary to construct a feature vector of a protein to be identified using the terminal device 101, the terminal device 101 determines target gene ontology information of the protein to be identified according to the gene ontology database, and then determines a target vector corresponding to the target gene ontology information from among vectors obtained in advance according to an identity of the target gene ontology information. Next, the terminal apparatus 101 constructs a feature vector of the protein to be identified from the target vector so as to take the feature vector as an input for subsequent processing.
Because the gene ontology information reflects the gene information and the molecular function or biological process, the method considers the gene information and the molecular function or biological process, thereby avoiding the construction of wrong protein characteristics caused by mutation in the gene transcription and translation process, improving the accuracy of the constructed protein characteristics, and further ensuring the accuracy of the subsequent processing results (such as the function prediction or classification result and the binding site prediction result).
It should be noted that the above application scenario is only shown for the convenience of understanding the present invention, and embodiments of the present invention are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.
Various non-limiting embodiments of the present invention are described in detail below with reference to the attached drawing figures.
Exemplary method
Referring to fig. 2, a schematic flow chart of a method for constructing protein features according to an embodiment of the present invention is shown. In this embodiment, taking the data processing device as an example of a terminal device, the method specifically may include the following steps:
s201, determining target gene ontology information of proteins to be identified according to the gene ontology database, wherein the target gene ontology information is provided with an identity.
In the scene of constructing the feature vector of the protein to be identified, the terminal equipment can determine the target gene ontology information of the protein to be identified by searching a gene ontology database. In this embodiment, the GO definition provides the targeted gene ontology information in the form of web ontology language (Web Ontology Language, OWL) files.
It should be noted that, in one possible implementation manner, the detailed process of determining the target gene ontology information by the terminal device may be shown in fig. 3, and the user may input the identifier, such as the name or the number, of the protein to be identified in the interface shown in 301 in fig. 3, and click on the "search" button, so as to trigger the search function of the terminal device, so that the terminal device may search the gene ontology database for the target gene ontology information of the protein to be identified.
The target gene body information of the protein to be identified may include one or more pieces, and in general, the target gene body information of one protein includes a plurality of pieces, and each piece of target gene body information may be referred to as GO Term. As shown at 302 in fig. 3. Wherein P62258 is the number of the protein to be identified, and the plurality of pieces of information included in GO-Molecular function are the target gene ontology information of the protein to be identified, namely a plurality of GO Term.
It will be appreciated that if the protein to be identified is known, the identity of the protein to be identified may be entered directly at 301. If the protein to be identified is unknown, the identification of the protein most similar to the protein to be identified can be selected as the identification of the protein to be identified through searching by a searching tool (Basic Local Alignment Search Tool, blast) based on a local alignment algorithm.
S202, determining a target vector corresponding to the target gene ontology information from the vectors obtained in advance according to the identity.
In this embodiment, a vector corresponding to each piece of gene ontology information in the gene ontology database is obtained in advance, and the vector corresponding to each piece of gene ontology information is stored, for example, a correspondence between an identity of the gene ontology information and the vector may be stored. The vector corresponding to each piece of gene ontology information may be stored in the terminal apparatus, or may be stored in another apparatus independent of the terminal apparatus, such as a database, a server, or the like. Thus, after determining the target gene body information corresponding to the protein to be identified, the terminal equipment can determine the target vector from the stored vectors according to the identity of the target gene body information.
Since a protein may include a plurality of, for example, n GO terminals, the determined target vector may include n, where n is the number of GO terminals and m is the vector dimension of one GO terminal, which form a feature matrix of n×m.
It is understood that the vector corresponding to each piece of gene ontology information in the gene ontology database may be pre-trained. The current gene ontology database contains 577454 Axiom (Axiom) contents and 43828 categories (Classes) in total. The GO is organized in Class form, and there are several Axiom in each GO Term, which describe any one of the GO relationships as an expression of the intrinsic relationship. Based on the above, when determining the vector corresponding to each piece of gene ontology information in the gene ontology database, the category and axiom content included in each piece of gene ontology information can be expressed as sentences, so as to obtain a training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content.
Specifically, for one piece of gene ontology information (GO Term), such as go_0000054, the GO formatted language can be converted into sentences for each class therein, and the GO formatted language can be converted into sentences for axiom content therein, so that all sentences are combined and spliced to obtain the description corpus of the GO Term. And traversing all GO terminals, and constructing description corpus of all GO terminals in the gene ontology database, thereby obtaining training corpus.
And then, carrying out word vector training according to the training corpus, and generating vectors corresponding to each piece of gene ontology information. In one possible implementation, a neural network learning technique, for example, word-Vector training is performed by using Word-to-Vector (Word 2 vec) -continuous Word bag model (continuous bag of words, CBOW) on the training corpus, so as to obtain a Word Vector of each Word, where the Word Vector of the entity Word (the Word representing the entity) may be used as the Word Vector of the corresponding GO Term, so as to obtain a Vector corresponding to each GO Term.
For example, the GO information corresponding to the P protein Q63564 includes go_0001669, go_0016021, go_0022857, go_0030054, go_0030672, go_0043195, and go_0055085, and then the vector corresponding to each GO information is determined by the above method.
S203, constructing the characteristic vector of the protein to be identified according to the target vector.
The terminal equipment can construct a characteristic vector of the protein to be identified according to the determined target vector, and the characteristic vector can reflect the characteristics of the protein to be identified, so that the construction of the characteristics of the protein to be identified is completed.
For example, for the P protein Q63564, the terminal device may process the vectors corresponding to the determined multiple GO information, so as to construct the feature vector of the P protein Q63564.
In the embodiment of the application, vectors corresponding to each piece of gene ontology information in a gene ontology database are obtained in advance, when a feature vector is required to be constructed for a certain protein such as a protein to be identified, target gene ontology information of the protein to be identified is determined according to the gene ontology database, and a target vector corresponding to the target gene ontology information is determined from the vectors obtained in advance according to an identity of the target gene ontology information. Then, a feature vector of the protein to be identified is constructed from the target vector. Because the gene determines the functions and the characteristics of the protein, the method is based on the characteristic vector of the protein constructed by the gene ontology information, and the gene ontology information reflects the gene information and the molecular functions or biological processes, namely the gene information and the molecular functions or biological processes of the protein are considered when the protein characteristics are constructed, so that the accuracy of the constructed protein characteristics is improved.
It should be noted that the embodiments of the present application provide various methods for constructing feature vectors according to target vectors. If the target gene ontology information includes a plurality of target gene ontology information, one construction method may be to add and average target feature vectors corresponding to the plurality of target gene ontology information to obtain feature vectors. By the method, proteins with different lengths can be converted into the feature vectors with fixed lengths, so that the feature vectors of different proteins have comparability, and the accuracy of functional classification of the proteins is improved.
Because there may be a specific requirement for the length of the feature vector in the functional classification scene, in order to meet the specific length requirement for the feature vector, one way to construct the feature vector may be to preset the length of the feature vector according to the requirement, then perform the dimension reduction processing on the target vector according to the preset length of the feature vector, and construct the feature vector according to the processed target vector.
The method of dimension reduction includes many ways, such as principal component analysis (Principal components analysis, PCA) dimension reduction technique, singular value decomposition (Singular Value Decomposition, SVD) dimension reduction technique, and the like, which is not limited in this embodiment.
The feature vector obtained by the method can be used for classifying the functions of the proteins to be identified and determining the similarity between the proteins, and can also be used for predicting the binding sites of the proteins to be identified.
When classifying the functions of the proteins to be identified according to the feature vectors, the terminal equipment can complete tasks such as classifying the functions of various proteins by combining a deep learning convolution technology.
When predicting the binding site of the protein to be identified according to the feature vector, in order to improve the accuracy of the prediction of the binding site, the feature vector may be used to expand the residue feature of the binding site, specifically, the terminal device may obtain the residue feature of the target binding site in the protein to be identified, and since the residue feature is obtained based on the amino acid sequence of the protein to be identified, in order to make the residue feature more accurately embody the feature of the target binding site, the terminal device may expand the residue feature of the target binding site according to the feature vector, and represent the feature of the target binding site by using the expanded residue feature, that is, by using the combination of the residue feature determined based on the amino acid sequence of the protein to be identified and the feature of the target binding site.
Because the feature vector is obtained based on the gene, the extended residue feature can also reflect the feature of the target binding site from the gene angle, so that the extended residue feature is more accurate, and the accuracy of the prediction of the subsequent binding site is improved.
Exemplary apparatus
Based on the method for constructing protein features provided in the foregoing embodiment, the embodiment of the present application further provides a device for constructing protein features, where vectors corresponding to each piece of gene ontology information in a gene ontology database are obtained in advance, and referring to fig. 4a, the device includes a first determining unit 401, a second determining unit 402, and a constructing unit 403:
the first determining unit 401 is configured to determine target gene ontology information of a protein to be identified according to the gene ontology database, where the target gene ontology information has an identity;
the second determining unit 402 is configured to determine, according to the identity, a target vector corresponding to the target gene ontology information from among vectors obtained in advance;
the construction unit 403 is configured to construct a feature vector of the protein to be identified according to the target vector.
Optionally, the base first determining unit 401 is further configured to:
expressing the category and axiom content included in each piece of gene ontology information as sentences to obtain training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content;
and training word vectors according to the training corpus, and generating vectors corresponding to each piece of gene ontology information.
Optionally, if the target gene ontology information includes a plurality of pieces, the constructing unit 403 is configured to:
and adding and averaging the target feature vectors corresponding to the plurality of pieces of target gene ontology information to obtain the feature vectors.
Optionally, the building unit 403 is configured to:
performing dimension reduction processing on the target vector according to the preset length of the feature vector;
and constructing the feature vector according to the processed target vector.
Optionally, the feature vector is used for classifying the function of the protein to be identified and/or for predicting the binding site of the protein to be identified.
Optionally, if the feature vector is used to predict the binding site of the protein to be identified, referring to fig. 4b, the device further comprises an acquisition unit 404 and an expansion unit 405:
the acquisition unit 404 is configured to acquire a residue feature of a target binding site in the protein to be identified;
the expansion unit 405 is configured to expand the residue feature of the target binding site according to the feature vector.
The embodiment of the application also provides a data processing device, which comprises a memory and a processor,
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the protein feature construction method according to any one of the corresponding embodiments of fig. 2 according to the instructions in the program code.
An embodiment of the present application provides a storage medium having instructions stored therein, which when executed on a data processing apparatus, cause the data processing apparatus to perform the protein feature construction method according to any one of the corresponding embodiments of fig. 2.
Embodiments of the present application provide a computer program product which, when run on a data processing apparatus, causes the data processing apparatus to perform the protein feature construction method of any of the corresponding embodiments of fig. 2.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing is merely exemplary of the application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the application and are intended to be comprehended within the scope of the application.
Claims (7)
1. A protein feature construction method, characterized in that a vector corresponding to each piece of gene ontology information in a gene ontology database is obtained in advance, the gene ontology information reflecting gene information, molecular functions and biological processes, the method comprising:
determining target gene ontology information of proteins to be identified according to the gene ontology database, wherein the target gene ontology information has an identity;
determining a target vector corresponding to the target gene ontology information from the vectors obtained in advance according to the identity;
constructing a feature vector of the protein to be identified according to the target vector, wherein the feature vector is applied to protein similarity comparison, protein function classification and protein binding site prediction;
if the feature vector is used for predicting the binding site of the protein to be identified, acquiring the residue feature of the target binding site in the protein to be identified;
expanding the residue characteristics of the target binding site according to the characteristic vector, and representing the characteristics of the target binding site by using the expanded residue characteristics, wherein the method specifically comprises the following steps: characterizing the target binding site in such a manner that a residue feature determined based on the amino acid sequence of the protein to be identified binds to the feature vector;
if the feature vector is used for protein function classification, carrying out protein function classification through a feature vector and a deep learning convolution algorithm;
the method for determining the vector corresponding to each piece of gene ontology information in the gene ontology database comprises the following steps:
expressing the category and axiom content included in each piece of gene ontology information as sentences to obtain training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content;
and training word vectors according to the training corpus, and generating vectors corresponding to each piece of gene ontology information.
2. The method according to claim 1, wherein if the target gene ontology information includes a plurality of pieces, the constructing the feature vector of the protein to be identified from the target vector includes:
and adding and averaging the target feature vectors corresponding to the plurality of pieces of target gene ontology information to obtain the feature vectors.
3. The method according to claim 1, wherein said constructing a feature vector of the protein to be identified from the target vector comprises:
performing dimension reduction processing on the target vector according to the preset length of the feature vector;
and constructing the feature vector according to the processed target vector.
4. The method according to claim 1, characterized in that the feature vector is used for classifying the function of the protein to be identified and/or for predicting the binding site of the protein to be identified.
5. A protein characteristic construction apparatus characterized in that vectors corresponding to each piece of gene ontology information in a gene ontology database are obtained in advance, the gene ontology information reflecting gene information, molecular functions and biological processes, the apparatus comprising a first determination unit, a second determination unit and a construction unit:
the first determining unit is used for determining target gene ontology information of the protein to be identified according to the gene ontology database, wherein the target gene ontology information is provided with an identity;
the second determining unit is used for determining a target vector corresponding to the target gene ontology information from the vectors obtained in advance according to the identity;
the construction unit is used for constructing a characteristic vector of the protein to be identified according to the target vector, and the characteristic vector is applied to protein similarity comparison, protein function classification and protein binding site prediction;
the acquisition unit is used for acquiring the residue characteristics of the target binding site in the protein to be identified if the characteristic vector is used for predicting the binding site of the protein to be identified;
the expansion unit is used for expanding the residue characteristics of the target binding site according to the characteristic vector, and representing the characteristics of the target binding site by utilizing the expanded residue characteristics, and specifically comprises the following steps: characterizing the target binding site in such a manner that a residue feature determined based on the amino acid sequence of the protein to be identified binds to the feature vector;
the classifying unit is used for classifying the protein functions through the feature vector and the deep learning convolution algorithm if the feature vector is used for classifying the protein functions;
the method for determining the vector corresponding to each piece of gene ontology information in the gene ontology database comprises the following steps:
expressing the category and axiom content included in each piece of gene ontology information as sentences to obtain training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content;
and training word vectors according to the training corpus, and generating vectors corresponding to each piece of gene ontology information.
6. A data processing apparatus, comprising a memory and a processor,
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the protein profile construction method of any one of claims 1-4 according to instructions in the program code.
7. A storage medium having instructions stored therein that, when executed on a data processing device, cause the data processing device to perform the protein profile construction method of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911329568.7A CN111091874B (en) | 2019-12-20 | 2019-12-20 | Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911329568.7A CN111091874B (en) | 2019-12-20 | 2019-12-20 | Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111091874A CN111091874A (en) | 2020-05-01 |
CN111091874B true CN111091874B (en) | 2024-01-19 |
Family
ID=70396642
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911329568.7A Active CN111091874B (en) | 2019-12-20 | 2019-12-20 | Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111091874B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106778070A (en) * | 2017-03-31 | 2017-05-31 | 上海交通大学 | A kind of human protein's subcellular location Forecasting Methodology |
CN106845149A (en) * | 2017-02-09 | 2017-06-13 | 景德镇陶瓷大学 | A kind of new protein sequence method for expressing based on gene ontology information |
CN107563150A (en) * | 2017-08-31 | 2018-01-09 | 深圳大学 | Forecasting Methodology, device, equipment and the storage medium of protein binding site |
CN109886385A (en) * | 2019-03-04 | 2019-06-14 | 上海宝藤生物医药科技股份有限公司 | Determination method, apparatus, equipment and the medium of cell-signaling pathways network characterization |
-
2019
- 2019-12-20 CN CN201911329568.7A patent/CN111091874B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106845149A (en) * | 2017-02-09 | 2017-06-13 | 景德镇陶瓷大学 | A kind of new protein sequence method for expressing based on gene ontology information |
CN106778070A (en) * | 2017-03-31 | 2017-05-31 | 上海交通大学 | A kind of human protein's subcellular location Forecasting Methodology |
CN107563150A (en) * | 2017-08-31 | 2018-01-09 | 深圳大学 | Forecasting Methodology, device, equipment and the storage medium of protein binding site |
CN109886385A (en) * | 2019-03-04 | 2019-06-14 | 上海宝藤生物医药科技股份有限公司 | Determination method, apparatus, equipment and the medium of cell-signaling pathways network characterization |
Non-Patent Citations (1)
Title |
---|
刘冰静等.以位置特异性得分矩阵和基因本体为特征的蛋白质亚细胞定位预测.《福州大学学报( 自然科学版)》.2017,第45卷(第45期),第16-24页. * |
Also Published As
Publication number | Publication date |
---|---|
CN111091874A (en) | 2020-05-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110598206B (en) | Text semantic recognition method and device, computer equipment and storage medium | |
CN107357875B (en) | Voice search method and device and electronic equipment | |
CN110070909B (en) | Deep learning-based multi-feature fusion protein function prediction method | |
US20150379087A1 (en) | Apparatus and method for replying to query | |
CN112015898B (en) | Model training and text label determining method and device based on label tree | |
CN110781204B (en) | Identification information determining method, device, equipment and storage medium of target object | |
CN111506719A (en) | Associated question recommending method, device and equipment and readable storage medium | |
CN111309887B (en) | Method and system for training text key content extraction model | |
CN114117240B (en) | Internet content pushing method based on big data demand analysis and AI system | |
JP2020512651A (en) | Search method, device, and non-transitory computer-readable storage medium | |
CN114358657A (en) | Post recommendation method and device based on model fusion | |
CN110737779B (en) | Knowledge graph construction method and device, storage medium and electronic equipment | |
Dong et al. | Predicting protein complexes using a supervised learning method combined with local structural information | |
CN116610872B (en) | Training method and device for news recommendation model | |
CN111091874B (en) | Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product | |
CN112508177A (en) | Network structure searching method and device, electronic equipment and storage medium | |
CN112259157A (en) | Protein interaction prediction method | |
JP7202757B1 (en) | Information processing system, information processing method and program | |
CN115881211B (en) | Protein sequence alignment method, protein sequence alignment device, computer equipment and storage medium | |
CN116089595A (en) | Data processing pushing method, device and medium based on scientific and technological achievements | |
Liang et al. | Modern Hopfield Networks for graph embedding | |
CN111159526B (en) | Query statement processing method, device, equipment and storage medium | |
CN113609248A (en) | Word weight generation model training method and device and word weight generation method and device | |
CN114528469A (en) | Recommendation method and device, electronic equipment and storage medium | |
JP2006004103A (en) | Method, apparatus and program for matching structure between document classification systems and recording medium recording the program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |