CN111091874B - Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product - Google Patents

Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product Download PDF

Info

Publication number
CN111091874B
CN111091874B CN201911329568.7A CN201911329568A CN111091874B CN 111091874 B CN111091874 B CN 111091874B CN 201911329568 A CN201911329568 A CN 201911329568A CN 111091874 B CN111091874 B CN 111091874B
Authority
CN
China
Prior art keywords
protein
gene ontology
target
vector
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911329568.7A
Other languages
Chinese (zh)
Other versions
CN111091874A (en
Inventor
汤一凡
崔朝辉
赵立军
张霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201911329568.7A priority Critical patent/CN111091874B/en
Publication of CN111091874A publication Critical patent/CN111091874A/en
Application granted granted Critical
Publication of CN111091874B publication Critical patent/CN111091874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/10Ontologies; Annotations

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Biochemistry (AREA)
  • Library & Information Science (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a protein characteristic construction method, which is characterized in that vectors corresponding to each piece of gene ontology information in a gene ontology database are obtained in advance, when a characteristic vector is required to be constructed for a certain protein such as a protein to be identified, target gene ontology information of the protein to be identified is determined according to the gene ontology database, and a target vector corresponding to the target gene ontology information is determined from the vectors obtained in advance according to identity marks of the target gene ontology information. Then, a feature vector of the protein to be identified is constructed from the target vector. Because the gene determines the functions and the characteristics of the protein, the method is based on the characteristic vector of the protein constructed by the gene ontology information, and the gene ontology information reflects the gene information and the molecular functions or biological processes, namely the gene information and the molecular functions or biological processes of the protein are considered when the protein characteristics are constructed, so that the accuracy of the constructed protein characteristics is improved.

Description

Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product
Technical Field
The present application relates to the field of biological information technology, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for constructing protein features.
Background
Proteins are the most important basic unit for life activities of all living things to represent life, can be the smallest automatic machine in nature, and have irreplaceable roles in the operation with biological systems. The function of proteins plays an important role in biotechnology and medicine research, such as new drug development, new crop development, and development of synthetic biochemicals such as biofuel.
The characteristic information of a protein can be used to represent the function of the protein, and thus, the construction of protein characteristics is important for prediction, classification, and the like of the function of the protein. Traditional protein characterization methods construct protein characterization matrices such as position-specific scoring matrices (PSSM) based on amino acid sequences such as statistical amino acid frequency of occurrence, calculated amino acid physicochemical properties, or homology searches.
However, the protein is determined by the gene and is the product of gene transcription and translation, and the method ignores the gene information and molecular functions or biological processes of the protein from the view point of sequence information, so that the accuracy of the constructed protein characteristics is low.
Disclosure of Invention
In order to solve the technical problems in the related art, the application provides a protein characteristic construction method, a device, equipment, a storage medium and a program product, and the method and the device consider the gene information and the molecular function or the biological process of the protein when constructing the protein characteristic, so that the accuracy of the constructed protein characteristic is improved.
In one aspect, an embodiment of the present application provides a method for constructing protein features, where a vector corresponding to each piece of gene ontology information in a gene ontology database is obtained in advance, the method includes:
determining target gene ontology information of proteins to be identified according to the gene ontology database, wherein the target gene ontology information has an identity;
determining a target vector corresponding to the target gene ontology information from the vectors obtained in advance according to the identity;
and constructing a characteristic vector of the protein to be identified according to the target vector.
Optionally, the determining mode of the vector corresponding to each piece of gene ontology information in the gene ontology database is as follows:
expressing the category and axiom content included in each piece of gene ontology information as sentences to obtain training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content;
and training word vectors according to the training corpus, and generating vectors corresponding to each piece of gene ontology information.
Optionally, if the target gene ontology information includes a plurality of pieces, the constructing the feature vector of the protein to be identified according to the target vector includes:
and adding and averaging the target feature vectors corresponding to the plurality of pieces of target gene ontology information to obtain the feature vectors.
Optionally, the constructing the feature vector of the protein to be identified according to the target vector includes:
performing dimension reduction processing on the target vector according to the preset length of the feature vector;
and constructing the feature vector according to the processed target vector.
Optionally, the feature vector is used for classifying the function of the protein to be identified and/or for predicting the binding site of the protein to be identified.
Optionally, if the feature vector is used for predicting the binding site of the protein to be identified, after the feature vector of the protein to be identified is constructed according to the target vector, the method further includes:
acquiring the residue characteristics of a target binding site in the protein to be identified;
expanding the residue characteristics of the target binding site according to the characteristic vector.
On the other hand, an embodiment of the present application provides a protein feature construction device, where a vector corresponding to each piece of gene ontology information in a gene ontology database is obtained in advance, and the device includes a first determining unit, a second determining unit, and a construction unit:
the first determining unit is used for determining target gene ontology information of the protein to be identified according to the gene ontology database, wherein the target gene ontology information is provided with an identity;
the second determining unit is used for determining a target vector corresponding to the target gene ontology information from the vectors obtained in advance according to the identity;
the construction unit is used for constructing the characteristic vector of the protein to be identified according to the target vector.
Optionally, the base first determining unit is further configured to:
expressing the category and axiom content included in each piece of gene ontology information as sentences to obtain training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content;
and training word vectors according to the training corpus, and generating vectors corresponding to each piece of gene ontology information.
Optionally, if the target gene ontology information includes a plurality of pieces, the construction unit is configured to:
and adding and averaging the target feature vectors corresponding to the plurality of pieces of target gene ontology information to obtain the feature vectors.
Optionally, the construction unit is configured to:
performing dimension reduction processing on the target vector according to the preset length of the feature vector;
and constructing the feature vector according to the processed target vector.
Optionally, the feature vector is used for classifying the function of the protein to be identified and/or for predicting the binding site of the protein to be identified.
Optionally, if the feature vector is used to predict the binding site of the protein to be identified, the device further comprises an acquisition unit and an expansion unit:
the acquisition unit is used for acquiring the residue characteristics of the target binding site in the protein to be identified;
the expansion unit is used for expanding the residue characteristics of the target binding site according to the characteristic vector.
In another aspect, an embodiment of the present application provides a data processing apparatus, including a memory and a processor, where the memory is configured to store program code and transmit the program code to the processor;
the processor is configured to execute any one of the protein feature construction methods according to instructions in the program code.
On the other hand, an embodiment of the present application provides a storage medium, where an instruction is stored, and when the instruction is executed on a terminal device, the instruction causes the terminal device to execute any one of the protein feature construction methods.
In another aspect, embodiments of the present application provide a computer program product, which when run on a terminal device, causes the terminal device to perform any of the protein feature construction methods described herein.
Compared with the prior art, the embodiment of the invention has the following advantages:
in the embodiment of the application, vectors corresponding to each piece of gene ontology information in a gene ontology database are obtained in advance, when a feature vector is required to be constructed for a certain protein such as a protein to be identified, target gene ontology information of the protein to be identified is determined according to the gene ontology database, and a target vector corresponding to the target gene ontology information is determined from the vectors obtained in advance according to an identity of the target gene ontology information. Then, a feature vector of the protein to be identified is constructed from the target vector. Because the gene determines the functions and the characteristics of the protein, the method is based on the characteristic vector of the protein constructed by the gene ontology information, and the gene ontology information reflects the gene information and the molecular functions or biological processes, namely the gene information and the molecular functions or biological processes of the protein are considered when the protein characteristics are constructed, so that the accuracy of the constructed protein characteristics is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an exemplary diagram of an application scenario of a protein feature construction method according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for constructing protein features according to an embodiment of the present application;
FIG. 3 is an interface diagram for determining ontology information of a target gene according to an embodiment of the present application;
FIG. 4a is a block diagram of a protein characterization device provided in an embodiment of the present application;
FIG. 4b is a block diagram of a protein characterization device provided in an embodiment of the present application.
Detailed Description
In order to make the present invention better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the related art, protein features are constructed based on amino acid sequences, but the method only considers from the view of sequence information, but neglects gene information and molecular functions or biological processes of the protein, so that the accuracy of the constructed protein features is lower.
For example, when protein a is composed of amino acid A, B, C and when amino acid C is mutated to amino acid D during the production of protein a by gene transcription and translation, the resulting protein is actually composed of amino acid A, B, D, and this protein may be referred to as protein b, and if the protein characteristics are obtained by constructing based on the amino acid sequences, the characteristics of protein a and protein b are different from each other because the amino acid sequences of protein a and protein b are different, and the functions of protein a and protein b are not similar to each other. However, in practice, the functions of protein a and protein b are similar, and it is difficult to accurately reflect the actual characteristics of proteins by constructing protein characteristics based on amino acid sequences.
The Gene Ontology (GO) theory is a method for systematically annotating species genes and their product attributes. As biotechnology progresses faster and faster, more and more data is available. A method needs to be found to organize the information. Gene ontology provides a reasonable solution, and the gene products are endowed with GO information in a database, so that the gene ontology database can be queried for the biological related information. The gene ontology is an directed acyclic graph (Directed acyclic graph, DAG) type ontology. The gene ontology provides uniformly defined entries to represent attributes of gene products.
The most basic concept in the gene ontology database is a node or entry (GO Term), each node having a name such as "Cell", "Fibroblast Growth Factor Receptor Binding" or "Signal Transduction", and a unique number such as "go_nnnnnnn".
The gene body mainly comprises three branches: a cell component (cellular component), each portion of the cell and the extracellular environment of the cell. Molecular function (molecular function), the major activities of gene products at the molecular level, such as binding and enzymatic catalysis. A biological process (biological process), a process or set of molecular events, may define events or actions that begin and end.
Based on the characteristics of the gene ontology, the embodiment of the application provides a method for constructing protein characteristics, which is used for training in advance to obtain vectors of each piece of gene ontology information in a gene ontology library, and when the characteristics of a certain protein need to be constructed, the purpose of constructing the protein characteristics based on the gene ontology information is achieved.
The protein characteristic construction method provided by the embodiment of the application can be applied to various application scenes, such as protein similarity comparison, protein function classification, protein binding site prediction and the like.
It should be noted that the method may be applied to a data processing device, where the data processing device may be a terminal device, and the terminal device may be, for example, an intelligent terminal, a computer, a personal digital assistant (Personal Digital Assistant, abbreviated as PDA), a tablet computer, or the like.
The data processing device may also be a server, which may be a stand-alone server or a cluster server. When the data processing device is a server, the server can acquire the protein to be identified sent by the terminal device, so that the characteristic vector of the protein to be identified is constructed, and the terminal device or the server can perform subsequent processing according to the constructed characteristic vector.
The data processing device may also include a terminal device and a server, where the terminal device and the server may cooperate to execute the protein feature construction method provided in the embodiments of the present application, for example, the terminal device may determine target gene body information of a protein to be identified, send an identity of the target gene body information to the server, and the server continues to perform subsequent steps to construct a feature vector of the protein to be identified.
For example, fig. 1 shows an application scenario of the protein feature construction method provided in the embodiment of the present application. The application scenario may include the terminal device 101, where the terminal device 101 may store a vector corresponding to each piece of gene ontology information in the gene ontology database obtained in advance.
The terminal device 101 acquires a protein to be identified, wherein the protein to be identified may be a protein requiring processing such as functional classification or prediction, binding site prediction, or the like. In order to perform the processing such as functional classification or prediction and binding site prediction on the protein to be identified, it is necessary to construct a feature vector of the protein to be identified so that the feature vector of the protein to be identified is used as an input for the processing such as functional classification or prediction and binding site prediction.
When it is necessary to construct a feature vector of a protein to be identified using the terminal device 101, the terminal device 101 determines target gene ontology information of the protein to be identified according to the gene ontology database, and then determines a target vector corresponding to the target gene ontology information from among vectors obtained in advance according to an identity of the target gene ontology information. Next, the terminal apparatus 101 constructs a feature vector of the protein to be identified from the target vector so as to take the feature vector as an input for subsequent processing.
Because the gene ontology information reflects the gene information and the molecular function or biological process, the method considers the gene information and the molecular function or biological process, thereby avoiding the construction of wrong protein characteristics caused by mutation in the gene transcription and translation process, improving the accuracy of the constructed protein characteristics, and further ensuring the accuracy of the subsequent processing results (such as the function prediction or classification result and the binding site prediction result).
It should be noted that the above application scenario is only shown for the convenience of understanding the present invention, and embodiments of the present invention are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.
Various non-limiting embodiments of the present invention are described in detail below with reference to the attached drawing figures.
Exemplary method
Referring to fig. 2, a schematic flow chart of a method for constructing protein features according to an embodiment of the present invention is shown. In this embodiment, taking the data processing device as an example of a terminal device, the method specifically may include the following steps:
s201, determining target gene ontology information of proteins to be identified according to the gene ontology database, wherein the target gene ontology information is provided with an identity.
In the scene of constructing the feature vector of the protein to be identified, the terminal equipment can determine the target gene ontology information of the protein to be identified by searching a gene ontology database. In this embodiment, the GO definition provides the targeted gene ontology information in the form of web ontology language (Web Ontology Language, OWL) files.
It should be noted that, in one possible implementation manner, the detailed process of determining the target gene ontology information by the terminal device may be shown in fig. 3, and the user may input the identifier, such as the name or the number, of the protein to be identified in the interface shown in 301 in fig. 3, and click on the "search" button, so as to trigger the search function of the terminal device, so that the terminal device may search the gene ontology database for the target gene ontology information of the protein to be identified.
The target gene body information of the protein to be identified may include one or more pieces, and in general, the target gene body information of one protein includes a plurality of pieces, and each piece of target gene body information may be referred to as GO Term. As shown at 302 in fig. 3. Wherein P62258 is the number of the protein to be identified, and the plurality of pieces of information included in GO-Molecular function are the target gene ontology information of the protein to be identified, namely a plurality of GO Term.
It will be appreciated that if the protein to be identified is known, the identity of the protein to be identified may be entered directly at 301. If the protein to be identified is unknown, the identification of the protein most similar to the protein to be identified can be selected as the identification of the protein to be identified through searching by a searching tool (Basic Local Alignment Search Tool, blast) based on a local alignment algorithm.
S202, determining a target vector corresponding to the target gene ontology information from the vectors obtained in advance according to the identity.
In this embodiment, a vector corresponding to each piece of gene ontology information in the gene ontology database is obtained in advance, and the vector corresponding to each piece of gene ontology information is stored, for example, a correspondence between an identity of the gene ontology information and the vector may be stored. The vector corresponding to each piece of gene ontology information may be stored in the terminal apparatus, or may be stored in another apparatus independent of the terminal apparatus, such as a database, a server, or the like. Thus, after determining the target gene body information corresponding to the protein to be identified, the terminal equipment can determine the target vector from the stored vectors according to the identity of the target gene body information.
Since a protein may include a plurality of, for example, n GO terminals, the determined target vector may include n, where n is the number of GO terminals and m is the vector dimension of one GO terminal, which form a feature matrix of n×m.
It is understood that the vector corresponding to each piece of gene ontology information in the gene ontology database may be pre-trained. The current gene ontology database contains 577454 Axiom (Axiom) contents and 43828 categories (Classes) in total. The GO is organized in Class form, and there are several Axiom in each GO Term, which describe any one of the GO relationships as an expression of the intrinsic relationship. Based on the above, when determining the vector corresponding to each piece of gene ontology information in the gene ontology database, the category and axiom content included in each piece of gene ontology information can be expressed as sentences, so as to obtain a training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content.
Specifically, for one piece of gene ontology information (GO Term), such as go_0000054, the GO formatted language can be converted into sentences for each class therein, and the GO formatted language can be converted into sentences for axiom content therein, so that all sentences are combined and spliced to obtain the description corpus of the GO Term. And traversing all GO terminals, and constructing description corpus of all GO terminals in the gene ontology database, thereby obtaining training corpus.
And then, carrying out word vector training according to the training corpus, and generating vectors corresponding to each piece of gene ontology information. In one possible implementation, a neural network learning technique, for example, word-Vector training is performed by using Word-to-Vector (Word 2 vec) -continuous Word bag model (continuous bag of words, CBOW) on the training corpus, so as to obtain a Word Vector of each Word, where the Word Vector of the entity Word (the Word representing the entity) may be used as the Word Vector of the corresponding GO Term, so as to obtain a Vector corresponding to each GO Term.
For example, the GO information corresponding to the P protein Q63564 includes go_0001669, go_0016021, go_0022857, go_0030054, go_0030672, go_0043195, and go_0055085, and then the vector corresponding to each GO information is determined by the above method.
S203, constructing the characteristic vector of the protein to be identified according to the target vector.
The terminal equipment can construct a characteristic vector of the protein to be identified according to the determined target vector, and the characteristic vector can reflect the characteristics of the protein to be identified, so that the construction of the characteristics of the protein to be identified is completed.
For example, for the P protein Q63564, the terminal device may process the vectors corresponding to the determined multiple GO information, so as to construct the feature vector of the P protein Q63564.
In the embodiment of the application, vectors corresponding to each piece of gene ontology information in a gene ontology database are obtained in advance, when a feature vector is required to be constructed for a certain protein such as a protein to be identified, target gene ontology information of the protein to be identified is determined according to the gene ontology database, and a target vector corresponding to the target gene ontology information is determined from the vectors obtained in advance according to an identity of the target gene ontology information. Then, a feature vector of the protein to be identified is constructed from the target vector. Because the gene determines the functions and the characteristics of the protein, the method is based on the characteristic vector of the protein constructed by the gene ontology information, and the gene ontology information reflects the gene information and the molecular functions or biological processes, namely the gene information and the molecular functions or biological processes of the protein are considered when the protein characteristics are constructed, so that the accuracy of the constructed protein characteristics is improved.
It should be noted that the embodiments of the present application provide various methods for constructing feature vectors according to target vectors. If the target gene ontology information includes a plurality of target gene ontology information, one construction method may be to add and average target feature vectors corresponding to the plurality of target gene ontology information to obtain feature vectors. By the method, proteins with different lengths can be converted into the feature vectors with fixed lengths, so that the feature vectors of different proteins have comparability, and the accuracy of functional classification of the proteins is improved.
Because there may be a specific requirement for the length of the feature vector in the functional classification scene, in order to meet the specific length requirement for the feature vector, one way to construct the feature vector may be to preset the length of the feature vector according to the requirement, then perform the dimension reduction processing on the target vector according to the preset length of the feature vector, and construct the feature vector according to the processed target vector.
The method of dimension reduction includes many ways, such as principal component analysis (Principal components analysis, PCA) dimension reduction technique, singular value decomposition (Singular Value Decomposition, SVD) dimension reduction technique, and the like, which is not limited in this embodiment.
The feature vector obtained by the method can be used for classifying the functions of the proteins to be identified and determining the similarity between the proteins, and can also be used for predicting the binding sites of the proteins to be identified.
When classifying the functions of the proteins to be identified according to the feature vectors, the terminal equipment can complete tasks such as classifying the functions of various proteins by combining a deep learning convolution technology.
When predicting the binding site of the protein to be identified according to the feature vector, in order to improve the accuracy of the prediction of the binding site, the feature vector may be used to expand the residue feature of the binding site, specifically, the terminal device may obtain the residue feature of the target binding site in the protein to be identified, and since the residue feature is obtained based on the amino acid sequence of the protein to be identified, in order to make the residue feature more accurately embody the feature of the target binding site, the terminal device may expand the residue feature of the target binding site according to the feature vector, and represent the feature of the target binding site by using the expanded residue feature, that is, by using the combination of the residue feature determined based on the amino acid sequence of the protein to be identified and the feature of the target binding site.
Because the feature vector is obtained based on the gene, the extended residue feature can also reflect the feature of the target binding site from the gene angle, so that the extended residue feature is more accurate, and the accuracy of the prediction of the subsequent binding site is improved.
Exemplary apparatus
Based on the method for constructing protein features provided in the foregoing embodiment, the embodiment of the present application further provides a device for constructing protein features, where vectors corresponding to each piece of gene ontology information in a gene ontology database are obtained in advance, and referring to fig. 4a, the device includes a first determining unit 401, a second determining unit 402, and a constructing unit 403:
the first determining unit 401 is configured to determine target gene ontology information of a protein to be identified according to the gene ontology database, where the target gene ontology information has an identity;
the second determining unit 402 is configured to determine, according to the identity, a target vector corresponding to the target gene ontology information from among vectors obtained in advance;
the construction unit 403 is configured to construct a feature vector of the protein to be identified according to the target vector.
Optionally, the base first determining unit 401 is further configured to:
expressing the category and axiom content included in each piece of gene ontology information as sentences to obtain training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content;
and training word vectors according to the training corpus, and generating vectors corresponding to each piece of gene ontology information.
Optionally, if the target gene ontology information includes a plurality of pieces, the constructing unit 403 is configured to:
and adding and averaging the target feature vectors corresponding to the plurality of pieces of target gene ontology information to obtain the feature vectors.
Optionally, the building unit 403 is configured to:
performing dimension reduction processing on the target vector according to the preset length of the feature vector;
and constructing the feature vector according to the processed target vector.
Optionally, the feature vector is used for classifying the function of the protein to be identified and/or for predicting the binding site of the protein to be identified.
Optionally, if the feature vector is used to predict the binding site of the protein to be identified, referring to fig. 4b, the device further comprises an acquisition unit 404 and an expansion unit 405:
the acquisition unit 404 is configured to acquire a residue feature of a target binding site in the protein to be identified;
the expansion unit 405 is configured to expand the residue feature of the target binding site according to the feature vector.
The embodiment of the application also provides a data processing device, which comprises a memory and a processor,
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the protein feature construction method according to any one of the corresponding embodiments of fig. 2 according to the instructions in the program code.
An embodiment of the present application provides a storage medium having instructions stored therein, which when executed on a data processing apparatus, cause the data processing apparatus to perform the protein feature construction method according to any one of the corresponding embodiments of fig. 2.
Embodiments of the present application provide a computer program product which, when run on a data processing apparatus, causes the data processing apparatus to perform the protein feature construction method of any of the corresponding embodiments of fig. 2.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing is merely exemplary of the application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the application and are intended to be comprehended within the scope of the application.

Claims (7)

1. A protein feature construction method, characterized in that a vector corresponding to each piece of gene ontology information in a gene ontology database is obtained in advance, the gene ontology information reflecting gene information, molecular functions and biological processes, the method comprising:
determining target gene ontology information of proteins to be identified according to the gene ontology database, wherein the target gene ontology information has an identity;
determining a target vector corresponding to the target gene ontology information from the vectors obtained in advance according to the identity;
constructing a feature vector of the protein to be identified according to the target vector, wherein the feature vector is applied to protein similarity comparison, protein function classification and protein binding site prediction;
if the feature vector is used for predicting the binding site of the protein to be identified, acquiring the residue feature of the target binding site in the protein to be identified;
expanding the residue characteristics of the target binding site according to the characteristic vector, and representing the characteristics of the target binding site by using the expanded residue characteristics, wherein the method specifically comprises the following steps: characterizing the target binding site in such a manner that a residue feature determined based on the amino acid sequence of the protein to be identified binds to the feature vector;
if the feature vector is used for protein function classification, carrying out protein function classification through a feature vector and a deep learning convolution algorithm;
the method for determining the vector corresponding to each piece of gene ontology information in the gene ontology database comprises the following steps:
expressing the category and axiom content included in each piece of gene ontology information as sentences to obtain training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content;
and training word vectors according to the training corpus, and generating vectors corresponding to each piece of gene ontology information.
2. The method according to claim 1, wherein if the target gene ontology information includes a plurality of pieces, the constructing the feature vector of the protein to be identified from the target vector includes:
and adding and averaging the target feature vectors corresponding to the plurality of pieces of target gene ontology information to obtain the feature vectors.
3. The method according to claim 1, wherein said constructing a feature vector of the protein to be identified from the target vector comprises:
performing dimension reduction processing on the target vector according to the preset length of the feature vector;
and constructing the feature vector according to the processed target vector.
4. The method according to claim 1, characterized in that the feature vector is used for classifying the function of the protein to be identified and/or for predicting the binding site of the protein to be identified.
5. A protein characteristic construction apparatus characterized in that vectors corresponding to each piece of gene ontology information in a gene ontology database are obtained in advance, the gene ontology information reflecting gene information, molecular functions and biological processes, the apparatus comprising a first determination unit, a second determination unit and a construction unit:
the first determining unit is used for determining target gene ontology information of the protein to be identified according to the gene ontology database, wherein the target gene ontology information is provided with an identity;
the second determining unit is used for determining a target vector corresponding to the target gene ontology information from the vectors obtained in advance according to the identity;
the construction unit is used for constructing a characteristic vector of the protein to be identified according to the target vector, and the characteristic vector is applied to protein similarity comparison, protein function classification and protein binding site prediction;
the acquisition unit is used for acquiring the residue characteristics of the target binding site in the protein to be identified if the characteristic vector is used for predicting the binding site of the protein to be identified;
the expansion unit is used for expanding the residue characteristics of the target binding site according to the characteristic vector, and representing the characteristics of the target binding site by utilizing the expanded residue characteristics, and specifically comprises the following steps: characterizing the target binding site in such a manner that a residue feature determined based on the amino acid sequence of the protein to be identified binds to the feature vector;
the classifying unit is used for classifying the protein functions through the feature vector and the deep learning convolution algorithm if the feature vector is used for classifying the protein functions;
the method for determining the vector corresponding to each piece of gene ontology information in the gene ontology database comprises the following steps:
expressing the category and axiom content included in each piece of gene ontology information as sentences to obtain training corpus, wherein the training corpus comprises sentences corresponding to the category and the axiom content;
and training word vectors according to the training corpus, and generating vectors corresponding to each piece of gene ontology information.
6. A data processing apparatus, comprising a memory and a processor,
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the protein profile construction method of any one of claims 1-4 according to instructions in the program code.
7. A storage medium having instructions stored therein that, when executed on a data processing device, cause the data processing device to perform the protein profile construction method of any one of claims 1-4.
CN201911329568.7A 2019-12-20 2019-12-20 Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product Active CN111091874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911329568.7A CN111091874B (en) 2019-12-20 2019-12-20 Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911329568.7A CN111091874B (en) 2019-12-20 2019-12-20 Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product

Publications (2)

Publication Number Publication Date
CN111091874A CN111091874A (en) 2020-05-01
CN111091874B true CN111091874B (en) 2024-01-19

Family

ID=70396642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911329568.7A Active CN111091874B (en) 2019-12-20 2019-12-20 Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product

Country Status (1)

Country Link
CN (1) CN111091874B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778070A (en) * 2017-03-31 2017-05-31 上海交通大学 A kind of human protein's subcellular location Forecasting Methodology
CN106845149A (en) * 2017-02-09 2017-06-13 景德镇陶瓷大学 A kind of new protein sequence method for expressing based on gene ontology information
CN107563150A (en) * 2017-08-31 2018-01-09 深圳大学 Forecasting Methodology, device, equipment and the storage medium of protein binding site
CN109886385A (en) * 2019-03-04 2019-06-14 上海宝藤生物医药科技股份有限公司 Determination method, apparatus, equipment and the medium of cell-signaling pathways network characterization

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845149A (en) * 2017-02-09 2017-06-13 景德镇陶瓷大学 A kind of new protein sequence method for expressing based on gene ontology information
CN106778070A (en) * 2017-03-31 2017-05-31 上海交通大学 A kind of human protein's subcellular location Forecasting Methodology
CN107563150A (en) * 2017-08-31 2018-01-09 深圳大学 Forecasting Methodology, device, equipment and the storage medium of protein binding site
CN109886385A (en) * 2019-03-04 2019-06-14 上海宝藤生物医药科技股份有限公司 Determination method, apparatus, equipment and the medium of cell-signaling pathways network characterization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘冰静等.以位置特异性得分矩阵和基因本体为特征的蛋白质亚细胞定位预测.《福州大学学报( 自然科学版)》.2017,第45卷(第45期),第16-24页. *

Also Published As

Publication number Publication date
CN111091874A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN110598206B (en) Text semantic recognition method and device, computer equipment and storage medium
CN107357875B (en) Voice search method and device and electronic equipment
CN110070909B (en) Deep learning-based multi-feature fusion protein function prediction method
US20150379087A1 (en) Apparatus and method for replying to query
CN112015898B (en) Model training and text label determining method and device based on label tree
CN110781204B (en) Identification information determining method, device, equipment and storage medium of target object
CN111506719A (en) Associated question recommending method, device and equipment and readable storage medium
CN111309887B (en) Method and system for training text key content extraction model
CN114117240B (en) Internet content pushing method based on big data demand analysis and AI system
JP2020512651A (en) Search method, device, and non-transitory computer-readable storage medium
CN114358657A (en) Post recommendation method and device based on model fusion
CN110737779B (en) Knowledge graph construction method and device, storage medium and electronic equipment
Dong et al. Predicting protein complexes using a supervised learning method combined with local structural information
CN116610872B (en) Training method and device for news recommendation model
CN111091874B (en) Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product
CN112508177A (en) Network structure searching method and device, electronic equipment and storage medium
CN112259157A (en) Protein interaction prediction method
JP7202757B1 (en) Information processing system, information processing method and program
CN115881211B (en) Protein sequence alignment method, protein sequence alignment device, computer equipment and storage medium
CN116089595A (en) Data processing pushing method, device and medium based on scientific and technological achievements
Liang et al. Modern Hopfield Networks for graph embedding
CN111159526B (en) Query statement processing method, device, equipment and storage medium
CN113609248A (en) Word weight generation model training method and device and word weight generation method and device
CN114528469A (en) Recommendation method and device, electronic equipment and storage medium
JP2006004103A (en) Method, apparatus and program for matching structure between document classification systems and recording medium recording the program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant