CN111627494B

CN111627494B - Protein property prediction method and device based on multidimensional features and computing equipment

Info

Publication number: CN111627494B
Application number: CN202010474289.6A
Authority: CN
Inventors: 王天元; 翟珂; 黄健; 张琳; 赖力鹏; 温书豪; 马健
Original assignee: Beijing Jingtai Technology Co ltd
Current assignee: Beijing Jingtai Technology Co ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2023-12-01
Anticipated expiration: 2040-05-29
Also published as: CN111627494A

Abstract

The invention discloses a protein property prediction method based on multidimensional features, which is executed in computing equipment, wherein the computing equipment comprises a protein property prediction model, and the input of the model is the assembled protein features and the output of the model is the predicted protein properties, and the method comprises the following steps: acquiring sequence data and structure data of a protein to be detected, and respectively extracting an amino acid sequence characteristic, a specified residue characteristic and a three-dimensional structure diagram characteristic of the protein to be detected from the sequence data and the structure data, wherein the amino acid sequence characteristic represents the characteristic comprising amino acid composition and physicochemical properties, the specified residue characteristic comprises self attribute and environmental attribute of the specified residue, and the three-dimensional structure diagram characteristic comprises residue node attribute and edge attribute; and assembling the three extracted characteristics into protein characteristics, and processing the protein characteristics by adopting a protein property prediction model to obtain the predicted property of the protein to be detected. The invention also discloses a corresponding protein property prediction device and computing equipment based on the multidimensional features.

Description

Protein property prediction method and device based on multidimensional features and computing equipment

Technical Field

The invention relates to the field of drug virtual screening, in particular to a protein property prediction method, device and computing equipment based on multidimensional characteristics.

Background

As is well known, drug development is a long process, and has the dilemma of long development period, low development achievement rate and high development cost. With the update of computer technology and the development of big data technology, artificial intelligence is playing a great application value in various industries, and is also receiving a great deal of attention in the pharmaceutical industry. In the process of discovering new drugs, virtual screening can improve the enrichment of active molecules, and by predicting the properties of the compounds, a great deal of manpower and material resources can be saved, the drug research and development period is shortened, and the conversion of research results is accelerated, so that great importance is placed on scientific research institutions and pharmaceutical companies in recent years.

In performing virtual drug screening, proteins need to be vectorized to extract their characteristic representations so that the computer can understand the protein data. However, the existing protein feature extraction is single, and the lack of a universal tool capable of extracting proteins from multiple angles greatly limits the machine learning modeling of biological macromolecules.

Disclosure of Invention

To this end, the present invention provides a method, apparatus and computing device for predicting protein properties based on multidimensional features in an effort to solve or at least alleviate at least one of the problems presented above.

According to one aspect of the present invention there is provided a method of predicting protein properties based on multi-dimensional features, adapted to be executed in a computing device comprising a protein property prediction model having inputs for assembled protein features and outputs for predicted property attributes, the method comprising the steps of: acquiring sequence data and structure data of a protein to be detected; respectively extracting an amino acid sequence characteristic, a specified residue characteristic and a three-dimensional structure diagram characteristic of the protein to be detected from sequence data and structure data, wherein the amino acid sequence characteristic comprises an amino acid composition and physicochemical properties, the specified residue characteristic comprises self attribute and environmental attribute of the specified residue, and the three-dimensional structure diagram characteristic comprises a residue node attribute and an edge attribute; and assembling the amino acid sequence characteristic, the designated residue characteristic and the three-dimensional structure diagram characteristic into protein characteristics, and processing the protein characteristics by adopting a protein property prediction model to obtain the property attribute of the protein to be detected.

Optionally, in the protein property prediction method according to the present invention, the amino acid sequence features include at least one of: frequency of occurrence of amino acids, group characteristics of consecutive N amino acids, sequence binary characteristics, amino acid index characteristics, auto-correlation of amino acid indices, protein sequence order correlation, amino acid structure and property distribution characteristics.

Alternatively, in the protein property prediction method according to the present invention, three-dimensional structure diagrams are represented as nodes and edges, each node representing one residue and labeled with the number of protein chains, the number of residues and the type of amino acids, and each edge representing the interaction between two residues.

Optionally, in the protein property prediction method according to the present invention, the interaction comprises at least one of: hydrophobic interactions, disulfide bonds, hydrogen bonds, ionic bonds, aromatic ring interactions, aromatic ring and sulfur interactions, cation and pi bond interactions, and backbone atom interactions.

Optionally, in the protein property prediction method according to the present invention, the self attribute includes at least one of: the solvent accessible surface area of the residue, the temperature factor mean of all atoms in the residue, the distance mean of all atoms in the residue from the solvent accessible surface, and the dihedral angle of the peptide bond backbone of the residue.

Optionally, in the protein property prediction method according to the present invention, the environmental attribute includes a secondary structure of a region where the residue is located and/or a number of carbon atoms within a predetermined distance around the residue.

Optionally, in the protein property prediction method according to the present invention, the residue node attribute includes at least one of: residue amino acid class, isoelectric point, molecular weight, number of adjacent nodes, and interaction class.

Optionally, in the protein property prediction method according to the present invention, the edge attribute includes a shortest path of the node pair.

Optionally, in the protein property prediction method according to the present invention, the three-dimensional structure map attribute further includes a map attribute including at least one of: edge number, minimum and maximum of all node eccentricity.

Optionally, in the protein property prediction method according to the present invention, further comprising a training step of a protein property prediction model: acquiring sequence data, structure data and property attributes of a plurality of sample proteins; extracting amino acid sequence characteristics, specified residue characteristics and three-dimensional structure diagram characteristics of the sample protein from sequence data and structure data respectively, and assembling the three characteristics into protein characteristics; and taking the protein characteristics as sample input, taking the corresponding property attributes as sample labels, and training the protein property prediction model to obtain a trained protein property prediction model.

Optionally, in the protein property prediction method according to the present invention, the protein property prediction model is any one of the following: the property attribute output by the functional peptide prediction model is whether the polypeptide has a specific function or not; a protein drug-resistance prediction model, the property attribute of the output of which is whether the protein can be used as a target site.

Alternatively, in the protein property prediction method according to the present invention, the protein property prediction model is a lightgbm model.

According to a further aspect of the present invention there is provided a multi-dimensional feature based protein property prediction apparatus adapted to reside in a computing device comprising a protein property prediction model having inputs for assembled protein features and outputs for predicted property attributes, the apparatus comprising: the data acquisition module is suitable for acquiring sequence data and structure data of the protein to be detected; the characteristic generation module is suitable for respectively extracting an amino acid sequence characteristic, a specified residue characteristic and a three-dimensional structure diagram characteristic of the protein to be detected from the sequence data and the structure data, wherein the amino acid sequence characteristic comprises an amino acid composition and physicochemical properties, the specified residue characteristic comprises self attribute and environmental characteristic of the specified residue, and the three-dimensional structure diagram characteristic comprises a residue node attribute and a side attribute; and a property prediction module, which is suitable for assembling the amino acid sequence characteristic, the appointed residue characteristic and the three-dimensional structure diagram characteristic into protein characteristics, and processing the protein characteristics by adopting a protein property prediction model to obtain the property attribute of the protein to be detected.

According to yet another aspect of the present invention, there is provided a computing device comprising: a memory; one or more processors; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing the multi-dimensional feature-based protein property prediction method as described above.

According to yet another aspect of the present invention, there is provided a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform the multi-dimensional feature-based protein property prediction method as described above.

According to the technical scheme of the invention, protein characteristic vectors are extracted from two aspects of the sequence and the structure of the protein, and protein characteristics are extracted from multiple angles and three-dimensionally by comprehensively using a method based on amino acid statistical information, residue structural characteristics and graph representation. The extracted features have data representativeness and data simplification, and the protein features are comprehensively represented by the most proper feature types and compositions, so that the accurate determination of the subsequent model prediction is ensured, the feature calculation amount and the model calculation amount are reduced, and the screening efficiency and the accuracy of the virtual medicine are improved as a whole.

The invention can improve the accuracy of the activity screening of the small molecular compound and greatly accelerate the research and development flow of the small molecular medicine.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which set forth the various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to fall within the scope of the claimed subject matter. The above, as well as additional objects, features, and advantages of the present disclosure will become more apparent from the following detailed description when read in conjunction with the accompanying drawings. Like reference numerals generally refer to like parts or elements throughout the present disclosure.

FIG. 1 illustrates a schematic diagram of a computing device 100 according to some implementations of the invention;

FIG. 2 illustrates a flow chart of a multi-dimensional feature based protein property prediction method 200 according to one embodiment of the invention;

FIG. 3 shows a schematic diagram of an amino acid group box according to one embodiment of the invention;

FIG. 4 is a schematic diagram showing a three-dimensional structure of a protein according to one embodiment of the present invention;

FIG. 5 illustrates a block diagram of a multi-dimensional feature based protein property prediction apparatus 500 in accordance with one embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 is a block diagram of a computing device 100 according to one embodiment of the invention. In a basic configuration 102, computing device 100 typically includes a system memory 106 and one or more processors 104. The memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing including, but not limited to: a microprocessor (μp), a microcontroller (μc), a digital information processor (DSP), or any combination thereof. The processor 104 may include one or more levels of caches, such as a first level cache 110 and a second level cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations, the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some implementations, the application 122 may be arranged to operate on an operating system with program data 124. Program data 124 includes instructions, in computing device 100 according to the present invention, program data 124 contains instructions for performing multi-dimensional feature based protein property prediction method 200.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to basic configuration 102 via bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices such as a display or speakers via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communication with one or more other computing devices 162 via one or more communication ports 164 over a network communication link.

The network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media in a modulated data signal, such as a carrier wave or other transport mechanism. A "modulated data signal" may be a signal that has one or more of its data set or changed in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or special purpose network, and wireless media such as acoustic, radio Frequency (RF), microwave, infrared (IR) or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 100 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., as part of a small-sized portable (or mobile) electronic device, such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless WEB-watch device, a personal headset device, an application-specific device, or a hybrid device that may include any of the above functions. Computing device 100 may also be implemented as a personal computer including desktop and notebook computer configurations. In some embodiments, the computing device 100 is configured to perform a multi-dimensional feature-based protein property prediction method 200.

FIG. 2 illustrates a flow chart of a multi-dimensional feature based protein property prediction method 200 according to one embodiment of the invention. The method 200 is performed in a computing device, such as the computing device 100, to predict properties of a protein to be tested according to a trained protein property prediction model. As shown in fig. 2, the method starts at step S210.

In step S210, sequence data and structure data of a protein to be tested are acquired. The sequence data and the structure data of the protein to be detected can be obtained from a public database. Generally, after receiving a name or number of a protein to be detected, which is input or selected by a user in an application platform interface of the computing device, the computing device invokes primary sequence data and three-dimensional structure data of the protein to be detected from a database.

Subsequently, in step S220, the amino acid sequence characteristics, the specified residue characteristics, and the three-dimensional structural diagram characteristics of the protein to be tested are extracted from the sequence data and the structural data, respectively.

Here, the present invention extracts characteristics of proteins from both the amino acid primary sequence and the 3D structure of the proteins, and the extracted characteristics are classified into two levels of full-length proteins and residues. Wherein the amino acid sequence features extracted from the amino acid primary sequence of the protein include amino acid composition and physicochemical properties. The specified residue features extracted from the protein-aware 3D structure include the self-properties and environmental properties of the specified residue. The three-dimensional structure diagram features extracted from the three-dimensional structure diagram of the protein comprise residue node attributes and edge attributes, and further can comprise diagram attributes.

According to one embodiment of the invention, the amino acid sequence features include at least one of the following: frequency of occurrence of amino acids, group characteristics of consecutive N amino acids, sequence binary characteristics, amino acid index characteristics, auto-correlation of amino acid indices, protein sequence order correlation, amino acid structure and property distribution characteristics.

Wherein the frequency of occurrence of amino acids is the frequency of occurrence of 20 natural amino acids in a protein sequence. If the total length of the sequence is M, the number of occurrences of an amino acid is M _t The frequency of occurrence of the amino acid f (t) =m _t and/M. In the group characteristics of continuous N amino acids, the continuous N amino acids form an amino acid group frame, and then the amino acid group frame sequentially slides over from the N end to the C end (one amino acid is slid at a time), so that the amino acid group composition information of the protein sequence is obtained. The amino acid group composition of the sequence is to count the occurrence frequency of a certain group frame in all group frames. N may take a value of 3-7, and may specifically be 5. FIG. 3 is a schematic diagram of a statistical amino acid group frame, wherein the first group frame is VQLQE, the second group frame is QLQES, and so on。

The sequence binary feature is that each amino acid is represented by a binary number of length 20, the 20 natural amino acid single-letter abbreviations are arranged in alphabetical order, then the code of each amino acid is set to 1 on its own order, and the other order is set to 0. For example, protein sequence A is denoted 10000000000000000000, etc.

The amino acid index features can be obtained in an amino acid index database, and numerical characterization of all 20 amino acids with various physical and chemical properties is collected in the amino acid index database, and the invention can be expressed by 8 amino acid indexes. Automatic correlation of amino acid indices the average of the 20 natural amino acid indices in the protein sequence was first calculatedAnd standard deviation σ, and all amino acid indices were normalized.

Thereafter, the autocorrelation is calculated:

where D is the hysteresis number of the autocorrelation coefficient, and nlag is the maximum value of the hysteresis number (30 by default). P (P) _i And P _i+d Index values of amino acids i and i+d are shown, respectively.Is the average value of the index P of each amino acid in the long N protein sequence.

Amino acid sequence features may also include one or more of composition/conversion/distribution features (C/T/D features), association triplet features, BLOSUM62 features, Z-scale features, quasi-sequence class features, pseudo-amino acid composition features. The C/T/D profile represents the amino acid profile of a particular structural or physicochemical property in a protein sequence. There are 13 physicochemical properties that have been used to calculate these characteristics, such as hydrophobicity, paradigm force, polarity, polarizability, charge, secondary structure, solvent contact, based on which 20 natural amino acids can be divided into 3 groups, and the 3 groups of amino acid composition distribution is counted as the C/T/D characteristic.

The characteristic of a combination triplet is that any continuous three amino acids in a protein sequence are taken as a unit of a 'triplet', and then the properties of one amino acid and the adjacent amino acids are calculated as the characteristic of the triplet. The protein sequence is represented by a binary space: (V, F), V representing a vector space of sequence features, each feature V _i All represent one triplet type, F is the vector space formed for the number of each triplet type in V, i.e., F _i (value of ith dimension of F) represents V _i The number of times the type appears in the protein sequence.

The BLOSUM62 feature uses a BLOSUM62 matrix as the feature set, each row in the matrix representing an amino acid in the sequence. The matrix is constructed by first representing each residue in the training set using a matrix of m x n elements, where n represents the length of each peptide chain and m is set to 20 (20 amino acids).

The Z-scale feature characterizes each amino acid with five physical and chemical property variables, denoted as Z1-Z5, respectively, which result in a Z-scale characterization of each amino acid in the protein sequence.

According to another embodiment of the present invention, according to the PDB file of the 3D structure of the protein, the structure-related property of the specified residue site is extracted, and at this time, the computing device receives the input PDB file of the protein to be tested and the specified residue site, and outputs the PDB file as the feature list csv file of the specified residue.

Specifically, the self-properties of the specified residues include at least one of: the solvent accessible surface area of the residue, the temperature factor (B-factor) average of all atoms in the residue, the distance average of all atoms in the residue from the solvent accessible surface, and the dihedral angle of the peptide bond backbone of the residue. The environmental attributes of the specified residues include at least one of: the secondary structure of the region in which the residue is located and/or the number of carbon atoms within a predetermined distance around the residue.

According to yet another embodiment of the invention, the three-dimensional structure is represented as nodes and edges, each node representing a residue and being labeled with the number of the protein chain, the number of the residue and the type of amino acid, each edge representing the interaction between two residues. The interaction includes at least one of: hydrophobic interactions, disulfide bonds, hydrogen bonds, ionic bonds, aromatic ring interactions, aromatic ring and sulfur interactions, cation and pi bond interactions, and backbone atom interactions. FIG. 4 is a 3D block diagram representation of a protein wherein A79GLN is represented as residue 79 of the A chain and the amino acid type is glutamine GLN.

In general, the computing device outputs three levels of features based on the input protein PDB file and the specified residue positions: residue node properties, and graph properties. Specifically, the residue node properties include at least one of: residue amino acid class, isoelectric point, molecular weight, number of adjacent nodes, interaction class, reciprocal of average value of distances to other nodes.

Wherein, the amino acid category of the residue is calculated by adopting single thermal coding, and is initially a vector with 20 dimensions, each column is 0, and when certain amino acid appears in the protein sequence, the corresponding position is changed to 1. For example, cysteine CYS is indicated as [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]. The interaction category can also be calculated by adopting a single-heat coding mode, 9 interactions are arranged, each initial position is 0, and when certain interaction occurs, the corresponding position is changed into 1. Isoelectric point and molecular weight are values associated with the class of residues, and node importance refers to the importance of the amino acid residue represented by a node in the protein properties, some amino acids or polypeptides being the core structure of the protein properties. Node weights mean that a key node will have multiple nodes pointing to the node, and the key node may also point to multiple nodes at the same time. Thus, the node weights may be calculated from the incoming and outgoing links of each node. The statistics of the amino acid categories of the surrounding nodes refers to statistics of the number of categories of each amino acid in all nodes within one node distance, two node distances or three node distances around a specific node.

The edge attributes include the shortest path of the node pair. The graph attributes include at least one of: the number of edges contained in the whole graph, the minimum and maximum values of all node eccentricity.

Three characteristics of the protein to be tested are extracted through step S220. For each feature, other attribute parameters may be added as desired by those skilled in the art, as the present invention is not limited in this regard.

Subsequently, in step S230, the amino acid sequence feature, the specified residue feature, and the three-dimensional structure feature are assembled into protein features, and the protein features are processed by using a protein property prediction model to obtain property attributes of the protein to be measured.

Here, the input to the protein property prediction model is the assembled protein characteristic, and the output is a predicted property attribute, such as a particular property of the protein, which may be a discrete value or a continuous value. The property attribute may be whether the protein has certain specific properties, such as whether the polypeptide is active, whether the protein has a drug property, etc. The property attribute may also be a specific performance value of the predicted protein, such as stability, activity, etc.

In general, the protein property prediction model may be any one of the following: the property attribute output by the functional peptide prediction model is whether the polypeptide has a specific function or not; a protein drug-resistance prediction model, the property attribute of the output of which is whether the protein can be used as a target site.

The protein property prediction model can be set as a regression model or a classification model according to the needs, such as random forest, support vector machine, random gradient descent, bayesian regression, etc. It should be understood that there are a variety of classification models and regression models that can predict specific properties, and the invention is not limited to a particular form, and all classification or regression models that can predict protein properties are within the scope of the invention. Moreover, the specific structure and parameters of the model can be set by those skilled in the art as required, and the present invention is not limited thereto. Optionally, the protein property prediction model is a Lightgbm model, in super parameters of the model, the type of the lifting tree is GBDT, the maximum leaf node of each base learner is 32-38, the learning rate is 0.005-0.02, the evaluation index is AUC, the L1 regularization coefficient is 0.3-0.8, and the L2 regularization coefficient is 0.

According to one embodiment of the present invention, the method 200 may further include a training step of the protein property prediction model, specifically including: and acquiring sequence data, structure data and property attributes of a plurality of sample proteins, and respectively extracting amino acid sequence features, designated residue features and three-dimensional structure diagram features of the sample proteins from the sequence data and the structure data, so as to assemble the three features into protein features. And then, taking the assembled protein characteristics as sample input, taking the corresponding property attributes as sample labels, and training the protein property prediction model to obtain the trained protein property prediction model.

The sequence data, structure data and property attributes of the sample proteins may be represented in csv format from existing databases (e.g., pdbbbind dataset), and specifically include simplified molecular input linear canonical sequences (SMILES), molecular numbers and specific property information, for example, activity values, which may be represented by IC50, ki, kd, etc., although not limited thereto. And predicting which property attribute needs to be predicted by the model, and selecting the property attribute of the sample protein in the database as a sample label. After training the protein property prediction model, the method can be used for predicting the property of the protein to be detected, namely, the assembled protein characteristics of the protein to be detected are input into the model for prediction.

In practical application, a user can calculate three types of characteristics of the protein and assemble the three types of characteristics into protein characteristics only by inputting or selecting the number or the name of the protein to be detected and designating the residue name in an interface, and then output the corresponding protein property.

FIG. 5 illustrates a multi-dimensional feature-based protein property prediction apparatus 500 suitable for residing in a computing device that includes a protein property prediction model with inputs for the assembled protein features and outputs as predicted property attributes, according to one embodiment of the invention. As shown in fig. 5, the apparatus includes a data acquisition module 510, a feature generation module 520, and a property prediction module 530.

The data acquisition module 510 acquires sequence data and structure data of the protein to be tested. The data acquisition module 510 may perform a process corresponding to the process described above in step S210, and a detailed description will not be repeated here.

The feature generation module 520 extracts amino acid sequence features, specified residue features, and three-dimensional structure map features of the protein to be tested from the sequence data and the structure data, respectively. Amino acid sequence features include amino acid composition and physicochemical properties, specified residue features include self-properties and environmental features of the specified residue, and three-dimensional structure features include residue node properties and edge properties. The feature generation module 520 may perform a process corresponding to the process described above in step S220, and a detailed description will not be repeated here.

The property prediction module 530 assembles the amino acid sequence feature, the specified residue feature, and the three-dimensional structure map feature into protein features, and processes the protein features using a protein property prediction model to obtain property attributes of the protein to be measured. The property prediction module 530 may perform a process corresponding to the process described above in step S230, and a detailed description thereof will not be repeated here.

According to one embodiment of the invention, the apparatus 500 may further include a model training module to obtain sequence data, structure data, and property attributes of the plurality of sample proteins; respectively extracting amino acid sequence characteristics, specified residue characteristics and three-dimensional structure diagram characteristics of sample proteins from the sequence data and the structure data, and assembling the three characteristics into protein characteristics; and taking the protein characteristics as sample input, taking the corresponding property attributes as sample labels, and training the protein property prediction model to obtain a trained protein property prediction model.

According to the technical scheme of the invention, protein feature vectors are extracted from two aspects of the sequence and the structure of the protein, and the method based on the amino acid statistical information, the residue structural features and the graph representation is comprehensively utilized to extract the features of the protein in a multi-angle and three-dimensional manner for the subsequent prediction of scenes. The invention provides a general protein feature extraction tool, according to which three types of sequences and structural features of proteins are calculated together, and the features are comprehensive and representative, so that the capability of modeling biomacromolecules and learning thereof can be improved, and the feature calculation amount and model calculation amount can be reduced as much as possible. The invention trains the most efficient model by using the least and most representative characteristic combination, and simultaneously combines the model prediction accuracy and the model calculation amount, thereby improving the efficiency and the accuracy of biological analysis on the whole.

A8, the method of any of A1-A7, wherein the edge attribute comprises a shortest path of a node pair. The method of any of A9, A1-A8, wherein the three-dimensional structure map attribute further comprises a map attribute comprising at least one of: edge number, minimum and maximum of all node eccentricity.

A10, the method of any one of A1-A9, further comprising the training step of the protein property prediction model: acquiring sequence data, structure data and property attributes of a plurality of sample proteins; extracting amino acid sequence features, specified residue features and three-dimensional structure diagram features of the sample proteins from the sequence data and the structure data respectively, and assembling the three features into protein features; and taking the protein characteristics as sample input, taking the corresponding property attributes as sample labels, and training the protein property prediction model to obtain a trained protein property prediction model.

A11, the method of any one of A1-A10, wherein the protein property prediction model is any one of the following: the property attribute output by the functional peptide prediction model is whether the polypeptide has a specific function or not; a protein drug-resistance prediction model, the property attribute of the output of which is whether the protein can be used as a target site. A12, the method of any one of A1-A11, wherein the protein property prediction model is a Lightgbm model.

The technology discussed herein refers to processor cores, processors, servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from these systems. The inherent flexibility of computer-based systems allows for a variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For example, the processes discussed herein may be implemented using a single device or component or multiple devices or components working in combination. Databases and applications may be implemented on a single system or distributed across multiple systems. Distributed components may operate sequentially or in parallel.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions of the methods and apparatus of the present invention, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U-drives, floppy diskettes, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the method of the invention in accordance with instructions in said program code stored in the memory.

By way of example, and not limitation, readable media comprise readable storage media and communication media. The readable storage medium stores information such as computer readable instructions, data structures, program modules, or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with examples of the invention. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into a plurality of sub-modules.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

Furthermore, some of the embodiments are described herein as methods or combinations of method elements that may be implemented by a processor of a computer system or by other means of performing the functions. Thus, a processor with the necessary instructions for implementing the described method or method element forms a means for implementing the method or method element. Furthermore, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is for carrying out the functions performed by the elements for carrying out the objects of the invention.

As used herein, unless otherwise specified the use of the ordinal terms "first," "second," "third," etc., to describe a general object merely denote different instances of like objects, and are not intended to imply that the objects so described must have a given order, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of the above description, will appreciate that other embodiments are contemplated within the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is defined by the appended claims.

Claims

1. A method of protein property prediction based on multi-dimensional features, adapted to be performed in a computing device comprising a protein property prediction model having inputs for assembled protein features and outputs for predicted property attributes, the property attributes comprising: whether a polypeptide has a specific function, whether a protein can serve as a targeting site, the method comprising the steps of:

acquiring sequence data and structure data of a protein to be detected;

respectively extracting an amino acid sequence characteristic, a specified residue characteristic and a three-dimensional structure diagram characteristic of the protein to be detected from the sequence data and the structure data, wherein the amino acid sequence characteristic comprises an amino acid composition and physicochemical properties, the specified residue characteristic comprises self attribute and environmental attribute of the specified residue, and the three-dimensional structure diagram characteristic comprises a residue node attribute and a side attribute; and

and assembling the amino acid sequence characteristic, the designated residue characteristic and the three-dimensional structure diagram characteristic into protein characteristics, and processing the protein characteristics by adopting the protein property prediction model to obtain the predicted property of the protein to be detected.

2. The method of claim 1, wherein the amino acid sequence features comprise at least one of:

frequency of occurrence of amino acids, group characteristics of consecutive N amino acids, sequence binary characteristics, amino acid index characteristics, auto-correlation of amino acid indices, protein sequence order correlation, amino acid structure and property distribution characteristics.

3. The method according to claim 1 or 2, wherein,

the three-dimensional structure is represented as nodes and edges, each node representing a residue and labeled with the number of protein chains, the number of residues and the type of amino acids, and each edge representing the interaction between two residues.

4. A method according to claim 3, wherein the interaction comprises at least one of:

hydrophobic interactions, disulfide bonds, hydrogen bonds, ionic bonds, aromatic ring interactions, aromatic ring and sulfur interactions, cation and pi bond interactions, and backbone atom interactions.

5. The method of claim 1 or 2, wherein the self-attribute comprises at least one of:

the solvent accessible surface area of the residue, the temperature factor mean of all atoms in the residue, the distance mean of all atoms in the residue from the solvent accessible surface, and the dihedral angle of the peptide bond backbone of the residue.

6. The method of claim 1 or 2, wherein the environmental attribute comprises the secondary structure of the region in which the residue is located and/or the number of carbon atoms within a predetermined distance around the residue.

7. The method of claim 1 or 2, wherein the residue node properties comprise at least one of:

residue amino acid class, isoelectric point, molecular weight, number of adjacent nodes, and interaction class.

8. The method of claim 1 or 2, wherein the edge attribute comprises a shortest path of a node pair.

9. The method of claim 1 or 2, wherein the three-dimensional structure map attributes further comprise map attributes comprising at least one of:

edge number, minimum and maximum of all node eccentricity.

10. The method of claim 1 or 2, further comprising the step of training the protein property prediction model:

acquiring sequence data, structure data and property attributes of a plurality of sample proteins;

extracting amino acid sequence features, specified residue features and three-dimensional structure diagram features of the sample proteins from the sequence data and the structure data respectively, and assembling the three features into protein features;

and taking the protein characteristics as sample input, taking the corresponding property attributes as sample labels, and training the protein property prediction model to obtain a trained protein property prediction model.

11. The method of claim 1 or 2, wherein the protein property prediction model is any one of:

the property attribute output by the functional peptide prediction model is whether the polypeptide has a specific function or not;

a protein drug-resistance prediction model, the property attribute of the output of which is whether the protein can be used as a target site.

12. The method of claim 1 or 2, wherein the protein property prediction model is a lightgbm model.

13. A protein property prediction apparatus based on multi-dimensional features, adapted to reside in a computing device, the computing device comprising a protein property prediction model having inputs for assembled protein features and outputs for predicted property attributes, the property attributes comprising: whether a polypeptide has a specific function, whether a protein can serve as a targeting site, the device comprising:

the data acquisition module is suitable for acquiring sequence data and structure data of the protein to be detected;

the characteristic generation module is suitable for respectively extracting an amino acid sequence characteristic, a specified residue characteristic and a three-dimensional structure diagram characteristic of the protein to be detected from the sequence data and the structure data, wherein the amino acid sequence characteristic comprises an amino acid composition and physicochemical properties, the specified residue characteristic comprises self attribute and environmental characteristic of the specified residue, and the three-dimensional structure diagram characteristic comprises a residue node attribute and an edge attribute; and

and the property prediction module is suitable for assembling the amino acid sequence characteristic, the designated residue characteristic and the three-dimensional structure diagram characteristic into protein characteristics, and processing the protein characteristics by adopting the protein property prediction model to obtain the predicted property of the protein to be detected.

14. A computing device, comprising:

a memory;

one or more processors;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-12.

15. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-12.