WO2019041333A1 - 蛋白质结合位点的预测方法、装置、设备及存储介质 - Google Patents

蛋白质结合位点的预测方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2019041333A1
WO2019041333A1 PCT/CN2017/100314 CN2017100314W WO2019041333A1 WO 2019041333 A1 WO2019041333 A1 WO 2019041333A1 CN 2017100314 W CN2017100314 W CN 2017100314W WO 2019041333 A1 WO2019041333 A1 WO 2019041333A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
amino acid
vector
protein sequence
feature
Prior art date
Application number
PCT/CN2017/100314
Other languages
English (en)
French (fr)
Inventor
张勇
何威
徐勇
赵东宁
Original Assignee
深圳大学
哈尔滨工业大学深圳研究生院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学, 哈尔滨工业大学深圳研究生院 filed Critical 深圳大学
Priority to JP2019511995A priority Critical patent/JP6850874B2/ja
Priority to US16/255,857 priority patent/US11620567B2/en
Publication of WO2019041333A1 publication Critical patent/WO2019041333A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention belongs to the field of bioinformatics, and particularly relates to a method, device, device and storage medium for predicting protein binding sites.
  • Bioinformatics has received widespread attention, and more and more researchers in different fields have devoted themselves to the research work of bioinformatics.
  • Bioinformatics is a comprehensive discipline that studies the flow of information and information in biological and biological systems. Its knowledge system includes biology (genetics, biochemistry, etc.) and mathematics (probability and mathematical statistics, algorithms, etc.). ), computer science (machine learning, computational theory, etc.), physical chemistry (molecular modeling, thermodynamics, etc.) and many other disciplines of knowledge.
  • Protein is the embodiment of life activities, the most important basic unit by which all living things express life. It can be regarded as the smallest automatic machine in nature, and has an irreplaceable role in the operation of biological systems.
  • the different roles of proteins in cells are regulated by interactions between proteins, proteins and DNA, proteins and R NA, and proteins and ligands. Protein-protein interactions involve the association of protein molecules that play a critical role in every biological process of living cells, such as DNA synthesis, gene transcriptional activation, protein translation, modification and localization, and information transmission. Important biological processes involve protein-protein interactions. Therefore, exploring the sequence and structural properties of protein-protein interactions is critical to understanding cellular activity.
  • the object of the present invention is to provide a method, a device, a computing device and a storage medium for predicting protein binding sites, aiming at solving the accuracy and ubiquity of protein binding sites predicted by the prior art. problem.
  • the invention provides a method of predicting a protein binding site, the method comprising the steps of:
  • the document feature vector and the biological feature vector are classified using a predetermined amino acid residue classification model to obtain an amino acid residue type of the protein sequence.
  • the present invention provides a protein binding site prediction apparatus, the apparatus comprising: [0012] a sequence division unit for receiving a protein sequence to be predicted, using a preset sliding window and sliding Stepping the sequence of the protein sequence to obtain a plurality of amino acid subsequences constituting the protein sequence;
  • a first vector construction unit configured to construct a word vector of the protein sequence according to the plurality of amino acid subsequences, wherein a word element of the word vector represents each of the amino acid subsequences, and the word element is performed
  • Document feature extraction constructing a document feature vector of the protein sequence according to the extracted document feature
  • a second vector construction unit configured to perform protein chain biological feature extraction on the amino acid subsequence represented by the word element, and construct a biological feature vector of the protein sequence according to the extracted biological feature;
  • a result obtaining unit configured to classify the document feature vector and the biological feature vector using a preset amino acid residue classification model to obtain an amino acid residue type of the protein sequence.
  • the present invention also provides a computing environment required for sequence partitioning and classification model construction, and a computer program operable in the environment, the processor executing the computer program to implement a protein as described The step of the binding method of the binding site.
  • the present invention also provides a computer readable storage medium storing a computer program, the computer program being executed by a processor to implement a protein binding site The steps of the prediction method.
  • the present invention receives a protein sequence to be predicted, and uses a predetermined sliding window and a sliding step size to sequence the protein sequence to obtain a plurality of amino acid subsequences constituting the protein sequence to be predicted, according to the obtained plurality of amino acid fragments.
  • the sequence constructs a word vector of a protein sequence, the word element of the word vector represents each amino acid subsequence, performs document feature extraction on the word element, constructs a document feature vector of the protein sequence according to the extracted document features, and performs protein chain biology on the amino acid subsequence
  • the feature extraction, the biological feature vector of the protein sequence is constructed according to the extracted biological features, and the amino acid subsequences represented by the document feature vector and the biological feature vector are classified by using the preset amino acid residue classification model.
  • the type of amino acid residue of the protein sequence thereby improving the accuracy and versatility of protein binding site prediction.
  • FIG. 1 is a flow chart showing an implementation of a method for predicting a protein binding site according to a first embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a protein binding site prediction apparatus according to a second embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a device for predicting a protein binding site according to a third embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a computing device according to Embodiment 4 of the present invention.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • Embodiment 1 shows an implementation flow of a method for predicting a protein binding site according to Embodiment 1 of the present invention. For convenience of description, only parts related to the embodiment of the present invention are shown, which are described in detail as follows:
  • step S101 the protein sequence to be predicted is received, and the protein sequence is sequenced using a preset sliding window and a sliding step to obtain a plurality of amino acid subsequences constituting the protein sequence to be predicted.
  • Embodiments of the invention are applicable to prediction systems for protein binding sites.
  • the sliding window is started, and the protein sequence is divided by adjusting the size of the sliding window and the sliding step size.
  • a plurality of amino acid subsequences of the protein sequence to be predicted are composed, whereby a partial partial block of the protein sequence is used as a subsequent analysis unit.
  • the size of the sliding window is (2*window + 1 - 2*b), wherein
  • window is the default value
  • b is a randomly generated variable with a size between 0 and window-1.
  • Such a sliding window contains window-b neighborhood residues on both sides of the target residue.
  • the amino acid residue classification model is obtained by machine learning training before receiving the protein sequence to be predicted.
  • the Stacking integrated learning algorithm can be used for machine learning to improve the classification accuracy and generalization ability of the amino acid residue classification model.
  • the training protein sequence in the preset training set is firstly sequenced by using a preset sliding window and a sliding step size to obtain a sequence of the training protein. a plurality of training amino acid subsequences, and then constructing a training word vector of the training protein sequence according to the obtained plurality of training amino acid subsequences, the training word element of the training word vector represents each training amino acid subsequence, and the document feature extraction is performed on the training word element, According to the extracted document features Constructing a document feature training vector of the training protein sequence, and extracting the biological characteristics of the protein chain of the trained amino acid subsequence represented by the training word element, constructing a biological feature training vector of the training protein sequence according to the extracted biological features, and finally using the document The training amino acid subsequence represented by the feature training vector and the biological feature training vector trains the pre-built classification model.
  • the training classification model is set as the amino acid residue classification model, thereby
  • the classification of amino acid residues provides a classification model that improves the classification efficiency of the classification model.
  • the training end condition may be set to a preset number of times the training times are reached or the loss during the training reaches a preset value.
  • the Stacking integrated learning algorithm is used to train the preset model to obtain an amino acid residue classification model.
  • the first layer of the Stacking model trains multiple base classifiers using different kinds of protein chain biological features, and then splicing the prediction results of multiple base classifiers with the document feature vector to train as the final feature vector. Amino acid residue classification model.
  • step S102 a word vector of a protein sequence is constructed according to the obtained plurality of amino acid subsequences, and the word element of the word vector represents each amino acid subsequence, and the pair of word elements are extracted by document features, and are constructed according to the extracted document features.
  • Document feature vector for a protein sequence is constructed according to the obtained plurality of amino acid subsequences, and the word element of the word vector represents each amino acid subsequence, and the pair of word elements are extracted by document features, and are constructed according to the extracted document features.
  • the word vector of the protein sequence is first constructed according to the amino acid subsequence, wherein the word element of the word vector represents each amino acid subsequence, and then the word The element performs document feature extraction, and finally constructs a document feature vector of the protein sequence according to the extracted document features.
  • the extracted document features include TFIDF sequence features and N-gmm sequence features.
  • a word vector of a protein sequence assigning a unique number to each amino acid subsequence and mapping the original subsequence unique number into the kappa vector space using the word2vec algorithm to obtain a protein
  • the word vector of the sequence assigning a unique number to each amino acid subsequence and mapping the original subsequence unique number into the kappa vector space using the word2vec algorithm to obtain a protein
  • the word vector of the sequence can effectively reduce the feature dimension, seek a deeper feature representation for the text data, and utilize all the data in the high-dimensional word vector to make the data scale larger, which is beneficial to improve the subsequent classification effect.
  • step S103 the amino acid subsequence represented by the word element is subjected to protein chain biological feature extraction, and the biological feature vector of the protein sequence is constructed according to the extracted biological features.
  • the amino acid subsequence obtained by the sequence division is first subjected to biological feature extraction of the protein chain, and then the biological feature vector of the protein sequence is constructed according to the extracted biological characteristics, wherein the extracted organism
  • the characteristics include position-specific scoring matrix features and pseudo-amino acid composition features, which effectively represent the local information such as the order of amino acids in the sequence, enhance the ability of feature vectors to represent protein sequence information, and thus improve the biological feature vector. The comprehensiveness of the biological characteristics of the middle.
  • step S104 the amino acid subsequences represented by the document feature vector and the biological feature vector are classified using a predetermined amino acid residue classification model to obtain an amino acid residue type of the protein sequence.
  • the amino acid residue type is used to indicate whether the amino acid residue is a binding site of a protein sequence.
  • the biological feature vector is first predicted, and then the predicted prediction result is combined with the document feature vector, and finally the stitched feature vector obtained by the feature stitching is performed. Classification, thereby further improving the accuracy of protein binding site prediction.
  • the preset amino acid residue classification model is the classification model of the amino acid residue obtained by the above training, thereby improving the prediction accuracy of the binding site of the protein sequence.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1
  • FIG. 2 shows the structure of a protein binding site prediction apparatus according to a second embodiment of the present invention. For the convenience of description, only parts related to the embodiment of the present invention are shown, including:
  • the sequence dividing unit 21 is configured to receive the protein sequence to be predicted, and sequence the protein sequence by using a preset sliding window and a sliding step to obtain a plurality of amino acid sub-sequences constituting the protein sequence to be predicted.
  • the first vector construction unit 22 is configured to construct a word vector of the protein sequence according to the obtained plurality of amino acid subsequences, the word element of the word vector represents each amino acid subsequence, and the document feature is extracted from the word element, according to the extracted Document features construct document feature vectors for protein sequences.
  • the second vector construction unit 23 is configured to perform protein chain biological feature extraction on the amino acid subsequence represented by the word element, and construct a biological feature vector of the protein sequence according to the extracted biological features.
  • the result obtaining unit 24 is configured to classify the amino acid subsequences represented by using the document feature vector and the biological feature vector using a preset amino acid residue classification model to obtain an amino acid residue type of the protein sequence.
  • the sequence dividing unit 21 receives the protein sequence to be predicted, and sequences the protein sequence by using a preset sliding window and a sliding step to obtain a plurality of amino acids constituting the protein sequence to be predicted.
  • the first vector construction unit 22 constructs a word vector of the protein sequence according to the obtained plurality of amino acid subsequences, the word element of the word vector represents each amino acid subsequence, and the document feature extraction is performed on the word element, and the document feature is constructed according to the extracted document features.
  • the document feature vector of the protein sequence performs protein chain biological feature extraction on the amino acid subsequence represented by the word element, and constructs a biological feature vector of the protein sequence according to the extracted biological feature, and the result obtaining unit 24 uses
  • the pre-defined amino acid residue classification model classifies the amino acid subsequences represented by the document feature vector and the biological feature vector to obtain the amino acid residue type of the protein sequence, thereby improving the accuracy and versatility of protein binding site prediction. .
  • each unit of the prediction device of the protein binding site may be implemented by a corresponding hardware or software unit, and each unit may be an independent soft and hardware unit, or may be integrated into a soft and hardware unit.
  • each unit may be implemented by a corresponding hardware or software unit, and each unit may be an independent soft and hardware unit, or may be integrated into a soft and hardware unit.
  • FIG. 3 shows the structure of a protein binding site prediction apparatus according to a third embodiment of the present invention. For the convenience of explanation, only parts related to the embodiment of the present invention are shown, including:
  • the training sequence dividing unit 31 is configured to sequence the training protein sequences in the preset training set by using a preset sliding window and a sliding step to obtain a plurality of training amino acid sub-sequences constituting the training protein sequence.
  • the first feature processing unit 32 is configured to construct a training word vector of the training protein sequence according to the obtained plurality of training amino acid sub-sequences, wherein the training word element of the training word vector represents each training amino acid sub-sequence, and the training word element is performed.
  • Document feature extraction constructing a document feature training vector of the training protein sequence according to the extracted document features.
  • the second feature processing unit 33 is configured to perform protein chain biological feature extraction on the training amino acid subsequence represented by the training word element, and construct a biological feature training vector of the training protein sequence according to the extracted biological features.
  • a model training unit 34 for training using document feature training vectors and biological feature training vector representations The pre-constructed classification model is trained by practicing the amino acid subsequence, and when the preset training end condition is reached, the training classification model is set as the amino acid residue classification model.
  • an amino acid residue classification model is obtained by machine learning training before receiving the protein sequence to be predicted.
  • the Stacking integrated learning algorithm can be used for machine learning to improve the classification accuracy and generalization ability of the amino acid residue classification model.
  • the training sequence dividing unit 31 first performs sequence division on the training protein sequence in the preset training set by using a preset sliding window and a sliding step size to obtain a composition.
  • a biological feature training vector of the training protein sequence is constructed based on the extracted biological features
  • the final model training unit 34 uses the document feature training vector and the biological feature training vector to represent the trained amino acid subsequence to the pre-built score.
  • the model is trained to classify amino acid residues classification model, classification models providing an amino acid residue for subsequent classification, improve the classification efficiency of the classification model.
  • the training end condition may be set to a preset number of times the training times are reached or the loss during the training reaches a preset value.
  • a Stacking integrated learning algorithm is used to train a preset model to obtain an amino acid residue classification model.
  • the first layer of the Stacking model trains multiple base classifiers using different kinds of protein chain biological features, and then splicing the prediction results of multiple base classifiers with the document feature vector to train as the final feature vector.
  • Amino acid residue classification model is used to train a preset model to obtain an amino acid residue classification model.
  • the sequence dividing unit 35 is configured to receive the protein sequence to be predicted, and sequence the protein sequence by using a preset sliding window and a sliding step to obtain a plurality of amino acid sub-sequences constituting the protein sequence to be predicted.
  • the sequence dividing unit 35 in order to embody the aggregation characteristic of the protein-protein binding site, after receiving the protein sequence to be predicted, the sequence dividing unit 35 starts a sliding window by adjusting the size of the sliding window. With the sliding step, the protein sequence is divided to obtain a plurality of amino acid subsequences constituting the protein sequence to be predicted, so that the partial partial block of the protein sequence is used as a subsequent analysis unit.
  • the size of the sliding window is (2*window + 1 - 2*b), wherein
  • window is the default value
  • b is a randomly generated variable with a size between 0 and window-1.
  • Such a sliding window contains window-b neighborhood residues on both sides of the target residue.
  • the first vector construction unit 36 is configured to construct a word vector of the protein sequence according to the obtained plurality of amino acid subsequences, the word element of the word vector represents each amino acid subsequence, and the document feature is extracted from the word element, according to the extracted Document features construct document feature vectors for protein sequences.
  • the first vector construction unit 36 first constructs a word vector of the protein sequence according to the amino acid subsequence, wherein the word element of the word vector represents each amino acid.
  • the subsequence then performs document feature extraction on the word element, and finally constructs a document feature vector of the protein sequence according to the extracted document features.
  • the extracted document features include features such as TFIDF sequence features and N-gram sequence features.
  • each amino acid subsequence is assigned a unique number and the original subsequence unique number is mapped into the kappa vector space using the word2vec algorithm to obtain the protein.
  • the word vector of the sequence This can effectively reduce the feature dimension, seek a deeper feature representation for the text data, and utilize all the data in the high-dimensional word vector to make the data scale larger, which is beneficial to improve the subsequent classification effect.
  • the second vector construction unit 37 is configured to perform protein chain biological feature extraction on the amino acid subsequence represented by the word element, and construct a biological feature vector of the protein sequence according to the extracted biological features.
  • the second vector construction unit 37 first extracts the biological characteristics of the protein chain from the amino acid subsequence obtained by the sequence division, and then constructs the biological feature vector of the protein sequence according to the extracted biological features.
  • the extracted biological features include features of position-specific scoring matrix and pseudo-amino acid composition characteristics, thereby effectively indicating the order of amino acid appearance in the sequence, etc.
  • Information enhances the ability of feature vectors to represent protein sequence information, thereby improving the comprehensiveness of biological features in biological feature vectors.
  • the result obtaining unit 38 is configured to classify the amino acid subsequences represented by the document feature vector and the biological feature vector using a preset amino acid residue classification model to obtain an amino acid residue type of the protein sequence.
  • the amino acid residue type is used to indicate whether the amino acid residue is a binding site for a protein sequence.
  • the biological feature vector is first predicted, and then the predicted prediction result is combined with the document feature vector, and finally the stitched feature vector obtained by the feature stitching is performed. Classification, thereby further improving the accuracy of protein binding site prediction.
  • the preset amino acid residue classification model is the classification model of the amino acid residue obtained by the above training, thereby improving the prediction accuracy of the binding site of the protein sequence.
  • the result obtaining unit 38 includes:
  • a feature splicing unit 381 configured to predict a biological feature vector, and perform feature splicing of the predicted prediction result and the document feature vector;
  • the feature classification unit 382 is configured to classify the stitching feature vectors obtained by the feature stitching.
  • each unit of the prediction device of the protein binding site may be implemented by a corresponding hardware or software unit, and each unit may be an independent soft and hardware unit, or may be integrated into a soft and hardware unit. There is no need to limit the invention herein.
  • Embodiment 4 is a diagrammatic representation of Embodiment 4
  • Embodiment 4 shows the structure of a computing device provided by Embodiment 4 of the present invention. For the convenience of description, only parts related to the embodiment of the present invention are shown.
  • the computing device 4 of an embodiment of the present invention includes a processor 40, a memory 41, and a computer program 42 stored in the memory 41 and operable on the processor 40.
  • the processor 40 executes the steps in the computer program 42 to implement the above-described method of predicting protein binding sites, such as steps S101 through S104 shown in FIG.
  • the processor 40 executes the computer program 42 to implement the functions of the units in the above-described respective apparatus embodiments, such as the functions of the units 21 to 24 shown in Fig. 2 and the units 31 to 38 shown in Fig. 3.
  • the processor 40 executes a computer program 42 to implement the steps in the foregoing method for predicting a protein binding site, and receives a protein sequence to be predicted, using a preset slip.
  • the moving window and the sliding step length sequence the protein sequence to obtain a plurality of amino acid subsequences constituting the protein sequence to be predicted, and construct a word vector of the protein sequence according to the obtained plurality of amino acid subsequences, and the word element of the word vector indicates each Amino acid subsequence, extracting document features from word elements, constructing document feature vectors of protein sequences according to extracted document features, extracting protein chain biological features from amino acid subsequences represented by word elements, and constructing according to extracted biological features
  • the biological characteristic vector of the protein sequence using a predetermined amino acid residue classification model to classify the amino acid subsequences represented by the document feature vector and the biological feature vector to obtain the amino acid residue type of the protein sequence, thereby increasing the protein binding position.
  • a computer readable storage medium storing a computer program, the computer program being executed by a processor, and implementing the method for predicting a protein binding site
  • the computer program is executed by the processor to implement the functions of the units in the respective apparatus embodiments, such as the functions of the units 21 to 24 shown in Fig. 2 and the units 31 to 38 shown in Fig. 3.
  • the protein sequence to be predicted is received, and the protein sequence is sequenced by using a preset sliding window and a sliding step to obtain a plurality of amino acid subsequences constituting the protein sequence to be predicted, according to the obtained
  • the plurality of amino acid subsequences construct a word vector of the protein sequence, the word element of the word vector represents each amino acid subsequence, the document feature extraction is performed on the word element, and the document feature vector of the protein sequence is constructed according to the extracted document features, and the word element is
  • the expressed amino acid subsequence extracts the biological characteristics of the protein chain, constructs a biological feature vector of the protein sequence according to the extracted biological features, and uses a predetermined amino acid residue classification model to represent the document feature vector and the biological feature vector.
  • the amino acid subsequences are classified to obtain the type of amino acid residues of the protein sequence, thereby improving the accuracy and versatility of protein binding site prediction.
  • the method for predicting the protein binding site to be implemented by the processor can be further referred to the description of the steps in the foregoing method embodiments, and details are not described herein.
  • a computer readable storage medium of an embodiment of the invention may comprise any computer capable of carrying computer program code A physical or device, a recording medium, such as a ROM/RAM, a magnetic disk, an optical disk, a flash memory, or the like.
  • a recording medium such as a ROM/RAM, a magnetic disk, an optical disk, a flash memory, or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Computational Linguistics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种蛋白质结合位点的预测方法、装置、设备及存储介质,适用于生物信息技术领域,该方法包括:接收待预测的蛋白质序列,使用预设的滑动窗口和滑动步长对蛋白质序列进行序列划分,得到多个氨基酸子序列(S101),根据这些氨基酸子序列构建蛋白质序列的词向量,对词元素进行文档特征提取,根据提取的文档特征构建蛋白质序列的文档特征向量(S102),对这些氨基酸子序列进行蛋白质链生物学特征提取,根据提取的生物学特征构建蛋白质序列的生物学特征向量(S103),使用预设的氨基酸残基分类模型对使用文档特征向量和生物学特征向量表示的氨基酸子序列进行分类,得到蛋白质序列的氨基酸残基类型(S104),从而提高了蛋白质结合位点预测的准确性和泛用性。

Description

蛋白质结合位点的预测方法、 装置、 设备及存储介质 技术领域
[0001] 本发明属于生物信息技术领域, 尤其涉及一种蛋白质结合位点的预测方法、 装 置、 设备及存储介质。
背景技术
[0002] 近年来, 生物信息学受到人们的广泛关注, 越来越多不同领域的研究者投入到 对生物信息学的研究工作中去。 生物信息学是一门研究生物和生物相关系统中 信息内容和信息流向的综合性学科, 其知识体系中包含了生物学 (遗传学、 生 物化学等) 、 数学 (概率论与数理统计、 算法等) 、 计算机科学 (机器学习、 计算理论等) 、 物理化学 (分子建模、 热力学等) 等多个不同学科的知识。
[0003] 蛋白质是生命活动的体现者, 是一切生物藉以表现生命的最重要基本单元, 可 以算是自然界最微小的自动机器, 并且在与生物体系的运作中有着无可替代的 作用。 蛋白质在细胞内的不同作用是由蛋白质之间、 蛋白质与 DNA、 蛋白质与 R NA以及蛋白质与配体之间的相互作用来进行调控的。 蛋白质 -蛋白质相互作用涉 及蛋白质分子的关联, 该关联在活细胞的每一个生物学过程中都起到非常关键 的作用, 例如 DNA合成、 基因转录激活、 蛋白质翻译、 修饰和定位以及信息传 导, 这些重要的生物过程均涉及到蛋白质-蛋白质的相互作用。 因此, 探索蛋白 质间相互作用的序列和结构特性对理解细胞活动至关重要。
[0004] 随着新一代测序技术的不断发展, 已测定的蛋白质序列数据不断增加。 于是, 人们对能够快速可靠地识别蛋白质结合位点的计算工具的需求也愈发强烈。 蛋 白质结合位点的定位对分析理解蛋白质相互作用的分子细节以及蛋白质功能至 关重要。 目前, 国内外的对蛋白质结合位点的研究预测, 大多基于通过对单个 位点进行专业测定, 得到的理化特征, 以及通过对蛋白质链进行分析, 计算得 到的位点间序列特征。 这样忽略了蛋白质结合位点的聚簇特性和氨基酸残基之 间的关联信息, 从而对蛋白质结合位点预测的准确性和泛用性不高。
技术问题 [0005] 本发明的目的在于提供一种蛋白质结合位点的预测方法、 装置、 计算设备及存 储介质, 旨在解决由于现有技术对蛋白质结合位点预测的准确性和泛用性不高 的问题。
问题的解决方案
技术解决方案
[0006] 一方面, 本发明提供了一种蛋白质结合位点的预测方法, 所述方法包括下述步 骤:
[0007] 接收待预测的蛋白质序列, 使用预设的滑动窗口和滑动步长对所述蛋白质序列 进行序列划分, 得到组成所述蛋白质序列的多个氨基酸子序列;
[0008] 根据所述多个氨基酸子序列构建所述蛋白质序列的词向量, 所述词向量的词元 素表示每个所述氨基酸子序列, 对所述词元素进行文档特征提取, 根据提取的 文档特征构建所述蛋白质序列的文档特征向量;
[0009] 对所述词元素表示的氨基酸子序列进行蛋白质链生物学特征提取, 根据提取到 的生物学特征构建所述蛋白质序列的生物学特征向量;
[0010] 使用预设的氨基酸残基分类模型对所述文档特征向量和所述生物学特征向量进 行分类, 得到所述蛋白质序列的氨基酸残基类型。
[0011] 另一方面, 本发明提供了一种蛋白质结合位点的预测装置, 所述装置包括: [0012] 序列划分单元, 用于接收待预测的蛋白质序列, 使用预设的滑动窗口和滑动步 长对所述蛋白质序列进行序列划分, 得到组成所述蛋白质序列的多个氨基酸子 序列;
[0013] 第一向量构建单元, 用于根据所述多个氨基酸子序列构建所述蛋白质序列的词 向量, 所述词向量的词元素表示每个所述氨基酸子序列, 对所述词元素进行文 档特征提取, 根据提取的文档特征构建所述蛋白质序列的文档特征向量;
[0014] 第二向量构建单元, 用于对所述词元素表示的氨基酸子序列进行蛋白质链生物 学特征提取, 根据提取到的生物学特征构建所述蛋白质序列的生物学特征向量 ; 以及
[0015] 结果获取单元, 用于使用预设的氨基酸残基分类模型对所述文档特征向量和所 述生物学特征向量进行分类, 得到所述蛋白质序列的氨基酸残基类型。 [0016] 另一方面, 本发明还提供了序列划分和分类模型构建所需的计算环境以及可在 所述环境中运行的计算机程序, 所述处理器执行所述计算机程序吋实现如所述 蛋白质结合位点的预测方法的步骤。
[0017] 另一方面, 本发明还提供了一种计算机可读存储介质, 所述计算机可读存储介 质存储有计算机程序, 所述计算机程序被处理器执行吋实现如所述蛋白质结合 位点的预测方法的步骤。
发明的有益效果
有益效果
[0018] 本发明接收待预测的蛋白质序列, 使用预设的滑动窗口和滑动步长对蛋白质序 列进行序列划分, 得到组成该待预测蛋白质序列的多个氨基酸子序列, 根据得 到的多个氨基酸子序列构建蛋白质序列的词向量, 该词向量的词元素表示每个 氨基酸子序列, 对词元素进行文档特征提取, 根据提取的文档特征构建蛋白质 序列的文档特征向量, 对氨基酸子序列进行蛋白质链生物学特征提取, 根据提 取到的生物学特征构建蛋白质序列的生物学特征向量, 使用预设的氨基酸残基 分类模型对同吋使用文档特征向量和生物学特征向量表示的氨基酸子序列进行 分类, 得到蛋白质序列的氨基酸残基类型, 从而提高了蛋白质结合位点预测的 准确性和泛用性。
对附图的简要说明
附图说明
[0019] 图 1是本发明实施例一提供的蛋白质结合位点的预测方法的实现流程图;
[0020] 图 2是本发明实施例二提供的蛋白质结合位点的预测装置的结构示意图;
[0021] 图 3是本发明实施例三提供的蛋白质结合位点的预测装置的结构示意图; 以及 [0022] 图 4是本发明实施例四提供的计算设备的结构示意图。
本发明的实施方式
[0023] 为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结合附图及实施例 , 对本发明进行进一步详细说明。 应当理解, 此处所描述的具体实施例仅仅用 以解释本发明, 并不用于限定本发明。
[0024] 以下结合具体实施例对本发明的具体实现进行详细描述:
[0025] 实施例一:
[0026] 图 1示出了本发明实施例一提供的蛋白质结合位点的预测方法的实现流程, 为 了便于说明, 仅示出了与本发明实施例相关的部分, 详述如下:
[0027] 在步骤 S101中, 接收待预测的蛋白质序列, 使用预设的滑动窗口和滑动步长对 蛋白质序列进行序列划分, 得到组成该待预测蛋白质序列的多个氨基酸子序列
[0028] 本发明实施例适用于蛋白质结合位点的预测系统。 在本发明实施例中, 为体现 蛋白质-蛋白质结合位点的聚集特性, 在接收到待预测的蛋白质序列后, 启动滑 动窗口, 通过调节滑动窗口大小与滑动步长, 对蛋白质序列进行划分, 得到组 成该待预测蛋白质序列的多个氨基酸子序列, 从而将蛋白质序列的局部分块作 为后续的分析单元。
[0029] 在本发明实施例中, 优选地, 滑动窗口的大小为 (2*window + 1 - 2*b) , 其中
, window为预设值, b是随机生成的、 大小处于 0到 window-1之间的变量。 这样 的滑动窗口中包含了目标残基两侧各 window-b个邻域残基, 随着窗口在氨基酸 序列上的滑动, 滑动窗口的大小在 3 (b=window-l) 到 2*window+l (b=0) 之间 随机改变, 得到以若干个氨基酸残基构成的蛋白质分块, 从而方便以蛋白质分 块作为基本单位进行后续分析, 充分体现蛋白质结合位点的聚簇特性, 进而提 高后续的特征表示能力、 预测精度和泛用性。
[0030] 优选地, 在接收待预测的蛋白质序列之前, 通过机器学习训练得到氨基酸残基 分类模型。 优选地, 可以使用 Stacking集成学习算法来进行机器学习, 从而提高 氨基酸残基分类模型的分类准确性和泛化能力。
[0031] 优选地, 在通过机器学习训练得到氨基酸残基分类模型吋, 首先使用预设的滑 动窗口和滑动步长对预设训练集中的训练蛋白质序列进行序列划分, 得到组成 该训练蛋白质序列的多个训练氨基酸子序列, 然后根据得到的多个训练氨基酸 子序列构建训练蛋白质序列的训练词向量, 训练词向量的训练词元素表示每个 训练氨基酸子序列, 对训练词元素进行文档特征提取, 根据提取的文档特征构 建训练蛋白质序列的文档特征训练向量, 并对训练词元素表示的训练氨基酸子 序列进行蛋白质链生物学特征提取, 根据提取到的生物学特征构建训练蛋白质 序列的生物学特征训练向量, 最后使用文档特征训练向量和生物学特征训练向 量表示的训练氨基酸子序列对预先构建的分类模型进行训练, 当达到预设的训 练结束条件吋, 将训练得到分类模型设置为氨基酸残基分类模型, 从而为后续 的氨基酸残基分类提供了分类模型, 提高了分类模型的分类效率。 其中, 训练 结束条件可以设置为训练次数到的预设次数或者训练过程中的损失达到预设值
[0032] 具体地, 在得到多种类型的特征后, 使用 Stacking集成学习算法来训练预设的 模型, 以得到氨基酸残基分类模型。 Stacking模型第一层分别使用不同种类的蛋 白质链生物学特征训练多种基分类器, 之后将多种基分类器的预测结果与文档 特征向量进行拼接, 以此作为最终的特征向量进行训练, 得到氨基酸残基分类 模型。
[0033] 在步骤 S102中, 根据得到的多个氨基酸子序列构建蛋白质序列的词向量, 词向 量的词元素表示每个氨基酸子序列, 该对词元素进行文档特征提取, 根据提取 的文档特征构建蛋白质序列的文档特征向量。
[0034] 在本发明实施例中, 序列划分得到多个氨基酸子序列后, 首先根据氨基酸子序 列构建蛋白质序列的词向量, 其中, 该词向量的词元素表示每个氨基酸子序列 , 然后对词元素进行文档特征提取, 最后根据提取的文档特征构建蛋白质序列 的文档特征向量。 其中, 提取的文档特征包括 TFIDF序列特征和 N-gmm序列特征 等特征。
[0035] 优选地, 在根据氨基酸子序列构建蛋白质序列的词向量吋, 对每种氨基酸子序 列分配一个唯一编号并使用 word2vec算法将原始的子序列唯一编号映射到 κ维向 量空间中, 得到蛋白质序列的词向量。 这样可以有效地降低特征维度, 为文本 数据寻求更加深层次的特征表示, 并且利用了高维词向量中的所有数据, 使得 数据规模更大, 有利于提高后续的分类效果。
[0036] 在步骤 S103中, 对词元素表示的氨基酸子序列进行蛋白质链生物学特征提取, 根据提取到的生物学特征构建蛋白质序列的生物学特征向量。 [0037] 在本发明实施例中, 首先对序列划分得到的氨基酸子序列进行蛋白质链生物学 特征提取, 然后根据提取到的生物学特征, 构建蛋白质序列的生物学特征向量 , 其中, 提取的生物学特征包括位置特异性打分矩阵特征和伪氨基酸组成特征 等特征, 从而有效地表示氨基酸在序列中出现顺序等局部信息, 增强了特征向 量对蛋白质序列信息的表示能力, 进而提高了生物学特征向量中生物学特征的 全面性。
[0038] 在步骤 S104中, 使用预设的氨基酸残基分类模型对使用文档特征向量和生物学 特征向量表示的氨基酸子序列进行分类, 得到蛋白质序列的氨基酸残基类型。
[0039] 在本发明实施例中, 氨基酸残基类型用于说明氨基酸残基是否为蛋白质序列的 结合位点。 优选地, 在对文档特征向量和生物学特征向量进行分类吋, 首先对 生物学特征向量进行预测, 然后将预测的预测结果与文档特征向量进行特征拼 接, 最后对特征拼接得到的拼接特征向量进行分类, 从而进一步提高了蛋白质 结合位点预测的准确性。 其中, 预设的氨基酸残基分类模型为前述训练得到的 氨基酸残基分类模型, 从而提高蛋白质序列的结合位点的预测准确性。
[0040] 实施例二:
[0041] 图 2示出了本发明实施例二提供的蛋白质结合位点的预测装置的结构, 为了便 于说明, 仅示出了与本发明实施例相关的部分, 其中包括:
[0042] 序列划分单元 21, 用于接收待预测的蛋白质序列, 使用预设的滑动窗口和滑动 步长对蛋白质序列进行序列划分, 得到组成该待预测蛋白质序列的多个氨基酸 子序列。
[0043] 第一向量构建单元 22, 用于根据得到的多个氨基酸子序列构建蛋白质序列的词 向量, 词向量的词元素表示每个氨基酸子序列, 对词元素进行文档特征提取, 根据提取的文档特征构建蛋白质序列的文档特征向量。
[0044] 第二向量构建单元 23, 用于对词元素表示的氨基酸子序列进行蛋白质链生物学 特征提取, 根据提取到的生物学特征构建蛋白质序列的生物学特征向量。
[0045] 结果获取单元 24, 用于使用预设的氨基酸残基分类模型对使用文档特征向量和 生物学特征向量表示的氨基酸子序列进行分类, 得到蛋白质序列的氨基酸残基 类型。 [0046] 在本发明实施例中, 序列划分单元 21接收待预测的蛋白质序列, 使用预设的滑 动窗口和滑动步长对蛋白质序列进行序列划分, 得到组成该待预测蛋白质序列 的多个氨基酸子序列, 第一向量构建单元 22根据得到的多个氨基酸子序列构建 蛋白质序列的词向量, 该词向量的词元素表示每个氨基酸子序列, 对词元素进 行文档特征提取, 根据提取的文档特征构建蛋白质序列的文档特征向量, 第二 向量构建单元 23对词元素表示的氨基酸子序列进行蛋白质链生物学特征提取, 根据提取到的生物学特征构建蛋白质序列的生物学特征向量, 结果获取单元 24 使用预设的氨基酸残基分类模型对使用文档特征向量和生物学特征向量表示的 氨基酸子序列进行分类, 得到蛋白质序列的氨基酸残基类型, 从而提高了蛋白 质结合位点预测的准确性和泛用性。
[0047] 在本发明实施例中, 蛋白质结合位点的预测装置的各单元可由相应的硬件或软 件单元实现, 各单元可以为独立的软、 硬件单元, 也可以集成为一个软、 硬件 单元, 在此不用以限制本发明。 各单元的具体实施方式可参考前述实施例一的 描述, 在此不再赘述。
[0048] 实施例三:
[0049] 图 3示出了本发明实施例三提供的蛋白质结合位点的预测装置的结构, 为了便 于说明, 仅示出了与本发明实施例相关的部分, 其中包括:
[0050] 训练序列划分单元 31, 用于使用预设的滑动窗口和滑动步长对预设训练集中的 训练蛋白质序列进行序列划分, 得到组成该训练蛋白质序列的多个训练氨基酸 子序列。
[0051] 第一特征处理单元 32, 用于根据得到的多个训练氨基酸子序列构建训练蛋白质 序列的训练词向量, 训练词向量的训练词元素表示每个训练氨基酸子序列, 对 训练词元素进行文档特征提取, 根据提取的文档特征构建训练蛋白质序列的文 档特征训练向量。
[0052] 第二特征处理单元 33, 用于对训练词元素表示的训练氨基酸子序列进行蛋白质 链生物学特征提取, 根据提取到的生物学特征构建训练蛋白质序列的生物学特 征训练向量。
[0053] 模型训练单元 34, 用于使用文档特征训练向量和生物学特征训练向量表示的训 练氨基酸子序列对预先构建的分类模型进行训练, 当达到预设的训练结束条件 吋, 将训练得到分类模型设置为氨基酸残基分类模型。
[0054] 在本发明实施例中, 在接收待预测的蛋白质序列之前, 通过机器学习训练得到 氨基酸残基分类模型。 优选地, 可以使用 Stacking集成学习算法来进行机器学习 , 从而提高氨基酸残基分类模型的分类准确性和泛化能力。
[0055] 具体地, 在通过机器学习训练得到氨基酸残基分类模型吋, 首先训练序列划分 单元 31使用预设的滑动窗口和滑动步长对预设训练集中的训练蛋白质序列进行 序列划分, 得到组成该训练蛋白质序列的多个训练氨基酸子序列, 然后第一特 征处理单元 32根据得到的多个训练氨基酸子序列构建训练蛋白质序列的训练词 向量, 训练词向量的训练词元素表示每个训练氨基酸子序列, 对训练词元素进 行文档特征提取, 根据提取的文档特征构建训练蛋白质序列的文档特征训练向 量, 第二特征处理单元 33对训练词元素表示的训练氨基酸子序列进行蛋白质链 生物学特征提取, 根据提取到的生物学特征构建训练蛋白质序列的生物学特征 训练向量, 最后模型训练单元 34使用文档特征训练向量和生物学特征训练向量 表示的训练氨基酸子序列对预先构建的分类模型进行训练, 当达到预设的训练 结束条件吋, 将训练得到分类模型设置为氨基酸残基分类模型, 从而为后续的 氨基酸残基分类提供了分类模型, 提高了分类模型的分类效率。 其中, 训练结 束条件可以设置为训练次数到的预设次数或者训练过程中的损失达到预设值。
[0056] 具体地, 在得到多种类型的特征后, 使用 Stacking集成学习算法来训练预设的 模型, 以得到氨基酸残基分类模型。 Stacking模型第一层分别使用不同种类的蛋 白质链生物学特征训练多种基分类器, 之后将多种基分类器的预测结果与文档 特征向量进行拼接, 以此作为最终的特征向量进行训练, 得到氨基酸残基分类 模型。
[0057] 序列划分单元 35, 用于接收待预测的蛋白质序列, 使用预设的滑动窗口和滑动 步长对蛋白质序列进行序列划分, 得到组成该待预测蛋白质序列的多个氨基酸 子序列。
[0058] 在本发明实施例中, 为体现蛋白质-蛋白质结合位点的聚集特性, 在接收到待 预测的蛋白质序列后, 序列划分单元 35启动滑动窗口, 通过调节滑动窗口大小 与滑动步长, 对蛋白质序列进行划分, 得到组成该待预测蛋白质序列的多个氨 基酸子序列, 从而将蛋白质序列的局部分块作为后续的分析单元。
[0059] 在本发明实施例中, 优选地, 滑动窗口的大小为 (2*window + 1 - 2*b) , 其中
, window为预设值, b是随机生成的、 大小处于 0到 window-1之间的变量。 这样 的滑动窗口中包含了目标残基两侧各 window-b个邻域残基, 随着窗口在氨基酸 序列上的滑动, 滑动窗口的大小在 3 (b=window-l) 到 2*window+l (b=0) 之间 随机改变, 得到以若干个氨基酸残基构成的蛋白质分块, 从而方便以蛋白质分 块作为基本单位进行后续分析, 充分体现蛋白质结合位点的聚簇特性, 进而提 高后续的特征表示能力、 预测精度和泛用性。
[0060] 第一向量构建单元 36, 用于根据得到的多个氨基酸子序列构建蛋白质序列的词 向量, 词向量的词元素表示每个氨基酸子序列, 对词元素进行文档特征提取, 根据提取的文档特征构建蛋白质序列的文档特征向量。
[0061] 在本发明实施例中, 序列划分得到多个氨基酸子序列后, 第一向量构建单元 36 首先根据氨基酸子序列构建蛋白质序列的词向量, 其中, 该词向量的词元素表 示每个氨基酸子序列, 然后对词元素进行文档特征提取, 最后根据提取的文档 特征构建蛋白质序列的文档特征向量。 其中, 提取的文档特征包括 TFIDF序列特 征和 N-gram序列特征等特征。
[0062] 优选地, 在根据氨基酸子序列构建蛋白质序列的词向量吋, 对每种氨基酸子序 列分配一个唯一编号并使用 word2vec算法将原始的子序列唯一编号映射到 κ维向 量空间中, 得到蛋白质序列的词向量。 这样可以有效地降低特征维度, 为文本 数据寻求更加深层次的特征表示, 并且利用了高维词向量中的所有数据, 使得 数据规模更大, 有利于提高后续的分类效果。
[0063] 第二向量构建单元 37, 用于对词元素表示的氨基酸子序列进行蛋白质链生物学 特征提取, 根据提取到的生物学特征构建蛋白质序列的生物学特征向量。
[0064] 在本发明实施例中, 第二向量构建单元 37首先对序列划分得到的氨基酸子序列 进行蛋白质链生物学特征提取, 然后根据提取到的生物学特征, 构建蛋白质序 列的生物学特征向量, 其中, 提取的生物学特征包括位置特异性打分矩阵特征 和伪氨基酸组成特征等特征, 从而有效地表示氨基酸在序列中出现顺序等局部 信息, 增强了特征向量对蛋白质序列信息的表示能力, 进而提高了生物学特征 向量中生物学特征的全面性。
[0065] 结果获取单元 38, 用于使用预设的氨基酸残基分类模型对使用文档特征向量和 生物学特征向量表示的氨基酸子序列进行分类, 得到蛋白质序列的氨基酸残基 类型。
[0066] 在本发明实施例中, 氨基酸残基类型用于说明氨基酸残基是否为蛋白质序列的 结合位点。 优选地, 在对文档特征向量和生物学特征向量进行分类吋, 首先对 生物学特征向量进行预测, 然后将预测的预测结果与文档特征向量进行特征拼 接, 最后对特征拼接得到的拼接特征向量进行分类, 从而进一步提高了蛋白质 结合位点预测的准确性。 其中, 预设的氨基酸残基分类模型为前述训练得到的 氨基酸残基分类模型, 从而提高蛋白质序列的结合位点的预测准确性。
[0067] 因此, 优选地, 该结果获取单元 38包括:
[0068] 特征拼接单元 381, 用于对生物学特征向量进行预测, 将预测的预测结果与文 档特征向量进行特征拼接; 以及
[0069] 特征分类单元 382, 用于对特征拼接得到的拼接特征向量进行分类。
[0070] 在本发明实施例中, 蛋白质结合位点的预测装置的各单元可由相应的硬件或软 件单元实现, 各单元可以为独立的软、 硬件单元, 也可以集成为一个软、 硬件 单元, 在此不用以限制本发明。
[0071] 实施例四:
[0072] 图 4示出了本发明实施例四提供的计算设备的结构, 为了便于说明, 仅示出了 与本发明实施例相关的部分。
[0073] 本发明实施例的计算设备 4包括处理器 40、 存储器 41以及存储在存储器 41中并 可在处理器 40上运行的计算机程序 42。 该处理器 40执行计算机程序 42吋实现上 述蛋白质结合位点的预测方法实施例中的步骤, 例如图 1所示的步骤 S101至 S104 。 或者, 处理器 40执行计算机程序 42吋实现上述各装置实施例中各单元的功能 , 例如图 2所示单元 21至 24、 图 3所示单元 31至 38的功能。
[0074] 在本发明实施例中, 该处理器 40执行计算机程序 42吋实现上述各个蛋白质结合 位点的预测方法实施例中的步骤吋, 接收待预测的蛋白质序列, 使用预设的滑 动窗口和滑动步长对蛋白质序列进行序列划分, 得到组成该待预测蛋白质序列 的多个氨基酸子序列, 根据得到的多个氨基酸子序列构建蛋白质序列的词向量 , 该词向量的词元素表示每个氨基酸子序列, 对词元素进行文档特征提取, 根 据提取的文档特征构建蛋白质序列的文档特征向量, 对词元素表示的氨基酸子 序列进行蛋白质链生物学特征提取, 根据提取到的生物学特征构建蛋白质序列 的生物学特征向量, 使用预设的氨基酸残基分类模型对使用文档特征向量和生 物学特征向量表示的氨基酸子序列进行分类, 得到蛋白质序列的氨基酸残基类 型, 从而提高了蛋白质结合位点预测的准确性和泛用性。 该计算设备 4中处理器 40在执行计算机程序 42吋实现的步骤具体可参考实施例一中方法的描述, 在此 不再赘述。
[0075] 实施例五:
[0076] 在本发明实施例中, 提供了一种计算机可读存储介质, 该计算机可读存储介质 存储有计算机程序, 该计算机程序被处理器执行吋实现上述蛋白质结合位点的 预测方法实施例中的步骤, 例如, 图 1所示的步骤 S101至 S104。 或者, 该计算机 程序被处理器执行吋实现上述各装置实施例中各单元的功能, 例如图 2所示单元 21至 24、 图 3所示单元 31至 38的功能。
[0077] 在本发明实施例中, 接收待预测的蛋白质序列, 使用预设的滑动窗口和滑动步 长对蛋白质序列进行序列划分, 得到组成该待预测蛋白质序列的多个氨基酸子 序列, 根据得到的多个氨基酸子序列构建蛋白质序列的词向量, 该词向量的词 元素表示每个氨基酸子序列, 对词元素进行文档特征提取, 根据提取的文档特 征构建蛋白质序列的文档特征向量, 对词元素表示的氨基酸子序列进行蛋白质 链生物学特征提取, 根据提取到的生物学特征构建蛋白质序列的生物学特征向 量, 使用预设的氨基酸残基分类模型对使用文档特征向量和生物学特征向量表 示的氨基酸子序列进行分类, 得到蛋白质序列的氨基酸残基类型, 从而提高了 蛋白质结合位点预测的准确性和泛用性。 该计算机程序被处理器执行吋实现的 蛋白质结合位点的预测方法进一步可参考前述方法实施例中步骤的描述, 在此 不再赘述。
[0078] 本发明实施例的计算机可读存储介质可以包括能够携带计算机程序代码的任何 实体或装置、 记录介质, 例如, ROM/RAM、 磁盘、 光盘、 闪存等存储器。 以上所述仅为本发明的较佳实施例而已, 并不用以限制本发明, 凡在本发明的 精神和原则之内所作的任何修改、 等同替换和改进等, 均应包含在本发明的保 护范围之内。

Claims

权利要求书
[权利要求 1] 一种蛋白质结合位点的预测方法, 其特征在于, 所述方法包括下述步 骤:
接收待预测的蛋白质序列, 使用预设的滑动窗口和滑动步长对所述蛋 白质序列进行序列划分, 得到组成所述蛋白质序列的多个氨基酸子序 列;
根据所述多个氨基酸子序列构建所述蛋白质序列的词向量, 所述词向 量的词元素表示每个所述氨基酸子序列, 对所述词元素进行文档特征 提取, 根据提取的文档特征构建所述蛋白质序列的文档特征向量; 对所述词元素表示的氨基酸子序列进行蛋白质链生物学特征提取, 根 据提取到的生物学特征构建所述蛋白质序列的生物学特征向量; 使用预设的氨基酸残基分类模型对使用所述文档特征向量和所述生物 学特征向量表示的氨基酸子序列进行分类, 得到所述蛋白质序列的氨 基酸残基类型。
[权利要求 2] 如权利要求 1所述的方法, 其特征在于, 接收待预测的蛋白质序列的 步骤之前, 所述方法还包括:
使用预设的滑动窗口和滑动步长对预设训练集中的训练蛋白质序列进 行序列划分, 得到组成所述训练蛋白质序列的多个训练氨基酸子序列 根据所述多个训练氨基酸子序列构建所述训练蛋白质序列的训练词向 量, 所述训练词向量的训练词元素表示每个所述训练氨基酸子序列, 对所述训练词元素进行文档特征提取, 根据提取的文档特征构建所述 训练蛋白质序列的文档特征训练向量;
对所述训练词元素表示的训练氨基酸子序列进行蛋白质链生物学特征 提取, 根据提取到的生物学特征构建所述训练蛋白质序列的生物学特 征训练向量;
使用所述文档特征训练向量和生物学特征训练向量表示的训练氨基酸 子序列对预先构建的分类模型进行训练, 当达到预设的训练结束条件 吋, 将训练得到分类模型设置为所述氨基酸残基分类模型。
[权利要求 3] 如权利要求 1所述的方法, 其特征在于, 所述预设的滑动窗口的大小 为 (2*window + 1 - 2*b) , 所述 window为预设值, 所述 b是随机生成 的、 大小位于 0到 window-1之间的变量。
[权利要求 4] 如权利要求 1或 2所述的方法, 其特征在于, 所述文档特征包括 TFIDF
序列特征和 N-gmm序列特征, 所述生物学特征包括位置特异性打分 矩阵特征和伪氨基酸组成特征。
[权利要求 5] 如权利要求 1所述的方法, 其特征在于, 使用预设的氨基酸残基分类 模型对所述文档特征向量和所述生物学特征向量进行分类的步骤, 包 括:
对所述生物学特征向量进行预测, 将预测的预测结果与所述文档特征 向量进行特征拼接;
对所述特征拼接得到的拼接特征向量进行分类。
[权利要求 6] —种蛋白质结合位点的预测装置, 其特征在于, 所述装置包括: 序列划分单元, 用于接收待预测的蛋白质序列, 使用预设的滑动窗口 和滑动步长对所述蛋白质序列进行序列划分, 得到组成所述蛋白质序 列的多个氨基酸子序列;
第一向量构建单元, 用于根据所述多个氨基酸子序列构建所述蛋白质 序列的词向量, 所述词向量的词元素表示每个所述氨基酸子序列, 对 所述词元素进行文档特征提取, 根据提取的文档特征构建所述蛋白质 序列的文档特征向量;
第二向量构建单元, 用于对所述词元素表示的氨基酸子序列进行蛋白 质链生物学特征提取, 根据提取到的生物学特征构建所述蛋白质序列 的生物学特征向量; 以及 结果获取单元, 用于使用预设的氨基酸残基分类模型对使用所述文档 特征向量和所述生物学特征向量表示的氨基酸子序列进行分类, 得到 所述蛋白质序列的氨基酸残基类型。
[权利要求 7] 如权利要求 6所述的装置, 其特征在于, 所述装置还包括: 训练序列划分单元, 用于使用预设的滑动窗口和滑动步长对预设训练 集中的训练蛋白质序列进行序列划分, 得到组成所述训练蛋白质序列 的多个训练氨基酸子序列;
第一特征处理单元, 用于根据所述多个训练氨基酸子序列构建所述训 练蛋白质序列的训练词向量, 所述训练词向量的训练词元素表示每个 所述训练氨基酸子序列, 对所述训练词元素进行文档特征提取, 根据 提取的文档特征构建所述训练蛋白质序列的文档特征训练向量; 第二特征处理单元, 用于对所述训练词元素表示的训练氨基酸子序列 进行蛋白质链生物学特征提取, 根据提取到的生物学特征构建所述训 练蛋白质序列的生物学特征训练向量; 以及
模型训练单元, 用于使用所述文档特征训练向量和生物学特征训练向 量表示的训练氨基酸子序列对预先构建的分类模型进行训练, 当达到 预设的训练结束条件吋, 将训练得到分类模型设置为所述氨基酸残基 分类模型。
[权利要求 8] 如权利要求 6所述的装置, 其特征在于, 所述结果获取单元包括: 特征拼接单元, 用于对所述生物学特征向量进行预测, 将预测的预测 结果与所述文档特征向量进行特征拼接; 以及
特征分类单元, 用于对所述特征拼接得到的拼接特征向量进行分类。
[权利要求 9] 一种计算设备, 包括存储器、 处理器以及存储在所述存储器中并可在 所述处理器上运行的计算机程序, 其特征在于, 所述处理器执行所述 计算机程序吋实现如权利要求 1至 5任一项所述方法的步骤。
[权利要求 10] —种计算机可读存储介质, 所述计算机可读存储介质存储有计算机程 序, 其特征在于, 所述计算机程序被处理器执行吋实现如权利要求 1 至 5任一项所述方法的步骤。
PCT/CN2017/100314 2017-08-31 2017-09-04 蛋白质结合位点的预测方法、装置、设备及存储介质 WO2019041333A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2019511995A JP6850874B2 (ja) 2017-08-31 2017-09-04 タンパク質結合部位予測の方法、装置、設備及び記憶媒体
US16/255,857 US11620567B2 (en) 2017-08-31 2019-01-24 Method, apparatus, device and storage medium for predicting protein binding site

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710770933.2 2017-08-31
CN201710770933.2A CN107563150B (zh) 2017-08-31 2017-08-31 蛋白质结合位点的预测方法、装置、设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/255,857 Continuation US11620567B2 (en) 2017-08-31 2019-01-24 Method, apparatus, device and storage medium for predicting protein binding site

Publications (1)

Publication Number Publication Date
WO2019041333A1 true WO2019041333A1 (zh) 2019-03-07

Family

ID=60977894

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/100314 WO2019041333A1 (zh) 2017-08-31 2017-09-04 蛋白质结合位点的预测方法、装置、设备及存储介质

Country Status (4)

Country Link
US (1) US11620567B2 (zh)
JP (1) JP6850874B2 (zh)
CN (1) CN107563150B (zh)
WO (1) WO2019041333A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110517730A (zh) * 2019-09-02 2019-11-29 河南师范大学 一种基于机器学习识别嗜热蛋白的方法
CN111091865A (zh) * 2019-12-20 2020-05-01 东软集团股份有限公司 MoRFs预测模型的生成方法、装置、设备和存储介质
CN114023376A (zh) * 2021-11-02 2022-02-08 四川大学 基于自注意力机制的rna-蛋白质结合位点预测方法和系统
CN114927165A (zh) * 2022-07-20 2022-08-19 深圳大学 泛素化位点的识别方法、装置、系统和存储介质

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830043B (zh) * 2018-06-21 2021-03-30 苏州大学 基于结构网络模型的蛋白质功能位点预测方法
CN109147868B (zh) * 2018-07-18 2022-03-22 深圳大学 蛋白质功能预测方法、装置、设备及存储介质
CN109326324B (zh) * 2018-09-30 2022-01-25 河北省科学院应用数学研究所 一种抗原表位的检测方法、系统及终端设备
CN109215737B (zh) * 2018-09-30 2021-03-02 东软集团股份有限公司 蛋白质特征提取、功能模型生成、功能预测的方法及装置
CN109637580B (zh) * 2018-12-06 2023-06-13 上海交通大学 一种蛋白质氨基酸关联矩阵预测方法
CN109767814A (zh) * 2019-01-17 2019-05-17 中国科学院新疆理化技术研究所 一种基于GloVe模型的氨基酸全局特征向量表示方法
US11210554B2 (en) 2019-03-21 2021-12-28 Illumina, Inc. Artificial intelligence-based generation of sequencing metadata
US11783917B2 (en) 2019-03-21 2023-10-10 Illumina, Inc. Artificial intelligence-based base calling
US11593649B2 (en) 2019-05-16 2023-02-28 Illumina, Inc. Base calling using convolutions
CN110335640B (zh) * 2019-07-09 2022-01-25 河南师范大学 一种药物-DBPs结合位点的预测方法
CN110706738B (zh) * 2019-10-30 2020-11-20 腾讯科技(深圳)有限公司 蛋白质的结构信息预测方法、装置、设备及存储介质
CN112818679A (zh) * 2019-11-15 2021-05-18 阿里巴巴集团控股有限公司 事件类别确定方法、装置及电子设备
CN111091871B (zh) * 2019-12-19 2022-02-18 上海交通大学 蛋白质信号肽及其切割位点预测实现方法
CN111091874B (zh) * 2019-12-20 2024-01-19 东软集团股份有限公司 蛋白质特征构建方法、装置、设备、存储介质及程序产品
CN111063393B (zh) * 2019-12-26 2023-04-07 青岛科技大学 基于信息融合和深度学习的原核生物乙酰化位点预测方法
AU2021224871A1 (en) 2020-02-20 2022-09-08 Illumina, Inc. Artificial intelligence-based many-to-many base calling
CN111599412B (zh) * 2020-04-24 2024-03-29 山东大学 基于词向量与卷积神经网络的dna复制起始区域识别方法
CN111462822B (zh) * 2020-04-29 2023-12-05 北京晶泰科技有限公司 一种蛋白质序列特征的生成方法、装置和计算设备
CN112489723B (zh) * 2020-12-01 2022-09-06 南京理工大学 基于局部进化信息的dna结合蛋白预测方法
US20220336054A1 (en) 2021-04-15 2022-10-20 Illumina, Inc. Deep Convolutional Neural Networks to Predict Variant Pathogenicity using Three-Dimensional (3D) Protein Structures
CN113299339B (zh) * 2021-05-28 2024-05-07 平安科技(深圳)有限公司 基于深度学习的药物疗效预测方法、装置、设备以及存储介质
CN116884473B (zh) * 2023-05-22 2024-04-26 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 一种蛋白质功能预测模型生成方法及装置
CN116844637B (zh) * 2023-07-07 2024-02-09 北京分子之心科技有限公司 一种获取第一源抗体序列对应的第二源蛋白质序列的方法与设备
CN117711532B (zh) * 2024-02-05 2024-05-10 北京悦康科创医药科技股份有限公司 多肽氨基酸序列生成模型训练方法以及多肽氨基酸序列生成方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1773517A (zh) * 2005-11-10 2006-05-17 上海交通大学 基于中文分词技术的蛋白质序列特征提取方法
CN102760210A (zh) * 2012-06-19 2012-10-31 南京理工大学常熟研究院有限公司 一种蛋白质三磷酸腺苷绑定位点预测方法
CN104077499A (zh) * 2014-05-25 2014-10-01 南京理工大学 基于有监督上采样学习的蛋白质-核苷酸绑定位点预测方法
US20150278441A1 (en) * 2014-03-25 2015-10-01 Nec Laboratories America, Inc. High-order semi-Restricted Boltzmann Machines and Deep Models for accurate peptide-MHC binding prediction
CN104992079A (zh) * 2015-06-29 2015-10-21 南京理工大学 基于采样学习的蛋白质-配体绑定位点预测方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473483A (zh) * 2013-10-07 2013-12-25 谢华林 一种蛋白质结构与功能的在线预测方法
US9652688B2 (en) * 2014-11-26 2017-05-16 Captricity, Inc. Analyzing content of digital images
CN105930318B (zh) * 2016-04-11 2018-10-19 深圳大学 一种词向量训练方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1773517A (zh) * 2005-11-10 2006-05-17 上海交通大学 基于中文分词技术的蛋白质序列特征提取方法
CN102760210A (zh) * 2012-06-19 2012-10-31 南京理工大学常熟研究院有限公司 一种蛋白质三磷酸腺苷绑定位点预测方法
US20150278441A1 (en) * 2014-03-25 2015-10-01 Nec Laboratories America, Inc. High-order semi-Restricted Boltzmann Machines and Deep Models for accurate peptide-MHC binding prediction
CN104077499A (zh) * 2014-05-25 2014-10-01 南京理工大学 基于有监督上采样学习的蛋白质-核苷酸绑定位点预测方法
CN104992079A (zh) * 2015-06-29 2015-10-21 南京理工大学 基于采样学习的蛋白质-配体绑定位点预测方法

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110517730A (zh) * 2019-09-02 2019-11-29 河南师范大学 一种基于机器学习识别嗜热蛋白的方法
CN111091865A (zh) * 2019-12-20 2020-05-01 东软集团股份有限公司 MoRFs预测模型的生成方法、装置、设备和存储介质
CN111091865B (zh) * 2019-12-20 2023-04-07 东软集团股份有限公司 MoRFs预测模型的生成方法、装置、设备和存储介质
CN114023376A (zh) * 2021-11-02 2022-02-08 四川大学 基于自注意力机制的rna-蛋白质结合位点预测方法和系统
CN114927165A (zh) * 2022-07-20 2022-08-19 深圳大学 泛素化位点的识别方法、装置、系统和存储介质

Also Published As

Publication number Publication date
US20190156915A1 (en) 2019-05-23
US11620567B2 (en) 2023-04-04
JP2019535057A (ja) 2019-12-05
CN107563150A (zh) 2018-01-09
CN107563150B (zh) 2021-03-19
JP6850874B2 (ja) 2021-03-31

Similar Documents

Publication Publication Date Title
WO2019041333A1 (zh) 蛋白质结合位点的预测方法、装置、设备及存储介质
JP2019535057A5 (zh)
Wong et al. DNA motif elucidation using belief propagation
US11769073B2 (en) Methods and systems for producing an expanded training set for machine learning using biological sequences
Lee et al. A comprehensive survey on genetic algorithms for DNA motif prediction
CN114420211A (zh) 一种基于注意力机制的rna-蛋白质结合位点预测方法
Wei et al. CALLR: a semi-supervised cell-type annotation method for single-cell RNA sequencing data
KR20180017827A (ko) 염기 프로파일과 조성을 이용하여 단백질과 결합하는 rna 서열 영역을 예측하는 방법 및 시스템
Salekin et al. A deep learning model for predicting transcription factor binding location at single nucleotide resolution
Katara et al. Phylogenetic footprinting: a boost for microbial regulatory genomics
Bi et al. Predicting Gene Ontology functions based on support vector machines and statistical significance estimation
Reid et al. STEME: a robust, accurate motif finder for large data sets
Beiko et al. GANN: genetic algorithm neural networks for the detection of conserved combinations of features in DNA
CN111048145A (zh) 蛋白质预测模型的生成方法、装置、设备和存储介质
CN114627964B (zh) 一种基于多核学习预测增强子及其强度分类方法及分类设备
Mesa et al. Hidden Markov models for gene sequence classification: Classifying the VSG gene in the Trypanosoma brucei genome
James et al. MeShClust2: Application of alignment-free identity scores in clustering long DNA sequences
Mahony et al. Self-organizing maps of position weight matrices for motif discovery in biological sequences
Kerdprasop et al. Constraint-Based System for Genomic Analysis
CN118212975A (zh) 一种基于多任务学习的肽、mhc、tcr结合性预测方法和系统
Ulrich et al. Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters
Balaji Santiago Segarra
Abdollahyan The Partial Order Kernel and its Application to Understanding the Regulatory Grammar of Conserved Non-coding Elements
Sharma et al. Applications of Signal Processing, Computational Techniques in Bioinformatics and Biotechnology
Zhu et al. Metagenomic Classification Using an Abstraction Augmented Markov Model

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019511995

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.09.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17923434

Country of ref document: EP

Kind code of ref document: A1