CN108009405A - A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter - Google Patents

A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter Download PDF

Info

Publication number
CN108009405A
CN108009405A CN201711435147.3A CN201711435147A CN108009405A CN 108009405 A CN108009405 A CN 108009405A CN 201711435147 A CN201711435147 A CN 201711435147A CN 108009405 A CN108009405 A CN 108009405A
Authority
CN
China
Prior art keywords
outer membrane
mrow
protein
pssm
msub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711435147.3A
Other languages
Chinese (zh)
Inventor
陈抗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Wickham Biomedical Technology Co.,Ltd.
Original Assignee
Chongqing Bai Nogee Biotechnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Bai Nogee Biotechnology Co Ltd filed Critical Chongqing Bai Nogee Biotechnology Co Ltd
Priority to CN201711435147.3A priority Critical patent/CN108009405A/en
Publication of CN108009405A publication Critical patent/CN108009405A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

The invention discloses a kind of method for predicting the outer membrane protein of its coding in bacterial genomes using machine learning techniques, this method is:Utilize PSI BLAST algorithms, calculate the location specific feature vector of protein, Feature Conversion is carried out using auto-correlation function, establish the grader based on support vector machines, external memebrane protein and non-outer membrane protein are classified, by local computer program, receive protein sequence input by user, predict whether it is an outer membrane protein.The protein sequence that the present invention can encode bacterium full-length genome carries out calculating prediction, and susceptibility is high, and calculating speed is fast, and effective tool is provided for the Rapid identification and screening of outer membrane protein in bacterial genomes.The present invention is a kind of accurately and effectively outer membrane protein screening technique, can be widely applied to the identification of the newly outer membrane protein of sequencing bacterial genomes.

Description

A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter
Technical field
The invention belongs to predict the technical field of Bacterial outer membrane proteins matter, more particularly to it is a kind of pre- based on machine learning techniques The method for surveying Bacterial outer membrane proteins matter.
Background technology
The transmembrane protein of a large amount of beta-barrel shapes, some of which albumen are distributed on gram-negative bacteria outer membrane Matter is the action protein of bacterial invasion cell, and the targets identification albumen of host immune system bacteria removal, mediates a variety of diseases The generation of disease, while the immunologic mechanism of body is also activated to antibacterial infection.
Currently, identify that outer membrane protein is mainly completed by testing in new bacterial genomes.However, use experimental method External memebrane protein is identified, it is necessary to expend substantial amounts of manpower and materials, of high cost, efficiency is low.One new bacterial genomes is past Toward thousands of a protein are encoded, outer membrane protein therein is identified one by one using traditional means of experiment, be one extremely Difficult thing.Therefore, Bioinformatics Prediction is carried out using computer, realization can be automated, speed is fast, and cost is low, is Solve the effective way of the discriminating outer membrane protein in bacterium full-length genome.Therefore, a kind of quickly and accurately biological information is established Prediction and recognizer are learned, becomes the main problem that this current field needs to solve.
The content of the invention
It is an object of the invention to provide a kind of method that its outer membrane protein is predicted based on machine learning techniques, it is intended to solves Certainly identify outer membrane protein mainly by testing the problem of completing in new bacterial genomes at present.
The present invention is achieved in that a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter, this can It is in the method for bacterium full-length genome horizontal forecast outer membrane protein:
Using PSI-BLAST algorithms, protein sequence is compared with irredundant protein sequence database, is calculated Location specific iteration scoring matrix (PSSM), the composition characteristic of PSSM is calculated by component function and auto-correlation function respectively (amino acid residue composition/PSSM_AAC), and PSSM autocorrelation characteristics (autocorrelative amino acid position specificity composition/ PSSM_AC), the grader based on support vector machines is established, external memebrane protein and non-outer membrane protein are classified, and pass through this Ground computer program, receives protein sequence input by user, predicts whether protein sequence input by user is an outer membrane Albumen.
A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter, this can be horizontal pre- in bacterium full-length genome Survey outer membrane protein method be:
Further, following steps should be specifically included based on the method for machine learning techniques prediction outer membrane protein:
Step 1: user by protein sequence to be predicted, using FASTA forms, inputs local computer program;
Step 2: computer program uses PSI-BLAST programs, by protein sequence and irredundant protein sequence into Row compares;
Step 3: computer program calls Matlab to run core-prediction program, the amino acid of protein sequence is calculated The features such as location specific forms and autocorrelative amino acid position specificity forms;
Step 4: multiclass feature is carried out feature selecting and combination by Matlab programs according to predetermined manner, an egg is produced White matter feature vector;
Step 5: Matlab routine call libSVM programs, using the good model of precondition, prediction protein is outer membrane The likelihood ratio of albumen;
Step 6: judging whether it is an outer membrane protein according to SVM prediction results, result is output to screen, or It is saved in local computer disk;
Further, the computational methods of the PSSM matrixes are:
Using PSI-BALST algorithms, setting e values are 0.001, and iterations 3 times is irredundant with NCBI by protein sequence Protein Data Bank nr be compared, export the PSSM matrixes that obtain in calculating process.
Further, the computational methods of the PSSM compositions and PSSM autocorrelation characteristics are:
Protein sequence is considered as the character string of 20 kinds of amino acid residue compositions, and protein sequence is passed through local PSI- Blast program, is compared, outgoing position specificity iteration score matrix (PSSM) with local protein sequence database. The composition PSSM_AAC calculations of protein sequence PSSM are:
Here, Sji is the PSSM scores that j-th of amino acid mutation is amino acid i on protein sequence,Then represent albumen The amino acid mutation of all positions of matter is the average of i.PSSM_AC represents the auto-correlation score of PSSM matrixes, calculation For:
Here, lg represents the distance between a residue and its neighbours, and LG is the maximum of g, and V represents to be separated by g ammonia Average autocorrelation values between two residues of base acid residue.In this way, 20*LG can be calculated from the PSSM of protein sequence Variable.
Further, the position conservative information that it is amino acid i in some position that the composition of the PSSM, which is reflected,;It is described PSSM autocorrelation characteristics, then reflect the conservative that the amino acid of some fragment occurs in amino acid sequence.
Further, the local computer program of protein sequence characteristics is calculated, protein sequence input by user is inputted Matlab shell scripts, matlab shell scripts are set in advance according to PSSM compositions and autocorrelation characteristic computational methods, use Parameter, is calculated a multidimensional characteristic vectors from protein sequence.
Further, it is described to establish the grader based on support vector machines (SVM), external memebrane protein and non-outer membrane protein The method classified is:Using libSVM3.14 establish the SVM classifier based on support vector machines and by the feature of multidimensional to Amount input, the outer membrane protein and non-outer membrane protein training dataset that SVM classifier is established using data mining technology, and Established according to SVM algorithm and use the trained disaggregated model of training data, RBF kernel functions, parameter.
Further, it is described to be established according to SVM algorithm and use the trained disaggregated model of training data, kernel function, parameter The construction method of middle disaggregated model is:
Sample sequence is collected using database search and sequence alignment, and redundancy sequence is removed using BLASTCLUST algorithms Row, obtain experiments verify that outer membrane protein sequence and non-outer membrane protein sequence as training dataset, each albumen Sequence similarity between matter sequence is no more than 25%, SVM Selection of kernel function RBF kernel functions, and penalty factor parameter uses grid Search and ten times of cross-betas, it is described to use SVMtrain train classification models, use sensitiveness, specificity, total precision of prediction With geneva related coefficient integrated forecasting performance, the optimal combinations of features mode of geneva related coefficient is finally selected from test result And model parameter, the model of best performance is exported as final mask and is preserved.
The present invention is an application of the bioinformatics method in Bacterial outer membrane proteins matter prediction field, its core concept is A kind of protein sequence characteristics method for digging of the composition based on PSSM matrixes and autocorrelation characteristic is proposed, and combines engineering Practise prediction model and algorithm that algorithm devises pin-point accuracy;PS I-BALST algorithms, auto-correlation function etc. are used for by this method Protein sequence characteristics calculate, and calculate PSSM in same position and its correlative character of upstream and downstream;In addition, using in mould Formula identifies and machine learning field shows the algorithm of support vector machine of excellent properties to establish disaggregated model, using grid search Determine optimal SVM kernel functional parameters;Standard exercise and test data set are established using database retrieval and literature mining method, is made Data redundancy is removed with the homologous comparison technologies of BLASTCLUST, uses sensitiveness, specificity, precision of prediction and geneva related coefficient Estimated performance is weighed etc. multiple indexs, the svm classifier model optimized is established by a large amount of performance tests, can be to arbitrarily not The protein sequence known is predicted, and provides the likelihood ratio that it is an outer membrane protein.The program is received by local program Bacterial genomes protein sequence input by user, predicts whether it is an outer membrane protein, and accurate with very high prediction Exactness.
It is a l pha screw type protein to establish comprising 208 outer membrane proteins, 674 globin sequences and 206 Irredundant training dataset, when being used on training dataset, the performance of ten times of cross validation test verification present invention, as a result shows Show, this method distinguish outer membrane protein and non-outer membrane protein sensitiveness, specificity, total precision of prediction and geneva related coefficient 93.08%, 99.50%, 98.27% and 0.945 is respectively reached, estimated performance has exceeded most of existing method.In addition, use The forecasting tool is calculated and predicted in Escherichia coli full-length genome protein, and the training data from Escherichia coli is picked After removing, remaining 189 outer membrane proteins and 657 non-outer membrane protein sequences, are trained prediction model.It will train Model come predict genome of E.coli coding 4128 protein sequences, predict 30 possible outer membrane proteins altogether Matter.In these protein, it is that certified outer membrane protein, expression prediction model have had quick well to have 22 protein Perception.In theory, between content of the Bacterial outer membrane proteins matter in genome is about 1%-3%, this method is in Escherichia coli The outer membrane protein that genome interior prediction arrives is no more than 1%, shows relatively low false positive rate.
The present invention can be widely applied to the correlative study of identification Bacterial outer membrane proteins matter.Bacterial outer membrane proteins matter is bacterium Pathogenic important molecule is participated in, is the action target of numerous antibacterials., can be with using the present invention and its Prediction program provided Outer membrane protein in the new bacterial genomes of fast prediction, obtains the outer membrane protein candidate target of a data volume very little, For experimental identification or other purposes, so as to accelerate the qualification process of bacterial genomes outer membrane protein.
Brief description of the drawings
Fig. 1 is that a kind of embodiment of the method based on machine learning techniques prediction Bacterial outer membrane proteins matter of the present invention provides Flow chart;
Fig. 2 is a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter of the present invention
The schematic diagram for the program that embodiment provides.
It is an object of the invention to provide a kind of method that its outer membrane protein is predicted based on machine learning techniques, it is intended to solves Certainly identify outer membrane protein mainly by testing the problem of completing in new bacterial genomes at present.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, it is right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
Unless otherwise defined, all of technologies and scientific terms used here by the article is with belonging to technical field of the invention The normally understood implication of technical staff is identical.Term used in the description of the invention herein is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.
Embodiment one
The present invention is achieved in that a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter, this can It is in the method for bacterium full-length genome horizontal forecast outer membrane protein:
Using PS I-BLAST algorithms, protein sequence is compared with irredundant protein sequence database, is counted Location specific iteration scoring matrix (PSSM) is calculated, calculates the composition characteristic of PSSM respectively by component function and auto-correlation function (amino acid residue composition/PSSM_AAC), and PSSM autocorrelation characteristics (autocorrelative amino acid position specificity composition/ PSSM_AC), the grader based on support vector machines is established, external memebrane protein and non-outer membrane protein are classified, and pass through this Ground computer program, receives protein sequence input by user, predicts whether protein sequence input by user is an outer membrane Albumen.
Embodiment two
A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter, this can be horizontal pre- in bacterium full-length genome Survey outer membrane protein method be:
Step 1: user by protein sequence to be predicted, using FASTA forms, inputs local computer program;
Step 2: computer program uses PSI-BLAST programs, by protein sequence and irredundant protein sequence into Row compares;
Step 3: computer program calls Matlab to run core-prediction program, calculate albumen and calculate protein PSSM composition characteristics and PSSM autocorrelation characteristics;;
Step 4: multiclass feature is carried out feature selecting and combination by Matlab programs according to predetermined manner, an egg is produced White matter feature vector;
Step 5: Matlab routine call libSVM programs, using the good model of precondition, prediction protein is outer membrane The likelihood ratio of albumen;
Step 6: judging whether it is an outer membrane protein according to SVM prediction results, result is output to screen, or It is saved in local computer disk;
Further, the computational methods of the PSSM matrixes are:
Using PSI-BALST algorithms, setting e values are 0.001, and iterations 3 times is irredundant with NCBI by protein sequence Protein Data Bank nr be compared, export the PSSM matrixes that obtain in calculating process.
Further, the computational methods of the PSSM compositions and PSSM autocorrelation characteristics are:
Protein sequence is considered as the character string of 20 kinds of amino acid residue compositions, and protein sequence is passed through local PSI- Blast program, is compared, outgoing position specificity iteration score matrix (PSSM) with local protein sequence database. The composition PSSM_AAC calculations of protein sequence PSSM are:
Here, Sji is the PSSM scores that j-th of amino acid mutation is amino acid i on protein sequence,Then represent albumen The amino acid mutation of all positions of matter is the average of i.PSSM_AC represents the auto-correlation score of PSSM matrixes, calculation For:
Here, lg represents the distance between a residue and its neighbours, and LG is the maximum of g, and V represents to be separated by g ammonia Average autocorrelation values between two residues of base acid residue.In this way, 20*LG can be calculated from the PSSM of protein sequence Variable.
Further, the position conservative information that it is amino acid i in some position that the composition of the PSSM, which is reflected,;It is described PSSM autocorrelation characteristics, then reflect the conservative that the amino acid of some fragment occurs in amino acid sequence.
Further, the local computer program of protein sequence characteristics is calculated, protein sequence input by user is inputted Matlab shell scripts, matlab shell scripts are set in advance according to PSSM compositions and autocorrelation characteristic computational methods, use Parameter, is calculated a multidimensional characteristic vectors from protein sequence.
Further, it is described to establish the grader based on support vector machines (SVM), external memebrane protein and non-outer membrane protein The method classified is:Using libSVM3.14 establish the SVM classifier based on support vector machines and by the feature of multidimensional to Amount input, the outer membrane protein and non-outer membrane protein training dataset that SVM classifier is established using data mining technology, and Established according to SVM algorithm and use the trained disaggregated model of training data, RBF kernel functions, parameter.
Further, it is described to be established according to SVM algorithm and use the trained disaggregated model of training data, kernel function, parameter The construction method of middle disaggregated model is:
Sample sequence is collected using database search and sequence alignment, and redundancy sequence is removed using BLASTCLUST algorithms Row, obtain experiments verify that outer membrane protein sequence and non-outer membrane protein sequence as training dataset, each albumen Sequence similarity between matter sequence is no more than 25%, SVM Selection of kernel function RBF kernel functions, and penalty factor parameter uses grid Search and ten times of cross-betas, it is described to use SVMtrain train classification models, use sensitiveness, specificity, total precision of prediction With geneva related coefficient integrated forecasting performance, the optimal combinations of features mode of geneva related coefficient is finally selected from test result And model parameter, the model of best performance is exported as final mask and is preserved.
The present invention is an application of the bioinformatics method in Bacterial outer membrane proteins matter prediction field, its core concept is A kind of protein sequence characteristics method for digging of the composition based on PSSM matrixes and autocorrelation characteristic is proposed, and combines engineering Practise prediction model and algorithm that algorithm devises pin-point accuracy;PS I-BALST algorithms, auto-correlation function etc. are used for by this method Protein sequence characteristics calculate, and calculate PSSM in same position and its correlative character of upstream and downstream;In addition, using in mould Formula identifies and machine learning field shows the algorithm of support vector machine of excellent properties to establish disaggregated model, using grid search Determine optimal SVM kernel functional parameters;Standard exercise and test data set are established using database retrieval and literature mining method, is made Data redundancy is removed with the homologous comparison technologies of BLASTCLUST, uses sensitiveness, specificity, precision of prediction and geneva related coefficient Estimated performance is weighed etc. multiple indexs, the svm classifier model optimized is established by a large amount of performance tests, can be to arbitrarily not The protein sequence known is predicted, and provides the likelihood ratio that it is an outer membrane protein.The program is received by local program Bacterial genomes protein sequence input by user, predicts whether it is an outer membrane protein, and accurate with very high prediction Exactness.
It is a l pha screw type protein to establish comprising 208 outer membrane proteins, 674 globin sequences and 206 Irredundant training dataset, when being used on training dataset, the performance of ten times of cross validation test verification present invention, as a result shows Show, this method distinguish outer membrane protein and non-outer membrane protein sensitiveness, specificity, total precision of prediction and geneva related coefficient 93.08%, 99.50%, 98.27% and 0.945 is respectively reached, estimated performance has exceeded most of existing method.In addition, use The forecasting tool is calculated and predicted in Escherichia coli full-length genome protein, and the training data from Escherichia coli is picked After removing, remaining 189 outer membrane proteins and 657 non-outer membrane protein sequences, are trained prediction model.It will train Model come predict genome of E.coli coding 4128 protein sequences, predict 30 possible outer membrane proteins altogether Matter.In these protein, it is that certified outer membrane protein, expression prediction model have had quick well to have 22 protein Perception.In theory, between content of the Bacterial outer membrane proteins matter in genome is about 1%-3%, this method is in Escherichia coli The outer membrane protein that genome interior prediction arrives is no more than 1%, shows relatively low false positive rate.
The present invention can be widely applied to the correlative study of identification Bacterial outer membrane proteins matter.Bacterial outer membrane proteins matter is bacterium Pathogenic important molecule is participated in, is the action target of numerous antibacterials., can be with using the present invention and its Prediction program provided Outer membrane protein in the new bacterial genomes of fast prediction, obtains the outer membrane protein candidate target of a data volume very little, For experimental identification or other purposes, so as to accelerate the qualification process of bacterial genomes outer membrane protein.
Can be with one of ordinary skill in the art will appreciate that realizing that all or part of step in above-described embodiment method is Control relevant hardware to complete by program, the program can in a computer read/write memory medium is stored in, The storage medium, such as ROM/RAM, disk, CD.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should all be included in the protection scope of the present invention.

Claims (7)

  1. A kind of 1. method based on machine learning techniques prediction Bacterial outer membrane proteins matter, it is characterised in that the method includes:
    Using PSI-BLAST algorithms, protein sequence is compared with irredundant protein sequence database, calculation position Specific scoring matrix, using an auto-correlation function, the PSSM compositions of same amino acid are special in sequence of calculation certain area Seek peace PSSM autocorrelation characteristics, collectively constitute protein characteristic vector, machine learning classification model is established using support vector machines, It is trained and optimizes using training data set pair model, trained model can classify unknown protein sequence, Judge whether it is outer membrane protein.
  2. A kind of 2. method based on machine learning techniques prediction Bacterial outer membrane proteins matter, it is characterised in that the method includes Following steps:
    Step 1: user by protein sequence to be predicted, using FASTA forms, inputs local computer program;
    Step 2: computer program uses PSI-BLAST programs, protein sequence and irredundant protein sequence are compared It is right;
    Step 3: computer program calls Matlab to run core-prediction program, calculate protein PSSM composition characteristics and PSSM autocorrelation characteristics;
    Step 4: multiclass feature is carried out feature selecting and combination by Matlab programs according to predetermined manner, a protein is produced Feature vector;
    Step 5: Matlab routine call libSVM programs, using the good model of precondition, prediction protein is outer membrane protein Likelihood ratio;
    Step 6: judging whether it is an outer membrane protein according to SVM prediction results, result is exported.
  3. 3. a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter as claimed in claim 2, its feature exist In Step 2: computer program uses PSI-BLAST programs, protein sequence and irredundant protein sequence are compared To concretely comprising the following steps:Protein sequence input by user is inputted into matlab shell scripts, matlab shell scripts call PSI- BALST is compared with irredundant Protein Data Bank, calculates PSSM, and passes through PSSM and form computational methods, will PSSM is converted to PSSM composition characteristics and PSSM autocorrelation characteristics, obtains the feature vector of a combination.
  4. 4. a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter as claimed in claim 2, its feature exist In, Step 3: computer program calls Matlab to run core-prediction program, calculate protein PSSM composition characteristics and PSSM autocorrelation characteristics;
    Protein sequence is considered as the character string of 20 kinds of amino acid residue compositions, and protein sequence is passed through local PSI-BLAST journeys Sequence, is compared with local protein sequence database, exports PSSM;PSSM autocorrelation characteristic calculations are:
    <mrow> <msup> <mi>V</mi> <mrow> <mi>P</mi> <mi>S</mi> <mi>S</mi> <mi>M</mi> <mo>_</mo> <mi>A</mi> <mi>A</mi> <mi>C</mi> </mrow> </msup> <mo>=</mo> <mo>&amp;lsqb;</mo> <msub> <mover> <mi>S</mi> <mo>&amp;OverBar;</mo> </mover> <mn>1</mn> </msub> <msub> <mover> <mi>S</mi> <mo>&amp;OverBar;</mo> </mover> <mn>2</mn> </msub> <msub> <mover> <mi>S</mi> <mo>&amp;OverBar;</mo> </mover> <mn>3</mn> </msub> <mo>...</mo> <msub> <mover> <mi>S</mi> <mo>&amp;OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>...</mo> <msub> <mover> <mi>S</mi> <mo>&amp;OverBar;</mo> </mover> <mn>20</mn> </msub> <mo>&amp;rsqb;</mo> </mrow>
    <mrow> <msub> <mover> <mi>S</mi> <mo>&amp;OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mi>L</mi> </mfrac> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>L</mi> </munderover> <msub> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>i</mi> </mrow> </msub> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>3</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mn>20</mn> </mrow>
    Sji is the PSSM scores that j-th of amino acid mutation is amino acid i on protein sequence,Then represent all positions of protein Amino acid mutation be i average;PSSM_AC represents PSSM composition characteristic scores, and calculation is:
    <mrow> <msubsup> <mi>V</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>lg</mi> </mrow> <mrow> <mi>S</mi> <mi>S</mi> <mi>S</mi> <mi>M</mi> <mo>_</mo> <mi>A</mi> <mi>C</mi> </mrow> </msubsup> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>L</mi> <mo>-</mo> <mi>g</mi> </mrow> </mfrac> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>L</mi> <mo>-</mo> <mi>g</mi> </mrow> </munderover> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>i</mi> </mrow> </msub> <mo>-</mo> <msub> <mover> <mi>S</mi> <mo>&amp;OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&amp;times;</mo> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>j</mi> <mo>+</mo> <mi>lg</mi> </mrow> </msub> <mo>-</mo> <msub> <mover> <mi>S</mi> <mo>&amp;OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mi>g</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>3</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mi>L</mi> <mi>G</mi> </mrow>
    Lg represents the distance between a residue and its neighbours, and LG is the maximum of g, and V expressions are separated by g amino acid residue Average autocorrelation values between two residues;In this way, 20*LG variable can be calculated from the PSSM of protein sequence.
  5. 5. a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter as claimed in claim 4, its feature exist In the PSSM compositions reflect the position conservative information for being amino acid i in some position;The PSSM autocorrelation haracters Sign, reflects the conservative that the amino acid of some fragment occurs in amino acid sequence.
  6. 6. a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter as claimed in claim 2, its feature exist In Step 5: Matlab routine call libSVM programs, using the good model of precondition, prediction protein is outer membrane protein The specific method of likelihood ratio be:The SVM classifier based on support vector machines is established using libSVM3.14 and by the spy of multidimensional Sign vector input, the outer membrane protein and non-outer membrane protein training dataset that SVM classifier is established using data mining technology, And established according to SVM algorithm and use the trained disaggregated model of training data, RBF kernel functions, parameter.
  7. 7. a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter as described in claim S6, its feature exist In described to be established according to SVM algorithm and use the trained disaggregated model of training data, RBF kernel functions, parameter;Specific side Method is:
    Sample sequence is collected using database search and sequence alignment, and redundant sequence is removed using BLASTCLUST algorithms, is obtained The outer membrane protein sequence through experimental verification and non-outer membrane protein sequence arrived is as training dataset, each protein sequence Sequence similarity between row is no more than 25%, SVM Selection of kernel function RBF kernel functions, and penalty factor parameter uses grid search Determined with ten times of cross-betas, it is described to use SVMtrain train classification models, use sensitiveness, specificity, total prediction essence Degree and geneva related coefficient integrated forecasting performance, finally select the optimal combinations of features side of geneva related coefficient from test result Formula and model parameter, the model of best performance is exported as final mask and is preserved.
CN201711435147.3A 2017-12-26 2017-12-26 A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter Pending CN108009405A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711435147.3A CN108009405A (en) 2017-12-26 2017-12-26 A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711435147.3A CN108009405A (en) 2017-12-26 2017-12-26 A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter

Publications (1)

Publication Number Publication Date
CN108009405A true CN108009405A (en) 2018-05-08

Family

ID=62061548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711435147.3A Pending CN108009405A (en) 2017-12-26 2017-12-26 A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter

Country Status (1)

Country Link
CN (1) CN108009405A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109215737A (en) * 2018-09-30 2019-01-15 东软集团股份有限公司 Protein characteristic extracts, functional mode generates, the method and device of function prediction
CN109637589A (en) * 2018-12-13 2019-04-16 上海交通大学 Based on frequent mode and the double nuclear localization signal prediction algorithms for recommending system of machine learning
CN109754843A (en) * 2018-12-04 2019-05-14 志诺维思(北京)基因科技有限公司 A kind of method and device detecting genome small fragment insertion and deletion
CN109785901A (en) * 2018-12-26 2019-05-21 东软集团股份有限公司 A kind of protein function prediction technique and device
CN109801675A (en) * 2018-12-26 2019-05-24 东软集团股份有限公司 A kind of method, apparatus and equipment of determining protein liposomal function
CN110060738A (en) * 2019-04-03 2019-07-26 中国人民解放军军事科学院军事医学研究院 Method and system based on machine learning techniques prediction bacterium protective antigens albumen
CN110428865A (en) * 2019-08-14 2019-11-08 信阳师范学院 A kind of method of high-throughput prediction Antifreeze protein
CN110517730A (en) * 2019-09-02 2019-11-29 河南师范大学 A method of thermophilic protein is identified based on machine learning
CN112242179A (en) * 2020-09-09 2021-01-19 天津大学 Method for identifying type of membrane protein
CN112634988A (en) * 2021-01-07 2021-04-09 内江师范学院 Python language-based gene variation detection method and system
CN116721695A (en) * 2023-03-07 2023-09-08 安徽农业大学 Identification method, device, equipment and medium of candidate gene for regulating bacterial shape

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930687A (en) * 2016-04-11 2016-09-07 中国人民解放军第三军医大学 Method for predicting outer membrane proteins at bacterial whole genome level
CN105938522A (en) * 2016-04-11 2016-09-14 中国人民解放军第三军医大学 Method for predicting effector molecules of bacterial IV-type secretory system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930687A (en) * 2016-04-11 2016-09-07 中国人民解放军第三军医大学 Method for predicting outer membrane proteins at bacterial whole genome level
CN105938522A (en) * 2016-04-11 2016-09-14 中国人民解放军第三军医大学 Method for predicting effector molecules of bacterial IV-type secretory system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙超 等: "《生物信息学中的计算机技术》", 31 July 2002 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109215737A (en) * 2018-09-30 2019-01-15 东软集团股份有限公司 Protein characteristic extracts, functional mode generates, the method and device of function prediction
CN109754843A (en) * 2018-12-04 2019-05-14 志诺维思(北京)基因科技有限公司 A kind of method and device detecting genome small fragment insertion and deletion
CN109637589B (en) * 2018-12-13 2022-07-26 上海交通大学 Nuclear localization signal prediction method based on frequent pattern and machine learning dual recommendation system
CN109637589A (en) * 2018-12-13 2019-04-16 上海交通大学 Based on frequent mode and the double nuclear localization signal prediction algorithms for recommending system of machine learning
CN109785901A (en) * 2018-12-26 2019-05-21 东软集团股份有限公司 A kind of protein function prediction technique and device
CN109801675A (en) * 2018-12-26 2019-05-24 东软集团股份有限公司 A kind of method, apparatus and equipment of determining protein liposomal function
CN110060738A (en) * 2019-04-03 2019-07-26 中国人民解放军军事科学院军事医学研究院 Method and system based on machine learning techniques prediction bacterium protective antigens albumen
CN110428865A (en) * 2019-08-14 2019-11-08 信阳师范学院 A kind of method of high-throughput prediction Antifreeze protein
CN110517730A (en) * 2019-09-02 2019-11-29 河南师范大学 A method of thermophilic protein is identified based on machine learning
CN112242179A (en) * 2020-09-09 2021-01-19 天津大学 Method for identifying type of membrane protein
CN112634988A (en) * 2021-01-07 2021-04-09 内江师范学院 Python language-based gene variation detection method and system
CN116721695A (en) * 2023-03-07 2023-09-08 安徽农业大学 Identification method, device, equipment and medium of candidate gene for regulating bacterial shape
CN116721695B (en) * 2023-03-07 2024-03-08 安徽农业大学 Identification method, device, equipment and medium of candidate gene for regulating bacterial shape

Similar Documents

Publication Publication Date Title
CN108009405A (en) A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter
Käll et al. A combined transmembrane topology and signal peptide prediction method
CN108960319A (en) It is a kind of to read the candidate answers screening technique understood in modeling towards global machine
Li et al. Protein contact map prediction based on ResNet and DenseNet
CN110853756B (en) Esophagus cancer risk prediction method based on SOM neural network and SVM
CN107463795A (en) A kind of prediction algorithm for identifying tyrosine posttranslational modification site
Kaur et al. Prediction of enhancers in DNA sequence data using a hybrid CNN-DLSTM model
CN102129565B (en) Object detection method based on feature redundancy elimination AdaBoost classifier
CN110060738A (en) Method and system based on machine learning techniques prediction bacterium protective antigens albumen
CN109326329B (en) Zinc binding protein action site prediction method
CN105046106B (en) A kind of Prediction of Protein Subcellular Location method realized with nearest _neighbor retrieval
CN105930687A (en) Method for predicting outer membrane proteins at bacterial whole genome level
CN110363302B (en) Classification model training method, prediction method and device
CN111048145A (en) Method, device, equipment and storage medium for generating protein prediction model
CN113724779B (en) SNAREs protein identification method, system, storage medium and equipment based on machine learning technology
CN115497564A (en) Antigen identification model establishing method and antigen identification method
Yaseen et al. Protein binding affinity prediction using support vector regression and interfecial features
Li et al. ctP 2 ISP: Protein–Protein Interaction Sites Prediction Using Convolution and Transformer With Data Augmentation
KR20210052855A (en) Electronic device for selecting biomarkers for predicting cancer prognosis based on patient-specific genetic characteristics and operating method thereof
Arango-Argoty et al. An adaptation of Pfam profiles to predict protein sub-cellular localization in Gram positive bacteria
Li et al. Deepacr: Predicting anti-crispr with deep learning
Singh et al. Classification of non-coding rna-a review from machine learning perspective
CN111951889B (en) Recognition prediction method and system for M5C locus in RNA sequence
Cheng et al. AttBind: Prediction of Transcription Factor Binding Sites Across Cell-types Based on Attention Mechanism
KC et al. Interpretable structured learning with sparse gated sequence encoder for protein-protein interaction prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200519

Address after: Room 534, No. 25, Lane 3399, Kangxin Road, Pudong New District, Shanghai 200000

Applicant after: Shanghai Wickham Biomedical Technology Co.,Ltd.

Address before: 400000 38 5, 5 office building, 1 Olympic road, Jiulongpo District, Chongqing.

Applicant before: CHONGQING BIOOGENE BIOTECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20180508

RJ01 Rejection of invention patent application after publication