CN108009405A - A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter - Google Patents
A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter Download PDFInfo
- Publication number
- CN108009405A CN108009405A CN201711435147.3A CN201711435147A CN108009405A CN 108009405 A CN108009405 A CN 108009405A CN 201711435147 A CN201711435147 A CN 201711435147A CN 108009405 A CN108009405 A CN 108009405A
- Authority
- CN
- China
- Prior art keywords
- outer membrane
- mrow
- protein
- pssm
- msub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Abstract
The invention discloses a kind of method for predicting the outer membrane protein of its coding in bacterial genomes using machine learning techniques, this method is:Utilize PSI BLAST algorithms, calculate the location specific feature vector of protein, Feature Conversion is carried out using auto-correlation function, establish the grader based on support vector machines, external memebrane protein and non-outer membrane protein are classified, by local computer program, receive protein sequence input by user, predict whether it is an outer membrane protein.The protein sequence that the present invention can encode bacterium full-length genome carries out calculating prediction, and susceptibility is high, and calculating speed is fast, and effective tool is provided for the Rapid identification and screening of outer membrane protein in bacterial genomes.The present invention is a kind of accurately and effectively outer membrane protein screening technique, can be widely applied to the identification of the newly outer membrane protein of sequencing bacterial genomes.
Description
Technical field
The invention belongs to predict the technical field of Bacterial outer membrane proteins matter, more particularly to it is a kind of pre- based on machine learning techniques
The method for surveying Bacterial outer membrane proteins matter.
Background technology
The transmembrane protein of a large amount of beta-barrel shapes, some of which albumen are distributed on gram-negative bacteria outer membrane
Matter is the action protein of bacterial invasion cell, and the targets identification albumen of host immune system bacteria removal, mediates a variety of diseases
The generation of disease, while the immunologic mechanism of body is also activated to antibacterial infection.
Currently, identify that outer membrane protein is mainly completed by testing in new bacterial genomes.However, use experimental method
External memebrane protein is identified, it is necessary to expend substantial amounts of manpower and materials, of high cost, efficiency is low.One new bacterial genomes is past
Toward thousands of a protein are encoded, outer membrane protein therein is identified one by one using traditional means of experiment, be one extremely
Difficult thing.Therefore, Bioinformatics Prediction is carried out using computer, realization can be automated, speed is fast, and cost is low, is
Solve the effective way of the discriminating outer membrane protein in bacterium full-length genome.Therefore, a kind of quickly and accurately biological information is established
Prediction and recognizer are learned, becomes the main problem that this current field needs to solve.
The content of the invention
It is an object of the invention to provide a kind of method that its outer membrane protein is predicted based on machine learning techniques, it is intended to solves
Certainly identify outer membrane protein mainly by testing the problem of completing in new bacterial genomes at present.
The present invention is achieved in that a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter, this can
It is in the method for bacterium full-length genome horizontal forecast outer membrane protein:
Using PSI-BLAST algorithms, protein sequence is compared with irredundant protein sequence database, is calculated
Location specific iteration scoring matrix (PSSM), the composition characteristic of PSSM is calculated by component function and auto-correlation function respectively
(amino acid residue composition/PSSM_AAC), and PSSM autocorrelation characteristics (autocorrelative amino acid position specificity composition/
PSSM_AC), the grader based on support vector machines is established, external memebrane protein and non-outer membrane protein are classified, and pass through this
Ground computer program, receives protein sequence input by user, predicts whether protein sequence input by user is an outer membrane
Albumen.
A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter, this can be horizontal pre- in bacterium full-length genome
Survey outer membrane protein method be:
Further, following steps should be specifically included based on the method for machine learning techniques prediction outer membrane protein:
Step 1: user by protein sequence to be predicted, using FASTA forms, inputs local computer program;
Step 2: computer program uses PSI-BLAST programs, by protein sequence and irredundant protein sequence into
Row compares;
Step 3: computer program calls Matlab to run core-prediction program, the amino acid of protein sequence is calculated
The features such as location specific forms and autocorrelative amino acid position specificity forms;
Step 4: multiclass feature is carried out feature selecting and combination by Matlab programs according to predetermined manner, an egg is produced
White matter feature vector;
Step 5: Matlab routine call libSVM programs, using the good model of precondition, prediction protein is outer membrane
The likelihood ratio of albumen;
Step 6: judging whether it is an outer membrane protein according to SVM prediction results, result is output to screen, or
It is saved in local computer disk;
Further, the computational methods of the PSSM matrixes are:
Using PSI-BALST algorithms, setting e values are 0.001, and iterations 3 times is irredundant with NCBI by protein sequence
Protein Data Bank nr be compared, export the PSSM matrixes that obtain in calculating process.
Further, the computational methods of the PSSM compositions and PSSM autocorrelation characteristics are:
Protein sequence is considered as the character string of 20 kinds of amino acid residue compositions, and protein sequence is passed through local PSI-
Blast program, is compared, outgoing position specificity iteration score matrix (PSSM) with local protein sequence database.
The composition PSSM_AAC calculations of protein sequence PSSM are:
Here, Sji is the PSSM scores that j-th of amino acid mutation is amino acid i on protein sequence,Then represent albumen
The amino acid mutation of all positions of matter is the average of i.PSSM_AC represents the auto-correlation score of PSSM matrixes, calculation
For:
Here, lg represents the distance between a residue and its neighbours, and LG is the maximum of g, and V represents to be separated by g ammonia
Average autocorrelation values between two residues of base acid residue.In this way, 20*LG can be calculated from the PSSM of protein sequence
Variable.
Further, the position conservative information that it is amino acid i in some position that the composition of the PSSM, which is reflected,;It is described
PSSM autocorrelation characteristics, then reflect the conservative that the amino acid of some fragment occurs in amino acid sequence.
Further, the local computer program of protein sequence characteristics is calculated, protein sequence input by user is inputted
Matlab shell scripts, matlab shell scripts are set in advance according to PSSM compositions and autocorrelation characteristic computational methods, use
Parameter, is calculated a multidimensional characteristic vectors from protein sequence.
Further, it is described to establish the grader based on support vector machines (SVM), external memebrane protein and non-outer membrane protein
The method classified is:Using libSVM3.14 establish the SVM classifier based on support vector machines and by the feature of multidimensional to
Amount input, the outer membrane protein and non-outer membrane protein training dataset that SVM classifier is established using data mining technology, and
Established according to SVM algorithm and use the trained disaggregated model of training data, RBF kernel functions, parameter.
Further, it is described to be established according to SVM algorithm and use the trained disaggregated model of training data, kernel function, parameter
The construction method of middle disaggregated model is:
Sample sequence is collected using database search and sequence alignment, and redundancy sequence is removed using BLASTCLUST algorithms
Row, obtain experiments verify that outer membrane protein sequence and non-outer membrane protein sequence as training dataset, each albumen
Sequence similarity between matter sequence is no more than 25%, SVM Selection of kernel function RBF kernel functions, and penalty factor parameter uses grid
Search and ten times of cross-betas, it is described to use SVMtrain train classification models, use sensitiveness, specificity, total precision of prediction
With geneva related coefficient integrated forecasting performance, the optimal combinations of features mode of geneva related coefficient is finally selected from test result
And model parameter, the model of best performance is exported as final mask and is preserved.
The present invention is an application of the bioinformatics method in Bacterial outer membrane proteins matter prediction field, its core concept is
A kind of protein sequence characteristics method for digging of the composition based on PSSM matrixes and autocorrelation characteristic is proposed, and combines engineering
Practise prediction model and algorithm that algorithm devises pin-point accuracy;PS I-BALST algorithms, auto-correlation function etc. are used for by this method
Protein sequence characteristics calculate, and calculate PSSM in same position and its correlative character of upstream and downstream;In addition, using in mould
Formula identifies and machine learning field shows the algorithm of support vector machine of excellent properties to establish disaggregated model, using grid search
Determine optimal SVM kernel functional parameters;Standard exercise and test data set are established using database retrieval and literature mining method, is made
Data redundancy is removed with the homologous comparison technologies of BLASTCLUST, uses sensitiveness, specificity, precision of prediction and geneva related coefficient
Estimated performance is weighed etc. multiple indexs, the svm classifier model optimized is established by a large amount of performance tests, can be to arbitrarily not
The protein sequence known is predicted, and provides the likelihood ratio that it is an outer membrane protein.The program is received by local program
Bacterial genomes protein sequence input by user, predicts whether it is an outer membrane protein, and accurate with very high prediction
Exactness.
It is a l pha screw type protein to establish comprising 208 outer membrane proteins, 674 globin sequences and 206
Irredundant training dataset, when being used on training dataset, the performance of ten times of cross validation test verification present invention, as a result shows
Show, this method distinguish outer membrane protein and non-outer membrane protein sensitiveness, specificity, total precision of prediction and geneva related coefficient
93.08%, 99.50%, 98.27% and 0.945 is respectively reached, estimated performance has exceeded most of existing method.In addition, use
The forecasting tool is calculated and predicted in Escherichia coli full-length genome protein, and the training data from Escherichia coli is picked
After removing, remaining 189 outer membrane proteins and 657 non-outer membrane protein sequences, are trained prediction model.It will train
Model come predict genome of E.coli coding 4128 protein sequences, predict 30 possible outer membrane proteins altogether
Matter.In these protein, it is that certified outer membrane protein, expression prediction model have had quick well to have 22 protein
Perception.In theory, between content of the Bacterial outer membrane proteins matter in genome is about 1%-3%, this method is in Escherichia coli
The outer membrane protein that genome interior prediction arrives is no more than 1%, shows relatively low false positive rate.
The present invention can be widely applied to the correlative study of identification Bacterial outer membrane proteins matter.Bacterial outer membrane proteins matter is bacterium
Pathogenic important molecule is participated in, is the action target of numerous antibacterials., can be with using the present invention and its Prediction program provided
Outer membrane protein in the new bacterial genomes of fast prediction, obtains the outer membrane protein candidate target of a data volume very little,
For experimental identification or other purposes, so as to accelerate the qualification process of bacterial genomes outer membrane protein.
Brief description of the drawings
Fig. 1 is that a kind of embodiment of the method based on machine learning techniques prediction Bacterial outer membrane proteins matter of the present invention provides
Flow chart;
Fig. 2 is a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter of the present invention
The schematic diagram for the program that embodiment provides.
It is an object of the invention to provide a kind of method that its outer membrane protein is predicted based on machine learning techniques, it is intended to solves
Certainly identify outer membrane protein mainly by testing the problem of completing in new bacterial genomes at present.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, it is right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
Unless otherwise defined, all of technologies and scientific terms used here by the article is with belonging to technical field of the invention
The normally understood implication of technical staff is identical.Term used in the description of the invention herein is intended merely to description tool
The purpose of the embodiment of body, it is not intended that in the limitation present invention.
Embodiment one
The present invention is achieved in that a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter, this can
It is in the method for bacterium full-length genome horizontal forecast outer membrane protein:
Using PS I-BLAST algorithms, protein sequence is compared with irredundant protein sequence database, is counted
Location specific iteration scoring matrix (PSSM) is calculated, calculates the composition characteristic of PSSM respectively by component function and auto-correlation function
(amino acid residue composition/PSSM_AAC), and PSSM autocorrelation characteristics (autocorrelative amino acid position specificity composition/
PSSM_AC), the grader based on support vector machines is established, external memebrane protein and non-outer membrane protein are classified, and pass through this
Ground computer program, receives protein sequence input by user, predicts whether protein sequence input by user is an outer membrane
Albumen.
Embodiment two
A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter, this can be horizontal pre- in bacterium full-length genome
Survey outer membrane protein method be:
Step 1: user by protein sequence to be predicted, using FASTA forms, inputs local computer program;
Step 2: computer program uses PSI-BLAST programs, by protein sequence and irredundant protein sequence into
Row compares;
Step 3: computer program calls Matlab to run core-prediction program, calculate albumen and calculate protein
PSSM composition characteristics and PSSM autocorrelation characteristics;;
Step 4: multiclass feature is carried out feature selecting and combination by Matlab programs according to predetermined manner, an egg is produced
White matter feature vector;
Step 5: Matlab routine call libSVM programs, using the good model of precondition, prediction protein is outer membrane
The likelihood ratio of albumen;
Step 6: judging whether it is an outer membrane protein according to SVM prediction results, result is output to screen, or
It is saved in local computer disk;
Further, the computational methods of the PSSM matrixes are:
Using PSI-BALST algorithms, setting e values are 0.001, and iterations 3 times is irredundant with NCBI by protein sequence
Protein Data Bank nr be compared, export the PSSM matrixes that obtain in calculating process.
Further, the computational methods of the PSSM compositions and PSSM autocorrelation characteristics are:
Protein sequence is considered as the character string of 20 kinds of amino acid residue compositions, and protein sequence is passed through local PSI-
Blast program, is compared, outgoing position specificity iteration score matrix (PSSM) with local protein sequence database.
The composition PSSM_AAC calculations of protein sequence PSSM are:
Here, Sji is the PSSM scores that j-th of amino acid mutation is amino acid i on protein sequence,Then represent albumen
The amino acid mutation of all positions of matter is the average of i.PSSM_AC represents the auto-correlation score of PSSM matrixes, calculation
For:
Here, lg represents the distance between a residue and its neighbours, and LG is the maximum of g, and V represents to be separated by g ammonia
Average autocorrelation values between two residues of base acid residue.In this way, 20*LG can be calculated from the PSSM of protein sequence
Variable.
Further, the position conservative information that it is amino acid i in some position that the composition of the PSSM, which is reflected,;It is described
PSSM autocorrelation characteristics, then reflect the conservative that the amino acid of some fragment occurs in amino acid sequence.
Further, the local computer program of protein sequence characteristics is calculated, protein sequence input by user is inputted
Matlab shell scripts, matlab shell scripts are set in advance according to PSSM compositions and autocorrelation characteristic computational methods, use
Parameter, is calculated a multidimensional characteristic vectors from protein sequence.
Further, it is described to establish the grader based on support vector machines (SVM), external memebrane protein and non-outer membrane protein
The method classified is:Using libSVM3.14 establish the SVM classifier based on support vector machines and by the feature of multidimensional to
Amount input, the outer membrane protein and non-outer membrane protein training dataset that SVM classifier is established using data mining technology, and
Established according to SVM algorithm and use the trained disaggregated model of training data, RBF kernel functions, parameter.
Further, it is described to be established according to SVM algorithm and use the trained disaggregated model of training data, kernel function, parameter
The construction method of middle disaggregated model is:
Sample sequence is collected using database search and sequence alignment, and redundancy sequence is removed using BLASTCLUST algorithms
Row, obtain experiments verify that outer membrane protein sequence and non-outer membrane protein sequence as training dataset, each albumen
Sequence similarity between matter sequence is no more than 25%, SVM Selection of kernel function RBF kernel functions, and penalty factor parameter uses grid
Search and ten times of cross-betas, it is described to use SVMtrain train classification models, use sensitiveness, specificity, total precision of prediction
With geneva related coefficient integrated forecasting performance, the optimal combinations of features mode of geneva related coefficient is finally selected from test result
And model parameter, the model of best performance is exported as final mask and is preserved.
The present invention is an application of the bioinformatics method in Bacterial outer membrane proteins matter prediction field, its core concept is
A kind of protein sequence characteristics method for digging of the composition based on PSSM matrixes and autocorrelation characteristic is proposed, and combines engineering
Practise prediction model and algorithm that algorithm devises pin-point accuracy;PS I-BALST algorithms, auto-correlation function etc. are used for by this method
Protein sequence characteristics calculate, and calculate PSSM in same position and its correlative character of upstream and downstream;In addition, using in mould
Formula identifies and machine learning field shows the algorithm of support vector machine of excellent properties to establish disaggregated model, using grid search
Determine optimal SVM kernel functional parameters;Standard exercise and test data set are established using database retrieval and literature mining method, is made
Data redundancy is removed with the homologous comparison technologies of BLASTCLUST, uses sensitiveness, specificity, precision of prediction and geneva related coefficient
Estimated performance is weighed etc. multiple indexs, the svm classifier model optimized is established by a large amount of performance tests, can be to arbitrarily not
The protein sequence known is predicted, and provides the likelihood ratio that it is an outer membrane protein.The program is received by local program
Bacterial genomes protein sequence input by user, predicts whether it is an outer membrane protein, and accurate with very high prediction
Exactness.
It is a l pha screw type protein to establish comprising 208 outer membrane proteins, 674 globin sequences and 206
Irredundant training dataset, when being used on training dataset, the performance of ten times of cross validation test verification present invention, as a result shows
Show, this method distinguish outer membrane protein and non-outer membrane protein sensitiveness, specificity, total precision of prediction and geneva related coefficient
93.08%, 99.50%, 98.27% and 0.945 is respectively reached, estimated performance has exceeded most of existing method.In addition, use
The forecasting tool is calculated and predicted in Escherichia coli full-length genome protein, and the training data from Escherichia coli is picked
After removing, remaining 189 outer membrane proteins and 657 non-outer membrane protein sequences, are trained prediction model.It will train
Model come predict genome of E.coli coding 4128 protein sequences, predict 30 possible outer membrane proteins altogether
Matter.In these protein, it is that certified outer membrane protein, expression prediction model have had quick well to have 22 protein
Perception.In theory, between content of the Bacterial outer membrane proteins matter in genome is about 1%-3%, this method is in Escherichia coli
The outer membrane protein that genome interior prediction arrives is no more than 1%, shows relatively low false positive rate.
The present invention can be widely applied to the correlative study of identification Bacterial outer membrane proteins matter.Bacterial outer membrane proteins matter is bacterium
Pathogenic important molecule is participated in, is the action target of numerous antibacterials., can be with using the present invention and its Prediction program provided
Outer membrane protein in the new bacterial genomes of fast prediction, obtains the outer membrane protein candidate target of a data volume very little,
For experimental identification or other purposes, so as to accelerate the qualification process of bacterial genomes outer membrane protein.
Can be with one of ordinary skill in the art will appreciate that realizing that all or part of step in above-described embodiment method is
Control relevant hardware to complete by program, the program can in a computer read/write memory medium is stored in,
The storage medium, such as ROM/RAM, disk, CD.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
All any modification, equivalent and improvement made within refreshing and principle etc., should all be included in the protection scope of the present invention.
Claims (7)
- A kind of 1. method based on machine learning techniques prediction Bacterial outer membrane proteins matter, it is characterised in that the method includes:Using PSI-BLAST algorithms, protein sequence is compared with irredundant protein sequence database, calculation position Specific scoring matrix, using an auto-correlation function, the PSSM compositions of same amino acid are special in sequence of calculation certain area Seek peace PSSM autocorrelation characteristics, collectively constitute protein characteristic vector, machine learning classification model is established using support vector machines, It is trained and optimizes using training data set pair model, trained model can classify unknown protein sequence, Judge whether it is outer membrane protein.
- A kind of 2. method based on machine learning techniques prediction Bacterial outer membrane proteins matter, it is characterised in that the method includes Following steps:Step 1: user by protein sequence to be predicted, using FASTA forms, inputs local computer program;Step 2: computer program uses PSI-BLAST programs, protein sequence and irredundant protein sequence are compared It is right;Step 3: computer program calls Matlab to run core-prediction program, calculate protein PSSM composition characteristics and PSSM autocorrelation characteristics;Step 4: multiclass feature is carried out feature selecting and combination by Matlab programs according to predetermined manner, a protein is produced Feature vector;Step 5: Matlab routine call libSVM programs, using the good model of precondition, prediction protein is outer membrane protein Likelihood ratio;Step 6: judging whether it is an outer membrane protein according to SVM prediction results, result is exported.
- 3. a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter as claimed in claim 2, its feature exist In Step 2: computer program uses PSI-BLAST programs, protein sequence and irredundant protein sequence are compared To concretely comprising the following steps:Protein sequence input by user is inputted into matlab shell scripts, matlab shell scripts call PSI- BALST is compared with irredundant Protein Data Bank, calculates PSSM, and passes through PSSM and form computational methods, will PSSM is converted to PSSM composition characteristics and PSSM autocorrelation characteristics, obtains the feature vector of a combination.
- 4. a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter as claimed in claim 2, its feature exist In, Step 3: computer program calls Matlab to run core-prediction program, calculate protein PSSM composition characteristics and PSSM autocorrelation characteristics;Protein sequence is considered as the character string of 20 kinds of amino acid residue compositions, and protein sequence is passed through local PSI-BLAST journeys Sequence, is compared with local protein sequence database, exports PSSM;PSSM autocorrelation characteristic calculations are:<mrow> <msup> <mi>V</mi> <mrow> <mi>P</mi> <mi>S</mi> <mi>S</mi> <mi>M</mi> <mo>_</mo> <mi>A</mi> <mi>A</mi> <mi>C</mi> </mrow> </msup> <mo>=</mo> <mo>&lsqb;</mo> <msub> <mover> <mi>S</mi> <mo>&OverBar;</mo> </mover> <mn>1</mn> </msub> <msub> <mover> <mi>S</mi> <mo>&OverBar;</mo> </mover> <mn>2</mn> </msub> <msub> <mover> <mi>S</mi> <mo>&OverBar;</mo> </mover> <mn>3</mn> </msub> <mo>...</mo> <msub> <mover> <mi>S</mi> <mo>&OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>...</mo> <msub> <mover> <mi>S</mi> <mo>&OverBar;</mo> </mover> <mn>20</mn> </msub> <mo>&rsqb;</mo> </mrow><mrow> <msub> <mover> <mi>S</mi> <mo>&OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mi>L</mi> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>L</mi> </munderover> <msub> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>i</mi> </mrow> </msub> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>3</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mn>20</mn> </mrow>Sji is the PSSM scores that j-th of amino acid mutation is amino acid i on protein sequence,Then represent all positions of protein Amino acid mutation be i average;PSSM_AC represents PSSM composition characteristic scores, and calculation is:<mrow> <msubsup> <mi>V</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>lg</mi> </mrow> <mrow> <mi>S</mi> <mi>S</mi> <mi>S</mi> <mi>M</mi> <mo>_</mo> <mi>A</mi> <mi>C</mi> </mrow> </msubsup> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>L</mi> <mo>-</mo> <mi>g</mi> </mrow> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>L</mi> <mo>-</mo> <mi>g</mi> </mrow> </munderover> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>i</mi> </mrow> </msub> <mo>-</mo> <msub> <mover> <mi>S</mi> <mo>&OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&times;</mo> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>j</mi> <mo>+</mo> <mi>lg</mi> </mrow> </msub> <mo>-</mo> <msub> <mover> <mi>S</mi> <mo>&OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mi>g</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>3</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mi>L</mi> <mi>G</mi> </mrow>Lg represents the distance between a residue and its neighbours, and LG is the maximum of g, and V expressions are separated by g amino acid residue Average autocorrelation values between two residues;In this way, 20*LG variable can be calculated from the PSSM of protein sequence.
- 5. a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter as claimed in claim 4, its feature exist In the PSSM compositions reflect the position conservative information for being amino acid i in some position;The PSSM autocorrelation haracters Sign, reflects the conservative that the amino acid of some fragment occurs in amino acid sequence.
- 6. a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter as claimed in claim 2, its feature exist In Step 5: Matlab routine call libSVM programs, using the good model of precondition, prediction protein is outer membrane protein The specific method of likelihood ratio be:The SVM classifier based on support vector machines is established using libSVM3.14 and by the spy of multidimensional Sign vector input, the outer membrane protein and non-outer membrane protein training dataset that SVM classifier is established using data mining technology, And established according to SVM algorithm and use the trained disaggregated model of training data, RBF kernel functions, parameter.
- 7. a kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter as described in claim S6, its feature exist In described to be established according to SVM algorithm and use the trained disaggregated model of training data, RBF kernel functions, parameter;Specific side Method is:Sample sequence is collected using database search and sequence alignment, and redundant sequence is removed using BLASTCLUST algorithms, is obtained The outer membrane protein sequence through experimental verification and non-outer membrane protein sequence arrived is as training dataset, each protein sequence Sequence similarity between row is no more than 25%, SVM Selection of kernel function RBF kernel functions, and penalty factor parameter uses grid search Determined with ten times of cross-betas, it is described to use SVMtrain train classification models, use sensitiveness, specificity, total prediction essence Degree and geneva related coefficient integrated forecasting performance, finally select the optimal combinations of features side of geneva related coefficient from test result Formula and model parameter, the model of best performance is exported as final mask and is preserved.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711435147.3A CN108009405A (en) | 2017-12-26 | 2017-12-26 | A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711435147.3A CN108009405A (en) | 2017-12-26 | 2017-12-26 | A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108009405A true CN108009405A (en) | 2018-05-08 |
Family
ID=62061548
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711435147.3A Pending CN108009405A (en) | 2017-12-26 | 2017-12-26 | A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108009405A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109215737A (en) * | 2018-09-30 | 2019-01-15 | 东软集团股份有限公司 | Protein characteristic extracts, functional mode generates, the method and device of function prediction |
CN109637589A (en) * | 2018-12-13 | 2019-04-16 | 上海交通大学 | Based on frequent mode and the double nuclear localization signal prediction algorithms for recommending system of machine learning |
CN109754843A (en) * | 2018-12-04 | 2019-05-14 | 志诺维思(北京)基因科技有限公司 | A kind of method and device detecting genome small fragment insertion and deletion |
CN109785901A (en) * | 2018-12-26 | 2019-05-21 | 东软集团股份有限公司 | A kind of protein function prediction technique and device |
CN109801675A (en) * | 2018-12-26 | 2019-05-24 | 东软集团股份有限公司 | A kind of method, apparatus and equipment of determining protein liposomal function |
CN110060738A (en) * | 2019-04-03 | 2019-07-26 | 中国人民解放军军事科学院军事医学研究院 | Method and system based on machine learning techniques prediction bacterium protective antigens albumen |
CN110428865A (en) * | 2019-08-14 | 2019-11-08 | 信阳师范学院 | A kind of method of high-throughput prediction Antifreeze protein |
CN110517730A (en) * | 2019-09-02 | 2019-11-29 | 河南师范大学 | A method of thermophilic protein is identified based on machine learning |
CN112242179A (en) * | 2020-09-09 | 2021-01-19 | 天津大学 | Method for identifying type of membrane protein |
CN112634988A (en) * | 2021-01-07 | 2021-04-09 | 内江师范学院 | Python language-based gene variation detection method and system |
CN116721695A (en) * | 2023-03-07 | 2023-09-08 | 安徽农业大学 | Identification method, device, equipment and medium of candidate gene for regulating bacterial shape |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930687A (en) * | 2016-04-11 | 2016-09-07 | 中国人民解放军第三军医大学 | Method for predicting outer membrane proteins at bacterial whole genome level |
CN105938522A (en) * | 2016-04-11 | 2016-09-14 | 中国人民解放军第三军医大学 | Method for predicting effector molecules of bacterial IV-type secretory system |
-
2017
- 2017-12-26 CN CN201711435147.3A patent/CN108009405A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930687A (en) * | 2016-04-11 | 2016-09-07 | 中国人民解放军第三军医大学 | Method for predicting outer membrane proteins at bacterial whole genome level |
CN105938522A (en) * | 2016-04-11 | 2016-09-14 | 中国人民解放军第三军医大学 | Method for predicting effector molecules of bacterial IV-type secretory system |
Non-Patent Citations (1)
Title |
---|
孙超 等: "《生物信息学中的计算机技术》", 31 July 2002 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109215737A (en) * | 2018-09-30 | 2019-01-15 | 东软集团股份有限公司 | Protein characteristic extracts, functional mode generates, the method and device of function prediction |
CN109754843A (en) * | 2018-12-04 | 2019-05-14 | 志诺维思(北京)基因科技有限公司 | A kind of method and device detecting genome small fragment insertion and deletion |
CN109637589B (en) * | 2018-12-13 | 2022-07-26 | 上海交通大学 | Nuclear localization signal prediction method based on frequent pattern and machine learning dual recommendation system |
CN109637589A (en) * | 2018-12-13 | 2019-04-16 | 上海交通大学 | Based on frequent mode and the double nuclear localization signal prediction algorithms for recommending system of machine learning |
CN109785901A (en) * | 2018-12-26 | 2019-05-21 | 东软集团股份有限公司 | A kind of protein function prediction technique and device |
CN109801675A (en) * | 2018-12-26 | 2019-05-24 | 东软集团股份有限公司 | A kind of method, apparatus and equipment of determining protein liposomal function |
CN110060738A (en) * | 2019-04-03 | 2019-07-26 | 中国人民解放军军事科学院军事医学研究院 | Method and system based on machine learning techniques prediction bacterium protective antigens albumen |
CN110428865A (en) * | 2019-08-14 | 2019-11-08 | 信阳师范学院 | A kind of method of high-throughput prediction Antifreeze protein |
CN110517730A (en) * | 2019-09-02 | 2019-11-29 | 河南师范大学 | A method of thermophilic protein is identified based on machine learning |
CN112242179A (en) * | 2020-09-09 | 2021-01-19 | 天津大学 | Method for identifying type of membrane protein |
CN112634988A (en) * | 2021-01-07 | 2021-04-09 | 内江师范学院 | Python language-based gene variation detection method and system |
CN116721695A (en) * | 2023-03-07 | 2023-09-08 | 安徽农业大学 | Identification method, device, equipment and medium of candidate gene for regulating bacterial shape |
CN116721695B (en) * | 2023-03-07 | 2024-03-08 | 安徽农业大学 | Identification method, device, equipment and medium of candidate gene for regulating bacterial shape |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108009405A (en) | A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter | |
Käll et al. | A combined transmembrane topology and signal peptide prediction method | |
CN108960319A (en) | It is a kind of to read the candidate answers screening technique understood in modeling towards global machine | |
Li et al. | Protein contact map prediction based on ResNet and DenseNet | |
CN110853756B (en) | Esophagus cancer risk prediction method based on SOM neural network and SVM | |
CN107463795A (en) | A kind of prediction algorithm for identifying tyrosine posttranslational modification site | |
Kaur et al. | Prediction of enhancers in DNA sequence data using a hybrid CNN-DLSTM model | |
CN102129565B (en) | Object detection method based on feature redundancy elimination AdaBoost classifier | |
CN110060738A (en) | Method and system based on machine learning techniques prediction bacterium protective antigens albumen | |
CN109326329B (en) | Zinc binding protein action site prediction method | |
CN105046106B (en) | A kind of Prediction of Protein Subcellular Location method realized with nearest _neighbor retrieval | |
CN105930687A (en) | Method for predicting outer membrane proteins at bacterial whole genome level | |
CN110363302B (en) | Classification model training method, prediction method and device | |
CN111048145A (en) | Method, device, equipment and storage medium for generating protein prediction model | |
CN113724779B (en) | SNAREs protein identification method, system, storage medium and equipment based on machine learning technology | |
CN115497564A (en) | Antigen identification model establishing method and antigen identification method | |
Yaseen et al. | Protein binding affinity prediction using support vector regression and interfecial features | |
Li et al. | ctP 2 ISP: Protein–Protein Interaction Sites Prediction Using Convolution and Transformer With Data Augmentation | |
KR20210052855A (en) | Electronic device for selecting biomarkers for predicting cancer prognosis based on patient-specific genetic characteristics and operating method thereof | |
Arango-Argoty et al. | An adaptation of Pfam profiles to predict protein sub-cellular localization in Gram positive bacteria | |
Li et al. | Deepacr: Predicting anti-crispr with deep learning | |
Singh et al. | Classification of non-coding rna-a review from machine learning perspective | |
CN111951889B (en) | Recognition prediction method and system for M5C locus in RNA sequence | |
Cheng et al. | AttBind: Prediction of Transcription Factor Binding Sites Across Cell-types Based on Attention Mechanism | |
KC et al. | Interpretable structured learning with sparse gated sequence encoder for protein-protein interaction prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200519 Address after: Room 534, No. 25, Lane 3399, Kangxin Road, Pudong New District, Shanghai 200000 Applicant after: Shanghai Wickham Biomedical Technology Co.,Ltd. Address before: 400000 38 5, 5 office building, 1 Olympic road, Jiulongpo District, Chongqing. Applicant before: CHONGQING BIOOGENE BIOTECHNOLOGY Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180508 |
|
RJ01 | Rejection of invention patent application after publication |