CN1773517A - Protein sequence characteristic extracting method based on Chinese participle technique - Google Patents

Protein sequence characteristic extracting method based on Chinese participle technique Download PDF

Info

Publication number
CN1773517A
CN1773517A CNA2005101102164A CN200510110216A CN1773517A CN 1773517 A CN1773517 A CN 1773517A CN A2005101102164 A CNA2005101102164 A CN A2005101102164A CN 200510110216 A CN200510110216 A CN 200510110216A CN 1773517 A CN1773517 A CN 1773517A
Authority
CN
China
Prior art keywords
sequence
dictionary
speech
protein sequence
characteristic extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2005101102164A
Other languages
Chinese (zh)
Inventor
杨旸
吕宝粮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CNA2005101102164A priority Critical patent/CN1773517A/en
Publication of CN1773517A publication Critical patent/CN1773517A/en
Pending legal-status Critical Current

Links

Images

Abstract

A method for picking up protein sequence character based on Chinese participle technique includes setting up dictionary according to sequence in training sample to find out amino acid sequence sub string set benefiting for classification, dividing all samples and matching sequence sample to entry in dictionary, counting up occurrence frequency of each word in dictionary on each sequence and converting sequence to be figure vector then classifying protein by utilizing character being converted out.

Description

Protein sequence characteristic extracting method based on Chinese words segmentation
Technical field
The present invention relates to the method in a kind of Computer Applied Technology field, specifically is a kind of protein sequence characteristic extracting method based on Chinese words segmentation.
Background technology
In in the past 20 years, bioinformatics and calculation biology field are flourish, and computer science is changing the looks of biological study rapidly and significantly, and the research that many scientists in the past will carry out in the laboratory can be carried out now on computers.The classification of protein and structure prediction are the important subject in this field, and the protein sequence that is stored at present in the public database is exponential increase, and understanding these sequences needs complicated mode identification technology.How effectively to represent biological sequence, promptly carry out feature extraction, so that appropriate design and utilization sorter seem particularly important.Feature extraction is a key issue in the pattern-recognition, owing to usually be not easy to find those most important characteristic in a lot of practical problemss, make that the task of feature selecting and extraction is complicated and becomes one of the most difficult task of structural model recognition system.
The feature extraction aspect of protein sequence has many achievements in research.1994, Nakashima and Nishikawa adopted single amino acids and the right constituent of residue successfully to distinguish in the cell and extracellular protein.Found afterwards that protein sequence N end ordering signal was an important biological characteristic, can effectively distinguish multiple proteins, yet this method depends on targeting sequencing strongly, but when leading sequence was unreliable, this method was made mistakes easily.In addition, Chou has proposed pseudo-amino acid and has formed, and the method for utilizing gene ontology and functional domain information, but these information are difficult to obtain for new protein sequence.Although people have attempted various method for classifying modes, as Bayes, k nearest neighbor, Hidden Markov Model (HMM), neural network, support vector machine etc. solve the protein classification problem, are very limited yet rely on sorter merely to the castering action of classification performance.
Find through literature search prior art, Park in 2003 publishes an article on " Bioinformatics " (" bioinformatics ") magazine " Prediction of protein subcellular locations by supportvector machines using compositions of amino acids and amino acid pairs " and (uses the right constituent of amino acid and amino acid to adopt SVM prediction protein subcellular location, vol.19,2003, pp.1656-1663), this article uses the feature of the right composition of amino acid monomer and four kinds of residues as protein respectively, classification accuracy for more than 7,000 eukaryotic proteins reaches 78%, has verified the validity of using this feature extracting method.But this piece article is not selected feature, all residues to all as feature, the dimension of feature space is higher, does not also mention the possibility of using the longer concatermer classification of length.
Summary of the invention
The objective of the invention is to overcome deficiency of the prior art, a kind of protein sequence characteristic extracting method based on Chinese words segmentation is provided, make it can improve nicety of grading and can accelerate classification speed again.The present invention uses for reference the method for Chinese word segmentation, and protein amino acid sequence is carried out cutting, and it is right to be not limited only to amino acid, has also investigated the concatermer of bigger length, and picks out significant to classification from these concatermers, thereby finds protein characteristic.
The present invention is achieved by the following technical solutions, the present invention includes following steps:
(1) according to the sequence in the training sample, sets up dictionary, find out set the amino acid sequence substring of classifying useful;
(2) cutting sequence is promptly carried out cutting to all samples, sequence samples and the entry of setting up in the good dictionary is mated, and select optimum slit mode;
(3) carry out the sequence statistics after cutting finishes, add up the frequency that occurs each speech in the dictionary in every sequence, sequence is converted into the numerical value vector;
(4) with transforming good feature protein is classified at last.
The described dictionary of setting up is meant to obtain a vocabulary that all speech in this vocabulary just should cut out during participle.Though protein sequence is much simpler than text, however text form by each significant speech, significant amino acid concatermer is difficult to judge in the sequence.The concatermer that the present invention is high with the frequency of occurrences and all 20 seed amino acid monomers are included into dictionary.At first set a maximum substring length, each is not more than the substring of this length, all to add up its number of times that in sample set, occurs (occurrence number sum in every sequence, the statistics of concatermer adopts the mode of overlapping (overlapping) counting), get the highest some substrings (being concatermer) of the frequency of occurrences of every kind of length (greater than 1) at last respectively as the speech in the dictionary.For prevent the cutting stage run into can't cutting situation, dictionary all put in 20 seed amino acid words.
Described cutting sequence is meant sequence samples and the entry of setting up in the good dictionary is mated that if find certain character string in dictionary, then the match is successful, promptly identifies a speech.Protein sequence is made up of hundreds of to several thousand amino acid usually, therefore may have multiple slit mode.In order to find out best slit mode, at first, find out and cut out those minimum slit modes of hop count according to the principle of priority of long word coupling, at this moment may still there be multiple choices.The present invention gives corresponding weights for each speech in the dictionary, and proposes a kind of maximum weights product coupling rule, and the slit mode that the promptly feasible speech weights product that cuts out is a maximum is for optimum.Details is as follows:
To single amino acids, establish frequency 1, iBe its occurrence number, Freq 1Be the maximal value of 20 kinds of single amino acids frequencies of occurrences, then the weights of single amino acids are defined as:
weight 1 , i = frequency 1 , i Freq 1 , 1 ≤ i ≤ 20
Similarly, the conjuncted weights of k are:
weight k , i = frequency k , i Freq k × C k - 1 , 1 ≤ i ≤ N , 1 ≤ k ≤ max Len , C ≥ 1
Wherein N is the conjuncted number of K-that data centralization occurs, and maxLen is conjuncted maximum length, and C is an adjustable parameter, and its effect is to guarantee that the weights of long word are bigger than short speech.
Each is had the slit mode of minimum hop count, and the weights product is defined as:
P S , T = Π W weight w n w , w ∈ W
Wherein W is all set of letters that matches, n wThe number of times that expression word w is mated when given sequence S adopts slit mode T.Calculate the weights product, just can select optimum a kind of slit mode.
The frequency of occurrences of each speech in the described statistical series is meant the sequence cutting for after the string of being made up of the speech in the dictionary, and number goes out each speech occurrence number in this sequence, and the numerical value of the corresponding dimension of conduct vector.Protein sequence just can be expressed as vector form like this.
The present invention expands to concatermer to the investigation of protein sequence substring from dyad, and by effective feature selection approach, has overcome the shortcoming that protein concatermer number is various, be difficult to add up.The reduction of feature space dimension means quick study and classification, and this is vital for the extensive biological problem that has mass data.In addition, the present invention adopts nonoverlapping cutting method, can find the feature different with former overlapping statistical method, thereby improves nicety of grading.The feature of using the present invention to extract can be used for the multiple proteins classification problem, as protein Subcellular Localization, structure prediction or the like.
Description of drawings
Fig. 1 is principle of the invention figure
Embodiment
Provide specific embodiment below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail.
The present invention considers that Chinese sentence and protein sequence have similarity, all be the continuation character string, and used language is different with speech.So use for reference the method for Chinese word segmentation, protein amino acid sequence carried out cutting, thereby find its feature.
Fig. 1 has shown three key steps of the present invention: set up dictionary, participle and be converted into proper vector.The foundation of dictionary need be considered two aspects, includes the standard and the quantity of speech, i.e. the dictionary capacity.The standard of including has multiple, as absolute word frequency, relative word frequency (utilization TF-IDF formula) or additive method.The size of capacity depends on two parameters: include the maximum length of speech and the number of every kind of length speech.The participle stage comprises three steps: give weights for each speech, optimum slit mode is selected in the sequence cutting.Minimum with segments during cutting is standard, selects optimum to be standard to the maximum with speech weights product.
The present invention is used in the prediction of protein subcellular location, to being distributed in 12 subcellular locations (chloroplast, tenuigenin, cytoskeleton, endoplasmic reticulum, extracellular, golgiosome, lysosome, mitochondria, nucleus, peroxisome, plasma membrane, vacuole) 7579 protein sequences on are classified, and DATA DISTRIBUTION sees Table one.The present invention is made as 3,4 or 5 to the maximum length that comprises speech in the dictionary, and the number of the speech of every kind of length is respectively 2,5, and 10,20,50 and 100.Use K-nearest neighbour method and support vector machine (SVM) as sorter respectively.The K value of K-nearest neighbour method chooses from 2 to 16.In this embodiment, the speech maximum length is 3,4 and 5 o'clock, can both reach the higher forecasting precision.The speech number of every kind of length is that 5 o'clock prediction effects are best, can reach degree of precision when promptly feature space is 30 dimensions, can use feature seldom to classify like this, has improved classification effectiveness greatly.Table two has shown the classification of employing support vector machine, the dimension of the present invention and other five kinds of feature extracting methods and the comparison of accuracy.
Table one, all kinds of distribution situations of data set
The position Number
Chloroplast 671
Tenuigenin 1241
Cytoskeleton 40
Endoplasmic reticulum 114
The extracellular 861
Golgiosome 47
Lysosome 93
Mitochondria 727
Nucleus 1932
Peroxisome 125
Plasma membrane 1674
Vacuole 54
Amount to 7579
Table two, various feature extracting method dimension and nicety of grading are relatively
Method Dimension Macroscopic view average (%) Microcosmic average (%)
Amino acid composition 20 53.4 70.3
Amino acid is to component 400 58.4 73.6
One amino acid is right at interval 400 60.0 73.9
Between the amino acid of next but two right 400 55.8 72.5
Three amino acid is right at interval 400 57.3 72.5
The present invention 30 61.2 74.7

Claims (8)

1, a kind of protein sequence characteristic extracting method based on Chinese words segmentation is characterized in that, may further comprise the steps:
(1) sets up dictionary according to the sequence in the training sample, find out set the amino acid sequence substring of classifying useful;
(2) cutting sequence is promptly carried out cutting to all samples, sequence samples and the entry of setting up in the good dictionary is mated, and select optimum slit mode;
(3) carry out the sequence statistics after cutting finishes, add up the frequency that occurs each speech in the dictionary in every sequence, sequence is converted into the numerical value vector;
(4) with transforming good feature protein is classified at last.
2, the protein sequence characteristic extracting method based on Chinese words segmentation according to claim 1, it is characterized in that, the described dictionary of setting up, be meant: the concatermer that the frequency of occurrences is high and all 20 seed amino acid monomers are included into dictionary, at first set a maximum substring length, each is less than or equal to the substring of this length, all adds up the number of times that it occurs in sample set, get the highest some substrings of the frequency of occurrences of every kind of length at last respectively as the speech in the dictionary.
3, the protein sequence characteristic extracting method based on Chinese words segmentation according to claim 2, it is characterized in that, the number of times that described statistics occurs in sample set is meant substring occurrence number sum in every sequence, and the statistics of concatermer adopts the mode of overlapping counting.
4, the protein sequence characteristic extracting method based on Chinese words segmentation according to claim 1, it is characterized in that, described cutting sequence, be meant sequence samples and the entry of setting up in the good dictionary are mated, if find certain character string in dictionary, then the match is successful, promptly identifies a speech, protein sequence is made up of hundreds of to several thousand amino acid, so there is multiple slit mode.
5, the protein sequence characteristic extracting method based on Chinese words segmentation according to claim 4, it is characterized in that, give corresponding weights for each speech in the dictionary, with the best slit mode of the maximum double Standard Selection of hop count minimum and the weight product of telling speech.
6, according to claim 4 or 5 described protein sequence characteristic extracting methods, it is characterized in that,, establish frequency single amino acids based on Chinese words segmentation 1, iBe its occurrence number, Freq 1Be the maximal value of 20 kinds of single amino acids frequencies of occurrences, then the weights of single amino acids are defined as:
weight 1 , i = frequency 1 , i Freq 1 , 1≤i≤20
Similarly, the conjuncted weights of k are:
weight k , i = frequency k , i Freq k × C k - 1 , 1≤i≤N,1≤k≤max Len,C≥1
Wherein N is the conjuncted number of K-that data centralization occurs, and maxLen is conjuncted maximum length, and C is an adjustable parameter, and its effect is to guarantee that the weights of long word are bigger than short speech.
7, according to claim 4 or 5 described protein sequence characteristic extracting methods, it is characterized in that each is had the slit mode of minimum hop count, and the weights product is defined as based on Chinese words segmentation:
P S , T = Π W weight w n w , w∈W
Wherein W is all set of letters that matches, n wThe number of times that expression word w is mated when given sequence S adopts slit mode T.
8, the protein sequence characteristic extracting method based on Chinese words segmentation according to claim 1, it is characterized in that, the frequency of occurrences of each speech in the described statistical series, be meant that the sequence cutting is for after the string of being made up of the speech in the dictionary, number goes out each speech occurrence number in this sequence, and as the corresponding numerical value of tieing up of vector, protein sequence just can be expressed as vector form like this.
CNA2005101102164A 2005-11-10 2005-11-10 Protein sequence characteristic extracting method based on Chinese participle technique Pending CN1773517A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2005101102164A CN1773517A (en) 2005-11-10 2005-11-10 Protein sequence characteristic extracting method based on Chinese participle technique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2005101102164A CN1773517A (en) 2005-11-10 2005-11-10 Protein sequence characteristic extracting method based on Chinese participle technique

Publications (1)

Publication Number Publication Date
CN1773517A true CN1773517A (en) 2006-05-17

Family

ID=36760480

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2005101102164A Pending CN1773517A (en) 2005-11-10 2005-11-10 Protein sequence characteristic extracting method based on Chinese participle technique

Country Status (1)

Country Link
CN (1) CN1773517A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101294970B (en) * 2007-04-25 2012-12-05 中国医学科学院基础医学研究所 Prediction method for protein three-dimensional structure
CN104951667A (en) * 2014-03-28 2015-09-30 国际商业机器公司 Method and device for analyzing nature of protein sequences
CN107563150A (en) * 2017-08-31 2018-01-09 深圳大学 Forecasting Methodology, device, equipment and the storage medium of protein binding site
CN108197427A (en) * 2018-01-02 2018-06-22 山东师范大学 Proteins subcellular location method and apparatus based on depth convolutional neural networks
CN109215737A (en) * 2018-09-30 2019-01-15 东软集团股份有限公司 Protein characteristic extracts, functional mode generates, the method and device of function prediction
CN109767814A (en) * 2019-01-17 2019-05-17 中国科学院新疆理化技术研究所 A kind of amino acid global characteristics vector representation method based on GloVe model
CN113094713A (en) * 2021-06-09 2021-07-09 四川大学 Self-adaptive host intrusion detection sequence feature extraction method and system
CN116994654A (en) * 2023-09-27 2023-11-03 北京立康生命科技有限公司 Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101294970B (en) * 2007-04-25 2012-12-05 中国医学科学院基础医学研究所 Prediction method for protein three-dimensional structure
CN104951667B (en) * 2014-03-28 2018-04-17 国际商业机器公司 A kind of method and apparatus of property for analysing protein sequence
CN104951667A (en) * 2014-03-28 2015-09-30 国际商业机器公司 Method and device for analyzing nature of protein sequences
WO2019041333A1 (en) * 2017-08-31 2019-03-07 深圳大学 Method, apparatus, device and storage medium for predicting protein binding sites
CN107563150A (en) * 2017-08-31 2018-01-09 深圳大学 Forecasting Methodology, device, equipment and the storage medium of protein binding site
CN107563150B (en) * 2017-08-31 2021-03-19 深圳大学 Method, device, equipment and storage medium for predicting protein binding site
CN108197427A (en) * 2018-01-02 2018-06-22 山东师范大学 Proteins subcellular location method and apparatus based on depth convolutional neural networks
CN109215737A (en) * 2018-09-30 2019-01-15 东软集团股份有限公司 Protein characteristic extracts, functional mode generates, the method and device of function prediction
CN109767814A (en) * 2019-01-17 2019-05-17 中国科学院新疆理化技术研究所 A kind of amino acid global characteristics vector representation method based on GloVe model
CN113094713A (en) * 2021-06-09 2021-07-09 四川大学 Self-adaptive host intrusion detection sequence feature extraction method and system
CN113094713B (en) * 2021-06-09 2021-08-13 四川大学 Self-adaptive host intrusion detection sequence feature extraction method and system
CN116994654A (en) * 2023-09-27 2023-11-03 北京立康生命科技有限公司 Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides
CN116994654B (en) * 2023-09-27 2023-12-29 北京立康生命科技有限公司 Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides

Similar Documents

Publication Publication Date Title
CN1773517A (en) Protein sequence characteristic extracting method based on Chinese participle technique
Rodríguez-Serrano et al. A model-based sequence similarity with application to handwritten word spotting
US8019699B2 (en) Machine learning system
Yi et al. Matching resumes and jobs based on relevance models
Liao et al. Biomedical named entity recognition based on skip-chain Crfs
Slonim et al. Discriminative feature selection via multiclass variable memory Markov model
Kelil et al. A general measure of similarity for categorical sequences
CN106557668A (en) DNA sequence dna similar test method based on LF entropys
Pavlov et al. Mixtures of conditional maximum entropy models
CN114722188A (en) Advertisement generation method, device and equipment based on operation data and storage medium
CN112579741B (en) High-dimensional multi-label data stream classification method based on online sequence kernel extreme learning machine
Yang et al. Classification of protein sequences based on word segmentation methods
Habib et al. Scalable biomedical Named Entity Recognition: investigation of a database-supported SVM approach
Chen et al. Constructing maximum entropy language models for movie review subjectivity analysis
Yang et al. Prediction of protein subcellular multi-locations with a min-max modular support vector machine
Yang et al. Extracting features from protein sequences using chinese segmentation techniques for subcellular localization
Chen et al. Extract protein-protein interactions from the literature using support vector machines with feature selection
Rani et al. RBNBC: Repeat based naive Bayes classifier for biological sequences
Elahimanesh et al. Improving K-Nearest Neighbor Efficacy for Farsi Text Classification.
Habib Addressing scalability issues of named entity recognition using Multi-class Support Vector Machines
CN116089598B (en) Green knowledge recommendation method based on feature similarity and user demand
CN116842180B (en) Method and device for identifying industry to which document belongs
Dasri et al. Text mining framework, methods and techniques
Zaki et al. Predicting Membrane Proteins Type Using Inter-domain Linker Knowledge.
BinThalab et al. Adapting Sequence Alignments for Text Classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication