CN1773517A

CN1773517A - Protein sequence characteristic extracting method based on Chinese participle technique

Info

Publication number: CN1773517A
Application number: CNA2005101102164A
Authority: CN
Inventors: 杨旸; 吕宝粮
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2005-11-10
Filing date: 2005-11-10
Publication date: 2006-05-17

Abstract

A method for picking up protein sequence character based on Chinese participle technique includes setting up dictionary according to sequence in training sample to find out amino acid sequence sub string set benefiting for classification, dividing all samples and matching sequence sample to entry in dictionary, counting up occurrence frequency of each word in dictionary on each sequence and converting sequence to be figure vector then classifying protein by utilizing character being converted out.

Description

Protein sequence characteristic extracting method based on Chinese words segmentation

Technical field

The present invention relates to the method in a kind of Computer Applied Technology field, specifically is a kind of protein sequence characteristic extracting method based on Chinese words segmentation.

Background technology

In in the past 20 years, bioinformatics and calculation biology field are flourish, and computer science is changing the looks of biological study rapidly and significantly, and the research that many scientists in the past will carry out in the laboratory can be carried out now on computers.The classification of protein and structure prediction are the important subject in this field, and the protein sequence that is stored at present in the public database is exponential increase, and understanding these sequences needs complicated mode identification technology.How effectively to represent biological sequence, promptly carry out feature extraction, so that appropriate design and utilization sorter seem particularly important.Feature extraction is a key issue in the pattern-recognition, owing to usually be not easy to find those most important characteristic in a lot of practical problemss, make that the task of feature selecting and extraction is complicated and becomes one of the most difficult task of structural model recognition system.

The feature extraction aspect of protein sequence has many achievements in research.1994, Nakashima and Nishikawa adopted single amino acids and the right constituent of residue successfully to distinguish in the cell and extracellular protein.Found afterwards that protein sequence N end ordering signal was an important biological characteristic, can effectively distinguish multiple proteins, yet this method depends on targeting sequencing strongly, but when leading sequence was unreliable, this method was made mistakes easily.In addition, Chou has proposed pseudo-amino acid and has formed, and the method for utilizing gene ontology and functional domain information, but these information are difficult to obtain for new protein sequence.Although people have attempted various method for classifying modes, as Bayes, k nearest neighbor, Hidden Markov Model (HMM), neural network, support vector machine etc. solve the protein classification problem, are very limited yet rely on sorter merely to the castering action of classification performance.

Find through literature search prior art, Park in 2003 publishes an article on " Bioinformatics " (" bioinformatics ") magazine " Prediction of protein subcellular locations by supportvector machines using compositions of amino acids and amino acid pairs " and (uses the right constituent of amino acid and amino acid to adopt SVM prediction protein subcellular location, vol.19,2003, pp.1656-1663), this article uses the feature of the right composition of amino acid monomer and four kinds of residues as protein respectively, classification accuracy for more than 7,000 eukaryotic proteins reaches 78%, has verified the validity of using this feature extracting method.But this piece article is not selected feature, all residues to all as feature, the dimension of feature space is higher, does not also mention the possibility of using the longer concatermer classification of length.

Summary of the invention

The objective of the invention is to overcome deficiency of the prior art, a kind of protein sequence characteristic extracting method based on Chinese words segmentation is provided, make it can improve nicety of grading and can accelerate classification speed again.The present invention uses for reference the method for Chinese word segmentation, and protein amino acid sequence is carried out cutting, and it is right to be not limited only to amino acid, has also investigated the concatermer of bigger length, and picks out significant to classification from these concatermers, thereby finds protein characteristic.

The present invention is achieved by the following technical solutions, the present invention includes following steps:

(1) according to the sequence in the training sample, sets up dictionary, find out set the amino acid sequence substring of classifying useful;

(2) cutting sequence is promptly carried out cutting to all samples, sequence samples and the entry of setting up in the good dictionary is mated, and select optimum slit mode;

(3) carry out the sequence statistics after cutting finishes, add up the frequency that occurs each speech in the dictionary in every sequence, sequence is converted into the numerical value vector;

(4) with transforming good feature protein is classified at last.

The described dictionary of setting up is meant to obtain a vocabulary that all speech in this vocabulary just should cut out during participle.Though protein sequence is much simpler than text, however text form by each significant speech, significant amino acid concatermer is difficult to judge in the sequence.The concatermer that the present invention is high with the frequency of occurrences and all 20 seed amino acid monomers are included into dictionary.At first set a maximum substring length, each is not more than the substring of this length, all to add up its number of times that in sample set, occurs (occurrence number sum in every sequence, the statistics of concatermer adopts the mode of overlapping (overlapping) counting), get the highest some substrings (being concatermer) of the frequency of occurrences of every kind of length (greater than 1) at last respectively as the speech in the dictionary.For prevent the cutting stage run into can't cutting situation, dictionary all put in 20 seed amino acid words.

Described cutting sequence is meant sequence samples and the entry of setting up in the good dictionary is mated that if find certain character string in dictionary, then the match is successful, promptly identifies a speech.Protein sequence is made up of hundreds of to several thousand amino acid usually, therefore may have multiple slit mode.In order to find out best slit mode, at first, find out and cut out those minimum slit modes of hop count according to the principle of priority of long word coupling, at this moment may still there be multiple choices.The present invention gives corresponding weights for each speech in the dictionary, and proposes a kind of maximum weights product coupling rule, and the slit mode that the promptly feasible speech weights product that cuts out is a maximum is for optimum.Details is as follows:

To single amino acids, establish frequency _{1, i}Be its occurrence number, Freq ₁Be the maximal value of 20 kinds of single amino acids frequencies of occurrences, then the weights of single amino acids are defined as:

{weight}_{1, i} = \frac{{frequency}_{1, i}}{{Freq}_{1}}, 1 \leq i \leq 20

Similarly, the conjuncted weights of k are:

{weight}_{k, i} = \frac{{frequency}_{k, i}}{{Freq}_{k}} \times C^{k - 1}, 1 \leq i \leq N, 1 \leq k \leq \max Len, C &GreaterEqual; 1

Wherein N is the conjuncted number of K-that data centralization occurs, and maxLen is conjuncted maximum length, and C is an adjustable parameter, and its effect is to guarantee that the weights of long word are bigger than short speech.

Each is had the slit mode of minimum hop count, and the weights product is defined as:

P_{S, T} = \underset{W}{Π} {weight}_{w}^{n_{w}}, w &Element; W

Wherein W is all set of letters that matches, n _wThe number of times that expression word w is mated when given sequence S adopts slit mode T.Calculate the weights product, just can select optimum a kind of slit mode.

The frequency of occurrences of each speech in the described statistical series is meant the sequence cutting for after the string of being made up of the speech in the dictionary, and number goes out each speech occurrence number in this sequence, and the numerical value of the corresponding dimension of conduct vector.Protein sequence just can be expressed as vector form like this.

The present invention expands to concatermer to the investigation of protein sequence substring from dyad, and by effective feature selection approach, has overcome the shortcoming that protein concatermer number is various, be difficult to add up.The reduction of feature space dimension means quick study and classification, and this is vital for the extensive biological problem that has mass data.In addition, the present invention adopts nonoverlapping cutting method, can find the feature different with former overlapping statistical method, thereby improves nicety of grading.The feature of using the present invention to extract can be used for the multiple proteins classification problem, as protein Subcellular Localization, structure prediction or the like.

Description of drawings

Fig. 1 is principle of the invention figure

Embodiment

Provide specific embodiment below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail.

The present invention considers that Chinese sentence and protein sequence have similarity, all be the continuation character string, and used language is different with speech.So use for reference the method for Chinese word segmentation, protein amino acid sequence carried out cutting, thereby find its feature.

Fig. 1 has shown three key steps of the present invention: set up dictionary, participle and be converted into proper vector.The foundation of dictionary need be considered two aspects, includes the standard and the quantity of speech, i.e. the dictionary capacity.The standard of including has multiple, as absolute word frequency, relative word frequency (utilization TF-IDF formula) or additive method.The size of capacity depends on two parameters: include the maximum length of speech and the number of every kind of length speech.The participle stage comprises three steps: give weights for each speech, optimum slit mode is selected in the sequence cutting.Minimum with segments during cutting is standard, selects optimum to be standard to the maximum with speech weights product.

The present invention is used in the prediction of protein subcellular location, to being distributed in 12 subcellular locations (chloroplast, tenuigenin, cytoskeleton, endoplasmic reticulum, extracellular, golgiosome, lysosome, mitochondria, nucleus, peroxisome, plasma membrane, vacuole) 7579 protein sequences on are classified, and DATA DISTRIBUTION sees Table one.The present invention is made as 3,4 or 5 to the maximum length that comprises speech in the dictionary, and the number of the speech of every kind of length is respectively 2,5, and 10,20,50 and 100.Use K-nearest neighbour method and support vector machine (SVM) as sorter respectively.The K value of K-nearest neighbour method chooses from 2 to 16.In this embodiment, the speech maximum length is 3,4 and 5 o'clock, can both reach the higher forecasting precision.The speech number of every kind of length is that 5 o'clock prediction effects are best, can reach degree of precision when promptly feature space is 30 dimensions, can use feature seldom to classify like this, has improved classification effectiveness greatly.Table two has shown the classification of employing support vector machine, the dimension of the present invention and other five kinds of feature extracting methods and the comparison of accuracy.

Table one, all kinds of distribution situations of data set

The position	Number
The position	Number	Chloroplast	671
Tenuigenin	1241	Chloroplast	671
Tenuigenin	1241	Cytoskeleton	40
Endoplasmic reticulum	114	Cytoskeleton	40
Endoplasmic reticulum	114	The extracellular	861
Golgiosome	47	The extracellular	861
Golgiosome	47	Lysosome	93

Mitochondria	727
Mitochondria	727	Nucleus	1932
Peroxisome	125	Nucleus	1932
Peroxisome	125	Plasma membrane	1674
Vacuole	54	Plasma membrane	1674
Vacuole	54	Amount to	7579

Table two, various feature extracting method dimension and nicety of grading are relatively

Method	Dimension	Macroscopic view average (%)	Microcosmic average (%)
Method	Dimension	Macroscopic view average (%)	Microcosmic average (%)	Amino acid composition	20	53.4	70.3
Amino acid is to component	400	58.4	73.6	Amino acid composition	20	53.4	70.3
Amino acid is to component	400	58.4	73.6	One amino acid is right at interval	400	60.0	73.9
Between the amino acid of next but two right	400	55.8	72.5	One amino acid is right at interval	400	60.0	73.9
Between the amino acid of next but two right	400	55.8	72.5	Three amino acid is right at interval	400	57.3	72.5
The present invention	30	61.2	74.7	Three amino acid is right at interval	400	57.3	72.5

Claims

1, a kind of protein sequence characteristic extracting method based on Chinese words segmentation is characterized in that, may further comprise the steps:

(1) sets up dictionary according to the sequence in the training sample, find out set the amino acid sequence substring of classifying useful;

(4) with transforming good feature protein is classified at last.

2, the protein sequence characteristic extracting method based on Chinese words segmentation according to claim 1, it is characterized in that, the described dictionary of setting up, be meant: the concatermer that the frequency of occurrences is high and all 20 seed amino acid monomers are included into dictionary, at first set a maximum substring length, each is less than or equal to the substring of this length, all adds up the number of times that it occurs in sample set, get the highest some substrings of the frequency of occurrences of every kind of length at last respectively as the speech in the dictionary.

3, the protein sequence characteristic extracting method based on Chinese words segmentation according to claim 2, it is characterized in that, the number of times that described statistics occurs in sample set is meant substring occurrence number sum in every sequence, and the statistics of concatermer adopts the mode of overlapping counting.

4, the protein sequence characteristic extracting method based on Chinese words segmentation according to claim 1, it is characterized in that, described cutting sequence, be meant sequence samples and the entry of setting up in the good dictionary are mated, if find certain character string in dictionary, then the match is successful, promptly identifies a speech, protein sequence is made up of hundreds of to several thousand amino acid, so there is multiple slit mode.

5, the protein sequence characteristic extracting method based on Chinese words segmentation according to claim 4, it is characterized in that, give corresponding weights for each speech in the dictionary, with the best slit mode of the maximum double Standard Selection of hop count minimum and the weight product of telling speech.

6, according to claim 4 or 5 described protein sequence characteristic extracting methods, it is characterized in that,, establish frequency single amino acids based on Chinese words segmentation _{1, i}Be its occurrence number, Freq ₁Be the maximal value of 20 kinds of single amino acids frequencies of occurrences, then the weights of single amino acids are defined as:

{weight}_{1, i} = \frac{{frequency}_{1, i}}{{Freq}_{1}},

1≤i≤20

Similarly, the conjuncted weights of k are:

{weight}_{k, i} = \frac{{frequency}_{k, i}}{{Freq}_{k}} \times C^{k - 1},

1≤i≤N，1≤k≤max Len，C≥1

7, according to claim 4 or 5 described protein sequence characteristic extracting methods, it is characterized in that each is had the slit mode of minimum hop count, and the weights product is defined as based on Chinese words segmentation:

P_{S, T} = \underset{W}{Π} {weight}_{w}^{n_{w}},

w∈W

Wherein W is all set of letters that matches, n _wThe number of times that expression word w is mated when given sequence S adopts slit mode T.

8, the protein sequence characteristic extracting method based on Chinese words segmentation according to claim 1, it is characterized in that, the frequency of occurrences of each speech in the described statistical series, be meant that the sequence cutting is for after the string of being made up of the speech in the dictionary, number goes out each speech occurrence number in this sequence, and as the corresponding numerical value of tieing up of vector, protein sequence just can be expressed as vector form like this.