Gene correlation method based on R-SVM and TPR rules
Technical field
The present invention relates to the Data Minings of bioinformatics, and in particular to a kind of prediction technique of gene function.
Background technology
Gene is the DNA fragmentation for having hereditary effect.Gene supports essential structure and the performance of life, stores life
Race, blood group, breed, grow, all information of apoptotic process.It is also the internal factor for determining life and health, organism
It is sick, long, decline, old, all life phenomenon of waiting indefinitely is all related with gene.Therefore the biological function of clear gene
(Biological Function) biological processes in understanding organism, analysis pathogenic mechanism, it is all to develop new drug etc.
Various aspects suffer from highly important meaning.
Currently, which kind of many genes of many biologies such as mouse, people have the function of also in unknown state, gene function note
It releases there is largely working to do, how to predict and finally determines that the biological function of gene becomes the research of genomics
Emphasis.The classification problem that can be converted into due to Gene correlation problem in machine learning and Data Mining, base
The research hotspot that function prediction is also current is carried out in classification.
The groundwork of Gene correlation is noted according to gene function according to known gene and its functional category information
Scheme is released, function prediction is carried out to the gene of unknown function, obtains its function that may have.In Gene correlation problem
In, each gene is considered as a sample, and function is considered as class label, annotation of gene function side possessed by gene
Case is the set of all possible class label.The function of one gene is predicted, is exactly to the gene according to certain function
Annotation scheme is classified, and obtains its function class label having, therefore Gene correlation problem can be considered as classification and ask
Topic is handled.Gene correlation method based on classification mainly carries out the related data of gene using various sorting algorithms
Processing, and then provide the function class label of genes with unknown function.
One function of gene is also referred to as functional label either functional category or function class label.One gene can
There can be multiple functions simultaneously, i.e., it has multiple functions class label simultaneously.It is not mutual indepedent between these function class labels,
But there are certain relationships, meet certain specific hierarchical structure.Hierarchical structure relationship is generally divided into two kinds, arborescence
(Tree) structure and directed acyclic graph (Directed Acyclic Graph, DAG) structure.For Gene correlation,
Classify to gene function generally according to pre-defined annotation scheme.Since annotation of gene function scheme carries level naturally
Structure, for example annotate scheme according to FunCat, then meet tree-like graph structure;Scheme is annotated according to GO, then meets directed acyclic graph
Structure, therefore this hierarchical structure is predefined, and be known.Due to these features of Gene correlation, gene
Function classification problem is not belonging to two traditional classification problems, but belongs to a kind of in machine learning and Data Mining and more choose
The problem of war property --- level multi-tag classification problem (Hierarchical Multi-label Classification,
HMC)。
Problem of the existing technology:Level multi-tag classification problem has both multi-tag classification (Multi-label
Classification) and the characteristic of hierarchical classification (Hierarchical Classification) two kinds of problems, i.e., such
In problem, each sample can carry multiple labels, there are predefined hierarchical structure relationship known to one between each label,
Each sample can have two or more label in any one layer simultaneously.Classical two sorting algorithms or plane
Multi-tag sorting algorithm can not be directly used in and solve the problems, such as this.Since level multi-tag classification problem has both multi-tag and layer
The difficult point of grade two problems of classification, thus caused by data set imbalance problem, predetermined depth problem, multi-tag problem, layer
Grade restricted problem and prediction result consistency problem will take in one by one.How to overcome these difficult points, design effective
Level multi-tag sorting algorithm is an important process.Currently, the domestic research to this problem and acquired achievement in research
It is less, still there is prodigious research space on this problem.
Invention content
The prediction to gene function may be implemented in the present invention, when can solve to realize Gene correlation using sorting algorithm
Existing multi-tag problem and level restricted problem.
Since in GO annotation schemes, function gradually refines from top to bottom, for some genes, may not have
There is the function representated by bottom layer node, this makes for these nodes, and the sample size with the function is seldom, does not have
There are many sample size of the function, and such case is referred to as data set imbalance problem.The presence of this problem can to classify
When accuracy rate reduce, therefore, in sample set positive and negative for joint structure, need to solve this using certain strategy to ask
Topic.
Due to representing a cdna sample with the vector of a multidimensional so that a sample has many attribute values, simultaneously
A large amount of calculating can be introduced by handling these attribute values.For different functional nodes, attribute related with the function may
It is different, some attributes may be uncorrelated attribute, can reduce the performance of grader to the processing of uncorrelated attribute, therefore be directed to
For different functional nodes, sample attribute select permeability is solved.
Since in GO annotation schemes, there are certain restriction relations between each function, therefore the knot given by grader
Fruit also has to comply with this hierarchical relationship, this is also problem to be solved.
Gene correlation method based on R-SVM and TPR rules, includes the following steps:
Step 1, using the gene of known function as training sample, composing training collection, and by each gene representation at one
The vector of multidimensional, vector in each element be referred to as an attribute, the content in the vector is the number of actual experimental results
Wordization indicates that these experimental results are all derived from the biological data library of standard;
In machine learning field, attribute refers to the property or characteristic of research object, it is different because of object, or at any time
Between change;One research object may have a variety of properties or characteristic, so an object may have a variety of different categories
Property;In practical operation, we are associated with numerical value or value of symbol using certain regular attribute by object, this numerical value
Or value of symbol is known as the value of the attribute;For different objects, the same attribute may have different values;Therefore every
One object can be indicated with a multi-C vector;
Research for this method and application background, research object are gene, and it is long that the attribute of research object refers to gene order
Degree, molecular wt, the amino acid ratio etc. of encoded protein;
Each gene can have multiple functions, that is to say, that and when classifying, a gene is considered as a sample,
It is exactly each term in GO annotation schemes, that is, GO annotations that each sample, which can have multiple class labels, these class labels,
Each node in scheme;For existing data, one group of gene can be regarded as one group of sample, it is understood that these bases
Because which has the function of, that is to say, that class label possessed by these samples is also known;It is original for unknown gene sample
It says, we are exactly to expect its function class label that may have;
Step 2, in classification problem, for some class label, if sample have this class label, claim should
Sample is positive sample, is known as positive sample collection by the sample set that positive sample is constituted;Sample without this class label is known as negative
Sample is collectively referred to as negative sample collection by the collection that negative sample is constituted;If the quantity of positive sample is far fewer than the quantity of negative sample, we claim
This problem is unbalanced dataset problem, positive and negative sample set imbalance problem or sample imbalance problem;
Each node in GO annotation schemes indicates a class label, for each node in GO annotation schemes, first
By each sample in training set, positive sample collection and negative sample collection are constructed according to improved sibling principles;
Step 3 selects the Attributions selection of corresponding data set progress sample for each node in GO annotation schemes
It selects the when of classifying to the function of the node and contributes larger attribute;
Step 4, for each node in GO annotation schemes, using R-SVM graders to the data set of each node into
Row training;Obtain one group of R-SVM grader;
R-SVM improves processing capacities of the SVM to unbalanced dataset using threshold value regulation technology;R-SVM is not depended on pair
It is distributed, will not change by the distribution situation of processing data set by the hypothesis of processing data set, can be used for solving uneven number
According to collection problem;
R-SVM is selected using the method for potential optimal threshold selection (Potential Best Threshold Selection)
One group of potential best SVM threshold value is selected, the method meter of optimal threshold estimation (Best Threshold Estimation) is then used
Optimal threshold is calculated, is allowed to be applied on SVM;
Step 5, each node correspond to a grader, and all nodes obtain a classifiers in GO annotation schemes;It uses
One group of R-SVM grader that training stage obtains carries out classification prediction to unknown sample, obtains one group of preliminary R-SVM classification knot
Fruit;
Step 6 converts this group of R-SVM classification results to posterior probability using the sigmoid methods of Platt propositions
Value;
Step 7, the TPR Integrated Algorithms using the Weight for directed acyclic graph hierarchical structure are ensureing finally pre-
Under the premise of survey result meets directed acyclic graph level constraint requirements, the prediction of gene function is realized.
The present invention has the following effects that:
Level multi-tag sorting technique proposed by the invention can be used for the Gene correlation problem of GO annotation schemes,
It realizes the prediction to gene function, provides the prediction result for multiple functions that a gene may have, solve gene function
Multi-tag problem in prediction.
Method proposed by the invention can using the TPR Integrated Algorithms of the Weight for directed acyclic graph hierarchical structure
It is unsatisfactory for asking for level constraint with the prediction result for solving to occur when existing Gene correlation method predicts gene
Topic.
Positive and negative sample set construction method and R-SVM graders of the present invention can solve to annotate scheme using GO
Existing data set imbalance problem when to Gene correlation.
The present invention can complete Gene correlation this vital task, so as to improve at present since high throughput experiment is produced
Raw mass data cannot get the realistic problem of timely and effective processing, and then provide corresponding foundation and side for biological experimental verification
To so that Bioexperiment can be carried out purposefully, greatly shorten the time needed for annotation of gene function, save corresponding experiment at
This, retrenches expenditures, has very far-reaching practical application meaning for the research of functional genomics.
Description of the drawings
Fig. 1 is that potential optimal threshold selects schematic diagram;
Fig. 2 is a part of GO annotated maps of bioprocess ontology in GO annotation schemes.
Specific implementation mode
Specific implementation mode one:
Since in GO annotation schemes, function gradually refines from top to bottom, for some genes, may not have
There is the function representated by bottom layer node, this makes for these nodes, and the sample size with the function is seldom, does not have
There are many sample size of the function, and such case is referred to as data set imbalance problem.The presence of this problem can to classify
When accuracy rate reduce, therefore, in sample set positive and negative for joint structure, need to solve this using certain strategy to ask
Topic.
Due to representing a cdna sample with the vector of a multidimensional so that a sample has many attribute values, simultaneously
A large amount of calculating can be introduced by handling these attribute values.For different functional nodes, attribute related with the function may
It is different, some attributes may be uncorrelated attribute, can reduce the performance of grader to the processing of uncorrelated attribute, therefore be directed to
For different functional nodes, sample attribute select permeability is solved.
Since in GO annotation schemes, there are certain restriction relations between each function, therefore the knot given by grader
Fruit also has to comply with this hierarchical relationship, this is also problem to be solved.
This method is for predicting the function of gene, and wherein gene function is determined by GO annotation of gene function schemes
Justice, GO annotation schemes give the function that gene may have, these functions are all indicated with term (term), Mei Geshu
Language all indicates a kind of function.Fig. 2 is a part for bioprocess ontology in GO annotation schemes.
Each node is a term in Fig. 2, that is, represents a kind of function.In GO structure charts, from top to bottom, art
Language is gradually detailed to the annotation of protein function.In GO structure charts, each node represents a term.Term more connects
Nearly bottom leaf node, functional information amount is bigger, functional interpretation it is more specific.GO annotation schemes meet TPR rules, a term
Annotate some gene, the term of his father's term or more top can also annotate this gene.Such as response in figure
to stress(GO:0006950) certain gene, his father term node R esponse to stimulus (GO are annotated:
0050896) this gene can equally be annotated.
For each gene, we can be carried out with the digital vectors of a multidimensional, and the content in the vector is true
The digitized representations of experimental result, these experimental results are all derived from the biological data library of standard.
Each gene can have multiple functions, that is to say, that and when classifying, a gene is considered as a sample,
Each sample can have multiple class labels, these class labels are exactly each term in GO annotation schemes.As existing number
For, the sample of one group of gene is can be regarded as, it is understood which these genes have the function of, that is to say, that these samples
Possessed class label is also known.For unknown cdna sample, we are exactly to expect its work(that may have
It can class label.
Gene correlation method based on R-SVM and TPR rules, includes the following steps:
Step 1, using the gene of known function as training sample, composing training collection, and by each gene representation at one
The vector of multidimensional, vector in each element be referred to as an attribute, the content in the vector is the number of actual experimental results
Wordization indicates that these experimental results are all derived from the biological data library of standard;
In machine learning field, attribute refers to the property or characteristic of research object, it is different because of object, or at any time
Between change;One research object may have a variety of properties or characteristic, so an object may have a variety of different categories
Property;In practical operation, we are associated with numerical value or value of symbol using certain regular attribute by object, this numerical value
Or value of symbol is known as the value of the attribute;For different objects, the same attribute may have different values;Therefore every
One object can be indicated with a multi-C vector;
Research for this method and application background, research object are gene, and it is long that the attribute of research object refers to gene order
Degree, molecular wt, the amino acid ratio etc. of encoded protein;
Each gene can have multiple functions, that is to say, that and when classifying, a gene is considered as a sample,
It is exactly each term in GO annotation schemes, that is, GO annotations that each sample, which can have multiple class labels, these class labels,
Each node in scheme;For existing data, one group of gene can be regarded as one group of sample, it is understood that these bases
Because which has the function of, that is to say, that class label possessed by these samples is also known;It is original for unknown gene sample
It says, we are exactly to expect its function class label that may have;
Step 2, in classification problem, for some class label, if sample have this class label, claim should
Sample is positive sample, is known as positive sample collection by the sample set that positive sample is constituted;Sample without this class label is known as negative
Sample is collectively referred to as negative sample collection by the collection that negative sample is constituted;If the quantity of positive sample is far fewer than the quantity of negative sample, we claim
This problem is unbalanced dataset problem, positive and negative sample set imbalance problem or sample imbalance problem;
Each node in GO annotation schemes indicates a class label, for each node in GO annotation schemes, first
By each sample in training set, positive sample collection and negative sample collection are constructed according to improved sibling principles;
Step 3 selects the Attributions selection of corresponding data set progress sample for each node in GO annotation schemes
It selects the when of classifying to the function of the node and contributes larger attribute;
Step 4, for each node in GO annotation schemes, using R-SVM graders to the data set of each node into
Row training;Obtain one group of R-SVM grader;
R-SVM improves processing capacities of the SVM to unbalanced dataset using threshold value regulation technology;R-SVM is not depended on pair
It is distributed, will not change by the distribution situation of processing data set by the hypothesis of processing data set, can be used for solving uneven number
According to collection problem;
R-SVM is selected using the method for potential optimal threshold selection (Potential Best Threshold Selection)
One group of potential best SVM threshold value is selected, the method meter of optimal threshold estimation (Best Threshold Estimation) is then used
Optimal threshold is calculated, is allowed to be applied on SVM;
Step 5, each node correspond to a grader, and all nodes obtain a classifiers in GO annotation schemes;It uses
One group of R-SVM grader that training stage obtains carries out classification prediction to unknown sample, obtains one group of preliminary R-SVM classification knot
Fruit;
Step 6 converts this group of R-SVM classification results to posterior probability using the sigmoid methods of Platt propositions
Value;
Step 7, the TPR Integrated Algorithms using the Weight for directed acyclic graph hierarchical structure are ensureing finally pre-
Under the premise of survey result meets directed acyclic graph level constraint requirements, the prediction of gene function is realized.
Specific implementation mode two:
The tool that positive sample collection and negative sample collection are constructed according to improved sibling principles described in present embodiment step 2
Body process is as follows:
For each node in GO annotation schemes, in training set, using the sample for belonging to the node as positive sample, incite somebody to action
Belong to the sample of the brotgher of node of the node as initial negative sample, while being rejected in original negative sample set while belonging to positive sample
The sample of this concentration, and as final negative sample collection, i.e. negative sample collection;Wherein, if a node does not have the brotgher of node,
The sample for the brotgher of node for selecting to belong to its father node of then tracing to the source upwards is as negative sample;
Specific symbolic indication:
Tr+(cj)=* (cj)
Wherein, Tr indicates the training set for including all samples;Node cjRepresent corresponding class label;Tr+(cj) indicate node
cjPositive sample collection,It indicates while belonging to node cjWith the positive sample collection of its brotgher of node, that is, these
Sample has c simultaneouslyjWith the class label of its brotgher of node;Tr-(cj) indicate node cjNegative sample collection;*(cj) indicate node cj
The set that corresponding specific sample is constituted;Indicate the brotgher of node;↑ indicate father node, ↓ indicate child node;Indicate ancestors' section
Point,Indicate descendent node;Indicate certain samples are rejected from a sample set.
Other steps and parameter are same as the specific embodiment one.
Specific implementation mode three:
The specific implementation process of present embodiment step 3 is as follows:
Step 3.1,
First, the information gain that each attribute is calculated using the concept of the information gain in C4.5 decision Tree algorithms, is calculated simultaneously
Go out the gain ratio that each attribute is occupied;
For a certain node, if D is sample set, Gain (R) is information gain, and Gainratio is for attribute R's
Information gain ratio, then its calculation formula be:
Gain (R)=Info (D)-InfoR(D)
Wherein, piIndicate that the sample for belonging to classification i ratio shared in sample set, m are the classification contained by sample set
Number, Info () indicate the entropy of sample set, i.e., the information content separately needed the different classification of sample set;K indicates attribute R
The value for having k kinds different, DjThe sample set being made of the sample that attribute R values are j, InfoR() indicates sample set for category
The entropy of property R, that is, after being classified according to attribute R, the information content that the different classification of sample set is separately also needed to;
SplitInfoR() indicates the division information for attribute R;| | indicate the number of sample included in set;
Step 3.2,
For some node, after obtaining the information gain rate value of each attribute, select sample to classification results
Larger attribute is contributed, and rejects unrelated attribute, the value of information gain ratio is bigger to indicate about big to classification results contribution;In order to
Appropriate number of sample attribute is chosen, is allowed to neither lose a large amount of sample information, while there is sufficient amount of attribute, is introduced
Two conditions --- minimal information gain ratio value and minimum number of attributes rate value;Select the specific of final combinations of attributes
Operating process is:
If each sample xjCan enough n-dimensional vectors indicate that contain n attribute, these attributes are expressed as
(a1,…,an);For node i, minimal information gain ratio value is set as gi, 0<gi≤1;Minimum number of attributes rate value is set
For qi, 0<qi≤1;
First, according to minimum number of attributes rate value qiCalculate minimum attribute number magnitude Qi=n × qi;
Then, each attribute is arranged from big to small according to the value of information gain ratio, it is maximum from information gain rate value
Attribute starts, when the summation of several information gain rate values of front is more than or equal to minimal information gain ratio value giWhen, simultaneously
Judge whether the quantity of these attributes is more than minimum attribute number magnitude Qi, if conditions are not met, then continuing to select from remaining attribute
The maximum attribute of breath gain ratio value of winning the confidence, until the quantity of attribute is more than or equal to minimum attribute number magnitude Qi;Then will meet
The Attributions selection of the two conditions comes out, and is rejected remaining attribute as unrelated attribute;This process retains information gain ratio
It is worth big attribute, that is, selects sample that classification results are contributed with larger attribute;
Step 3.2 illustrates:
The first situation:
It is now assumed that n=10, that is, there are 10 attributes, for node i, g is seti=0.95, qi=0.25, Q at this timei=10 ×
0.25=2.5 ≈ 3;
For node i, the information gain rate value of each attribute be 0.4,0.3,0.1,0.1,0.05,0.01,0.01,
0.01,0.01,0.01 }, it is 1 that all information ratio values, which mutually sum it up,;We select preceding 5 attribute values at this time, then this 5 attributes
The information gain rate value of value and be 0.95, have equalized to gi, that is, meet the requirement of minimal information gain ratio value;Institute simultaneously
The attribute value quantity selected is 5, is more than minimum attribute number magnitude Qi=3, so when select the attribute value representative sample of front 5,
Abandon 5 attributes below;After operating herein, sample becomes 5 dimensional vectors from 10 dimensional vectors;
The second situation:
It is now assumed that n=10, that is, there are 10 attributes, for node i, g is seti=0.95, qi=0.25, Q at this timei=10 ×
0.25=2.5 ≈ 3;
For node i, the information gain rate value of each attribute be 0.8,0.15,0.01,0.02,0.01,0.01,0,
0,0,0 }, it is 1 that all information ratio values, which mutually sum it up,;We select preceding 2 attribute values at this time, then the information of this 2 attribute values
Gain ratio value and be 0.95, that is, meet minimal information gain ratio value requirement;But selected attribute value quantity is
2, it is less than minimum attribute number magnitude Qi=3, so when select the attribute value representative sample of front 3, abandon 7 attributes below;
After this operation, sample becomes 3 dimensional vectors from 10 dimensional vectors;
Step 3.3,
Process described in step 3.1 and step 3.2 is that the process of Attributions selection is carried out for a node in GO annotation schemes;
Step 3.1 and 3.2 is repeated, Attributions selection is carried out to all nodes in GO annotation schemes.
Other steps and parameter are the same as one or two specific embodiments.
Specific implementation mode four:
The specific implementation process of present embodiment step 4 is as follows:
R-SVM improves processing capacities of the SVM to unbalanced dataset using threshold value regulation technology;R-SVM is not depended on pair
It is distributed, will not change by the distribution situation of processing data set by the hypothesis of processing data set, can be used for solving uneven number
According to collection problem;
R-SVM is selected using the method for potential optimal threshold selection (Potential Best Threshold Selection)
One group of potential best SVM threshold value is selected, the method meter of optimal threshold estimation (Best Threshold Estimation) is then used
Optimal threshold is calculated, is allowed to be applied on SVM;Detailed process is as follows:
One group of step 4.1, selection potential optimal threshold:
Unbalanced dataset is handled using standard SVM, obtain the output valve of all samples and by it from high to low
It is ranked up, finds the different adjacent sample of physical tags, the threshold value between the different adjacent sample of physical tags is exactly potential
Optimal threshold, after adjusting thresholds, prediction result can change;For each node in GO annotation schemes, can obtain
To one group of potential optimal threshold, i.e., potential optimal threshold set;
Step 4.2 determines used optimal threshold estimation:
For each node in GO annotation schemes, former training set is divided into using Partitioning (PT) methods several
Training set, i.e., is divided into the subset of several non-overlapping copies by a sub- training set, and each subset is considered as a training subset later;
Then on each training subset, an optimal threshold is selected from potential optimal threshold set;Finally, by all training subsets
All optimal thresholds of selection are averaged, as final threshold θ;
Described " optimal threshold is selected from the potential optimal threshold set " detailed process is as follows:By potential best threshold
Each threshold value in value set brings sub- training set into, using the best threshold value of classification results as the optimal threshold of training subset;
Step 4.3 is modified the result of SVM using final threshold θ, for node i, sample xjPrediction
As a result calculation formula is hi *(xj)=hi(xj)-θ;
Wherein hi() is the classification function that the SVM of node i is provided, hi(xj) it is the sample x that SVM is providedjClassification knot
Fruit;hi *(xj) it is revised as a result, that namely R-SVM is provided as a result, hi *(xj) be more than or equal to 0, then judge xjFor positive class;
hi *(xj) be less than 0, then judge xjTo bear class.
It illustrates:
If X is the training set containing n sample, sample label number shares m, that is to say, that shares m node;X=
{x1,x2,…,xn, Y={ y11,y12,…,y1m…yn1,yn2,…,ynmIt is true class label corresponding with each sample, also
It is each node in GO annotation schemes;xjFor a sample in training set, yjiFor sample xjFor the class label of node i,
yji=1 indicates that the sample belongs to node i, yji=-1 indicates that the sample is not belonging to node i;Possibility threshold values all SVM are formed
Collection be combined into Θ, we expect a threshold θ ∈ Θ so that the classifying quality of SVM reaches best;
R-SVM calculate threshold θ the step of be:
A, illustrate by taking Fig. 1 as an example, common SVM taken to classify sample set first, obtain each sample set as a result,
If we there are 10 samples, 10 results that SVM is provided are obtained, it is believed that optimal threshold, which appears in, closes on two sample labels
In inconsistent situation;In the example shown, there are three these threshold values, these threshold values are known as potential optimal threshold;
B, training set X is divided into S parts, training set divided method is Partitioning (PT) method, it divides training set
For S unduplicated training subsets;
C, each potential optimal threshold is separately verified on each training subset, selects the optimal threshold for each subset;
D, the optimal threshold for each subset selected is averaged, obtains final optimal threshold;
Assuming that potential optimal threshold { 1.1,1.7,2.1 } there are three us, training set is divided into 5 training subsets by us,
Verification result is respectively { 1.1,1.1,1.7,1.1,2.1 }, then final threshold value result is θ=1.42
E, the result of SVM is modified using the threshold value, formula hi *(xj)=hi(xj)-θ。
Other steps and parameter are identical as specific implementation mode two or three.
Specific implementation mode five:
The specific implementation process of present embodiment step 6 is as follows:
If X is the training set containing n sample, sample label number shares m, that is, shares m node;X={ x1,x2,…,
xn};Y={ y11,y12,…,y1m…yn1,yn2,…,ynmIt is true class label corresponding with each sample, that is, GO annotations
Each node in scheme;xjFor a sample in training set, yjiFor sample xjFor the class label of node i, yji=1 indicates
The sample belongs to node i, yji=-1 indicates that the sample is not belonging to node i;
For node i, by the SVM of the node for a sample xjOutput valve hi *(xj) be converted to probability valueIt is public
Formula isA, B is two coefficients for converting result.
Other steps and parameter are identical as one of specific implementation mode one to four.
Specific implementation mode six:
The solution procedure of coefficient A and B described in present embodiment step 6 are as follows:
For node i, the value of A, B can be obtained by solving following formula to training set:
WhereinN+To belong to section in sample set
The quantity of the sample of point i, N_ are the quantity for the sample that node i is not belonging in sample set.
Other steps and parameter are identical as one of specific implementation mode one to five.
Specific implementation mode seven:
The specific implementation process of present embodiment step 7 is as follows:
Step 7.1, a node may contain multiple father nodes in directed acyclic graph structures, therefore be reached from root node
There may be mulitpaths for one node;For such case, we define the level belonging to a node and are reached for root node
What the maximum path of this node was determined, therefore there are directed acyclic graph structures how many level to depend on the tool in directed acyclic graph
There is the node of longest path;It is the root node in directed acyclic graph to define r, and node i is any one section in directed acyclic graph
Point (non-root node), p (r, i) indicate that the paths from root node r to node i, l (p (r, i)) indicate the length in the path;ψ
(i) it is the function for determining level residing for node i, as follows:
The level in GO annotation schemes residing for each node is obtained according to ψ (i), and it is the 0th layer to define root node, is then 1
Layer, 2 layers, until GO annotation scheme bottom grade;
Step 7.2, for GO annotations scheme, process, a sample are depended in the prediction result of each node from bottom to top
The prediction result of the nodal basis grader and its child node are predicted as the result of positive class;One sample its whether there is node i
Representative function depend not only on it is that the node classifier provides as a result, additionally depend on the child node grader of the node to
The result gone out;
For a sample xj, being in the node of the bottom since GO annotation schemes, successively handled, counted upwards
Calculate the synthesis result of the result given by the result and child node grader that each node classifier provides;Detailed process is:
For a node i in GO annotation schemes, φiIndicate that prediction of result is all sons of the node i of positive class
The set that node is constituted;Classification results for the node i provided after comprehensive child node classifier result;ThenCalculating it is public
Formula is:
Wherein, ω is weight parameter, and weight parameter ω is used for the tribute of balanced basis grader and child node to final result
Size is offered, which could be provided as 0.5, can also be and is adjusted according to actual conditions;By the step, the positive class of lower section is pre-
It surveys result and is successively transmitted to upper layer respective nodes;
Step 7.3, for GO annotations scheme, process, its main target are will to pass through process from bottom to top from top to bottom
Afterwards, upper layer node is judged as that the result of negative class passes to corresponding lower level node;It also by the way of successively transmitting, is changed
The predicted value of each node layer is finally predicted finally according to respective threshold and the finally obtained predicted value for each node
As a result;Particular content is:
For a sample xj, final calculation resultFor
Wherein, par (i) indicates the father node of node i;
During from bottom to top, it is therefore an objective to be calculated according to the classifier result of each nodeI.e. comprehensive child node result
One result;Process from top to bottom is then basisCalculate final calculation resultIt is that the sample belongs to node i
Probability value, a number are more than or equal to 0, are less than or equal to+1;More than or equal to 0.5, illustrate that sample belongs to the node,It is less than
0.5 explanation is not belonging to the node;
Step 7.4, for a sample xjFor, the final calculation result of node i isLabel in GO annotation schemes
Shared m of number, that is to say, that share m node;Then for a sample xjFor, final calculation result is
Step 7.5, for a sample xjIfMore than or equal to 0.5, then it is predicted as positive class, i.e., the sample belongs to section
Point i, the class label indicated with node i;IfLess than 0.5, then it is predicted as negative class, i.e. the sample is not belonging to node i, no
The class label indicated with node i;That is sample xjThe final prediction result Y of class labeljiIt is expressed as
Step 7.6 finally obtains a sample xjWhich point, i.e. sample x in GO annotation schemes belonged tojWith which
Class label;About sample xjAll class labels final prediction result YjIt can be expressed as Yj={ Yj1..., Yji..., Yjm,
Realize sample xjTag Estimation, that is, realize the prediction to gene function.
Other steps and parameter are identical as one of specific implementation mode one to six.
Specific implementation mode eight:
Minimal information gain ratio value g described in present embodiment step 3.2iWith minimum attribute number magnitude QiSpecific number
Value needs repeatedly to be trained in training, chooses the highest value of accuracy and is set;Detailed process is as follows:
First rule of thumb selection minimal information gain ratio value giAnd minimum attribute number magnitude QiInitial value;Then
Continue step 4- steps 7;After completing the process, according to the accuracy of prediction result, this g is adjustedi、Qi, repeatedly step again
4- steps 7;After repeatedly, the case where choosing pre- accuracy highest, sets the concrete numerical value of the two values.
Other steps and parameter are identical as one of specific implementation mode one to seven.