CN106126973B

CN106126973B - Gene correlation method based on R-SVM and TPR rules

Info

Publication number: CN106126973B
Application number: CN201610452114.9A
Authority: CN
Inventors: 冯收; 付平; 徐明珠; 郑文斌; 石金龙; 邱传良; 于鸿杰; 阳彬
Original assignee: Harbin Institute of Technology
Current assignee: Heilongjiang Jiancheng medical laboratory Co.,Ltd.
Priority date: 2016-06-21
Filing date: 2016-06-21
Publication date: 2018-10-02
Anticipated expiration: 2036-06-21
Also published as: CN106126973A

Abstract

Based on the Gene correlation method of R SVM and TPR rules, it is related to a kind of prediction technique of gene function.The prediction to gene function may be implemented in the present invention, existing multi-tag and level restricted problem when can solve the problems, such as using sorting algorithm realization Gene correlation.The present invention is first using the gene of known function as training sample, composing training collection；For each node in GO annotation schemes, positive sample collection and negative sample collection are constructed；For each node in GO annotation schemes, selection contributes larger attribute when classifying to the function of the node；One group of R SVM classifier is obtained by training and classification prediction is carried out to unknown sample, obtains one group of preliminary R svm classifiers result；It converts classification results to posterior probability values, using the TPR Integrated Algorithms of the Weight for directed acyclic graph hierarchical structure, realizes the prediction of gene function.The present invention is suitable for the prediction of gene function.

Description

Gene correlation method based on R-SVM and TPR rules

Technical field

The present invention relates to the Data Minings of bioinformatics, and in particular to a kind of prediction technique of gene function.

Background technology

Gene is the DNA fragmentation for having hereditary effect.Gene supports essential structure and the performance of life, stores life Race, blood group, breed, grow, all information of apoptotic process.It is also the internal factor for determining life and health, organism It is sick, long, decline, old, all life phenomenon of waiting indefinitely is all related with gene.Therefore the biological function of clear gene (Biological Function) biological processes in understanding organism, analysis pathogenic mechanism, it is all to develop new drug etc. Various aspects suffer from highly important meaning.

Currently, which kind of many genes of many biologies such as mouse, people have the function of also in unknown state, gene function note It releases there is largely working to do, how to predict and finally determines that the biological function of gene becomes the research of genomics Emphasis.The classification problem that can be converted into due to Gene correlation problem in machine learning and Data Mining, base The research hotspot that function prediction is also current is carried out in classification.

The groundwork of Gene correlation is noted according to gene function according to known gene and its functional category information Scheme is released, function prediction is carried out to the gene of unknown function, obtains its function that may have.In Gene correlation problem In, each gene is considered as a sample, and function is considered as class label, annotation of gene function side possessed by gene Case is the set of all possible class label.The function of one gene is predicted, is exactly to the gene according to certain function Annotation scheme is classified, and obtains its function class label having, therefore Gene correlation problem can be considered as classification and ask Topic is handled.Gene correlation method based on classification mainly carries out the related data of gene using various sorting algorithms Processing, and then provide the function class label of genes with unknown function.

One function of gene is also referred to as functional label either functional category or function class label.One gene can There can be multiple functions simultaneously, i.e., it has multiple functions class label simultaneously.It is not mutual indepedent between these function class labels, But there are certain relationships, meet certain specific hierarchical structure.Hierarchical structure relationship is generally divided into two kinds, arborescence (Tree) structure and directed acyclic graph (Directed Acyclic Graph, DAG) structure.For Gene correlation, Classify to gene function generally according to pre-defined annotation scheme.Since annotation of gene function scheme carries level naturally Structure, for example annotate scheme according to FunCat, then meet tree-like graph structure；Scheme is annotated according to GO, then meets directed acyclic graph Structure, therefore this hierarchical structure is predefined, and be known.Due to these features of Gene correlation, gene Function classification problem is not belonging to two traditional classification problems, but belongs to a kind of in machine learning and Data Mining and more choose The problem of war property --- level multi-tag classification problem (Hierarchical Multi-label Classification, HMC)。

Problem of the existing technology：Level multi-tag classification problem has both multi-tag classification (Multi-label Classification) and the characteristic of hierarchical classification (Hierarchical Classification) two kinds of problems, i.e., such In problem, each sample can carry multiple labels, there are predefined hierarchical structure relationship known to one between each label, Each sample can have two or more label in any one layer simultaneously.Classical two sorting algorithms or plane Multi-tag sorting algorithm can not be directly used in and solve the problems, such as this.Since level multi-tag classification problem has both multi-tag and layer The difficult point of grade two problems of classification, thus caused by data set imbalance problem, predetermined depth problem, multi-tag problem, layer Grade restricted problem and prediction result consistency problem will take in one by one.How to overcome these difficult points, design effective Level multi-tag sorting algorithm is an important process.Currently, the domestic research to this problem and acquired achievement in research It is less, still there is prodigious research space on this problem.

Invention content

The prediction to gene function may be implemented in the present invention, when can solve to realize Gene correlation using sorting algorithm Existing multi-tag problem and level restricted problem.

Since in GO annotation schemes, function gradually refines from top to bottom, for some genes, may not have There is the function representated by bottom layer node, this makes for these nodes, and the sample size with the function is seldom, does not have There are many sample size of the function, and such case is referred to as data set imbalance problem.The presence of this problem can to classify When accuracy rate reduce, therefore, in sample set positive and negative for joint structure, need to solve this using certain strategy to ask Topic.

Due to representing a cdna sample with the vector of a multidimensional so that a sample has many attribute values, simultaneously A large amount of calculating can be introduced by handling these attribute values.For different functional nodes, attribute related with the function may It is different, some attributes may be uncorrelated attribute, can reduce the performance of grader to the processing of uncorrelated attribute, therefore be directed to For different functional nodes, sample attribute select permeability is solved.

Since in GO annotation schemes, there are certain restriction relations between each function, therefore the knot given by grader Fruit also has to comply with this hierarchical relationship, this is also problem to be solved.

Gene correlation method based on R-SVM and TPR rules, includes the following steps：

Step 1, using the gene of known function as training sample, composing training collection, and by each gene representation at one The vector of multidimensional, vector in each element be referred to as an attribute, the content in the vector is the number of actual experimental results Wordization indicates that these experimental results are all derived from the biological data library of standard；

In machine learning field, attribute refers to the property or characteristic of research object, it is different because of object, or at any time Between change；One research object may have a variety of properties or characteristic, so an object may have a variety of different categories Property；In practical operation, we are associated with numerical value or value of symbol using certain regular attribute by object, this numerical value Or value of symbol is known as the value of the attribute；For different objects, the same attribute may have different values；Therefore every One object can be indicated with a multi-C vector；

Research for this method and application background, research object are gene, and it is long that the attribute of research object refers to gene order Degree, molecular wt, the amino acid ratio etc. of encoded protein；

Each gene can have multiple functions, that is to say, that and when classifying, a gene is considered as a sample, It is exactly each term in GO annotation schemes, that is, GO annotations that each sample, which can have multiple class labels, these class labels, Each node in scheme；For existing data, one group of gene can be regarded as one group of sample, it is understood that these bases Because which has the function of, that is to say, that class label possessed by these samples is also known；It is original for unknown gene sample It says, we are exactly to expect its function class label that may have；

Step 2, in classification problem, for some class label, if sample have this class label, claim should Sample is positive sample, is known as positive sample collection by the sample set that positive sample is constituted；Sample without this class label is known as negative Sample is collectively referred to as negative sample collection by the collection that negative sample is constituted；If the quantity of positive sample is far fewer than the quantity of negative sample, we claim This problem is unbalanced dataset problem, positive and negative sample set imbalance problem or sample imbalance problem；

Each node in GO annotation schemes indicates a class label, for each node in GO annotation schemes, first By each sample in training set, positive sample collection and negative sample collection are constructed according to improved sibling principles；

Step 3 selects the Attributions selection of corresponding data set progress sample for each node in GO annotation schemes It selects the when of classifying to the function of the node and contributes larger attribute；

Step 4, for each node in GO annotation schemes, using R-SVM graders to the data set of each node into Row training；Obtain one group of R-SVM grader；

R-SVM improves processing capacities of the SVM to unbalanced dataset using threshold value regulation technology；R-SVM is not depended on pair It is distributed, will not change by the distribution situation of processing data set by the hypothesis of processing data set, can be used for solving uneven number According to collection problem；

R-SVM is selected using the method for potential optimal threshold selection (Potential Best Threshold Selection) One group of potential best SVM threshold value is selected, the method meter of optimal threshold estimation (Best Threshold Estimation) is then used Optimal threshold is calculated, is allowed to be applied on SVM；

Step 5, each node correspond to a grader, and all nodes obtain a classifiers in GO annotation schemes；It uses One group of R-SVM grader that training stage obtains carries out classification prediction to unknown sample, obtains one group of preliminary R-SVM classification knot Fruit；

Step 6 converts this group of R-SVM classification results to posterior probability using the sigmoid methods of Platt propositions Value；

Step 7, the TPR Integrated Algorithms using the Weight for directed acyclic graph hierarchical structure are ensureing finally pre- Under the premise of survey result meets directed acyclic graph level constraint requirements, the prediction of gene function is realized.

The present invention has the following effects that：

Level multi-tag sorting technique proposed by the invention can be used for the Gene correlation problem of GO annotation schemes, It realizes the prediction to gene function, provides the prediction result for multiple functions that a gene may have, solve gene function Multi-tag problem in prediction.

Method proposed by the invention can using the TPR Integrated Algorithms of the Weight for directed acyclic graph hierarchical structure It is unsatisfactory for asking for level constraint with the prediction result for solving to occur when existing Gene correlation method predicts gene Topic.

Positive and negative sample set construction method and R-SVM graders of the present invention can solve to annotate scheme using GO Existing data set imbalance problem when to Gene correlation.

The present invention can complete Gene correlation this vital task, so as to improve at present since high throughput experiment is produced Raw mass data cannot get the realistic problem of timely and effective processing, and then provide corresponding foundation and side for biological experimental verification To so that Bioexperiment can be carried out purposefully, greatly shorten the time needed for annotation of gene function, save corresponding experiment at This, retrenches expenditures, has very far-reaching practical application meaning for the research of functional genomics.

Description of the drawings

Fig. 1 is that potential optimal threshold selects schematic diagram；

Fig. 2 is a part of GO annotated maps of bioprocess ontology in GO annotation schemes.

Specific implementation mode

Specific implementation mode one：

This method is for predicting the function of gene, and wherein gene function is determined by GO annotation of gene function schemes Justice, GO annotation schemes give the function that gene may have, these functions are all indicated with term (term), Mei Geshu Language all indicates a kind of function.Fig. 2 is a part for bioprocess ontology in GO annotation schemes.

Each node is a term in Fig. 2, that is, represents a kind of function.In GO structure charts, from top to bottom, art Language is gradually detailed to the annotation of protein function.In GO structure charts, each node represents a term.Term more connects Nearly bottom leaf node, functional information amount is bigger, functional interpretation it is more specific.GO annotation schemes meet TPR rules, a term Annotate some gene, the term of his father's term or more top can also annotate this gene.Such as response in figure to stress(GO:0006950) certain gene, his father term node R esponse to stimulus (GO are annotated： 0050896) this gene can equally be annotated.

For each gene, we can be carried out with the digital vectors of a multidimensional, and the content in the vector is true The digitized representations of experimental result, these experimental results are all derived from the biological data library of standard.

Each gene can have multiple functions, that is to say, that and when classifying, a gene is considered as a sample, Each sample can have multiple class labels, these class labels are exactly each term in GO annotation schemes.As existing number For, the sample of one group of gene is can be regarded as, it is understood which these genes have the function of, that is to say, that these samples Possessed class label is also known.For unknown cdna sample, we are exactly to expect its work(that may have It can class label.

Specific implementation mode two：

The tool that positive sample collection and negative sample collection are constructed according to improved sibling principles described in present embodiment step 2 Body process is as follows：

For each node in GO annotation schemes, in training set, using the sample for belonging to the node as positive sample, incite somebody to action Belong to the sample of the brotgher of node of the node as initial negative sample, while being rejected in original negative sample set while belonging to positive sample The sample of this concentration, and as final negative sample collection, i.e. negative sample collection；Wherein, if a node does not have the brotgher of node, The sample for the brotgher of node for selecting to belong to its father node of then tracing to the source upwards is as negative sample；

Specific symbolic indication：

Tr⁺(c_j)=* (c_j)

Wherein, Tr indicates the training set for including all samples；Node c_jRepresent corresponding class label；Tr⁺(c_j) indicate node c_jPositive sample collection,It indicates while belonging to node c_jWith the positive sample collection of its brotgher of node, that is, these Sample has c simultaneously_jWith the class label of its brotgher of node；Tr^-(c_j) indicate node c_jNegative sample collection；*(c_j) indicate node c_j The set that corresponding specific sample is constituted；Indicate the brotgher of node；↑ indicate father node, ↓ indicate child node；Indicate ancestors' section Point,Indicate descendent node；Indicate certain samples are rejected from a sample set.

Other steps and parameter are same as the specific embodiment one.

Specific implementation mode three：

The specific implementation process of present embodiment step 3 is as follows：

Step 3.1,

First, the information gain that each attribute is calculated using the concept of the information gain in C4.5 decision Tree algorithms, is calculated simultaneously Go out the gain ratio that each attribute is occupied；

For a certain node, if D is sample set, Gain (R) is information gain, and Gainratio is for attribute R's Information gain ratio, then its calculation formula be：

Gain (R)=Info (D)-Info_R(D)

Wherein, p_iIndicate that the sample for belonging to classification i ratio shared in sample set, m are the classification contained by sample set Number, Info () indicate the entropy of sample set, i.e., the information content separately needed the different classification of sample set；K indicates attribute R The value for having k kinds different, D_jThe sample set being made of the sample that attribute R values are j, Info_R() indicates sample set for category The entropy of property R, that is, after being classified according to attribute R, the information content that the different classification of sample set is separately also needed to； SplitInfo_R() indicates the division information for attribute R；| | indicate the number of sample included in set；

Step 3.2,

For some node, after obtaining the information gain rate value of each attribute, select sample to classification results Larger attribute is contributed, and rejects unrelated attribute, the value of information gain ratio is bigger to indicate about big to classification results contribution；In order to Appropriate number of sample attribute is chosen, is allowed to neither lose a large amount of sample information, while there is sufficient amount of attribute, is introduced Two conditions --- minimal information gain ratio value and minimum number of attributes rate value；Select the specific of final combinations of attributes Operating process is：

If each sample x_jCan enough n-dimensional vectors indicate that contain n attribute, these attributes are expressed as (a₁,…,a_n)；For node i, minimal information gain ratio value is set as g_i, 0<g_i≤1；Minimum number of attributes rate value is set For q_i, 0<q_i≤1；

First, according to minimum number of attributes rate value q_iCalculate minimum attribute number magnitude Q_i=n × q_i；

Then, each attribute is arranged from big to small according to the value of information gain ratio, it is maximum from information gain rate value Attribute starts, when the summation of several information gain rate values of front is more than or equal to minimal information gain ratio value g_iWhen, simultaneously Judge whether the quantity of these attributes is more than minimum attribute number magnitude Q_i, if conditions are not met, then continuing to select from remaining attribute The maximum attribute of breath gain ratio value of winning the confidence, until the quantity of attribute is more than or equal to minimum attribute number magnitude Q_i；Then will meet The Attributions selection of the two conditions comes out, and is rejected remaining attribute as unrelated attribute；This process retains information gain ratio It is worth big attribute, that is, selects sample that classification results are contributed with larger attribute；

Step 3.2 illustrates：

The first situation：

It is now assumed that n=10, that is, there are 10 attributes, for node i, g is set_i=0.95, q_i=0.25, Q at this time_i=10 × 0.25=2.5 ≈ 3；

For node i, the information gain rate value of each attribute be 0.4,0.3,0.1,0.1,0.05,0.01,0.01, 0.01,0.01,0.01 }, it is 1 that all information ratio values, which mutually sum it up,；We select preceding 5 attribute values at this time, then this 5 attributes The information gain rate value of value and be 0.95, have equalized to g_i, that is, meet the requirement of minimal information gain ratio value；Institute simultaneously The attribute value quantity selected is 5, is more than minimum attribute number magnitude Q_i=3, so when select the attribute value representative sample of front 5, Abandon 5 attributes below；After operating herein, sample becomes 5 dimensional vectors from 10 dimensional vectors；

The second situation：

For node i, the information gain rate value of each attribute be 0.8,0.15,0.01,0.02,0.01,0.01,0, 0,0,0 }, it is 1 that all information ratio values, which mutually sum it up,；We select preceding 2 attribute values at this time, then the information of this 2 attribute values Gain ratio value and be 0.95, that is, meet minimal information gain ratio value requirement；But selected attribute value quantity is 2, it is less than minimum attribute number magnitude Q_i=3, so when select the attribute value representative sample of front 3, abandon 7 attributes below； After this operation, sample becomes 3 dimensional vectors from 10 dimensional vectors；

Step 3.3,

Process described in step 3.1 and step 3.2 is that the process of Attributions selection is carried out for a node in GO annotation schemes； Step 3.1 and 3.2 is repeated, Attributions selection is carried out to all nodes in GO annotation schemes.

Other steps and parameter are the same as one or two specific embodiments.

Specific implementation mode four：

The specific implementation process of present embodiment step 4 is as follows：

R-SVM is selected using the method for potential optimal threshold selection (Potential Best Threshold Selection) One group of potential best SVM threshold value is selected, the method meter of optimal threshold estimation (Best Threshold Estimation) is then used Optimal threshold is calculated, is allowed to be applied on SVM；Detailed process is as follows：

One group of step 4.1, selection potential optimal threshold：

Unbalanced dataset is handled using standard SVM, obtain the output valve of all samples and by it from high to low It is ranked up, finds the different adjacent sample of physical tags, the threshold value between the different adjacent sample of physical tags is exactly potential Optimal threshold, after adjusting thresholds, prediction result can change；For each node in GO annotation schemes, can obtain To one group of potential optimal threshold, i.e., potential optimal threshold set；

Step 4.2 determines used optimal threshold estimation：

For each node in GO annotation schemes, former training set is divided into using Partitioning (PT) methods several Training set, i.e., is divided into the subset of several non-overlapping copies by a sub- training set, and each subset is considered as a training subset later； Then on each training subset, an optimal threshold is selected from potential optimal threshold set；Finally, by all training subsets All optimal thresholds of selection are averaged, as final threshold θ；

Described " optimal threshold is selected from the potential optimal threshold set " detailed process is as follows：By potential best threshold Each threshold value in value set brings sub- training set into, using the best threshold value of classification results as the optimal threshold of training subset；

Step 4.3 is modified the result of SVM using final threshold θ, for node i, sample x_jPrediction As a result calculation formula is h_i ^*(x_j)=h_i(x_j)-θ；

Wherein h_i() is the classification function that the SVM of node i is provided, h_i(x_j) it is the sample x that SVM is provided_jClassification knot Fruit；h_i ^*(x_j) it is revised as a result, that namely R-SVM is provided as a result, h_i ^*(x_j) be more than or equal to 0, then judge x_jFor positive class； h_i ^*(x_j) be less than 0, then judge x_jTo bear class.

It illustrates：

If X is the training set containing n sample, sample label number shares m, that is to say, that shares m node；X= {x₁,x₂,…,x_n, Y={ y₁₁,y₁₂,…,y_1m…y_n1,y_n2,…,y_nmIt is true class label corresponding with each sample, also It is each node in GO annotation schemes；x_jFor a sample in training set, y_jiFor sample x_jFor the class label of node i, y_ji=1 indicates that the sample belongs to node i, y_ji=-1 indicates that the sample is not belonging to node i；Possibility threshold values all SVM are formed Collection be combined into Θ, we expect a threshold θ ∈ Θ so that the classifying quality of SVM reaches best；

R-SVM calculate threshold θ the step of be：

A, illustrate by taking Fig. 1 as an example, common SVM taken to classify sample set first, obtain each sample set as a result, If we there are 10 samples, 10 results that SVM is provided are obtained, it is believed that optimal threshold, which appears in, closes on two sample labels In inconsistent situation；In the example shown, there are three these threshold values, these threshold values are known as potential optimal threshold；

B, training set X is divided into S parts, training set divided method is Partitioning (PT) method, it divides training set For S unduplicated training subsets；

C, each potential optimal threshold is separately verified on each training subset, selects the optimal threshold for each subset；

D, the optimal threshold for each subset selected is averaged, obtains final optimal threshold；

Assuming that potential optimal threshold { 1.1,1.7,2.1 } there are three us, training set is divided into 5 training subsets by us, Verification result is respectively { 1.1,1.1,1.7,1.1,2.1 }, then final threshold value result is θ=1.42

E, the result of SVM is modified using the threshold value, formula h_i ^*(x_j)=h_i(x_j)-θ。

Other steps and parameter are identical as specific implementation mode two or three.

Specific implementation mode five：

The specific implementation process of present embodiment step 6 is as follows：

If X is the training set containing n sample, sample label number shares m, that is, shares m node；X={ x₁,x₂,…, x_n}；Y={ y₁₁,y₁₂,…,y_1m…y_n1,y_n2,…,y_nmIt is true class label corresponding with each sample, that is, GO annotations Each node in scheme；x_jFor a sample in training set, y_jiFor sample x_jFor the class label of node i, y_ji=1 indicates The sample belongs to node i, y_ji=-1 indicates that the sample is not belonging to node i；

For node i, by the SVM of the node for a sample x_jOutput valve h_i ^*(x_j) be converted to probability valueIt is public Formula isA, B is two coefficients for converting result.

Other steps and parameter are identical as one of specific implementation mode one to four.

Specific implementation mode six：

The solution procedure of coefficient A and B described in present embodiment step 6 are as follows：

For node i, the value of A, B can be obtained by solving following formula to training set：

WhereinN₊To belong to section in sample set The quantity of the sample of point i, N_ are the quantity for the sample that node i is not belonging in sample set.

Other steps and parameter are identical as one of specific implementation mode one to five.

Specific implementation mode seven：

The specific implementation process of present embodiment step 7 is as follows：

Step 7.1, a node may contain multiple father nodes in directed acyclic graph structures, therefore be reached from root node There may be mulitpaths for one node；For such case, we define the level belonging to a node and are reached for root node What the maximum path of this node was determined, therefore there are directed acyclic graph structures how many level to depend on the tool in directed acyclic graph There is the node of longest path；It is the root node in directed acyclic graph to define r, and node i is any one section in directed acyclic graph Point (non-root node), p (r, i) indicate that the paths from root node r to node i, l (p (r, i)) indicate the length in the path；ψ (i) it is the function for determining level residing for node i, as follows：

The level in GO annotation schemes residing for each node is obtained according to ψ (i), and it is the 0th layer to define root node, is then 1 Layer, 2 layers, until GO annotation scheme bottom grade；

Step 7.2, for GO annotations scheme, process, a sample are depended in the prediction result of each node from bottom to top The prediction result of the nodal basis grader and its child node are predicted as the result of positive class；One sample its whether there is node i Representative function depend not only on it is that the node classifier provides as a result, additionally depend on the child node grader of the node to The result gone out；

For a sample x_j, being in the node of the bottom since GO annotation schemes, successively handled, counted upwards Calculate the synthesis result of the result given by the result and child node grader that each node classifier provides；Detailed process is：

For a node i in GO annotation schemes, φ_iIndicate that prediction of result is all sons of the node i of positive class The set that node is constituted；Classification results for the node i provided after comprehensive child node classifier result；ThenCalculating it is public Formula is：

Wherein, ω is weight parameter, and weight parameter ω is used for the tribute of balanced basis grader and child node to final result Size is offered, which could be provided as 0.5, can also be and is adjusted according to actual conditions；By the step, the positive class of lower section is pre- It surveys result and is successively transmitted to upper layer respective nodes；

Step 7.3, for GO annotations scheme, process, its main target are will to pass through process from bottom to top from top to bottom Afterwards, upper layer node is judged as that the result of negative class passes to corresponding lower level node；It also by the way of successively transmitting, is changed The predicted value of each node layer is finally predicted finally according to respective threshold and the finally obtained predicted value for each node As a result；Particular content is：

For a sample x_j, final calculation resultFor

Wherein, par (i) indicates the father node of node i；

During from bottom to top, it is therefore an objective to be calculated according to the classifier result of each nodeI.e. comprehensive child node result One result；Process from top to bottom is then basisCalculate final calculation resultIt is that the sample belongs to node i Probability value, a number are more than or equal to 0, are less than or equal to+1；More than or equal to 0.5, illustrate that sample belongs to the node,It is less than 0.5 explanation is not belonging to the node；

Step 7.4, for a sample x_jFor, the final calculation result of node i isLabel in GO annotation schemes Shared m of number, that is to say, that share m node；Then for a sample x_jFor, final calculation result is

Step 7.5, for a sample x_jIfMore than or equal to 0.5, then it is predicted as positive class, i.e., the sample belongs to section Point i, the class label indicated with node i；IfLess than 0.5, then it is predicted as negative class, i.e. the sample is not belonging to node i, no The class label indicated with node i；That is sample x_jThe final prediction result Y of class label_jiIt is expressed as

Step 7.6 finally obtains a sample x_jWhich point, i.e. sample x in GO annotation schemes belonged to_jWith which Class label；About sample x_jAll class labels final prediction result Y_jIt can be expressed as Y_j={ Y_j1..., Y_ji..., Y_jm, Realize sample x_jTag Estimation, that is, realize the prediction to gene function.

Other steps and parameter are identical as one of specific implementation mode one to six.

Specific implementation mode eight：

Minimal information gain ratio value g described in present embodiment step 3.2_iWith minimum attribute number magnitude Q_iSpecific number Value needs repeatedly to be trained in training, chooses the highest value of accuracy and is set；Detailed process is as follows：

First rule of thumb selection minimal information gain ratio value g_iAnd minimum attribute number magnitude Q_iInitial value；Then Continue step 4- steps 7；After completing the process, according to the accuracy of prediction result, this g is adjusted_i、Q_i, repeatedly step again 4- steps 7；After repeatedly, the case where choosing pre- accuracy highest, sets the concrete numerical value of the two values.

Other steps and parameter are identical as one of specific implementation mode one to seven.

Claims

1. the Gene correlation method based on R-SVM and TPR rules, it is characterised in that include the following steps：

Step 1, using the gene of known function as training sample, composing training collection, and by each gene representation at a multidimensional Vector, vector in each element be referred to as an attribute；

Each node in step 2, GO annotation schemes indicates a class label, first for each node in GO annotation schemes First by each sample in training set, positive sample collection and negative sample collection are constructed according to improved sibling principles；

Step 3, for each node in GO annotation schemes, the Attributions selection of sample, selection pair are carried out to corresponding data set Larger attribute is contributed when the function of the node is classified；

Step 4, for each node in GO annotation schemes, the data set of each node is instructed using R-SVM graders Practice；Obtain one group of R-SVM grader；

R-SVM uses the potential best SVM threshold values of one group of method choice of potential optimal threshold selection, is then estimated using optimal threshold The method of meter calculates optimal threshold, is allowed to be applied on SVM；Detailed process is as follows：

One group of step 4.1, selection potential optimal threshold：

Unbalanced dataset is handled using standard SVM, obtain the output valve of all samples and carries out it from high to low Sequence, finds the different adjacent sample of physical tags, and the threshold value between the different adjacent sample of physical tags is exactly potential best Threshold value；For each node in GO annotation schemes, one group of potential optimal threshold can be accessed, i.e., potential optimal threshold collection It closes；

Step 4.2 determines used optimal threshold estimation：

For each node in GO annotation schemes, former training set is divided into several height using Partitioning methods and is trained Training set, i.e., is divided into the subset of several non-overlapping copies, each subset is considered as a training subset later by collection；Then every On a training subset, an optimal threshold is selected from potential optimal threshold set；Finally, the institute all training subsets selected There is optimal threshold to be averaged, as final threshold θ；

Described " optimal threshold is selected from the potential optimal threshold set " detailed process is as follows：By potential optimal threshold collection Each threshold value in conjunction brings sub- training set into, using the best threshold value of classification results as the optimal threshold of training subset；

Step 4.3 is modified the result of SVM using final threshold θ, for node i, sample x_jPrediction result Calculation formula is

Wherein h_i() is the classification function that the SVM of node i is provided, h_i(x_j) it is the sample x that SVM is provided_jClassification results；Be it is revised as a result, it is that namely R-SVM is provided as a result,More than or equal to 0, then x is judged_jFor positive class；Less than 0, then x is judged_jTo bear class；

Step 5, each node correspond to a grader, and all nodes obtain a classifiers in GO annotation schemes；Use training One group of R-SVM grader that stage obtains carries out classification prediction to unknown sample, obtains one group of preliminary R-SVM classification results；

Step 6 converts this group of R-SVM classification results to posterior probability values using sigmoid methods；

Step 7, the TPR Integrated Algorithms using the Weight for directed acyclic graph hierarchical structure are ensureing final prediction knot Under the premise of fruit meets directed acyclic graph level constraint requirements, the prediction of gene function is realized.

2. the Gene correlation method according to claim 1 based on R-SVM and TPR rules, it is characterised in that step 2 The detailed process that positive sample collection and negative sample collection are constructed according to improved sibling principles is as follows：

For each node in GO annotation schemes, in training set, using the sample for belonging to the node as positive sample, will belong to The sample of the brotgher of node of the node rejects in original negative sample set as initial negative sample while belonging to positive sample collection In sample, and as final negative sample collection, i.e. negative sample collection；Wherein, if a node does not have the brotgher of node, to On trace to the source the brotgher of node for selecting to belong to its father node sample as negative sample；

Specific symbolic indication：

Tr⁺(c_j)=* (c_j)

Wherein, Tr indicates the training set for including all samples；Node c_jRepresent corresponding class label；Tr⁺(c_j) indicate node c_j's Positive sample collection,It indicates while belonging to node c_jWith the positive sample collection of its brotgher of node, that is, these samples There is c simultaneously_jWith the class label of its brotgher of node；Tr^-(c_j) indicate node c_jNegative sample collection；*(c_j) indicate node c_jIt is corresponding Specific sample constitute set；Indicate the brotgher of node；↑ indicate father node；Indicate ancestor node；Indicate from a sample Certain samples are rejected in this set.

3. the Gene correlation method according to claim 2 based on R-SVM and TPR rules, feature is described The specific implementation process of step 3 is as follows：

Step 3.1,

First, the information gain of each attribute is calculated using the concept of the information gain in C4.5 decision Tree algorithms, while being calculated each The gain ratio that attribute is occupied；

For a certain node, if D is sample set, Gain (R) is information gain, and Gainratio is the information for attribute R Gain ratio, then its calculation formula be：

Gain (R)=Info (D)-Info_R(D)

Wherein, p_iIndicate that the sample for belonging to classification i ratio shared in sample set, m are the classification number contained by sample set, Info () indicates the entropy of sample set, i.e., the information content separately needed the different classification of sample set；K indicates that attribute R has k kinds Different values, D_jThe sample set being made of the sample that attribute R values are j, Info_R() indicates sample set for attribute R's Entropy, that is, after being classified according to attribute R, the information content that the different classification of sample set is separately also needed to； SplitInfo_R() indicates the division information for attribute R；| | indicate the number of sample included in set；

Step 3.2,

For some node, after obtaining the information gain rate value of each attribute, sample is selected to contribute classification results Larger attribute, and unrelated attribute is rejected, the value of information gain ratio is bigger to indicate about big to classification results contribution；Selection is final The specific operation process of combinations of attributes be：

If each sample x_jCan enough n-dimensional vectors indicate that contain n attribute, these attributes are expressed as (a₁,…, a_n)；For node i, minimal information gain ratio value is set as g_i, 0<g_i≤1；Minimum number of attributes rate value is set as q_i, 0< q_i≤1；

Then, each attribute is arranged from big to small according to the value of information gain ratio, from the maximum attribute of information gain rate value Start, when the summation of several information gain rate values of front is more than or equal to minimal information gain ratio value g_iWhen, judge simultaneously Whether the quantity of attribute is more than minimum attribute number magnitude Q_i, if conditions are not met, then continuing to choose information increasing from remaining attribute The beneficial maximum attribute of rate value, until the quantity of attribute is more than or equal to minimum attribute number magnitude Q_i；Then it will meet the two The Attributions selection of part comes out, and is rejected remaining attribute as unrelated attribute；This process retains the big category of information gain rate value Property, that is, select sample that classification results are contributed with larger attribute；

Step 3.3,

Process described in step 3.1 and step 3.2 is that the process of Attributions selection is carried out for a node in GO annotation schemes；It repeats Step 3.1 and 3.2 carries out Attributions selection to all nodes in GO annotation schemes.

4. the Gene correlation method according to claim 3 based on R-SVM and TPR rules, feature is described The specific implementation process of step 6 is as follows：

If X is the training set containing n sample, sample label number shares m, that is, shares m node；X={ x₁,x₂,…,x_n}； Y={ y₁₁,y₁₂,…,y_1m…y_n1,y_n2,…,y_nmIt is true class label corresponding with each sample, that is, GO annotation schemes In each node；x_jFor a sample in training set, y_jiFor sample x_jFor the class label of node i, y_ji=1 indicates the sample Originally belong to node i, y_ji=-1 indicates that the sample is not belonging to node i；

For node i, by the SVM of the node for a sample x_jOutput valveBe converted to probability valueFormula isA, B is two coefficients for converting result.

5. the Gene correlation method according to claim 4 based on R-SVM and TPR rules, feature is in step 6 institute The solution procedure for stating coefficient A and B is as follows：

WhereinN₊To belong to node i in sample set Sample quantity, N_-For be not belonging in sample set node i sample quantity.

6. the Gene correlation method according to claim 5 based on R-SVM and TPR rules, feature is described The specific implementation process of step 7 is as follows：

Step 7.1 defines the level belonging to a node and reaches what the maximum path of this node determined by root node, oriented nothing Ring graph structure depends on the node with longest path in directed acyclic graph with how many level；Definition r is directed acyclic graph In root node, node i is any one node in directed acyclic graph, and p (r, i) indicates one from root node r to node i Path, l (p (r, i)) indicate the length in the path；ψ (i) is the function for determining level residing for node i, as follows：

The level in GO annotation schemes residing for each node is obtained according to ψ (i), and it is the 0th layer to define root node, is then 1 layer, 2 Layer, until the bottom grade of GO annotation schemes；

Step 7.2 annotates scheme process from bottom to top for GO,

For a sample x_j, being in the node of the bottom since GO annotation schemes, successively handled, calculated each upwards The synthesis result of a node and child node result；Detailed process is：

For a node i in GO annotation schemes, φ_iIndicate that prediction of result is all child node structures of the node i of positive class At set；Classification results for the node i provided after comprehensive child node classifier result；ThenCalculation formula be：

Wherein, ω is weight parameter；

Step 7.3 annotates scheme process from top to bottom for GO,

For a sample x_j, final calculation resultFor

Wherein, par (i) indicates the father node of node i；

Step 7.4, for a sample x_jFor, the final calculation result of node i isNumber of tags in GO annotation schemes is total There are m, that is to say, that share m node；Then for a sample x_jFor, final calculation result is

Step 7.5, for a sample x_jIfMore than or equal to 0.5, being then predicted as positive class, i.e., the sample belongs to node i, The class label indicated with node i；IfLess than 0.5, being then predicted as negative class, i.e. the sample is not belonging to node i, without The class label that node i indicates；That is sample x_jThe final prediction result Y of class label_jiIt is expressed as

Step 7.6 finally obtains a sample x_jWhich point, i.e. sample x in GO annotation schemes belonged to_jWith which category Label；About sample x_jAll class labels final prediction result Y_jIt is expressed as Y_j={ Y_j1..., Y_ji..., Y_jm, realize sample x_jTag Estimation, that is, realize the prediction to gene function.

7. the Gene correlation method according to claim 6 based on R-SVM and TPR rules, feature is in step 3.2 The minimal information gain ratio value g_iWith minimum attribute number magnitude Q_iConcrete numerical value need training when repeatedly instructed Practice, chooses the highest value of accuracy and set；Detailed process is as follows：

First rule of thumb selection minimal information gain ratio value g_iAnd minimum attribute number magnitude Q_iInitial value；It then proceedes to Carry out step 4- steps 7；After completing the process, according to the accuracy of prediction result, this g is adjusted_i、Q_i, repeatedly step 4- is walked again Rapid 7；After repeatedly, the case where choosing pre- accuracy highest, sets the concrete numerical value of the two values.