CN104820687A

CN104820687A - Construction method of directed link type classifier and classification method

Info

Publication number: CN104820687A
Application number: CN201510192537.7A
Authority: CN
Inventors: 张晓宇; 侯子骄; 王树鹏
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2015-04-22
Filing date: 2015-04-22
Publication date: 2015-08-05

Abstract

The invention discloses a construction method of a directed link type classifier and a classification method. The construction method comprises the following steps: 1) initializing weight distribution of a labelled training data set T, an increment set and an iteration cut-off condition; 2) for the m-th iteration, training a weak classifier Gm(x) by adoption of the labelled training set T(m), and updating the weight distribution of the labelled training set T(m) by utilization of a classification error rate and a coefficient of the current Gm(x); predicting an unlabeled set U by utilization of the current Gm(x), and then selecting former K samples with the highest confidence degree from predicting results and the corresponding predicting labels to put or update in the increment set; and 3) when the iteration cut-off condition is satisfied, stopping iterating and constructing a strong classifier G(x) according to the weak classifier obtained through each iteration. Through the adoption of the construction method disclosed by the invention, the labelled sample and the unlabeled sample are sufficiently excavated and utilized through the sharing transmission and cooperative guidance of valuable knowledge, the effective utilization and the fusion enhancement of model information are realized.

Description

A kind of oriented interlinkage sorter building method and sorting technique

Technical field

The present invention relates to a kind of oriented interlinkage sorter building method and sorting technique, belong to computer software technical field.

Background technology

At information intelligent analysis field, many typical apply inherently can be summed up as classification problem, as malicious code identification, intrusion detection etc.Traditional sorting technique or highly depend on artificial judgement, or based on simple direct empirical rule, effect and the efficiency of classification all urgently promote.In this case, sorting technique that is intelligent, robotization is regarded as a kind of effective solution, and the selection of sorter is a vital link.Boosting algorithm, because of advantages such as its simple structure, lifting successfuls, becomes a kind of method be widely used; Wherein, AdaBoost (Adaptive Boosting) most is representative.

From the angle of machine learning, traditional automatic classification method belongs to supervised learning (supervised learning), and these class methods build disaggregated model based on marking sample as training set completely.That correspond is unsupervised learning (unsupervised learning), namely never marks sample and to set out the process of structured message implicit in mining data.Supervised learning is comparatively large for the scale-dependent marking sample set, and disaggregated model is more reliable more at most to have marked sample.But in a lot of actual classification problem, because human cost, time cost are high, a large amount of needed for model training and sample class information fully often cannot be obtained; Usually, sub-fraction can only be obtained and mark sample, and all the other most of samples all do not mark.Therefore, even if the efficient sorter of such as AdaBoost and so on, when training sample is very rare, be also difficult to accurately portray and disclose real disaggregated model.

The defect of background technology

In traditional AdaBoost sorter building method, each Weak Classifier forms strong classifier only by the training weight combination obtained by error rate, but there is not direct contact between Weak Classifier.If each Weak Classifier to be regarded as the node in graph model, then in traditional AdaBoost sorting technique, there is not the limit interlinked between these nodes, in other words these nodes isolate relatively.From information flow angle, also namely there is not the information interaction between Weak Classifier, this knowledge just causing previous Weak Classifier learning to obtain directly for the structure of follow-up Weak Classifier provides effective guidance, thus cannot waste valuable information.

Summary of the invention

The object of the present invention is to provide a kind of oriented interlinkage sorter building method and sorting technique, by being designed with to link information path between Weak Classifier, the shared transmission of implementation model knowledge and collaborative guidance.Use the method, the limited sample that marked can be made full use of to obtain more excellent classification results, for the Data classification application scenarios of " marked that sample acquisition cost is high, quantity is few and do not mark that sample size is huge, ubiquity " provides a kind of effective solution.

The present invention is directed to the limitation of traditional AdaBoost framework, devise that a kind of Weak Classifier is collaborative instructs structural framing, propose a kind of oriented interlinkage AdaBoost sorter building method.The method is designed with to link information path between Weak Classifier, is instructed with collaborative by the shared transmission of valuable knowledge, and fully excavate and marked and do not marked this two kinds of samples with utilizing, the effective utilization achieving model information strengthens with fusion.

The core concept of oriented interlinkage AdaBoost sorter building method is: utilize the Weak Classifier previously trained to classify to unlabeled set, and some samples the highest for forecast confidence are recommended follow-up Weak Classifier, profit passes to follow-up Weak Classifier the information with high reliability in this way on the one hand, instruct the structure of follow-up Weak Classifier, on the other hand also by shared effective " expansion " of valuable information training set, thus overall classification performance can be promoted while making full use of limited training data.Specifically: take turns in loop iteration in each of oriented interlinkage AdaBoost sorting technique, the Weak Classifier G trained _mx () marks collection in the hope of merging except weight coefficient except acting on, also act on unlabeled set with select wherein forecast confidence the highest before K sample, by the prediction mark formation incremental training collection Δ T of these samples together with correspondence _mand recommend follow-up Weak Classifier, thus instruct the structure of follow-up Weak Classifier targetedly while the existing training set of expansion.Oriented interlinkage AdaBoost sorting technique flow process as shown in Figure 1.

According to the recommended range of incremental training collection, oriented interlinkage AdaBoost sorting technique can Further Division be " modern " and " accumulation type " two kinds of patterns.For the purpose of sake of clarity, document of the present invention is used represent sample input feature vector, use y _i∈ {-the 1 ,+1} class label representing its correspondence; Sample set X marks collection L and unlabeled set U according to whether marking to be divided into, and has wherein marked the training set T that the sample in collection L learns together with its corresponding label component model.

Modern: under this pattern, current delta training set only recommends next Weak Classifier, and therefore information interaction is only present in (as shown in Figure 2) between adjacent Weak Classifier.Use T ^(m)represent and build Weak Classifier G _mx spread training collection that () adopts, uses Δ T _mrepresent Weak Classifier G _mx incremental training collection that () generates, then formulism is expressed as:

\{\begin{matrix} T^{(1)} = T \\ T^{(m + 1)} = T + Δ T_{m} \end{matrix} - - - (1)

Accumulation type: under this pattern, current delta training set can recommend follow-up all Weak Classifiers, and therefore each Weak Classifier all can accept the tutorial message (as shown in Figure 3) from other Weak Classifiers all before this.Corresponding formulism is expressed as:

\{\begin{matrix} T^{(1)} = T \\ T^{(m + 1)} = T^{(m)} + Δ T_{m} = T + Σ_{j = 1}^{m} Δ T_{j} \end{matrix} - - - (2)

Finally, after the training set after utilizing expansion constructs a series of Weak Classifier, oriented interlinkage AdaBoost sorting technique is organically blended formation strong classifier.Particularly, the performing step of oriented interlinkage AdaBoost sorting technique is as follows:

Step 1: input training set T={ (x _i, y _i) (1≤i≤N), wherein input data label y _i∈ {-1 ,+1}, unlabeled set U.

Step 2: the weights distribution of initialization training dataset T and Δ T ₀:

i＝1，2，3，...，N

ΔT ₀＝{}

Step 3: when m belongs to 1 to M, following steps under circulation:

Step 3.1: build training set T ^(m):

T ^(m)＝T+ΔT _m-1

Or

T^{(m)} = T + Σ_{0}^{m - 1} Δ T_{j}

Step 3.2: use T ^(m)training Weak Classifier G _m(x):

T ^(m)～G _m(x)：X→{-1，+1}

Step 3.3: calculate G _mthe error in classification rate of (x):

{err}_{m} = Σ_{i = 1}^{N} ω_{mi} II (G_{m} (x_{i}) &NotEqual; y_{i})

Step 3.4: calculate G _mthe coefficient of (x):

α_{m} = \frac{1}{2} \log \frac{1 - {err}_{m}}{{err}_{m}}

Step 3.5: upgrade the distribution of training dataset weights:

ω_{m + 1, i} = \frac{ω_{mi}}{Z_{m}} \exp (- α_{m} y_{i} G_{m} (x_{i})),

i＝1，2，...，N

Wherein Z _mit is normalization factor

Z_{m} = Σ_{i = 1}^{N} ω_{mi} \exp (- α_{m} y_{i} G_{m} (x_{i}))

Step 3.6: use G _mx () prediction unlabeled set U, selects the prediction label of front K the highest sample of degree of confidence and correspondence thereof, builds Δ T _m:

Δ T_{m} = \underset{x &Element; U}{top - K} Cert (G_{m} (x))

Step 4: circulate complete after, each Weak Classifier calculates the weight of gained according to step 3.4, is weighted superposition, form export strong classifier G (x):

G (x) = sign (Σ_{m = 1}^{M} α_{m} G_{m} (x))

Step 5: unlabeled set U is predicted by strong classifier G (x), prediction of output result.

Compared with prior art, good effect of the present invention:

The present invention devises a kind of working in coordination with and instructs framework; According to the recommended range of incremental training collection, devise modern and accumulation type two kinds of information recommendation patterns.Utilize oriented interlinkage AdaBoost sorting technique provided by the invention, tool has the following advantages:

The present invention is directed in large scale database and marked that sample acquisition cost is high, limited amount and do not mark sample and exist in a large number, be easy to obtain this situation, on the basis of traditional AdaBoost framework, devise a kind of working in coordination with and instruct framework, information path is built between the Weak Classifier that originally there is no direct correlation, by the transmission of highly reliable valuable information, realize the lifting of using efficiency of information and effective expansion of training set; Secondly, by modern and accumulation type two kinds of information recommendation patterns, the collaborative enhancing of implementation model knowledge in different range.Namely in traditional AdaBoost, each Weak Classifier only mutually superposes and forms strong classifier, but there is not information transmission between Weak Classifier; The present invention designs first and establishes oriented link between Weak Classifier, achieves the transmission of valuable information on the one hand, extends training set on the other hand.

Accompanying drawing explanation

Fig. 1 is oriented interlinkage AdaBoost sorting technique iteration-internal process flow diagram in the inventive method;

Fig. 2 is modern oriented interlinkage AdaBoost sorting technique schematic diagram in the inventive method;

Fig. 3 is accumulation type oriented interlinkage AdaBoost sorting technique schematic diagram in the inventive method.

Embodiment

The present invention is explained in further detail the present invention by an example, and example, only for explaining the present invention, is not intended to limit scope of the present invention.

Malicious code (Malware) typically refers to deliberately to create and is used for performing without permission and the software program of normally harmful act.Comprise computer virus, back door, trojan horse program, worm, spyware etc.Now common malicious code major part is the program run in Windows system, and the Win32API function (hereinafter referred to as API) that they are generally provided by call operation system realizes the access to other important system resource such as file system, registration table.Malicious code data collection is call situation by analyzing malicious code to API, is concluded in the dynamic behaviour characteristic item being recorded to correspondence, obtains eigenmatrix.

Comprising the following steps malicious code classification is realized by oriented interlinkage AdaBoost sorting technique:

Step 1: input malicious code training set T={ (x _i, y _i) (1≤i≤N), wherein input data label y _i∈ {-1 ,+1}, unlabeled set U.

Step 2: the weights distribution of initialization training dataset T and Δ T ₀;

Step 3: according to designated cycle number of times, iteration Weak Classifier;

Step 4: after iteration is complete, exports strong classifier G (x);

Described step 3 is specially:

When m belongs to 1 to M, following steps under circulation

Step 3.1: build training set T ^(m);

Step 3.2: use T ^(m)training Weak Classifier G _m(x);

Step 3.3: calculate G _mthe error in classification rate of (x);

Step 3.4: calculate G _mthe coefficient of (x);

Step 3.5: upgrade the distribution of training dataset weights;

Step 3.6: use G _mx () prediction unlabeled set U, selects the prediction label of front K the highest sample of degree of confidence and correspondence thereof, builds Δ T _m.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. an oriented interlinkage AdaBoost sorter building method, the steps include:

1) initialization one has marked weights distribution, an increment collection and the iteration cut-off condition of training dataset T;

2) for the m time iteration, adopt and marked training set T ^(m)train a Weak Classifier G _m(x), and with current Weak Classifier G _mx the error in classification rate of () and coefficient update have marked training set T ^(m)weights distribution; And utilize current Weak Classifier G _mx () predicts a unlabeled set U, then select from predict the outcome degree of confidence the highest before the prediction label of K sample and correspondence thereof put into described increment and concentrate, be designated as Δ T _m; Wherein, T ^(m)=T+ Δ T _m;

3) when meeting iteration cut-off condition, stopping iteration and building strong classifier G (x) according to the Weak Classifier that each iteration obtains.

2. the method for claim 1, is characterized in that, described iteration cut-off condition is iteration M time.

3. the method for claim 1, is characterized in that, described iteration cut-off condition is the condition of convergence of setting.

4. the method as described in claim 1 or 2 or 3, is characterized in that, the construction method of described strong classifier G (x) is: the Weak Classifier that each iteration obtains linearly is weighted superposition, forms described strong classifier G (x).

5. an oriented interlinkage AdaBoost sorter building method, the steps include:

2) for the m time iteration, adopt and marked training set T ^(m)train a Weak Classifier G _m(x), and with current Weak Classifier G _mx the error in classification rate of () and coefficient update have marked training set T ^(m)weights distribution; And utilize current Weak Classifier G _mx () predicts a unlabeled set U, then select from predict the outcome degree of confidence the highest before K sample and correspondence thereof prediction tag update described in sample in increment collection, be designated as Δ T _m; Wherein, T ^(m)=T+ Δ T _m;

6. the method for claim 1, is characterized in that, described iteration cut-off condition is iteration M time.

7. the method for claim 1, is characterized in that, described iteration cut-off condition is the condition of convergence of setting.

8. the method as described in claim 1 or 2 or 3, is characterized in that, the construction method of described strong classifier G (x) is: the Weak Classifier that each iteration obtains linearly is weighted superposition, forms described strong classifier G (x).

9. an oriented interlinkage AdaBoost sorting technique, is characterized in that, adopts strong classifier G (x) of method construct described in claim 1 or 5 to predict unlabeled set U, prediction of output result.