CN104820687A - Construction method of directed link type classifier and classification method - Google Patents

Construction method of directed link type classifier and classification method Download PDF

Info

Publication number
CN104820687A
CN104820687A CN201510192537.7A CN201510192537A CN104820687A CN 104820687 A CN104820687 A CN 104820687A CN 201510192537 A CN201510192537 A CN 201510192537A CN 104820687 A CN104820687 A CN 104820687A
Authority
CN
China
Prior art keywords
iteration
classifier
condition
weak classifier
cut
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510192537.7A
Other languages
Chinese (zh)
Inventor
张晓宇
侯子骄
王树鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201510192537.7A priority Critical patent/CN104820687A/en
Publication of CN104820687A publication Critical patent/CN104820687A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a construction method of a directed link type classifier and a classification method. The construction method comprises the following steps: 1) initializing weight distribution of a labelled training data set T, an increment set and an iteration cut-off condition; 2) for the m-th iteration, training a weak classifier Gm(x) by adoption of the labelled training set T(m), and updating the weight distribution of the labelled training set T(m) by utilization of a classification error rate and a coefficient of the current Gm(x); predicting an unlabeled set U by utilization of the current Gm(x), and then selecting former K samples with the highest confidence degree from predicting results and the corresponding predicting labels to put or update in the increment set; and 3) when the iteration cut-off condition is satisfied, stopping iterating and constructing a strong classifier G(x) according to the weak classifier obtained through each iteration. Through the adoption of the construction method disclosed by the invention, the labelled sample and the unlabeled sample are sufficiently excavated and utilized through the sharing transmission and cooperative guidance of valuable knowledge, the effective utilization and the fusion enhancement of model information are realized.

Description

A kind of oriented interlinkage sorter building method and sorting technique
Technical field
The present invention relates to a kind of oriented interlinkage sorter building method and sorting technique, belong to computer software technical field.
Background technology
At information intelligent analysis field, many typical apply inherently can be summed up as classification problem, as malicious code identification, intrusion detection etc.Traditional sorting technique or highly depend on artificial judgement, or based on simple direct empirical rule, effect and the efficiency of classification all urgently promote.In this case, sorting technique that is intelligent, robotization is regarded as a kind of effective solution, and the selection of sorter is a vital link.Boosting algorithm, because of advantages such as its simple structure, lifting successfuls, becomes a kind of method be widely used; Wherein, AdaBoost (Adaptive Boosting) most is representative.
From the angle of machine learning, traditional automatic classification method belongs to supervised learning (supervised learning), and these class methods build disaggregated model based on marking sample as training set completely.That correspond is unsupervised learning (unsupervised learning), namely never marks sample and to set out the process of structured message implicit in mining data.Supervised learning is comparatively large for the scale-dependent marking sample set, and disaggregated model is more reliable more at most to have marked sample.But in a lot of actual classification problem, because human cost, time cost are high, a large amount of needed for model training and sample class information fully often cannot be obtained; Usually, sub-fraction can only be obtained and mark sample, and all the other most of samples all do not mark.Therefore, even if the efficient sorter of such as AdaBoost and so on, when training sample is very rare, be also difficult to accurately portray and disclose real disaggregated model.
The defect of background technology
In traditional AdaBoost sorter building method, each Weak Classifier forms strong classifier only by the training weight combination obtained by error rate, but there is not direct contact between Weak Classifier.If each Weak Classifier to be regarded as the node in graph model, then in traditional AdaBoost sorting technique, there is not the limit interlinked between these nodes, in other words these nodes isolate relatively.From information flow angle, also namely there is not the information interaction between Weak Classifier, this knowledge just causing previous Weak Classifier learning to obtain directly for the structure of follow-up Weak Classifier provides effective guidance, thus cannot waste valuable information.
Summary of the invention
The object of the present invention is to provide a kind of oriented interlinkage sorter building method and sorting technique, by being designed with to link information path between Weak Classifier, the shared transmission of implementation model knowledge and collaborative guidance.Use the method, the limited sample that marked can be made full use of to obtain more excellent classification results, for the Data classification application scenarios of " marked that sample acquisition cost is high, quantity is few and do not mark that sample size is huge, ubiquity " provides a kind of effective solution.
The present invention is directed to the limitation of traditional AdaBoost framework, devise that a kind of Weak Classifier is collaborative instructs structural framing, propose a kind of oriented interlinkage AdaBoost sorter building method.The method is designed with to link information path between Weak Classifier, is instructed with collaborative by the shared transmission of valuable knowledge, and fully excavate and marked and do not marked this two kinds of samples with utilizing, the effective utilization achieving model information strengthens with fusion.
The core concept of oriented interlinkage AdaBoost sorter building method is: utilize the Weak Classifier previously trained to classify to unlabeled set, and some samples the highest for forecast confidence are recommended follow-up Weak Classifier, profit passes to follow-up Weak Classifier the information with high reliability in this way on the one hand, instruct the structure of follow-up Weak Classifier, on the other hand also by shared effective " expansion " of valuable information training set, thus overall classification performance can be promoted while making full use of limited training data.Specifically: take turns in loop iteration in each of oriented interlinkage AdaBoost sorting technique, the Weak Classifier G trained mx () marks collection in the hope of merging except weight coefficient except acting on, also act on unlabeled set with select wherein forecast confidence the highest before K sample, by the prediction mark formation incremental training collection Δ T of these samples together with correspondence mand recommend follow-up Weak Classifier, thus instruct the structure of follow-up Weak Classifier targetedly while the existing training set of expansion.Oriented interlinkage AdaBoost sorting technique flow process as shown in Figure 1.
According to the recommended range of incremental training collection, oriented interlinkage AdaBoost sorting technique can Further Division be " modern " and " accumulation type " two kinds of patterns.For the purpose of sake of clarity, document of the present invention is used represent sample input feature vector, use y i∈ {-the 1 ,+1} class label representing its correspondence; Sample set X marks collection L and unlabeled set U according to whether marking to be divided into, and has wherein marked the training set T that the sample in collection L learns together with its corresponding label component model.
Modern: under this pattern, current delta training set only recommends next Weak Classifier, and therefore information interaction is only present in (as shown in Figure 2) between adjacent Weak Classifier.Use T (m)represent and build Weak Classifier G mx spread training collection that () adopts, uses Δ T mrepresent Weak Classifier G mx incremental training collection that () generates, then formulism is expressed as:
T ( 1 ) = T T ( m + 1 ) = T + Δ T m - - - ( 1 )
Accumulation type: under this pattern, current delta training set can recommend follow-up all Weak Classifiers, and therefore each Weak Classifier all can accept the tutorial message (as shown in Figure 3) from other Weak Classifiers all before this.Corresponding formulism is expressed as:
T ( 1 ) = T T ( m + 1 ) = T ( m ) + Δ T m = T + Σ j = 1 m Δ T j - - - ( 2 )
Finally, after the training set after utilizing expansion constructs a series of Weak Classifier, oriented interlinkage AdaBoost sorting technique is organically blended formation strong classifier.Particularly, the performing step of oriented interlinkage AdaBoost sorting technique is as follows:
Step 1: input training set T={ (x i, y i) (1≤i≤N), wherein input data label y i∈ {-1 ,+1}, unlabeled set U.
Step 2: the weights distribution of initialization training dataset T and Δ T 0:
i=1,2,3,...,N
ΔT 0={}
Step 3: when m belongs to 1 to M, following steps under circulation:
Step 3.1: build training set T (m):
T (m)=T+ΔT m-1
Or
T ( m ) = T + Σ 0 m - 1 Δ T j
Step 3.2: use T (m)training Weak Classifier G m(x):
T (m)~G m(x):X→{-1,+1}
Step 3.3: calculate G mthe error in classification rate of (x):
err m = Σ i = 1 N ω mi II ( G m ( x i ) ≠ y i )
Step 3.4: calculate G mthe coefficient of (x):
α m = 1 2 log 1 - err m err m
Step 3.5: upgrade the distribution of training dataset weights:
ω m + 1 , i = ω mi Z m exp ( - α m y i G m ( x i ) ) ,
i=1,2,...,N
Wherein Z mit is normalization factor
Z m = Σ i = 1 N ω mi exp ( - α m y i G m ( x i ) )
Step 3.6: use G mx () prediction unlabeled set U, selects the prediction label of front K the highest sample of degree of confidence and correspondence thereof, builds Δ T m:
Δ T m = top - K x ∈ U Cert ( G m ( x ) )
Step 4: circulate complete after, each Weak Classifier calculates the weight of gained according to step 3.4, is weighted superposition, form export strong classifier G (x):
G ( x ) = sign ( Σ m = 1 M α m G m ( x ) )
Step 5: unlabeled set U is predicted by strong classifier G (x), prediction of output result.
Compared with prior art, good effect of the present invention:
The present invention devises a kind of working in coordination with and instructs framework; According to the recommended range of incremental training collection, devise modern and accumulation type two kinds of information recommendation patterns.Utilize oriented interlinkage AdaBoost sorting technique provided by the invention, tool has the following advantages:
The present invention is directed in large scale database and marked that sample acquisition cost is high, limited amount and do not mark sample and exist in a large number, be easy to obtain this situation, on the basis of traditional AdaBoost framework, devise a kind of working in coordination with and instruct framework, information path is built between the Weak Classifier that originally there is no direct correlation, by the transmission of highly reliable valuable information, realize the lifting of using efficiency of information and effective expansion of training set; Secondly, by modern and accumulation type two kinds of information recommendation patterns, the collaborative enhancing of implementation model knowledge in different range.Namely in traditional AdaBoost, each Weak Classifier only mutually superposes and forms strong classifier, but there is not information transmission between Weak Classifier; The present invention designs first and establishes oriented link between Weak Classifier, achieves the transmission of valuable information on the one hand, extends training set on the other hand.
Accompanying drawing explanation
Fig. 1 is oriented interlinkage AdaBoost sorting technique iteration-internal process flow diagram in the inventive method;
Fig. 2 is modern oriented interlinkage AdaBoost sorting technique schematic diagram in the inventive method;
Fig. 3 is accumulation type oriented interlinkage AdaBoost sorting technique schematic diagram in the inventive method.
Embodiment
The present invention is explained in further detail the present invention by an example, and example, only for explaining the present invention, is not intended to limit scope of the present invention.
Malicious code (Malware) typically refers to deliberately to create and is used for performing without permission and the software program of normally harmful act.Comprise computer virus, back door, trojan horse program, worm, spyware etc.Now common malicious code major part is the program run in Windows system, and the Win32API function (hereinafter referred to as API) that they are generally provided by call operation system realizes the access to other important system resource such as file system, registration table.Malicious code data collection is call situation by analyzing malicious code to API, is concluded in the dynamic behaviour characteristic item being recorded to correspondence, obtains eigenmatrix.
Comprising the following steps malicious code classification is realized by oriented interlinkage AdaBoost sorting technique:
Step 1: input malicious code training set T={ (x i, y i) (1≤i≤N), wherein input data label y i∈ {-1 ,+1}, unlabeled set U.
Step 2: the weights distribution of initialization training dataset T and Δ T 0;
Step 3: according to designated cycle number of times, iteration Weak Classifier;
Step 4: after iteration is complete, exports strong classifier G (x);
Step 5: unlabeled set U is predicted by strong classifier G (x), prediction of output result.
Described step 3 is specially:
When m belongs to 1 to M, following steps under circulation
Step 3.1: build training set T (m);
Step 3.2: use T (m)training Weak Classifier G m(x);
Step 3.3: calculate G mthe error in classification rate of (x);
Step 3.4: calculate G mthe coefficient of (x);
Step 3.5: upgrade the distribution of training dataset weights;
Step 3.6: use G mx () prediction unlabeled set U, selects the prediction label of front K the highest sample of degree of confidence and correspondence thereof, builds Δ T m.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (9)

1. an oriented interlinkage AdaBoost sorter building method, the steps include:
1) initialization one has marked weights distribution, an increment collection and the iteration cut-off condition of training dataset T;
2) for the m time iteration, adopt and marked training set T (m)train a Weak Classifier G m(x), and with current Weak Classifier G mx the error in classification rate of () and coefficient update have marked training set T (m)weights distribution; And utilize current Weak Classifier G mx () predicts a unlabeled set U, then select from predict the outcome degree of confidence the highest before the prediction label of K sample and correspondence thereof put into described increment and concentrate, be designated as Δ T m; Wherein, T (m)=T+ Δ T m;
3) when meeting iteration cut-off condition, stopping iteration and building strong classifier G (x) according to the Weak Classifier that each iteration obtains.
2. the method for claim 1, is characterized in that, described iteration cut-off condition is iteration M time.
3. the method for claim 1, is characterized in that, described iteration cut-off condition is the condition of convergence of setting.
4. the method as described in claim 1 or 2 or 3, is characterized in that, the construction method of described strong classifier G (x) is: the Weak Classifier that each iteration obtains linearly is weighted superposition, forms described strong classifier G (x).
5. an oriented interlinkage AdaBoost sorter building method, the steps include:
1) initialization one has marked weights distribution, an increment collection and the iteration cut-off condition of training dataset T;
2) for the m time iteration, adopt and marked training set T (m)train a Weak Classifier G m(x), and with current Weak Classifier G mx the error in classification rate of () and coefficient update have marked training set T (m)weights distribution; And utilize current Weak Classifier G mx () predicts a unlabeled set U, then select from predict the outcome degree of confidence the highest before K sample and correspondence thereof prediction tag update described in sample in increment collection, be designated as Δ T m; Wherein, T (m)=T+ Δ T m;
3) when meeting iteration cut-off condition, stopping iteration and building strong classifier G (x) according to the Weak Classifier that each iteration obtains.
6. the method for claim 1, is characterized in that, described iteration cut-off condition is iteration M time.
7. the method for claim 1, is characterized in that, described iteration cut-off condition is the condition of convergence of setting.
8. the method as described in claim 1 or 2 or 3, is characterized in that, the construction method of described strong classifier G (x) is: the Weak Classifier that each iteration obtains linearly is weighted superposition, forms described strong classifier G (x).
9. an oriented interlinkage AdaBoost sorting technique, is characterized in that, adopts strong classifier G (x) of method construct described in claim 1 or 5 to predict unlabeled set U, prediction of output result.
CN201510192537.7A 2015-04-22 2015-04-22 Construction method of directed link type classifier and classification method Pending CN104820687A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510192537.7A CN104820687A (en) 2015-04-22 2015-04-22 Construction method of directed link type classifier and classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510192537.7A CN104820687A (en) 2015-04-22 2015-04-22 Construction method of directed link type classifier and classification method

Publications (1)

Publication Number Publication Date
CN104820687A true CN104820687A (en) 2015-08-05

Family

ID=53730982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510192537.7A Pending CN104820687A (en) 2015-04-22 2015-04-22 Construction method of directed link type classifier and classification method

Country Status (1)

Country Link
CN (1) CN104820687A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938565A (en) * 2016-06-27 2016-09-14 西北工业大学 Multi-layer classifier and Internet image aided training-based color image emotion classification method
WO2018072663A1 (en) * 2016-10-18 2018-04-26 腾讯科技(深圳)有限公司 Data processing method and device, classifier training method and system, and storage medium
CN111881446A (en) * 2020-06-19 2020-11-03 中国科学院信息工程研究所 Method and device for identifying malicious codes of industrial internet
CN112231775A (en) * 2019-07-15 2021-01-15 天津大学 Hardware Trojan horse detection method based on Adaboost algorithm
CN112697179A (en) * 2020-11-17 2021-04-23 浙江工业大学 AdaBoost-based Brillouin frequency shift extraction method
CN113951868A (en) * 2021-10-29 2022-01-21 北京富通东方科技有限公司 Method and device for detecting man-machine asynchrony of mechanically ventilated patient
WO2022174436A1 (en) * 2021-02-22 2022-08-25 深圳大学 Incremental learning implementation method and apparatus for classification model, and electronic device and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103208008A (en) * 2013-03-21 2013-07-17 北京工业大学 Fast adaptation method for traffic video monitoring target detection based on machine vision
US20140321737A1 (en) * 2013-02-08 2014-10-30 Emotient Collection of machine learning training data for expression recognition
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
CN104504393A (en) * 2014-12-04 2015-04-08 西安电子科技大学 SAR (Synthetic Aperture Radar) image semi-supervised classification method based on integrated learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140321737A1 (en) * 2013-02-08 2014-10-30 Emotient Collection of machine learning training data for expression recognition
CN103208008A (en) * 2013-03-21 2013-07-17 北京工业大学 Fast adaptation method for traffic video monitoring target detection based on machine vision
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
CN104504393A (en) * 2014-12-04 2015-04-08 西安电子科技大学 SAR (Synthetic Aperture Radar) image semi-supervised classification method based on integrated learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐军: ""面向金融信息检索的体裁分类与情感分析技术研究"", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938565A (en) * 2016-06-27 2016-09-14 西北工业大学 Multi-layer classifier and Internet image aided training-based color image emotion classification method
WO2018072663A1 (en) * 2016-10-18 2018-04-26 腾讯科技(深圳)有限公司 Data processing method and device, classifier training method and system, and storage medium
US11860976B2 (en) 2016-10-18 2024-01-02 Tencent Technology (Shenzhen) Company Limited Data processing method and device, classifier training method and system, and storage medium
CN112231775A (en) * 2019-07-15 2021-01-15 天津大学 Hardware Trojan horse detection method based on Adaboost algorithm
CN112231775B (en) * 2019-07-15 2022-10-21 天津大学 Hardware Trojan horse detection method based on Adaboost algorithm
CN111881446A (en) * 2020-06-19 2020-11-03 中国科学院信息工程研究所 Method and device for identifying malicious codes of industrial internet
CN111881446B (en) * 2020-06-19 2023-10-27 中国科学院信息工程研究所 Industrial Internet malicious code identification method and device
CN112697179A (en) * 2020-11-17 2021-04-23 浙江工业大学 AdaBoost-based Brillouin frequency shift extraction method
WO2022174436A1 (en) * 2021-02-22 2022-08-25 深圳大学 Incremental learning implementation method and apparatus for classification model, and electronic device and medium
CN113951868A (en) * 2021-10-29 2022-01-21 北京富通东方科技有限公司 Method and device for detecting man-machine asynchrony of mechanically ventilated patient
CN113951868B (en) * 2021-10-29 2024-04-09 北京富通东方科技有限公司 Method and device for detecting man-machine asynchronism of mechanical ventilation patient

Similar Documents

Publication Publication Date Title
CN104820687A (en) Construction method of directed link type classifier and classification method
CN111274134B (en) Vulnerability identification and prediction method, system, computer equipment and storage medium based on graph neural network
CN105824802A (en) Method and device for acquiring knowledge graph vectoring expression
CN113312447B (en) Semi-supervised log anomaly detection method based on probability label estimation
US20160062740A1 (en) Validating and maintaining respective validation status of software applications and manufacturing systems and processes
CN107133176A (en) A kind of spanned item mesh failure prediction method based on semi-supervised clustering data screening
CN112364133A (en) Post portrait generation method, device, equipment and storage medium
CN103942220A (en) Method used for intelligently linking work orders with knowledge of knowledge base and suitable for IT operation and maintenance system
CN105654144A (en) Social network body constructing method based on machine learning
CN110210656A (en) Shared bicycle method for predicting and system based on website behavioural analysis
Rusek The point nuisance method as a decision-support system based on Bayesian inference approach
Xu et al. A hybrid approach for dynamic simulation of safety risks in mega construction projects
CN112132014A (en) Target re-identification method and system based on non-supervised pyramid similarity learning
Aghimien et al. A review of the application of data mining for sustainable construction in Nigeria
CN116110234A (en) Traffic flow prediction method and device based on artificial intelligence and application of traffic flow prediction method and device
CN113904844A (en) Intelligent contract vulnerability detection method based on cross-modal teacher-student network
Moghayedi et al. Predicting the impact size of uncertainty events on construction cost and time of highway projects using ANFIS technique
CN115438190B (en) Power distribution network fault auxiliary decision knowledge extraction method and system
Hong et al. A graph-based approach for unpacking construction sequence analysis to evaluate schedules
CN110688368A (en) Component behavior model mining method and device
CN113435055B (en) Self-adaptive migration prediction method and system in shield cutter head torque field
KR102646061B1 (en) Demand forecasting method using ai-based model selector algorithm
Zheng et al. Prediction of road traffic accidents using a combined model based on IOWGA operator
CN112199287B (en) Cross-project software defect prediction method based on enhanced hybrid expert model
CN111258624B (en) Issue solving time prediction method and system in open source software development

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150805