CN107273912A - A kind of Active Learning Method based on three decision theories - Google Patents

A kind of Active Learning Method based on three decision theories Download PDF

Info

Publication number
CN107273912A
CN107273912A CN201710326684.8A CN201710326684A CN107273912A CN 107273912 A CN107273912 A CN 107273912A CN 201710326684 A CN201710326684 A CN 201710326684A CN 107273912 A CN107273912 A CN 107273912A
Authority
CN
China
Prior art keywords
sample
data
value
active learning
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710326684.8A
Other languages
Chinese (zh)
Inventor
胡峰
张苗
张清华
于洪
程麟焰
余春霖
靳义林
李智星
王进
雷大江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201710326684.8A priority Critical patent/CN107273912A/en
Publication of CN107273912A publication Critical patent/CN107273912A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of Active Learning Algorithm based on three decision-makings is claimed in the present invention, the problem of by using the thought of three decision-makings to solve current unmarked sample.It is related to rough set, the field such as data mining.The uncertainty of data untagged is determined by margin strategies first.Then data untagged is divided into by three different domains by uncertainty:Positive domain, negative domain, Boundary Region.Handled for the data in each domain using corresponding solution.Purpose is exactly that, in order to select information content high, representative strong data are marked.These data marked are added to training set, new grader is created.Learnt by the training iterated, until reaching the preset times of iteration or reaching desired evaluation criteria.The present invention can preferably improve the properties of grader.

Description

A kind of Active Learning Method based on three decision theories
Technical field
The invention belongs to the fields such as rough set, machine learning, data mining, and in particular to a kind of Active Learning Method.
Background technology
, it is necessary to train a model by given data (training set) during data analysis and processing is done, In specific practice process:We can have found easily can efficiently get data, but these data often all do not have It is markd.With the arriving in big data epoch, data influence the every aspect of people's life, but these data are redundancies , it is cumbersome, unmarked.Directly obtain that valuable, not only appointed condition does not reach and needed with markd data The fund of writing and time.However, we can start with from these without markd data, how these data are added Work processing, gives full play to its potential value, new requirement is proposed to the existing technology of people.
Active Learning Method is one kind of machine learning method, can effectively be solved the above problems.Calculated by Active Learning Method, which selects most useful data and gives expert, to be marked, and expands training set, creates more effective model.Compared to more traditional Passive learning, this method can select information content high, and representational data carry out mark, it is to avoid the redundancy additions of data and Unnecessary addition.Meanwhile, the human and material resources of high-volume flag data are reduced, the cost of data analysis is reduced.
1974, Simon just proposed the related thought of Active Learning to lea.Valiant is from statistical angle Demonstrate by training the selection of example to effectively reduce the data needed for training.In recent years, increasing scholar Research direction Active Learning is locked, it is proposed that about the related notion and theory of Active Learning.It is compared to passive learning The mode of data is randomly selected, Active Learning Algorithm selects useful sample to user's mark, rather than passive receive data. According to experimental result, on the premise of same accuracy is realized, actively selection can largely reduce institute compared with random selection The sample size needed.The implementation procedure of Active Learning Algorithm can be divided into two processes.One:Selected by sample selection algorithm Go out information content highest, the maximum sample of the magnitude of value is marked, the sample of mark is added in markd sample.Its Two:There is flag data to create a base grader using existing, select suitable evaluation index, weighed by supervised learning The classification performance of grader.The two processes are alternately performed, and pass through continuous iteration so that the performance of grader is optimal, Or the number of times of iteration is set, until reaching default condition.
According to the form of selection data untagged, Active Learning can be divided into two classes:Active Learning based on stream, it is based on The Active Learning in pond.
Active Learning based on stream:This method is obtained in natural language processing directions such as part-of-speech tagging, elimination meaning of a word differences It is widely applied.Sample is judged it one by one according to the form of stream with learning algorithm, as long as two kinds of possibility of judged result, Or mark or not.The sample for needing mark is transferred into expert's mark.The sample of not mark is directly abandoned, and is not being used. Representational method has the committee to inquire about (Query By Committee, QBC).
Active Learning based on pond:Sequential sampling is required from the Active Learning based on stream, judges different one by one, and this method will Ask and unified judgement is carried out to the data of a certain scope, topk sample is selected according to a certain index and is labeled.Certainly it is based on The sample in pond can also be changed using based on stream learning method, selected a small amount of sample from pond every time and individually sentenced It is disconnected.Active Learning based on pond is to study most popular, of greatest concern, most widely used method instantly.
The samples selection strategy of Active Learning:
1. based on probabilistic samples selection strategy
By probabilistic measure, the sample for picking out its most unascertainable classification gives human expert progress Mark., can be by creating probabilistic model for two classification problems, the sample for selecting posterior probability values closest to 0.5 is carried out Mark.For many classification problems, what is often selected is the sample of low confidence, or selects maximum a posteriori probability based on margin The minimum sample with the difference of secondary maximum a posteriori probability.The uncertain of sample is determined it can in addition contain be calculated by comentropy Property, entropy is bigger, uncertain bigger.
2. the samples selection strategy reduced based on following error
Do not have markd sample to be put into sample to be marked to be marked, be then added in training set by each, train Go out new grader, select so that the maximum sample of grader anticipation error reduction is marked.Due to estimation anticipation error Reduction needs very big amount of calculation, and institute often applies to 3. sample based on committee's inquiry in two classification problems in this way Selection strategy
Method based on committee's inquiry constitutes the committees by creating multiple models, using these models respectively to without mark Remember that sample carries out decision-making, decision-making least consistent sample is considered as the sample that least can determine that classification, such sample is added to and treated Mark is carried out in marker samples can minimize version space.
Three decision-makings are to be studied to draw on the basis of probability rough set and decision-making rough set by Yao Y Y.Probability is coarse Collect model and introduce two parameter alphas and β, whole space is divided for three domains:Positive domain, negative domain also have Boundary Region.Yao Y Y are first The secondary related notion for proposing three decision-makings:Positive domain means the affirmative to things, and negative domain means the negative to things, Boundary Region Mean there is no full assurance to a certain things, it is impossible to make a policy immediately.Three decision-makings are that probability rough set is imparted newly Semanteme, its appearance provides new thinking methods for decision problem.
The content of the invention
Present invention seek to address that above problem of the prior art.Propose a kind of preferably lifting grader properties, It can be effectively used for the Active Learning Method based on three decision theories of the situation of the current a large amount of categorical attributes missings of processing.This hair Bright technical scheme is as follows:
A kind of Active Learning Method based on three decision theories, it comprises the following steps:
1), obtain data set and call random function by data set randomization, data set has proportionally been divided into mark Remember data set, data untagged collection, test set;
2), using there is flag data collection training Naive Bayes Classifier, by Naive Bayes Classifier to unmarked Data carry out posterior probability estimation, calculate the uncertainty of each data untagged;
3) data untagged, is divided into by three domains, i.e., positive domain, negative domain and Boundary Region according to probabilistic size.Its In be related to two kinds of division methods:Divided according to threshold value and division is ranked up according to uncertain size;
4), to step 3) in sample not on same area separately handle, select contain much information, costly sample enters rower Note, and the sample marked is added in training set, to train new grader to carry out result test to test set, in order to survey The performance of grader is tried, is verified using different evaluation criterias.
Further, the step 2) use margin strategies to carry out posterior probability estimation to data untagged, after calculating The difference of probability is tested, the uncertainty of each data untagged is determined;
D_value (x)=p (yfirst|x,L)-p(ysecond| x, L) (1),
Wherein D_value (x) represents probabilistic size, p (yfirst| x, L) the maximum value of posterior probability, p (ysecond|x, L) the maximum value of posterior probability time.
Further, the step 3) in data untagged be divided into by three domains according to threshold value specifically included:According to threshold Value threshold_ α, threshold_ β are divided, and are defined as follows:
If D_value(x)≤threshold_αx∈POS(X)
If threshold_α<D_value(x)<threshod_β x∈BND(X) (2)
If D_value(x)≥threshold_βx∈NEG(X)
Wherein, 0≤threshold_ α<threshold_β≤1
Threshold_ α, threshold_ β can based on experience value or the confidence level of setting is determined.
Further, the step 3) in be ranked up division according to uncertainty and specifically include:According to uncertainty by Small, by controlling the size, sample space to be divided to being ranked up greatly, preceding topk belongs to positive domain, and rear topk belongs to negative domain, A middle part belongs to Boundary Region, and k value is influenceed by quantity selectNum is selected.
Further, the step 4) in step 3) in sample not on same area separately handle and specifically include step:
For positive domain:That is x ∈ POS (X) sample set, is directly appended to sequence to be marked, and by it from unmarked number Deleted according to middle;Negative domain:I.e. x ∈ NEG (X) sample set, does not do any processing to such sample;Boundary Region:X ∈ BND (X) are needed Further determine whether mark, including step:By sample two-by-two between distance determine the field radius of Boundary Region sample;Meter Calculation field density, selects most representational sample, representational sample is descending to be ranked up, and selects topk sample It is added in sequence select to be marked.
Further, it is described calculate Boundary Region sample two-by-two between distance include:If attribute is continuous type attribute, Europe is used Distance is drawn, is defined as follows:Assuming that sample X={ x1,x2,...xj...,xm, Y={ y1,y2,...yj...,ym}:
xjRepresent sample x j-th of attribute, yjRepresent sample Y j-th of attribute;
If attribute is discrete type attribute, selection uses VDM, is defined as follows:Assuming that sample x1, x2The two of discrete type attribute Individual value V1, V2,
C1It is that the property value is V in all samples1Number, C1iIt is then i numbers, C for wherein classification2In being all samples The property value is V2Number, C2iIt is then i numbers for wherein classification, K is constant, generally takes 1.
Further, the formula of the field radius for determining sample is:
δ=min (dis (xi, s)) and+w × range (dis (xi, s)), 0≤w≤1 (5)
Wherein, min (dis (xi, s)) and represent the sample away from its nearest neighbours, range (dis (xi, s)) represent in specified data The span of its distance is concentrated, w controls the size of radius,xiδ neighborhood definitions be:δ(xi)={ xi|dis (xi,xi)≤δ }, wherein δ is a metric predefined.
Further, the representative point is defined as below:
D_value (x) represents probabilistic size, and the smaller uncertainty of the value is bigger, xkFor the nothing in the radius of neighbourhood Marker samples, N is to make formula dis (x, xk)≤δ sets up xkNumber, it is assumed that sample x and sample xkAttribute space be respectively x ={ x1,x2,...xj...,xm, xk={ xk 1,xk 2,...xk j...,xk m, both similarities are calculated using cosine formula, it is remaining String formula is as follows:
Advantages of the present invention and have the beneficial effect that:
The present invention has applied to three decision theories in Active Learning, is divided sample space by uncertain size For 3 domains:Positive domain, Boundary Region, negative domain.The present invention proposes two kinds of division methods, method one:Divided according to threshold value, method Two:Division is ranked up according to uncertain size.It has selected different processing methods respectively for the sample in not same area, it is right Sample in positive domain is directly appended in sequence select to be marked;Do not processed for the sample of negative domain;For Boundary Region Sample, neighborhood density is calculated based on neighborhood, determines that it is representative, and topk sample of selection is added to sequence select to be marked In, most relief expert is that the sample in select carries out mark, this from individual difference, and sample choosing is carried out with a definite target in view The method selected can finer selection go out have mark meaning sample be marked so that preferably lifted grader properties, Such as classification accuracy rate, ROC, F-value.This method is extended to Active Learning, can be effectively used for processing a large amount of at present The situation of categorical attribute missing.
Brief description of the drawings
Fig. 1 is that the present invention provides preferred embodiment Active Learning flow chart;
Fig. 2:Divide positive domain, the schematic diagram of negative domain;
Fig. 3:Select the schematic diagram of representative sample on Boundary Region;
Fig. 4:Schematic diagram based on uncertain zoning;
Fig. 5:Active Learning flow chart based on three decision-makings.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, detailed Carefully describe.Described embodiment is only a part of embodiment of the present invention.
The present invention solve above-mentioned technical problem technical scheme be:
Technical scheme includes following steps:
The method of the present invention comprises the following steps:
Step1:Experimental data is divided.By data be divided into flag data (training set), data untagged and Data (test set) to be tested.
Step2:There is flag data i.e. training set training one base grader --- Naive Bayes Classification using existing Device, the posterior probability values of each sample are estimated using the grader created in data untagged.If the grader is one Individual two classification, the posterior probability of another classification is subtracted with the posterior probability of maximum classification.If many classification, then select posteriority general Maximum and time maximum the value of rate value simultaneously seeks difference D_value.
Step3:Based on uncertainty, whole data untagged space is divided into three different domains according to a certain threshold value In.For the D_value less data of value, it is considered as uncertain higher data, is directly divided into positive domain, is added It is added in sequence select to be marked, waits human expert to be marked, and the deletion partial data is concentrated from data untagged; The larger data of value for D_value, classification can be determined by being considered as, and be divided into negative domain, without the partial data Process;Remaining data are divided into Boundary Region.If the data for Boundary Region are not not anticipate completely to their marks Justice, in order to select the data for more thering is mark to be worth.Need to redefine whether mark.
Step4:Point on Boundary Region is handled.For the data on Boundary Region, in order that must mark it is more valuable, The concept of neighborhood is introduced, the density of data untagged is calculated in neighborhood.The distributed intelligence of sample is taken into account, selection is provided Representational data.
Step5:Sequence is re-started to representative information content, selection topk is added to data set to be marked In select.
Step6:Domain expert is given by the data in select, mark is carried out to them, according to select result more New training set and data untagged collection.New grader is created using the training set after renewal, result test is carried out to test set.
The step of repeating Step2-Step6, until meeting the preparatory condition of iteration or reaching the requirement of performance metric.
Further, in the Step1:It is single to use a number in order to be preferably estimated to learner Generalization Capability Result of the test is obtained according to division to be often not sufficiently stable reliably, it is also difficult that there is convincingness.If selection is using random division several times, weight Retrial is tested, the method finally averaged, it is clear that obtained assessment result is more reasonable, more convincing.When testing, The number of times for repeating experiment is specified, the test result that different pieces of information is divided in each circulation experiment is observed.So dividing data When, it is necessary to call random function, realize the random division to data.
Further, in the Step2:Naive Bayes Classifier is created, Naive Bayes Classifier is with conditional attribute Premised on independence, it is assumed that the influence that each attribute is produced to classification results is separate.So as to solve solution class condition The problem of joint probability distribution of probability P (x | c) this all properties, joint probability is directly calculated if based on finite sample, The problem of will being faced with multiple shot array in computational problem, will be faced with the problem of sample is sparse in data, if number Attribute especially many situation is still belonged to according to collection, then problems faced can be more serious.
Determine that the principle of a certain sample classification is as follows by Naive Bayes Classifier:
The set of attributive character:X={ a1,a2,a3...am}
The set of category attribute:C={ y1,y2,y3...yn}
P (x) is to be used for normalized " evidence " factor, and for given sample x, evidence factor p (x) does not have with categorical attribute There is any relation, the size for any categorical attribute value does not change.As long as so molecule is maximized.Piao The expression formula of plain Bayes classifier is usually defined as follows:
Formula will be causedTake the y of maximumiIt is used as the result of decision.From this knot If it can be found that directly using the output probability value of the result in fruit, probable value that will be maximum
p(yfirst, x) with time probable value p (y of maximumsecond, x) do difference, it may appear that different probable values, but difference Identical situation.So that two classify as an example, scene one:p(yfirst, x) with p (ysecond, x) value is respectively 0.4,0.2,
D_value=p (yfirst,x)-p(ysecond, x)=0.2;Scene two:p(yfirst, x) with p (ysecond, x) value Respectively
0.5,0.3, D_value=p (yfirst,x)-p(ysecond, x)=0.2;In fact, scene one adds the evidence factor Afterwards
Scene two Add after the evidence factor
Difference in obvious scene two is than scene one It is small, it is uncertain some larger in scene two.So, when from naive Bayesian as grader, utilize formula (2) Obtained probable value needs to be normalized with evidence factor p (x), it is impossible to directly do difference operation using the probable value of output.
Further with regards to conditional probability p (xj|yi) calculating, for discrete type attribute, using formulaTo calculate.WhereinIt is x to represent j-th of attribute valuejSample set.For continuous type category For property, it will be assumed that p (xj|yi) Normal Distribution, i.e.,So what continuous type was calculated Condition probability formula is(3)
Further, in the Step2 and Step3:Naive Bayes Classifier is constructed, selects most uncertain sample to enter Line flag, selects to represent method margin strategies using probabilistic here, margin formula are as follows
x*=argmin (p (yfirst|x,L)-p(ysecond|x,L)) (4)
yfirst:The maximum value of posterior probability, ysecond:The maximum value of posterior probability time, when the values difference of two is minimum Wait, it is uncertain bigger.
By taking two classification (y, n) as an example, if D_value value very little illustrates that data belong to classification y or the n that classifies probability Compare neighbouring.At this moment grader accurate can not carry out decision-making to the data.Can be very big if by such data mark The performance of classification is improved in degree, here it is based on the uncertain sample for selecting most worth mark.If D_value's takes Value is larger, such as the posterior probability for the y that classifies is far longer than classification n posterior probability, in the range of error permission, at this moment classifies Device has had the full classification results for holding the determination data.In this case, we need not do to such data Processing.
Further, in the Step3:Based on uncertainty, data are divided according to threshold value, two are referred here to Plant division methods.Method one:Based on experience value or setting confidence level determine that threshold_ α, threshold_ β are drawn Point;Method two:It is ranked up, is divided by controlling the size according to uncertainty is ascending.In order to better illustrate not With the division in region, draw as Fig. 2 further illustrates problem.
For sample A, posterior probability values are obtained by grader, most probable value is far longer than time most probable value, this When think that its classification can be determined.Even if A density is very big, also no longer A is marked, now, A is divided into negative Domain.
For sample B, posterior probability values are obtained by grader, most probable value and time most probable value very close to, point Class device is very big to the probability of its decision error, although sample B does not have sample A density big.Obviously, meaning is marked more to B Greatly, now, B is divided into positive domain.
Further, in the Step4:Based on the uncertain division data are carried out with three, there is most sample This acquirement is not the most value at two ends, and uncertainty is in intermediateness, if it is known that having more unmarked around such data Sample, then illustrate that the data are representative, and mark is carried out to it can reduce the uncertainty of surrounding sample, improve grader Performance.
To such as Fig. 3 of the data processing in Boundary Region, the grader of construction has not true to the A of unmarked sample, B classification It is qualitative, but it is clear that sample A is more representative, study is more beneficial for A marks.Sample on Boundary Region was needed week The distribution situation for enclosing sample takes into account, by most representative sample mark.
It is described below with specific embodiment, specifically includes following steps:
(1) data are divided
Call random function to make randomizing data, data are divided, it is settable to have flag data collection:Data untagged Collection:Test set=1:69:The data of 30, i.e., 1% are used for having a flag data, and 69% data are used for data untagged, 30% Data are used to test.69% data untagged collection is selectively added into 1% by each iteration flag data Collect in (training set).It is iterating through Active Learning Method every time to select data untagged collection, picks out most worthy, most Significant sample is marked.The sample marked is added in training set, trains new grader to carry out test set Test, observes the performance of grader after more each adding procedure.
(2) uncertainty of data untagged is calculated, different domains are divided according to uncertain
Naive Bayes Classifier is constructed, according to margin strategies, the difference of posterior probability is calculated, determines each without mark The uncertainty for the evidence that counts.
D_value (x)=p (yfirst|x,L)-p(ysecond|x,L) (1)
Based on probabilistic size, whole sample space is divided into 3 domains:Positive domain, negative domain, Boundary Region.
Division methods one:Divided, be defined as follows according to threshold value threshold_ α, threshold_ β:
If D_value(x)≤threshold_αx∈POS(X)
If threshold_α<D_value(x)<threshod_β x∈BND(X) (2)
If D_value(x)≥threshold_βx∈NEG(X)
Wherein, 0≤threshold_ α<threshold_β≤1
Threshold_ α, threshold_ β can based on experience value or the confidence level of setting is determined.So that two classify as an example Illustrate, if threshold_ α=0.05, threshold_ β=0.95
As D_value=0.05
p(y1|x)+p(y2| x)=1, p (y1|x)-p(y2| equation group x)=0.05 is solved, following result p (y can be obtained1|x) =0.525, p (y2| x)=0.475, i.e., when the posterior probability of two classification is 0.525 and 0.475 respectively, it is divided into just Domain, is considered as and not can determine that classification completely.
As D_value=0.95
p(y1|x)+p(y2| x)=1, p (y1|x)-p(y2| equation group x)=0.95 is solved, following result p (y can be obtained1|x) =0.975, p (y2| x)=0.025, i.e., when the posterior probability of two classification is 0.975 and 0.025 respectively, it is divided into bearing Domain, classification can be determined by being considered as.
Division methods two:Such as Fig. 4, it is ranked up, by controlling the size, sample space is entered according to probable value is ascending Row is divided, and preceding topk belongs to positive domain, and rear topk belongs to negative domain, and a middle part belongs to Boundary Region.
WithExemplified by
Ifx∈POS(X)
Ifx∈BND(X)
Ifx∈NEG(X)
Wherein, selectNum:Each iteration expects the quantity of mark.
top(x):A function is defined, the data x sequence number in sequencing queue is obtained.
(3) corresponding processing is done to the data on not same area
Positive domain:I.e. x ∈ POS (X) sample set, is directly appended to sequence to be marked, and by it from data untagged Delete;Negative domain:I.e. x ∈ NEG (X) sample set, has little significance to such sample mark, does not do any place to such sample Reason.Boundary Region:X ∈ BND (X) need to further determine whether mark.
1. the distance between calculating sample two-by-two.
If attribute is continuous type attribute, using Euler's distance, it is defined as follows:
(concrete meaning is as above changed) (3)
If attribute is discrete type attribute, selection uses VDM, is defined as follows:Assuming that sample x1, x2The two of discrete type attribute Individual value V1, V2,
C1It is that the property value is V in all samples1Number, C1iIt is then i numbers, C for wherein classification2In being all samples The property value is V2Number, C2iIt is then i numbers for wherein classification.K is constant, generally takes 1.
2. determine the field radius of sample
δ=min (dis (xi, s)) and+w × range (dis (xi, s)), 0≤w≤1 (5)
Wherein, min (dis (xi, s)) and represent the sample away from its nearest neighbours, range (dis (xi, s)) represent in specified data The span of its distance is concentrated, w controls the size of radius
3. the most representational sample of selection
It is uncertain bigger because D_value value is smaller, take its opposite number then to have D_value value bigger, do not know Property it is bigger, in order to avoid negative, progress Jia 1 and operated, and then representative point is defined as below:
N is to make formula dis (x, xkThe number that)≤δ is set up, wherein,
Descending to representational sample to be ranked up, topk sample of selection is added to sequence select to be marked In
(4) it is sequence mark to be marked, creates new grader
The sample in union operation, i.e., sequence select to be marked is taken to the sample that each domain is selected, expert etc. is given It is to be marked, marked sample is added to training set, new grader is created.
(5) result is tested
Active Learning Algorithm is the process of an iteration, and selection selectNum is added in data set to be marked every time, and They are labeled.Training set is updated, new grader is created, classification performance is tested using test set.By not Disconnected iteration addition data untagged is until meeting default iterations or evaluation criterion.Here evaluation index can select to make With accuracy, ROC, F-value etc..In order that obtaining experimental result has more reliability, by the method for random division, The test of 10 times such as is carried out to data, this 10 average results are finally asked for.
The above embodiment is interpreted as being merely to illustrate the present invention rather than limited the scope of the invention. After the content for the record for having read the present invention, technical staff can make various changes or modifications to the present invention, these equivalent changes Change and modification equally falls into the scope of the claims in the present invention.

Claims (8)

1. a kind of Active Learning Method based on three decision theories, it is characterised in that comprise the following steps:
1), obtain data set and call random function by data set randomization, data set has proportionally been divided into reference numerals According to collection, data untagged collection, test set;
2), using there is flag data collection training Naive Bayes Classifier, by Naive Bayes Classifier to data untagged Posterior probability estimation is carried out, the uncertainty of each data untagged is calculated;
3) data untagged, is divided into by three domains, i.e., positive domain, negative domain and Boundary Region according to probabilistic size.Wherein relate to And to two kinds of division methods:Divided according to threshold value and division is ranked up according to uncertain size;
4), to step 3) in sample not on same area separately handle, select contain much information, costly sample is marked, And the sample marked is added in training set, to train new grader to carry out result test to test set, in order to test The performance of grader, is verified using different evaluation criterias.
2. the Active Learning Method according to claim 1 based on three decision theories, it is characterised in that
The step 2) use margin strategies to carry out posterior probability estimation to data untagged, the difference of posterior probability is calculated, Determine the uncertainty of each data untagged;
D_value (x)=p (yfirst|x,L)-p(ysecond| x, L) (1), wherein
D_value (x) represents probabilistic size, p (yfirst| x, L) the maximum value of posterior probability, p (ysecond| x, L) posteriority The maximum value of probability time.
3. the Active Learning Method according to claim 2 based on three decision theories, it is characterised in that
The step 3) in data untagged be divided into by three domains according to threshold value specifically included:
Divided, be defined as follows according to threshold value threshold_ α, threshold_ β:
Wherein, 0≤threshold_ α<threshold_β≤1
Threshold_ α, threshold_ β can based on experience value or the confidence level of setting is determined.
4. the Active Learning Method according to claim 2 based on three decision theories, it is characterised in that
The step 3) in be ranked up division according to uncertainty and specifically include:Arranged according to uncertainty is ascending Sequence, by controlling the size, is divided to sample space, and preceding topk belongs to positive domain, and rear topk belongs to negative domain, a middle part Belong to Boundary Region, k value is influenceed by quantity selectNum is selected.
5. the Active Learning Method according to claim 4 based on three decision theories, it is characterised in that the step 4) In to step 3) in sample not on same area separately handle and specifically include step:
For positive domain:I.e. x ∈ POS (X) sample set, is directly appended to sequence to be marked, and by it from data untagged Delete;Negative domain:I.e. x ∈ NEG (X) sample set, does not do any processing to such sample;Boundary Region:X ∈ BND (X) need into One step determines whether mark, including step:By sample two-by-two between distance determine the field radius of Boundary Region sample;Calculate neck Domain density, selects most representational sample, representational sample is descending to be ranked up, topk sample addition of selection Into sequence select to be marked.
6. the Active Learning Method according to claim 5 based on three decision theories, it is characterised in that
It is described calculate Boundary Region sample two-by-two between distance include:If attribute is continuous type attribute, using Euler's distance, definition is such as Under:Assuming that sample X={ x1,x2,...xj...,xm, Y={ y1,y2,...yj...,ym}:
xjRepresent sample x j-th of attribute, yjRepresent sample Y j-th of attribute;
If attribute is discrete type attribute, selection uses VDM, is defined as follows:Assuming that sample x1, x2In two values of discrete type attribute V1, V2,
C1It is that the property value is V in all samples1Number, C1iIt is then i numbers, C for wherein classification2It is the category in all samples Property value be V2Number, C2iIt is then i numbers for wherein classification, K is constant, generally takes 1.
7. the Active Learning Method according to claim 5 based on three decision theories, it is characterised in that the determination sample The formula of this field radius is:
δ=min (dis (xi, s)) and+w × range (dis (xi, s)), 0≤w≤1 (5)
Wherein, min (dis (xi, s)) and represent the sample away from its nearest neighbours, range (dis (xi, s)) represent in data set is specified The span of its distance, w controls the size of radius,xj∈U xiδ neighborhood definitions be:δ(xi)={ xi|dis(xi, xi)≤δ }, wherein δ is a metric predefined.
8. the Active Learning Method according to claim 5 based on three decision theories, it is characterised in that
The representative point is defined as below:
D_value (x) represents probabilistic size, and the smaller uncertainty of the value is bigger, xkFor the unmarked sample in the radius of neighbourhood This, N is to make formula dis (x, xk)≤δ sets up xkNumber, it is assumed that sample x and sample xkAttribute space be respectively x={ x1, x2,...xj...,xm, xk={ xk 1,xk 2,...xk j...,xk m, both similarities, cosine formula are calculated using cosine formula It is as follows:
CN201710326684.8A 2017-05-10 2017-05-10 A kind of Active Learning Method based on three decision theories Pending CN107273912A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710326684.8A CN107273912A (en) 2017-05-10 2017-05-10 A kind of Active Learning Method based on three decision theories

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710326684.8A CN107273912A (en) 2017-05-10 2017-05-10 A kind of Active Learning Method based on three decision theories

Publications (1)

Publication Number Publication Date
CN107273912A true CN107273912A (en) 2017-10-20

Family

ID=60074134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710326684.8A Pending CN107273912A (en) 2017-05-10 2017-05-10 A kind of Active Learning Method based on three decision theories

Country Status (1)

Country Link
CN (1) CN107273912A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875768A (en) * 2018-01-23 2018-11-23 北京迈格威科技有限公司 Data mask method, device and system and storage medium
CN109543707A (en) * 2018-09-29 2019-03-29 南京航空航天大学 Semi-supervised change level Software Defects Predict Methods based on three decisions
CN109820479A (en) * 2019-01-08 2019-05-31 西北大学 A kind of fluorescent molecular tomography feasible zone optimization method
CN109977994A (en) * 2019-02-02 2019-07-05 浙江工业大学 A kind of presentation graphics choosing method based on more example Active Learnings
CN110058576A (en) * 2018-01-19 2019-07-26 临沂矿业集团有限责任公司 Equipment fault prognostics and health management method based on big data
CN110784481A (en) * 2019-11-04 2020-02-11 重庆邮电大学 DDoS detection method and system based on neural network in SDN network
CN111914061A (en) * 2020-07-13 2020-11-10 上海乐言信息科技有限公司 Radius-based uncertainty sampling method and system for text classification active learning
CN112365120A (en) * 2020-09-29 2021-02-12 重庆邮电大学 Intelligent business strategy generation method based on three decisions
CN113240007A (en) * 2021-05-14 2021-08-10 西北工业大学 Target feature selection method based on three-branch decision
CN113327131A (en) * 2021-06-03 2021-08-31 太原理工大学 Click rate estimation model for feature interactive selection based on three-branch decision theory
CN114927239A (en) * 2022-04-21 2022-08-19 厦门大学 Decision rule automatic generation method and system applied to medicine analysis
CN116452320A (en) * 2023-04-12 2023-07-18 西南财经大学 Credit risk prediction method based on continuous learning

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110058576A (en) * 2018-01-19 2019-07-26 临沂矿业集团有限责任公司 Equipment fault prognostics and health management method based on big data
CN108875768A (en) * 2018-01-23 2018-11-23 北京迈格威科技有限公司 Data mask method, device and system and storage medium
CN109543707A (en) * 2018-09-29 2019-03-29 南京航空航天大学 Semi-supervised change level Software Defects Predict Methods based on three decisions
CN109543707B (en) * 2018-09-29 2020-09-25 南京航空航天大学 Semi-supervised change-level software defect prediction method based on three decisions
CN109820479A (en) * 2019-01-08 2019-05-31 西北大学 A kind of fluorescent molecular tomography feasible zone optimization method
CN109977994B (en) * 2019-02-02 2021-04-09 浙江工业大学 Representative image selection method based on multi-example active learning
CN109977994A (en) * 2019-02-02 2019-07-05 浙江工业大学 A kind of presentation graphics choosing method based on more example Active Learnings
CN110784481A (en) * 2019-11-04 2020-02-11 重庆邮电大学 DDoS detection method and system based on neural network in SDN network
CN110784481B (en) * 2019-11-04 2021-09-07 重庆邮电大学 DDoS detection method and system based on neural network in SDN network
CN111914061B (en) * 2020-07-13 2021-04-16 上海乐言科技股份有限公司 Radius-based uncertainty sampling method and system for text classification active learning
CN111914061A (en) * 2020-07-13 2020-11-10 上海乐言信息科技有限公司 Radius-based uncertainty sampling method and system for text classification active learning
CN112365120A (en) * 2020-09-29 2021-02-12 重庆邮电大学 Intelligent business strategy generation method based on three decisions
CN112365120B (en) * 2020-09-29 2022-05-03 重庆邮电大学 Intelligent business strategy generation method based on three decisions
CN113240007A (en) * 2021-05-14 2021-08-10 西北工业大学 Target feature selection method based on three-branch decision
CN113240007B (en) * 2021-05-14 2024-05-14 西北工业大学 Target feature selection method based on three decisions
CN113327131A (en) * 2021-06-03 2021-08-31 太原理工大学 Click rate estimation model for feature interactive selection based on three-branch decision theory
CN114927239A (en) * 2022-04-21 2022-08-19 厦门大学 Decision rule automatic generation method and system applied to medicine analysis
CN114927239B (en) * 2022-04-21 2024-07-02 厦门大学 Automatic decision rule generation method and system applied to drug analysis
CN116452320A (en) * 2023-04-12 2023-07-18 西南财经大学 Credit risk prediction method based on continuous learning
CN116452320B (en) * 2023-04-12 2024-04-30 西南财经大学 Credit risk prediction method based on continuous learning

Similar Documents

Publication Publication Date Title
CN107273912A (en) A kind of Active Learning Method based on three decision theories
Huang et al. Mos: Towards scaling out-of-distribution detection for large semantic space
Khodadadeh et al. Unsupervised meta-learning for few-shot image classification
CN111191732B (en) Target detection method based on full-automatic learning
Feng et al. Learning fair representations via an adversarial framework
Sikora A modified stacking ensemble machine learning algorithm using genetic algorithms
Farid et al. Mining complex data streams: discretization, attribute selection and classification
Han et al. Prediction and recovery for adaptive low-resolution person re-identification
CN110032682A (en) A kind of information recommendation list generation method, device and equipment
CN110929848A (en) Training and tracking method based on multi-challenge perception learning model
Wang et al. Active learning with co-auxiliary learning and multi-level diversity for image classification
Xue et al. PSO for feature construction and binary classification
Zhang et al. Modeling the Homophily Effect between Links and Communities for Overlapping Community Detection.
Huang et al. Learning consistent region features for lifelong person re-identification
CN115292532A (en) Remote sensing image domain adaptive retrieval method based on pseudo label consistency learning
Malphedwar et al. Squirrel search method for deep learning-based anomaly identification in videos
Chen et al. Refining noisy labels with label reliability perception for person re-identification
El-Shorbagy et al. Advances in Henry Gas Solubility Optimization: A Physics-Inspired Metaheuristic Algorithm With Its Variants and Applications
Daoud et al. Recent Advances of Chimp Optimization Algorithm: Variants and Applications
Michelakos et al. A hybrid classification algorithm evaluated on medical data
CN111783088B (en) Malicious code family clustering method and device and computer equipment
Azarbad et al. Brain tissue segmentation using an unsupervised clustering technique based on PSO algorithm
Sastry et al. Sub-structural niching in estimation of distribution algorithms
Salehi et al. Hybrid simple artificial immune system (SAIS) and particle swarm optimization (PSO) for spam detection
CN105160358B (en) A kind of image classification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20171020