CN107273912A - A kind of Active Learning Method based on three decision theories - Google Patents
A kind of Active Learning Method based on three decision theories Download PDFInfo
- Publication number
- CN107273912A CN107273912A CN201710326684.8A CN201710326684A CN107273912A CN 107273912 A CN107273912 A CN 107273912A CN 201710326684 A CN201710326684 A CN 201710326684A CN 107273912 A CN107273912 A CN 107273912A
- Authority
- CN
- China
- Prior art keywords
- sample
- data
- value
- active learning
- threshold
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of Active Learning Algorithm based on three decision-makings is claimed in the present invention, the problem of by using the thought of three decision-makings to solve current unmarked sample.It is related to rough set, the field such as data mining.The uncertainty of data untagged is determined by margin strategies first.Then data untagged is divided into by three different domains by uncertainty:Positive domain, negative domain, Boundary Region.Handled for the data in each domain using corresponding solution.Purpose is exactly that, in order to select information content high, representative strong data are marked.These data marked are added to training set, new grader is created.Learnt by the training iterated, until reaching the preset times of iteration or reaching desired evaluation criteria.The present invention can preferably improve the properties of grader.
Description
Technical field
The invention belongs to the fields such as rough set, machine learning, data mining, and in particular to a kind of Active Learning Method.
Background technology
, it is necessary to train a model by given data (training set) during data analysis and processing is done,
In specific practice process:We can have found easily can efficiently get data, but these data often all do not have
It is markd.With the arriving in big data epoch, data influence the every aspect of people's life, but these data are redundancies
, it is cumbersome, unmarked.Directly obtain that valuable, not only appointed condition does not reach and needed with markd data
The fund of writing and time.However, we can start with from these without markd data, how these data are added
Work processing, gives full play to its potential value, new requirement is proposed to the existing technology of people.
Active Learning Method is one kind of machine learning method, can effectively be solved the above problems.Calculated by Active Learning
Method, which selects most useful data and gives expert, to be marked, and expands training set, creates more effective model.Compared to more traditional
Passive learning, this method can select information content high, and representational data carry out mark, it is to avoid the redundancy additions of data and
Unnecessary addition.Meanwhile, the human and material resources of high-volume flag data are reduced, the cost of data analysis is reduced.
1974, Simon just proposed the related thought of Active Learning to lea.Valiant is from statistical angle
Demonstrate by training the selection of example to effectively reduce the data needed for training.In recent years, increasing scholar
Research direction Active Learning is locked, it is proposed that about the related notion and theory of Active Learning.It is compared to passive learning
The mode of data is randomly selected, Active Learning Algorithm selects useful sample to user's mark, rather than passive receive data.
According to experimental result, on the premise of same accuracy is realized, actively selection can largely reduce institute compared with random selection
The sample size needed.The implementation procedure of Active Learning Algorithm can be divided into two processes.One:Selected by sample selection algorithm
Go out information content highest, the maximum sample of the magnitude of value is marked, the sample of mark is added in markd sample.Its
Two:There is flag data to create a base grader using existing, select suitable evaluation index, weighed by supervised learning
The classification performance of grader.The two processes are alternately performed, and pass through continuous iteration so that the performance of grader is optimal,
Or the number of times of iteration is set, until reaching default condition.
According to the form of selection data untagged, Active Learning can be divided into two classes:Active Learning based on stream, it is based on
The Active Learning in pond.
Active Learning based on stream:This method is obtained in natural language processing directions such as part-of-speech tagging, elimination meaning of a word differences
It is widely applied.Sample is judged it one by one according to the form of stream with learning algorithm, as long as two kinds of possibility of judged result,
Or mark or not.The sample for needing mark is transferred into expert's mark.The sample of not mark is directly abandoned, and is not being used.
Representational method has the committee to inquire about (Query By Committee, QBC).
Active Learning based on pond:Sequential sampling is required from the Active Learning based on stream, judges different one by one, and this method will
Ask and unified judgement is carried out to the data of a certain scope, topk sample is selected according to a certain index and is labeled.Certainly it is based on
The sample in pond can also be changed using based on stream learning method, selected a small amount of sample from pond every time and individually sentenced
It is disconnected.Active Learning based on pond is to study most popular, of greatest concern, most widely used method instantly.
The samples selection strategy of Active Learning:
1. based on probabilistic samples selection strategy
By probabilistic measure, the sample for picking out its most unascertainable classification gives human expert progress
Mark., can be by creating probabilistic model for two classification problems, the sample for selecting posterior probability values closest to 0.5 is carried out
Mark.For many classification problems, what is often selected is the sample of low confidence, or selects maximum a posteriori probability based on margin
The minimum sample with the difference of secondary maximum a posteriori probability.The uncertain of sample is determined it can in addition contain be calculated by comentropy
Property, entropy is bigger, uncertain bigger.
2. the samples selection strategy reduced based on following error
Do not have markd sample to be put into sample to be marked to be marked, be then added in training set by each, train
Go out new grader, select so that the maximum sample of grader anticipation error reduction is marked.Due to estimation anticipation error
Reduction needs very big amount of calculation, and institute often applies to 3. sample based on committee's inquiry in two classification problems in this way
Selection strategy
Method based on committee's inquiry constitutes the committees by creating multiple models, using these models respectively to without mark
Remember that sample carries out decision-making, decision-making least consistent sample is considered as the sample that least can determine that classification, such sample is added to and treated
Mark is carried out in marker samples can minimize version space.
Three decision-makings are to be studied to draw on the basis of probability rough set and decision-making rough set by Yao Y Y.Probability is coarse
Collect model and introduce two parameter alphas and β, whole space is divided for three domains:Positive domain, negative domain also have Boundary Region.Yao Y Y are first
The secondary related notion for proposing three decision-makings:Positive domain means the affirmative to things, and negative domain means the negative to things, Boundary Region
Mean there is no full assurance to a certain things, it is impossible to make a policy immediately.Three decision-makings are that probability rough set is imparted newly
Semanteme, its appearance provides new thinking methods for decision problem.
The content of the invention
Present invention seek to address that above problem of the prior art.Propose a kind of preferably lifting grader properties,
It can be effectively used for the Active Learning Method based on three decision theories of the situation of the current a large amount of categorical attributes missings of processing.This hair
Bright technical scheme is as follows:
A kind of Active Learning Method based on three decision theories, it comprises the following steps:
1), obtain data set and call random function by data set randomization, data set has proportionally been divided into mark
Remember data set, data untagged collection, test set;
2), using there is flag data collection training Naive Bayes Classifier, by Naive Bayes Classifier to unmarked
Data carry out posterior probability estimation, calculate the uncertainty of each data untagged;
3) data untagged, is divided into by three domains, i.e., positive domain, negative domain and Boundary Region according to probabilistic size.Its
In be related to two kinds of division methods:Divided according to threshold value and division is ranked up according to uncertain size;
4), to step 3) in sample not on same area separately handle, select contain much information, costly sample enters rower
Note, and the sample marked is added in training set, to train new grader to carry out result test to test set, in order to survey
The performance of grader is tried, is verified using different evaluation criterias.
Further, the step 2) use margin strategies to carry out posterior probability estimation to data untagged, after calculating
The difference of probability is tested, the uncertainty of each data untagged is determined;
D_value (x)=p (yfirst|x,L)-p(ysecond| x, L) (1),
Wherein D_value (x) represents probabilistic size, p (yfirst| x, L) the maximum value of posterior probability, p (ysecond|x,
L) the maximum value of posterior probability time.
Further, the step 3) in data untagged be divided into by three domains according to threshold value specifically included:According to threshold
Value threshold_ α, threshold_ β are divided, and are defined as follows:
If D_value(x)≤threshold_αx∈POS(X)
If threshold_α<D_value(x)<threshod_β x∈BND(X) (2)
If D_value(x)≥threshold_βx∈NEG(X)
Wherein, 0≤threshold_ α<threshold_β≤1
Threshold_ α, threshold_ β can based on experience value or the confidence level of setting is determined.
Further, the step 3) in be ranked up division according to uncertainty and specifically include:According to uncertainty by
Small, by controlling the size, sample space to be divided to being ranked up greatly, preceding topk belongs to positive domain, and rear topk belongs to negative domain,
A middle part belongs to Boundary Region, and k value is influenceed by quantity selectNum is selected.
Further, the step 4) in step 3) in sample not on same area separately handle and specifically include step:
For positive domain:That is x ∈ POS (X) sample set, is directly appended to sequence to be marked, and by it from unmarked number
Deleted according to middle;Negative domain:I.e. x ∈ NEG (X) sample set, does not do any processing to such sample;Boundary Region:X ∈ BND (X) are needed
Further determine whether mark, including step:By sample two-by-two between distance determine the field radius of Boundary Region sample;Meter
Calculation field density, selects most representational sample, representational sample is descending to be ranked up, and selects topk sample
It is added in sequence select to be marked.
Further, it is described calculate Boundary Region sample two-by-two between distance include:If attribute is continuous type attribute, Europe is used
Distance is drawn, is defined as follows:Assuming that sample X={ x1,x2,...xj...,xm, Y={ y1,y2,...yj...,ym}:
xjRepresent sample x j-th of attribute, yjRepresent sample Y j-th of attribute;
If attribute is discrete type attribute, selection uses VDM, is defined as follows:Assuming that sample x1, x2The two of discrete type attribute
Individual value V1, V2,
C1It is that the property value is V in all samples1Number, C1iIt is then i numbers, C for wherein classification2In being all samples
The property value is V2Number, C2iIt is then i numbers for wherein classification, K is constant, generally takes 1.
Further, the formula of the field radius for determining sample is:
δ=min (dis (xi, s)) and+w × range (dis (xi, s)), 0≤w≤1 (5)
Wherein, min (dis (xi, s)) and represent the sample away from its nearest neighbours, range (dis (xi, s)) represent in specified data
The span of its distance is concentrated, w controls the size of radius,xiδ neighborhood definitions be:δ(xi)={ xi|dis
(xi,xi)≤δ }, wherein δ is a metric predefined.
Further, the representative point is defined as below:
D_value (x) represents probabilistic size, and the smaller uncertainty of the value is bigger, xkFor the nothing in the radius of neighbourhood
Marker samples, N is to make formula dis (x, xk)≤δ sets up xkNumber, it is assumed that sample x and sample xkAttribute space be respectively x
={ x1,x2,...xj...,xm, xk={ xk 1,xk 2,...xk j...,xk m, both similarities are calculated using cosine formula, it is remaining
String formula is as follows:
Advantages of the present invention and have the beneficial effect that:
The present invention has applied to three decision theories in Active Learning, is divided sample space by uncertain size
For 3 domains:Positive domain, Boundary Region, negative domain.The present invention proposes two kinds of division methods, method one:Divided according to threshold value, method
Two:Division is ranked up according to uncertain size.It has selected different processing methods respectively for the sample in not same area, it is right
Sample in positive domain is directly appended in sequence select to be marked;Do not processed for the sample of negative domain;For Boundary Region
Sample, neighborhood density is calculated based on neighborhood, determines that it is representative, and topk sample of selection is added to sequence select to be marked
In, most relief expert is that the sample in select carries out mark, this from individual difference, and sample choosing is carried out with a definite target in view
The method selected can finer selection go out have mark meaning sample be marked so that preferably lifted grader properties,
Such as classification accuracy rate, ROC, F-value.This method is extended to Active Learning, can be effectively used for processing a large amount of at present
The situation of categorical attribute missing.
Brief description of the drawings
Fig. 1 is that the present invention provides preferred embodiment Active Learning flow chart;
Fig. 2:Divide positive domain, the schematic diagram of negative domain;
Fig. 3:Select the schematic diagram of representative sample on Boundary Region;
Fig. 4:Schematic diagram based on uncertain zoning;
Fig. 5:Active Learning flow chart based on three decision-makings.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, detailed
Carefully describe.Described embodiment is only a part of embodiment of the present invention.
The present invention solve above-mentioned technical problem technical scheme be:
Technical scheme includes following steps:
The method of the present invention comprises the following steps:
Step1:Experimental data is divided.By data be divided into flag data (training set), data untagged and
Data (test set) to be tested.
Step2:There is flag data i.e. training set training one base grader --- Naive Bayes Classification using existing
Device, the posterior probability values of each sample are estimated using the grader created in data untagged.If the grader is one
Individual two classification, the posterior probability of another classification is subtracted with the posterior probability of maximum classification.If many classification, then select posteriority general
Maximum and time maximum the value of rate value simultaneously seeks difference D_value.
Step3:Based on uncertainty, whole data untagged space is divided into three different domains according to a certain threshold value
In.For the D_value less data of value, it is considered as uncertain higher data, is directly divided into positive domain, is added
It is added in sequence select to be marked, waits human expert to be marked, and the deletion partial data is concentrated from data untagged;
The larger data of value for D_value, classification can be determined by being considered as, and be divided into negative domain, without the partial data
Process;Remaining data are divided into Boundary Region.If the data for Boundary Region are not not anticipate completely to their marks
Justice, in order to select the data for more thering is mark to be worth.Need to redefine whether mark.
Step4:Point on Boundary Region is handled.For the data on Boundary Region, in order that must mark it is more valuable,
The concept of neighborhood is introduced, the density of data untagged is calculated in neighborhood.The distributed intelligence of sample is taken into account, selection is provided
Representational data.
Step5:Sequence is re-started to representative information content, selection topk is added to data set to be marked
In select.
Step6:Domain expert is given by the data in select, mark is carried out to them, according to select result more
New training set and data untagged collection.New grader is created using the training set after renewal, result test is carried out to test set.
The step of repeating Step2-Step6, until meeting the preparatory condition of iteration or reaching the requirement of performance metric.
Further, in the Step1:It is single to use a number in order to be preferably estimated to learner Generalization Capability
Result of the test is obtained according to division to be often not sufficiently stable reliably, it is also difficult that there is convincingness.If selection is using random division several times, weight
Retrial is tested, the method finally averaged, it is clear that obtained assessment result is more reasonable, more convincing.When testing,
The number of times for repeating experiment is specified, the test result that different pieces of information is divided in each circulation experiment is observed.So dividing data
When, it is necessary to call random function, realize the random division to data.
Further, in the Step2:Naive Bayes Classifier is created, Naive Bayes Classifier is with conditional attribute
Premised on independence, it is assumed that the influence that each attribute is produced to classification results is separate.So as to solve solution class condition
The problem of joint probability distribution of probability P (x | c) this all properties, joint probability is directly calculated if based on finite sample,
The problem of will being faced with multiple shot array in computational problem, will be faced with the problem of sample is sparse in data, if number
Attribute especially many situation is still belonged to according to collection, then problems faced can be more serious.
Determine that the principle of a certain sample classification is as follows by Naive Bayes Classifier:
The set of attributive character:X={ a1,a2,a3...am}
The set of category attribute:C={ y1,y2,y3...yn}
P (x) is to be used for normalized " evidence " factor, and for given sample x, evidence factor p (x) does not have with categorical attribute
There is any relation, the size for any categorical attribute value does not change.As long as so molecule is maximized.Piao
The expression formula of plain Bayes classifier is usually defined as follows:
Formula will be causedTake the y of maximumiIt is used as the result of decision.From this knot
If it can be found that directly using the output probability value of the result in fruit, probable value that will be maximum
p(yfirst, x) with time probable value p (y of maximumsecond, x) do difference, it may appear that different probable values, but difference
Identical situation.So that two classify as an example, scene one:p(yfirst, x) with p (ysecond, x) value is respectively 0.4,0.2,
D_value=p (yfirst,x)-p(ysecond, x)=0.2;Scene two:p(yfirst, x) with p (ysecond, x) value
Respectively
0.5,0.3, D_value=p (yfirst,x)-p(ysecond, x)=0.2;In fact, scene one adds the evidence factor
Afterwards
Scene two
Add after the evidence factor
Difference in obvious scene two is than scene one
It is small, it is uncertain some larger in scene two.So, when from naive Bayesian as grader, utilize formula (2)
Obtained probable value needs to be normalized with evidence factor p (x), it is impossible to directly do difference operation using the probable value of output.
Further with regards to conditional probability p (xj|yi) calculating, for discrete type attribute, using formulaTo calculate.WhereinIt is x to represent j-th of attribute valuejSample set.For continuous type category
For property, it will be assumed that p (xj|yi) Normal Distribution, i.e.,So what continuous type was calculated
Condition probability formula is(3)
Further, in the Step2 and Step3:Naive Bayes Classifier is constructed, selects most uncertain sample to enter
Line flag, selects to represent method margin strategies using probabilistic here, margin formula are as follows
x*=argmin (p (yfirst|x,L)-p(ysecond|x,L)) (4)
yfirst:The maximum value of posterior probability, ysecond:The maximum value of posterior probability time, when the values difference of two is minimum
Wait, it is uncertain bigger.
By taking two classification (y, n) as an example, if D_value value very little illustrates that data belong to classification y or the n that classifies probability
Compare neighbouring.At this moment grader accurate can not carry out decision-making to the data.Can be very big if by such data mark
The performance of classification is improved in degree, here it is based on the uncertain sample for selecting most worth mark.If D_value's takes
Value is larger, such as the posterior probability for the y that classifies is far longer than classification n posterior probability, in the range of error permission, at this moment classifies
Device has had the full classification results for holding the determination data.In this case, we need not do to such data
Processing.
Further, in the Step3:Based on uncertainty, data are divided according to threshold value, two are referred here to
Plant division methods.Method one:Based on experience value or setting confidence level determine that threshold_ α, threshold_ β are drawn
Point;Method two:It is ranked up, is divided by controlling the size according to uncertainty is ascending.In order to better illustrate not
With the division in region, draw as Fig. 2 further illustrates problem.
For sample A, posterior probability values are obtained by grader, most probable value is far longer than time most probable value, this
When think that its classification can be determined.Even if A density is very big, also no longer A is marked, now, A is divided into negative
Domain.
For sample B, posterior probability values are obtained by grader, most probable value and time most probable value very close to, point
Class device is very big to the probability of its decision error, although sample B does not have sample A density big.Obviously, meaning is marked more to B
Greatly, now, B is divided into positive domain.
Further, in the Step4:Based on the uncertain division data are carried out with three, there is most sample
This acquirement is not the most value at two ends, and uncertainty is in intermediateness, if it is known that having more unmarked around such data
Sample, then illustrate that the data are representative, and mark is carried out to it can reduce the uncertainty of surrounding sample, improve grader
Performance.
To such as Fig. 3 of the data processing in Boundary Region, the grader of construction has not true to the A of unmarked sample, B classification
It is qualitative, but it is clear that sample A is more representative, study is more beneficial for A marks.Sample on Boundary Region was needed week
The distribution situation for enclosing sample takes into account, by most representative sample mark.
It is described below with specific embodiment, specifically includes following steps:
(1) data are divided
Call random function to make randomizing data, data are divided, it is settable to have flag data collection:Data untagged
Collection:Test set=1:69:The data of 30, i.e., 1% are used for having a flag data, and 69% data are used for data untagged, 30%
Data are used to test.69% data untagged collection is selectively added into 1% by each iteration flag data
Collect in (training set).It is iterating through Active Learning Method every time to select data untagged collection, picks out most worthy, most
Significant sample is marked.The sample marked is added in training set, trains new grader to carry out test set
Test, observes the performance of grader after more each adding procedure.
(2) uncertainty of data untagged is calculated, different domains are divided according to uncertain
Naive Bayes Classifier is constructed, according to margin strategies, the difference of posterior probability is calculated, determines each without mark
The uncertainty for the evidence that counts.
D_value (x)=p (yfirst|x,L)-p(ysecond|x,L) (1)
Based on probabilistic size, whole sample space is divided into 3 domains:Positive domain, negative domain, Boundary Region.
Division methods one:Divided, be defined as follows according to threshold value threshold_ α, threshold_ β:
If D_value(x)≤threshold_αx∈POS(X)
If threshold_α<D_value(x)<threshod_β x∈BND(X) (2)
If D_value(x)≥threshold_βx∈NEG(X)
Wherein, 0≤threshold_ α<threshold_β≤1
Threshold_ α, threshold_ β can based on experience value or the confidence level of setting is determined.So that two classify as an example
Illustrate, if threshold_ α=0.05, threshold_ β=0.95
As D_value=0.05
p(y1|x)+p(y2| x)=1, p (y1|x)-p(y2| equation group x)=0.05 is solved, following result p (y can be obtained1|x)
=0.525, p (y2| x)=0.475, i.e., when the posterior probability of two classification is 0.525 and 0.475 respectively, it is divided into just
Domain, is considered as and not can determine that classification completely.
As D_value=0.95
p(y1|x)+p(y2| x)=1, p (y1|x)-p(y2| equation group x)=0.95 is solved, following result p (y can be obtained1|x)
=0.975, p (y2| x)=0.025, i.e., when the posterior probability of two classification is 0.975 and 0.025 respectively, it is divided into bearing
Domain, classification can be determined by being considered as.
Division methods two:Such as Fig. 4, it is ranked up, by controlling the size, sample space is entered according to probable value is ascending
Row is divided, and preceding topk belongs to positive domain, and rear topk belongs to negative domain, and a middle part belongs to Boundary Region.
WithExemplified by
Ifx∈POS(X)
Ifx∈BND(X)
Ifx∈NEG(X)
Wherein, selectNum:Each iteration expects the quantity of mark.
top(x):A function is defined, the data x sequence number in sequencing queue is obtained.
(3) corresponding processing is done to the data on not same area
Positive domain:I.e. x ∈ POS (X) sample set, is directly appended to sequence to be marked, and by it from data untagged
Delete;Negative domain:I.e. x ∈ NEG (X) sample set, has little significance to such sample mark, does not do any place to such sample
Reason.Boundary Region:X ∈ BND (X) need to further determine whether mark.
1. the distance between calculating sample two-by-two.
If attribute is continuous type attribute, using Euler's distance, it is defined as follows:
(concrete meaning is as above changed) (3)
If attribute is discrete type attribute, selection uses VDM, is defined as follows:Assuming that sample x1, x2The two of discrete type attribute
Individual value V1, V2,
C1It is that the property value is V in all samples1Number, C1iIt is then i numbers, C for wherein classification2In being all samples
The property value is V2Number, C2iIt is then i numbers for wherein classification.K is constant, generally takes 1.
2. determine the field radius of sample
δ=min (dis (xi, s)) and+w × range (dis (xi, s)), 0≤w≤1 (5)
Wherein, min (dis (xi, s)) and represent the sample away from its nearest neighbours, range (dis (xi, s)) represent in specified data
The span of its distance is concentrated, w controls the size of radius
3. the most representational sample of selection
It is uncertain bigger because D_value value is smaller, take its opposite number then to have D_value value bigger, do not know
Property it is bigger, in order to avoid negative, progress Jia 1 and operated, and then representative point is defined as below:
N is to make formula dis (x, xkThe number that)≤δ is set up, wherein,
Descending to representational sample to be ranked up, topk sample of selection is added to sequence select to be marked
In
(4) it is sequence mark to be marked, creates new grader
The sample in union operation, i.e., sequence select to be marked is taken to the sample that each domain is selected, expert etc. is given
It is to be marked, marked sample is added to training set, new grader is created.
(5) result is tested
Active Learning Algorithm is the process of an iteration, and selection selectNum is added in data set to be marked every time, and
They are labeled.Training set is updated, new grader is created, classification performance is tested using test set.By not
Disconnected iteration addition data untagged is until meeting default iterations or evaluation criterion.Here evaluation index can select to make
With accuracy, ROC, F-value etc..In order that obtaining experimental result has more reliability, by the method for random division,
The test of 10 times such as is carried out to data, this 10 average results are finally asked for.
The above embodiment is interpreted as being merely to illustrate the present invention rather than limited the scope of the invention.
After the content for the record for having read the present invention, technical staff can make various changes or modifications to the present invention, these equivalent changes
Change and modification equally falls into the scope of the claims in the present invention.
Claims (8)
1. a kind of Active Learning Method based on three decision theories, it is characterised in that comprise the following steps:
1), obtain data set and call random function by data set randomization, data set has proportionally been divided into reference numerals
According to collection, data untagged collection, test set;
2), using there is flag data collection training Naive Bayes Classifier, by Naive Bayes Classifier to data untagged
Posterior probability estimation is carried out, the uncertainty of each data untagged is calculated;
3) data untagged, is divided into by three domains, i.e., positive domain, negative domain and Boundary Region according to probabilistic size.Wherein relate to
And to two kinds of division methods:Divided according to threshold value and division is ranked up according to uncertain size;
4), to step 3) in sample not on same area separately handle, select contain much information, costly sample is marked,
And the sample marked is added in training set, to train new grader to carry out result test to test set, in order to test
The performance of grader, is verified using different evaluation criterias.
2. the Active Learning Method according to claim 1 based on three decision theories, it is characterised in that
The step 2) use margin strategies to carry out posterior probability estimation to data untagged, the difference of posterior probability is calculated,
Determine the uncertainty of each data untagged;
D_value (x)=p (yfirst|x,L)-p(ysecond| x, L) (1), wherein
D_value (x) represents probabilistic size, p (yfirst| x, L) the maximum value of posterior probability, p (ysecond| x, L) posteriority
The maximum value of probability time.
3. the Active Learning Method according to claim 2 based on three decision theories, it is characterised in that
The step 3) in data untagged be divided into by three domains according to threshold value specifically included:
Divided, be defined as follows according to threshold value threshold_ α, threshold_ β:
Wherein, 0≤threshold_ α<threshold_β≤1
Threshold_ α, threshold_ β can based on experience value or the confidence level of setting is determined.
4. the Active Learning Method according to claim 2 based on three decision theories, it is characterised in that
The step 3) in be ranked up division according to uncertainty and specifically include:Arranged according to uncertainty is ascending
Sequence, by controlling the size, is divided to sample space, and preceding topk belongs to positive domain, and rear topk belongs to negative domain, a middle part
Belong to Boundary Region, k value is influenceed by quantity selectNum is selected.
5. the Active Learning Method according to claim 4 based on three decision theories, it is characterised in that the step 4)
In to step 3) in sample not on same area separately handle and specifically include step:
For positive domain:I.e. x ∈ POS (X) sample set, is directly appended to sequence to be marked, and by it from data untagged
Delete;Negative domain:I.e. x ∈ NEG (X) sample set, does not do any processing to such sample;Boundary Region:X ∈ BND (X) need into
One step determines whether mark, including step:By sample two-by-two between distance determine the field radius of Boundary Region sample;Calculate neck
Domain density, selects most representational sample, representational sample is descending to be ranked up, topk sample addition of selection
Into sequence select to be marked.
6. the Active Learning Method according to claim 5 based on three decision theories, it is characterised in that
It is described calculate Boundary Region sample two-by-two between distance include:If attribute is continuous type attribute, using Euler's distance, definition is such as
Under:Assuming that sample X={ x1,x2,...xj...,xm, Y={ y1,y2,...yj...,ym}:
xjRepresent sample x j-th of attribute, yjRepresent sample Y j-th of attribute;
If attribute is discrete type attribute, selection uses VDM, is defined as follows:Assuming that sample x1, x2In two values of discrete type attribute
V1, V2,
C1It is that the property value is V in all samples1Number, C1iIt is then i numbers, C for wherein classification2It is the category in all samples
Property value be V2Number, C2iIt is then i numbers for wherein classification, K is constant, generally takes 1.
7. the Active Learning Method according to claim 5 based on three decision theories, it is characterised in that the determination sample
The formula of this field radius is:
δ=min (dis (xi, s)) and+w × range (dis (xi, s)), 0≤w≤1 (5)
Wherein, min (dis (xi, s)) and represent the sample away from its nearest neighbours, range (dis (xi, s)) represent in data set is specified
The span of its distance, w controls the size of radius,xj∈U xiδ neighborhood definitions be:δ(xi)={ xi|dis(xi,
xi)≤δ }, wherein δ is a metric predefined.
8. the Active Learning Method according to claim 5 based on three decision theories, it is characterised in that
The representative point is defined as below:
D_value (x) represents probabilistic size, and the smaller uncertainty of the value is bigger, xkFor the unmarked sample in the radius of neighbourhood
This, N is to make formula dis (x, xk)≤δ sets up xkNumber, it is assumed that sample x and sample xkAttribute space be respectively x={ x1,
x2,...xj...,xm, xk={ xk 1,xk 2,...xk j...,xk m, both similarities, cosine formula are calculated using cosine formula
It is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710326684.8A CN107273912A (en) | 2017-05-10 | 2017-05-10 | A kind of Active Learning Method based on three decision theories |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710326684.8A CN107273912A (en) | 2017-05-10 | 2017-05-10 | A kind of Active Learning Method based on three decision theories |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107273912A true CN107273912A (en) | 2017-10-20 |
Family
ID=60074134
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710326684.8A Pending CN107273912A (en) | 2017-05-10 | 2017-05-10 | A kind of Active Learning Method based on three decision theories |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107273912A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875768A (en) * | 2018-01-23 | 2018-11-23 | 北京迈格威科技有限公司 | Data mask method, device and system and storage medium |
CN109543707A (en) * | 2018-09-29 | 2019-03-29 | 南京航空航天大学 | Semi-supervised change level Software Defects Predict Methods based on three decisions |
CN109820479A (en) * | 2019-01-08 | 2019-05-31 | 西北大学 | A kind of fluorescent molecular tomography feasible zone optimization method |
CN109977994A (en) * | 2019-02-02 | 2019-07-05 | 浙江工业大学 | A kind of presentation graphics choosing method based on more example Active Learnings |
CN110058576A (en) * | 2018-01-19 | 2019-07-26 | 临沂矿业集团有限责任公司 | Equipment fault prognostics and health management method based on big data |
CN110784481A (en) * | 2019-11-04 | 2020-02-11 | 重庆邮电大学 | DDoS detection method and system based on neural network in SDN network |
CN111914061A (en) * | 2020-07-13 | 2020-11-10 | 上海乐言信息科技有限公司 | Radius-based uncertainty sampling method and system for text classification active learning |
CN112365120A (en) * | 2020-09-29 | 2021-02-12 | 重庆邮电大学 | Intelligent business strategy generation method based on three decisions |
CN113240007A (en) * | 2021-05-14 | 2021-08-10 | 西北工业大学 | Target feature selection method based on three-branch decision |
CN113327131A (en) * | 2021-06-03 | 2021-08-31 | 太原理工大学 | Click rate estimation model for feature interactive selection based on three-branch decision theory |
CN114927239A (en) * | 2022-04-21 | 2022-08-19 | 厦门大学 | Decision rule automatic generation method and system applied to medicine analysis |
CN116452320A (en) * | 2023-04-12 | 2023-07-18 | 西南财经大学 | Credit risk prediction method based on continuous learning |
-
2017
- 2017-05-10 CN CN201710326684.8A patent/CN107273912A/en active Pending
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110058576A (en) * | 2018-01-19 | 2019-07-26 | 临沂矿业集团有限责任公司 | Equipment fault prognostics and health management method based on big data |
CN108875768A (en) * | 2018-01-23 | 2018-11-23 | 北京迈格威科技有限公司 | Data mask method, device and system and storage medium |
CN109543707A (en) * | 2018-09-29 | 2019-03-29 | 南京航空航天大学 | Semi-supervised change level Software Defects Predict Methods based on three decisions |
CN109543707B (en) * | 2018-09-29 | 2020-09-25 | 南京航空航天大学 | Semi-supervised change-level software defect prediction method based on three decisions |
CN109820479A (en) * | 2019-01-08 | 2019-05-31 | 西北大学 | A kind of fluorescent molecular tomography feasible zone optimization method |
CN109977994B (en) * | 2019-02-02 | 2021-04-09 | 浙江工业大学 | Representative image selection method based on multi-example active learning |
CN109977994A (en) * | 2019-02-02 | 2019-07-05 | 浙江工业大学 | A kind of presentation graphics choosing method based on more example Active Learnings |
CN110784481A (en) * | 2019-11-04 | 2020-02-11 | 重庆邮电大学 | DDoS detection method and system based on neural network in SDN network |
CN110784481B (en) * | 2019-11-04 | 2021-09-07 | 重庆邮电大学 | DDoS detection method and system based on neural network in SDN network |
CN111914061B (en) * | 2020-07-13 | 2021-04-16 | 上海乐言科技股份有限公司 | Radius-based uncertainty sampling method and system for text classification active learning |
CN111914061A (en) * | 2020-07-13 | 2020-11-10 | 上海乐言信息科技有限公司 | Radius-based uncertainty sampling method and system for text classification active learning |
CN112365120A (en) * | 2020-09-29 | 2021-02-12 | 重庆邮电大学 | Intelligent business strategy generation method based on three decisions |
CN112365120B (en) * | 2020-09-29 | 2022-05-03 | 重庆邮电大学 | Intelligent business strategy generation method based on three decisions |
CN113240007A (en) * | 2021-05-14 | 2021-08-10 | 西北工业大学 | Target feature selection method based on three-branch decision |
CN113240007B (en) * | 2021-05-14 | 2024-05-14 | 西北工业大学 | Target feature selection method based on three decisions |
CN113327131A (en) * | 2021-06-03 | 2021-08-31 | 太原理工大学 | Click rate estimation model for feature interactive selection based on three-branch decision theory |
CN114927239A (en) * | 2022-04-21 | 2022-08-19 | 厦门大学 | Decision rule automatic generation method and system applied to medicine analysis |
CN114927239B (en) * | 2022-04-21 | 2024-07-02 | 厦门大学 | Automatic decision rule generation method and system applied to drug analysis |
CN116452320A (en) * | 2023-04-12 | 2023-07-18 | 西南财经大学 | Credit risk prediction method based on continuous learning |
CN116452320B (en) * | 2023-04-12 | 2024-04-30 | 西南财经大学 | Credit risk prediction method based on continuous learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107273912A (en) | A kind of Active Learning Method based on three decision theories | |
Huang et al. | Mos: Towards scaling out-of-distribution detection for large semantic space | |
Khodadadeh et al. | Unsupervised meta-learning for few-shot image classification | |
CN111191732B (en) | Target detection method based on full-automatic learning | |
Feng et al. | Learning fair representations via an adversarial framework | |
Sikora | A modified stacking ensemble machine learning algorithm using genetic algorithms | |
Farid et al. | Mining complex data streams: discretization, attribute selection and classification | |
Han et al. | Prediction and recovery for adaptive low-resolution person re-identification | |
CN110032682A (en) | A kind of information recommendation list generation method, device and equipment | |
CN110929848A (en) | Training and tracking method based on multi-challenge perception learning model | |
Wang et al. | Active learning with co-auxiliary learning and multi-level diversity for image classification | |
Xue et al. | PSO for feature construction and binary classification | |
Zhang et al. | Modeling the Homophily Effect between Links and Communities for Overlapping Community Detection. | |
Huang et al. | Learning consistent region features for lifelong person re-identification | |
CN115292532A (en) | Remote sensing image domain adaptive retrieval method based on pseudo label consistency learning | |
Malphedwar et al. | Squirrel search method for deep learning-based anomaly identification in videos | |
Chen et al. | Refining noisy labels with label reliability perception for person re-identification | |
El-Shorbagy et al. | Advances in Henry Gas Solubility Optimization: A Physics-Inspired Metaheuristic Algorithm With Its Variants and Applications | |
Daoud et al. | Recent Advances of Chimp Optimization Algorithm: Variants and Applications | |
Michelakos et al. | A hybrid classification algorithm evaluated on medical data | |
CN111783088B (en) | Malicious code family clustering method and device and computer equipment | |
Azarbad et al. | Brain tissue segmentation using an unsupervised clustering technique based on PSO algorithm | |
Sastry et al. | Sub-structural niching in estimation of distribution algorithms | |
Salehi et al. | Hybrid simple artificial immune system (SAIS) and particle swarm optimization (PSO) for spam detection | |
CN105160358B (en) | A kind of image classification method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171020 |