Specific embodiment
In order to make those skilled in the art more fully understand the technical solution in this specification embodiment, below in conjunction with this
Attached drawing in specification embodiment is described in detail the technical solution in this specification embodiment, it is clear that described
Embodiment is only a part of the embodiment of this specification, instead of all the embodiments.The embodiment of base in this manual,
Those of ordinary skill in the art's every other embodiment obtained, all should belong to the range of protection.
Traditional machine learning model is all built upon training data and test data obeys the base of identical data distribution
On plinth, two classifiers are obtained based on training data, which is used for test data.But in many situations,
Training data and test data and being unsatisfactory for obey identical data distribution this it is assumed that and marking out again and test data takes
Time cost and material resources cost need to be paid again from the training data of identical data distribution, thus, researcher attempts to utilize
These obey the training data of different distributions with test data, and training one can be applied to test data, and obtain preferable
Two classifiers of classifying quality.
To achieve the goals above, researcher proposes TraAdaboost algorithm, TraAdaboost algorithm is continued to use
The basic framework of Adaboost algorithm, but the different from terms of adjusting sample weights, specifically, being calculated in TraAdaboost
In method, a weight is all set for each of training sample set sample in advance, when the source domain sample that the training sample is concentrated
Subset TbIn sample by mistake classification after, it may be considered that the classification difficulty of the sample is larger, to increase the sample
Weight, correspondingly, when the training sample concentrate auxiliary domain sample set TaIn sample by mistake classification after, then may be used
To think the sample compared to source domain sample set TbIn differences between samples it is larger, so as to reduce the weight of the sample,
Reduce sample specific gravity shared during two classifier trainings, wherein above-mentioned TaWith TbDifference be, TbWith test
Data obedience same distribution, and TaDifferent distributions are obeyed with test data.
In conjunction with foregoing description, in the detailed process of TraAdaboost algorithm, for each trained sample of training sample concentration
One initial weight of this setting, and a number of iterations is set, it is iterated as follows: utilizing learning algorithm and training
Sample set is trained, and obtains a Weak Classifier;Using the Weak Classifier to above-mentioned TbIn training sample classify,
In, the classification thresholds that when classification is utilized are usually business personnel according to the preset fixed value of business experience, for example,
0.5, then, if the Weak Classifier is greater than 0.5 for the calculated sample score of a certain training sample, which is returned
The class that is positive sample, conversely, the training sample is classified as negative class sample if sample score is not more than 0.5, later, based on classification
As a result the Weak Classifier is calculated relative to above-mentioned TbError rate;It is subsequent, it is based on the error rate adjusting training sample concentration training
The weight of sample.Finally, after iteration, multiple Weak Classifiers are integrated, obtained strong classifier is as final
Two classifiers.
It seen from the above description, is the error rate based on general classification come adjusting training sample in TraAdaboost algorithm
This weight, also, the error rate is to be determined by classification results namely classification thresholds, and classification thresholds are fixed values,
To if the respective proportion and unbalanced of positive class sample and negative class sample that training sample is concentrated, such as positive class sample account for
1%, negative class sample accounts for 99%, then, tend to for minority class sample to be divided into most class samples using TraAdaboost algorithm
This, such as all samples are all divided into the class sample that is negative, it is with higher on the whole with two classifiers for guaranteeing that training obtains
Classification accuracy, it can be seen that, for unbalanced training data, two classification trained using TraAdaboost algorithm
The poor performance of model.To solve the above-mentioned problems, this specification embodiment provides a kind of training method of two classifiers.
It is as follows, following embodiments are shown, the training method of two classifier is illustrated: being this explanation referring to Figure 1
The embodiment flow chart of the training method for two classifier of one kind that one exemplary embodiment of book provides, this method includes following step
It is rapid:
Step 102: it is trained using the learning algorithm and training sample set of setting, obtains Weak Classifier, the training sample
This concentration includes multiple training samples, and any training sample in multiple training sample has weight.
In this specification embodiment, a training sample set can be preset, it includes multiple instructions which, which concentrates,
Practice sample, multiple training sample is divided into positive class sample and negative class sample again, and each training sample all has a weight.
In one embodiment, it is assumed that training sample set is { T1, T2... ..., Tn, Tn+1... ..., Tn+m, wherein { T1,
T2... ..., TnIt is auxiliary domain sample set Ta, { Tn+1... ..., Tn+mIt is source domain sample set Tb, can before iteration for the first time
It is that an initial weight is arranged in each training sample that the training sample is concentrated according to following formula (one):
Based on above-mentioned training sample set, in this step, then the power of training sample can be concentrated based on the training sample
Redistribution is trained using the learning algorithm and the training sample set of setting, obtains a classifier, for convenience,
Classifier obtained in each iterative process is known as Weak Classifier.
Wherein, the weight distribution of above-mentioned training sample can be calculated by following formula (two):
In above-mentioned formula (two), t indicates current the number of iterations, for example, if current for iteration for the first time, t 1.
In above-mentioned formula (two), wtIndicate current weight vectors, speciallyFor example, if
It is currently iteration for the first time, then
In one embodiment, above-mentioned set algorithm can be SVM (Support Vector Machine, support vector machines)
Algorithm or logistic regression algorithm etc..
In one embodiment, above-mentioned Weak Classifier can be the form of decision tree, or other finer classification
Device, for example, RF (Random Forest, random forest) classifier.
Step 104: the classification thresholds of the Weak Classifier are determined based on the ROC curve of Weak Classifier.
In this specification embodiment, it is different from and one fixed value is set according to business experience by business personnel in the related technology
For classification thresholds, propose that the ROC curve based on Weak Classifier determines classification thresholds, it will be appreciated by persons skilled in the art that
ROC curve is with true positive rate (sensitivity is denoted as Sensitivity) for ordinate, false positive rate (1- specificity, wherein special
Different degree is denoted as specificity) it is what abscissa was drawn, each data point thereon corresponds to a section, i.e. classification threshold
Value, wherein true positive rate Sensitivity can also reflect positive class level of coverage namely coverage rate, and specificity
Specificity can also reflect the positive class level of coverage for misdeeming the class that is negative, namely bother rate, as the drafting Weak Classifier
ROC curve detailed process, those skilled in the art may refer to description in the related technology, and this specification embodiment is to this
No longer it is described in detail.
Based on foregoing description, in one embodiment, each data point that can be directed on the ROC curve of Weak Classifier,
Calculate the distance between the data point and specified coordinate point, wherein the ordinate of orthogonal axes value of specified coordinate point is business personnel's root
According to the coverage rate that business experience is arranged, for convenience, referred to herein as specified coverage rate, the horizontal axis coordinate of the specified coordinate point
Value then subtracts business personnel for 1 and bothers rate according to what business experience was arranged, for convenience, referred to herein as specified to bother rate.
Then, the smallest data point at a distance between the specified coordinate point is determined on ROC curve, by foregoing description
It is found that the corresponding classification thresholds of each data point on ROC curve, thus then can be corresponding apart from the smallest data point by this
Classification thresholds be determined as classification thresholds to be determined in the present embodiment.
In this embodiment, rate is bothered by the way that specified coverage rate is arranged and is specified according to business experience by business personnel,
The performance index value of the desired Weak Classifier of business personnel is set, thus, one, which is determined, in ROC curve most possibly reaches
To the classification thresholds of the desired performance index value.
In one embodiment, set algorithm logarithm can be utilized for each data point on the ROC curve of Weak Classifier
The ordinate of orthogonal axes value and horizontal axis coordinate value at strong point carry out operation, for example, the set algorithm can be as shown in following formula (three), again
For example, the set algorithm can be as shown in following formula (four):
Then the maximum data point of operation result is determined, by the corresponding classification thresholds of the maximum data point of the operation result
It is determined as classification thresholds to be determined in the present embodiment.
In this embodiment, by carrying out operational analysis to each data point on ROC curve, therefrom choosing one can
So that the optimal classification thresholds of the performance index value of Weak Classifier.
In one embodiment, above-mentioned ROC curve can be adjusted ROC curve, specifically, can use above-mentioned weak point
Class device calculates the sample score of any training sample in multiple training samples, be then based on the sample score to training sample into
Row determine, to judge positive and negative class sample, if a certain training sample is judged as negative sample, can decision the training sample is not held
The specified event of row, conversely, if a certain training sample is judged to position positive sample, can decision specified thing is executed to the training sample
Part, it is subsequent, it specifies the implementation effect of event as specified index this, the density function of the specified index is determined, by the density
Function is as ROC curve Dynamic gene, further, using the ROC curve Dynamic gene, multiple training sample to Weak Classifier
Original ROC curve be adjusted, be then based on ROC curve adjusted and determine classification thresholds.
In this embodiment, by being adjusted to ROC curve, then classification thresholds are determined based on ROC curve adjusted,
Enable to the classification results made using the classification thresholds more acurrate.
Step 106: obtaining Weak Classifier using classification thresholds to any in the specified portions sample set of training sample set
The classification results of training sample.
Step 108: being adjusted based on weight of the classification thresholds to any training sample in multiple training samples.
It is as follows, step 106 and step 108 are illustrated:
In this specification embodiment, above-mentioned specified portions sample set can be source domain sample set Tb, then, according to
The description of step 106, Weak Classifier then can be based on the classification thresholds that step 104 is determined to source domain sample set TbIn sample
This is classified, to obtain classification results.
Further, it is possible to calculate Weak Classifier in source domain sample set T based on classification resultsbOn error rate, note
For εt, specifically, calculation formula can be as shown in following formula (five):
In above-mentioned formula (five), ht(xi) indicate the classification thresholds determined based on step 104 to the classification knot of sample
Fruit, c (xi) then indicate the true classification of sample.
Subsequent, β is arranged in (six) according to the following equationt:
Further, it is possible to which (seven) adjust the weight of any training sample in multiple training samples according to the following equation
It is whole:
In above-mentioned formula (seven),Wherein, N is preset frequency threshold value.
In addition, can also directly replace above-mentioned ε using 1-wAUC in this specification embodimenttSubsequent operation is participated in,
Wherein, wAUC indicates area under the line of ROC curve adjusted.
Step 110: judging whether current the number of iterations reaches preset frequency threshold value, if so, continuing to execute step
112, otherwise, return to step 102.
Step 112: the Weak Classifier that iteration each time obtains is integrated to obtain two classifiers.
About the detailed description of step 108 and step 110, those skilled in the art may refer to correlation in the prior art
This is no longer described in detail in description, this specification embodiment.
In addition, the output result for finally integrating two obtained classifiers can obtain in this specification embodiment for sample
Point, not classification results, wherein sample score is higher, then can indicate sample be positive class sample probability it is higher, on the contrary, sample
This score is lower, then can indicate sample be positive class sample probability is lower namely sample is negative class sample probability it is higher.Base
In this, after step 108, then any test sample that can be concentrated test sample inputs two classifiers, obtains any survey
The sample score of sample sheet.
In addition, after iteration, can also be exported any in above-mentioned multiple training samples in this specification embodiment
The present weight of training sample is handled by this kind, white list formation efficiency can be improved.
The technical solution that this specification embodiment provides, by being iterated processing using following steps, until iteration time
Number reaches preset frequency threshold value: it is trained using the learning algorithm and training sample set of setting, obtains Weak Classifier, the instruction
Practicing includes multiple training samples in sample set, and any training sample in multiple training sample has weight, is based on this weak point
The ROC curve of class device determines the classification thresholds of Weak Classifier, then obtains the Weak Classifier using the classification thresholds to training sample
The classification results of any training sample in this concentration specified portions sample set, based on the classification results in multiple training samples
The weight of any training sample be adjusted, finally, being carried out after iteration to the Weak Classifier that iteration each time obtains
Integration obtains two classifiers.Due in each iterative process, be based on the Weak Classifier trained ROC curve determine should
The classification thresholds of Weak Classifier, thus it is subsequent more accurate based on classification results of the classification thresholds to training sample, also
The validity being adjusted based on weight of the classification results to training sample is improved, to handle by this kind, final training
The better performances of two disaggregated models obtained.
Corresponding to above method embodiment, this specification embodiment also provides a kind of training device of two classifiers, referring to
It is the embodiment block diagram of the training device for two classifiers that one exemplary embodiment of this specification provides, which can shown in Fig. 2
To include: training module 210, determining module 220, categorization module 230, adjustment module 240, and integrate module 250.
Wherein, training module 210 can be used for being trained using the learning algorithm of setting with training sample set, obtain
Weak Classifier, it includes multiple training samples that the training sample, which is concentrated, any training sample tool in the multiple training sample
There is weight;
Determining module 220 can be used for determining the classification threshold of the Weak Classifier based on the ROC curve of the Weak Classifier
Value;
Categorization module 230 can be used for obtaining the Weak Classifier using the classification thresholds to the training sample set
Specified portions sample set in any training sample classification results;
Module 240 is adjusted, can be used for based on the classification results to any trained sample in the multiple training sample
This weight is adjusted;
The training module 210, the determining module 220, the categorization module 230 and 240 phase of adjustment module
Mutually iterative processing is realized in cooperation, until meeting preset iteration stopping condition;
Module 250 is integrated, after can be used for iteration, the Weak Classifier that iteration each time obtains is integrated to obtain
Two classifiers.
In one embodiment, the determining module 210 may include (being not shown in Fig. 2):
First computational submodule, for calculating described for each data point on the ROC curve of the Weak Classifier
The distance between data point and specified coordinate point, wherein the ordinate of orthogonal axes value of the specified coordinate point is specified coverage rate, described
The horizontal axis coordinate value of specified coordinate point, which subtracts to specify for 1, bothers rate;
First data point determines submodule, for determining the smallest data point of the distance between specified coordinate point;
First threshold determines submodule, for determining the classification threshold of the Weak Classifier based on the data point determined
Value.
In one embodiment, the determining module 210 may include (being not shown in Fig. 2):
Second computational submodule, for being calculated using setting for each data point on the ROC curve of the Weak Classifier
Method carries out operation to the ordinate of orthogonal axes value and horizontal axis coordinate value of the data point;
Second data point determines submodule, for determining the maximum data point of operation result;
Second threshold determines submodule, for determining the classification threshold of the Weak Classifier based on the data point determined
Value.
In one embodiment, the determining module 210 may include (being not shown in Fig. 2):
Third computational submodule, for calculating any trained sample in the multiple training sample using the Weak Classifier
Whether this sample score, and being determined based on the sample score the training sample will determine result as being directed to
The sample executes the foundation of specified event;
Dynamic gene determines submodule, the density function of the specified index for estimating the multiple training sample, by institute
The density function of specified index is stated as ROC curve Dynamic gene, what the specified index reflected the specified event executes effect
Fruit;
Curve adjusting submodule, for utilizing the ROC curve Dynamic gene, the multiple training sample to described weak point
The ROC curve of class device is adjusted;
Third threshold value determines submodule, for determining the classification thresholds of the Weak Classifier based on ROC curve adjusted.
In one embodiment, described device can also include (being not shown in Fig. 2):
Output module, for exporting the current power of any training sample in the multiple training sample after iteration
Weight.
In one embodiment, described device can also include (showing in Fig. 2):
Sample points calculating module, any test sample for concentrating test sample input two classifier, obtain
To the sample score of any test sample.
In one embodiment, the learning algorithm of the setting is at least following one of them:
SVM algorithm, logistic regression algorithm.
It is understood that training module 210, determining module 220, categorization module 230, adjustment module 240, and integration
Module of the module 250 as five kinds of functional independences can both configure in a device simultaneously as shown in Figure 2, can also be individually
It configures in a device, therefore structure shown in Fig. 2 should not be construed as the restriction to this specification example scheme.
In addition, the function of modules and the realization process of effect are specifically detailed in the above method corresponding step in above-mentioned apparatus
Rapid realization process, details are not described herein.
This specification embodiment also provides a kind of computer equipment, includes at least memory, processor and is stored in
On reservoir and the computer program that can run on a processor, wherein processor realizes two points above-mentioned when executing described program
The training method of class device, this method include at least: processing are iterated using following steps, until the number of iterations reaches preset
Frequency threshold value: being trained using the learning algorithm and training sample set of setting, obtains Weak Classifier, and the training sample is concentrated
Including multiple training samples, any training sample in the multiple training sample has weight;Based on the Weak Classifier
Receiver Operating Characteristics' ROC curve determines the classification thresholds of the Weak Classifier;It obtains the Weak Classifier and utilizes the classification
Classification results of the threshold value to any training sample in the specified portions sample set of the training sample set;It is tied based on the classification
Fruit is adjusted the weight of any training sample in the multiple training sample;After iteration, iteration each time is obtained
To Weak Classifier integrated to obtain two classifiers.
Fig. 3 shows one kind provided by this specification embodiment and more specifically calculates device hardware structural schematic diagram,
The equipment may include: processor 310, memory 320, input/output interface 330, communication interface 340 and bus 350.Wherein
Processor 310, memory 320, input/output interface 330 and communication interface 340 between the realization of bus 350 by setting
Standby internal communication connection.
Processor 310 can use general CPU (Central Processing Unit, central processing unit), micro process
Device, application specific integrated circuit (Application Specific Integrated Circuit, ASIC) or one or
The modes such as multiple integrated circuits are realized, for executing relative program, to realize technical solution provided by this specification embodiment.
Memory 320 can use ROM (Read Only Memory, read-only memory), RAM (Random Access
Memory, random access memory), static storage device, the forms such as dynamic memory realize.Memory 320 can store
Operating system and other applications are realizing technical solution provided by this specification embodiment by software or firmware
When, relevant program code is stored in memory 320, and execution is called by processor 310.
Input/output interface 330 is for connecting input/output module, to realize information input and output.Input and output/
Module can be used as component Configuration and (be not shown in Fig. 3) in a device, can also be external in equipment to provide corresponding function.Wherein
Input equipment may include keyboard, mouse, touch screen, microphone, various kinds of sensors etc., output equipment may include display,
Loudspeaker, vibrator, indicator light etc..
Communication interface 340 is used for connection communication module (being not shown in Fig. 3), to realize the communication of this equipment and other equipment
Interaction.Wherein communication module can be realized by wired mode (such as USB, cable etc.) and be communicated, can also be wirelessly
(such as mobile network, WIFI, bluetooth etc.) realizes communication.
Bus 350 includes an access, in various components (such as the processor 310, memory 320, input/output of equipment
Interface 330 and communication interface 340) between transmit information.
It should be noted that although above equipment illustrates only processor 310, memory 320, input/output interface
330, communication interface 340 and bus 350, but in the specific implementation process, which can also include realizing to operate normally
Necessary other assemblies.In addition, it will be appreciated by those skilled in the art that, it can also be only comprising realizing in above equipment
Component necessary to this specification example scheme, without including all components shown in figure.
This specification embodiment also provides a kind of computer readable storage medium, is stored thereon with computer program, the journey
It realizes that the training method of two classifier above-mentioned, this method include at least when sequence is executed by processor: being carried out using following steps
Iterative processing, until the number of iterations reaches preset frequency threshold value: being instructed using the learning algorithm and training sample set of setting
Practice, obtains Weak Classifier, it includes multiple training samples that the training sample, which is concentrated, any training in the multiple training sample
Sample has weight;The classification threshold of the Weak Classifier is determined based on Receiver Operating Characteristics' ROC curve of the Weak Classifier
Value;The Weak Classifier is obtained using the classification thresholds to any instruction in the specified portions sample set of the training sample set
Practice the classification results of sample;It is carried out based on weight of the classification results to any training sample in the multiple training sample
Adjustment;After iteration, the Weak Classifier that iteration each time obtains is integrated to obtain two classifiers.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
As seen through the above description of the embodiments, those skilled in the art can be understood that this specification
Embodiment can be realized by means of software and necessary general hardware platform.Based on this understanding, this specification is implemented
Substantially the part that contributes to existing technology can be embodied in the form of software products the technical solution of example in other words,
The computer software product can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are to make
It is each to obtain computer equipment (can be personal computer, server or the network equipment etc.) execution this specification embodiment
Method described in certain parts of a embodiment or embodiment.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity,
Or it is realized by the product with certain function.A kind of typically to realize that equipment is computer, the concrete form of computer can
To be personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play
In device, navigation equipment, E-mail receiver/send equipment, game console, tablet computer, wearable device or these equipment
The combination of any several equipment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality
For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method
Part explanation.The apparatus embodiments described above are merely exemplary, wherein described be used as separate part description
Module may or may not be physically separated, can be each module when implementing this specification example scheme
Function realize in the same or multiple software and or hardware.Can also select according to the actual needs part therein or
Person's whole module achieves the purpose of the solution of this embodiment.Those of ordinary skill in the art are not the case where making the creative labor
Under, it can it understands and implements.
The above is only the specific embodiment of this specification embodiment, it is noted that for the general of the art
For logical technical staff, under the premise of not departing from this specification embodiment principle, several improvements and modifications can also be made, this
A little improvements and modifications also should be regarded as the protection scope of this specification embodiment.