CN105760899A

CN105760899A - Adboost training learning method and device based on distributed computation and detection cost ordering

Info

Publication number: CN105760899A
Application number: CN201610201531.6A
Authority: CN
Inventors: 田雨农; 吴子章; 周秀田; 于维双; 陆振波
Original assignee: Dalian Roiland Technology Co Ltd
Current assignee: Dalian Roiland Technology Co Ltd
Priority date: 2016-03-31
Filing date: 2016-03-31
Publication date: 2016-07-13
Anticipated expiration: 2036-03-31
Also published as: CN105760899B

Abstract

The invention belongs to the field of computation, relates to an Adboost training learning method and device based on distributed computation and detection cost ordering, and solves the problem of slow detection speed of high-dimensional mass data sample training caused by an upper limit of memory and computing power of a single computer, so as to realize parallel training, form a new cascade classifier through circulation, and increase detection speed. The method is characterized by including: Step 3. distributed training: a first sample is split and distributed to computers, the computers train strong classifiers respectively according to allocated computation indexes of strong classifiers, and a new cascade classifier is formed according to the classification capacity of the strong classifiers; and Step 4. distributed training of a next time: the first samples of the computers that are wrongly classified are collected, the first sample is split and distributed to the computers, the computer train strong classifiers according to the allocated computation indexes of the strong classifiers, and a new cascade classifier is formed according to the classification capacity of the strong classifiers.

Description

Adboost based on Distributed Calculation with detection cost sequence trains learning method and device

Technical field

The invention belongs to calculating field, relate to the training learning method of a kind of grader.

Background technology

Along with the commercialization step of artificial intelligence is constantly accelerated, use machine learning algorithm that various one-dimensional, two-dimentional, three-dimensional or even more high-dimensional sample trainings become the invisible obstacle of many machine learning algorithms application.Because along with the increase of sample dimension Yu quantity, training memory source shared by learning sample calculates resource with CPU and GPU etc. and is also continuously increased.The internal memory of a general PC or server always has the upper limit, thus can limit the number of samples of training study to a great extent.Although machine learning algorithm traditional at present varies in training skills, but occupying system resources is unavoidable.Therefore the combination realizing the sample training of mass data, parallel computation and Distributed Calculation is requisite means.

Summary of the invention

In order to adapt to high-dimensional sample training, solving the internal memory due to single computer and the computing capability upper limit, the high-dimensional mass data sample training caused detects slow-footed problem, and the present invention proposes oneAdboost based on Distributed Calculation with detection cost sequence trains learning method and device, to realize parallel training, and circulation forms new cascade classifier, acceleration detection speed.

To achieve these goals, technical key point is as follows:

A kind of Adboost based on Distributed Calculation with detection cost sequence trains learning method, comprises the steps:

Step one. set detection target, determine the number of computers of distribution according to cascade classifier quantity；

Step 2. the second sample is distributed each computer；

Step 3. distributed training, the first sample splits and distributes each computer, each computer is respectively trained strong classifier according to the parameter of the strong classifier of distribution, and forms new cascade classifier according to the classification capacity of strong classifier；

Step 4. distributed training next time, first sample of the misclassification of each computer is collected, first sample is split and distributes each computer, each computer is respectively trained strong classifier according to the parameter of the strong classifier of distribution, and forms new cascade classifier according to the classification capacity of strong classifier；

Step 5. repeat step 4, until new cascade classifier meets exit criteria；

Wherein: the first described sample is a fairly large number of sample in positive negative sample to be tested, the second described sample is the sample of negligible amounts in positive negative sample to be tested.

The invention still further relates to a kind of Adboost based on Distributed Calculation with detection cost sequence and train learning device, including:

Detection goal setting device, sets detection target, determines the number of computers of distribution according to cascade classifier quantity；

Second sample dispensing device, distributes each computer by the second sample；

First distributed training devices, distributed training, the first sample splits and distributes each computer, each computer is respectively trained strong classifier according to the parameter of the strong classifier of distribution, and forms new cascade classifier according to the classification capacity of strong classifier；

Second distributed training devices, distributed training next time, first sample of the misclassification of each computer is collected, first sample is split and distributes each computer, each computer is respectively trained strong classifier according to the parameter of the strong classifier of distribution, and forms new cascade classifier according to the classification capacity of strong classifier；

Distributed training repetition training device, second distributed training devices's repetition training, until new cascade classifier meets exit criteria；

Beneficial effect:

1. the present invention utilizes Adboost algorithm self-characteristic, and its cascade classifier is split as the strong classifier of multiple computer distribution type parallel computation, overcomes the dependence to internal memory Yu computing capability of the single computer.

2. the present invention utilizes repeatedly the mode of distributed training iteration, constantly sample huge for sample size it is compressed and redistributes, and it is sequentially carried out the composition of final cascade classifier according to the classification capacity of strong classifier in each computer so that it is the detection speed of optimum is reached when detection in real time.

3. the present invention is when iteration carries out distributed training, and last three distributed training carry out the special adjustment of index strategy, it is ensured that the stable convergence of final cascade classifier.

Accompanying drawing explanation

Fig. 1It it is distributed Adboost training method flow processFigure。

Detailed description of the invention

Embodiment 1: a kind of Adboost based on Distributed Calculation with detection cost sequence trains learning method, comprises the steps:

Step 2. the second sample is distributed each computer；

Step 5. repeat step 4, until new cascade classifier meets exit criteria；Namely cascade classifier meets the detection target of setting, and wherein: the first described sample is a fairly large number of sample in positive negative sample to be tested, the second described sample is the sample of negligible amounts in positive negative sample to be tested.

The present embodiment utilizes Adboost algorithm self-characteristic, and its cascade classifier is split as the strong classifier of multiple computer distribution type parallel computation, overcomes the dependence to internal memory Yu computing capability of the single computer.And utilize repeatedly the mode of distributed training iteration, constantly sample huge for sample size it is compressed and redistributes, and it is sequentially carried out the composition of new cascade classifier according to the classification capacity of strong classifier in each computer so that it is the detection speed of optimum is reached when detection in real time.

Embodiment 2: there is the technical scheme identical with embodiment 1, more specifically: the step of described step 3 is:

S3.1. the first sample is split, and be distributed to each computer successively；

S3.2. strong classifier parameter is distributed to each computer；

S3.3. each computer is respectively trained strong classifier, and the strong classifier of training gained is collected to server, server uses the strong classifier collecting to all test samples, preferably segmented test, the classification capacity of each strong classifier being sorted, uses with server the strong classifier collecting the mistake point rate to adding up in all test samples for sort by, wrong point rate is more little, classification capacity is more strong, and according to institute's alignment sequence, the strong classifier collected is formed grader that level newly joins successively；

Embodiment 3: have the technical scheme identical with embodiment 1 or 2, more specifically: in step one, set detection target, including the false alarm rate index of cascade classifier, the recall rate index of cascade classifier；Wherein, false alarm rate target setting is equal to 0, and recall rate target setting is equal to 99%；In step 2, the parameter of distribution strong classifier is to distribute the recall rate identical with each strong classifier in entire cascaded grader and false alarm rate index, number of computers is n, the quantity of strong classifier is Max_Num, the strong classifier number joined together is equal to (n+n/2+n/4+...), makes (n+n/2+n/4+...) < 2n≤Max_Num.

If the first total sample number is N, number of computers is n, is distributed to the quantity k=N/n of the negative sample of each computer.The false alarm rate of each strong classifier needs Strong_False_Alarm≤50%, assume that the quantity Max_Num of strong classifier is less than 20, so n≤10, the strong classifier number joined together due to subsequent calculations is equal to (n+n/2+n/4+...), make (n+n/2+n/4+...)<2n≤Max_Num, the then recall rate of each strong classifier needs Strong_Detection_Rate>=99.95%.

When repeating step 3 and step 4, being for the negative sample being excluded on each computer, its sample label is summarized on an extra computer by we, and it is split again.When again splitting, owing to negative sample sum can only achieve at most the 50% of original scale, so the number of computers needed also will reduce by half.

Firstly, it is necessary to remaining strong classifier quantity space Max_Num-n in judgement cascade classifier, compare with this strong classifier quantity n/2 that will appear from.Need to ensure that remaining strong classifier quantity space is bigger than this strong classifier quantity that will appear from, i.e. (n/2)≤(Max_Num-n), and the 2n≤Max_Num above set just ensure that this point.

Embodiment 4: have and the identical technical scheme of embodiment 1 or 2 or 3, more specifically: described sequence is: the higher strong classifier of classification capacity sorts in more forwardly of position, the more weak strong classifier of classification capacity sorts in position more posteriorly；The cascade classifier that described composition is new: strong classifier sequence more forwardly of position in cascade classifier that classification capacity is higher, strong classifier sequence position more posteriorly in cascade classifier that classification capacity is more weak, by the classification capacity the strongest strong classifier first order strong classifier as cascade classifier, and discharge second level strong classifier, third level grader ..., n-th grade of strong classifier successively, n is the quantity of computer, and first order strong classifier rearranges new cascade classifier to n-th grade of strong classifier.N represents the quantity carrying out the computer that distributed training uses, and namely completes the generation of a strong classifier on each computer.

Embodiment 5: have and the identical technical scheme of embodiment 1 or 2 or 3 or 4, more specifically: when distributed training proceeds to second from the bottom secondary and first time, gradually reduce the distribution of false alarm rate in the parameter of strong classifier, and when last time, set false alarm rate as being not more than 0%.When iteration carries out distributed training, last three distributed training are carried out the special adjustment of index strategy, it is ensured that the stable convergence of final cascade classifier.

If the recall rate index of entirety is A, the front respective recall rate index of k level strong classifier is a, then the overall recall rate that front k level strong classifier realizes has reached a^k, and now do not exit, A is described > a^k.So, next stage strong classifier is as the words of afterbody strong classifier, it is necessary to reach A/a^kRecall rate.Therefore, now the recall rate of strong classifier is set to A/a^k, work as A/a^k> a time, this kind of method to set up can relax the exit criteria of last strong classifier to a certain extent, thus accelerating its convergence rate.

Because in whole training process, often remaining last also very low by the sample degree of being to discriminate between of misclassification, be namely difficult to carry out correct classification, the training of this part sample has tended to take up most training resource and Weak Classifier quantity in traditional training method.And said method can be alleviated to a great extent and even greatly reduces this training cost, so that the training of grader restrains quickly.

Embodiment 6: the present embodiment is for Adboost learning algorithm, propose a kind of Adboost based on Distributed Calculation with detection cost sequence and train learning method, sample is carried out two classification by the mathematical property first with algorithm, utilize the difference relation of positive and negative sample size simultaneously, relatively small number of for sample size sample is distributed to each computer successively, fewer for positive sample here.

Then negative sample being split, if negative sample adds up to N, we have n computer, then k=N/n negative sample is distributed to each computer successively, specifically train flow processSuch as Fig. 1Shown in.

It is assumed here that the false alarm rate index of whole cascade classifier is Cascade_False_Alarm=0；

The recall rate index of whole cascade classifier is Cascade_Detection_Rate=99%；

The false alarm rate index of each strong classifier is Strong_False_Alarm≤50%；

Such as, set the quantity Max_Num of strong classifier less than 20, so n≤10, because the strong classifier number that subsequent calculations is joined together=(n+n/2+n/4+...), (n+n/2+n/4+...)<the recall rate index of 2n≤Max_Num. then each strong classifier is Strong_Detection_Rate>=99.95%；

(1) process of first time distributed training is described below

Here the sample in each computer is carried out Distribution Indexes, namely distribute the recall rate identical with each strong classifier in entire cascaded grader and false recognition rate index.The index that each computer receives is identical, and the false alarm rate of each strong classifier needs Strong_False_Alarm≤50%；The recall rate of each strong classifier needs Strong_Detection_Rate >=99.95%.

In so each computer, after the training of first order strong classifier, when using first order strong classifier that the sample on this computer is detected, all at least ensureing have the positive sample of 99.95% to be detected, the negative sample of at least 50% is excluded.Therefore, the first order strong classifier on each computer all possesses the ability that all positive samples and the negative sample being excluded carry out correct two classification.Then, the first order strong classifier on each computer is sent to server successively, carries out collecting of strong classifier.

On the server, the strong classifier of coming is sent for each computer, we use it that all negative sample segmentations are tested successively, and (negative sample sum is likely to too many, once cannot not test completely), and the mistake point rate to negative sample of the strong classifier that each computer is sent here is added up and as the index of classification capacity.According to classification capacity, each strong classifier is ranked up, successively using first order strong classifier the strongest for classification capacity as the first order strong classifier of cascade classifier, second level strong classifier ... n-th grade of strong classifier.

Then we complete the distributed training of first time Adboost algorithm, it is achieved that to all positive samples (99.95%)ⁿRecall rate, possess the separating capacity at least to the negative sample of more than 50%.So in follow-up distributed training, we are accomplished by the separating capacity of negative sample is trained further.

(2) process of second time distributed training is described below

For the negative sample being excluded on each computer, its sample label is summarized on an extra computer by we, and it is split again.When again splitting, owing to negative sample sum can only achieve at most the 50% of original scale, so the number of computers needed also will reduce by half.

Then, similar with first time distributed training process, the sample in each computer is carried out Distribution Indexes, namely distributes the recall rate identical with each strong classifier in entire cascaded grader and false recognition rate index.The false alarm rate that the index that each computer receives is each strong classifier needs Strong_False_Alarm≤50%；The recall rate of each strong classifier needs Strong_Detection_Rate >=99.95%.

(3) the distributed training of third time and last distributed training

From third time distributed training, training afterwards is similar with above twice, arrive successively only remain three computers time, it is necessary to special handling.In the present invention, when the computer of training only remains three, also mean that and need to perform twice at distributed training.So, remaining negative sample is accomplished by twice last distributed training to separate with positive sample completely.

So the adjustment being evaluated index of needing for property, the recall rate of distributed training second from the bottom time and false alarm rate are set as Strong_Detection_Rate >=99.95% with Strong_False_Alarm≤30%, provide a transition for the index convergence trained for the last time.And last index is accomplished by being set as Strong_Detection_Rate >=99.95% with Strong_False_Alarm≤0%.

Embodiment 7: a kind of Adboost based on Distributed Calculation with detection cost sequence trains learning device, including:

More specifically, the device described in the present embodiment, set detection target, including the false alarm rate index of cascade classifier, the recall rate index of cascade classifier；The parameter of distribution strong classifier is to distribute the recall rate identical with each strong classifier in entire cascaded grader and false alarm rate index, number of computers is n, the quantity of strong classifier is Max_Num, the strong classifier number joined together is equal to (n+n/2+n/4+...), makes (n+n/2+n/4+...) < 2n≤Max_Num.

Described sequence is: the higher strong classifier of classification capacity sorts in more forwardly of position, and the more weak strong classifier of classification capacity sorts in position more posteriorly；The cascade classifier that described composition is new: using the strong classifier the strongest for the classification capacity first order strong classifier as cascade classifier, and discharge second level strong classifier, third level grader ..., n-th grade of strong classifier successively, n is the quantity of computer, and first order strong classifier rearranges new cascade classifier to n-th grade of strong classifier.

When distributed training proceeds to second from the bottom time and first time, gradually reduce the distribution of false alarm rate in the parameter of strong classifier, and when last time, set false alarm rate as being not more than 0%.

Device in any technology scheme described in the present embodiment may be used for the method described in embodiment 1-6 that performs.

Claims

1. train learning method based on the Adboost of Distributed Calculation with detection cost sequence for one kind, it is characterised in that comprise the steps:

Step 2. the second sample is distributed each computer；

Step 5. repeat step 4, until new cascade classifier meets exit criteria；

2. the Adboost based on Distributed Calculation with detection cost sequence as claimed in claim 1 trains learning method, it is characterised in that the step of described step 3 is:

S3.2. strong classifier parameter is distributed to each computer；

S3.3. each computer is respectively trained strong classifier, and the strong classifier of training gained is collected to server, server uses the strong classifier collecting to all test samples, the classification capacity of each strong classifier is sorted, and according to institute's alignment sequence, the strong classifier collected is formed the grader that level newly joins successively.

3. the Adboost based on Distributed Calculation with detection cost sequence as claimed in claim 1 trains learning method, it is characterised in that in step one, set detection target, including the false alarm rate index of cascade classifier, the recall rate index of cascade classifier；In step 2, the parameter of distribution strong classifier is to distribute the recall rate identical with each strong classifier in entire cascaded grader and false alarm rate index, number of computers is n, the quantity of strong classifier is Max_Num, the strong classifier number joined together is equal to (n+n/2+n/4+...), makes (n+n/2+n/4+...) < 2n≤Max_Num.

4. the Adboost based on Distributed Calculation with detection cost sequence as claimed in claim 2 trains learning method, it is characterized in that, described sequence is: the higher strong classifier of classification capacity sorts in more forwardly of position, and the more weak strong classifier of classification capacity sorts in position more posteriorly；The cascade classifier that described composition is new: using the strong classifier the strongest for the classification capacity first order strong classifier as cascade classifier, and discharge successively second level strong classifier, third level grader, n-th grade of strong classifier, n is the quantity of computer, and first order strong classifier rearranges new cascade classifier to n-th grade of strong classifier.

5. the Adboost based on Distributed Calculation with detection cost sequence as claimed in claim 1 or 2 or 3 or 4 trains learning method, it is characterized in that, when distributed training proceeds to second from the bottom secondary and first time, gradually reduce the distribution of false alarm rate in the parameter of strong classifier, and when last time, set false alarm rate as being not more than 0%.

6. train learning device based on the Adboost of Distributed Calculation with detection cost sequence for one kind, it is characterised in that including:

Detection goal setting device, sets detection target, determines the number of computers of distribution according to cascade classifier quantity；Second sample dispensing device, distributes each computer by the second sample；

7. the Adboost based on Distributed Calculation with detection cost sequence as claimed in claim 6 trains learning device, it is characterised in that set detection target, including the false alarm rate index of cascade classifier, the recall rate index of cascade classifier；The parameter of distribution strong classifier is to distribute the recall rate identical with each strong classifier in entire cascaded grader and false alarm rate index, number of computers is n, the quantity of strong classifier is Max_Num, the strong classifier number joined together is equal to (n+n/2+n/4+...), makes (n+n/2+n/4+...) < 2n≤Max_Num.

8. the Adboost based on Distributed Calculation with detection cost sequence as claimed in claim 7 trains learning device, it is characterized in that, described sequence is: the higher strong classifier of classification capacity sorts in more forwardly of position, and the more weak strong classifier of classification capacity sorts in position more posteriorly；The cascade classifier that described composition is new: using the strong classifier the strongest for the classification capacity first order strong classifier as cascade classifier, and discharge successively second level strong classifier, third level grader, n-th grade of strong classifier, n is the quantity of computer, and first order strong classifier rearranges new cascade classifier to n-th grade of strong classifier.

9. the Adboost based on Distributed Calculation with detection cost sequence as described in claim 6 or 7 or 8 trains learning device, it is characterized in that, when distributed training proceeds to second from the bottom secondary and first time, gradually reduce the distribution of false alarm rate in the parameter of strong classifier, and when last time, set false alarm rate as being not more than 0%.

10. the Adboost based on Distributed Calculation with detection cost sequence as claimed in claim 9 trains learning device, it is characterized in that, the method of the distribution gradually reducing the false alarm rate in the parameter of strong classifier is: set overall recall rate index as A, the front respective recall rate index of k level strong classifier is a, and the overall recall rate that front k level strong classifier realizes has reached a^k, the recall rate of strong classifier is set to A/a^k。