CN109871799A

CN109871799A - A kind of driver based on deep learning plays the detection method of mobile phone behavior

Info

Publication number: CN109871799A
Application number: CN201910106254.4A
Authority: CN
Inventors: 朱仲杰; 金充充; 白永强; 张巧文
Original assignee: Zhejiang Wanli College
Current assignee: Zhejiang Wanli College
Priority date: 2019-02-02
Filing date: 2019-02-02
Publication date: 2019-06-11
Anticipated expiration: 2039-02-02
Also published as: CN109871799B

Abstract

The invention discloses a kind of, and the driver based on deep learning plays the detection method of mobile phone behavior, on the basis of deep learning carries out target detection, it proposes that the video of acquisition is carried out to pre-process and optimize its convolutional neural networks, frame image is converted into the video sample that driver's cabin plays mobile phone by acquisition driver, collected video is subjected to dynamically track processing, by the target area being mutually disturbed separately training detection, advantage is to substantially improve the real-time and accuracy of deep learning detection target, largely reduce and calculates the time, substantially increase the feature extraction accuracy rate of hand and mobile phone.

Description

A kind of driver based on deep learning plays the detection method of mobile phone behavior

Technical field

The present invention relates to the detection methods that a kind of driver plays mobile phone behavior, more particularly, to a kind of based on deep learning The detection method of driver's object for appreciation mobile phone behavior.

Background technique

Mobile phone is as the greatest invention of 20th century mankind, while allowing the communication of people to become easy and convenient, also gradually Many negative effects are increased to our life.For example, in traffic safety, driver usually vehicle rareness section, Traffic lights generate during waiting and play mobile phone behavior.According to measuring and calculating, an emergency situation is encountered when being formed by reaction when driver plays mobile phone Between it is also slower by 30% than driving when intoxicated, high 4 times of likelihood ratio normal driving or so of traffic accident occur when playing mobile phone, it is seen that It is big that security risk existing for mobile phone is played when driving.

" People's Republic of China Road Traffic Safety Law Implementation Regulations " the 62nd article of Section 3 and 123 command of the Ministry of Public Security Middle clear stipulaties can fine to the driver in driving procedure using mobile phone 200 yuan, detain 2 points of punishment.But in reality This punishment measure is performed practically no function substantially in the traffic administration of border.Main cause is that driving plays the illegal activities of mobile phone by artificial Management cost is too big, and cannot still be identified well by camera.So if to manage driver plays mobile phone row when driving For it is very necessary for detecting driver with the presence or absence of mobile phone phenomenon is played by target detection technique.

Currently, also less about the detection research that driver plays mobile phone behavior, common driver generates on the way in driving The monitoring and managing method for playing mobile phone behavior mainly has three classes.One is artificially enforced the law based on traffic administration personnel, i.e. the law enforcement such as traffic police people Member estimates driver with the presence or absence of object for appreciation mobile phone behavior by artificial mode.Secondly, the detection method based on mobile phone signal, lead to It crosses on vehicle or whether generates telephone signal in the vehicle of road monitoring region installation mobile phone signal receiver monitoring traveling, If the vehicle in traveling is detected telephone signal and then shoots driver's picture, video camera evidence obtaining is carried out.Thirdly, based on taking the photograph As the detection method of head, camera may be mounted at the monitoring region that vehicle interior or vehicle pass through, and be existed by shooting driver Video or picture in driving procedure, are analyzed with computer vision technique, and detection driver plays mobile phone behavior.Due to The method low efficiency of personal monitoring can not collect evidence, and have a significant limitation, thus we only analyze based on mobile phone signal and Driver, which is carried out, based on camera plays two methods of mobile phone behavior monitoring.

(1) based on the detection method of mobile phone signal

Rodr í guez-Ascariz etc. proposes that a kind of driver uses the autoelectrinic system of mobile phone, passes through electronic circuit Radio frequency captures driver's generated electric power when using mobile phone, and with two antennas and signal positioned at vehicle interior Parser identifies when driver is using mobile phone.Zhi Lukui etc. invents a kind of cellular phone signal shielding device, by vehicle The systems such as existing sensing, braking in auxiliary system, and utilize the microwave transport equipments such as bluetooth, infrared ray, radar and automobile High precision be integrated, do not influence driver safety drive under the premise of, by 0.5 square metre of small range around driver Interior mobile phone signal shielding.A possibility that the method directly plays mobile phone from source cutting driver, but when driver encounters critical Quarter is not available mobile phone, there is very big security risk.Bo C etc. has devised and embodied TEXIVE method, utilizes inertia sensing Device detection driver is irregular and fine motion abundant is made to detect driver and driven on the way using personal smart phone row For this method can also distinguish between driver and passenger, reduce the interference for playing mobile phone behavioral value to driver.Leem S etc. It is proposed that a kind of impulse radio ultra-wideband radar (IW-UWB) monitors driver's vital sign, breathing, heart rate, mobile phone signal etc. Driver's relevant abnormalities phenomenon, detection system have various movements or background object to change in the car, are also able to detect driving The mobile phone service condition of member.Such detection method based on mobile phone signal will usually install sensor, at high cost, and signal holds Easy erroneous detection does not have very strong social utility's property.

(2) detection method based on computer vision

Tsinghua University proposes a kind of recognition methods for the behavior of making a phone call.In the car by camera installation, real-time monitoring The face of driver expands each 1/2 human face region around in the human face region that positioning obtains, calculates and hand-held electricity whether occur The aspect of model of words, to judge driver with the presence or absence of behavior of making a phone call.Wang D etc. proposes a kind of based on vehicle-mounted camera Driver play mobile phone detection method, on the windshield by camera installation, using movable analytical algorithm by phone activity point Solution is three movements, and indicates using AoG figure the time relationship between the hierarchical structure and movement of phone activity, thus Differentiate whether driver is used mobile phone.Torres R et al. is proposed with the mobile electricity of deep learning target detection technique detection Words extract mobile phone by a large amount of picture training that is, by collected driver's video input convolutional neural networks Feature, there are the detections and classification when mobile phone for Lai Shixian driver's cabin, thus analyze driver with the presence or absence of play mobile phone behavior. Detection method based on computer vision there are camera devices it is at low cost, contactless the advantages that, be most prospect at present Detection method.Wherein, traditional object detection method, i.e., the detection method of artificial settings feature combining classification device is to non-rigid Property object detection effect is poor.And the object detection method of deep learning can detecte non-rigid object, but detects accuracy and need Improve, real-time is also to be improved.

Summary of the invention

It is high that technical problem to be solved by the invention is to provide a kind of detection accuracy, strong real-time based on deep learning Driver play mobile phone behavior detection method.

The technical scheme of the invention to solve the technical problem is: a kind of driver based on deep learning plays hand The detection method of machine behavior, detection method include:

Acquisition driver is converted into frame image in the video sample that driver's cabin plays mobile phone, subtracts each other respective pixel value to obtain difference Partial image, then given threshold is to difference image binaryzation, the dynamic object image after binaryzation is filtered, is expanded, The subsequent arithmetics such as corrosion remove isolated noise point, obtain dynamic object region, then carry out the dynamic object region of acquisition Horizontal, upright projection, finding dynamic object region, four critical points are split up and down, obtain dynamic object region packet Enclose box；

Obtained dynamic object region bounding box is subjected to dynamic object coarse positioning using normalized bending operation, will be moved State target area bounding box is unified into the consistent square of resolution ratio；

Dynamic object region bounding box after normalization is input in the convolutional neural networks after training, uses classification Device carries out target classification to the dynamic object region bounding box of input, and differentiation is sold and two class target of mobile phone, provides hand and hand Machine label obtains in one's hands and mobile phone confidence, removes jamming target according to confidence threshold value, then uses frame It returns device and regional frame positioning is carried out to dynamic object region bounding box, obtain in one's hands and two class of mobile phone dynamic object region and surround Box；

If the distance between dynamic object region bounding box of dynamic object region bounding box of hand and mobile phone is less than etc. In the radius of hand dynamic object region bounding box and the sum of the radius of mobile phone dynamic object region bounding box, then determine that two classes are dynamic State target area bounding box, which exists, is overlapped phenomenon, and driver, which exists, plays mobile phone behavior；If the dynamic object region bounding box of hand The distance between dynamic object region bounding box of mobile phone is greater than the radius and mobile phone of the dynamic object region bounding box of hand The sum of the radius of dynamic object region bounding box then determines that two class dynamic object region bounding boxs there is no phenomenon is overlapped, drive There is no play mobile phone behavior by member.

Threshold value to difference image binaryzation is 45~60, in identical ambient brightness, if consecutive frame image Respective pixel value variation be less than threshold value, by the pixel in these regions be labeled as background pixel；If the correspondence of consecutive frame image Pixel value variation is greater than threshold value, and the pixel in these regions is labeled as object pixel.

The confidence threshold value is 0.88~0.92, preferably 0.9.

The training step of convolutional neural networks are as follows: acquisition object for appreciation picture of mobile telephone 15000, which is opened, from network is fabricated to comprising only The positive sample collection that hand, only mobile phone, hand and mobile phone exist simultaneously tertiary target picture, acquires 20000 hands and hand from network The picture making of complex background around machine is at negative sample collection；

Collected data set is labeled, the label of hand and mobile phone is made, is converted into computer operation format；

The data of hand and mobile phone after conversion are inputted two independent neural networks respectively to be trained, first mark hand Portion region and background are inputted in hand network and are trained, and are then carried out the classification of hand classifier and are returned the positioning of device, Confidence and bounding box are provided, at the same time, phone area and background is marked, is trained in input handset network, with The classification of mobile phone classifier is carried out afterwards and returns the positioning of device, provides confidence and bounding box；

After the completion of training, the training data of two independent neural networks is merged to the convolutional Neural net after being trained Network.

The neural network is the neural network by optimization, and specific optimization method is as follows:

1) optimize convolution nuclear volume

(1) computable matrix, i.e. gray level co-occurrence matrixes are converted by Feature Mapping figure: take in image arbitrary point (x, Y) and deviate it a bit (x+a, y+b), if (x, y) point gray value be (i, j)；Point (x, y) is enabled to move on the entire image It is dynamic, the value of various (i, j) is obtained, if gray scale numerical series is k, then the value of gray value (i, j) shares k²Kind, it traverses whole A image counts the number of each (i, j) appearance, is arranged in a square matrix, obtained number is normalized to occur general Rate p (i, j), p (i, j) square matrix generated are gray level co-occurrence matrixes；

(2) entropy in gray level co-occurrence matrixes is calculated, the entropy in gray level co-occurrence matrixes meets formula:

Wherein, p (i, j) refers to that the gray level co-occurrence matrixes after normalization, Ent are grayscale image entropy；

(3) entropy calculated is subjected to size sequence, and sets the convolution kernel proportion threshold value for needing to delete, such as following formula It is shown:

Wherein, D_ETo need the convolution kernel number deleted, N_EThe convolution kernel number possessed by current layer will need to delete Threshold value Thershold be set as 10%, retain 90% convolution kernel.

Compared with the prior art, the advantages of the present invention are as follows deep learning carry out target detection on the basis of, propose pair The video of acquisition pre-process and optimize to its convolutional neural networks, substantially improves deep learning detection target Real-time and accuracy, and preferable experimental result is achieved, the video sample of mobile phone is played in driver's cabin by acquisition driver It is converted into frame image, collected video is subjected to dynamically track processing, i.e., thick target positioning is carried out simultaneously to dynamic abnormal region The candidate region normalization that positioning is obtained, the coarse positioning of dynamic area, which avoids, mentions the candidate region of whole picture It takes, largely reduces and calculate the time, normalization then facilitates subsequent convolutional neural networks to extract feature；It will normalization Candidate region afterwards is input in the convolutional neural networks after training, is different from traditional multiple target areas of CNN classification Domain, by the target area being mutually disturbed, separately training detection substantially increases the feature extraction accuracy rate of hand and mobile phone, and passes through The method of invalid convolution kernel in convolutional layer is reduced to greatly improve detection real-time while guaranteeing detection accuracy；By right Hand and mobile phone carry out labeling, show confidence and provide recurrence bounding box, then the distance threshold of both settings can Mobile phone behavior is played accurately to judge whether there is.

Convolutional neural networks are optimized, first is that effectively being deleted the convolution nuclear volume in convolutional layer, to reduce The time of calculating；Second is that separately training each target with individual network when training, the accuracy of detection can be improved.

By 1 experimental data of table it is found that the network of different convolution nuclear volumes to be carried out to the training of same sample database, completely Alexnet possesses 95.7% accuracy rate, and the model when retaining 90% convolution kernel number still possess 93.3% it is accurate Rate, accuracy rate only have dropped two percentage points or so than complete model, but the calculating time of model but reduce it is 4 one-tenth nearly, The volume of model is also to be reduced to 34MB from 212MB, is reduced more than 6 times, and one has been reached between accuracy rate and real-time Preferable balance.

Table 1

Separately training the advantages of two convolutional neural networks mainly has at 3 points: first, hand and mobile phone when detecting will not be because Mutual accuracy rate is influenced to shade one another；Second, the presence both more easily determined after the bounding box both provided is closed System；Third, due to two networks be it is independent, can equipment relatively fall behind in the case where, by two networks separate in two electricity It is handled simultaneously on brain, finally summarizes experimental result on a computer, reduce the calculating time of its network；

In addition, our additional one traditional single convolutional neural networks of training be used for the method after optimization of the invention into Row analysis is compared.Sample is subjected to once-through operation, i.e., marks hand and mobile phone in image simultaneously, is entered into a convolution It is trained in neural network, then with classifier classification hand and mobile phone, returns device positioning hand and mobile phone, provide setting for the two Confidence score and bounding box.

The classification accuracy that network and the single convolutional neural networks of tradition are separately trained using the confusion matrix assessment present invention, is mixed The matrix that confuses is an error matrix, is commonly used to visually assess the performance of supervised learning algorithm, size be n_classes × N_classes, wherein n_classes indicates that n_classes is divided into three classes by the quantity of class herein: hand, mobile phone, background.

Firstly, by background set be in confusion matrix 1, hand be set as in confusion matrix 2, mobile phone is set as obscuring 3 in matrix；

It is opened secondly, randomly selecting test sample 200, these pictures is input in single network and separately trained network and are carried out The analysis of confusion matrix is compared.

Finally, comparing the numerical value of confusion matrix it is found that single network is by one's hands and mobile phone target area overlapping phenomenon interference Seriously, hand and mobile phone are usually divided into background by mistake, and accuracy rate is low；And it is changed to separate between training network defensive position and mobile phone Overlapping interference problem no longer interacts, and whole discrimination improves 7% or so than single network.

As shown in table 2, this paper algorithm still uses traditional object detection process, i.e., first looks for compared with R-CNN model Candidate target object out extracts feature, reuses classifier and recurrence device carries out detection identification.The difference is that changing candidate Method for extracting region optimizes the convolution kernel of convolutional neural networks, so that time loss reduces 47 times or so, than Fast R- CNN algorithm is nearly 20 times fast, 4 times or so faster than Faster R-CNN algorithm.In this test, this algorithm and at present master The Faster R-CNN model of stream is compared, and no matter has been above in accuracy rate or in real-time the latter, and Faster R- Network RPN is extracted in the candidate region of CNN to be run on GPU, and the dynamic tracking algorithm of this paper algorithm is run on CPU, right The requirement of equipment is lower, and the algorithm is extremely strong to the fault-tolerance of light, background, body posture etc., has stronger popularization meaning Justice.

2 models comparison results of table

Detailed description of the invention

Fig. 1 is target area illustraton of model when detecting in the embodiment of the present invention.

Specific embodiment

The present invention will be described in further detail below with reference to the embodiments of the drawings.

Embodiment: a kind of driver based on deep learning plays the detection method of mobile phone behavior, and detection method includes:

Acquisition driver is converted into frame image in the video sample that driver's cabin plays mobile phone, subtracts each other respective pixel value to obtain difference Partial image, then for given threshold to difference image binaryzation, the threshold value to difference image binaryzation is 45~60, bright in environment In the case where degree variation less, if the variation of respective pixel value is less than pre-determined threshold value, it is believed that be herein background picture Element；If the pixel value variation of image-region is very greatly, it is believed that this is because in image caused by moving object, by these Zone marker is foreground pixel；Dynamic object image after binaryzation such as is filtered, expands, corroding at the subsequent arithmetics, and removal is lonely Vertical noise spot, obtains dynamic object region；Then the dynamic object region of acquisition is subjected to horizontal, upright projection, finds dynamic Four critical points are split up and down for target area, obtain dynamic object region bounding box；

Obtained dynamic object region bounding box is subjected to dynamic object coarse positioning using normalized bending operation, will be moved State target area bounding box is unified into the consistent square of resolution ratio；Its purpose is first is that many classifiers only take fixed size Image is input in classifier, and classifier can handle deformed image, will not influence the accuracy rate of object classification, Second is that facilitating neural network model subsequent processing, calculation amount is reduced, the image resolution ratio that the present embodiment is chosen after normalization is 227*227。

Dynamic object region bounding box after normalization is input in the convolutional neural networks after training, uses classification Device carries out target classification to the dynamic object region bounding box of input, and differentiation is sold and two class target of mobile phone, provides hand and hand Machine label obtains in one's hands and mobile phone confidence, removes jamming target, confidence threshold according to confidence threshold value Value is 0.88~0.92, preferably 0.9, then returns device using frame and confines to dynamic object region bounding box progress region Position obtains in one's hands and two class of mobile phone dynamic object region bounding box；

The training step of convolutional neural networks are as follows: acquisition object for appreciation picture of mobile telephone 15000, which is opened, from network is fabricated to comprising only The positive sample collection that hand, only mobile phone, hand and mobile phone exist simultaneously tertiary target picture, these positive sample Target Photos are in multiple Under the conditions of miscellaneous background, different illumination conditions, different angle, different resolution etc., the diversity of sample is sufficiently met, is needed It is to be noted that due to mainly being run in driving environment, so needing object for appreciation mobile phone sample graph when some driving of more acquisitions；From The picture making of the complex background around 20000 hands and mobile phone is acquired on network into negative sample collection；Negative sample collection is mostly Complex background around hand and mobile phone will consider instrument when driving when choosing negative sample similarly due to being in driving environment The interference of exterior portion point etc. and mobile phone periphery and its similar object；

Collected data set is labeled using the ImageLabeler tool that MATLAB is carried, in callout box Input needs the target designation marked, the label of hand and mobile phone is made, as the label of hand region can be set to " hand ". The image data generated after mark is converted into computer operation Table format, is convenient for subsequent unified calculation；

Traditional convolutional neural networks classification multiple target object often frequently with a network class multiple objects, it is this Although mode classification detection convenience, once detection target overlaps, target will interact when classification, accuracy meeting Each target is separately trained with different networks when being greatly affected, therefore training, the accuracy of detection will be greatly improved.

And the above-mentioned neural network for training is the neural network by optimization, mainly optimizes convolution nuclear volume, Each convolutional layer includes tens of or even hundreds of convolution kernel filters in Alexnet network.The purpose of this suboptimization is How the convolution kernel filter for examining these large number of has an impact the overall performance of model.But directly pass through analysis Convolution kernel can not intuitively and effectively analyze network, we are helped by the corresponding convolution Feature Mapping of convolution karyogenesis We analyze the performance superiority and inferiority of each convolution kernel.

By taking first layer convolutional layer as an example, first layer convolutional layer use 96 filters, filter by 11*11 Two-Dimensional Moment Whether battle array composition, the submatrix that two-dimensional matrix judges image match with the form of Convolution Filter, if form is coincide Its value can generate Feature Mapping image greatly than the value of surrounding.As it can be seen that not all convolution kernel can all propose feature It takes and is contributed, especially in the case where being directed to specific objective, therefore propose a kind of to arrange the disturbance degree of Feature Mapping The method of sequence.Specific Optimization Steps are as follows:

(1) computable matrix, i.e. gray level co-occurrence matrixes are converted by Feature Mapping figure: in a two-dimensional coordinate system xoy In, it takes the arbitrary point (x, y) in image and deviates its a bit (x+a, y+b), if the gray value of (x, y) point is (i, j)； It enables point (x, y) move on the entire image, obtains the value of various (i, j), if gray scale numerical series is k, then gray value The value of (i, j) shares k²Kind, whole image is traversed, the number of each (i, j) appearance is counted, is arranged in a square matrix, will To number be normalized to occur Probability p (i, j), p (i, j) square matrix generated is gray level co-occurrence matrixes；

(3) it is calculated by the entropy to Feature Mapping gray level co-occurrence matrixes caused by each convolution kernel, by Feature Mapping Calculated entropy carries out size sequence, and entropy is bigger to illustrate the more complicated of image, contribution degree of the convolution kernel filter to model Height, due to there is hundreds of thousands of a convolution kernels in convolutional neural networks, and the input of every later layer needs to rely on the defeated of preceding layer It out, is clearly not all right by visual means.If retain convolution kernel too much so guarantee accuracy rate while can not Guarantee real-time；If the convolution nuclear volume cut down is excessive, the real-time of detection is improved, but may delete useful volume Product core causes the accuracy rate of model to be lower, and loses more than gain, in order to effectively grasp the balance between accuracy rate and real-time, if The convolution kernel proportion threshold value for needing to delete calmly, is shown below:

Wherein, D_ETo need the convolution kernel number deleted, N_EThe convolution kernel number possessed by current layer, with Alexnet For 96 convolution kernels of first layer, the threshold value Thershold deleted will be needed to be set as 10%, then deletes contribution margin sequence In 1~9 convolution kernel, according to calculation above, adjustment threshold size that can be in due course adjusts the volume of every layer of convolutional layer Product core filter quantity, it is desirable to obtain better discrimination, threshold value can be reduced, if to obtain better real-time, just add Big threshold value reaches a balance of model accuracy rate and real-time by searching out a suitable threshold value；

The manpower and phone area detected needs to judge whether the region of the two is overlapped in the case where existing simultaneously, exclude The two exist alone and apart from it is far a possibility that.Due to the variability of hand gesture and mobile phone size, bounding box is obtained Area size is not uniquely to fix.Therefore we determine manpower area using the method for two region distance threshold values of setting Domain and phone area whether there is superposition phenomenon, to judge driver with the presence or absence of object for appreciation mobile phone behavior.Key step:

Firstly, reading four apex coordinates a1, a2, a3, a4 of manpower bounding box and four tops of mobile phone bounding box respectively Point coordinate b1, b2, b3, b4, target area illustraton of model are as shown in Figure 1.

Secondly, respectively calculating center point coordinate c1, the c2 in two regions, meet formula 3.

Again, the distance d for calculating two regional center points, meets formula 4.

Finally, calculating radius r1, r2 of two Rectangular Bounding Volumes, meet formula 5.

If the distance between dynamic object region bounding box of dynamic object region bounding box of hand and mobile phone is less than etc. In the radius of hand dynamic object region bounding box and the sum of radius d≤r1+r2 of mobile phone dynamic object region bounding box, then sentence Fixed two class dynamic object region bounding boxs, which exist, is overlapped phenomenon, and driver, which exists, plays mobile phone behavior；If the dynamic object area of hand The distance between dynamic object region bounding box of domain bounding box and mobile phone is greater than the radius of the dynamic object region bounding box of hand With the sum of the radius of dynamic object region bounding box of mobile phone d > r1+r2, then determine that two class dynamic object region bounding boxs are not deposited It is being overlapped phenomenon, there is no play mobile phone behavior by driver.

Claims

1. the detection method that a kind of driver based on deep learning plays mobile phone behavior, it is characterised in that specific detection method is such as Under:

Acquisition driver is converted into frame image in the video sample that driver's cabin plays mobile phone, subtracts each other respective pixel value to obtain difference diagram Picture, then given threshold is filtered the dynamic object image after binaryzation, expands, corrodes to difference image binaryzation Subsequent arithmetic removes isolated noise point, obtains dynamic object region, then carries out the dynamic object region of acquisition horizontal, vertical Shadow is delivered directly, finding dynamic object region, four critical points are split up and down, obtain dynamic object region bounding box；

Obtained dynamic object region bounding box is subjected to dynamic object coarse positioning using normalized bending operation, by dynamic mesh Mark region bounding box is unified into the consistent square of resolution ratio；

Dynamic object region bounding box after normalization is input in the convolutional neural networks after training, uses classifier pair The dynamic object region bounding box of input carries out target classification, and differentiation is sold and two class target of mobile phone, provides hand and mobile phone mark Label obtain in one's hands and mobile phone confidence, remove jamming target according to confidence threshold value, then return device using frame Regional frame positioning is carried out to dynamic object region bounding box, obtains in one's hands and two class of mobile phone dynamic object region bounding box；

If the distance between dynamic object region bounding box and the dynamic object region bounding box of mobile phone of hand are less than or equal to hand The radius of dynamic object region bounding box and the sum of the radius of mobile phone dynamic object region bounding box, then determine two class dynamic objects Region bounding box, which exists, is overlapped phenomenon, and driver, which exists, plays mobile phone behavior；If the dynamic object region bounding box and mobile phone of hand The distance between dynamic object region bounding box be greater than hand dynamic object region bounding box radius and mobile phone dynamic mesh The sum of the radius for marking region bounding box then determines that two class dynamic object region bounding boxs are not deposited there is no phenomenon, driver is overlapped Playing mobile phone behavior.

2. a kind of driver based on deep learning as described in claim 1 plays the detection method of mobile phone behavior, feature exists In the threshold value to difference image binaryzation be 45~60, in identical ambient brightness, if the correspondence of consecutive frame image Pixel value variation is less than threshold value, and the pixel in these regions is labeled as background pixel；If the respective pixel value of consecutive frame image Variation is greater than threshold value, and the pixel in these regions is labeled as object pixel.

3. a kind of driver based on deep learning as described in claim 1 plays the detection method of mobile phone behavior, feature exists In the confidence threshold value be 0.88~0.92.

4. a kind of driver based on deep learning as described in claim 1 plays the detection method of mobile phone behavior, feature exists In the confidence threshold value be 0.9.

5. a kind of driver based on deep learning as described in claim 1 plays the detection method of mobile phone behavior, feature exists In the training step of convolutional neural networks are as follows: acquisition object for appreciation picture of mobile telephone 15000, which is opened, from network is fabricated to comprising only hand, only Mobile phone, hand and mobile phone exist simultaneously the positive sample collection of tertiary target picture, acquire around 20000 hands and mobile phone from network The picture making of complex background is at negative sample collection；

The data of hand and mobile phone after conversion are inputted two independent neural networks respectively to be trained, first mark hand area Domain and background are inputted in hand network and are trained, and are then carried out the classification of hand classifier and are returned the positioning of device, provide and set Confidence score and bounding box mark phone area and background, are trained in input handset network, then carry out at the same time The classification of mobile phone classifier and the positioning for returning device, provide confidence and bounding box；

After the completion of training, the training data of two independent neural networks is merged to the convolutional neural networks after being trained.

6. a kind of driver based on deep learning as described in claim 1 plays the detection method of mobile phone behavior, feature exists In the neural network that the neural network is by optimization, specific optimization method is as follows:

1) optimize convolution nuclear volume

(1) computable matrix, i.e. gray level co-occurrence matrixes are converted by Feature Mapping figure: take the arbitrary point (x, y) in image with And deviate it a bit (x+a, y+b), if (x, y) point gray value be (i, j)；It enables point (x, y) move on the entire image, obtains To the value of various (i, j), if gray scale numerical series is k, then the value of gray value (i, j) shares k²Kind, whole image is traversed, Count each (i, j) appearance number, be arranged in a square matrix, by obtained number be normalized to occur Probability p (i, J), p (i, j) square matrix generated is gray level co-occurrence matrixes；

(3) entropy calculated is subjected to size sequence, and sets the convolution kernel proportion threshold value for needing to delete, be shown below:Wherein, D_ETo need the convolution kernel number deleted, N_EThe convolution kernel number possessed by current layer, The threshold value Thershold deleted will be needed to be set as 10%, retain 90% convolution kernel.