AU2020103494A4

AU2020103494A4 - Handheld call detection method based on lightweight target detection network

Info

Publication number: AU2020103494A4
Application number: AU2020103494A
Authority: AU
Inventors: Zhongxin Zhang; Zuopeng Zhao
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-01-28
Anticipated expiration: 2028-11-17

Abstract

The invention discloses a handheld call detection method based on a lightweight target detection network, which comprises the following steps: Si, acquiring a driver image data set and labeling the driver image data set to obtain the sample image data set; S2, constructing a handheld call detection model based on an LMS-DN network, and performing model training through the sample image data set; S3, conducting performance test on the trained handheld call detection model based on the indexes of the detection precision, detection efficiency and model size; repeating step S2 to optimize and train the model when the performance test result is lower than a preset threshold; and S4, inputting the driver image acquired in real time into the optimized handheld call detection model to obtain the driver handheld call test result. The invention can effectively improve the detection precision and the detection efficiency, reduce the size of the model, has strong anti-interference capability, is suitable for embedded equipment, can overcome the influence of strong light and weak light, and can finish the real time detection of the target with high accuracy in the scene of obstacle interference. 1/8 Sl1. Acquiring a driver image data set, and labeling the driver image data set to obtain a sample image data set; S2. Constructing a handheld call detection model based on an LMS-DN network, and training the handheld call detection model through the sample image data set obtained in step Si to obtain a trained handheld call detection model; S3. Conducting a performance test on the trained handheld call detection model based on the indexes of the detection precision, the detection efficiency and the model size; If the performance test result is lower than a preset threshold, and repeating step S2 to optimize and train the model until the performance result reaches the preset threshold; S4. Inputting the driver image acquired in real time into the optimized handheld call detection model in step S3 to obtain the test result of the driver handheld call. Figure 1

Description

1/8

Sl1. Acquiring a driver image data set, and labeling the driver image data set to obtain a sample image data set;

S2. Constructing a handheld call detection model based on an LMS-DN network, and training the handheld call detection model through the sample image data set obtained in step Si to obtain a trained handheld call detection model;

S3. Conducting a performance test on the trained handheld call detection model based on the indexes of the detection precision, the detection efficiency and the model size; If the performance test result is lower than a preset threshold, and repeating step S2 to optimize and train the model until the performance result reaches the preset threshold;

S4. Inputting the driver image acquired in real time into the optimized handheld call detection model in step S3 to obtain the test result of the driver handheld call.

Figure 1

Handheld call detection method based on lightweight target detection network

TECHNICAL FIELD

[01] The invention relates to the technical field of target detection, in particular to a handheld call detection method based on a lightweight target detection network.

BACKGROUND

[02] In order to avoid the occurrence of traffic accidents, the detection algorithms of dangerous behaviors such as hand-held calls and telephone calls by drivers and the application of embedded environment have been studied. However, most of the research on target detection networks aim to improve the accuracy, but ignored the problems of the model, calculation quantity and reference quantity. In 2014, Girshick et al. proposed a region-based convolution neural network R-CNN, which used region recognition to detect objects. In 2015, in the improved version of R-CNN, Fast R-CNN and Faster R-CNN were proposed to realize end-to-end detection of targets. Both the Fast R-CNN and Faster R-CNN models were two-stage algorithms, and the accuracy was higher than the traditional algorithms, but the detection speed was slow and can not meet the real-time requirements. Mask R-CNN was also a two-stage approach, because where the ROIAlign instead of ROIPool was adopted in performing quantization. 2016, Redmon et al. proposed the YOLO and YOL09000, followed by the YOLOv2 and YOLOv3, which greatly improved the detection of small objects compared to the YOL09000. The YOLO series algorithm combined the two-stage tasks of classifying and identifying candidate frames in the Faster-RCNN, which greatly improved the detection speed, but because the number of the YOLOv3 parameters was large and the detection speed was slow. To realize the operation of the target detection network on the mobile equipment, the network is required to be both accurate enough and fast. In addition, the SSD algorithm proposed by Wei Liu et al. realized the regression detection of the whole image, and the speed was improved, but the accuracy of small target detection was greatly reduced. The YOLO and SSD and their derived networks are representative of the one-stage network which realized the end-to-end training, only used a convolution neural network to directly predict the types and positions of different targets, and improved the detection speed on the basis of sacrificing certain accuracy.

[03] The Tiny-Yolo was proposed in 2017 and widely used due to its high speed and low memory consumption. But it is still difficult to implement a real-time application for a device without a GPU (Graphics Processing Unit). In the same year, Andrew G. Howard et al proposed the MobileNet for mobile and embedded vision applications. In 2018, the network of the MobileNet-SSD derived from the VGG-SSD was proposed, which greatly reduced the parameters and at the same time greatly improved the detection speed. But the risk of missing and false detection of small objects was very high, so that the real-time and accurate handheld call detection could not be realized on the embedded equipment. Therefore, it is necessary to provide a handheld call detection method based on a lightweight target detection network so as to enhance the real-time performance and the accuracy of handheld call detection on embedded equipment.

SUMMARY

[04] The invention aims to provide a handheld call detection method based on a lightweight target detection network, so as to solve the technical problems in the prior art, effectively improve detection precision and detection efficiency, reduce the model size, have strong anti-interference capability, be suitable for embedded equipment, overcome the influence of strong light and weak light, and complete real-time target detection with high accuracy in a scene with certain obstacle interference.

[05] To achieve the above objectives, the invention provides the following scheme: the invention provides a handheld call test method based on a lightweight target detection network, which comprises the following steps:

[06] Si. Acquiring a driver image data set, and labeling the driver image data set to obtain the sample image data set;

[07] S2. Constructing a handheld call detection model based on an LMS-DN network, and training the handheld call detection model through a sample image data set obtained in step S Ito obtain a trained handheld call detection model;

[08] S3. Conducting performance test on the trained handheld call detection model based on the indexes of the detection precision, the detection efficiency and the model size, wherein if the performance detection result is lower than a preset threshold, repeating step S2 to optimize and train the model until the performance result reaches the preset threshold;

[09] S4. Inputting the driver image acquired in real time into the optimized handheld call detection model in step S3 to obtain the result of the driver handheld call detection.

[010] Preferably, in step S2, the LMS-DN network is divided into two parts: the first part is divided into a basic classification network Mobilenet-I, and the second part is divided into an SSDLite network.

[011] Preferably, the Mobilenet-I network comprises a sequentially connected Conv convolution layer with a depth of 3x3, a SinConv convolution layer with a depth of 3x3, a BnConv3 convolution layer with a depth of 3x3, a BnConv3 convolution layer with a depth of 5x5, two BnConv6 convolution layers with a depth of3x3, a BnConv6 convolution layer with a depth of 5x5, a BnConv6 convolution layer with a depth of 3x3, an FC full connection layer and a pooling layer, which are sequentially connected.

[012] Preferably, the BnConv3 convolution layer of a depth of 3x3 includes a Conv convolution layer of a depth of 1x1, a DwiseConv layer of a depth of 3x3, and a Conv convolution layer of a depth of 1x1, which are sequentially connected.

[013] Preferably, the BnConv3 convolution layer of a depth of 5x5 includes a Conv convolution layer of a depth of 1x1, a DwiseConv layer of a depth of 5x5, and a Conv convolution layer of a depth of 1x1, which are sequentially connected.

[014] Preferably, the BnConv6 convolution layer of a depth of 3x3 includes a Conv convolution layer of a depth of 1x1, a DwiseConv layer of a depth of 3x3, and a Conv convolution layer of a depth oflx1, which are sequentially connected.

[015] Preferably, the BnConv6 convolution layer of a depth of 5x5 includes a Conv convolution layer of a depth of 1x1, a DwiseConv layer of a depth of 5x5, and a Conv convolution layer of a depth oflx1, which are sequentially connected.

[016] Preferably, the SinConv convolution layer with a depth of 3x3 comprises two paths of branch structures, each path of branch structure comprises a DWConv convolution layer with a depth of 3x3 and a Conv convolution layer with a depth of lxI which are sequentially connected. . The two branch structures are synthesized to form one path of signals through a Concat function.

[017] Preferably, the SSDLite network includes a prediction layer that employs depth separable convolution.

[018] Preferably, the detection accuracy is measured through the accuracy rate, recall rate, precision rate and mean accuracy (mAP); the detection efficiency is measured through the number of detected frames per second; and the model size is measured through the megabyte (MB) of the model.

[019] The invention discloses the following technical effects:

[020] Aiming at a driver active safety prevention and control system, a handheld call detection model is constructed on the basis of a lightweight network LMS-DN. The LMS-DN network is formed by combining an improved Mobilenet-I and an improved SSDLite network. The experimental results prove that the handheld call detection model has higher detection precision and detection efficiency on small targets like mobile phones, with a smaller model size, strong anti-interference capability. The model is suitable for embedded equipment, can overcome the influence of strong light and weak light, and can realize real-time detection on the targets with higher accuracy in a scene with certain obstacle interference.

BRIEF DESCRIPTION OF THE FIGURES

[021] The invention aims to provide a handheld call detection method based on a lightweight target detection network, so as to solve the technical problems in the prior art, effectively improve detection precision and detection efficiency, reduce the model size, have strong anti-interference capability, be suitable for embedded equipment, overcome the influence of strong light and weak light, and complete real-time target detection with high accuracy in a scene with certain obstacle interference.

[022] To achieve the above objectives, the invention provides the following scheme: the invention provides a handheld call test method based on a lightweight target detection network, which comprises the following steps:

[023] Si. Acquiring a driver image data set, and labeling the driver image data set to obtain the sample image data set;

[024] S2. Constructing a handheld call detection model based on an LMS-DN network, and training the handheld call detection model through a sample image data set obtained in step Si to obtain a trained handheld call detection model;

[025] S3. Conducting performance test on the trained handheld call detection model based on the indexes of the detection precision, the detection efficiency and the model size, wherein if the performance detection result is lower than a preset threshold, repeating step S2 to optimize and train the model until the performance result reaches the preset threshold;

[026] S4. Inputting the driver image acquired in real time into the optimized handheld call detection model in step S3 to obtain the result of the driver handheld call detection.

[027] Preferably, in step S2, the LMS-DN network is divided into two parts: the first part is divided into a basic classification network Mobilenet-I, and the second part is divided into an SSDLite network.

[028] Preferably, the Mobilenet-I network comprises a sequentially connected Conv convolution layer with a depth of 3x3, a SinCony convolution layer with a depth of 3x3, a BnConv3 convolution layer with a depth of 3x3, a BnConv3 convolution layer with a depth of 5x5, two BnConv6 convolution layers with a depth of 3x3, a BnConv6 convolution layer with a depth of 5x5, a BnConv6 convolution layer with a depth of 3x3, an FC full connection layer and a pooling layer, which are sequentially connected.

[029] Preferably, the BnConv3 convolution layer of a depth of 3x3 includes a Conv convolution layer of a depth of lxi, a DwiseConv layer of a depth of 3x3, and a Conv convolution layer of a depth of x, which are sequentially connected.

[030] Preferably, the BnConv3 convolution layer of a depth of 5x5 includes a Conv convolution layer of a depth of ixi, a DwiseConv layer of a depth of 5x5, and a Conv convolution layer of a depth ofix, which are sequentially connected.

[031] Preferably, the BnConv6 convolution layer of a depth of 3x3 includes a Conv convolution layer of a depth of 1x1, a DwiseConv layer of a depth of 3x3, and a Conv convolution layer of a depth of 1x1, which are sequentially connected.

[032] Preferably, the BnConv6 convolution layer of a depth of 5x5 includes a Conv convolution layer of a depth of 1x1, a DwiseConv layer of a depth of 5x5, and a Conv convolution layer of a depth oflx1, which are sequentially connected.

[033] Preferably, the SinCony convolution layer with a depth of 3x3 comprises two paths of branch structures, each path of branch structure comprises a DWConv convolution layer with a depth of 3x3 and a Conv convolution layer with a depth of 1x1 which are sequentially connected. . The two branch structures are synthesized to form one path of signals through a Concat function.

[034] Preferably, the SSDLite network includes a prediction layer that employs depth separable convolution.

[035] Preferably, the detection accuracy is measured through the accuracy rate, recall rate, precision rate and mean accuracy (mAP); the detection efficiency is measured through the number of detected frames per second; and the model size is measured through the megabyte (MB) of the model.

[036] The invention discloses the following technical effects:

[037] Aiming at a driver active safety prevention and control system, a handheld call detection model is constructed on the basis of a lightweight network LMS-DN. The LMS-DN network is formed by combining an improved Mobilenet-I and an improved SSDLite network. The experimental results prove that the handheld call detection model has higher detection precision and detection efficiency on small targets like mobile phones, with a smaller model size, strong anti-interference capability. The model is suitable for embedded equipment, can overcome the influence of strong light and weak light, and can realize real-time detection on the targets with higher accuracy in a scene with certain obstacle interference.

[038]

[039] BRIEF DESCRIPTION OF THE FIGURES

[040] In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the figures which are required to be used in the embodiments will be briefly described in below. Obviously, the figures in the following description are only some embodiments of the present invention, and other figures can be obtained based on these ones without devoting inventive labor by those skilled in the art.

[041] Figure 1 is a flowchart of a handheld call detection method based on a lightweight target detection network according to the present invention;

[042] Figure 2 is a schematic diagram of a Mobilenet-I network structure, consisting of a schematic diagrams of an overall structure of the Mobilenet-I network (Figure 2 (a)), BnConv3 layer with a depth of 3x3 (Figure 2(b)), BnConv3 convolution layer with a depth of 5x5 (Figure 2(c)), BnConv6 convolution layer with a depth of 3x3 (Figure 2(d)), BnConv6 convolution layer with a depth of 5x5 (Figure 2(e)), and SinCony convolution layer with a depth of 3x3 (Figure 2(f) );

[043] Figure 3 is a schematic diagram showing the overall structure of an LMS DN network according to the present invention;

[044] Figure 4 is a schematic diagram of an SSD network structure according to the present invention;

[045] Figure 5 is an image of a SafeImgs data set according to the example in the present invention;

[046] Figure 6 is a diagram showing detection effects of a MobileNetV2-SSDLite and an LMS-DN network on a KITTI data set according to the example in the present invention, wherein Figure 6(a) shows the detection result under the MobileNetV2 SSDLite, and Figure 6(b) shows the detection result under the LMS-DN;

[047] Figure 7 shows detection results of the LMS-DN network and the MobileNetV2-SSDLite network under different thresholds according to the example in the present invention.

DESCRIPTION OF THE INVENTION

[048] The technical solution in the embodiments of the present invention will be clearly and fully described in below with reference to the accompanying figures. Obviously, the described embodiments are only a part but not all of these of the present invention. All other embodiments obtained based on the embodiments of the present invention by one of ordinary skill in the art without creative labor are within the scope of the present invention.

[049] In order to make the above objects, features and advantages of the present invention be more clear and understandable, the present invention will now be described in further detail with reference to the accompanying figures and specific embodiments thereof.

[050] Referring to Figure 1, the embodiment provides a handheld call detection method based on a lightweight target detection network, which specifically comprises the following steps:

[051] Si. Acquiring a driver image data set, and labeling the driver image data set to obtain the sample image data set;

[052] According to the embodiment, the sample image data set is obtained based on the SafeImgs data set, and images of the SafeImgs data set are all from a driver video monitoring platform; as shown in Figure 5. From the data in the SafeImgs data set, 30 videos of drivers on the phone during driving, from which the driving videos of 15 drivers are selected as verification sets. Using the OpenCV to intercept the face of the collected video at 10 frames per second, and obtaining 5500 images in the training data set and 550 images in the verification data set. Labeling the training data set and the images in the verification data set by using a labeling tool of labellmg. After the samples are labeled, generating xml files corresponding to the samples one by one.

[053] S2. Constructing a handheld call detection model based on the LMS-DN network, and training the handheld call detection model through a sample image data set obtained in step S to obtain a trained handheld call detection model;

[054] The LMS-DN network is divided into two parts: the first part is the basic classification network Mobilenet-I, and the second part is the SSDLite network.

[055] The Mobilenet-I network, which is an improved version of the MobilenetV2 network, extracts the sense field by using expanded feature of the Inception structure, incorporates the separable convolutions with a depth of 5x5, and adjusts the overall structure of the MobilenetV2. The Mobilenet-I network includes: a Conv convolution layer with a depth of 3x3, a SinCony convolution layer with a depth of 3x3, a BnConv3 convolution layer with a depth of 3x3, a BnConv3 convolution layer with a depth of 5x5, two BnConv6 convolution layers with a depth of 3x3, a BnConv6 convolutional layer with a depth of 5x5, a BnConv6 convolutional layer with a depth of 3x3, an FC fully connected layer, and a pooling layer, which are sequentially connected, as shown in FIG. 2 (a). Among them, the BnConv3 convolutional layer with a depth of 3x3 includes a Conv convolution layer of a depth of lxI, a DwiseConv layer of a depth of 3x3 , and a Conv convolution layer of a depth of 1xi, which are sequentially connected, as shown in FIG. 2(b); the BnConv3 convolution layer of a depth of 5x5 includes a Conv convolution layer of a depth of 1xi, a DwiseConv layer of a depth of 5x5, and a Conv convolution layer of a depth of 1xi, which are sequentially connected, as shown in FIG. 2(c); the BnConv6 convolution layer of a depth of 3x3 includes a Conv convolution layer of a depth of 1xi, a DwiseConv layer of a depth of 3x3, and a Conv convolution layer of a depth of lxi, which are sequentially connected, as shown in FIG. 2(d); the BnConv6 convolution layer of a depth of 5x5 includes a Conv convolution layer of a depth oflxi, a DwiseConv layer of a depth of 5x5, and a Conv convolution layer of a depth of lxi, which are sequentially connected, as shown in FIG. 2(e); the SinCony convolution layer with a depth of 3x3 comprises two paths of branch structures, each path of branch structure comprises a DWConv convolution layer with a depth of 3x3 and a Conv convolution layer with a depth of x which are sequentially connected. . The two branch structures are synthesized to form one path of signals through a Concat function, as shown in Figure 2(f).

[056] The Mobilenet-I network contains the convolutions with a depth of 5x5, while the existing networks usually only use 3x3 cores. For deep separable convolutions, one 5x5 core can save resources more than two 3x3 cores. Formally, given the input shape (H, W, M) and output shape (H, W, N), where H and W represent the height and width of the image respectively, and M and N represent the depth of the input and output image respectively, the computational costs of the separable convolutions with a depth of 5x5 and 3x3 are calculated by the following equation:

[057] C5 = 5 H *W *M*(25+ N)

[058] C3 3 =H*W*M*(9+N)

[059] C5x5 < 2* C3x 3 (if N>7)

[060] where, C5x5 and C3x3 are the computational costs of the separable convolutions with a depth of 5x5 and 3x3, respectively.

[061] For the same effective sense field, when the input depth N>7, the computation of one 5x5 convolution kernel is less than that of two 3x3 convolution kernels. Remove the final pool layer of Mobilenet-I and add an auxiliary convolution layer to connect the base network and SSDLite network to form the whole structure of the LMS-DN network, as shown in Figure 3.

[062] A standard SSD network was used as the basic network of the SSDLite, and a prediction layer is added on the basis of the standard SSD network. The added prediction layer adopts depth separable convolution, and feature maps produced by a plurality of different convolution layers are fused for detection.

[063] The SSD is a target detection network that directly predicts target types and locations. The SSD network model is preceded by a standard architecture network for image classification, called the infrastructure network, followed by the addition of additional layers, and the fusion of feature maps output by six different convolutional layers for integrated detection. The SSD is shown in Figure 4. The basic network of the SSD is the VGG16, and the network structure is constructed by changing two fully connected layers into convolution layers and adding four convolution layers. The six convolutional layers participating in the feature map fusion generate a certain number

of frames called default frames, whose size Sk is calculated by the follow equation:

S -S. Sk = Smin + '" """ (k - 1), kE [1, m

[064] rn-1

[065] Where Smin represents the minimum value of the default frame, and Smi

equals 0.2; Smaxrepresents the maximum value of the default frame, and Sma equals 0.95; k represents the k-th default box; m represents the number of default boxes, and the default box is adjusted by the aspect ratio a.

[066] When calculating the loss function by the SSD, the total loss function is obtained by calculating the sum of the positioning loss function and the regression loss function.

[067] According to the feature extraction mechanism of small targets and the characteristics of different convolution layers, the LMS-DN network additionally extracts the features of two special convolution layers to detect the targets, which is very effective for detecting small target objects such as mobile phones.

[068] S3. Conducting performance test on the trained handheld call detection model based on the indexes of the detection precision, the detection efficiency and the model size, wherein if the performance detection result is lower than a preset threshold, repeating step S2 to optimize and train the model until the performance result reaches the preset threshold; The target detection model has a plurality of evaluation standards; according to the emphasis of different standards, the embodiment evaluates the trained handheld call detection model by using the detection accuracy, the detection efficiency and the model size; the detection accuracy is evaluated by using the accuracy rate, recall rate, precision rate and mean accuracy (mAP), as shown in the following formula:

accuracy=(1---)x1000/o

[069] m

precision= TP x100/o

[070] TP+FP

recall= xlOO%

[071] TP+FN mAP= AP(q)

[072] QR qE QR

[073] where a is the number of misclassified samples, m is the total number of samples, TP is (the number of) the positive samples which are correctly identified as positive samples, FP is (the number of) the negative samples which are misidentified

as positive samples, and QR refers to the total number of categories. Detection efficiency was evaluated using frames per second detection (FPS), and the size of the model was evaluated using MB (MByte, megabytes). These performance indexes are weighed through experiments, and the algorithm more suitable for embedded migration is discussed comprehensively.

[074] S4. Inputting the driver image acquired in real time into the optimized handheld call detection model in step S3 to obtain the result of the driver handheld call detection.

[075] To further verify the validity of the handheld call detection model of the invention, an experimental platform equipped with an Intel Core 15 7200U processor, an NVIDIA GTX 1080 8G video memory, a Ubuntu 16.04 software environment and a Caffe deep learning framework is utilized in the embodiment. The handheld call detection model based on the LMS-DN network of the invention is compared with the existing handheld call detection model from the following five aspects:

[076] Scheme 1: Experiments were conducted on the KITTI data set by combining the MobileNet, MobileNetV2, and MobileNet-I base networks with the SSD architecture, respectively, and the results are shown in Table 1:

[077] Table 1

Network Data set Model Size/MB Mean accuracy/mAP (%) MobileNet-SSD KITTI 25.1 46.8 MobileNetV2-SSD KITTI 21.8 47.0 MobileNct-I-SSD KITTI 21.5 48.3

[078] As can be seen from Table 1, comparing the results of the first two rows, when the SSD is used in all the detection networks, if the MobileNetV2 is used as the basic network, the network model size is reduced by 3.3 MB, and the detection accuracy is basically not influenced; comparing the results of the second and third rows, the MobileNet-I as the basic network, compared with the MobileNetV2, introduces the SinCony convolution unit by reference to the Inception structure and has more separable convolutions with a depth of 5x5, while the model size is slightly reduced, so that when the network depth reaches a certain degree, the computation cost for a 5x5 convolution core is smaller than that for two the MobileNetV2; besides, as a basic network, compared with the MobileNetV2, the MobileNet-I has a higher detection accuracy, and mAP is increased by 1.3%, which show that the MobileNet-I is also superior to the MobileNetV2 in feature extraction, with a smaller network model and higher detection accuracyScheme 2: The LMS-DN network of the invention and the two popular lightweight target detection networks Mobilenet-SSD and MobilenetV2 SSDLite are subjected to experiments on the VOC, KITTI and Safe_Imgs data sets;

[079] The experimental results on the VOC and KITTI data sets are shown in Tables 2 and 3, respectively;

[080] Table 2

Network Data set Model Mean Size/MB accuracy/mAP (%) MobileNet-SSD VOC0712 23.3 72.3 MobileNetV2-SSDLite VOC0712 19.7 72.6 LMS-DN VOC0712 20.5 76.2

[081] Table 3

Network Data set Model FPS Mean Size/MB accuracy/mAP (%) MobileNet-SSD KITTI 25.1 53 46.8 MobileNetV2-SSDLite KITTI 21.6 59 47.1 LMS-DN KITTI 22.5 58 49.7

[082] From Tables 2 and 3, compared with the results of the first two rows, the MobileNetV2-SSDLite network disclosed by the invention has the advantages that compared with the MobileNetV2-SSDLite, the MobileNetV2-SSD has the advantages that compared with the MobileNetV2-SSDLite, the MobileNetV2-SSDLite has the advantages that compared with the MobileNetV2-SSDLite, the MobileNetV2-SSDLite has the advantages that compared with the MobileNetV2-SSDLite, compared with the MobileNetV2-SSDLite, the mobile NetV2-SSDLite has the advantages that the mAP is improved by 3.6% and 2.6% under the condition that the FPS is slightly reduced. The detection effect of the MobileNetV2-SSDLite and the LMS-DN network on the KITTI data set is shown in FIG. 6, wherein Figure 6(a) is a detection result under the MobileNetV2-SSDLite, and FIG. 6(b) is a detection result under the LMS-DN; and as can be seen from FIG. 6, under the MobileNetV2-SSDLite, a plurality of cars which are smaller than those in the mobile NetV2-SSDLite are not detected, the LMS-DN network which is further improved on the SSDLite has great improvement on the detection of small target objects, and the target vehicles which are far away can be detected. Therefore, the improvement for SSDLite enables LMS-DN to detect more small target objects than MobileNetV2-SSDLite.

[083] The experimental results on the SafeImgs data set are shown in Table 4;

[084] Table 4

Network Model FPS Accura Precision Recall rate Size/M cy rate/ rate /

B /(%) MobileNet-SSD 18.6 59 81.3 90.6 80.6 MobileNetV2-SSDLite 17.5 65 82.7 92.8 83.5 LMS-DN 17.9 63 86.2 96.3 85.2

[085] As can be seen from Table 4, the LMS-DN network of the present invention has an accuracy of 86.2% at the cost of less model size, which is 4.9% higher than that of the MobileNet-SSD and 3.5% higher than that of the MobileNetV2-SSDLite. At the same time, both the accuracy rate and the recall rate have different degrees of improvement.

[086] Scheme 3: The LMS-DN network is compared with the traditional MobileNetV2-SSDLite network by using the SafeImgs data set under the same experimental condition and different detection accuracy threshold values, and the comparison result is shown in Figure 7. When the threshold value is gradually increased from 0.25 to 0.55, the detection performance of the MobileNetV2-SSDLite decreased obviously; however, for the LMS-DN, when the threshold value is 0.55, the accuracy is still 79.60%, so that the LMS-DN has good anti-interference capability.

[087] Scheme 4: To further verify the performance of the handheld call detection model on the embedded development board NVIDIA Jetson TX2, the comparison of the performance of the LMS-DN network and the traditional VGG16-SSD, MobileNet SSD and MobileNetV2-SSDLite on the embedded development board NVIDIA Jetson TX2 is conducted, and the experimental results are shown in Table 5:

[088] Table 5

Network Average time (ms) VGG16-SSD 246 MobileNet-SSD 58 MobileNetV2-SSDLite 50 LMS-DN 56

[089] As can be seen from Table 5, although the average detection time of the LMS-DN is only 6 ms longer than that of the MobileNetV2-SSDLite, it still satisfies the real-time detection requirement of the mobile device, and compared with the other two models, the LMS-DN can realize higher detection accuracy under the condition that the speeds are almost the same. Therefore, the LMS-DN model in the invention is more suitable for transplanting an embedded development board.

[090] Scheme 5: To further verify the reliability, the handheld call detection model of the present invention is subjected to tests of pictures under different illumination and Obstacle occlusion conditions on the NVIDIA Jetson TX2.

[091] The results of different illumination detection are shown in Table 6:

[092] Table 6

Accuracy/Acculacy Number of test Average detection (%) pictures time (ms) Normal 89.2 80 58 Strong light 72.5 80 59 Weak light 77.1 80 57

[093] As can be seen from Table 6, the detection accuracy of the LMS-DN network of the present invention is at a higher level under different illumination intensities.

[094] Obstacle occlusion was tested on NVIDIA Jetson TX2 as shown in Table 7:

[095] Table 7

Accuracy rate /(% Number of Average detection test pictures time (ms) Normal 89.2 80 58 Partial blocked 70.8 80 59

[096] As can be seen from Table 7, the LMS-DN network can still accurately detect the target object mobile phone when most of the mobile phone is blocked by the palm. In summary, the handheld call detection model constructed based on the LMS DN network not only can overcome the influence of strong light and weak light, but also can realize real-time detection of a target with high accuracy in a scene with certain obstacle interference.

[097] Although the invention has been described with reference to specific examples, it will be appreciated by those skilled in the art that the invention may be embodied in many other forms, in keeping with the broad principles and the spirit of the invention described herein.

[098] The present invention and the described embodiments specifically include the best method known to the applicant of performing the invention. The present invention and the described preferred embodiments specifically include at least one feature that is industrially applicable.

Claims

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:

1. A handheld call detection method based on a lightweight target detection network, characterized by comprising the following steps:

Si. Acquiring a driver image data set, and labeling the driver image data set to obtain the sample image data set;

S2. Constructing a handheld call detection model based on an LMS-DN network, and training the handheld call detection model through a sample image data set obtained in step S Ito obtain a trained handheld call detection model;

S3. Conducting performance test on the trained handheld call detection model based on the indexes of the detection precision, the detection efficiency and the model size, wherein if the performance detection result is lower than a preset threshold, repeating step S2 to optimize and train the model until the performance result reaches the preset threshold;

S4. Inputting the driver image acquired in real time into the optimized handheld call detection model in step S3 to obtain the result of the driver handheld call detection.

2. The handheld call detection method based on the lightweight target detection network according to claim 1, wherein in step S2, the LMS-DN network is divided into two parts: the first part is a basic classification network Mobilenet-I, and the second part is an SSDLite network.

3. The handheld call detection method based on the lightweight target detection network according to claim 2, wherein the Mobilenet-I network comprises a Conv convolution layer of a depth of 3x3, a SinCony convolution layer of a depth of 3x3, a BnConv3 convolution layer of a depth of 3x3, a BnConv3 convolution layer of a depth of 5x5, two BnConv6 convolution layers of a depth of 3x3, a BnConv6 convolution layer of a depth of 5x5, a BnConv6 convolution layer of a depth of 3x3, an FC full connection layer and a pooling pool layer which are sequentially connected.

4. The handheld call detection method based on the lightweight target detection network according to claim 3, wherein the BnConv3 convolution layer with a depth of

3x3 comprises a Conv convolution layer with a depth of lx1, a DwiseConv layer with a depth of 3x3 and a Conv convolution layer with a depth of 1x which are sequentially connected.

5. The handheld call detection method based on the lightweight target detection network according to claim 3, wherein the BnConv3 convolution layer with a depth of 5x5 comprises a Conv convolution layer with a depth of lx, a DwiseConv layer with a depth of 5x5, and a Conv convolution layer with a depth of 1x which are sequentially connected.

6. The handheld call detection method based on the lightweight target detection network according to claim 3, wherein the BnConv6 convolution layer with a depth of 3x3 comprises a Conv convolution layer with a depth of lx, a DwiseConv layer with a depth of 3x3, and a Conv convolution layer with a depth of x which are sequentially connected.

7. The handheld call detection method based on the lightweight target detection network according to claim 3, wherein the BnConv6 convolution layer with a depth of 5x5 comprises a Conv convolution layer with a depth of lxi, a DwiseConv layer with a depth of 5x5, and a Conv convolution layer with depth oflxi which are sequentially connected.

8. The handheld call detection method based on the lightweight target detection network according to claim 3, wherein the SinCony convolution layer with a depth of 3x3 comprises two paths of branch structures, each path of branch structure comprises a DWConv convolution layer with a depth of 3x3 and a Conv convolution layer with a depth of lxi which are sequentially connected. . The two branch structures are synthesized to form one path of signals through a Concat function.

9. The handheld call detection method based on a lightweight target detection network according to claim 2, wherein the SSDLite network includes a prediction layer that employs depth separable convolution.

10. The hand-held call detection method based on the lightweight target detection network according to claim 1, wherein the detection accuracy is detected by the accuracy, recall rate, precision and mean accuracy value mAP; the detection efficiency is measured through the number of detected frames per second; and the model size is measured through the megabyte (MB) of the model.