CN111723602B

CN111723602B - Method, device, equipment and storage medium for identifying driver behavior

Info

Publication number: CN111723602B
Application number: CN201910207840.8A
Authority: CN
Inventors: 乔梁
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2023-08-08
Anticipated expiration: 2039-03-19
Also published as: CN111723602A

Abstract

The application discloses a driver behavior identification method, device, equipment and storage medium, and belongs to the technical field of intelligent traffic. The method comprises the following steps: obtaining a target image, obtaining a first area image from the target image, wherein the first area image comprises preset behaviors of a driver, in the target image, performing reduction processing on the peripheral area of the first area image according to a first proportion threshold value to obtain a second area image, performing expansion processing on the peripheral area of the first area image according to a second proportion threshold value to obtain a third area image, and identifying the behaviors of the driver based on the first area image, the second area image and the third area image. After the first area image is subjected to inner interception and outer expansion, the obtained second area image and third area image can comprise more information of surrounding objects related to preset behaviors, so that the behaviors of a driver are identified based on the three area images, and the accuracy of identification can be improved.

Description

Method, device, equipment and storage medium for identifying driver behavior

Technical Field

The present disclosure relates to the field of intelligent traffic technologies, and in particular, to a method, an apparatus, a device, and a storage medium for identifying behavior of a driver.

Background

With the rapid development of intelligent transportation technology, people pay more attention to how to solve the driving safety problem caused by driver distraction through intelligent means. In some scenes, the face part of the driver can be shot, the behavior of the driver is analyzed based on the shot image, and when the fact that the illegal behavior exists is determined, alarm prompt is carried out, so that the driver is reminded of safe driving.

At present, a network model subjected to deep learning can be adopted to detect the behavior of a driver. That is, the network model to be trained may be deeply trained in advance based on some training samples, which may include the image samples and the location information and behavior categories of the regions where the behaviors in the image samples are located, so that the trained network model may recognize some gestures such as making a call, smoking a cigarette, and the like based on the photographed images.

However, since the driver may have gestures similar to the illegal behaviors, such as touching the ear, chin, mouth, etc., during driving, in this case, the above-provided behavior recognition method of the driver may cause erroneous judgment.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for identifying the behavior of a driver, which can solve the problem that the method for identifying the behavior of the driver in the related technology may cause misjudgment. The technical scheme is as follows:

in a first aspect, there is provided a method of identifying behavior of a driver, the method comprising:

acquiring a target image, wherein the target image comprises a face of a driver;

acquiring a first area image from the target image, wherein the first area image comprises preset behaviors of the driver, and the preset behaviors refer to behaviors with similarity with illegal behaviors being larger than a preset threshold;

in the target image, performing reduction processing on the peripheral area of the first area image according to a first proportion threshold value to obtain a second area image, and performing expansion processing on the peripheral area of the first area image according to a second proportion threshold value to obtain a third area image;

and identifying the behavior of the driver based on the first region image, the second region image and the third region image.

Optionally, before the identifying the behavior of the driver based on the first area image, the second area image, and the third area image, the method further includes:

Adjusting the first region image, the second region image and the third region image to region images of the same size;

accordingly, the detecting the behavior of the driver based on the first region image, the second region image, and the third region image includes:

invoking a target network model, wherein the target network model is used for determining a behavior category of a behavior based on a group of region images corresponding to any behavior;

and identifying the behavior of the driver through the target network model based on the first area image, the second area image and the third area image after the size adjustment.

Optionally, the target network model comprises an input layer, an intermediate layer, a splicing layer, a full connection layer and an output layer;

the identifying, based on the resized first, second, and third region images, the behavior of the driver via the target network model includes:

the input layer carries out channel superposition processing on image data in the first region image, the second region image and the third region image after size adjustment based on the resolution ratio and the channel number of the region image after size adjustment, so as to obtain a first feature image;

Performing convolution sampling processing on the first feature images through the intermediate layer to obtain a plurality of second feature images, wherein the second feature images are identical in size and different in channel number;

the splicing layer is used for carrying out channel superposition processing on the plurality of second feature images, and feature fusion is carried out on the feature images after channel superposition through a convolution layer in a deep network layer to obtain a third feature image;

determining, by the fully-connected layer, behavior of the driver based on the third feature map;

outputting the behavior through the output layer.

Optionally, the middle layer includes N groups of convolution layers and N groups of sampling layers, each group of convolution layers corresponds to each group of sampling layers one by one;

the convolution sampling processing is performed on the first feature map through the intermediate layer to obtain a plurality of second feature maps, including:

let i=1, determine the first feature map as a target feature map; the target feature image is subjected to convolution processing through an ith group of convolution layers, and the obtained feature image is subjected to sampling processing through an ith group of sampling layers according to two different multiples to obtain 2 ⁱ A second characteristic diagram of the ith reference size and the double characteristic diagram, and 2 ⁱ The multiple feature map is obtained as the target feature map, and the reference dimension is greater than or equal to 2 ⁱ The size of the doubling feature map;

when i is smaller than N, making i=i+1, returning to the convolution processing of the target feature map through the ith group of convolution layers, and respectively sampling the obtained feature map according to two different multiples through the ith group of sampling layers to obtain 2 ⁱ A second characteristic diagram of the ith reference size and the double characteristic diagram, and 2 ⁱ Obtaining a multiple feature map as the operation of the target feature map;

and when i is equal to N, carrying out convolution processing on the target feature image through an N group of convolution layers, carrying out sampling processing on the obtained feature image through an N group of sampling layers for one time to obtain a second feature image with an N reference size, and ending the operation.

Optionally, before the target network model is invoked, the method further includes:

acquiring a plurality of training samples, wherein the plurality of training samples comprise a plurality of groups of images and behavior categories in each group of images, each group of images comprises a region image of a behavior, a reduced region image determined after the region image is subjected to reduction processing, and an expanded region image determined after the region image is subjected to expansion processing;

and training the network model to be trained based on the plurality of training samples to obtain the target network model.

Optionally, the acquiring a first area image from the target image includes:

invoking a target detection model, inputting the target image into the target detection model, and outputting a face detection frame and a preset behavior detection frame, wherein the target detection model is used for identifying the face and the preset behavior in the image based on any image;

when the number of the face detection frames is one, determining the face detection frames as target face detection frames; when the number of the face detection frames is multiple, the face detection frame with the largest area is obtained from the multiple face detection frames, and the obtained face detection frame is determined to be a target face detection frame;

and determining the first area image based on the target face detection frame.

Optionally, the determining the first area image based on the target face detection box includes:

filtering out the preset behavior detection frames which are not overlapped with the target face detection frame or the distance between the preset behavior detection frames and the target face detection frame is larger than a preset distance threshold;

and cutting out the region corresponding to the residual preset behavior detection frame after filtering from the target image to obtain the first region image.

Optionally, the acquiring the target image includes:

detecting a working mode of a camera for shooting the face of the driver;

when the working mode of the camera is an infrared shooting mode, acquiring a gray image obtained by shooting;

calling an image pseudo-color conversion model, inputting the gray image into the image pseudo-color conversion model, and outputting a three-channel color image corresponding to the gray image, wherein the image pseudo-color conversion model is used for converting any gray image into the three-channel color image corresponding to the gray image;

and acquiring the output three-channel color image as the target image.

Optionally, before the detecting the working mode of the camera for shooting the face of the driver, the method further includes:

acquiring the current speed and illumination intensity;

and when the vehicle speed is greater than a preset vehicle speed threshold and the illumination intensity is lower than an illumination intensity threshold, switching the working mode of the camera into the infrared shooting mode.

Optionally, after detecting the behavior of the driver based on the first area image, the second area image, and the third area image, the method further includes:

When the behavior of the driver belongs to the illegal behavior, counting the number of times of the illegal behavior of the driver in a preset duration;

and when the number of times of the illegal behaviors of the driver reaches a preset number threshold value within the preset duration, carrying out illegal driving alarm prompt.

In a second aspect, there is provided a behavior recognition apparatus of a driver, the apparatus comprising:

the first acquisition module is used for acquiring a target image, wherein the target image comprises a face of a driver;

the second acquisition module is used for acquiring a first area image from the target image, wherein the first area image comprises preset behaviors of the driver, and the preset behaviors refer to behaviors with similarity with illegal behaviors being larger than a preset threshold;

the image processing module is used for carrying out reduction processing on the peripheral area of the first area image according to a first proportion threshold value in the target image to obtain a second area image, and carrying out expansion processing on the peripheral area of the first area image according to a second proportion threshold value to obtain a third area image;

and the identification module is used for identifying the behavior of the driver based on the first area image, the second area image and the third area image.

Optionally, the apparatus further comprises:

a size adjustment module configured to adjust the first area image, the second area image, and the third area image to area images of the same size;

the recognition module is used for calling a target network model, and the target network model is used for determining the behavior category of the behavior based on a group of region images corresponding to any behavior;

Optionally, the identification module is configured to:

the target network model comprises an input layer, an intermediate layer, a splicing layer, a full-connection layer and an output layer; the input layer carries out channel superposition processing on image data in the first region image, the second region image and the third region image after size adjustment based on the resolution ratio and the channel number of the region image after size adjustment, so as to obtain a first feature image;

outputting the behavior through the output layer.

Optionally, the identification module is configured to:

the middle layer comprises N groups of convolution layers and N groups of sampling layers, and each group of convolution layers corresponds to each group of sampling layers one by one; let i=1, determine the first feature map as a target feature map; the target feature image is subjected to convolution processing through an ith group of convolution layers, and the obtained feature image is subjected to sampling processing through an ith group of sampling layers according to two different multiples to obtain 2 ⁱ SesquiterA second feature map of the sign map and the ith reference dimension, and 2 ⁱ The multiple feature map is obtained as the target feature map, and the reference dimension is greater than or equal to 2 ⁱ The size of the doubling feature map;

Optionally, the apparatus further comprises a training module for:

Optionally, the second obtaining module is configured to:

and determining the first area image based on the target face detection frame.

Optionally, the second obtaining module is configured to:

Optionally, the first obtaining module is configured to:

detecting a working mode of a camera for shooting the face of the driver;

And acquiring the output three-channel color image as the target image.

Optionally, the apparatus further comprises:

the third acquisition module is used for acquiring the current speed and illumination intensity;

and the switching module is used for switching the working mode of the camera into the infrared shooting mode when the vehicle speed is greater than a preset vehicle speed threshold and the illumination intensity is lower than the illumination intensity threshold.

Optionally, the apparatus further comprises:

the statistics module is used for counting the number of times of the illegal behaviors of the driver in a preset duration when the behaviors of the driver belong to the illegal behaviors;

and the alarm module is used for carrying out illegal driving alarm prompt when the number of times of the illegal behaviors of the driver reaches a preset number threshold value within the preset duration.

In a third aspect, an intelligent device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the driver behavior recognition method of the first aspect described above.

In a fourth aspect, there is provided a computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the method for identifying behavior of a driver according to the first aspect.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of identifying a behaviour of a driver as described in the first aspect above.

The beneficial effects that technical scheme that this application embodiment provided brought are:

a target image including a face of a driver is acquired, and a first area image including a preset behavior of the driver, which is similar to the behavior of the violation, is acquired from the target image, that is, a first area image of the preset behavior close to the behavior of the violation is acquired from the target image. And then, carrying out inner interception and outer expansion of different scales on the first area image in the target image to obtain a second area image and a third area image. Because the first area image may only include the preset behavior, but does not include or includes the peripheral object of which the minimum part is related to the preset behavior, whether the preset behavior is actually illegal or not cannot be accurately determined based on the first area image, after the first area image is subjected to inner interception and outer expansion, the obtained second area image and third area image can include more information of the peripheral object related to the preset behavior, and therefore, the behavior of a driver is detected based on the three area images, and the detection accuracy can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of a captured image of a face of a driver, according to an exemplary embodiment;

FIG. 2 is a flowchart illustrating a method of driver behavior recognition, according to an exemplary embodiment;

FIG. 3 is a schematic diagram of a face detection box of a driver, shown according to an exemplary embodiment;

FIG. 4 is a schematic diagram of an area image shown according to an exemplary embodiment;

FIG. 5 is a schematic diagram of a convolution sampling process for an intermediate layer according to an exemplary embodiment;

fig. 6 is a schematic structural view showing a behavior recognition apparatus of a driver according to an exemplary embodiment;

fig. 7 is a schematic structural view of a behavior recognition apparatus of a driver shown according to another exemplary embodiment;

Fig. 8 is a schematic structural view of a behavior recognition apparatus of a driver shown according to another exemplary embodiment;

fig. 9 is a schematic structural view of a behavior recognition apparatus of a driver shown according to another exemplary embodiment;

fig. 10 is a schematic structural view of a behavior recognition apparatus of a driver shown according to another exemplary embodiment;

fig. 11 is a schematic diagram illustrating a structure of a terminal 1000 according to an exemplary embodiment.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before describing the behavior recognition method of the driver provided in the embodiment of the present application in detail, an application scenario and an implementation environment related to the embodiment of the present application are first described briefly.

First, an application scenario according to an embodiment of the present application will be briefly described.

Currently, image recognition is widely used in the field of intelligent transportation, and in some embodiments, may be used to detect driver behavior. In the driving process of the driver, the face part of the driver can be shot, and then the shot image is detected and analyzed through the trained network model so as to detect whether the driver has illegal behaviors. However, some behaviors of the driver may be similar to the illegal behaviors, for example, please refer to fig. 1, the face touch gesture of the driver is similar to the call gesture, in this case, the face touch gesture is easily misjudged as the illegal behavior when the detection and analysis are performed through the trained network model, so as to lead to a false alarm, which not only affects passengers, but also may affect the distraction of the driver. Therefore, the embodiment of the application provides a method for identifying the behavior of the driver, which can accurately determine whether the behavior of the driver is illegal, and the specific implementation of the method is shown in the following fig. 2.

Next, an implementation environment related to the embodiments of the present application will be briefly described.

The behavior recognition method of the driver can be executed by intelligent equipment, the intelligent equipment can be configured or connected with a camera, the camera can be installed in areas such as a middle console, an instrument panel or an A column in front of a vehicle, the face of the driver is photographed through the camera, and therefore clear images around the driver are obtained in real time. In some embodiments, the smart device may be a mobile phone, a tablet computer, a computer device, or other terminals, or the smart device may also be a smart camera device, which is not limited in this embodiment of the present application.

After describing application scenarios and implementation environments related to the embodiments of the present application, a detailed description will be given next of a driver behavior recognition method provided in the embodiments of the present application with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for identifying the behavior of a driver according to an exemplary embodiment, which is described by using the method performed by the above-mentioned smart device as an example, the method for identifying the behavior of the driver may include the following implementation steps:

Step 201: a target image is acquired, the target image comprising a face of a driver.

In one possible implementation manner, the intelligent device may acquire a video image captured by the camera every time period threshold value, so as to obtain the target image. Or it can be understood that, in the process of shooting the video by the camera, the intelligent device can acquire a frame of video image every other preset number of video image frames to obtain the target image.

The duration threshold may be set by a user in a user-defined manner according to actual requirements, or may be set by default by the intelligent device, which is not limited in the embodiment of the present application.

In addition, the preset number can be set by a user in a self-defined manner according to actual requirements, or can be set by the intelligent device in a default manner, and the embodiment of the application is not limited in this way. For example, the predetermined number may be 5, where the target image may be the first video image frame, the sixth video image frame, the eleventh video image frame, etc.

Further, the target image may be a three-channel color image, where the three channels are R, G and B, and represent red, green and blue values of the pixel point, respectively. In implementation, the intelligent device detects a working mode of a camera for shooting a face of the driver, when the working mode of the camera is an infrared shooting mode, a gray image obtained by shooting is obtained, an image pseudo-color conversion model is called, the gray image is input into the image pseudo-color conversion model, a three-channel color image corresponding to the gray image is output, the image pseudo-color conversion model is used for converting any gray image into a three-channel color image corresponding to the gray image, and the output three-channel color image is obtained as the target image.

That is, the camera may autonomously switch the operation modes according to the intensity of day and night light, and the operation modes may include an infrared photographing mode and a non-infrared photographing mode. In the non-infrared shooting mode, the image shot by the camera is a three-channel color image, and at the moment, the image acquired by the camera can be directly acquired as a target image. However, in the infrared photographing mode, the image photographed by the camera is a gray image, and color information is lost due to the gray image, which may increase the difficulty of subsequent processing for the smart device. For this reason, when it is detected that the camera is in the infrared photographing mode, the grayscale image may be converted into a three-channel color image, that is, pseudo color information of the grayscale image is restored, and the three-channel color image obtained after the conversion is acquired as the target image.

In some embodiments, the grayscale image may be converted to a corresponding three-channel color image by an image pseudo-color conversion model. That is, the gray image may be input into a pseudo color conversion model that outputs a three-channel color image of the same size as the gray image.

Further, before the image pseudo-color conversion model is called, a plurality of gray image samples and color image samples corresponding to each gray image sample can be obtained, and the image pseudo-color conversion model is obtained after training the network to be trained based on the gray image samples and the color image samples corresponding to each gray image sample.

For example, 50 tens of thousands of color image samples taken from a natural scene may be acquired, converted into corresponding gray image samples by the following formula (1), and then input into a network to be trained. Wherein, the formula (1) is:

H＝R*0.299+G*0.587+B*0.114 (1)

where H represents the gray value of the pixel, R represents the red value of the pixel, G represents the green value of the pixel, and B represents the blue value of the pixel.

In one possible implementation, the network to be trained may use, but is not limited to, a condition-generating countermeasure network, where during training, the network includes a generator that can generate color images of the same size by parametric superposition of the grayscale image samples and a discriminator that iteratively trains the network by discriminating the difference values between the generated color images and the color image samples corresponding to the grayscale image samples.

Further, before detecting the working mode of the camera for shooting the face of the driver, the intelligent device obtains the current vehicle speed and the illumination intensity, and when the vehicle speed is greater than a preset vehicle speed threshold and the illumination intensity is lower than the illumination intensity threshold, the working mode of the camera is switched to the infrared shooting mode.

That is, the camera may be automatically switched to the infrared photographing mode only when the vehicle speed is greater than a certain preset vehicle speed threshold and the illumination intensity is low, otherwise, the camera performs video photographing in the non-infrared photographing mode, that is, performs video photographing in the normal mode.

The preset vehicle speed threshold may be set by a user according to actual requirements, or may be set by default by the intelligent device, which is not limited in the embodiment of the present application.

In addition, the illumination intensity threshold may be set by the user according to the actual requirement, or may be set by default by the intelligent device, which is not limited in the embodiment of the present application.

Further, the above description will be given by taking as an example only that the camera is controlled to be automatically switched to the infrared photographing mode under one of the conditions of the illumination intensity. In another embodiment, in order to be able to remind the driver not to drive fatigue, there may be a need to detect whether the driver is closed, in which case if the driver is wearing a sunglasses, the camera is also required to switch to the infrared shooting mode, at this time, the user may manually switch to trigger an infrared shooting switching instruction, so that the intelligent device switches the camera to the infrared shooting mode based on the infrared shooting switching instruction.

In the above description, the target image is merely an example of a three-channel color image, and in another embodiment, the target image may be a gray-scale image, which is not limited in the embodiment of the present application.

It should be noted that, the above description only takes the case that the camera is automatically switched to the infrared shooting mode when the vehicle speed is greater than the preset vehicle speed threshold and the illumination intensity is low, otherwise, the non-infrared shooting mode is taken as an example for shooting. In another embodiment, the camera may not be turned on when the vehicle speed is below the preset threshold, that is, the driver may not be behavior-detected if the vehicle speed is below the preset threshold. Therefore, before the target image is acquired, whether the vehicle speed is greater than a preset vehicle speed threshold value or not may be judged, and when the vehicle speed is greater than the preset vehicle speed threshold value, the operation of acquiring the target image is performed, otherwise, the camera is not started, that is, the operation of acquiring the target image is not performed.

Step 202: and acquiring a first area image from the target image, wherein the first area image comprises preset behaviors of the driver, and the preset behaviors refer to behaviors with similarity with the illegal behaviors being larger than a preset threshold.

The preset behavior can be set by a user according to actual requirements, or can be set by default by the intelligent device, which is not limited in the embodiment of the present application.

For example, the preset behavior may include making a call, smoking, feeling chin, covering mouth, feeling side face, feeling ear, etc. Wherein, the call making refers to the gesture that the hand holds the mobile phone to be close to the ear.

In addition, the preset threshold may be set by user according to actual requirements, or may be set by default by the computer device, which is not limited in the embodiment of the present application.

That is, the intelligent device determines the region where the preset behavior of the driver similar to the offence exists from the target image, and then acquires the determined region from the target image, so that further fine analysis can be performed on the preset behavior later, thereby determining whether the preset behavior is actually the offence.

In some embodiments, the specific implementation of acquiring the first area image from the target image may include: and calling a target detection model, inputting the target image into the target detection model, outputting a face detection frame and a preset behavior detection frame, and identifying the face and the preset behavior in the image based on any image by the target detection model. When the number of the face detection frames is one, determining the face detection frames as target face detection frames; when the number of the face detection frames is a plurality of, the face detection frame with the largest area is obtained from the plurality of face detection frames, the obtained face detection frame is determined to be a target face detection frame, and the first area image is determined based on the target face detection frame.

That is, the object detection model can detect not only a face but also an area corresponding to a preset behavior, such as a phone call, a face touch, a smoke, and the like. In general, the number of detected preset behavior detection frames is plural, and the preset behavior corresponding to a certain preset behavior detection frame may not be the one of the driver, for example, may be a passenger standing beside the driver, and the intelligent device may filter out the preset behavior detection frames irrelevant to the driver. In implementations, a face detection box for the driver may be determined, after which a filtering operation is performed based on the face detection box for the driver.

Since the camera shoots the face portion of the driver, when the number of face detection frames output is one, it can be determined that the face detection frame corresponds to the face of the driver, and the face detection frame is determined as the target detection frame at this time. In other embodiments, the passenger may stand near the driver, such as behind the driver's side and facing the camera, at which point the object detection model may detect multiple face detection frames. Since the camera shoots the face of the driver, the face detection frame with the largest area can be determined to correspond to the face of the driver, for example, as shown in fig. 3, so the face detection frame with the largest area is determined as the target face detection frame.

Further, determining, based on the target face detection frame, a specific implementation of the first region image may include: filtering out the preset behavior detection frames which are not overlapped with the target face detection frame or the distance between the preset behavior detection frames and the target face detection frame is larger than a preset distance threshold, and cutting out the areas corresponding to the residual preset behavior detection frames after filtering from the target image to obtain the first area image.

The intelligent device may filter a preset behavior detection frame that does not belong to the preset behavior of the driver based on the target face detection frame. It will be appreciated that if the preset behavior detection frame does not overlap with the target face detection frame, it is indicated that the preset behavior detection frame is far away from the face of the driver, so that it may be determined that the preset behavior corresponding to the preset behavior detection frame is not the behavior of the driver. In addition, the distance between the preset behavior detection frame and the target face detection frame can be determined according to the distance between the preset behavior detection frame and the target face detection frame, and when the distance between the preset behavior detection frame and the target face detection frame is larger than the preset distance threshold value, the preset behavior detection frame is far away from the face of the driver, so that the behavior of the driver, which corresponds to the preset behavior detection frame, can be determined.

The preset distance threshold may be set by a user in a user-defined manner according to actual requirements, or may be set by default by the intelligent device, which is not limited in the embodiment of the present application.

After the intelligent device filters the preset behavior detection frames, the remaining preset behavior detection frames can be determined to be detection frames corresponding to the preset behaviors of the driver, and the intelligent device cuts out corresponding areas from the target image to obtain a first area image.

It should be noted that if the currently output preset behavior detection frame does not overlap with the target face detection frame, or the distances between the currently output preset behavior detection frame and the target face detection frame are all greater than the preset distance threshold, it may be determined that no violation exists in the target image for the driver.

Further, when the driver wearing the sunglasses is detected through the target detection model, the camera can be automatically controlled to be switched into the infrared shooting mode. That is, if it is detected that the driver is wearing a sunglasses through the object detection model during the daytime, the camera may be switched to the infrared photographing mode in order to be able to detect whether the driver has eye-closing behavior.

Further, before the target detection model is called, a plurality of image samples and face frames and preset action frames in each image sample can be obtained, and the face frames and the preset action frames in the plurality of image samples and each image sample are input into the detection model to be trained for training, so that the target detection model is obtained.

The face frames and the preset behavior frames in each image sample may be calibrated in advance, that is, based on the plurality of image samples and the face frames and the preset behavior frames calibrated in the plurality of image samples, the detection model to be trained is subjected to deep learning and training, so that the obtained target detection model can automatically detect the face and the preset behavior in any image.

In one possible implementation, the detection model to be trained may be YOLO (You Only Look Once, you see once) network, SSD (Single Shot Detector, disposable detector), etc., which is not limited by the embodiments of the present application.

Step 203: in the target image, the peripheral area of the first area image is subjected to reduction processing according to a first proportion threshold value to obtain a second area image, and the peripheral area of the first area image is subjected to expansion processing according to a second proportion threshold value to obtain a third area image.

Since the first area image may only include the preset behavior, but not include or only include a small portion of the articles related to the preset behavior, such as mobile phones, cigarettes, etc., that is, the mobile phones or cigarettes may not be completely contained in the first area image or only occupy a small portion thereof, and at the same time, the objects such as mobile phones or cigarettes may be different in size due to the problems of camera mounting and self-size. In order to further determine whether the preset behavior is actually illegal, the intelligent device performs internal cutting and external expansion on the area corresponding to the first area image according to different scales, namely, in the target image, performs reduction processing on the peripheral area of the first area image according to a first proportion threshold value, and performs expansion processing on the peripheral area of the first area image according to a second proportion threshold value, so as to obtain a second area image and a third area image.

For example, referring to fig. 4, assuming that the first ratio threshold is 0.8 and the second ratio threshold is 1.2, after the first area image is truncated and expanded from the target image, a second area image and a third area image shown as 41 and 42 in fig. 4 can be obtained, where 43 in fig. 4 is the first area image.

It is worth mentioning that the first area image, the second area image and the third area image are obtained from the target image, redundant information in the whole target image is removed, and therefore compared with the fact that behavior detection is carried out based on the whole target image, the accuracy of detection can be improved by carrying out subsequent behavior discrimination processing based on the first area image, the second area image and the third area image.

The first proportional threshold may be set by user according to actual requirements, or may be set by default by the intelligent device, which is not limited in this embodiment of the present application.

The second proportion threshold value may be set in a self-defined manner according to actual requirements, or may be set by default by the intelligent device, which is not limited in the embodiment of the present application.

Step 204: the first region image, the second region image and the third region image are adjusted to region images of the same size.

In the embodiment of the application, in order to facilitate the recognition of the behavior of the driver based on the first area image, the second area image and the third area image, the intelligent device performs the size adjustment on the three area images, that is, scales the area images under different scales to the same size.

Step 205: a target network model is invoked for determining a behavior category for any behavior based on a set of region images corresponding to the behavior.

The region image corresponding to any behavior comprises a region image of the behavior, a reduced region image and an expanded region image, wherein the reduced region image is a region image determined by reducing the periphery of a region where the behavior is located in a photographed image where the behavior is located, and the expanded region image is a region image determined by expanding the periphery of the region where the behavior is located in the photographed image where the behavior is located. Further, the area image, the reduced area image, and the extended area image of the behavior are the same in size.

Further, before invoking the target network model, a plurality of training samples may be acquired, where the plurality of training samples include a plurality of groups of images and behavior categories in each group of images, and each group of images includes a region image of a behavior, a reduced region image determined after a reduction process is performed on the region image, and an extended region image determined after an expansion process is performed on the region image, and the target network model is obtained after training a network model to be trained based on the plurality of training samples.

That is, the target network model may be obtained in advance through training. In the training process, multiple groups of images and behavior categories in each group of images can be input into a network model to be trained for training. The images of each group include a region image of a behavior, a reduced region image and an expanded region image, that is, the region image of each group of images may be obtained by cutting out a region of the behavior from a photographed image of the behavior, the reduced region image may be obtained by reducing a surrounding region of the behavior from a photographed image of the behavior, and the expanded region image may be obtained by expanding a surrounding region of the behavior from a photographed image of the behavior. Further, the sizes of the three area images in each group of images are the same, that is, after the reduction processing and the expansion processing, the obtained area images and the area images corresponding to the behaviors can be subjected to size adjustment, so that the sizes of the three area images corresponding to the behaviors are consistent.

In addition, because the information contained in different channels is different, in order to enable the network model to be trained to learn the weight of each channel for the identification task, a compression-activation module can be added in the network model to be trained so as to weight each channel of the network model to be trained, so that the learning of the network model to be trained is focused on the region of interest, and the accuracy of target network model detection is improved.

Step 206: and identifying the behavior of the driver through the target network model based on the first region image, the second region image and the third region image after the size adjustment.

Further, the target network model includes an input layer, an intermediate layer, a splicing layer, a full connection layer and an output layer, and the specific implementation of identifying the behavior of the driver through the target network model based on the first area image, the second area image and the third area image after the size adjustment may include: and carrying out channel superposition processing on image data in the first region image, the second region image and the third region image after the size adjustment by the input layer based on the resolution and the channel number of the region image after the size adjustment, so as to obtain a first feature map. And carrying out convolution sampling processing on the first feature images through the middle layer to obtain a plurality of second feature images, wherein the second feature images are identical in size and different in channel number, carrying out channel superposition processing on the second feature images through the splicing layer, and carrying out feature fusion on the feature images after channel superposition through a deep convolution layer of the network to obtain a third feature image. And determining the behavior of the driver based on the third feature map through the full connection layer, and outputting the behavior through the output layer.

For example, assuming that the resolution of each area image is 256×256 and the number of channels of each area image is 3 channels, in this case, three area images, that is, the first area image, the second area image, and the third area image after the size adjustment, are input into the target network model, and the three area images are subjected to channel stacking through the output layer, a three-dimensional matrix of 256×256×9 can be obtained, where 9 is the number of channels after channel stacking, and the three-dimensional matrix is used to indicate the first feature map.

And then, carrying out convolution sampling processing on the first feature map through an intermediate layer in the target network model, wherein in a possible mode, the intermediate layer comprises N groups of convolution layers and N groups of sampling layers, each group of convolution layers corresponds to each group of sampling layers one by one, and N is an integer greater than 1. Further, each set of convolution layers may include at least one convolution layer, and each set of sampling layers may include at least one sampling layer. In this case, the convolution sampling processing is performed on the first feature map through the intermediate layer, and specific implementation for obtaining a plurality of second feature maps may include: let i=1, determine the first feature map as a target feature map; the target feature image is convolved through the ith group of convolution layers, and the obtained feature image is sampled through the ith group of sampling layers according to two different multiples to obtain 2 ⁱ A second characteristic diagram of the ith reference size and the multiple characteristic diagram, and the second characteristic diagram is 2 ⁱ The multiple feature map is obtained as the target feature map, and the reference dimension is greater than or equal to 2 ⁱ The size of the doubling feature map; when i is smaller than N, making i=i+1, returning to the ith group of convolution layers to carry out convolution processing on the target feature map, and carrying out sampling processing on the obtained feature map by the ith group of sampling layers according to two different multiples respectively to obtain 2 ⁱ A second characteristic diagram of the ith reference size and the multiple characteristic diagram, and the second characteristic diagram is 2 ⁱ Obtaining a multiple feature map as an operation of the target feature map; and when i is equal to N, carrying out convolution processing on the target feature image through an N group of convolution layers, carrying out sampling processing on the obtained feature image through an N group of sampling layers for one time to obtain a second feature image with an N reference size, and ending the operation.

The reference size may be set according to actual requirements, or may be set by default by the smart device, which is not limited in this embodiment of the present application.

The convolution processing of the convolution layer can extract the image characteristics which are favorable for identifying the behavior category, and the sampling processing of the sampling layer can obtain the multi-scale characteristic image. Referring to fig. 5, in implementation, a first feature map output by an output layer is input to a first group of convolution layers to perform convolution processing to obtain a multi-channel feature map, and then downsampling processing with different multiples is performed by 2 x 2 sampling layers and 8 x 8 sampling layers in the first group of sampling layers to obtain a 2-fold feature map and a second feature map with a reference size. And then continuously inputting the 2 times of feature images into a second group of convolution layers to carry out convolution processing to obtain a multi-channel feature image, and respectively carrying out downsampling processing of different multiples through 2 x 2 sampling layers and 4*4 sampling layers in the second group of sampling layers to obtain a 4 times of feature image and a second feature image with a reference size. And then, continuing to perform convolution sampling processing according to the mode until the N group of convolution layers perform convolution processing, performing 2 x 2 sampling layer processing through the N group of sampling layers to obtain a second feature map of the N reference dimension, and ending convolution sampling operation.

With continued reference to fig. 5, after obtaining a plurality of second feature graphs, channel stacking is performed through the splicing layer, and feature fusion is performed through the convolution layer, so as to obtain a third feature graph. The method comprises the steps of sampling down the single, double and quadruple characteristic images of a shallow layer of a network by different multiples before a final prediction result to obtain characteristic images with the same size, overlapping the characteristic images on a channel, splicing the characteristic images, carrying out characteristic fusion through a convolution layer, and transmitting the characteristic images after characteristic fusion to a full-connection layer.

It is worth mentioning that the feature distribution of objects with different scales can be learned by the target network model by fusing the multi-scale feature graphs, so that the accuracy of behavior detection is improved.

It should be noted that the size of the convolution kernel of each convolution layer may be set in advance according to actual requirements, and in addition, parameters in the convolution kernel of each convolution layer are determined in the training process. In addition, in the implementation process, parameters such as image size, feature map size, channel number, down-sampling frequency, network depth and the like of the target network model of different behaviors may be different.

After obtaining the third feature map, determining, by the full connection layer, a behavior of the driver based on the third feature map, and then outputting, by the output layer, the behavior, for example, whether the behavior is a call or a non-call. Further, the behavior may be identified by a behavior identification, for example, an output behavior identification of "1" indicates a call, and an output behavior identification of "0" indicates a non-call. In some embodiments, a confidence determination may be given to the output result, where the confidence interval is 0,1, and when the confidence is greater than 0.5, the behavior class is determined to be an offending behavior, otherwise, it is determined that the behavior does not belong to the offending behavior.

It should be noted that, the foregoing is merely an example in which the target network model includes an input layer, an intermediate layer, a splicing layer, a full connection layer, and an output layer, and in another embodiment, the target network model further includes other layers to perform other operations on the feature map through the other layers, for example, the target network model may further include a BN (Batch Normalization ) layer, etc., which is not limited in this embodiment of the present application.

Further, when the behavior of the driver belongs to the illegal behavior, counting the number of times of the illegal behavior of the driver in a preset time period, and when the number of times of the illegal behavior of the driver reaches a preset time threshold value in the preset time period, carrying out illegal driving alarm prompt.

When the behavior of the driver is determined to belong to the illegal behavior, the continuous illegal behavior times of the driver in a certain time period are counted, namely, the behavior of the driver detected in a plurality of video image frames in all video image frames in a preset time period is counted, and if the illegal behavior times reach a preset time threshold, dangerous driving behaviors of the driver can be determined. For example, the detection results of 75% of all video image frames within 3 seconds indicate that the behavior of the driver belongs to illegal behaviors, and at this time, illegal driving alarm prompt can be performed.

In some embodiments, the warning prompt of the illegal driving can be performed through the buzzer, or the warning prompt of the illegal driving can be performed through the voice broadcasting mode, so that the driver is reminded of safe driving, and the passenger can monitor the behavior of the driver.

Therefore, when the number of times of illegal actions of the driver is detected to reach a certain preset number of times threshold, illegal driving alarm prompt is carried out, false alarm caused by deviation in detection of individual video image frames can be avoided, and the alarm accuracy is improved.

The preset duration may be set by a user in a user-defined manner according to actual requirements, or may be set by default by the intelligent device, which is not limited in the embodiment of the present application.

The preset time threshold may be set by a user in a user-defined manner according to actual requirements, or may be set by default by the intelligent device, which is not limited in the embodiment of the present application.

In the embodiment of the application, a target image including a face of a driver is acquired, and a first area image including a preset behavior of the driver is acquired from the target image, wherein the preset behavior is similar to the illegal behavior, that is, the first area image of the preset behavior similar to the illegal behavior is acquired from the target image. And then, carrying out inner interception and outer expansion of different scales on the first area image in the target image to obtain a second area image and a third area image. Because the first area image may only include a small part of the preset behavior, but does not include or includes a very small part of the peripheral objects related to the preset behavior, whether the preset behavior is actually illegal or not cannot be accurately determined based on the first area image, after the first area image is subjected to inner interception and outer expansion, the obtained second area image and third area image can include more information of the peripheral objects related to the preset behavior, and therefore the behavior of a driver is detected based on the three area images, and the detection accuracy can be improved.

In addition, the three area images are subjected to channel fusion, so that the target network model can pay attention to information in a certain range around the object, the behavior category of a driver can be accurately determined by the target network model, and the robustness of the target network model is improved.

Fig. 6 is a schematic structural diagram showing a behavior recognition apparatus of a driver, which may be implemented by software, hardware, or a combination of both, according to an exemplary embodiment. The behavior recognition device of the driver may include:

a first obtaining module 501, configured to obtain a target image, where the target image includes a face of a driver;

a second obtaining module 502, configured to obtain a first area image from the target image, where the first area image includes a preset behavior of the driver, and the preset behavior is a behavior with a similarity with an offence that is greater than a preset threshold;

an image processing module 503, configured to perform reduction processing on a peripheral area of the first area image according to a first ratio threshold in the target image to obtain a second area image, and perform expansion processing on the peripheral area of the first area image according to a second ratio threshold to obtain a third area image;

An identifying module 504 is configured to identify a behavior of the driver based on the first area image, the second area image, and the third area image.

Optionally, referring to fig. 7, the apparatus further includes:

a resizing module 505 for resizing the first region image, the second region image, and the third region image to region images of the same size;

the identifying module 504 is configured to invoke a target network model, where the target network model is configured to determine a behavior class of a behavior based on a set of area images corresponding to any of the behaviors;

Optionally, the identifying module 504 is configured to:

outputting the behavior through the output layer.

Optionally, the identifying module 504 is configured to:

the middle layer comprises N groups of convolution layers and N groups of sampling layers, and each group of convolution layers corresponds to each group of sampling layers one by one; let i=1, determine the first feature map as a target feature map; the target feature image is subjected to convolution processing through an ith group of convolution layers, and the obtained feature image is subjected to sampling processing through an ith group of sampling layers according to two different multiples to obtain 2 ⁱ A second characteristic diagram of the ith reference size and the double characteristic diagram, and 2 ⁱ The multiple feature map is obtained as the target feature map, and the reference dimension is greater than or equal to 2 ⁱ The size of the doubling feature map;

Optionally, referring to fig. 8, the apparatus further includes a training module 506, where the training module 506 is configured to:

Optionally, the second obtaining module 502 is configured to:

and determining the first area image based on the target face detection frame.

Optionally, the second obtaining module 502 is configured to:

Optionally, the first obtaining module 501 is configured to:

Detecting a working mode of a camera for shooting the face of the driver;

and acquiring the output three-channel color image as the target image.

Optionally, referring to fig. 9, the apparatus further includes:

a third obtaining module 507, configured to obtain a current vehicle speed and an illumination intensity;

and the switching module 508 is configured to switch the working mode of the camera to the infrared shooting mode when the vehicle speed is greater than a preset vehicle speed threshold and the illumination intensity is lower than an illumination intensity threshold.

Optionally, referring to fig. 10, the apparatus further includes:

a statistics module 509, configured to, when the behavior of the driver belongs to an offence, count the number of times of the offence of the driver within a preset duration;

and the alarm module 510 is configured to perform an offending driving alarm prompt when the number of offending behaviors of the driver reaches a preset number threshold within the preset duration.

It should be noted that: in the behavior recognition device for a driver provided in the above embodiment, when the behavior recognition method for a driver is implemented, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the behavior recognition device of the driver provided in the above embodiment belongs to the same concept as the behavior recognition method embodiment of the driver, and the specific implementation process of the behavior recognition device is detailed in the method embodiment, which is not described herein again.

Fig. 11 shows a block diagram of a terminal 1000 according to an exemplary embodiment of the present application. The terminal 1000 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 1000 can also be referred to by other names of user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, terminal 1000 can include: a processor 1001 and a memory 1002.

The processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 1001 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1001 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1001 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1001 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. Memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1002 is used to store at least one instruction for execution by processor 1001 to implement the driver behavior recognition method provided by the method embodiments herein.

In some embodiments, terminal 1000 can optionally further include: a peripheral interface 1003, and at least one peripheral. The processor 1001, the memory 1002, and the peripheral interface 1003 may be connected by a bus or signal line. The various peripheral devices may be connected to the peripheral device interface 1003 via a bus, signal wire, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1004, touch display 1005, camera 1006, audio circuitry 1007, positioning component 1008, and power supply 1009.

Peripheral interface 1003 may be used to connect I/O (Input/Output) related at least one peripheral to processor 1001 and memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1001, memory 1002, and peripheral interface 1003 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

Radio Frequency circuit 1004 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. Radio frequency circuitry 1004 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. Radio frequency circuitry 1004 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 1004 may also include NFC (Near Field Communication ) related circuitry, which is not limited in this application.

The display screen 1005 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1005 is a touch screen, the display 1005 also has the ability to capture touch signals at or above the surface of the display 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this time, the display 1005 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, display 1005 may be one, providing a front panel of terminal 1000; in other embodiments, display 1005 may be provided in at least two, separately provided on different surfaces of terminal 1000 or in a folded configuration; in still other embodiments, display 1005 may be a flexible display disposed on a curved surface or a folded surface of terminal 1000. Even more, the display 1005 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 1005 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1006 is used to capture images or video. Optionally, camera assembly 1006 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1006 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1001 for processing, or inputting the electric signals to the radio frequency circuit 1004 for voice communication. For purposes of stereo acquisition or noise reduction, the microphone may be multiple, each located at a different portion of terminal 1000. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 1007 may also include a headphone jack.

The location component 1008 is used to locate the current geographic location of terminal 1000 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 1008 may be a positioning component based on the united states GPS (Global Positioning System ), the beidou system of china, or the galileo system of russia.

Power supply 1009 is used to power the various components in terminal 1000. The power source 1009 may be alternating current, direct current, disposable battery or rechargeable battery. When the power source 1009 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1000 can further include one or more sensors 1010. The one or more sensors 1010 include, but are not limited to: acceleration sensor 1011, gyroscope sensor 1012, pressure sensor 1013, fingerprint sensor 1014, optical sensor 1015, and proximity sensor 1016.

The acceleration sensor 1011 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 1000. For example, the acceleration sensor 1011 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1001 may control the touch display 1005 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1011. The acceleration sensor 1011 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 1012 may detect the body direction and the rotation angle of the terminal 1000, and the gyro sensor 1012 may collect the 3D motion of the user to the terminal 1000 in cooperation with the acceleration sensor 1011. The processor 1001 may implement the following functions according to the data collected by the gyro sensor 1012: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

Pressure sensor 1013 may be disposed on a side frame of terminal 1000 and/or on an underlying layer of touch display 1005. When the pressure sensor 1013 is provided at a side frame of the terminal 1000, a grip signal of the terminal 1000 by a user can be detected, and the processor 1001 performs right-and-left hand recognition or quick operation according to the grip signal collected by the pressure sensor 1013. When the pressure sensor 1013 is provided at the lower layer of the touch display 1005, the processor 1001 controls the operability control on the UI interface according to the pressure operation of the user on the touch display 1005. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 1014 is used to collect a fingerprint of the user, and the processor 1001 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 1014, or the fingerprint sensor 1014 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 1001 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 1014 may be provided on the front, back or side of terminal 1000. When a physical key or vendor Logo is provided on terminal 1000, fingerprint sensor 1014 may be integrated with the physical key or vendor Logo.

The optical sensor 1015 is used to collect ambient light intensity. In one embodiment, the processor 1001 may control the display brightness of the touch display 1005 based on the ambient light intensity collected by the optical sensor 1015. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 1005 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 1005 is turned down. In another embodiment, the processor 1001 may dynamically adjust the shooting parameters of the camera module 1006 according to the ambient light intensity collected by the optical sensor 1015.

Proximity sensor 1016, also referred to as a distance sensor, is typically located on the front panel of terminal 1000. Proximity sensor 1016 is used to collect the distance between the user and the front of terminal 1000. In one embodiment, when proximity sensor 1016 detects a gradual decrease in the distance between the user and the front face of terminal 1000, processor 1001 controls touch display 1005 to switch from the bright screen state to the off screen state; when proximity sensor 1016 detects a gradual increase in the distance between the user and the front face of terminal 1000, processor 1001 controls touch display 1005 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 11 is not limiting and that terminal 1000 can include more or fewer components than shown, or certain components can be combined, or a different arrangement of components can be employed.

The embodiment of the application also provides a non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, enables the mobile terminal to execute the behavior recognition method of the driver provided by the embodiment.

The present embodiments also provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the method for identifying the behavior of the driver provided in the above embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to the particular embodiments of the present application.

Claims

1. A method of identifying behavior of a driver, the method comprising:

invoking a target network model, wherein the target network model is used for determining a behavior category of a behavior based on a group of region images corresponding to any behavior, and comprises an input layer, a middle layer, a splicing layer, a full-connection layer and an output layer;

outputting the behavior through the output layer.

2. The method of claim 1, wherein the intermediate layer comprises N sets of convolution layers and N sets of sampling layers, each set of convolution layers corresponding one-to-one to each set of sampling layers;

let i=1, determine the first feature map as a target feature map; the target feature image is subjected to convolution processing through an ith group of convolution layers, and the obtained feature image is subjected to sampling processing through an ith group of sampling layers according to two different multiples, so that the target feature image is obtainedThe +.o signature and the second signature of the ith reference dimension are combined >The multiple feature map is obtained as the target feature map, and the reference dimension is greater than or equal to the +.>The size of the doubling feature map;

when i is smaller than N, making i=i+1, returning to the convolution processing of the target feature map through the ith group of convolution layers, and respectively sampling the obtained feature map according to two different multiples through the ith group of sampling layers to obtainThe +.o signature and the second signature of the ith reference dimension are combined>Obtaining a multiple feature map as the operation of the target feature map;

3. The method of claim 1, wherein prior to invoking the target network model, further comprising:

4. The method of claim 1, wherein the acquiring a first region image from the target image comprises:

and determining the first area image based on the target face detection frame.

5. The method of claim 4, wherein the determining the first region image based on the target face detection box comprises:

6. The method of claim 1, wherein the acquiring the target image comprises:

detecting a working mode of a camera for shooting the face of the driver;

and acquiring the output three-channel color image as the target image.

7. The method of claim 6, wherein prior to detecting the mode of operation of the camera for capturing the face of the driver, further comprising:

acquiring the current speed and illumination intensity;

8. The method of claim 1, wherein after outputting the behavior through the output layer, further comprising:

9. A behavior recognition apparatus of a driver, characterized in that the apparatus comprises:

the recognition module is used for calling a target network model, the target network model is used for determining the behavior category of the behavior based on a group of region images corresponding to any behavior, and the target network model comprises an input layer, a middle layer, a splicing layer, a full connection layer and an output layer; the input layer carries out channel superposition processing on image data in the first region image, the second region image and the third region image after size adjustment based on the resolution ratio and the channel number of the region image after size adjustment, so as to obtain a first feature image; performing convolution sampling processing on the first feature images through the intermediate layer to obtain a plurality of second feature images, wherein the second feature images are identical in size and different in channel number; the splicing layer is used for carrying out channel superposition processing on the plurality of second feature images, and feature fusion is carried out on the feature images after channel superposition through a convolution layer in a deep network layer to obtain a third feature image; determining, by the fully-connected layer, behavior of the driver based on the third feature map; outputting the behavior through the output layer.

10. An intelligent device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the steps of any of the methods of claims 1-8.

11. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the steps of the method of any of claims 1-8.