CN110633701A

CN110633701A - Driver call detection method and system based on computer vision technology

Info

Publication number: CN110633701A
Application number: CN201911013602.XA
Authority: CN
Inventors: 邹尧; 王海; 王天峥
Original assignee: Dream Innovation Technology (shenzhen) Co Ltd
Current assignee: Dream Innovation Technology (shenzhen) Co Ltd
Priority date: 2019-10-23
Filing date: 2019-10-23
Publication date: 2019-12-31

Abstract

The invention discloses a driver call detection method and system based on computer vision technology, which can detect human face in real time by acquiring external image information from a camera; when the face is detected, selecting the largest face as the face to be analyzed subsequently by comparing the sizes of all the detected face areas; dividing the obtained maximum face area into blocks, and calling and operating a calling detection module to detect a calling gesture; the calling detection module comprises a plurality of calling detection submodules, and if any one of the calling detection submodules detects a calling behavior, the corresponding calling position is stored, and the other calling detection modules are stopped to operate. The invention has the advantages of faster calling detection algorithm, lower requirement on hardware, stronger algorithm expandability, lower false alarm rate, no reduction in accuracy, robustness and the like, smaller whole equipment volume, low and controllable hardware cost and easy wide popularization as consumer electronics products.

Description

Driver call detection method and system based on computer vision technology

Technical Field

The invention relates to the technical field of dangerous driving detection of vehicle safety systems, in particular to a driver call detection method and system based on a computer vision technology.

Background

At present, more and more private cars and more developed traffic are achieved, and the driving frequency of people is greatly increased compared with that of people before. Although the number of driving increases, the awareness of safety and protection of people is not enhanced correspondingly. How to secure driving in such a case is an important issue. In all driving accidents, driving and playing the mobile phone are one of the biggest accident-inducing reasons, but at present, no good preventive measures are taken for driving and playing the mobile phone.

Although some fatigue-driven devices are available on the market to detect driving and playing mobile phones, these devices are based on Neural Processor (NPU) or Graphics Processing Unit (GPU) hardware on which a deep learning based call detection algorithm is then run. Although the performance of the algorithm based on deep learning is generally high, the deep learning algorithm has high requirements on hardware performance, so that the equipment is generally high in price and mainly used for professional fleet management, and for common automobile drivers in the civil market, the equipment is not high in applicability due to the fact that the hardware is heavy, complex and high in price.

Disclosure of Invention

The invention overcomes the defects of the prior art, provides the driver call detection method and the driver call detection system based on the computer vision technology, and can quickly detect whether the driver makes a call in the driving process by utilizing the computer vision technology.

The invention aims to provide a driver call detection method and system based on a computer vision technology, which at least solve the problems of slow detection algorithm, high requirement on hardware, low equipment applicability and the like in the prior art.

In order to achieve the above purpose, the invention provides the following technical scheme: a driver call detection method based on computer vision technology comprises the following steps:

acquiring external image information through a camera, and performing face detection in real time;

when a face is detected, selecting an image corresponding to the largest face area as a face image for subsequent analysis by comparing the sizes of all detected face areas;

dividing the obtained face image corresponding to the maximum face area into blocks, and calling and operating a calling detection module to detect a calling gesture;

the calling detection module comprises a plurality of calling detection submodules, if any one of the calling detection submodules detects a calling behavior, the corresponding calling position is stored, and the other calling detection submodules are stopped to operate;

wherein the algorithm for detecting the call-making behavior in the plurality of call-making detection sub-modules is established by the following modeling training steps:

a characteristic extraction stage: extracting image characteristics of a calling area and a non-calling area of the images in the general database and the infrared image database by using local binary characteristic LBP;

a positive and negative sample construction stage: classifying the images in the general database and the infrared image database to obtain a calling area image and a non-calling area image, and respectively and correspondingly scaling the area images to different sizes according to different calling postures;

an original training stage: constructing a strong classifier for the images in the general database by adopting a traditional Float Boosting algorithm, wherein the classifier uses LBP (local binary pattern) to obtain a general Float Boosting model;

a transfer learning stage: constructing a strong classifier for the images in the infrared image database by adopting the universal Float Boosting model, and optimizing a specific training objective equation by taking the obtained model in the universal database into consideration, so that the obtained model has the characteristics of both the universal model and the infrared image data;

a detection stage: and detecting a calling area on the infrared image by using an infrared enhanced Float Boosting model obtained in the transfer learning stage based on a cascade model structure, averaging the obtained multiple potential calling areas, and taking the average position of the multiple calling areas as the calling position.

According to the driver call detection method, the algorithm for detecting the face comprises the following steps:

a characteristic extraction stage: extracting image characteristics of human faces and non-human faces of images in a general database and an infrared image database by using a Local Binary Pattern (LBP) and a Local Gradient Pattern (LGP);

a positive and negative sample construction stage: classifying images in a general database and an infrared image database to obtain a face image and a non-face image, scaling the face image and the non-face image to 40 × 40 pixels, and dividing each face into different subsets according to different postures of the face;

an original training stage: constructing a cascade classifier for the images in the general database by adopting a traditional Vector Boosting algorithm, wherein the characteristics used by the classifier are the combination of LBP characteristics and LGP characteristics to obtain a general Vector Boosting model;

a transfer learning stage: constructing a cascade classifier by adopting a general Vector Boosting model for images in an infrared image database, and optimizing a specific training objective equation by taking the obtained model into consideration on the general database, so that the obtained model has the characteristics of both the general model and the infrared image data;

a detection stage: and detecting the human face region on the infrared image by using an infrared intensified Vector Boosting model obtained in the transfer learning stage and a Vector tree model-based structure.

According to the driver call detection method, in the algorithm step of detecting the call behavior, the Float Boosting objective equation used in the original training stage is as follows:

h_m＝arg min Loss(H_M-1(x)+h(x))

where x is the input feature vector, H (x) is the weak classifier, H_MRepresenting a strong classifier combined from M weak classifiers, h_mRepresents the m-th weak classifier; y is_iA label representing the ith instance, Loss representing a Loss function for a classifier, exp representing an exponential function;

the optimization equations used in the transfer learning phase are:

wherein KL represents the KL distance between the general model and the infrared enhancement model, and λ is a weight to balance the two losses.

According to the driver call detection method, the method further comprises the following steps:

and a result analyzing and judging step, wherein if the analyzed frame number reaches a certain frame number, the analysis result of the current frame is counted to judge the duration time of the call gesture.

According to the driver call detection method, the certain frame number is 30 frames, and the statistical method comprises a linear statistical method, a nonlinear statistical fitting method and a weighted average method according to the statistical call frequency.

According to the driver call detection method, in the algorithm step for detecting the face, a specific calculation method based on local binary pattern LBP characteristics is adopted as follows:

wherein (x)_c,y_c) Is the pixel center point position, (i)_p-i_c) Is the center point i_cAnd neighbor point i_pThe difference between the pixel values, p is the number of the pixels around each pixel; the specific calculation method based on local gradient pattern LGP features is as follows:

wherein (x)_c,y_c) Is the pixel center point position, center point i_cAnd neighbor point i_nThe difference of convolution values between is g_n＝|i_n-i_cAverage of convolution differences of

p is the number of pixels around each pixel.

According to the above method for detecting a driver's call, the algorithm step for detecting a face includes:

in the original training stage, an original Vector Boosting training model is used, a part of characteristic values are selected from high-dimensional LGP and LBP characteristic values each time, each weak classifier is given a certain weight, the weight distribution is carried out on each image again by combining the result of the current classifier, the wrong classification is given a larger weight, the correct classification is given a smaller weight, and the formula of the weak classifier is selected as follows:

wherein f is_t(x) Is the weak classifier obtained by selection, exp is an exponential function, f (x)_i) Is a weak classifier, v_iIs the current one of the class labels,

is that the sample i is atWeights for t iterations;

in the transfer learning stage, the input is a general Vector Boosting model, the output is an infrared reinforcement Vector Boosting model, and the KL distance is used for measuring the difference between the general model and the infrared reinforcement model, and a specific optimization formula is as follows:

setting different values of lambda, and finally determining lambda with the lowest test error rate;

is a general Vector Boosting model, p and q are two probability distributions, p_iAnd q is_iThe probabilities of the ith instance in the two probability distributions, respectively;

in the detection phase, the last strong classifier F_T(x) Is a combination of T selected weak classifiers

Another aspect of the present invention is a computer vision technology-based driver call detection system, comprising:

the processor comprises a main control unit, an arithmetic unit, a memory unit and a system bus, wherein the main control unit processes logic judgment in the running process of the system and is also used for controlling the connection and the opening of a hardware module;

the arithmetic unit is used for reading and processing the data in the memory unit according to the command of the main control unit and outputting a processing result to the main control unit;

the memory unit provides memory support for the operation unit and the main control unit;

the camera module is connected with the processor and used for acquiring image information of an automobile driver seat and sending the image information to the memory unit through the system bus;

the storage module is used for storing an algorithm model file, parameters and a user configuration file, the storage module is connected with the processor, and the processor can call and modify data stored in the storage module;

wherein, still include in the storage module:

the face detection module is used for detecting the face area of the driver in real time, selecting the largest face as a face for subsequent analysis by comparing the sizes of all the detected face areas, and carrying out block division on the obtained largest face area to obtain a face mouth block;

the system comprises a calling detection module, a calling detection module and a calling detection module, wherein the calling detection module comprises a plurality of calling detection sub-modules, the calling detection sub-modules respectively detect the mouth blocks to find calling behaviors, if any one of the calling detection sub-modules detects a calling behavior, a corresponding calling position is stored, and other calling detection sub-modules are stopped to operate;

and the result analysis and judgment module is used for counting the analysis result of the current frame to judge the duration time of the call gesture when the analyzed frame number reaches a certain frame number.

According to the driver call detection system, the camera module adopts an infrared camera, and is additionally provided with an infrared light supplement lamp and an infrared optical filter, wherein the infrared light supplement lamp is a narrow-spectrum infrared light supplement lamp invisible to naked eyes; the infrared filter is positioned in front of the infrared light supplement lamp and the infrared camera;

the printed circuit board where the infrared light supplement lamp is located adopts large-area windowing to the printing ink layer to expose the copper layer so as to quickly dissipate heat.

Another aspect of the invention is a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the driver call detection methods described above.

Compared with the closest prior art, the technical scheme provided by the invention has the following excellent effects:

the invention can provide a driver call detection method and system based on computer vision technology, and the core of the method is that hardware equipment comprises a DSP chip and a rapid call detection algorithm running on a hardware system. The hardware equipment firstly obtains an infrared image through an infrared camera, and then runs a real-time call detection algorithm through a digital signal processor, wherein the real-time call detection algorithm comprises the steps of detecting the moving range of the face of a driver by using a face detection technology, and then running the call detection algorithm near the face frame range to identify various different call behaviors of the driver, including holding a phone when driving, holding the phone for talkback, sending short messages by using the phone and the like.

Compared with other similar products, the calling detection algorithm has the advantages of higher speed, lower requirement on hardware, higher algorithm expandability, lower false alarm rate, and no reduction in accuracy, robustness and the like. The whole equipment volume is littleer, and the outward appearance is more pleasing to the eye and has the science and technology sense, and hardware low cost is controllable moreover, easily carries out extensive popularization as consumer electronics product.

When a driver plays a mobile phone (including calling, sending short messages, voice conversation and the like) in the driving process, the system can take appropriate measures, such as prompting and standardizing the behavior of the driver and avoiding traffic accidents by sound alarm, indicator light flash alarm, vibration mode and machine instruction alarm.

Drawings

FIG. 1 is a diagram illustrating hardware connections in an embodiment of the present invention;

FIG. 2 is a flow chart of the operation of the call detection algorithm module in an embodiment of the present invention;

FIG. 3 is a flow chart of the operation of the incoming call detection analysis integration module in an embodiment of the present invention;

FIG. 4 is a flowchart of overall system operation in an embodiment of the present invention;

FIG. 5 is a LGP feature extraction diagram in an embodiment of the present invention;

FIG. 6 is a LBP feature extraction diagram in an embodiment of the present invention;

FIG. 7 is a block diagram of a telephone call division according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of various call gestures in an embodiment of the present invention;

FIG. 9 is a diagram illustrating a Vector Boost training process in an embodiment of the present invention;

FIG. 10 is a schematic diagram of a cascade classifier in an embodiment of the invention;

FIG. 11 is a schematic diagram of PCB slotted hole insulation design in an embodiment of the invention.

Detailed Description

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the description of the present invention, the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, which are for convenience of description of the present invention only and do not require that the present invention must be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. The terms "connected" and "connected" used herein should be interpreted broadly, and may include, for example, a fixed connection or a detachable connection; they may be directly connected or indirectly connected through intermediate members, and specific meanings of the above terms will be understood by those skilled in the art as appropriate.

Most computer vision algorithms in the invention need to be trained, different training data and implementation details may influence the final effect, and in order to achieve better effect, the training data and implementation details used by some modules needing to be trained in the invention are specifically as follows:

face detection: during training, in order to reduce the false alarm rate, the invention trains the model of the invention by utilizing a large number of face images on the Internet, the training data is huge, and the training data comprises a large number of negative samples (non-face images), and the whole training sample comprises face images of different races, skin colors, ages, sexes and different postures.

And (3) making a call and detecting: during training, we collected approximately 40000 pictures containing calls, of which approximately 10000 were for a handheld phone below the ear (phone against ear), approximately 10000 were for a handheld phone at the side of the ear (phone against ear), approximately 10000 were for a handheld phone speaking or typing (phone on hand), and approximately 10000 were for a handheld phone voice input (phone directly in front of mouth). Firstly, training is carried out on different pictures to obtain different calling gesture detection models, and then partial infrared images containing calling gestures are collected to correct the models, so that the obtained models are more in line with the infrared images. After training is complete, all modules may retain only the test program and the resulting model trained.

As shown in fig. 1, the system for detecting the call state of the driver of a vehicle according to the preferred embodiment of the present invention includes intelligent hardware. The intelligent hardware comprises: the infrared camera is used for acquiring infrared image information, particularly human face image information; the storage unit is used for storing the algorithm model file, the parameters and the user configuration file; the memory unit is used for providing necessary memory support for algorithm operation; the loudspeaker is used for prompting alarm information to a user; the processing unit comprises a main control unit and an arithmetic unit: the main control unit is mainly used for processing various logic judgments in the analysis process, receiving the starting of the control device, connecting various hardware modules and controlling the loudspeaker to give out alarm and prompt sounds; the arithmetic unit is mainly used for calculating various computer vision algorithms to which the invention belongs, the calculation amount required by the computer vision algorithms is large, and if the arithmetic unit only depends on the main control chip, the calculation speed is very slow.

The camera (camera module) in the system is preferably an infrared camera, and an infrared fill light and an optical filter are added. Wherein infrared light filling lamp and light filter are optional, if only need this equipment in the environment operation of enough visible light can, then the camera select ordinary camera can. If the equipment can work in a dark environment, an infrared light supplement lamp needs to be added, an infrared camera needs to be selected, and an optical filter needs to be added to eliminate various light noises such as strong light, backlight and the like.

The camera is connected with the main control unit through a data bus, so that picture information acquired by the camera is transmitted to the memory unit through the data bus, then the computer vision algorithm acquires image information from the memory unit, the operation is carried out in the DSP operation unit, various information is calculated, the acquired information is fed back to the main control unit, and the main control unit controls the loudspeaker module or the Bluetooth module to make corresponding response. The operation unit and the main control unit are also connected through a memory data bus.

The basic principle of the whole system is that the state of the driver is analyzed in real time according to the computer vision technology, mainly the computer vision technology, and finally whether the driver is in a calling state or not is obtained comprehensively. And comparing the detection state obtained by analysis with a threshold value according to the threshold value obtained by the experiment, and if the detection state exceeds a certain threshold value, triggering a corresponding alarm system by the equipment to give an alarm so as to standardize the behavior of the driver.

In the preferred embodiment of the present invention, the software and hardware system is specifically adopted as follows:

hardware system

The system hardware equipment takes ADI Blackfin53x as a DSP processor, the highest main frequency is 600MHZ, the maximum 4GB addressing space exists, the maximum L1 instruction memory of 80KB SRAM and the maximum L1 data memory of 232 KB SRAMs are available; rich peripherals and interfaces are integrated. The Blackfin53x is connected to the DSP with a 16MB Flash memory module (scalable to 32MB or 64MB), 32MB off-chip memory SDRAM (scalable to 64 MB). The storage module is used for storing audio files and configuration files required by the system, and the SDRAM and the SRAM provide memory required by the whole system during operation.

Other peripheral hardware modules are equipped with infrared camera, infrared light filling lamp, bluetooth module, status indicator lamp, speaker.

The status indicator light indicates whether equipment operates normally or not and whether the driver state is normal or not with different colors and flicker frequencies, when the driver makes a call, the status indicator light can flash slowly in red, meanwhile, the loudspeaker can make a voice to warn the driver not to make a call, and the Bluetooth module can send the behavior of making a call of the driver to another Bluetooth equipment, such as a smart phone, in a machine instruction and picture mode. In addition, through the bluetooth module, the smart phone can also set up this hardware system as follows in the APP:

1, setting 4 optional positions which are matched with the installation position of the device in an automobile, namely a center console, a steering wheel shaft, an A column and a windshield at present;

2, calibrating the installation angle of the decoration to enable the head of a driver to be positioned in the middle of the monitoring range of the device;

3, setting the sensitivity of the device, wherein the sensitivity can be selected from low/medium/high three-gear sensitivity;

4, the volume of the prompt tone of the device is controlled, and three volumes of mute, bass and treble can be selected;

5, controlling the alarm condition of the device, so that the device does not alarm when the vehicle speed is low or the vehicle is stopped and is static, and alarms when the vehicle speed is high;

the device can send the state of the driver to other equipment in a machine instruction mode in a wireless mode of a Bluetooth module, and also can send the state of the driver to other equipment in a machine instruction mode in 2 wired modes of a serial port (such as RS232 and RS485) and GPIO (general purpose input/output), such as a vehicle-mounted video recorder, a seat vibration system, an automobile control system or other equipment which can intervene in the control of the driver and an automobile, so that the driver can intervene in driving behaviors in various modes to realize safer driving.

On the optical structure, we have realized an innovative design that is significantly different from other similar products: a narrow-spectrum infrared light supplement lamp with high radiation power and invisible to human eyes is adopted, and a narrow-band infrared filter is adopted inside the camera; simultaneously cover the infrared filter that a can block visible light again in the front of equipment (also be the place ahead of camera), optical interference and the noise of non-infrared light in the external environment is eliminated to the very big degree, and the camera mainly utilizes the infrared light filling lamp of system self to gather the image, can both gather stable and clear image like this under any environment (no matter daytime or night, direct light or contrary light, or have or not to the interference of driving to the car light, wait other optical interference). Fig. 1 is a hardware connection diagram.

In thermal design, we have realized an innovative design that is significantly different from other similar products: the four elongated holes 3 are used for isolating the camera sensor chip area 2 of the PCB (printed circuit board) from other areas 1 of the PCB, so that the phenomenon that a large amount of heat emitted by electronic elements on the other areas 1 of the PCB is diffused to the camera sensor area 2 to influence the imaging quality is avoided. FIG. 11 is a schematic diagram of PCB slotted hole insulation design. In addition, adopt the large tracts of land to open the window with exposing the copper layer with the heat dissipation of PCB to the air to infrared light filling lamp place PCB. The miniature electronic fan is integrated inside the product in product design, and heat inside the product is quickly blown to the outside of the product so as to realize good heat dissipation design of the product.

Driver calling detection analysis system

The driver call monitoring system comprises two parts: the first is various corresponding calling algorithms, which are used for analyzing each frame of image to obtain original detection information, such as a face area, and calling detection results; the comprehensive result analyzing and judging module is mainly used for comprehensively judging whether the driver calls or not according to various original analysis data, and the comprehensive result analyzing and judging module can integrate multi-frame information due to inevitable error detection of the original call detection module, so that the detection accuracy is further improved, and the false alarm rate is reduced. Specifically, using computer vision algorithms, the system can analyze the infrared image obtained by the camera and obtain some raw information, mainly including the location of the face (possibly including multiple face regions), the location of key points on the face contour, and the results of various call detection algorithms. And the comprehensive result analysis module performs statistical analysis by using the original analysis information to judge whether the driver is in a calling state currently. And finally, calling the corresponding hardware module to remind according to the comprehensive analysis result.

Specifically, the computer vision algorithm included in the system is as follows: face detection algorithm, call detection algorithm.

The face detection algorithm comprises the following steps: the following technical scheme is used: the whole algorithm is trained by using a Vector Boosting algorithm (Vector Boosting) and transfer learning (transfer learning) as a classifier framework, and features of the image are extracted by using a Local Binary Pattern (LBP) and a Local Gradient Pattern (LGP). Meanwhile, different from the original vector boosting algorithm, after a universal face detection model is obtained on a universal database, a part of infrared training images are collected, and the universal face detection model is transferred onto the infrared-based training images through a transfer learning (transfer learning) technology, so that the obtained face detection model has better performance and more pertinence compared with the universal face detection model.

The specific modeling steps of the face detection algorithm (i.e. the specific construction of the face detection model) are as follows:

1) a characteristic extraction stage: for images in the infrared database and the general database, Local Binary Pattern (LBP) and Local Gradient Pattern (LGP) are used to extract features of all images of human faces and non-human faces, and the specific extraction manner of the LBP and LGP features is shown in fig. 5 and 6, numbers in the leftmost box represent gray values of original images, and numbers in the middle box represent the ratio of each pixel value to the most middle pixel value.

2) A positive and negative sample construction stage: the images in the general database and the infrared image database are classified into face images and non-face images, the face images and the non-face images are respectively scaled to 40 x 40 pixels, and each face is divided into different subsets according to different postures of the face (according to the left, right, upper and lower equal angles of the face).

3) An original training stage: constructing a cascade classifier by using the images in the general database processed by the steps 1) and 2) and adopting a traditional Vector Boosting algorithm, wherein the characteristics used by the classifier are the combination of LBP characteristics and LGP characteristics, namely the two characteristic points are added, and the length is m + n; a universal Vector Boosting model based on a universal database is obtained in the original training stage.

4) A transfer learning stage: the Vector Boosting algorithm is adopted for constructing the cascade classifier on the infrared database, and meanwhile, the model obtained on the universal database is taken into consideration, namely, the image of the infrared database is continuously trained on the universal Vector Boosting model directly, so that the learned model can be compatible with the universal database and the infrared database, the characteristics are the combination of the previous LGP and LBP, a specific training objective equation is optimized, the obtained infrared reinforced Vector Boosting model has the characteristics of the universal model and the infrared image data, and the problem of insufficient infrared image data amount is solved.

5) A detection stage: and detecting the human face region on the infrared image by using an infrared intensified Vector Boosting model obtained in the transfer learning stage and a Vector tree model-based structure.

6) The specific calculation method based on LGP/LBP characteristics is adopted as follows:

LGP formula:

wherein (x)_c,y_c) Is the pixel center point position, center point i_cAnd neighbor point i_nThe difference between the gray values is g_n＝|i_n-i_cAverage of convolution differences of

p is the number of pixels around each pixel, and p is 8; n is a subscript index from 0 to 8; c represents the center point; r represents such a calculation mode.

LBP calculation formula:

wherein (x)_c,y_c) Is the pixel center point position, (i)_p-i_c) Is the center point i_cAnd neighbor point i_pThe difference between the pixel values, p is the number of pixels around each pixel, and in this embodiment, p is 8.

Different from other algorithms, in this embodiment, where LBP and LGP features complement each other, combining together may improve the stability of the entire algorithm, and as a way of combining together the LBP features and the LGP features, two vector features may be directly connected together, for example, one vector feature is m in length, for example, one vector feature is n in length, and the connection is m + n, and under the same false detection rate, the correct detection rate may be greatly improved, and the entire feature extraction calculation time is still low.

In the positive and negative sample construction stage, for many non-face areas, more non-face pictures are constructed in the embodiment, so that the proportion of positive and negative samples in the whole picture sample is unbalanced, and the false alarm rate of the whole model for face detection can be reduced by receiving a large number of negative samples.

In the original training stage, an original Vector Boosting training model is directly used, a part of feature values are selected from LGP and LBP feature values of a high dimension (feature Vector dimension) each time, each weak classifier is given a certain weight, the weight distribution is carried out on each image again by combining the result of the current classifier, a larger weight is given to the wrong classification, a smaller weight is given to the correct classification, and the specific training process is shown in FIG. 9. The formula when selecting the weak classifier is as follows:

is the weight of sample i at the t-th iteration, where m is the total number of training iterations.

In the transfer learning stage, the input is a model which is trained on the general database, and the output is a model which transfers the model on the general database to the infrared image. In order to measure the differences between models and to transfer the model parameters on the general database onto the infrared image as much as possible, KL distances are used to measure the differences between models. The specific optimization formula is as follows:

setting different values of lambda, and finally determining lambda with the lowest test error rate, wherein lambda is a value between 0 and 1 in the embodiment;

is a model obtained by training on a general database, p and q are two probability distributions, p_iAnd q is_iIs the probability of the ith instance in both distributions.

In the test phase (detection phase), the last strong classifier F_T(x) Is a combination of T weak classes trained:

the strong classifier is an infrared intensified Vector Boosting model used in detection; because the model of transfer learning is the same as the model of non-transfer learning, but the parameters are different, the traditional cascade classifier of the waterfall tree is adopted to detect the human face, each frame of image is subjected to pyramid scaling, the human face detection is carried out on different scaling scales, and then the detection result is scaled back to the size of the original image. In different image zooming, in order to accelerate the operation speed, images with different scales can be zoomed at the same time, characteristic values are calculated in parallel, integral images are calculated for detection, and the specific detection process is shown in fig. 10.

Because the Vector Boosting algorithm is used, the robustness is stronger, the human faces in different postures can be processed, and the human face detection range is larger. Compared with a model obtained by training directly on an infrared image, the robustness of the model is enhanced by using a large number of face images with different postures on the Internet, and meanwhile, compared with a general face detection model obtained by training only using an Internet picture, infrared image information is added into the algorithm, so that the final model has higher pertinence, and the algorithm has better effect on the infrared image than the general model. Further, different from a general Haar feature extraction method, due to the fact that a combination of feature extraction of a local binary feature extraction algorithm (LBP) and a Local Gradient Pattern (LGP) is used, the algorithm is very insensitive to illumination change of an image, and the detection effect is higher. After model parameters of the whole algorithm are obtained, a new image is given, LBP features and a Local Gradient Pattern (LGP) of the image are extracted firstly, then all positions on the image are slid, and each sliding window (40 multiplied by 40) is evaluated by a waterfall cascade model to judge whether the window is a face area or not.

And (3) construction of a calling detection algorithm model: the call detection algorithm is basically the same as the face recognition algorithm, but the Float Boosting is used as a classifier model, except that the classifier model, the feature extraction and the sub-classifier design are different, and the whole training process and the detection steps are basically the same as the face detection.

The algorithm comprises the following specific steps:

1) a characteristic extraction stage: extracting image characteristics of all calling areas and non-calling areas by using local binary characteristics (LBP) (comprising an infrared database and a universal database);

2) a positive and negative sample construction stage: and classifying the images in the general database and the infrared image database, wherein the images comprise calling area images and non-calling area images. Because the gestures of people for holding mobile phones are various when people make calls, and the boosting algorithm has natural defects, the embodiment finds that one boosting model is difficult to contain all different calling gestures, so that different classifiers are respectively trained according to different calling gestures. Specifically, the present embodiment includes a total of four different call gestures, corresponding to scaling the region images to different sizes (22 × 15, 22 × 15, 22 × 22, 15 × 22), respectively;

3) an original training stage: constructing a strong classifier by adopting a traditional Float Boosting algorithm on a general database, wherein the classifier is characterized by LBP (local binary pattern) and finally obtaining a general Float Boosting model based on the general database;

4) a transfer learning stage: the universal Float Boosting model is adopted on the infrared database to construct a forced classifier, and meanwhile, the model obtained on the universal database is taken into consideration to optimize a specific training target equation, so that the obtained infrared enhanced Float Boosting model has the characteristics of both the universal model and the infrared image data, and the problem of insufficient infrared image data amount is solved;

5) a detection stage: and detecting a calling area by using an infrared enhanced Float Boosting model obtained in a transfer learning stage on the infrared image based on a cascade model structure. Finally, averaging the obtained plurality of potential calling rectangular areas, and taking the average position of the plurality of areas as a final calling position, namely determining the final calling area.

The Float Boosting objective equation used in the original training phase is:

h_m＝arg min Loss(H_M-1(x)+h(x)) (10)

where x is the input feature vector, H (x) is the weak classifier, H_MRepresenting combinations of M weak classifiersThe strong classifier, h_mRepresenting the m-th weak classifier. y is_iThe label representing the ith instance, Loss representing the penalty function for a classifier, exp representing the exponential function.

The optimization equation used in the transfer learning phase is:

where KL represents the KL distance between the generic model and the model above the infrared and λ is a weight to balance the two losses.

After obtaining the algorithm model parameters for the infrared image, giving any new face region (assuming that the face region has been detected), the algorithm performs pyramid scaling on the region of the face image (including the periphery of the face image), then performs sliding on all positions in each scaling region, the size of the sliding window is 22 × 15, 22 × 15, 22 × 22, and 15 × 22 (four different call gestures, see fig. 8), and then evaluates each sliding window to determine whether the window contains a call gesture.

Compared with a deep learning algorithm, the Boosting algorithm has higher running speed, but is more sensitive to the appearance difference and the rotation angle of the detected object, and in order to overcome the problem that the detection coverage of the Boosting algorithm is small, four different detectors are designed, wherein the four different detectors are respectively used for four different calling postures, and the four different detectors are respectively as follows: hand-held phone under ear (phone against ear), hand-held phone to ear side (phone against ear), hand-held phone talkback (phone on hand), hand-held phone voice input (phone right in front of mouth), specific four different gestures see fig. 8. In order to detect different calling gestures, the calling detection module simultaneously comprises the four sub-detection modules. Each detection module consumes less time, and the four detection modules do not influence each other, because the four detection modules are operated in parallel in the actual operation process, and for each specific area, as long as one of the four modules detects a call gesture, the detection can be quitted in advance.

Because each module needs to run on the embedded platform, fixed-point implementation is adopted when the code is run to implement the program, so that floating-point operation is avoided, and the running speed of the whole system can be greatly increased. In order to further shorten the processing time on the embedded platform and avoid detecting the whole picture and reducing the false alarm rate in the actual implementation, the embodiment may divide the face area into the left side of the face area, the right side of the face area, the side of the face area, and the lower side of the face area, then operate the first classifier on the left side of the face area, operate the second classifier on the right side of the face area, operate the third classifier on the side of the face area, and operate the fourth classifier on the lower side of the face area, as shown in fig. 8 and 10, the cascade classifier in fig. 10 is used to train the four postures in fig. 8, so as to obtain four classifiers for four different call postures. After all the areas which may contain the calling are obtained, the average calculation is carried out on all the areas to obtain the position of the calling area. In the embodiment, Floating Boost is adopted as a classifier, LBP is adopted as a feature extraction method, and the actual on-machine test result shows that the LBP is more stable than DCT and has Haar change and stronger identifiability. Compared with a deep learning method, under the condition of driving which is a relatively limited environment, the algorithm of the embodiment has the same detection performance, similar detection coverage and shorter running time. Meanwhile, the call detection algorithm is only operated in the area near the face, so that the false alarm rate of the whole system is greatly reduced, and the processing time is greatly shortened.

Given a frame of image, the working flow of the analysis module is as follows:

(1) acquiring external image information through a camera;

(2) detecting a face through a real-time face detection module;

(3) if the face is detected, selecting the largest face as the face to be analyzed subsequently by comparing the sizes of all the detected face areas;

(4) dividing the obtained maximum face area into four blocks;

(5) operating corresponding calling detection sub-modules in the four blocks respectively;

(6) if one model detects a call, the location is saved and the other two detection models are stopped.

The workflow is shown in figure 2.

The comprehensive result analysis and judgment module: the device can summarize and process according to the initial analysis results provided by each analysis module, judge that the driver is in different driving states according to the results of comprehensive analysis, and provide different prompts for the driver. The hints we can give at present include: make/receive phone calls, look down at the phone, and talk back and forth. The device can process at least 20 frames of images per second, after 30 frames of images are analyzed, comprehensive analysis and judgment are carried out once, and the main standard of the comprehensive judgment is to give an alarm according to the state with the highest occurrence frequency according to the statistical times of each state.

The system operation flow comprises the following steps:

the whole system comprises hardware equipment and a computer vision algorithm, and the operation flow of the whole system is as follows:

(1) the system is powered on for self-checking, if the hardware has no fault, the step (2) is carried out;

(2) calling a face detection module to detect whether the current driver is in a detectable range, and if the face is not detected, prompting the driver to adjust the position of the equipment until the equipment can detect the face;

(3) when the face is detected, the loudspeaker starts to broadcast a prompt that the system starts to work;

(4) continuously acquiring image information from a camera, performing face detection, and making a call for detection until the number of analyzed frames exceeds a certain number;

(5) running a comprehensive result analysis and judgment module, and calling corresponding voice to prompt a driver if the driver calls the phone;

(6) and if a user shutdown signal is received, releasing the memory and exiting the cycle.

The workflow is shown in figure 4.

Hereinafter, the details will be described with reference to fig. 2 to 4.

Fig. 2 is a flow chart of the operation of the call detection algorithm module, and fig. 3 is a flow chart of the analysis of the comprehensive result of the call detection.

The call detection algorithm module of fig. 2 works as follows:

(1) acquiring external image information through a camera;

(2) the method comprises the steps that a human face is detected through a real-time human face detection module, a Vector Boost algorithm (Vector Boost) and transfer learning (transfer learning) are adopted as a classifier frame in the system to train the whole algorithm, the features of an image are extracted by using a layout binary code (LBP) and a Local Gradient Pattern (LGP), the size of a sliding window is 40 multiplied by 40, the number of layers of an image pyramid is 5, and in order to increase the processing speed, human face detection classifiers with different postures can be simultaneously operated in parallel; specific LBP and LGP feature extraction modes are shown in FIGS. 5 and 6;

(3) if the face is detected, selecting the largest face as the face for subsequent analysis by comparing the sizes of all the detected face regions, preferably, only analyzing the face region in the image fixed range according to the actual condition;

(4) and carrying out block segmentation on the obtained maximum face area, and calling a calling detection module to detect a calling gesture. In the system, a Float Boost is used as a classifier model, LBP is used as a feature extraction method, and the sizes of sliding windows are set to be 22 multiplied by 15, 22 multiplied by 22 and 15 multiplied by 22 according to different call-making posture classifiers; the image pyramid layer number is 3, the maximum weak classifier number is 200, in order to accelerate the processing speed, the face detection classifiers with different postures can be operated in parallel at the same time, and in order to reduce the error detection rate and shorten the operation time, four kinds of calling detection classifiers are operated only in the area near the face;

(5) if the calling is detected, the detection result of the calling gesture is stored, and meanwhile, in order to adapt to different detection precision requirements under different environments, the system can also provide an interface for a user to modify the threshold value;

the work flow of the integrated result analysis module for call detection in fig. 3 is as follows:

(1) initializing various data structures and initializing arrays for storing information;

(2) acquiring current image information from an external camera;

(3) calling an analysis module to analyze the current image and obtain an analysis result;

(4) if the analyzed frame number has reached a certain frame number, entering the step (5); otherwise, the step (2) is carried out, the current image is continuously analyzed, and the number of frames in the system is 30;

(5) the system carries out linear statistics according to the statistical call frequency, and preferably adopts other methods to carry out statistics, such as a nonlinear statistical fitting method, a weighted average method and the like;

(6) and (4) resetting the currently saved data to be empty, and entering the step (2) to start the next round of analysis.

Fig. 4 is a flowchart of the whole system work, and the specific work flow is as follows:

(1) the system is electrified for self-checking, if the hardware has no fault, the step (2) is carried out, in the system, a micro USB (micro USB) is adopted for supplying power, and preferably, USB interfaces with other specifications can also be adopted for supplying power;

(2) calling a face detection module to detect whether the current driver is in a detectable range, if the face is not detected, prompting the driver to adjust the position of the equipment until the equipment can detect the face, preferably, detecting the face by using other face algorithms, and determining that the face can be detected by the equipment;

(4) continuously acquiring image information from a camera and carrying out face detection;

(5) continuously acquiring image information from a camera, calling a calling detection algorithm module to analyze the image, and calling a comprehensive result analysis module to comprehensively analyze the calling state of a driver;

(6) and if the user shutdown signal is received, releasing the memory, closing the Bluetooth and exiting the cycle.

In summary, the present invention is an intelligent hardware device capable of automatically analyzing a call made by a driver, wherein an infrared image is obtained by an infrared camera, and a real-time call detection algorithm is run by a digital signal processor, wherein the real-time call detection algorithm comprises detecting a moving range of a face of the driver by using a face detection technology, and then running the call detection algorithm near a face frame range to identify a call made behavior of the driver. Compared with other similar products, the call detection algorithm has the advantages of higher speed, lower requirement on hardware, stronger algorithm expandability, lower false alarm rate, and no reduction in accuracy, robustness and the like. The whole equipment has smaller volume and low and controllable hardware cost, and is easy to be widely popularized as a consumer electronics product.

Other embodiments of the present technology will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the technology following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the technology pertains and as may be applied to the essential features hereinbefore set forth. The specification and examples are to be considered as exemplary only, and the technical scope of the present invention is not limited to the contents of the specification, and must be determined in accordance with the scope of protection of the present application.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is only limited by the content of the appended representative protection scope.

Claims

1. A driver call detection method based on computer vision technology is characterized by comprising the following steps:

2. The computer vision technology-based driver call detection method as claimed in claim 1, wherein the algorithm step of detecting the face comprises:

an original training stage: constructing a cascade classifier for the images in the general database by adopting a traditional Vector Boosting algorithm, wherein the characteristics used by the classifier are the combination of LBP (local binary pattern) characteristics and LGP (local binary pattern) characteristics to obtain a general Vector Boosting model;

3. The computer vision technology-based driver call-making detection method according to claim 1, wherein in the algorithm step of detecting call-making behavior, the Float Boosting objective equation used in the original training phase is:

h_m＝argminLoss(H_M-1(x)+h(x))

the optimization equations used in the transfer learning phase are:

4. The computer vision technology-based driver call detection method as recited in claim 1, further comprising:

5. The computer vision technology-based driver call detection method as recited in claim 4,

the number of the certain frames is 30, and the statistical method comprises a linear statistical method, a nonlinear statistical fitting method and a weighted average method according to the statistical calling frequency.

6. The computer vision technology-based driver call detection method as claimed in claim 2, wherein in the step of detecting the human face, a specific calculation method based on local binary pattern LBP features is adopted as follows:

p is the number of pixels around each pixel.

7. The computer vision technology-based driver call detection method as claimed in claim 2, wherein the algorithm step of detecting the face comprises:

is the weight of sample i at the t-th iteration;

8. A computer vision technology based driver call detection system, comprising:

wherein, still include in the storage module:

9. The computer vision technology-based driver call detection system as recited in claim 8,

the camera module adopts an infrared camera, and is additionally provided with an infrared light supplement lamp and an infrared optical filter, wherein the infrared light supplement lamp is a narrow-spectrum infrared light supplement lamp invisible to naked eyes; the infrared filter is positioned in front of the infrared light supplement lamp and the infrared camera;

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a driver call detection method according to any one of claims 1 to 7.