CN111611971B

CN111611971B - Behavior detection method and system based on convolutional neural network

Info

Publication number: CN111611971B
Application number: CN202010485168.1A
Authority: CN
Inventors: 郁强; 李圣权; 李开民; 曹喆; 金仁杰
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2023-06-30
Anticipated expiration: 2040-06-01
Also published as: CN111611971A

Abstract

The invention provides a behavior detection method and a behavior detection system based on a convolutional neural network, wherein the behavior detection method detects human specific behavior actions in video or dynamic images by means of the convolutional neural network, including but not limited to behavior actions such as eating and drinking food, smoking and the like, replaces a human supervision mode, and has the advantages of high detection rate, accurate detection precision and the like.

Description

Behavior detection method and system based on convolutional neural network

Technical Field

The invention relates to the field of video processing, in particular to a behavior detection method and system based on a convolutional neural network.

Background

Deep learning is a new field of machine learning research, and the motivation is to build and simulate a neural network for analysis learning of human brain, and utilize the built neural network to simulate human brain and replace human activities to complete analysis and processing of data. Deep learning is widely applied to the field of visual image processing, so that useful information in images can be better and more accurately acquired, most of the deep learning technologies are focused on analysis processing of static images, and research on application of the deep learning technologies to processing of dynamic images or video images is less discovered.

However, in real life, the analysis processing of a moving image or a video image has a great research significance, and particularly when the analysis processing is used for detecting a specific behavior action, such as eating, smoking, drinking, etc., the analysis of the moving image or the video image information is required.

At present, in some special places, such as public places of subways, buses, movie theaters, museums and the like, clear fasted and smoking bans exist, but the actual implementation condition of the fasted or smoking bans is not optimistic, and a management and control unit lacks effective supervision means for people who violate the fasted or smoking bans, so that the reason for the situation is mainly that the dynamic behavior belongs to transient behavior and can be performed within minutes or even seconds, the flow rate of people and the area of places in the public places are very large, even if the monitoring of the specially-assigned people is difficult to ensure that the monitoring personnel can comprehensively supervise the people who violate the fasted people, the effort of the people is limited, and the monitoring mode of the specially-assigned people is time-consuming and labor-consuming and has poor effect.

Disclosure of Invention

The invention aims to provide a behavior detection method and a behavior detection system based on a convolutional neural network, wherein the behavior detection method detects human specific behavior actions in video or dynamic images by means of the convolutional neural network, including but not limited to behavior actions such as eating and drinking food, smoking and the like, replaces a human supervision mode, and has the advantages of being high in detection rate, accurate in detection precision and the like.

The technical scheme provides a behavior detection method based on a convolutional neural network, which comprises the following steps:

acquiring image data, wherein the image data at least comprises a first image and a second image aiming at the same detection object, and the second image is acquired after the first image is acquired for a fixed time period;

inputting the first image and the second image into a neural network model, and obtaining confidence maps and affinity vector maps of predicted hand key points and predicted mouth key points in the first image and the second image, wherein the confidence maps represent the accuracy of the predicted hand key points and the predicted mouth key points, and the affinity vector represents the association degree of the predicted hand key points and the predicted mouth key points;

analyzing a confidence level diagram and an affinity vector diagram of the predicted hand key points and the predicted mouth key points through a greedy algorithm, and outputting coordinate values of the predicted hand key points and the predicted mouth key points;

and acquiring the mouth-hand distance of the detection object in the first image and the second image according to the coordinate values of the predicted hand key point and the predicted mouth key point.

In other embodiments, the method further comprises:

the image data at least comprises three continuous images aiming at the same detection object, wherein two continuous images are acquired at fixed time intervals; inputting continuous images into a neural network model, acquiring a confidence coefficient map and an affinity vector value of hand key points and predicted mouth key points of each image, calculating the mouth-hand distance of a detection object in each image according to the confidence coefficient map and the affinity vector value, and judging that the detection object is sucking tobacco if the mouth-hand distance of adjacent images in the continuous images is changed in an alternating size.

Further, the edible products comprise food and tobacco, so that the behavior detection method can be used for detecting the behavior of eating the edible products and sucking the tobacco.

The technical scheme provides a behavior detection system based on a convolutional neural network, which comprises:

an image acquisition unit configured to acquire image data including at least two images for the same detection object; the confidence coefficient unit is used for acquiring confidence coefficient graphs of predicted hand key points and predicted mouth key points of the images; the affinity unit is used for acquiring an affinity vector diagram of the predicted hand key points and the predicted mouth key points in the image; the analysis unit is used for analyzing the confidence level map and the affinity vector map of the predicted hand key points and the predicted mouth key points through a greedy algorithm and outputting coordinate values of the predicted hand key points and the predicted mouth key points; and the calculating unit is used for acquiring the mouth-hand distance of the detection object in the image according to the coordinate values of the predicted hand key point and the predicted mouth key point.

The present solution provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the above mentioned convolutional neural network based behavior detection method when executing the program.

The present solution provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the above mentioned convolutional neural network based behavior detection method.

Drawings

Fig. 1 is a schematic structural diagram of an edible product detection model.

Fig. 2 is a schematic structural diagram of a hand mouth keypoint detection model according to the present embodiment.

Fig. 3 is a flow chart of a behavior detection method based on convolutional neural network according to the present embodiment.

Fig. 4 is a schematic diagram of a convolutional neural network-based behavior detection system of the present scheme.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the invention, fall within the scope of protection of the invention.

It should be appreciated that embodiments of the invention may be implemented or realized by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer readable storage medium configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, in accordance with the methods and drawings described in the specific embodiments. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Furthermore, the operations of the processes described herein may be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes (or variations and/or combinations thereof) described herein may be performed under control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications), by hardware, or combinations thereof, collectively executing on one or more processors. The computer program includes a plurality of instructions executable by one or more processors.

Furthermore, the operations of the processes described herein may be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes (or variations and/or combinations thereof) described herein may be performed under control of one or more computer systems configured with executable instructions and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications, by hardware, or combinations thereof) that collectively execute on one or more processors.

A computer program can be applied to the input data to perform the functions herein to convert the input data to generate output data that is stored to the non-volatile memory. The output information may also be applied to one or more output devices such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including specific visual depictions of physical and tangible objects produced on a display.

Specifically, the technical scheme provides a behavior detection method and a behavior detection system based on a convolutional neural network, wherein the behavior detection method based on the convolutional neural network can be used for detecting such dynamic behaviors as eating and drinking food, smoking and the like of people in videos or dynamic images, and particularly can be applied to monitoring and management in public places.

It should be noted that, in this scheme, the feeding or smoking behavior of the user is determined by using the change of the distance between the mouth and the hand, and the position of the head (i.e. the mouth) is usually kept still when the user feeds or smokes, at this time, the distance between the mouth and the hand intuitively reflects the dynamic motion of the hand.

In the scheme, whether the detected object is eating food products is judged by utilizing the mouth-hand distance of the detected object for judging various images, wherein the food products comprise food and tobacco, the food comprises various finished products and raw materials for people to eat or drink, and the tobacco comprises electronic cigarettes, pipes and the like according to the traditional articles which are food and medicine. When the food is edible food, people perform eating behavior; when the food is a drink, people can drink things; when the food is tobacco, people are smoking, and the behavior detection behavior based on the convolutional neural network provided by the scheme can detect any behavior of eating, drinking or smoking mentioned above.

Specifically, the present solution provides a behavior detection method based on convolutional neural network, which is used for detecting whether a specific behavior exists in a detected object, such as eating food or smoking tobacco, and includes the following steps:

The image data is continuous image data acquired by corresponding detection objects, and in some embodiments, the continuous image data can be selected as a corresponding continuous video frame image set in the monitoring video, a shooting image set continuously shot in a set time period, or a dynamic image set continuously performed.

Before confidence level diagrams and affinity vector diagrams of predicted hand key points and predicted mouth key points in the first image and the second image are obtained, the first image and the second image obtain corresponding feature diagrams through a convolution module.

In addition, in the scheme, whether the food is present in the acquired image can be detected by using the food detection model, and whether the food is held by the hand of the detection object in the acquired image can also be detected by combining the coordinate value of the hand key point and the food detection model.

Of course, the step of detecting whether the hand of the subject holds the food item may be performed before or after the distance of the mouth is obtained, and if the continuous image data of the hand of the subject is selected by manual selection, the deep learning model is not required to detect the food item.

The edible product detection model can detect the type and coordinate values of the edible product of the image data, the neural network model on the edible product detection model obtains the coordinate values of the predicted hand key points, whether the hand of the detected object holds the edible product or not is judged through the coordinate information of the edible product and the coordinate information of the predicted hand key points, and if the coordinate information of the edible product and the coordinate information of the hand key points overlap or are close to each other, or in other words, the coordinate value range of the edible product and the coordinate value of the predicted hand key points are crossed, the hand of the detected object holds the edible product is judged.

When judging whether a user eats food, only at least two interval images are needed, and if the absolute difference between the mouth and hand distances of the first image and the second image is larger than the set first threshold value when the hands of the detected object in the first image and the second image hold food, the detected object is judged to be eating food. It should be noted that the mouth-hand distance of the second image may be larger than the mouth-hand distance of the first image, and the user takes the food out of the mouth after finishing eating; the mouth-to-hand distance of the second image may also be smaller than the mouth-to-hand distance of the first image, at which time the user is holding food into the mouth to complete the eating action.

It is worth mentioning that the acquisition time interval between the first image and the second image is not more than 30 seconds, and the first threshold is set to be not more than 0.5 meter.

For example, take the bread for user a:

according to the behavior detection method based on the convolutional neural network, a first image and a second image containing a user A are obtained, the mouth hand distance of the user A in the first image and the mouth hand distance of the user A in the second image are obtained respectively, and if the difference value between the mouth hand distance of the first image and the mouth hand distance of the second image is larger than a set first threshold value, the user A is determined to move food to the mouth in the interval between the two images, namely, the user A eats bread. (of course, in this scenario the bread on the user A's hand is detected by the food detection model).

The smoking behavior of the user and the eating behavior of the food are similar and different, and the tobacco sucking action of the user is continuously reciprocated at the mouth, so that at least three images of the detected object need to be acquired, the acquisition time interval of each image is set for a time period, the mouth hand distance of the detected object in each image is acquired, and the judgment is carried out according to the judgment standard that the mouth hand distance is alternately increased and decreased.

Correspondingly, the scheme provides a behavior detection method based on a convolutional neural network, which is used for detecting the behavior of edible foods and comprises the following steps of:

acquiring image data, wherein the image data at least comprises three continuous images aiming at the same detection object, and two continuous images are acquired at a fixed time interval; and inputting the continuous images into a neural network model, obtaining a confidence coefficient map and an affinity vector value of the hand key points and the mouth key points of each image, and calculating the mouth-hand distance of the detection object in each image according to the confidence coefficient map and the affinity vector map by using a greedy algorithm. And if the absolute difference value between the mouth-hand distances in the continuous images is changed in an alternating mode, judging that the detection object is sucking tobacco.

Of course, before obtaining the confidence maps and affinity vector maps of the predicted hand key points and the predicted mouth key points, the image is subjected to a convolution module to obtain a corresponding feature map, and the step is the same as that described above.

Taking four images as an example, the behavior detection method based on the convolutional neural network provided by the scheme comprises the following steps:

acquiring image data, wherein the image data at least comprises a first image and a second image, a third image and a fourth image aiming at the same detection object, the second image is acquired after a fixed time period acquired by the first image, the third image is acquired after the fixed time period acquired by the second image, and the fourth image is acquired after the fixed time period acquired by the third image;

inputting the first image, the second image, the third image and the fourth image into a neural network model, and obtaining confidence level images and affinity vector images of hand key points and mouth key points in the first image, the second image, the third image and the fourth image;

and acquiring the first image, the second image and the third image according to the confidence coefficient map and the affinity vector map by using a greedy algorithm, and detecting the mouth-hand distance of the object in the fourth image.

If the mouth hand distance in the first image, the second image, the third image and the fourth image is changed in an alternating manner, for example, the mouth hand distance of the second image is smaller than the mouth hand distance of the first image, the mouth hand distance of the third image is larger than the mouth hand distance of the second image, the mouth hand distance of the fourth image is smaller than the mouth hand distance of the third image, or the mouth hand distance of the second image is larger than the mouth hand distance of the first image, the mouth hand distance of the third image is smaller than the mouth hand distance of the second image, and the mouth hand distance of the fourth image is larger than the mouth hand distance of the third image. And the absolute difference value of the mouth hand distance of the adjacent images is larger than the set threshold value, of course, the set threshold values corresponding to the absolute difference values of the mouth hand distances between the different adjacent images can be different, and the detection object is judged to be sucking tobacco.

It should be noted that, the values of the first threshold, the second threshold, and the third threshold may or may not be identical, and the intervals of the acquisition time of the continuous images do not have to be identical. Preferably, the control interval time is not more than 10 seconds, and the judgment threshold corresponding to the absolute difference between the mouth-hand distances of the continuous images is set to be not more than 0.3 meter.

Taking the example of user a sucking tobacco as an example, the following description will be given:

acquiring a first image, a second image, a third image and a fourth image of the user A, respectively acquiring the mouth hand distance of the user A in the images, and judging that the detection object is sucking tobacco when the absolute difference between the mouth hand distances of the third image and the fourth image is larger than a set third threshold value if the absolute difference between the mouth hand distances of the first image and the second image is larger than the set first threshold value and the absolute difference between the mouth hand distances of the second image and the third image is smaller than the set second threshold value, then recognizing that the user A takes cigarettes into the mouth in the interval time between the first image and the second image, taking cigarettes out of the mouth in the interval time between the second image and the third image, and judging that the user A is sucking tobacco when the cigarettes are close to the mouth again in the interval time between the third image and the fourth image.

The construction and training process of the neural network model adopted by the scheme is as follows:

pedestrian hand and mouth keypoint detection data preparation: marking key points of hands and mouths of pedestrians in the acquired marked image data, and marking affinity vectors of the key points of the hands and the affinity vectors of the key points of the mouths;

pedestrian hand and mouth key point detection network structure design: the main network is composed of convolution nerve modules, marked image data is used as input, a feature map F is obtained through the convolution module A, the network is divided into two branches, a branch 1 predicts confidence degrees of hand key points and mouth key points, a branch 2 predicts affinity vectors of the hand key points and the mouth key points, each branch is an iterative prediction framework, the branch 1 and the branch 2 form a stage, and each stage network generates a group of detection confidence maps Score ^k ＝ρ ^k (F) And a set of affinity vectors

Wherein ρ is ¹ And->

Is the output result of the first stage network, and then the input of each stage is from the predicted result of the previous stage and the characteristic diagram F obtained by the convolution module A, < >>

And->

Representing the convolutional neural module structure of stage k, its output is:

and

analyzing a confidence map of key points of the hand and the mouth through greedy reasoning and learning association of the hand and the mouth through a non-parameter characterization method Part Affinity Fields (PAF component affinity vector field);

pedestrian hand and mouth keypoint detection model training: initializing network parameters, and setting the maximum iteration number m of the network; inputting the prepared training image data set into a network for training, and if the loss value is always reduced, continuing training until the iteration is performed for M times, and obtaining a final model; if the loss value tends to be stable in the middle, stopping iteration to obtain a final model;

the loss function is:

in which two loss functions for each stage k

And

wherein, the liquid crystal display device comprises a liquid crystal display device,

confidence map representing manually labeled hand key points and mouth key points +.>

The affinity vector of the hand key points and the affinity vector of the mouth key points which are marked manually are represented, m represents the key points of the hand and the mouth, n represents limbs, namely the hand and the mouth, and one limb corresponds to two key points.

The food detection model is constructed and trained as follows:

wherein the food product comprises food or tobacco.

Data preparation: labeling the labeling image, wherein labeling information is a bounding box of food or tobacco and a labeling category, namely (c) _j ，x _j ，y _j ，w _j ，h _l ) Wherein c _j Representing the category of the bounding box, and different foods in different categories correspond to different c _j Value, x _j ，y _j Representing the coordinates of the top left corner vertex of the bounding box, w _j ，h _j Representing the width and height of the bounding box, and dividing the marked data sample into a training set, a verification set and a test set according to the ratio of 8:1:1;

and (3) network structure design: the algorithm adopts a convolution neural network with a multi-scale structure, a main network is formed by residual modules, network characteristic channels are separated and mixed, a top-down characteristic pyramid structure is adopted on the basis of the main network, top-down up-sampling operation is added, deep layer characteristics and shallow layer characteristic information fusion of a plurality of layers is constructed, thereby obtaining better characteristics, screening candidate frames with different sizes, and finally retaining the optimal result;

the network employs a swish activation function,

training: setting the size of an input image to be 416 x 416, setting the value of the input minimum batch data to be 64, setting the learning rate to be 10 < -3 >, and adopting an Adam gradient descent strategy to perform optimization learning;

model test: test data is input, and bounding box information (c, x, y, w, h) is output.

Marking c corresponding to various food products in the food product detection model _j If the number is 1 and the number of cigarettes or other tobaccos is 2, whether food is contained and whether the food is food or tobacco can be known according to the output bounding box information, and the coordinates of the food can be known.

In addition, in the scheme, the behavior management can be performed on the basis of a behavior detection method based on a convolutional neural network, and the method comprises the following steps of: and loading a detection frame of the detected object with the edible food into a pedestrian recognition detection model, acquiring identity information of the detected object, and recording the identity information as a task library event.

The scheme further provides a behavior detection system based on the convolutional neural network, which comprises:

Of course, the behavior detection system based on the convolutional neural network provided by the scheme further comprises a judging unit, wherein the judging unit is used for judging the difference relation between the mouth hand distance of the detected object and the set threshold value. Details of how to judge and how to use data and the like are described in the behavior detection method based on the convolutional neural network, and the description is not repeated here.

In addition, the behavior detection system based on the convolutional neural network comprises an edible product detection unit, wherein the edible product detection unit runs an edible product detection model for detecting whether edible products exist in image data and coordinate values of the edible products, and at the moment, the judgment unit judges whether the edible products are held on the hand of the detection object or not further based on the coordinate values of the edible products and the coordinate values of the predicted hand key points. In the behavior detection system based on the convolutional neural network, the edible product comprises food and tobacco.

The training and construction process of the food detection model and the hand key point detection model are as described above, and the description thereof is not repeated here.

Additionally, in some embodiments, the present solution provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed implements the steps of the convolutional neural network-based behavior detection method mentioned above.

There is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the above-mentioned convolutional neural network based behavior detection method.

The present invention is not limited to the above-described preferred embodiments, and any person who can obtain other various products under the teaching of the present invention, however, any change in shape or structure of the product is within the scope of the present invention, and all the products having the same or similar technical solutions as the present application are included.

Claims

1. The behavior detection method based on the convolutional neural network is characterized by comprising the following steps of:

and detecting whether food products exist in the acquired images by utilizing a food product detection model, and combining the coordinate values of the hand key points and whether the food products are held by the hands of the detection object in the acquired images by utilizing the food product detection model, wherein the food products comprise food or tobacco, if the hands of the detection object in the first image and the second image hold the food, acquiring the mouth hand distance of the detection object in the first image and the second image according to the coordinate values of the predicted hand key points and the predicted mouth key points, and judging that the detection object is in the food products when the absolute difference value between the mouth hand distances of the first image and the second image is larger than a set first threshold value, wherein the image data at least comprise three continuous images aiming at the same detection object, wherein the acquisition time interval of two continuous images is fixed, and if the mouth hand distance of adjacent images in the continuous images is changed alternately, judging that the detection object is sucking the tobacco.

2. The convolutional neural network-based behavior detection method of claim 1, further comprising:

before confidence level diagrams and affinity level vector diagrams of predicted hand key points and predicted mouth key points in the first image and the second image are obtained, the first image and the second image obtain corresponding feature diagrams through a convolution module.

3. The convolutional neural network-based behavior detection method of claim 1, further comprising:

and inputting the continuous images into a neural network model, acquiring coordinate values of the predicted hand key points and the predicted mouth key points of each image, and calculating the mouth-hand distance of the detection object in each image according to the coordinate values.

4. A behavior detection method based on a convolutional neural network according to any one of claims 1 to 3, wherein the image data is input into a food detection model to obtain bounding box information of the food product, wherein the bounding box information comprises the type of the food product and coordinate values, and if a range formed by the coordinate values of the food product and a sitting value of a predicted hand key point intersect, the food product held by the user is judged.

5. A convolutional neural network-based behavior detection system, comprising:

an image acquisition unit configured to acquire image data including at least two images for the same detection object;

the confidence coefficient unit is used for acquiring confidence coefficient graphs of predicted hand key points and predicted mouth key points of the images;

the affinity unit is used for acquiring an affinity vector diagram of the predicted hand key points and the predicted mouth key points in the image;

the analysis unit is used for analyzing the confidence level map and the affinity vector map of the predicted hand key points and the predicted mouth key points through a greedy algorithm and outputting coordinate values of the predicted hand key points and the predicted mouth key points;

the calculating unit is used for detecting whether food products exist in the obtained images by utilizing the food product detection model, combining the coordinate values of the hand key points and whether the hands of the detection objects in the obtained images are holding the food products or not by the food product detection model, wherein the food products comprise food or tobacco, if the hands of the detection objects in the first image and the second image are holding the food, the mouth hand distance of the detection objects in the first image and the second image is obtained according to the coordinate values of the predicted hand key points and the predicted mouth key points, when the absolute difference value between the mouth hand distances of the first image and the second image is larger than a set first threshold value, the detection objects are judged to be in the food products, the image data at least comprise three continuous images aiming at the same detection object, wherein the time interval of the acquisition of the two continuous images is fixed, and if the mouth hand distances of the adjacent images in the continuous images are alternately changed, the detection objects are judged to be sucking the tobacco.

6. The convolutional neural network-based behavior detection system of claim 5, wherein the system further comprises:

and the judging unit is used for judging the difference relation between the mouth hand distance of the detection object and the set threshold value.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that the computer program when executed by the processor implements the steps of the method according to any of claims 1-3.

8. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor realizes the steps of the method according to any of claims 1-3 _。