CN111611971A

CN111611971A - Behavior detection method and system based on convolutional neural network

Info

Publication number: CN111611971A
Application number: CN202010485168.1A
Authority: CN
Inventors: 郁强; 李圣权; 李开民; 曹喆; 金仁杰
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2020-09-01
Anticipated expiration: 2040-06-01
Also published as: CN111611971B

Abstract

The invention provides a behavior detection method and system based on a convolutional neural network, wherein the behavior detection method is used for detecting artificial specific behavior actions in videos or dynamic images by means of the convolutional neural network, wherein the artificial specific behavior actions include but are not limited to behavior actions of eating and drinking food, smoking and the like, and the behavior detection method replaces an artificial supervision mode, and has the advantages of being high in detection rate, accurate in detection precision and the like.

Description

Behavior detection method and system based on convolutional neural network

Technical Field

The invention relates to the field of video processing, in particular to a behavior detection method and system based on a convolutional neural network.

Background

Deep learning is a new field of machine learning research, and the motivation is to establish and simulate a neural network for analyzing and learning the human brain, and to use the established neural network to simulate the human brain and replace human activities to complete the analysis and processing of data. In order to better and more accurately acquire useful information in an image, most of the current deep learning techniques focus on analyzing and processing a static image, and find little research about the application of the deep learning techniques to processing a dynamic image or a video image.

However, the analysis of moving images or video images in real life is of great research interest, especially when used for detecting specific behavioral actions, such as eating, smoking, drinking, etc., it is necessary to analyze the moving image or video image information.

At present, in some special places, such as public places like subways, buses, movie theaters, museums and the like, clear fasting orders and smoking prohibition orders exist, but the actual execution conditions of the fasting orders or the smoking prohibition orders are not optimistic, and a control unit lacks an actual effective supervision means for people who violate the fasting orders or the smoking prohibition orders, so that the situation is mainly caused because the dynamic behaviors belong to instantaneous behaviors and can be carried out within minutes or even seconds, the people flow and the area of the public places are usually very large, even if a person specially assigned for monitoring is difficult to ensure that a monitoring person can supervise the person who violates the smoking prohibition orders in an all-round manner, the energy of the person is limited, and the time and labor consumption and the effect of the way of the person assigned for monitoring are poor.

Disclosure of Invention

The invention aims to provide a behavior detection method and a behavior detection system based on a convolutional neural network, wherein the behavior detection method is used for detecting artificial specific behavior actions in videos or dynamic images by means of the convolutional neural network, including but not limited to behavior actions of eating and drinking food, smoking and the like, replaces an artificial supervision mode, and has the advantages of being high in detection rate, accurate in detection precision and the like.

The technical scheme provides a behavior detection method based on a convolutional neural network, which comprises the following steps:

acquiring image data, wherein the image data at least comprises a first image and a second image aiming at the same detection object, and the second image is acquired after the first image is acquired for a fixed time period;

inputting the first image and the second image into a neural network model, and acquiring a confidence map and an affinity vector map of a predicted hand key point and a predicted mouth key point in the first image and the second image, wherein the confidence map represents the accuracy of the predicted hand key point and the predicted mouth key point, and the affinity vector represents the relevance of the predicted hand key point and the predicted mouth key point;

analyzing the confidence coefficient graph and the affinity vector graph of the predicted hand key point and the predicted mouth key point through a greedy algorithm, and outputting coordinate values of the predicted hand key point and the predicted mouth key point;

and acquiring the mouth-hand distance of the detection object in the first image and the second image according to the coordinate values of the predicted hand key point and the predicted mouth key point.

In other embodiments, the method further comprises:

the image data at least comprises three continuous images aiming at the same detection object, wherein the time interval between the two continuous images is fixed; inputting the continuous images into a neural network model, obtaining a confidence map and an affinity vector value of a hand key point and a predicted mouth key point of each image, calculating the mouth-hand distance of the detection object in each image according to the confidence map and the affinity vector map, and judging that the detection object is smoking tobacco if the mouth-hand distances of adjacent images in the continuous images change in alternate size.

Further, the edible product comprises food and tobacco, so that the behavior detection method can be used for detecting the behavior of eating the edible product and smoking the tobacco.

This technical scheme provides a behavior detection system based on convolutional neural network, includes:

an image acquisition unit configured to acquire image data including at least two images for a same detection object; the confidence coefficient unit is used for acquiring a confidence coefficient map of the predicted hand key point and the predicted mouth key point of the image; the affinity unit is used for acquiring an affinity vector diagram of the predicted hand key point and the predicted mouth key point in the image; the analysis unit is used for analyzing the confidence coefficient map and the affinity vector map of the predicted hand key point and the predicted mouth key point through a greedy algorithm and outputting coordinate values of the predicted hand key point and the predicted mouth key point; and the calculating unit is used for acquiring the mouth-hand distance of the detection object in the image according to the coordinate values of the predicted hand key point and the predicted mouth key point.

The present solution provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the above-mentioned steps of the convolutional neural network-based behavior detection method when executing the program.

The present solution provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, realizes the above-mentioned steps of the convolutional neural network-based behavior detection method.

Drawings

FIG. 1 is a schematic structural diagram of a food testing model.

Fig. 2 is a schematic structural diagram of a hand mouth keypoint detection model according to the present embodiment.

Fig. 3 is a schematic method flow diagram of the behavior detection method based on the convolutional neural network according to the present embodiment.

Fig. 4 is a schematic diagram of a convolutional neural network-based behavior detection system according to the present embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, the operations of processes described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described herein (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the operations of processes described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described herein (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions and may be implemented as code (e.g., executable instructions, one or more computer programs or one or more applications, by hardware, or combinations thereof) that collectively executes on one or more processors.

A computer program can be applied to input data to perform the functions herein to transform the input data to generate output data that is stored to non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

Specifically, the technical scheme provides a behavior detection method and a behavior detection system based on a convolutional neural network, and the behavior detection method based on the convolutional neural network can be used for detecting dynamic behaviors of people for eating and drinking food, smoking and the like in videos or dynamic images, and particularly can be applied to monitoring management in public places.

It should be noted that, in the present scheme, the change of the mouth-hand distance is used to determine the eating or smoking behavior of the user, and the position of the head (i.e., the mouth) of the user is usually kept still when the user eats or smokes, and at this time, the mouth-hand distance intuitively reflects the dynamic motion of the hand.

In the scheme, whether the detection object eats the edible product is judged by utilizing the mouth-hand distance of the detection object for judging the multiple images, wherein the edible product comprises but is not limited to food and tobacco, the food comprises various finished products and raw materials for people to eat or drink and articles which are food and medicines according to the tradition, and the tobacco comprises electronic cigarettes, tobacco pipes and the like. When the edible food is edible food, people can eat the food; when the edible product is a drinking beverage, people can drink the food; when the food is tobacco, people can smoke, and any one of the behaviors of eating, drinking or smoking mentioned above can be detected by the behavior detection behavior based on the convolutional neural network provided by the scheme.

Specifically, the present invention provides a behavior detection method based on a convolutional neural network, which is used for detecting whether a specific behavior exists in a detection object, such as eating food or smoking tobacco, and includes the following steps:

In some embodiments, the continuous image data may be selected as a corresponding continuous video frame image set in the surveillance video, or may be selected as a continuously shot image set within a set time period, or may be selected as a continuously performed dynamic image set.

And before obtaining the confidence map and affinity vector map of the predicted hand key points and predicted mouth key points in the first image and the second image, the first image and the second image are processed by a convolution module to obtain corresponding feature maps.

In addition, in the scheme, whether the edible product exists in the image obtained by detection of the edible product detection model can be selected, and whether the edible product is held by the hand of the detection object in the image obtained by detection of the edible product detection model can also be combined with the coordinate values of the key points of the hand.

Of course, the step of detecting whether the object hand holds the food can be performed before or after the mouth-hand distance is obtained, and if the continuous image data of the food held by the object hand is selected in an artificial selection mode, the food is detected without a deep learning model.

The method comprises the steps that an edible product detection model can detect the type and the coordinate value of an edible product of image data, a neural network model on the edible product detection model obtains the coordinate value of a predicted hand key point, whether a hand of a detection object holds the edible product is judged according to the coordinate information of the edible product and the coordinate information of the predicted hand key point, and if the coordinate information of the edible product and the coordinate information of the hand key point are overlapped or close to each other, or the coordinate value range of the edible product and the coordinate value of the predicted hand key point are crossed, the hand of the detection object is judged to hold the edible product.

When the user is judged whether to eat the food, at least two spaced images are only needed, and if the food is held by the hand of the detection object in the first image and the second image, the absolute difference between the mouth-hand distances of the first image and the second image is judged to be larger than a set first threshold value, the detection object is judged to eat the food. It is worth mentioning here that the mouth-hand distance of the second image may be larger than the mouth-hand distance of the first image, at which time the user finishes eating to take the food away from the mouth; the mouth-hand distance of the second image may also be smaller than the mouth-hand distance of the first image, when the user is carrying food into the mouth to complete a eating motion.

It is worth mentioning that the acquisition time interval between the first image and the second image does not exceed 30 seconds, and the first threshold is set to be not more than 0.5 meter.

For example, take user a eating bread as an example:

according to the behavior detection method based on the convolutional neural network, the first image and the second image containing the user A are obtained, the mouth-hand distance of the user A in the first image and the mouth-hand distance of the user A in the second image are respectively obtained, and if the difference value between the mouth-hand distance of the first image and the mouth-hand distance of the second image is larger than a set first threshold value, the user A is determined to move food to the mouth through the hand in the interval between the two images, namely the user A eats bread. (of course, in this scenario the bread on the user A's hand is detected by the food inspection model).

The user smoking behavior and food eating behavior are similar and different, and the user smoking tobacco is constantly reciprocating in the mouth, so at least three images of the detection object need to be acquired, the time interval of acquiring each image is set as a time period, the mouth-hand distance of the detection object in each image is acquired, and the judgment is carried out according to the judgment standard that the mouth-hand distance alternately increases and decreases.

Correspondingly, the scheme provides a behavior detection method based on a convolutional neural network, which is used for detecting the behavior of edible food and comprises the following steps:

acquiring image data, wherein the image data at least comprises three continuous images aiming at the same detection object, and the acquisition time interval of the two continuous images is fixed; and inputting the continuous images into a neural network model, acquiring a confidence map and an affinity vector value of the hand key points and the mouth key points of each image, and calculating the mouth-hand distance of the detected object in each image according to the confidence map and the affinity vector map by using a greedy algorithm. And if the absolute difference value between the mouth-hand distances in the continuous images changes alternately, judging that the detection object sucks tobacco.

Of course, before obtaining the confidence map and affinity vector map of the predicted hand keypoints and predicted mouth keypoints, the image is passed through a convolution module to obtain the corresponding feature map, which is the same as the above-mentioned step.

Taking the acquisition of four images as an example, the behavior detection method based on the convolutional neural network provided by the scheme comprises the following steps:

acquiring image data, wherein the image data at least comprises a first image and a second image, a third image and a fourth image aiming at the same detection object, the second image is acquired after a fixed time period of the first image acquisition, the third image is acquired after the fixed time period of the second image acquisition, and the fourth image is acquired before the fixed time period of the third image acquisition;

inputting the first image, the second image, the third image and the fourth image into a neural network model to obtain a confidence map and an affinity vector map of the hand key points and the mouth key points in the first image, the second image, the third image and the fourth image;

and acquiring the first image, the second image and the third image according to the confidence coefficient image and the affinity vector image by using a greedy algorithm, and detecting the mouth-hand distance of the object in the fourth image.

If the mouth-hand distance in the first image, the second image, the third image and the fourth image is changed in size alternately, for example, the mouth-hand distance in the second image is smaller than the mouth-hand distance in the first image, the mouth-hand distance in the third image is larger than the mouth-hand distance in the second image, and the mouth-hand distance in the fourth image is smaller than the mouth of the third image, or the mouth-hand distance in the second image is larger than the mouth-hand distance in the first image, the mouth-hand distance in the third image is smaller than the mouth-hand distance in the second image, and the mouth-hand distance in the fourth image is larger than the mouth-hand distance in the third image. And the absolute difference value of the mouth-hand distances of the adjacent images is larger than the set threshold value, and the set threshold values corresponding to the absolute difference values of the mouth-hand distances of different adjacent images can be different, so that the tobacco smoking of the detection object is judged.

It should be noted that the values of the first threshold, the second threshold, and the third threshold may be set to be consistent or inconsistent, and the time intervals between the acquisition times of the consecutive images do not necessarily need to be consistent. Preferably, the control interval time is not more than 10 seconds, and the judgment threshold corresponding to the absolute difference value between the mouth-hand distances of the continuous images is set to be not more than 0.3 meter.

Taking the example that the user A sucks tobacco as an example:

the method comprises the steps of obtaining a first image, a second image, a third image and a fourth image of a user A, obtaining mouth-hand distances of the user A in the images respectively, judging that a detection object sucks tobacco when the user A is determined to take cigarettes into a mouth part within the interval time between the first image and the second image, taking the cigarettes out of the mouth part within the interval time between the second image and the third image, and again approaching the cigarettes to the mouth part within the interval time between the third image and the fourth image if the absolute difference between the mouth-hand distances of the first image and the second image is larger than a set first threshold value, the absolute difference between the mouth-hand distances of the second image and the third image is smaller than a set second threshold value, and the absolute difference between the mouth-hand distances of the second image and the third image is larger than a set third threshold value, and judging that the user A sucks tobacco.

The construction and training process of the neural network model adopted by the scheme is as follows:

preparation of pedestrian hand and mouth key point detection data: marking key points of hands and mouths of pedestrians in the collected marking image data, and marking affinity vectors of the key points of the hands and affinity vectors of the key points of the mouths;

the pedestrian hand and mouth key point detection network structure design: the main network is composed of a convolution neural module, marked image data is used as input, a feature map F is obtained through the convolution module A, the network is divided into two branches, the confidence coefficients of a hand key point and a mouth key point are predicted through a branch 1, the affinity vector of the hand key point and the affinity vector of the mouth key point are predicted through a branch 2, each branch is an iterative prediction framework, the branch 1 and the branch 2 form a stage, and the network generates a group of detection confidence coefficient maps Score in each stage^k＝ρ^k(F) And a set of affinity vectors

Where ρ is¹And

the output result of the network in the first stage is input into the prediction result in the previous stage and the characteristic diagram F, rho obtained by the convolution module A in each stage^kAnd

the convolutional neural block structure representing the k-th stage, whose output is:

and

analyzing a confidence map of key points of the hand and the mouth through greedy reasoning and learning the association of the hand and the mouth by a Part affinity fields (PAF component affinity vector field);

training the detection models of the hand and the mouth key points of the pedestrian: assigning an initialization value to the network parameter, and setting the maximum iteration number m of the network; inputting the prepared training image data set into a network, training, and if the loss value is reduced all the time, continuing training until a final model is obtained after M iterations; if the loss value tends to be stable in the midway, stopping iteration to obtain a final model;

the loss function is:

two loss functions for each stage k in the equation

And

wherein,

representing confidence maps for manually labeling hand and mouth keypoints,

the method comprises the steps of representing an affinity vector of manually marked hand key points and an affinity vector of mouth key points, wherein m represents key points of a hand and key points of a mouth, n represents limbs, namely the hand and the mouth, and one limb corresponds to two key points.

The construction and training process of the food detection model is as follows:

wherein the food product comprises food or tobacco.

Preparing data: labeling the labeling image, wherein the labeling information is the enclosing frame of the food or the tobacco and the labeled category, namely (c)_j，x_j，y_j，w_j，h_j) In which C is_jIndicating the category of the bounding box, different categories of food corresponding to different c_jValue, x_j，y_jCoordinates representing the vertex of the upper left corner of the bounding box, w_j，h_jRepresenting the width and the height of the surrounding frame, and dividing the labeled data sample into a training set, a verification set and a test set according to the ratio of 8:1: 1;

and (3) network structure design: the algorithm adopts a convolutional neural network with a multi-scale structure, a backbone network is composed of residual modules, network characteristic channel separation and channel shuffling are carried out, a top-down characteristic pyramid structure is adopted on the basis of the backbone network, top-down up-sampling operation is added, deep layer characteristics and shallow layer characteristic information fusion of a plurality of layers are constructed, so that better characteristics are obtained, candidate frames with different sizes are screened, and finally an optimal result is reserved;

the network employs a swish activation function,

training: setting the size of an input image as 416 x 416, setting the input minimum batch data value as 64, setting the learning rate as 10 < -3 >, and performing optimized learning by adopting an Adam gradient descent strategy;

and (3) testing a model: test data is input, and bounding box information (c, x, y, w, h) is output.

Marking C corresponding to various foods in the food detection model_j1, and 2 for cigarettes or other tobaccos, whether the food is contained or not and whether the food is food or tobacco can be known according to the output bounding box information, and the coordinates of the food can be known.

In addition, the scheme can perform behavior management on the basis of a behavior detection method based on a convolutional neural network, and comprises the following steps of: and loading the detection frame of the detection object with the edible food into a pedestrian recognition detection model, acquiring the identity information of the detection object, and recording the identity information as a task library event.

This scheme provides a behavior detection system based on convolutional neural network in addition, includes:

Certainly, the behavior detection system based on the convolutional neural network provided in the present solution further includes a determining unit, configured to determine a difference relationship between a mouth-hand distance of the detection object and the set threshold. The details of how to judge and data are described in the above description of behavior detection method based on convolutional neural network, and are not redundantly described here.

In addition, the behavior detection system based on the convolutional neural network comprises an edible product detection unit, wherein the edible product detection unit operates and operates an edible product detection model and is used for detecting whether an edible product and a coordinate value of the edible product exist in the image data, and at the moment, the judgment unit further judges whether the detection object hand holds the edible product or not based on the coordinate value of the edible product and the coordinate value of the key point of the predicted hand. In the convolutional neural network-based behavior detection system, the food products include food and tobacco.

The training and construction process for the food detection model and the hand keypoint detection model is as described above and will not be redundantly described here.

In addition, in some embodiments, the present solution provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned steps of the convolutional neural network-based behavior detection method when executing the program.

There is provided a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the above-mentioned steps of the convolutional neural network-based behavior detection method.

The present invention is not limited to the above-mentioned preferred embodiments, and any other products in various forms can be obtained by anyone in the light of the present invention, but any changes in the shape or structure thereof, which have the same or similar technical solutions as those of the present application, fall within the protection scope of the present invention.

Claims

1. A behavior detection method based on a convolutional neural network is characterized by comprising the following steps:

2. The convolutional neural network-based behavior detection method of claim 1, further comprising:

before obtaining a confidence map and an affinity vector map of predicted hand key points and predicted mouth key points in the first image and the second image, the first image and the second image are subjected to a convolution module to obtain corresponding feature maps.

3. The convolutional neural network-based behavior detection method of claim 1, further comprising:

and when the absolute difference value between the mouth-hand distances of the first image and the second image is larger than a set first threshold value, judging that the detection object eats food.

4. The convolutional neural network-based behavior detection method of claim 1, further comprising:

the image data at least comprises three continuous images aiming at the same detection object, wherein the time interval between the two continuous images is fixed; inputting the continuous images into a neural network model, acquiring coordinate values of the key points of the predicted hand part and the key points of the predicted mouth part of each image, and calculating the mouth-hand distance of the detection object in each image according to the coordinate values.

5. The convolutional neural network-based behavior detection method of claim 4, further comprising:

and if the mouth-hand distances of the adjacent images in the continuous images change in alternate sizes, judging that the detection object sucks tobacco.

6. The convolutional neural network-based behavior detection method as claimed in any one of claims 1 to 5, wherein the image data is input into a food detection model to obtain bounding box information of the food, wherein the bounding box information includes a type and coordinate values of the food, and if a range formed by the coordinate values of the food intersects with a seating value of a key point of a predicted hand, it is determined that the user holds the food.

7. A convolutional neural network-based behavior detection system, comprising:

an image acquisition unit configured to acquire image data including at least two images for a same detection object;

the confidence coefficient unit is used for acquiring a confidence coefficient map of the predicted hand key point and the predicted mouth key point of the image;

the affinity unit is used for acquiring an affinity vector diagram of the predicted hand key point and the predicted mouth key point in the image;

the analysis unit is used for analyzing the confidence coefficient map and the affinity vector map of the predicted hand key point and the predicted mouth key point through a greedy algorithm and outputting coordinate values of the predicted hand key point and the predicted mouth key point;

and the calculating unit is used for acquiring the mouth-hand distance of the detection object in the image according to the coordinate values of the predicted hand key point and the predicted mouth key point.

8. The convolutional neural network-based behavior detection system as claimed in claim 7, further comprising:

and the judging unit is used for judging the difference relation between the mouth-hand distance of the detection object and the set threshold value.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that the computer program, when executed by the processor, implements the steps of the method according to any of claims 1-5.

10. A computer-readable storage medium, having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, is adapted to carry out the steps of the method according to any of the claims 1-5.