CN112163469B

CN112163469B - Smoking behavior recognition method, system, equipment and readable storage medium

Info

Publication number: CN112163469B
Application number: CN202010955978.9A
Authority: CN
Inventors: 徐璐; 任鹏飞; 徐浪
Original assignee: Tensorsight Shanghai Intelligent Technology Co ltd
Current assignee: Tensorsight Shanghai Intelligent Technology Co ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2022-09-02
Anticipated expiration: 2040-09-11
Also published as: CN112163469A

Abstract

The embodiment of the application discloses a smoking behavior identification method, a system, equipment and a readable storage medium, which aim at collected videos to be analyzed, detect whether cigarettes exist in video frames, label the cigarettes in the video frames with the cigarettes, and form a cigarette data set; inputting the video to be analyzed into a human body posture estimation neural network to obtain key points of a human body in a frame sequence, and inputting the key points into a graph convolution neural network to obtain a behavior label of the human body in each frame; inputting the frame sequence of the cigarette data set into a target detection neural network to detect the cigarette position in a frame image and obtain the cigarette confidence of each frame; and judging whether smoking behaviors exist in each frame according to the behavior label of each frame and the cigarette confidence coefficient, and marking a smoking label in the video with the smoking behaviors. The smoking behavior is accurately identified, and the fire is effectively avoided.

Description

Smoking behavior recognition method, system, equipment and readable storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a smoking behavior identification method, a system, equipment and a readable storage medium.

Background

The number of global smoking people is increasing every year, smoking not only seriously harms the health of people, but also increases the fire hazard, especially for the areas of no smoking fire, such as gas stations, public places, dangerous goods areas, inflammable goods warehouses and the like, once a fire happens, immeasurable loss is caused. Therefore, it is important to quickly and effectively detect smoking behavior in the above-mentioned area, and to prevent the smoking behavior from occurring in the past.

Current methods for determining smoking behavior can be divided into two broad categories, the first category relying primarily on sensor detection, such as: collecting cigarette end infrared thermal data by using infrared sensing equipment; the determination is made using a smoke sensor device. The second category mainly relies on real-time video to detect smoking behavior, and particularly in recent years, with the rapid development of deep neural networks, the identification of occurring smoking behavior in video by means of intelligent video analysis algorithm shows a variety of solutions, such as: classifying the video single-frame image by using a convolutional neural network; segmenting the head and hand regions of a person in a video, and detecting whether cigarettes exist in the segmented regions by using a depth algorithm; partial key points (such as faces and hands) of people are detected, and the position relation of the key points between frames is analyzed.

The method for detecting smoking by a sensor is only suitable for detecting a closed space, is greatly influenced by environmental illumination and ventilation, and basically cannot meet the detection requirement of an outdoor smoking ban. And the sensitivity of the sensor is difficult to adapt, frequent false alarm is easy to occur when the sensitivity is too high, and the late alarm is easy to miss when the sensitivity is too low, so that the sensor cannot play a role in prevention. The existing method for detecting smoking behavior by means of real-time video analysis basically only considers the posture of a human body in a single-frame image, is easy to generate false alarm, is easy to be interfered by factors such as external light and the like, and is poor in real-time performance.

Disclosure of Invention

Therefore, the embodiment of the application provides a smoking behavior identification method, a system, equipment and a readable storage medium, a human body key point sequence is used as input, and a method for identifying smoking behaviors is completed by using a graph volume and a target detection deep neural network algorithm together, so that the potential smoking behaviors are detected in real time while the accuracy is ensured, and a fire disaster is effectively avoided.

In order to achieve the above object, the embodiments of the present application provide the following technical solutions:

according to a first aspect of embodiments of the present application, there is provided a smoking behaviour identification method, the method comprising:

detecting whether cigarettes exist in the video frames or not according to the collected video to be analyzed, and labeling the cigarettes in the video frames with the cigarettes to form a cigarette data set;

inputting the video to be analyzed into a human body posture estimation neural network to obtain key points of a human body in a frame sequence, and inputting the key points into a graph convolution neural network to obtain a behavior label of the human body in each frame;

inputting the frame sequence of the cigarette data set into a target detection neural network to detect the cigarette position in a frame image and obtain the cigarette confidence of each frame;

and judging whether smoking behaviors exist in each frame according to the behavior label and the cigarette confidence coefficient of each frame, and marking a smoking label in the video with the smoking behaviors.

Optionally, inputting the video to be analyzed into a human body posture estimation neural network to obtain key points of a human body in a frame sequence, including:

inputting the video to be analyzed into a human body posture estimation neural network, predicting and acquiring the whole body joint point connection of each person in each frame of video, taking a frame sequence with a set length for each section of video, connecting the frames and the same joint point to form a graph structure, and thus obtaining the key points of the human body in the frame sequence.

Optionally, the inputting the key points into a graph convolution neural network to obtain a behavior tag of the person in each frame includes:

constructing a skeleton space-time diagram G-V, E through each video frame; the node matrix set V ═ vti | T ═ 1., T, i ═ 1., N }, the frame number is T, the number of key points is N, E includes Es and Ef, Es refers to connectivity of key points in a single frame according to human body structures, edges are connected, and Ef refers to a single key point being connected to a node of the same type on consecutive video frames;

automatically capturing the spatial configuration and the temporal dynamics among the key points by using the graph convolution neural network, wherein in space, local features of adjacent key points in the same frame are learned by using convolution, and in time, features among corresponding key points among the convolution learning frames are also used for carrying out convolution alternation in space and time, so as to extract a higher-level feature graph;

and classifying the feature map into corresponding behavior categories by using a classifier to obtain the behavior label of the person in each frame.

Optionally, before detecting whether a cigarette is present in the video frame, the method further comprises:

and aiming at the collected video to be analyzed, normalizing the frame rate and the resolution of the video so as to enable the time lengths in the graph convolution neural network to be consistent.

According to a second aspect of embodiments of the present application, there is provided a smoking behaviour recognition system, the system comprising:

the data preprocessing module is used for detecting whether cigarettes exist in the video frames aiming at the collected video to be analyzed and marking the cigarettes in the video frames with the cigarettes to form a cigarette data set;

the key point detection module is used for inputting the video to be analyzed into a human body posture estimation neural network to obtain key points of a human body in a frame sequence, and inputting the key points into a graph convolution neural network to obtain a behavior tag of the human body in each frame;

the cigarette detection module is used for inputting the frame sequence of the cigarette data set into a target detection neural network so as to detect the cigarette position in a frame image and obtain the cigarette confidence of each frame;

and the smoking action recognition module is used for judging whether smoking actions exist in each frame according to the behavior label of each frame and the cigarette confidence coefficient, and printing a smoking label in the video with the smoking actions.

Optionally, the key point detection module is specifically configured to:

automatically capturing the spatial configuration and the time dynamics among the key points by utilizing the graph convolution neural network, wherein in space, local features of adjacent key points in the same frame are learned by using convolution, and in time, features among corresponding key points among the convolution learning frames are also utilized to perform convolution alternation in space and time so as to extract a higher-level feature graph;

Optionally, the data preprocessing module is further configured to:

According to a third aspect of embodiments herein, there is provided an apparatus comprising: the device comprises a data acquisition device, a processor and a memory; the data acquisition device is used for acquiring data; the memory is to store one or more program instructions; the processor is configured to execute one or more program instructions to perform the method of any of the first aspect.

According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium having one or more program instructions embodied therein for performing the method of any of the first aspects.

In summary, the present application provides a smoking behavior identification method, system, device, and readable storage medium, which detect whether a cigarette exists in a video frame for an acquired video to be analyzed, and mark the cigarette in the video frame where the cigarette exists to form a cigarette data set; inputting the video to be analyzed into a human body posture estimation neural network to obtain key points of a human body in a frame sequence, and inputting the key points into a graph convolution neural network to obtain a behavior tag of the human body in each frame; inputting the frame sequence of the cigarette data set into a target detection neural network to detect the cigarette position in a frame image and obtain the cigarette confidence of each frame; and judging whether smoking behaviors exist in each frame according to the behavior label of each frame and the cigarette confidence coefficient, and marking a smoking label in the video with the smoking behaviors. The method for identifying smoking behaviors by using the key point sequence of the human body as input and using the graph volume and the target detection deep neural network algorithm together achieves real-time detection of the potential smoking behaviors while ensuring the precision, and effectively avoids fire.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope of the present invention.

Fig. 1 is a schematic flow chart of a smoking behavior identification method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of a model training phase according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of a model prediction phase according to an embodiment of the present application;

fig. 4 is a block diagram of a smoking behavior recognition system according to an embodiment of the present application.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 shows a flow of a smoking behavior recognition method provided in an embodiment of the present application, and as shown in fig. 1, the method includes the following steps:

step 101: and detecting whether cigarettes exist in the video frames aiming at the collected videos to be analyzed, and labeling the cigarettes in the video frames with the cigarettes to form a cigarette data set.

Step 102: inputting the video to be analyzed into a human body posture estimation neural network to obtain key points of a human body in a frame sequence, and inputting the key points into a graph convolution neural network to obtain a behavior tag of the human body in each frame.

Step 103: and inputting the frame sequence of the cigarette data set into a target detection neural network to detect the cigarette position in the frame image and obtain the cigarette confidence of each frame.

Step 104: and judging whether smoking behaviors exist in each frame according to the behavior label of each frame and the cigarette confidence coefficient, and marking a smoking label in the video with the smoking behaviors.

In a possible implementation, before detecting whether a cigarette is present in the video frame in step 101, the method further includes: and aiming at the collected video to be analyzed, normalizing the frame rate and the resolution of the video so as to enable the time lengths in the graph convolution neural network to be consistent.

In a possible implementation manner, in step 102, inputting the video to be analyzed into a human body posture estimation neural network to obtain key points of a human body in a frame sequence, including: inputting the video to be analyzed into a human body posture estimation neural network, predicting and acquiring the whole body joint point connection of each person in each frame of video, taking a frame sequence with a set length for each section of video, connecting the frames and the same joint point to form a graph structure, and thus obtaining the key points of the human body in the frame sequence.

In one possible implementation, in step 102, the inputting the keypoints into a convolutional neural network to obtain a behavior tag of the person in each frame includes: constructing a skeleton space-time diagram G-V, E through each video frame; the node matrix set V ═ vti | T ═ 1., T, i ═ 1., N }, the frame number is T, the number of key points is N, E includes Es and Ef, Es refers to connectivity of key points in a single frame according to human body structures, edges are connected, and Ef refers to a single key point being connected to a node of the same type on consecutive video frames; automatically capturing the spatial configuration and the time dynamics among the key points by utilizing the graph convolution neural network, wherein in space, local features of adjacent key points in the same frame are learned by using convolution, and in time, features among corresponding key points among the convolution learning frames are also utilized to perform convolution alternation in space and time so as to extract a higher-level feature graph; and classifying the feature map into corresponding behavior categories by using a classifier to obtain the behavior label of the person in each frame.

In the method provided by the embodiment of the application, a human body posture estimation neural network is used for predicting and acquiring the joint point connection of the whole body of each person in each video frame, each video frame is a frame sequence (for example, a frame sequence corresponding to 10 seconds), and the same joint point is also connected between frames, so that a graph structure is formed, and the joint point connection is input as the node characteristic of the graph.

The spatial convolution requires a convolution kernel (sample space) and convolution weights, the sample space p (h, w) being defined over adjacent pixels with respect to the central position x, a sampling function is defined on the periphery of the node Vti of the graph set b (Vti) { Vti | D (Vtj, Vti) ≦ D }, where D (Vtj, Vti) denotes the minimum length of any path from node Vtj to node Vti, where D is set to 1 (i.e., the neighbors of x), the convolution weights are the set b (Vti) of nodes around node Vti divided into 3 subsets according to spatial configuration partitions, namely, a root node Ot exists in the joint points of the whole body, the length of d (Ot, Vti) is taken as a division standard, wherein Vti is divided into a group separately, the distance between the node and the root node Ot is greater than d, and divided into a centripetal group, otherwise, the centripetal group, 3 groups get different weight values, and the convolution also uses the same weight elsewhere on the frame.

The time convolution aims at the same type of joint points, is similar to the space convolution, but not only has D values but also F values (the number of frames) about the sampling function, namely F frames form a feature map, and the weight is directly modified by taking Vti as the root neighborhood (frame) because the time axis is ordered.

A human body joint point combination sequence in a group of video frame images within a period of time (such as 5 seconds) is learned from two directions of space and time by using a graph convolution neural network without the limitation of a topological structure, after space convolution and time convolution are carried out for each time, an Resnet mechanism is used, meanwhile, partial features of dropout are randomly generated according to a certain probability, overfitting is prevented, after multiple times of space-time convolution, a feature vector is finally sent to a SoftMax classifier, action label indexes of each person in each frame are predicted, and a graph convolution network pre-training model capable of identifying smoking behaviors is formed for real-time video analysis.

And directly analyzing the action sequence of the human body in continuous time and space in the real-time video stream by using the trained graph convolution neural network, and judging whether smoking action exists or not. Meanwhile, a target detection neural network is introduced to detect cigarettes in the video frames, and the graph volume analysis and the target detection aiming at the human key point combination sequence are combined for judgment, so that the accuracy rate of smoking behavior identification is high, and the robustness is strong.

Fig. 2 shows a model training phase flow provided by the embodiment of the present application, and fig. 3 shows a model prediction phase flow provided by the embodiment of the present application. The smoking behavior recognition method provided by the embodiment of the present application is further explained with reference to fig. 2 and fig. 3, and specifically may include the following steps:

step 1: preprocessing a data set: the frame rate and the resolution of the video in the data set are normalized, the frame number of the video is made to be equal, and the time length (the number of frames) of the graph convolution neural network is ensured to be the same. Meanwhile, in order to detect whether cigarettes exist in the video frame, if the cigarette position marked on the video frame exists, a new data set is formed.

Step 2: and (3) key point detection: and performing key point prediction on the processed video frame by using a human body posture estimation neural network. If the number of people in a single frame exceeds 5, the 5 people with the highest confidence coefficient are kept, and the human key point position sequences of all the frames of the video are stored into a json file. Each of which is 18 keypoints, each keypoint containing 3 features (coordinates x, y, confidence), and if there are some keypoints for a person that are not predicted, then such keypoints are set to (0, 0, 0).

And step 3: graph convolution neural network: and (2) constructing a bone space-time graph G- (V, E) through the json file of each video in the step 2, wherein a node matrix set V- (vti | T-1., T, i-1., N }, the frame number is T, the number of key points is N, E comprises Es and Ef, Es refers to the connectivity of key points according to the human body structure in a single frame, edges are connected, and Ef refers to the fact that a single key point is connected to the same type of node on the continuous video frames. The method comprises the steps of automatically capturing spatial configuration and time dynamics among key points by utilizing convolution, learning local features of adjacent key points in the same frame by utilizing convolution in space, learning features among corresponding key points among frames by utilizing convolution in time, and alternately performing spatial and time convolution to extract a higher-level feature map, and finally classifying the feature map into a corresponding behavior category by utilizing a classifier, namely giving a behavior label to a person detected in each frame. Target (cigarette) detection neural network: and (3) training a target detection neural network model by using the cigarette data set marked in the step (1) to detect the cigarette position in the image.

And 4, step 4: in fig. 3, a video which is actually acquired is mainly used as an input, a human body posture estimation neural network directly predicts key points of a human body in a frame sequence, then a result is transmitted into a graph convolution neural network, a behavior tag in each frame is predicted, meanwhile, the frame sequence is also transmitted into a target detection neural network, whether cigarettes exist in the frames is detected, finally, the two results are fused to judge whether smoking behaviors exist in each frame, and a wrinkling tag is pasted in the video.

And 5: a human-computer interaction interface: the method mainly realizes that an operator gives an Rtsp address or an original video to be analyzed, introduces a pre-trained deep neural network model, commands the model to execute smoking behavior detection operation, and graphically and simultaneously displays an original input video and a video marked with behavior labels and key points on an interface.

In practical application, if a person tends to smoke in a smoke forbidden area, namely when the person takes cigarettes, the graph convolution neural network and the target detection neural network can quickly and accurately judge whether potential smoking behaviors exist or not and timely send out warning prompts, so that the conditions of creating combustion conditions for combustible articles and enabling a warehouse to be in a smoldering state are avoided. Meanwhile, people can be traced according to the graphical display, so that the fire hazard is avoided.

The embodiment of the application has the following beneficial effects:

the accuracy is high, the smoking behavior is identified by adopting the graph convolution neural network and the target detection neural network together, and the accuracy rate reaches 95.52 percent. High efficiency, and the real-time video analysis speed can reach 26 frames/second. The edge embedded equipment can be deployed flexibly, namely the edge embedded equipment can be deployed quickly, the edge embedded equipment can also be deployed in a centralized manner at the cloud end, and the edge embedded equipment can be used in the field only by one common camera, so that the edge embedded equipment is suitable for popularization and large-scale deployment.

In summary, the embodiment of the present application provides a smoking behavior identification method, which detects whether a cigarette exists in a video frame for an acquired video to be analyzed, and marks the cigarette in the video frame where the cigarette exists to form a cigarette data set; inputting the video to be analyzed into a human body posture estimation neural network to obtain key points of a human body in a frame sequence, and inputting the key points into a graph convolution neural network to obtain a behavior label of the human body in each frame; inputting the frame sequence of the cigarette data set into a target detection neural network to detect the cigarette position in a frame image and obtain the cigarette confidence of each frame; and judging whether smoking behaviors exist in each frame according to the behavior label and the cigarette confidence coefficient of each frame, and marking a smoking label in the video with the smoking behaviors. The method for identifying smoking behaviors by using the key point sequence of the human body as input and using the graph volume and the target detection deep neural network algorithm together achieves real-time detection of the potential smoking behaviors while ensuring the precision, and effectively avoids fire.

Based on the same technical concept, an embodiment of the present application further provides a smoking behavior recognition system, as shown in fig. 4, the system includes:

the data preprocessing module 401 is configured to detect whether a cigarette exists in a video frame of the acquired video to be analyzed, and label the cigarette in the video frame where the cigarette exists to form a cigarette data set.

And the key point detection module 402 is configured to input the video to be analyzed into the human body posture estimation neural network to obtain key points of a human body in a frame sequence, and input the key points into the atlas convolution neural network to obtain a behavior tag of the person in each frame.

A cigarette detection module 403, configured to input the frame sequence of the cigarette data set into a target detection neural network to detect a cigarette position in a frame image, so as to obtain a cigarette confidence of each frame.

And a smoking action recognition module 404, configured to determine whether a smoking action exists in each frame according to the behavior tag of each frame and the cigarette confidence level, and print a smoking tag in a video in which the smoking action exists.

In a possible implementation, the key point detecting module 402 is specifically configured to: inputting the video to be analyzed into a human body posture estimation neural network, predicting and acquiring the whole body joint point connection of each person in each frame of video, taking a frame sequence with a set length for each section of video, connecting the frames and the same joint point to form a graph structure, and thus obtaining the key points of the human body in the frame sequence.

In a possible implementation manner, the key point detecting module 402 is specifically configured to: constructing a skeleton space-time diagram G-V, E through each video frame; the node matrix set V ═ vti | T ═ 1., T, i ═ 1., N }, the frame number is T, the number of key points is N, E includes Es and Ef, Es refers to connectivity of key points in a single frame according to human body structures, edges are connected, and Ef refers to a single key point being connected to a node of the same type on consecutive video frames; automatically capturing the spatial configuration and the time dynamics among the key points by utilizing the graph convolution neural network, wherein in space, local features of adjacent key points in the same frame are learned by using convolution, and in time, features among corresponding key points among the convolution learning frames are also utilized to perform convolution alternation in space and time so as to extract a higher-level feature graph; and classifying the feature map into corresponding behavior categories by using a classifier to obtain the behavior label of the person in each frame.

In a possible implementation, the data preprocessing module 401 is further configured to: and aiming at the collected video to be analyzed, normalizing the frame rate and the resolution of the video so as to enable the time lengths in the graph convolution neural network to be consistent.

Based on the same technical concept, the embodiment of the present application further provides an apparatus, including: the device comprises a data acquisition device, a processor and a memory; the data acquisition device is used for acquiring data; the memory is to store one or more program instructions; the processor is configured to execute one or more program instructions to perform the method.

Based on the same technical concept, the embodiment of the present application further provides a computer-readable storage medium, which contains one or more program instructions for executing the method.

In the present specification, each embodiment of the method is described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. Reference is made to the description of the method embodiments in part.

It should be noted that although the operations of the methods of the present invention are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Although the present application provides method steps as in embodiments or flowcharts, additional or fewer steps may be included based on conventional or non-inventive approaches. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an apparatus or client product in practice executes, it may execute sequentially or in parallel (e.g., in a parallel processor or multithreaded processing environment, or even in a distributed data processing environment) according to the embodiments or methods shown in the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in processes, methods, articles, or apparatus that include the recited elements is not excluded.

The units, devices, modules, etc. set forth in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, in implementing the present application, the functions of each module may be implemented in one or more software and/or hardware, or a module implementing the same function may be implemented by a combination of a plurality of sub-modules or sub-units, and the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a mobile terminal, a server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present application in detail, and it should be understood that the above-mentioned embodiments are only examples of the present application and are not intended to limit the scope of the present application, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present application should be included in the scope of the present application.

Claims

1. A smoking behaviour recognition method, characterised in that the method comprises:

detecting whether cigarettes exist in the video frames aiming at the collected videos to be analyzed, and marking the cigarettes in the video frames with the cigarettes to form a cigarette data set;

inputting the video to be analyzed into a human body posture estimation neural network, predicting and acquiring the whole body joint point connection of each person in each frame of video, taking a frame sequence with a set length for each section of video, connecting the frames and the same joint point to form a graph structure so as to obtain key points of the human body in the frame sequence, and inputting the key points into the graph convolution neural network so as to obtain a behavior tag of the person in each frame;

2. The method of claim 1, wherein said inputting the keypoints into a convolutional neural network to obtain behavior labels of the persons in each frame, comprises:

and classifying the feature map into corresponding behavior categories by using a classifier to obtain behavior labels of people in each frame.

3. The method of claim 1, wherein prior to detecting whether a cigarette is present in the video frame, the method further comprises:

4. A smoking behaviour recognition system, characterised in that the system comprises:

the key point detection module is used for inputting the video to be analyzed into a human body posture estimation neural network, predicting and acquiring the connection of all body joint points of each person in each frame of video, taking a frame sequence with a set length for each section of video, connecting the frames and the same joint point to form a graph structure so as to obtain key points of the human body in the frame sequence, and inputting the key points into the graph convolution neural network so as to obtain a behavior label of the person in each frame;

5. The system of claim 4, wherein the keypoint detection module is specifically configured to:

6. The system of claim 4, wherein the data pre-processing module is further to:

7. An apparatus, characterized in that the apparatus comprises: the device comprises a data acquisition device, a processor and a memory;

the data acquisition device is used for acquiring data; the memory is to store one or more program instructions; the processor, configured to execute one or more program instructions to perform the method of any of claims 1-3.

8. A computer-readable storage medium having one or more program instructions embodied therein for performing the method of any of claims 1-3.