WO2021114892A1

WO2021114892A1 - Environmental semantic understanding-based body movement recognition method, apparatus, device, and storage medium

Info

Publication number: WO2021114892A1
Application number: PCT/CN2020/123214
Authority: WO
Inventors: 冯颖龙; 付佐毅; 周宸; 周宝; 陈远旭
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-05-29
Filing date: 2020-10-23
Publication date: 2021-06-17
Also published as: CN111666857B; CN111666857A

Abstract

The present application relates to the technical field of video image processing and artificial intelligence, and specifically relates to an environmental semantic understanding-based body movement recognition method, an apparatus, a device, and a storage medium. The method comprises: detecting a person and items included in the frame images in a video stream; performing posture recognition on the detected person included in each frame image so as to obtain the body postures; inputting the body postures into a first convolutional neural network to obtain the probabilities of different action types of the person; inputting the body postures and the items surrounding the person into a second convolutional neural network so as to obtain the probability of person falling down; outputting the movement recognition result. The present method prevents mis-recognizing an item as a person during posture recognition, and enhances the accuracy and time relevancy of body posture recognition. The second convolutional neural network uses the body postures and surrounding items to recognize a fall, thereby enhancing the accuracy of action detection, and providing good robustness to the unstable process of body posture recognition.

Description

Human behavior recognition method, device, equipment and storage medium based on environmental semantics understanding

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 29, 2020, the application number is 202010475795.7, and the invention title is "Human Behavior Recognition Method, Device and Storage Medium Based on Understanding of Environmental Semantics", and its entire contents Incorporated in this application by reference.

【Technical Field】

This application relates to the field of video image processing technology, and also to the field of artificial intelligence, and in particular to a method, device, equipment, and storage medium for human behavior recognition based on environmental semantic understanding.

【Background technique】

The mainstream solution for human body posture recognition in the prior art is to use top-down and bottom-up algorithms. The inventor realized that when the bottom-up algorithm is used to recognize human body posture, there is a high probability of misrecognition. For example, mistakenly use a chair or a robot placed in a warehouse as a human body and predict the posture of the human body. Misrecognition will seriously affect the recognition accuracy of the algorithm and the usage scenario, and the instability of the model will give the algorithm application Increase the greater uncertainty; at the same time, the bottom-up algorithm increases the time complexity and space complexity of the calculation; in addition, the top-down algorithm estimates the accuracy of the pose in complex scenes with multiple people And the speed is low.

After estimating the pose of the human body, it is necessary to perform action classification according to the pose of the human body to recognize human behavior. In the prior art, most of the prior art uses an end-to-end algorithm model for action classification. The algorithm model requires a lot of accuracy of the input human pose. It also requires high quality of the labeled data, which makes the end-to-end action recognition prone to large deviations, and the recognition accuracy is low.

Therefore, it is necessary to provide a new human behavior recognition method to solve the above technical problems.

[Summary of the invention]

The purpose of this application is to provide a human body behavior recognition method, device, device, and storage medium based on the understanding of environmental semantics, which can solve the problems of low accuracy of human body gesture recognition and low accuracy of detection actions in the prior art.

In order to solve the above technical problems, a technical solution adopted in this application is to provide a human behavior recognition method based on the understanding of environmental semantics, including:

Detect the human body and objects contained in each frame of the video stream;

Perform posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body;

Inputting the posture of each human body in the continuous multiple frames of images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body;

Obtain the objects around each human body, input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network to obtain the second action recognition result, The items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;

The behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.

In order to solve the above technical problems, another technical solution adopted in this application is to provide a human body behavior recognition device based on the understanding of environmental semantics, including:

The target detection module is used to detect the human body and objects contained in each frame of the video stream;

The posture recognition module is used to recognize the posture of each human body contained in each frame of the detected image to obtain the posture of each human body;

The general action classification module is used to input the posture of each human body in the continuous multi-frame images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result. The first convolution The neural network is used for action recognition, and the first action recognition result includes the occurrence probabilities of different action categories of each human body;

The fall action recognition module is used to obtain the objects around each human body, and input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network Acquire a second action recognition result, the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, and the second convolutional neural network is used for fall action recognition, so The second action recognition result includes the probability of each human body falling;

The output module is configured to output the behavior recognition result of each human body according to the first action recognition result and the second action recognition result.

In order to solve the above technical problems, another technical solution adopted in this application is to provide an electronic device, including a processor, and a memory coupled to the processor, and the memory stores executable files that can be executed by the processor. Program instructions; when the processor executes the program instructions stored in the memory, the following steps are implemented:

Detect the human body and objects contained in each frame of the video stream;

In order to solve the above technical problems, another technical solution adopted in this application is to provide a storage medium in which program instructions are stored, and when the program instructions are executed by a processor, the following steps are implemented:

Detect the human body and objects contained in each frame of the video stream;

The beneficial effects of this application are: the human body behavior recognition method, device and storage medium based on the understanding of environmental semantics of this application first detect the human body and objects contained in each frame of image in the video stream, and then analyze the content contained in each frame of the detected image. Recognize the posture of each human body to obtain the posture of each human body; input the posture of the human body into the first convolutional neural network to obtain the occurrence probability of different action categories of each human body, and input the posture of the human body and the objects around the human body into the second convolutional neural network to obtain The occurrence probability of each human body falls, and then the behavior recognition result is output according to the occurrence probability of different human action categories and the occurrence probability of human body fall. Through the above method, the misrecognition of objects as human bodies in the gesture recognition process is avoided, and the human body is improved. The accuracy and real-time performance of gesture recognition; the first convolutional neural network performs general action recognition, and the second convolutional neural network uses the posture of the human body and the surrounding objects to recognize falling, which improves the accuracy of the detection action and is not stable. The gesture recognition of the human body has good robustness.

【Explanation of the drawings】

Fig. 1 is a flowchart of a human body behavior recognition method based on understanding of environmental semantics according to the first embodiment of this application;

FIG. 2 is a flowchart of a human body behavior recognition method based on understanding of environmental semantics according to a second embodiment of this application;

3 is a schematic structural diagram of a human body behavior recognition device based on understanding of environmental semantics according to a third embodiment of this application;

4 is a schematic structural diagram of an electronic device based on environmental semantics understanding according to a fourth embodiment of this application;

FIG. 5 is a schematic structural diagram of a storage medium according to an embodiment of the application.

【Detailed ways】

The following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The terms "first", "second", and "third" in this application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first”, “second”, and “third” may explicitly or implicitly include at least one of the features. In the description of the present application, "plurality" means at least two, such as two, three, etc., unless specifically defined otherwise. All directional indicators (such as up, down, left, right, front, back...) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) , Movement status, etc., if the specific posture changes, the directional indication will also change accordingly. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.

The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.

FIG. 1 is a schematic flowchart of a human behavior recognition method based on understanding of environmental semantics according to a first embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 1. As shown in Figure 1, the human behavior recognition method based on the understanding of environmental semantics includes the following steps:

S101: Detect human bodies and objects contained in each frame of image in the video stream.

In step S101, the video stream includes multiple consecutive video frames shot by the robot, or the video stream includes any several video frames among the multiple consecutive video frames shot by the robot.

In step S101, based on the understanding of the semantic information of the environment, the human body and objects in the environment are detected, and the video stream is input into the pre-trained deep learning network to obtain the human body and objects contained in each frame of the video stream. The learning network is used for target prediction. The target includes humans and objects. The end-to-end deep learning network includes a multi-layer convolutional neural network, a multi-layer maximum pooling layer, and a fully connected layer, such as a 23-layer convolutional neural network, 5 The maximum pooling layer and finally the fully connected layer is used for classification and regression. Specifically, each frame of the video stream is divided into multiple grids according to a preset division method; in each grid, through pre- Set different types of detection frames for target prediction, for each detection frame, obtain the coordinate parameters (x, y) of the target predicted by the detection frame, the width and height of the detection frame (w, h), and the confidence of the detection frame (Ptr), the detection frame with the highest confidence is used as the detection result. The prediction result includes the target, the detection frame, the coordinate parameters of the target, and the target category. The detection frame selects the circumscribed area of the target. The target categories include human bodies and objects; the human bodies and objects contained in each frame of image in the video stream are determined according to the prediction result.

Among them, each frame of image can be divided into s×s grids. In each grid, target prediction is performed according to different types of detection frames to achieve target location and category prediction. There are n types for each grid. There are a total of m types of target prediction detection frames, including m types of human body, bed, table, chair, robot, yoga mat, etc. For different types of detection frames, the detection results include coordinate parameters (x, y), width and height (w, h), and confidence (Ptr). There are 5 parameters in total, and the number of parameters is (s×s×n×( m+5)).

In order to predict the category and position of the target in the image, the deep learning network is trained. The specific process is as follows: for each sample image in the sample image set, a rectangular detection frame is used to label the target; the deep learning network is used in the sample image The position and category of the target are predicted, and the error of the deep learning network is determined according to the prediction result and the label information of the target. The error is determined by the loss function of the deep learning network. The loss function of the deep learning network includes the coordinate prediction loss function and the confidence level. The loss function and category loss function are as follows:

(1) Coordinate prediction loss function:

Among them, P _ij indicates whether the center point of the predicted target in the j-th detection frame is located in the i-th grid, u _i is the abscissa of the predicted target's center point in the i-th grid,

Labeled target center point of the center point of the i-th grid abscissa, V _i is the predicted target ordinate at the i-th grid,

Is the ordinate of the marked target's center point in the i-th grid, w _i is the width of the detection frame where the predicted center point is in the i-th grid,

Is the width of the detection frame where the marked center point is in the i-th grid, and h _i is the height of the detection frame where the predicted center point is in the i-th grid,

Is the height of the detection frame where the marked center point is in the i-th grid;

(2) Confidence loss function:

Among them, P _ij indicates whether the center point of the predicted target in the j-th detection frame is located in the i-th grid, and Conf _i is the confidence of the prediction,

Is the confidence of the label;

(3) Category loss function:

Wherein, P _i represents a prediction of the i-th object existence of the network center point, P _i (m) is the target within the i-th grid prediction probability attributable to the m categories,

It is the probability that the marked target in the i-th grid belongs to category m.

S102: Perform posture recognition on each human body included in each frame of detected images to obtain the posture of each human body.

In this embodiment, the posture of the human body includes the position of the joint points and the connection between the joint points. The joint points include the head, the left shoulder joint point, the right shoulder joint point, the neck joint point, the waist joint point, and the left knee joint point. Right knee joint point, left wrist joint point, right wrist joint point, left elbow joint point, right elbow joint point, left ankle joint point, and right ankle joint point.

In this embodiment, each human body contained in each frame of image is input into a pre-trained human body posture detection network to obtain the posture of the human body. Specifically, the human body posture detection network includes a feedforward neural for extracting high-dimensional features. Network, joint point position prediction network and joint point relationship prediction network. Among them, the feedforward neural network includes 10-layer convolutional network and 2-layer pooling network, which extracts high-dimensional features of the human body contained in each frame of image; joint point position The prediction network includes a 5-layer convolutional network, and the output result is the confidence level of the j-th joint point of the k-th human body in each frame of image

It is used to determine the position of the joint points of the human body according to the high-dimensional features; the joint point relationship prediction network is used to estimate the connection direction between two joint points, and the connection between the joint points is determined according to the position of the joint point. The position of the joint points belonging to the same human body and the line between the joint points are regarded as the posture of the human body.

When connecting multiple joint points of the human body, multiple connection methods can be established between multiple joint points, but it conforms to the structure of the human body. Ensure that the connection is a connection method that can represent a certain structure of the human body There is only one way. For example, the arm of the human body can only be represented by connecting the wrist joint point and the elbow joint point. Therefore, there is only one way to connect multiple human joint points according to the structure of the human body. , It can show the posture of the human body based on the joint points and connections of the human body. Specifically, the step of determining the line between the joint points according to the position of the joint point includes:

In the first step, for every two joint points, the direction vectors of the two joint points are obtained according to the positions of the two joint points, and the direction vectors of the two joint points are decomposed into a parallel direction vector and a vertical direction vector.

Specifically, determine whether the first joint point (position a1) and the second joint point (position a2) are the two ends of the first joint (for example, the left arm or the right arm), the first joint point and the second joint point Direction vector

Direction vector

Decompose into parallel direction vectors

And the vertical vector

among them,

In the second step, for each pixel point between the two joint points, determine whether the pixel point is located on the first joint according to the position of the pixel point and the direction vector of the two joint points.

Specifically, the length of the first joint is L, the width of the first joint is w, the pixel point p between the first joint point (a1) and the second joint point (a2), p is the position of the pixel point, when Pixel point p satisfies

When the pixel point p is located on the first joint, the first joint point (a1) and the second joint point (a2) are correlated.

In the third step, if the pixel point is located on the first joint, calculate the correlation between the two joint points and the first joint according to the correlation function, and use the two joint points with the highest correlation as the first joint At both ends, create a line between the two joint points.

Specifically, the correlation function is

Among them, p(u) is sampling the pixels between the first joint point (a1) and the second joint point (a2), p(u)=(1-u)a ₁ +ua ₂ .

S103. Input the posture of each human body in the continuous multiple frames of images in the video stream into a first pre-trained convolutional neural network to obtain a first action recognition result, where the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body.

In this embodiment, the first convolutional neural network classifies general actions, the first convolutional neural network is a graph convolutional neural network, and step S103 specifically includes the following steps:

Perform normalization processing on the posture of each human body in the continuous multi-frame images in the video stream;

Use the attention network to extract the region of interest from each frame of the video stream;

Perform a graph convolution operation on the different joint points of each human body in each frame of the video stream;

Perform a temporal convolution operation on the same joint points of each human body in consecutive multiple frames of images in the video stream;

The fully connected layer is used to classify actions according to the features output by the graph convolution operation and the features output by the time convolution operation, and obtain the occurrence probability of different action categories of each human body.

Specifically, the formula for action classification is as follows:

Among them: g _out is the classification result; f _in is the feature map;

Is the sampling function,

That is, the current joint point v _ti is the closest joint point v _tj ; x is the position of the joint point; w is the weight;

Is the weighting function; K is the size of the convolution kernel;

In the time domain,

Among them, r _i is the distance from the current joint point v _ti to the center of the human body; r _j is the distance from the adjacent joint point v _tj to the center of the human body; Γ is the sampling time window size; q is the sampling time; t is the current time.

S104. Obtain objects around each human body, input the postures of each human body and the objects around each human body in the continuous multi-frame images in the video stream into a second pre-trained convolutional neural network, and obtain a second action recognition result The second convolutional neural network is used for fall action recognition, and the second action recognition result includes the occurrence probability of each human body falling.

In this embodiment, the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold.

In this embodiment, fall recognition is performed based on the posture of the human body, the objects around the human body, and the position of the article relative to the human body, such as identifying the human body to fall and surrounding tables, semantic information and position information of the chair , And then judge that if the person who should fall is very close to the table and chair, it is likely that he did not fall. If the person is far away from the table and chair, there is a greater probability of falling. If it is detected that the falling human body is under a bed or yoga mat, it can be judged that the pedestrian has not fallen, but just lie down or do some exercises, and the semantic information of the surrounding environment can greatly improve the accuracy of the detection action.

In this embodiment, the second convolutional neural network is trained using the posture of the human body when the fall occurs, the objects around the human body, and the position of the object relative to the human body as a sample set. Specifically, in this embodiment, the training process of the second convolutional neural network includes:

S1041: Obtain a first sample image of a human body that has a fall action and a second sample image of a human body that has a fall-like action, respectively, and detect the human body and objects contained in the first sample image and the second sample image that contains Human body and objects;

S1042, performing posture recognition on the human body included in the detected first sample image and the second sample image, respectively, to obtain the posture of the human body;

S1043. In the first sample image, obtain an item whose distance from the human body is less than or equal to the preset threshold as the item around the human body, according to the position of the human body and the value of the items around the human body. The position determines the position of the object relative to the human body; the posture of the human body, the objects around the human body, and the position of the object relative to the human body are used as fall training features in the first sample image Mark in, to obtain the first marked sample image;

S1044. In the second sample image, obtain an article whose distance from the human body is less than or equal to the preset threshold as the article around the human body, according to the position of the human body and the position of the article around the human body Determine the position of the object relative to the human body; use the posture of the human body, the objects around the human body, and the position of the object relative to the human body as non-fall training features in the second sample image Perform an annotation to obtain a second annotated sample image;

S1045: Input the first annotated sample image and the second sample annotated image into a preset initial neural network for training, so as to obtain a second convolutional neural network.

S105: Output a behavior recognition result of each human body according to the first action recognition result and the second action recognition result.

In this embodiment, by setting corresponding weights for the first action recognition result and the second action recognition result, according to the occurrence probability of different human action categories in the first action recognition result, the weight of the first action recognition result and the second recognition result In the result, the occurrence probability of the human body falling and the weight of the second recognition result are calculated to calculate the adjustment probability of different human body action categories and the adjustment probability of the human body falling, and the action category with the largest adjustment probability is output as the human body behavior recognition result.

Fig. 2 is a schematic flowchart of a human behavior recognition method based on understanding of environmental semantics according to a second embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 2. As shown in Figure 2, the human behavior recognition method based on the understanding of environmental semantics includes the following steps:

S201: Detect human bodies and objects contained in each frame of image in the video stream.

S202: Perform posture recognition on each human body included in each detected frame of image, to obtain the posture of each human body.

S203: Perform a de-occlusion operation on the posture of each human body included in each frame of the recognized image.

S204, inputting the posture of each human body in the continuous multiple frames of images in the video stream into a first pre-trained convolutional neural network to obtain a first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body.

S205: Obtain objects around each human body, and input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into a second pre-trained convolutional neural network, and obtain a second action recognition result The second convolutional neural network is used for fall action recognition, and the second action recognition result includes the occurrence probability of each human body falling.

S206: Output a behavior recognition result of each human body according to the first action recognition result and the second action recognition result.

For step S201, step S202, and steps S204 to S206, please refer to step S101 to step S105 of the first embodiment, respectively, which will not be repeated here.

In step S203, for each detection frame, when the detection frame contains multiple human bodies, based on the posture of each human body located in the detection frame, multiple sets of joint points in the detection frame are acquired, and each group The key node group includes multiple joint points belonging to the same human body. The detection frame of the human body is to select the circumscribed area of the human body contained in each frame of image; obtain the left shoulder joint points and the right shoulder joint points from multiple sets of joint points The joint point group located in the detection frame; select the joint point group with the largest number of joint points from the joint point group where the left shoulder joint point and the right shoulder joint point are located in the detection frame, and mark it as the target joint point group, and then mark it as the target joint point group. The joint point group except the target joint point group is marked as the occluded joint point group. In this embodiment, each group of joint points corresponds to a human body. When there are multiple human bodies in the detection frame, the joint point group of the occluded human body is removed by the de-occlusion operation in step S203, and the target joint point group corresponds to the human body As the object of action recognition, in the subsequent steps S204 and S205, the action is classified according to the posture of the human body corresponding to the target joint point group. In step S205, in this embodiment, the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold.

In this embodiment, the multi-person overlapping scene design algorithm is used to remove the occlusion, thereby further avoiding the use of the pose information of the occluded person to identify the behavior of the unoccluded person, increasing the reliability of the algorithm and improving the accuracy of the algorithm , So that it can be applied in actual complex scenes.

In an optional implementation manner, after step S206, the following steps are further included:

The human body behavior recognition method further includes: uploading the posture of each human body and the behavior recognition result of each human body to the blockchain, so that the blockchain can recognize the posture of each human body and the behavior of each human body The result is stored encrypted.

Specifically, the corresponding summary information is obtained based on the posture of each human body or the behavior recognition result of each human body. Specifically, the summary information is obtained by hashing the posture of each human body or the behavior recognition result of each human body, for example, using the sha256s algorithm. get. Uploading summary information to the blockchain can ensure its security and fairness and transparency to users. The user equipment can download the summary information from the blockchain to verify whether the human body's behavior recognition result has been tampered with. The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Fig. 3 is a schematic structural diagram of a human body behavior recognition device based on understanding of environmental semantics according to a third embodiment of the present application. As shown in FIG. 3, the device 30 includes a target detection module 301, a gesture recognition module 302, a general action classification module 303, a fall action recognition module 304, and an output module 305.

Among them, the target detection module 301 is used to detect the human body and objects contained in each frame of the image in the video stream; the gesture recognition module 302 is used to recognize each human body contained in each frame of the detected image to obtain the posture of each human body; The action classification module 303 is configured to input the posture of each human body in the continuous multi-frame images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result. The first convolutional neural network uses For action recognition, the first action recognition result includes the occurrence probability of different action categories of each human body; the falling action recognition module 304 is used to obtain items around each human body, and the posture of each human body in the continuous multi-frame images in the video stream And the items around the human body are input to the second convolutional neural network that is pre-trained to obtain the second action recognition result. The items around the human body are the items whose distance from the human body in each frame of image is less than or equal to the preset threshold. The second convolutional neural network is used for fall action recognition, and the second action recognition result includes the occurrence probability of each human body falling; the output module 305 is used for outputting each human body according to the first action recognition result and the second action recognition result The result of behavior recognition.

Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application. As shown in FIG. 4, the electronic device 40 includes a processor 41 and a memory 42 coupled to the processor 41.

The memory 42 stores program instructions for realizing human behavior recognition based on environmental semantic understanding in any of the above embodiments.

The processor 41 is configured to execute program instructions stored in the memory 42 to perform human behavior recognition based on the understanding of environmental semantics.

The processor 41 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 41 may be an integrated circuit chip with signal processing capabilities. The processor 41 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component . The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

Refer to FIG. 5, which is a schematic structural diagram of a storage medium according to an embodiment of the application. The storage medium of the embodiment of the present application stores program instructions 51 that can implement all the above methods, and the storage medium may be non-volatile or volatile. Wherein, the program instructions 51 may be stored in the above-mentioned storage medium in the form of a software product, and include several instructions for making a computer device (may be a personal computer, a server, or a network device, etc.) or a processor to execute the program. Apply for all or part of the steps of the method described in each embodiment. The aforementioned storage devices include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. , Or terminal devices such as computers, servers, mobile phones, and tablets.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. The above are only implementations of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly applied to other related technical fields, The same reason is included in the scope of patent protection of this application.

The above are only the implementation manners of this application. It should be pointed out here that for those of ordinary skill in the art, improvements can be made without departing from the creative concept of this application, but these all belong to this application. The scope of protection.

Claims

A human body behavior recognition method based on environmental semantics understanding, wherein the method includes:

Detect the human body and objects contained in each frame of the video stream;

Perform posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body;

Inputting the posture of each human body in the continuous multiple frames of images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body;

Obtain the objects around each human body, input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network to obtain the second action recognition result, The items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;

The behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
The human body behavior recognition method according to claim 1, wherein the detecting the human body and objects contained in each frame image in the video stream comprises:

Dividing each frame of image in the video stream into a plurality of grids according to a preset division method;

In each grid, target prediction is performed through preset detection frames of different types, and for each detection frame, the coordinate parameters of the target predicted by the detection frame, the width and height of the detection frame, and the confidence of the detection frame are obtained. , Taking the detection frame with the highest confidence level as the prediction result, the prediction result including the target, the detection frame, the coordinate parameters of the target, and the target category, and the detection frame is the frame selection of the target The circumscribed area of the target category includes human body and objects;

The human body and objects contained in each frame image in the video stream are determined according to the prediction result.
The human body behavior recognition method according to claim 1, wherein the posture of the human body includes the position of the joint point and the line between the joint point; the posture recognition is performed on each human body contained in each frame of the detected image , Get the posture of each human body, including:

Perform high-dimensional feature extraction on the human body contained in each frame of image;

Determining the position of the joint point of the human body according to the high-dimensional feature;

The line between the joint points is determined according to the position of the joint point, and the position of the joint point and the line between the joint points are taken as the posture of the human body.
The human body behavior recognition method according to claim 3, wherein the determining the connection line between the joint points according to the positions of the joint points comprises:

For every two joint points, obtain the direction vectors of the two joint points according to the positions of the two joint points, and decompose the direction vectors of the two joint points into a parallel direction vector and a vertical direction vector;

For each pixel point between the two joint points, judging whether the pixel point is located on the first joint according to the position of the pixel point and the direction vector of the two joint points;

If the pixel point is located on the first joint, the correlation between the two joint points is calculated according to the correlation function, and the two joint points with the highest correlation are used as the two ends of the first joint to generate The connecting line between the two joint points.
The human body behavior recognition method according to claim 1, wherein said performing the posture recognition of each human body contained in each frame of the detected image to obtain the posture of each human body, further comprising:

For the detection frame of each human body, when the detection frame contains multiple human bodies, based on the posture of each human body located in the detection frame, multiple sets of joint point groups in the detection frame are acquired, and each group of the The node group includes a plurality of joint points belonging to the same human body, and the detection frame of the human body is a circumscribed area of the human body included in each frame of image selected by the frame;

Acquiring, from a plurality of sets of joint points, a set of joint points in which a left shoulder joint point and a right shoulder joint point are located in the detection frame;

Select the joint point group with the largest number of joint points from the joint point group in which the left shoulder joint point and the right shoulder joint point are located in the detection frame and mark it as the target joint point group, and mark the joints in the detection frame excluding the target joint point group The point group is marked as the occluded joint point group, and the posture of the human body corresponding to the target joint point group is taken as the object of action recognition.
The human body behavior recognition method according to claim 1, wherein the posture of each human body in the continuous multi-frame images in the video stream is input into a pre-trained first convolutional neural network to obtain the human body The first action recognition results include:

Extracting a region of interest from each frame of the video stream by using an attention network;

Performing a graph convolution operation on different joint points of each human body in each frame of the video stream;

Performing a temporal convolution operation on the same joint points of each human body in consecutive multiple frames of images in the video stream;

The fully connected layer is used to classify actions according to the features output by the graph convolution operation and the features output by the time convolution operation, and obtain the occurrence probability of different action categories of each human body.
The human body behavior recognition method according to claim 6, wherein the human body behavior recognition method further comprises: uploading the posture of each human body and the behavior recognition result of each human body to the blockchain, so that the block The chain encrypts and stores the posture of each human body and the behavior recognition result of each human body;

Before extracting the region of interest from each frame of image in the video stream by using the attention network, the method further includes: normalizing the posture of each human body in the continuous multiple frames of images in the video stream.
The human body behavior recognition method according to claim 1, wherein the training process of the second convolutional neural network comprises:

Acquiring a first sample image of a human body that has a falling action, and detecting the human body and objects contained in the first sample image;

Performing posture recognition on the human body included in the detected first sample image to obtain the posture of the human body;

Obtain an article whose distance from the human body is less than or equal to the preset threshold as an article around the human body, and determine that the article is relative to the human body according to the position of the human body and the position of the article around the human body s position;

Labeling the first sample image with the posture of the human body, the objects around the human body, and the position of the article relative to the human body as fall training features in the first sample image to obtain a first label sample image;

The first labeled sample image is input into a preset initial neural network for training to obtain a second convolutional neural network.
A human body behavior recognition device based on environmental semantics understanding, wherein the device includes:

The target detection module is used to detect the human body and objects contained in each frame of the video stream;

The posture recognition module is used to recognize the posture of each human body contained in each frame of the detected image to obtain the posture of each human body;

The general action classification module is used to input the posture of each human body in the continuous multi-frame images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result. The first convolution The neural network is used for action recognition, and the first action recognition result includes the occurrence probabilities of different action categories of each human body;

The fall action recognition module is used to obtain the objects around each human body, and input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network Acquire a second action recognition result, the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, and the second convolutional neural network is used for fall action recognition, so The second action recognition result includes the probability of each human body falling;

The output module is configured to output the behavior recognition result of each human body according to the first action recognition result and the second action recognition result.
An electronic device, wherein the electronic device includes a processor and a memory coupled to the processor, the memory storing program instructions executable by the processor; the processor executes the memory The stored program instructions implement the following steps:

Detect the human body and objects contained in each frame of the video stream;

Perform posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body;

Inputting the posture of each human body in the continuous multiple frames of images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body;

Obtain the objects around each human body, input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network to obtain the second action recognition result, The items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;

The behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
The electronic device according to claim 10, wherein said detecting the human body and objects contained in each frame of image in the video stream comprises:

Dividing each frame of image in the video stream into a plurality of grids according to a preset division method;

In each grid, target prediction is performed through preset detection frames of different types, and for each detection frame, the coordinate parameters of the target predicted by the detection frame, the width and height of the detection frame, and the confidence of the detection frame are obtained. , Taking the detection frame with the highest confidence level as the prediction result, the prediction result including the target, the detection frame, the coordinate parameters of the target, and the target category, and the detection frame is the frame selection of the target The circumscribed area of the target category includes human body and objects;

The human body and objects contained in each frame image in the video stream are determined according to the prediction result.
The electronic device according to claim 10, wherein the posture of the human body includes the position of the joint point and the connection line between the joint point; the posture recognition is performed on each human body contained in each frame of the detected image to obtain The posture of each human body, including:

Perform high-dimensional feature extraction on the human body contained in each frame of image;

Determining the position of the joint point of the human body according to the high-dimensional feature;

The line between the joint points is determined according to the position of the joint point, and the position of the joint point and the line between the joint points are taken as the posture of the human body.
10. The electronic device according to claim 10, wherein said performing posture recognition of each human body included in each frame of the detected image to obtain the posture of each human body, further comprising:

For the detection frame of each human body, when the detection frame contains multiple human bodies, based on the posture of each human body located in the detection frame, multiple sets of joint point groups in the detection frame are acquired, and each group of the The node group includes a plurality of joint points belonging to the same human body, and the detection frame of the human body is a circumscribed area of the human body included in each frame of image selected by the frame;

Acquiring, from a plurality of sets of joint points, a set of joint points in which a left shoulder joint point and a right shoulder joint point are located in the detection frame;

Select the joint point group with the largest number of joint points from the joint point group in which the left shoulder joint point and the right shoulder joint point are located in the detection frame and mark it as the target joint point group, and mark the joints in the detection frame excluding the target joint point group The point group is marked as the occluded joint point group, and the posture of the human body corresponding to the target joint point group is taken as the object of action recognition.
The electronic device according to claim 10, wherein said inputting the posture of each human body in the continuous multiple frames of images in the video stream into a first pre-trained convolutional neural network to obtain the first convolutional neural network of the human body An action recognition result, including:

Extracting a region of interest from each frame of the video stream by using an attention network;

Performing a graph convolution operation on different joint points of each human body in each frame of the video stream;

Performing a temporal convolution operation on the same joint points of each human body in consecutive multiple frames of images in the video stream;

The fully connected layer is used to classify actions according to the features output by the graph convolution operation and the features output by the time convolution operation, and obtain the occurrence probability of different action categories of each human body.
The electronic device according to claim 10, wherein the training process of the second convolutional neural network comprises:

Acquiring a first sample image of a human body that has a falling action, and detecting the human body and objects contained in the first sample image;

Performing posture recognition on the human body included in the detected first sample image to obtain the posture of the human body;

Obtain an article whose distance from the human body is less than or equal to the preset threshold as an article around the human body, and determine that the article is relative to the human body according to the position of the human body and the position of the article around the human body s position;

Labeling the first sample image with the posture of the human body, the objects around the human body, and the position of the article relative to the human body as fall training features in the first sample image to obtain a first label sample image;

The first labeled sample image is input into a preset initial neural network for training to obtain a second convolutional neural network.
A storage medium, wherein program instructions are stored in the storage medium, and the following steps are implemented when the program instructions are executed by a processor:

Detect the human body and objects contained in each frame of the video stream;

Perform posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body;

Inputting the posture of each human body in the continuous multiple frames of images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body;

Obtain the objects around each human body, input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network to obtain the second action recognition result, The items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;

The behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
The storage medium according to claim 16, wherein said detecting the human body and objects contained in each frame image in the video stream comprises:

Dividing each frame of image in the video stream into a plurality of grids according to a preset division method;

In each grid, target prediction is performed through preset detection frames of different types, and for each detection frame, the coordinate parameters of the target predicted by the detection frame, the width and height of the detection frame, and the confidence of the detection frame are obtained. , Taking the detection frame with the highest confidence level as the prediction result, the prediction result including the target, the detection frame, the coordinate parameters of the target, and the target category, and the detection frame is the frame selection of the target The circumscribed area of the target category includes human body and objects;

The human body and objects contained in each frame image in the video stream are determined according to the prediction result.
15. The storage medium according to claim 16, wherein said performing posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body, further comprising:

For the detection frame of each human body, when the detection frame contains multiple human bodies, based on the posture of each human body located in the detection frame, multiple sets of joint points in the detection frame are acquired, and each group of the The set of key nodes includes multiple joint points belonging to the same human body, and the detection frame of the human body is a circumscribed area of the human body included in each frame of image selected by the frame;

Acquiring, from a plurality of sets of joint points, a set of joint points in which a left shoulder joint point and a right shoulder joint point are located in the detection frame;

Select the joint point group with the largest number of joint points from the joint point group in which the left shoulder joint point and the right shoulder joint point are located in the detection frame and mark it as the target joint point group, and mark the joints in the detection frame excluding the target joint point group The point group is marked as the occluded joint point group, and the posture of the human body corresponding to the target joint point group is taken as the object of action recognition.
The storage medium according to claim 16, wherein said inputting the posture of each human body in the continuous multi-frame images in the video stream into a first pre-trained convolutional neural network to obtain the first convolutional neural network of the human body An action recognition result, including:

Extracting a region of interest from each frame of the video stream by using an attention network;

Performing a graph convolution operation on different joint points of each human body in each frame of the video stream;

Performing a temporal convolution operation on the same joint points of each human body in consecutive multiple frames of images in the video stream;

The fully connected layer is used to classify actions according to the features output by the graph convolution operation and the features output by the time convolution operation, and obtain the occurrence probability of different action categories of each human body. 20. The storage medium according to claim 16, wherein the training process of the second convolutional neural network comprises:

Acquiring a first sample image of a human body that has a falling action, and detecting the human body and objects contained in the first sample image;

Performing posture recognition on the human body included in the detected first sample image to obtain the posture of the human body;

Obtain an article whose distance from the human body is less than or equal to the preset threshold as an article around the human body, and determine that the article is relative to the human body according to the position of the human body and the position of the article around the human body s position;

Labeling the first sample image with the posture of the human body, the objects around the human body, and the position of the article relative to the human body as fall training features in the first sample image to obtain a first label sample image;

The first labeled sample image is input into a preset initial neural network for training to obtain a second convolutional neural network.