WO2021114892A1 - Environmental semantic understanding-based body movement recognition method, apparatus, device, and storage medium - Google Patents

Environmental semantic understanding-based body movement recognition method, apparatus, device, and storage medium Download PDF

Info

Publication number
WO2021114892A1
WO2021114892A1 PCT/CN2020/123214 CN2020123214W WO2021114892A1 WO 2021114892 A1 WO2021114892 A1 WO 2021114892A1 CN 2020123214 W CN2020123214 W CN 2020123214W WO 2021114892 A1 WO2021114892 A1 WO 2021114892A1
Authority
WO
WIPO (PCT)
Prior art keywords
human body
frame
posture
video stream
neural network
Prior art date
Application number
PCT/CN2020/123214
Other languages
French (fr)
Chinese (zh)
Inventor
冯颖龙
付佐毅
周宸
周宝
陈远旭
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021114892A1 publication Critical patent/WO2021114892A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of video image processing technology, and also to the field of artificial intelligence, and in particular to a method, device, equipment, and storage medium for human behavior recognition based on environmental semantic understanding.
  • the mainstream solution for human body posture recognition in the prior art is to use top-down and bottom-up algorithms.
  • the inventor realized that when the bottom-up algorithm is used to recognize human body posture, there is a high probability of misrecognition. For example, mistakenly use a chair or a robot placed in a warehouse as a human body and predict the posture of the human body. Misrecognition will seriously affect the recognition accuracy of the algorithm and the usage scenario, and the instability of the model will give the algorithm application Increase the greater uncertainty; at the same time, the bottom-up algorithm increases the time complexity and space complexity of the calculation; in addition, the top-down algorithm estimates the accuracy of the pose in complex scenes with multiple people And the speed is low.
  • the purpose of this application is to provide a human body behavior recognition method, device, device, and storage medium based on the understanding of environmental semantics, which can solve the problems of low accuracy of human body gesture recognition and low accuracy of detection actions in the prior art.
  • a technical solution adopted in this application is to provide a human behavior recognition method based on the understanding of environmental semantics, including:
  • the first action recognition result includes the occurrence probabilities of different action categories of each human body
  • the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;
  • the behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
  • Another technical solution adopted in this application is to provide a human body behavior recognition device based on the understanding of environmental semantics, including:
  • the target detection module is used to detect the human body and objects contained in each frame of the video stream;
  • the posture recognition module is used to recognize the posture of each human body contained in each frame of the detected image to obtain the posture of each human body;
  • the general action classification module is used to input the posture of each human body in the continuous multi-frame images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result.
  • the first convolution The neural network is used for action recognition, and the first action recognition result includes the occurrence probabilities of different action categories of each human body;
  • the fall action recognition module is used to obtain the objects around each human body, and input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network Acquire a second action recognition result, the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, and the second convolutional neural network is used for fall action recognition, so The second action recognition result includes the probability of each human body falling;
  • the output module is configured to output the behavior recognition result of each human body according to the first action recognition result and the second action recognition result.
  • an electronic device including a processor, and a memory coupled to the processor, and the memory stores executable files that can be executed by the processor.
  • Program instructions when the processor executes the program instructions stored in the memory, the following steps are implemented:
  • the first action recognition result includes the occurrence probabilities of different action categories of each human body
  • the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;
  • the behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
  • Another technical solution adopted in this application is to provide a storage medium in which program instructions are stored, and when the program instructions are executed by a processor, the following steps are implemented:
  • the first action recognition result includes the occurrence probabilities of different action categories of each human body
  • the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;
  • the behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
  • the human body behavior recognition method, device and storage medium based on the understanding of environmental semantics of this application first detect the human body and objects contained in each frame of image in the video stream, and then analyze the content contained in each frame of the detected image. Recognize the posture of each human body to obtain the posture of each human body; input the posture of the human body into the first convolutional neural network to obtain the occurrence probability of different action categories of each human body, and input the posture of the human body and the objects around the human body into the second convolutional neural network to obtain The occurrence probability of each human body falls, and then the behavior recognition result is output according to the occurrence probability of different human action categories and the occurrence probability of human body fall.
  • the misrecognition of objects as human bodies in the gesture recognition process is avoided, and the human body is improved.
  • the gesture recognition of the human body has good robustness.
  • Fig. 1 is a flowchart of a human body behavior recognition method based on understanding of environmental semantics according to the first embodiment of this application;
  • FIG. 2 is a flowchart of a human body behavior recognition method based on understanding of environmental semantics according to a second embodiment of this application;
  • FIG. 3 is a schematic structural diagram of a human body behavior recognition device based on understanding of environmental semantics according to a third embodiment of this application;
  • FIG. 4 is a schematic structural diagram of an electronic device based on environmental semantics understanding according to a fourth embodiment of this application;
  • FIG. 5 is a schematic structural diagram of a storage medium according to an embodiment of the application.
  • first”, “second”, and “third” in this application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first”, “second”, and “third” may explicitly or implicitly include at least one of the features.
  • “plurality” means at least two, such as two, three, etc., unless specifically defined otherwise. All directional indicators (such as up, down, left, right, front, back%) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) , Movement status, etc., if the specific posture changes, the directional indication will also change accordingly.
  • FIG. 1 is a schematic flowchart of a human behavior recognition method based on understanding of environmental semantics according to a first embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 1. As shown in Figure 1, the human behavior recognition method based on the understanding of environmental semantics includes the following steps:
  • S101 Detect human bodies and objects contained in each frame of image in the video stream.
  • step S101 the video stream includes multiple consecutive video frames shot by the robot, or the video stream includes any several video frames among the multiple consecutive video frames shot by the robot.
  • step S101 based on the understanding of the semantic information of the environment, the human body and objects in the environment are detected, and the video stream is input into the pre-trained deep learning network to obtain the human body and objects contained in each frame of the video stream.
  • the learning network is used for target prediction.
  • the target includes humans and objects.
  • the end-to-end deep learning network includes a multi-layer convolutional neural network, a multi-layer maximum pooling layer, and a fully connected layer, such as a 23-layer convolutional neural network, 5 The maximum pooling layer and finally the fully connected layer is used for classification and regression.
  • each frame of the video stream is divided into multiple grids according to a preset division method; in each grid, through pre- Set different types of detection frames for target prediction, for each detection frame, obtain the coordinate parameters (x, y) of the target predicted by the detection frame, the width and height of the detection frame (w, h), and the confidence of the detection frame (Ptr), the detection frame with the highest confidence is used as the detection result.
  • the prediction result includes the target, the detection frame, the coordinate parameters of the target, and the target category.
  • the detection frame selects the circumscribed area of the target.
  • the target categories include human bodies and objects; the human bodies and objects contained in each frame of image in the video stream are determined according to the prediction result.
  • each frame of image can be divided into s ⁇ s grids.
  • target prediction is performed according to different types of detection frames to achieve target location and category prediction.
  • the detection results include coordinate parameters (x, y), width and height (w, h), and confidence (Ptr).
  • the number of parameters is (s ⁇ s ⁇ n ⁇ ( m+5)).
  • the deep learning network is trained.
  • the specific process is as follows: for each sample image in the sample image set, a rectangular detection frame is used to label the target; the deep learning network is used in the sample image
  • the position and category of the target are predicted, and the error of the deep learning network is determined according to the prediction result and the label information of the target.
  • the error is determined by the loss function of the deep learning network.
  • the loss function of the deep learning network includes the coordinate prediction loss function and the confidence level.
  • the loss function and category loss function are as follows:
  • P ij indicates whether the center point of the predicted target in the j-th detection frame is located in the i-th grid
  • u i is the abscissa of the predicted target's center point in the i-th grid
  • V i is the predicted target ordinate at the i-th grid
  • w i is the width of the detection frame where the predicted center point is in the i-th grid
  • h i is the height of the detection frame where the predicted center point is in the i-th grid, Is the height of the detection frame where the marked center point is in the i-th grid;
  • P ij indicates whether the center point of the predicted target in the j-th detection frame is located in the i-th grid, and Conf i is the confidence of the prediction, Is the confidence of the label;
  • P i represents a prediction of the i-th object existence of the network center point
  • P i (m) is the target within the i-th grid prediction probability attributable to the m categories
  • It is the probability that the marked target in the i-th grid belongs to category m.
  • S102 Perform posture recognition on each human body included in each frame of detected images to obtain the posture of each human body.
  • the posture of the human body includes the position of the joint points and the connection between the joint points.
  • the joint points include the head, the left shoulder joint point, the right shoulder joint point, the neck joint point, the waist joint point, and the left knee joint point.
  • each human body contained in each frame of image is input into a pre-trained human body posture detection network to obtain the posture of the human body.
  • the human body posture detection network includes a feedforward neural for extracting high-dimensional features. Network, joint point position prediction network and joint point relationship prediction network.
  • the feedforward neural network includes 10-layer convolutional network and 2-layer pooling network, which extracts high-dimensional features of the human body contained in each frame of image; joint point position
  • the prediction network includes a 5-layer convolutional network, and the output result is the confidence level of the j-th joint point of the k-th human body in each frame of image It is used to determine the position of the joint points of the human body according to the high-dimensional features; the joint point relationship prediction network is used to estimate the connection direction between two joint points, and the connection between the joint points is determined according to the position of the joint point.
  • the position of the joint points belonging to the same human body and the line between the joint points are regarded as the posture of the human body.
  • connection is a connection method that can represent a certain structure of the human body
  • the arm of the human body can only be represented by connecting the wrist joint point and the elbow joint point. Therefore, there is only one way to connect multiple human joint points according to the structure of the human body.
  • It can show the posture of the human body based on the joint points and connections of the human body.
  • the step of determining the line between the joint points according to the position of the joint point includes:
  • the direction vectors of the two joint points are obtained according to the positions of the two joint points, and the direction vectors of the two joint points are decomposed into a parallel direction vector and a vertical direction vector.
  • first joint point (position a1) and the second joint point (position a2) are the two ends of the first joint (for example, the left arm or the right arm), the first joint point and the second joint point Direction vector Direction vector Decompose into parallel direction vectors And the vertical vector among them,
  • the second step for each pixel point between the two joint points, determine whether the pixel point is located on the first joint according to the position of the pixel point and the direction vector of the two joint points.
  • the length of the first joint is L
  • the width of the first joint is w
  • p is the position of the pixel point, when Pixel point p satisfies
  • the first joint point (a1) and the second joint point (a2) are correlated.
  • the third step if the pixel point is located on the first joint, calculate the correlation between the two joint points and the first joint according to the correlation function, and use the two joint points with the highest correlation as the first joint At both ends, create a line between the two joint points.
  • the first action recognition result includes the occurrence probabilities of different action categories of each human body.
  • the first convolutional neural network classifies general actions
  • the first convolutional neural network is a graph convolutional neural network
  • step S103 specifically includes the following steps:
  • the fully connected layer is used to classify actions according to the features output by the graph convolution operation and the features output by the time convolution operation, and obtain the occurrence probability of different action categories of each human body.
  • g out is the classification result
  • f in is the feature map
  • Is the sampling function That is, the current joint point v ti is the closest joint point v tj
  • x is the position of the joint point
  • w is the weight
  • K is the size of the convolution kernel
  • r i is the distance from the current joint point v ti to the center of the human body
  • r j is the distance from the adjacent joint point v tj to the center of the human body
  • is the sampling time window size
  • q is the sampling time
  • t is the current time.
  • the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold.
  • fall recognition is performed based on the posture of the human body, the objects around the human body, and the position of the article relative to the human body, such as identifying the human body to fall and surrounding tables, semantic information and position information of the chair , And then judge that if the person who should fall is very close to the table and chair, it is likely that he did not fall. If the person is far away from the table and chair, there is a greater probability of falling. If it is detected that the falling human body is under a bed or yoga mat, it can be judged that the pedestrian has not fallen, but just lie down or do some exercises, and the semantic information of the surrounding environment can greatly improve the accuracy of the detection action.
  • the second convolutional neural network is trained using the posture of the human body when the fall occurs, the objects around the human body, and the position of the object relative to the human body as a sample set.
  • the training process of the second convolutional neural network includes:
  • S1041 Obtain a first sample image of a human body that has a fall action and a second sample image of a human body that has a fall-like action, respectively, and detect the human body and objects contained in the first sample image and the second sample image that contains Human body and objects;
  • the first sample image obtain an item whose distance from the human body is less than or equal to the preset threshold as the item around the human body, according to the position of the human body and the value of the items around the human body.
  • the position determines the position of the object relative to the human body; the posture of the human body, the objects around the human body, and the position of the object relative to the human body are used as fall training features in the first sample image Mark in, to obtain the first marked sample image;
  • the second sample image obtain an article whose distance from the human body is less than or equal to the preset threshold as the article around the human body, according to the position of the human body and the position of the article around the human body Determine the position of the object relative to the human body; use the posture of the human body, the objects around the human body, and the position of the object relative to the human body as non-fall training features in the second sample image Perform an annotation to obtain a second annotated sample image;
  • S1045 Input the first annotated sample image and the second sample annotated image into a preset initial neural network for training, so as to obtain a second convolutional neural network.
  • S105 Output a behavior recognition result of each human body according to the first action recognition result and the second action recognition result.
  • the weight of the first action recognition result and the second recognition result are calculated to calculate the adjustment probability of different human body action categories and the adjustment probability of the human body falling, and the action category with the largest adjustment probability is output as the human body behavior recognition result.
  • Fig. 2 is a schematic flowchart of a human behavior recognition method based on understanding of environmental semantics according to a second embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 2. As shown in Figure 2, the human behavior recognition method based on the understanding of environmental semantics includes the following steps:
  • S201 Detect human bodies and objects contained in each frame of image in the video stream.
  • S202 Perform posture recognition on each human body included in each detected frame of image, to obtain the posture of each human body.
  • S203 Perform a de-occlusion operation on the posture of each human body included in each frame of the recognized image.
  • the first action recognition result includes the occurrence probabilities of different action categories of each human body.
  • S205 Obtain objects around each human body, and input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into a second pre-trained convolutional neural network, and obtain a second action recognition result
  • the second convolutional neural network is used for fall action recognition, and the second action recognition result includes the occurrence probability of each human body falling.
  • step S201 For step S201, step S202, and steps S204 to S206, please refer to step S101 to step S105 of the first embodiment, respectively, which will not be repeated here.
  • step S203 for each detection frame, when the detection frame contains multiple human bodies, based on the posture of each human body located in the detection frame, multiple sets of joint points in the detection frame are acquired, and each group The key node group includes multiple joint points belonging to the same human body.
  • the detection frame of the human body is to select the circumscribed area of the human body contained in each frame of image; obtain the left shoulder joint points and the right shoulder joint points from multiple sets of joint points The joint point group located in the detection frame; select the joint point group with the largest number of joint points from the joint point group where the left shoulder joint point and the right shoulder joint point are located in the detection frame, and mark it as the target joint point group, and then mark it as the target joint point group.
  • each group of joint points corresponds to a human body.
  • the joint point group of the occluded human body is removed by the de-occlusion operation in step S203, and the target joint point group corresponds to the human body
  • the action is classified according to the posture of the human body corresponding to the target joint point group.
  • the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold.
  • the multi-person overlapping scene design algorithm is used to remove the occlusion, thereby further avoiding the use of the pose information of the occluded person to identify the behavior of the unoccluded person, increasing the reliability of the algorithm and improving the accuracy of the algorithm , So that it can be applied in actual complex scenes.
  • step S206 the following steps are further included:
  • the human body behavior recognition method further includes: uploading the posture of each human body and the behavior recognition result of each human body to the blockchain, so that the blockchain can recognize the posture of each human body and the behavior of each human body
  • the result is stored encrypted.
  • the corresponding summary information is obtained based on the posture of each human body or the behavior recognition result of each human body.
  • the summary information is obtained by hashing the posture of each human body or the behavior recognition result of each human body, for example, using the sha256s algorithm. get.
  • Uploading summary information to the blockchain can ensure its security and fairness and transparency to users.
  • the user equipment can download the summary information from the blockchain to verify whether the human body's behavior recognition result has been tampered with.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • Fig. 3 is a schematic structural diagram of a human body behavior recognition device based on understanding of environmental semantics according to a third embodiment of the present application.
  • the device 30 includes a target detection module 301, a gesture recognition module 302, a general action classification module 303, a fall action recognition module 304, and an output module 305.
  • the target detection module 301 is used to detect the human body and objects contained in each frame of the image in the video stream;
  • the gesture recognition module 302 is used to recognize each human body contained in each frame of the detected image to obtain the posture of each human body;
  • the action classification module 303 is configured to input the posture of each human body in the continuous multi-frame images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result.
  • the first convolutional neural network uses For action recognition, the first action recognition result includes the occurrence probability of different action categories of each human body; the falling action recognition module 304 is used to obtain items around each human body, and the posture of each human body in the continuous multi-frame images in the video stream And the items around the human body are input to the second convolutional neural network that is pre-trained to obtain the second action recognition result.
  • the items around the human body are the items whose distance from the human body in each frame of image is less than or equal to the preset threshold.
  • the second convolutional neural network is used for fall action recognition, and the second action recognition result includes the occurrence probability of each human body falling; the output module 305 is used for outputting each human body according to the first action recognition result and the second action recognition result The result of behavior recognition.
  • Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.
  • the electronic device 40 includes a processor 41 and a memory 42 coupled to the processor 41.
  • the memory 42 stores program instructions for realizing human behavior recognition based on environmental semantic understanding in any of the above embodiments.
  • the processor 41 is configured to execute program instructions stored in the memory 42 to perform human behavior recognition based on the understanding of environmental semantics.
  • the processor 41 may also be referred to as a CPU (Central Processing Unit, central processing unit).
  • the processor 41 may be an integrated circuit chip with signal processing capabilities.
  • the processor 41 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component .
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • FIG. 5 is a schematic structural diagram of a storage medium according to an embodiment of the application.
  • the storage medium of the embodiment of the present application stores program instructions 51 that can implement all the above methods, and the storage medium may be non-volatile or volatile.
  • the program instructions 51 may be stored in the above-mentioned storage medium in the form of a software product, and include several instructions for making a computer device (may be a personal computer, a server, or a network device, etc.) or a processor to execute the program. Apply for all or part of the steps of the method described in each embodiment.
  • the aforementioned storage devices include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes.
  • terminal devices such as computers, servers, mobile phones, and tablets.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. The above are only implementations of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly applied to other related technical fields, The same reason is included in the scope of patent protection of this application.

Abstract

The present application relates to the technical field of video image processing and artificial intelligence, and specifically relates to an environmental semantic understanding-based body movement recognition method, an apparatus, a device, and a storage medium. The method comprises: detecting a person and items included in the frame images in a video stream; performing posture recognition on the detected person included in each frame image so as to obtain the body postures; inputting the body postures into a first convolutional neural network to obtain the probabilities of different action types of the person; inputting the body postures and the items surrounding the person into a second convolutional neural network so as to obtain the probability of person falling down; outputting the movement recognition result. The present method prevents mis-recognizing an item as a person during posture recognition, and enhances the accuracy and time relevancy of body posture recognition. The second convolutional neural network uses the body postures and surrounding items to recognize a fall, thereby enhancing the accuracy of action detection, and providing good robustness to the unstable process of body posture recognition.

Description

基于环境语义理解的人体行为识别方法、装置、设备及存储介质Human behavior recognition method, device, equipment and storage medium based on environmental semantics understanding
本申请要求于2020年05月29日提交中国专利局、申请号为202010475795.7,发明名称为“基于环境语义理解的人体行为识别方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 29, 2020, the application number is 202010475795.7, and the invention title is "Human Behavior Recognition Method, Device and Storage Medium Based on Understanding of Environmental Semantics", and its entire contents Incorporated in this application by reference.
【技术领域】【Technical Field】
本申请涉及视频图像处理技术领域,还涉及人工智能领域,尤其涉及一种基于环境语义理解的人体行为识别方法、装置、设备及存储介质。This application relates to the field of video image processing technology, and also to the field of artificial intelligence, and in particular to a method, device, equipment, and storage medium for human behavior recognition based on environmental semantic understanding.
【背景技术】【Background technique】
现有技术中人体姿态识别的主流方案是采用自上而下和自下而上的算法,发明人意识到,采用自下而上的算法识别人体姿态时,会有很大的概率产生误识别,例如,误将椅子或仓库摆放的机器人等物品作为人体并从中预测出人体的姿态,误识别会严重影响到算法的识别准确度以及使用场景,并且模型的不稳定性会给算法的应用增加较大的不确定性;同时,自下而上的算法增加了计算的时间复杂度和空间复杂度;另外,自上而下的算法对多人存在的复杂场景中估计位姿的准确率以及速度都较低。The mainstream solution for human body posture recognition in the prior art is to use top-down and bottom-up algorithms. The inventor realized that when the bottom-up algorithm is used to recognize human body posture, there is a high probability of misrecognition. For example, mistakenly use a chair or a robot placed in a warehouse as a human body and predict the posture of the human body. Misrecognition will seriously affect the recognition accuracy of the algorithm and the usage scenario, and the instability of the model will give the algorithm application Increase the greater uncertainty; at the same time, the bottom-up algorithm increases the time complexity and space complexity of the calculation; in addition, the top-down algorithm estimates the accuracy of the pose in complex scenes with multiple people And the speed is low.
估计出人体位姿后,需要根据人体位姿进行动作分类以识别人体行为,现有技术中大多采用基于端到端的算法模型进行动作分类,该算法模型对输入的人体位姿的准确度要求很高,也对标注的数据的质量要求很高,从而使得端到端的动作识别容易产生较大的偏差,识别的准确率较低。After estimating the pose of the human body, it is necessary to perform action classification according to the pose of the human body to recognize human behavior. In the prior art, most of the prior art uses an end-to-end algorithm model for action classification. The algorithm model requires a lot of accuracy of the input human pose. It also requires high quality of the labeled data, which makes the end-to-end action recognition prone to large deviations, and the recognition accuracy is low.
因此,有必要提供一种新的人体行为识别方法以解决上述技术问题。Therefore, it is necessary to provide a new human behavior recognition method to solve the above technical problems.
【发明内容】[Summary of the invention]
本申请的目的在于提供一种基于环境语义理解的人体行为识别方法、装置、设备及存储介质,能够解决现有技术中人体姿态识别的准确性低以及检测动作的准确度低的问题。The purpose of this application is to provide a human body behavior recognition method, device, device, and storage medium based on the understanding of environmental semantics, which can solve the problems of low accuracy of human body gesture recognition and low accuracy of detection actions in the prior art.
为解决上述技术问题,本申请采用的一个技术方案是:提供一种基于环境语义理解的人体行为识别方法,包括:In order to solve the above technical problems, a technical solution adopted in this application is to provide a human behavior recognition method based on the understanding of environmental semantics, including:
检测视频流中各帧图像包含的人体和物品;Detect the human body and objects contained in each frame of the video stream;
对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态;Perform posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body;
将所述视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取第一动作识别结果,所述第一卷积神经网络用于动作识别,所述第一动作识别结果包括各人体不同动作类别的发生概率;Inputting the posture of each human body in the continuous multiple frames of images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body;
获取各人体周围的物品,将所述视频流中连续的多帧图像中各人体的姿态以及各人体周围的物品输入到预先训练完成的第二卷积神经网络中,获取第二动作识别结果,所述人体周 围的物品为每帧图像中与所述人体的距离小于或等于预设阈值的物品,所述第二卷积神经网络用于摔倒动作识别,所述第二动作识别结果包括各人体摔倒的发生概率;Obtain the objects around each human body, input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network to obtain the second action recognition result, The items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;
根据所述第一动作识别结果和所述第二动作识别结果输出各人体的行为识别结果。The behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
为解决上述技术问题,本申请采用的另一个技术方案是:提供一种基于环境语义理解的人体行为识别装置,包括:In order to solve the above technical problems, another technical solution adopted in this application is to provide a human body behavior recognition device based on the understanding of environmental semantics, including:
目标检测模块,用于检测视频流中各帧图像包含的人体和物品;The target detection module is used to detect the human body and objects contained in each frame of the video stream;
姿态识别模块,用于对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态;The posture recognition module is used to recognize the posture of each human body contained in each frame of the detected image to obtain the posture of each human body;
通用动作分类模块,用于将所述视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取第一动作识别结果,所述第一卷积神经网络用于动作识别,所述第一动作识别结果包括各人体不同动作类别的发生概率;The general action classification module is used to input the posture of each human body in the continuous multi-frame images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result. The first convolution The neural network is used for action recognition, and the first action recognition result includes the occurrence probabilities of different action categories of each human body;
摔倒动作识别模块,用于获取各人体周围的物品,将所述视频流中连续的多帧图像中各人体的姿态以及各人体周围的物品输入到预先训练完成的第二卷积神经网络中,获取第二动作识别结果,所述人体周围的物品为每帧图像中与所述人体的距离小于或等于预设阈值的物品,所述第二卷积神经网络用于摔倒动作识别,所述第二动作识别结果包括各人体摔倒的发生概率;The fall action recognition module is used to obtain the objects around each human body, and input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network Acquire a second action recognition result, the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, and the second convolutional neural network is used for fall action recognition, so The second action recognition result includes the probability of each human body falling;
输出模块,用于根据所述第一动作识别结果和所述第二动作识别结果输出各人体的行为识别结果。The output module is configured to output the behavior recognition result of each human body according to the first action recognition result and the second action recognition result.
为解决上述技术问题,本申请采用的另一个技术方案是:提供一种电子设备,包括处理器、以及与所述处理器耦接的存储器,所述存储器存储有可被所述处理器执行的程序指令;所述处理器执行所述存储器存储的所述程序指令时实现以下步骤:In order to solve the above technical problems, another technical solution adopted in this application is to provide an electronic device, including a processor, and a memory coupled to the processor, and the memory stores executable files that can be executed by the processor. Program instructions; when the processor executes the program instructions stored in the memory, the following steps are implemented:
检测视频流中各帧图像包含的人体和物品;Detect the human body and objects contained in each frame of the video stream;
对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态;Perform posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body;
将所述视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取第一动作识别结果,所述第一卷积神经网络用于动作识别,所述第一动作识别结果包括各人体不同动作类别的发生概率;Inputting the posture of each human body in the continuous multiple frames of images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body;
获取各人体周围的物品,将所述视频流中连续的多帧图像中各人体的姿态以及各人体周围的物品输入到预先训练完成的第二卷积神经网络中,获取第二动作识别结果,所述人体周围的物品为每帧图像中与所述人体的距离小于或等于预设阈值的物品,所述第二卷积神经网络用于摔倒动作识别,所述第二动作识别结果包括各人体摔倒的发生概率;Obtain the objects around each human body, input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network to obtain the second action recognition result, The items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;
根据所述第一动作识别结果和所述第二动作识别结果输出各人体的行为识别结果。The behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
为解决上述技术问题,本申请采用的另一个技术方案是:提供一种存储介质,所述存储介质内存储有程序指令,所述程序指令被处理器执行时实现以下步骤:In order to solve the above technical problems, another technical solution adopted in this application is to provide a storage medium in which program instructions are stored, and when the program instructions are executed by a processor, the following steps are implemented:
检测视频流中各帧图像包含的人体和物品;Detect the human body and objects contained in each frame of the video stream;
对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态;Perform posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body;
将所述视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取第一动作识别结果,所述第一卷积神经网络用于动作识别,所述第一动作识别结果包括各人体不同动作类别的发生概率;Inputting the posture of each human body in the continuous multiple frames of images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body;
获取各人体周围的物品,将所述视频流中连续的多帧图像中各人体的姿态以及各人体周围的物品输入到预先训练完成的第二卷积神经网络中,获取第二动作识别结果,所述人体周围的物品为每帧图像中与所述人体的距离小于或等于预设阈值的物品,所述第二卷积神经网络用于摔倒动作识别,所述第二动作识别结果包括各人体摔倒的发生概率;Obtain the objects around each human body, input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network to obtain the second action recognition result, The items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;
根据所述第一动作识别结果和所述第二动作识别结果输出各人体的行为识别结果。The behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
本申请的有益效果在于:本申请的基于环境语义理解的人体行为识别方法、装置及存储介质,先检测视频流中各帧图像包含的人体和物品,再对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态;将人体的姿态输入第一卷积神经网络获取各人体不同动作类别的发生概率,将人体的姿态和人体周围的物品输入第二卷积神经网络获取各人体摔倒的发生概率,再根据人体不同动作类别的发生概率和人体摔倒的发生概率输出行为识别结果,通过上述方式,避免了姿态识别过程中将物品作为人体进行误识别,提高了人体姿态识别的准确性和实时性;第一卷积神经网络进行通用动作识别,第二卷积神经网络利用人体的姿态和周围的物品进行摔倒识别,提高了检测动作的准确度,对于不稳定的人体的姿态识别具有良好的鲁棒性。The beneficial effects of this application are: the human body behavior recognition method, device and storage medium based on the understanding of environmental semantics of this application first detect the human body and objects contained in each frame of image in the video stream, and then analyze the content contained in each frame of the detected image. Recognize the posture of each human body to obtain the posture of each human body; input the posture of the human body into the first convolutional neural network to obtain the occurrence probability of different action categories of each human body, and input the posture of the human body and the objects around the human body into the second convolutional neural network to obtain The occurrence probability of each human body falls, and then the behavior recognition result is output according to the occurrence probability of different human action categories and the occurrence probability of human body fall. Through the above method, the misrecognition of objects as human bodies in the gesture recognition process is avoided, and the human body is improved. The accuracy and real-time performance of gesture recognition; the first convolutional neural network performs general action recognition, and the second convolutional neural network uses the posture of the human body and the surrounding objects to recognize falling, which improves the accuracy of the detection action and is not stable. The gesture recognition of the human body has good robustness.
【附图说明】【Explanation of the drawings】
图1为本申请第一实施例的基于环境语义理解的人体行为识别方法的流程图;Fig. 1 is a flowchart of a human body behavior recognition method based on understanding of environmental semantics according to the first embodiment of this application;
图2为本申请第二实施例的基于环境语义理解的人体行为识别方法的流程图;FIG. 2 is a flowchart of a human body behavior recognition method based on understanding of environmental semantics according to a second embodiment of this application;
图3为本申请第三实施例的基于环境语义理解的人体行为识别装置的结构示意图;3 is a schematic structural diagram of a human body behavior recognition device based on understanding of environmental semantics according to a third embodiment of this application;
图4为本申请第四实施例的基于环境语义理解的电子设备的结构示意图;4 is a schematic structural diagram of an electronic device based on environmental semantics understanding according to a fourth embodiment of this application;
图5为本申请实施例的存储介质的结构示意图。FIG. 5 is a schematic structural diagram of a storage medium according to an embodiment of the application.
【具体实施方式】【Detailed ways】
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本申请中的术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”、“第三”的特征可以明示或者隐含地包括至少一个该特征。本申请的描述中,“多个”的含义是至少 两个,例如两个,三个等,除非另有明确具体的限定。本申请实施例中所有方向性指示(诸如上、下、左、右、前、后……)仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等,如果该特定姿态发生改变时,则该方向性指示也相应地随之改变。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", and "third" in this application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first”, “second”, and “third” may explicitly or implicitly include at least one of the features. In the description of the present application, "plurality" means at least two, such as two, three, etc., unless specifically defined otherwise. All directional indicators (such as up, down, left, right, front, back...) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) , Movement status, etc., if the specific posture changes, the directional indication will also change accordingly. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.
图1是本申请第一实施例的基于环境语义理解的人体行为识别方法的流程示意图。需注意的是,若有实质上相同的结果,本申请的方法并不以图1所示的流程顺序为限。如图1所示,该基于环境语义理解的人体行为识别方法包括步骤:FIG. 1 is a schematic flowchart of a human behavior recognition method based on understanding of environmental semantics according to a first embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 1. As shown in Figure 1, the human behavior recognition method based on the understanding of environmental semantics includes the following steps:
S101,检测视频流中各帧图像包含的人体和物品。S101: Detect human bodies and objects contained in each frame of image in the video stream.
在步骤S101中,视频流包括机器人拍摄的连续多个视频帧,或者视频流包括机器人拍摄的连续多个视频帧中的任意几个视频帧。In step S101, the video stream includes multiple consecutive video frames shot by the robot, or the video stream includes any several video frames among the multiple consecutive video frames shot by the robot.
在步骤S101中,基于环境语义信息理解,检测出环境中的人体和物品,将视频流输入到预先训练完成的深度学习网络中,获取该视频流中各帧图像包含的人体和物品,该深度学习网络用于目标预测,该目标包括人体和物品,该端到端的深度学习网络包括多层卷积神经网络、多层最大池化层以及全连接层,例如为23层卷积神经网络、5层最大池化层以及最后采用全连接层进行分类和回归,具体地,按照预设的划分方式,将该视频流中各帧图像划分为多个网格;在每个网格中,通过预先设置的不同类型的检测框进行目标预测,针对每个检测框,获取该检测框预测的目标的坐标参数(x,y)、检测框的宽度和高度(w,h)以及检测框的置信度(Ptr),将置信度最高的检测框作为检测结果,该预测结果包括目标、检测框、该目标的坐标参数以及该目标的类别,该检测框为框选出所述目标的外接区域,该目标的类别包括人体和物品;根据所述预测结果确定所述视频流中各帧图像包含的人体和物品。In step S101, based on the understanding of the semantic information of the environment, the human body and objects in the environment are detected, and the video stream is input into the pre-trained deep learning network to obtain the human body and objects contained in each frame of the video stream. The learning network is used for target prediction. The target includes humans and objects. The end-to-end deep learning network includes a multi-layer convolutional neural network, a multi-layer maximum pooling layer, and a fully connected layer, such as a 23-layer convolutional neural network, 5 The maximum pooling layer and finally the fully connected layer is used for classification and regression. Specifically, each frame of the video stream is divided into multiple grids according to a preset division method; in each grid, through pre- Set different types of detection frames for target prediction, for each detection frame, obtain the coordinate parameters (x, y) of the target predicted by the detection frame, the width and height of the detection frame (w, h), and the confidence of the detection frame (Ptr), the detection frame with the highest confidence is used as the detection result. The prediction result includes the target, the detection frame, the coordinate parameters of the target, and the target category. The detection frame selects the circumscribed area of the target. The target categories include human bodies and objects; the human bodies and objects contained in each frame of image in the video stream are determined according to the prediction result.
其中,可以将每帧图像分成s×s个网格,在每个网格中,按照不同类型的检测框进行目标预测,实现目标的位置和类别的预测,对于每个网格共有n个类型的检测框,目标预测的类别共有m种,包括人体、床、桌子、椅子、机器人、瑜伽垫等m种。对于不同类型的检测框,检测结果包括坐标参数(x,y)、宽度和高度(w,h)以及置信度(Ptr),共有5个参数,则参数数量为(s×s×n×(m+5))。Among them, each frame of image can be divided into s×s grids. In each grid, target prediction is performed according to different types of detection frames to achieve target location and category prediction. There are n types for each grid. There are a total of m types of target prediction detection frames, including m types of human body, bed, table, chair, robot, yoga mat, etc. For different types of detection frames, the detection results include coordinate parameters (x, y), width and height (w, h), and confidence (Ptr). There are 5 parameters in total, and the number of parameters is (s×s×n×( m+5)).
为了预测图像中目标的类别和位置,对该深度学习网络进行训练,具体过程如下:针对 样本图像集合中的每个样本图像,采用矩形的检测框标注目标;利用深度学习网络对该样本图像中目标的位置和类别进行预测,根据预测结果和目标的标注信息确定该深度学习网络的误差,该误差采用深度学习网络的损失函数进行确定,深度学习网络的损失函数包括坐标预测损失函数、置信度损失函数以及类别损失函数,分别如下所示:In order to predict the category and position of the target in the image, the deep learning network is trained. The specific process is as follows: for each sample image in the sample image set, a rectangular detection frame is used to label the target; the deep learning network is used in the sample image The position and category of the target are predicted, and the error of the deep learning network is determined according to the prediction result and the label information of the target. The error is determined by the loss function of the deep learning network. The loss function of the deep learning network includes the coordinate prediction loss function and the confidence level. The loss function and category loss function are as follows:
(1)坐标预测损失函数:(1) Coordinate prediction loss function:
Figure PCTCN2020123214-appb-000001
Figure PCTCN2020123214-appb-000001
其中,P ij表示预测的第j个检测框内目标的中心点是否位于第i个网格内,u i为预测的目标的中心点在第i个网格的横坐标,
Figure PCTCN2020123214-appb-000002
为标注的目标的中心点在第i个网格的横坐标,v i为预测的目标的中心点在第i个网格的纵坐标,
Figure PCTCN2020123214-appb-000003
为标注的目标的中心点在第i个网格的纵坐标,w i为预测的中心点在第i个网格内的目标所在检测框的宽度,
Figure PCTCN2020123214-appb-000004
为标注的中心点在第i个网格内的目标所在检测框的宽度,h i为预测的中心点在第i个网格内的目标所在检测框的高度,
Figure PCTCN2020123214-appb-000005
为标注的中心点在第i个网格内的目标所在检测框的高度;
Among them, P ij indicates whether the center point of the predicted target in the j-th detection frame is located in the i-th grid, u i is the abscissa of the predicted target's center point in the i-th grid,
Figure PCTCN2020123214-appb-000002
Labeled target center point of the center point of the i-th grid abscissa, V i is the predicted target ordinate at the i-th grid,
Figure PCTCN2020123214-appb-000003
Is the ordinate of the marked target's center point in the i-th grid, w i is the width of the detection frame where the predicted center point is in the i-th grid,
Figure PCTCN2020123214-appb-000004
Is the width of the detection frame where the marked center point is in the i-th grid, and h i is the height of the detection frame where the predicted center point is in the i-th grid,
Figure PCTCN2020123214-appb-000005
Is the height of the detection frame where the marked center point is in the i-th grid;
(2)置信度损失函数:(2) Confidence loss function:
Figure PCTCN2020123214-appb-000006
Figure PCTCN2020123214-appb-000006
其中,P ij表示预测的第j个检测框内目标的中心点是否位于第i个网格内,Conf i为预测的置信度,
Figure PCTCN2020123214-appb-000007
为标注的置信度;
Among them, P ij indicates whether the center point of the predicted target in the j-th detection frame is located in the i-th grid, and Conf i is the confidence of the prediction,
Figure PCTCN2020123214-appb-000007
Is the confidence of the label;
(3)类别损失函数:(3) Category loss function:
Figure PCTCN2020123214-appb-000008
Figure PCTCN2020123214-appb-000008
其中,P i表示预测的第i个网络是否存在目标的中心点,p i(m)为预测的第i个网格内的目标归属于类别m的概率,
Figure PCTCN2020123214-appb-000009
为标注的第i个网格内的目标归属于类别m的概率。
Wherein, P i represents a prediction of the i-th object existence of the network center point, P i (m) is the target within the i-th grid prediction probability attributable to the m categories,
Figure PCTCN2020123214-appb-000009
It is the probability that the marked target in the i-th grid belongs to category m.
S102,对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态。S102: Perform posture recognition on each human body included in each frame of detected images to obtain the posture of each human body.
在本实施例中,人体的姿态包括关节点的位置和关节点之间的连线,关节点包括头、左肩关节点、右肩关节点、颈关节点、腰关节点、左膝关节点、右膝关节点、左手腕关节点、右手腕关节点、左手肘关节点、右手肘关节点、左脚腕关节点、右脚腕关节点。In this embodiment, the posture of the human body includes the position of the joint points and the connection between the joint points. The joint points include the head, the left shoulder joint point, the right shoulder joint point, the neck joint point, the waist joint point, and the left knee joint point. Right knee joint point, left wrist joint point, right wrist joint point, left elbow joint point, right elbow joint point, left ankle joint point, and right ankle joint point.
在本实施例中,将每帧图像中包含的各人体输入到预先训练完成的人体姿态检测网络中,获取人体的姿态,具体地,人体姿态检测网络包括用于提取高维特征的前馈神经网络、关节点位置预测网络以及关节点关系预测网络,其中,前馈神经网络包括10层卷积网络和2层池化网络,对每帧图像中包含的人体进行高维特征提取;关节点位置预测网络包括5层卷积网络,输出结果为每帧图像中第K个人体第j种关节点的置信度
Figure PCTCN2020123214-appb-000010
用于根据该高维特征确定该人体的关节点的位置;关节点关系预测网络用于估计两个关节点之间的连接方向,根 据该关节点的位置确定关节点之间的连线,将属于同一人体的关节点的位置以及该关节点之间的连线作为该人体的姿态。
In this embodiment, each human body contained in each frame of image is input into a pre-trained human body posture detection network to obtain the posture of the human body. Specifically, the human body posture detection network includes a feedforward neural for extracting high-dimensional features. Network, joint point position prediction network and joint point relationship prediction network. Among them, the feedforward neural network includes 10-layer convolutional network and 2-layer pooling network, which extracts high-dimensional features of the human body contained in each frame of image; joint point position The prediction network includes a 5-layer convolutional network, and the output result is the confidence level of the j-th joint point of the k-th human body in each frame of image
Figure PCTCN2020123214-appb-000010
It is used to determine the position of the joint points of the human body according to the high-dimensional features; the joint point relationship prediction network is used to estimate the connection direction between two joint points, and the connection between the joint points is determined according to the position of the joint point. The position of the joint points belonging to the same human body and the line between the joint points are regarded as the posture of the human body.
在对多个人体关节点进行连线时,由于多个关节点之间是可以建立多种连线方式的,但是符合人体的构造,保证连线是可以表示人体的某个结构的连线方式只有一种,例如,只有将腕关节点和肘关节点连线才可以表示人体的胳膊,因此,按照人体构造对多个人体关节点进行连线只会有一种方式,这样,在连线后,便可以基于人体关节点和连线表现出人体的姿态。具体地,根据该关节点的位置确定关节点之间的连线的步骤包括:When connecting multiple joint points of the human body, multiple connection methods can be established between multiple joint points, but it conforms to the structure of the human body. Ensure that the connection is a connection method that can represent a certain structure of the human body There is only one way. For example, the arm of the human body can only be represented by connecting the wrist joint point and the elbow joint point. Therefore, there is only one way to connect multiple human joint points according to the structure of the human body. , It can show the posture of the human body based on the joint points and connections of the human body. Specifically, the step of determining the line between the joint points according to the position of the joint point includes:
第一步,针对每两个关节点,根据该两个关节点的位置获取该两个关节点的方向向量,并将该两个关节点的方向向量分解为平行方向向量和垂直方向向量。In the first step, for every two joint points, the direction vectors of the two joint points are obtained according to the positions of the two joint points, and the direction vectors of the two joint points are decomposed into a parallel direction vector and a vertical direction vector.
具体地,判断第一关节点(位置为a1)和第二关节点(位置为a2)是否为第一关节(例如为左臂或右臂)的两端,第一关节点和第二关节点的方向向量
Figure PCTCN2020123214-appb-000011
将方向向量
Figure PCTCN2020123214-appb-000012
分解为平行方向向量
Figure PCTCN2020123214-appb-000013
和垂直方向向量
Figure PCTCN2020123214-appb-000014
其中,
Figure PCTCN2020123214-appb-000015
Specifically, determine whether the first joint point (position a1) and the second joint point (position a2) are the two ends of the first joint (for example, the left arm or the right arm), the first joint point and the second joint point Direction vector
Figure PCTCN2020123214-appb-000011
Direction vector
Figure PCTCN2020123214-appb-000012
Decompose into parallel direction vectors
Figure PCTCN2020123214-appb-000013
And the vertical vector
Figure PCTCN2020123214-appb-000014
among them,
Figure PCTCN2020123214-appb-000015
第二步,针对该两个关节点之间的每个像素点,根据该像素点的位置以及该两个关节点的方向向量判断该像素点是否位于第一关节上。In the second step, for each pixel point between the two joint points, determine whether the pixel point is located on the first joint according to the position of the pixel point and the direction vector of the two joint points.
具体地,第一关节的长度为L,第一关节的宽度为w,第一关节点(a1)和第二关节点(a2)之间的像素点p,p为该像素点的位置,当像素点p满足
Figure PCTCN2020123214-appb-000016
Figure PCTCN2020123214-appb-000017
时,像素点p位于该第一关节上,第一关节点(a1)和第二关节点(a2)具有相关性。
Specifically, the length of the first joint is L, the width of the first joint is w, the pixel point p between the first joint point (a1) and the second joint point (a2), p is the position of the pixel point, when Pixel point p satisfies
Figure PCTCN2020123214-appb-000016
Figure PCTCN2020123214-appb-000017
When the pixel point p is located on the first joint, the first joint point (a1) and the second joint point (a2) are correlated.
第三步,若该像素点位于该第一关节上,根据相关性函数计算该两个关节点与该第一关节的相关度,将相关度最高的该两个关节点作为该第一关节的两端,生成该两个关节点之间的连线。In the third step, if the pixel point is located on the first joint, calculate the correlation between the two joint points and the first joint according to the correlation function, and use the two joint points with the highest correlation as the first joint At both ends, create a line between the two joint points.
具体地,相关性函数为
Figure PCTCN2020123214-appb-000018
其中,p(u)为对从第一关节点(a1)和第二关节点(a2)之间的像素进行采样,p(u)=(1-u)a 1+ua 2
Specifically, the correlation function is
Figure PCTCN2020123214-appb-000018
Among them, p(u) is sampling the pixels between the first joint point (a1) and the second joint point (a2), p(u)=(1-u)a 1 +ua 2 .
S103,将该视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取第一动作识别结果,该第一卷积神经网络用于动作识别,该第一动作识别结果包括各人体不同动作类别的发生概率。S103. Input the posture of each human body in the continuous multiple frames of images in the video stream into a first pre-trained convolutional neural network to obtain a first action recognition result, where the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body.
在本实施例中,第一卷积神经网络针对通用动作进行分类,第一卷积神经网络为图卷积神经网络,步骤S103具体包括如下步骤:In this embodiment, the first convolutional neural network classifies general actions, the first convolutional neural network is a graph convolutional neural network, and step S103 specifically includes the following steps:
对该视频流中连续的多帧图像中各人体的姿态进行归一化处理;Perform normalization processing on the posture of each human body in the continuous multi-frame images in the video stream;
利用注意力网络从该视频流的每帧图像中提取感兴趣区域;Use the attention network to extract the region of interest from each frame of the video stream;
对该视频流的每帧图像中各人体的不同关节点进行图卷积操作;Perform a graph convolution operation on the different joint points of each human body in each frame of the video stream;
对该视频流中连续的多帧图像中各人体的相同关节点进行时间卷积操作;Perform a temporal convolution operation on the same joint points of each human body in consecutive multiple frames of images in the video stream;
利用全连接层根据图卷积操作输出的特征和时间卷积操作输出的特征进行动作分类,获取各人体不同动作类别的出现概率。The fully connected layer is used to classify actions according to the features output by the graph convolution operation and the features output by the time convolution operation, and obtain the occurrence probability of different action categories of each human body.
具体地,动作分类的公式如下:Specifically, the formula for action classification is as follows:
Figure PCTCN2020123214-appb-000019
Figure PCTCN2020123214-appb-000019
其中:g out为分类结果;f in为特征图;
Figure PCTCN2020123214-appb-000020
为采样函数,
Figure PCTCN2020123214-appb-000021
即当前关节点v ti距离最近的关节点v tj;x为关节点位置;w为权重;
Figure PCTCN2020123214-appb-000022
为加权函数;K为卷积核大小;空间域中,
Figure PCTCN2020123214-appb-000023
时间域中,
Figure PCTCN2020123214-appb-000024
Among them: g out is the classification result; f in is the feature map;
Figure PCTCN2020123214-appb-000020
Is the sampling function,
Figure PCTCN2020123214-appb-000021
That is, the current joint point v ti is the closest joint point v tj ; x is the position of the joint point; w is the weight;
Figure PCTCN2020123214-appb-000022
Is the weighting function; K is the size of the convolution kernel;
Figure PCTCN2020123214-appb-000023
In the time domain,
Figure PCTCN2020123214-appb-000024
Figure PCTCN2020123214-appb-000025
Figure PCTCN2020123214-appb-000025
其中,r i为当前关节点v ti到人体中心的距离;r j为相邻关节点v tj到人体中心的距离;Γ为采样时间窗口大小;q为采样时间;t为当前时间。 Among them, r i is the distance from the current joint point v ti to the center of the human body; r j is the distance from the adjacent joint point v tj to the center of the human body; Γ is the sampling time window size; q is the sampling time; t is the current time.
S104,获取各人体周围的物品,将该视频流中连续的多帧图像中各人体的姿态以及各人体周围的物品输入到预先训练完成的第二卷积神经网络中,获取第二动作识别结果,该第二卷积神经网络用于摔倒动作识别,该第二动作识别结果包括各人体摔倒的发生概率。S104. Obtain objects around each human body, input the postures of each human body and the objects around each human body in the continuous multi-frame images in the video stream into a second pre-trained convolutional neural network, and obtain a second action recognition result The second convolutional neural network is used for fall action recognition, and the second action recognition result includes the occurrence probability of each human body falling.
在本实施例中,该人体周围的物品为每帧图像中与该人体的距离小于或等于预设阈值的物品。In this embodiment, the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold.
在本实施例中,根据人体的姿态、该人体周围的物品以及该物品相对于该人体的位置进行摔倒识别,例如识别出要摔倒的人体以及周围的桌子,椅子的语义信息和位置信息,然后判断,如果该摔倒的人距离桌椅很近,很可能没有摔倒,如果距离桌椅很远,则有较大的概率可能摔倒。如果检测到摔倒的人体的下方是床铺或者瑜伽垫,此时可以判断出该行人没有摔倒,只是躺下或者做一些运动,配合周围环境的语义信息,大大提高了检测动作的准确度。In this embodiment, fall recognition is performed based on the posture of the human body, the objects around the human body, and the position of the article relative to the human body, such as identifying the human body to fall and surrounding tables, semantic information and position information of the chair , And then judge that if the person who should fall is very close to the table and chair, it is likely that he did not fall. If the person is far away from the table and chair, there is a greater probability of falling. If it is detected that the falling human body is under a bed or yoga mat, it can be judged that the pedestrian has not fallen, but just lie down or do some exercises, and the semantic information of the surrounding environment can greatly improve the accuracy of the detection action.
在本实施例中,该第二卷积神经网络是以发生摔倒时人体的姿态、该人体周围的物品以及该物品相对于该人体的位置作为样本集训练得到的。具体地,在本实施例中,第二卷积神经网络的训练过程包括:In this embodiment, the second convolutional neural network is trained using the posture of the human body when the fall occurs, the objects around the human body, and the position of the object relative to the human body as a sample set. Specifically, in this embodiment, the training process of the second convolutional neural network includes:
S1041,分别获取包含发生摔倒动作的人体的第一样本图像以及包含发生类摔倒动作的人体的第二样本图像,分别检测第一样本图像包含的人体和物品以及第二样本图像包含的人体和物品;S1041: Obtain a first sample image of a human body that has a fall action and a second sample image of a human body that has a fall-like action, respectively, and detect the human body and objects contained in the first sample image and the second sample image that contains Human body and objects;
S1042,分别对检测到的第一样本图像及第二样本图像中包含的人体进行姿态识别,得到所述人体的姿态;S1042, performing posture recognition on the human body included in the detected first sample image and the second sample image, respectively, to obtain the posture of the human body;
S1043,在第一样本图像中,获取与所述人体的距离小于或等于所述预设阈值的物品,作为所述人体周围的物品,根据所述人体的位置以及所述人体周围的物品的位置确定所述物品 相对于所述人体的位置;将所述人体的姿态、所述人体周围的物品以及所述物品相对于所述人体的位置作为摔倒训练特征在所述第一样本图像中进行标注,得到第一标注样本图像;S1043. In the first sample image, obtain an item whose distance from the human body is less than or equal to the preset threshold as the item around the human body, according to the position of the human body and the value of the items around the human body. The position determines the position of the object relative to the human body; the posture of the human body, the objects around the human body, and the position of the object relative to the human body are used as fall training features in the first sample image Mark in, to obtain the first marked sample image;
S1044,在第二样本图像中,获取与所述人体的距离小于或等于所述预设阈值的物品,作为所述人体周围的物品,根据所述人体的位置以及所述人体周围的物品的位置确定所述物品相对于所述人体的位置;将所述人体的姿态、所述人体周围的物品以及所述物品相对于所述人体的位置作为非摔倒训练特征在所述第二样本图像中进行标注,得到第二标注样本图像;S1044. In the second sample image, obtain an article whose distance from the human body is less than or equal to the preset threshold as the article around the human body, according to the position of the human body and the position of the article around the human body Determine the position of the object relative to the human body; use the posture of the human body, the objects around the human body, and the position of the object relative to the human body as non-fall training features in the second sample image Perform an annotation to obtain a second annotated sample image;
S1045,将所述第一标注样本图像和所述第二样本标注图像输入预设的初始神经网络中进行训练,以获得第二卷积神经网络。S1045: Input the first annotated sample image and the second sample annotated image into a preset initial neural network for training, so as to obtain a second convolutional neural network.
S105,根据该第一动作识别结果和该第二动作识别结果输出各人体的行为识别结果。S105: Output a behavior recognition result of each human body according to the first action recognition result and the second action recognition result.
在本实施例中,通过分别对第一动作识别结果和第二动作识别结果设置对应权重,根据第一动作识别结果中人体不同动作类别的发生概率与第一动作识别结果的权重以及第二识别结果中人体摔倒的发生概率与第二识别结果的权重计算人体不同动作类别的调整概率以及人体摔倒的调整概率,将调整概率最大的动作类别作为人体的行为识别结果输出。In this embodiment, by setting corresponding weights for the first action recognition result and the second action recognition result, according to the occurrence probability of different human action categories in the first action recognition result, the weight of the first action recognition result and the second recognition result In the result, the occurrence probability of the human body falling and the weight of the second recognition result are calculated to calculate the adjustment probability of different human body action categories and the adjustment probability of the human body falling, and the action category with the largest adjustment probability is output as the human body behavior recognition result.
图2是本申请第二实施例的基于环境语义理解的人体行为识别方法的流程示意图。需注意的是,若有实质上相同的结果,本申请的方法并不以图2所示的流程顺序为限。如图2所示,该基于环境语义理解的人体行为识别方法包括步骤:Fig. 2 is a schematic flowchart of a human behavior recognition method based on understanding of environmental semantics according to a second embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 2. As shown in Figure 2, the human behavior recognition method based on the understanding of environmental semantics includes the following steps:
S201,检测视频流中各帧图像包含的人体和物品。S201: Detect human bodies and objects contained in each frame of image in the video stream.
S202,对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态。S202: Perform posture recognition on each human body included in each detected frame of image, to obtain the posture of each human body.
S203,对识别到的每帧图像中包含的各人体的姿态进行去遮挡操作。S203: Perform a de-occlusion operation on the posture of each human body included in each frame of the recognized image.
S204,将该视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取第一动作识别结果,该第一卷积神经网络用于动作识别,该第一动作识别结果包括各人体不同动作类别的发生概率。S204, inputting the posture of each human body in the continuous multiple frames of images in the video stream into a first pre-trained convolutional neural network to obtain a first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body.
S205,获取各人体周围的物品,将该视频流中连续的多帧图像中各人体的姿态以及各人体周围的物品输入到预先训练完成的第二卷积神经网络中,获取第二动作识别结果,该第二卷积神经网络用于摔倒动作识别,该第二动作识别结果包括各人体摔倒的发生概率。S205: Obtain objects around each human body, and input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into a second pre-trained convolutional neural network, and obtain a second action recognition result The second convolutional neural network is used for fall action recognition, and the second action recognition result includes the occurrence probability of each human body falling.
S206,根据该第一动作识别结果和该第二动作识别结果输出各人体的行为识别结果。S206: Output a behavior recognition result of each human body according to the first action recognition result and the second action recognition result.
步骤S201、步骤S202、步骤S204至S206分别参见第一实施例的步骤S101至步骤S105,在此不进行一一赘述。For step S201, step S202, and steps S204 to S206, please refer to step S101 to step S105 of the first embodiment, respectively, which will not be repeated here.
在步骤S203中,针对每个检测框,当所述检测框内包含多个人体时,基于位于所述检测框内的各人体的姿态,获取该检测框中的多组关节点组,每组关节点组包括属于同一人体的多个关节点,该人体的检测框为框选出每帧图像中包含的该人体的外接区域;从多组关节点组中获取左肩关节点和右肩关节点位于该检测框内的关节点组;从左肩关节点和右肩关节点位于该检测框内的关节点组中选择关节点数量最多的关节点组标记为目标关节点组,将该检 测框内除目标关节点组外的关节点组标记为被遮挡关节点组。在本实施例中,每组关节点组对应一个人体,当检测框中存在多个人体时,通过步骤S203的去遮挡操作去除被遮挡人体的关节点组,将该目标关节点组对应的人体的姿态作为动作识别的对象,在后续步骤S204和步骤S205中,根据目标关节点组对应的人体的姿态进行动作分类。在步骤S205中,在本实施例中,该人体周围的物品为每帧图像中与该人体的距离小于或等于预设阈值的物品。In step S203, for each detection frame, when the detection frame contains multiple human bodies, based on the posture of each human body located in the detection frame, multiple sets of joint points in the detection frame are acquired, and each group The key node group includes multiple joint points belonging to the same human body. The detection frame of the human body is to select the circumscribed area of the human body contained in each frame of image; obtain the left shoulder joint points and the right shoulder joint points from multiple sets of joint points The joint point group located in the detection frame; select the joint point group with the largest number of joint points from the joint point group where the left shoulder joint point and the right shoulder joint point are located in the detection frame, and mark it as the target joint point group, and then mark it as the target joint point group. The joint point group except the target joint point group is marked as the occluded joint point group. In this embodiment, each group of joint points corresponds to a human body. When there are multiple human bodies in the detection frame, the joint point group of the occluded human body is removed by the de-occlusion operation in step S203, and the target joint point group corresponds to the human body As the object of action recognition, in the subsequent steps S204 and S205, the action is classified according to the posture of the human body corresponding to the target joint point group. In step S205, in this embodiment, the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold.
在本实施例中,多人重叠场景设计算法来去遮挡,从而进一步避免利用被遮挡的人的位姿信息识别未被遮挡人的行为动作,增加了算法的可靠性,提高了算法的准确率,从而使之可以在实际复杂场景之中应用。In this embodiment, the multi-person overlapping scene design algorithm is used to remove the occlusion, thereby further avoiding the use of the pose information of the occluded person to identify the behavior of the unoccluded person, increasing the reliability of the algorithm and improving the accuracy of the algorithm , So that it can be applied in actual complex scenes.
在一个可选的实施方式中,步骤S206之后还包括如下步骤:In an optional implementation manner, after step S206, the following steps are further included:
所述人体行为识别方法还包括:将所述各人体的姿态及各人体的行为识别结果上传至区块链中,以使得所述区块链对所述各人体的姿态及各人体的行为识别结果进行加密存储。The human body behavior recognition method further includes: uploading the posture of each human body and the behavior recognition result of each human body to the blockchain, so that the blockchain can recognize the posture of each human body and the behavior of each human body The result is stored encrypted.
具体地,基于各人体的姿态或各人体的行为识别结果得到对应的摘要信息,具体来说,摘要信息由各人体的姿态或各人体的行为识别结果进行散列处理得到,比如利用sha256s算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户设备可以从区块链中下载得该摘要信息,以便查证人体的行为识别结果是否被篡改。本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。Specifically, the corresponding summary information is obtained based on the posture of each human body or the behavior recognition result of each human body. Specifically, the summary information is obtained by hashing the posture of each human body or the behavior recognition result of each human body, for example, using the sha256s algorithm. get. Uploading summary information to the blockchain can ensure its security and fairness and transparency to users. The user equipment can download the summary information from the blockchain to verify whether the human body's behavior recognition result has been tampered with. The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
图3是本申请第三实施例的基于环境语义理解的人体行为识别装置的结构示意图。如图3所示,该装置30包括目标检测模块301、姿态识别模块302、通用动作分类模块303、摔倒动作识别模块304和输出模块305。Fig. 3 is a schematic structural diagram of a human body behavior recognition device based on understanding of environmental semantics according to a third embodiment of the present application. As shown in FIG. 3, the device 30 includes a target detection module 301, a gesture recognition module 302, a general action classification module 303, a fall action recognition module 304, and an output module 305.
其中,目标检测模块301用于检测视频流中各帧图像包含的人体和物品;姿态识别模块302用于对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态;通用动作分类模块303用于将该视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取第一动作识别结果,该第一卷积神经网络用于动作识别,该第一动作识别结果包括各人体不同动作类别的发生概率;摔倒动作识别模块304用于获取各人体周围的物品,将该视频流中连续的多帧图像中各人体的姿态以及各人体周围的物品输入到预先训练完成的第二卷积神经网络中,获取第二动作识别结果,人体周围的物品为每帧图像中与人体的距离小于或等于预设阈值的物品,该第二卷积神经网络用于摔倒动作识别,该第二动作识别结果包括各人体摔倒的发生概率;输出模块305用于根据该第一动作识别结果和该第二动作识别结果输出各人体的行为识别结果。Among them, the target detection module 301 is used to detect the human body and objects contained in each frame of the image in the video stream; the gesture recognition module 302 is used to recognize each human body contained in each frame of the detected image to obtain the posture of each human body; The action classification module 303 is configured to input the posture of each human body in the continuous multi-frame images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result. The first convolutional neural network uses For action recognition, the first action recognition result includes the occurrence probability of different action categories of each human body; the falling action recognition module 304 is used to obtain items around each human body, and the posture of each human body in the continuous multi-frame images in the video stream And the items around the human body are input to the second convolutional neural network that is pre-trained to obtain the second action recognition result. The items around the human body are the items whose distance from the human body in each frame of image is less than or equal to the preset threshold. The second convolutional neural network is used for fall action recognition, and the second action recognition result includes the occurrence probability of each human body falling; the output module 305 is used for outputting each human body according to the first action recognition result and the second action recognition result The result of behavior recognition.
图4是本申请第四实施例的电子设备的结构示意图。如图4所示,该电子设备40包括处 理器41及和处理器41耦接的存储器42。Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application. As shown in FIG. 4, the electronic device 40 includes a processor 41 and a memory 42 coupled to the processor 41.
存储器42存储有用于实现上述任一实施例该基于环境语义理解的人体行为识别的程序指令。The memory 42 stores program instructions for realizing human behavior recognition based on environmental semantic understanding in any of the above embodiments.
处理器41用于执行存储器42存储的程序指令以进行基于环境语义理解的人体行为识别。The processor 41 is configured to execute program instructions stored in the memory 42 to perform human behavior recognition based on the understanding of environmental semantics.
其中,处理器41还可以称为CPU(Central Processing Unit,中央处理单元)。处理器41可能是一种集成电路芯片,具有信号的处理能力。处理器41还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 41 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 41 may be an integrated circuit chip with signal processing capabilities. The processor 41 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component . The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
参阅图5,图5为本申请实施例的存储介质的结构示意图。本申请实施例的存储介质存储有能够实现上述所有方法的程序指令51,所述存储介质可以是非易失性,也可以是易失性。其中,该程序指令51可以以软件产品的形式存储在上述存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式所述方法的全部或部分步骤。而前述的存储装置包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。Refer to FIG. 5, which is a schematic structural diagram of a storage medium according to an embodiment of the application. The storage medium of the embodiment of the present application stores program instructions 51 that can implement all the above methods, and the storage medium may be non-volatile or volatile. Wherein, the program instructions 51 may be stored in the above-mentioned storage medium in the form of a software product, and include several instructions for making a computer device (may be a personal computer, a server, or a network device, etc.) or a processor to execute the program. Apply for all or part of the steps of the method described in each embodiment. The aforementioned storage devices include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. , Or terminal devices such as computers, servers, mobile phones, and tablets.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。以上仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. The above are only implementations of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly applied to other related technical fields, The same reason is included in the scope of patent protection of this application.
以上所述的仅是本申请的实施方式,在此应当指出,对于本领域的普通技术人员来说,在不脱离本申请创造构思的前提下,还可以做出改进,但这些均属于本申请的保护范围。The above are only the implementation manners of this application. It should be pointed out here that for those of ordinary skill in the art, improvements can be made without departing from the creative concept of this application, but these all belong to this application. The scope of protection.

Claims (19)

  1. 一种基于环境语义理解的人体行为识别方法,其中,所述方法包括:A human body behavior recognition method based on environmental semantics understanding, wherein the method includes:
    检测视频流中各帧图像包含的人体和物品;Detect the human body and objects contained in each frame of the video stream;
    对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态;Perform posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body;
    将所述视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取第一动作识别结果,所述第一卷积神经网络用于动作识别,所述第一动作识别结果包括各人体不同动作类别的发生概率;Inputting the posture of each human body in the continuous multiple frames of images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body;
    获取各人体周围的物品,将所述视频流中连续的多帧图像中各人体的姿态以及各人体周围的物品输入到预先训练完成的第二卷积神经网络中,获取第二动作识别结果,所述人体周围的物品为每帧图像中与所述人体的距离小于或等于预设阈值的物品,所述第二卷积神经网络用于摔倒动作识别,所述第二动作识别结果包括各人体摔倒的发生概率;Obtain the objects around each human body, input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network to obtain the second action recognition result, The items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;
    根据所述第一动作识别结果和所述第二动作识别结果输出各人体的行为识别结果。The behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
  2. 根据权利要求1所述的人体行为识别方法,其中,所述检测视频流中各帧图像包含的人体和物品,包括:The human body behavior recognition method according to claim 1, wherein the detecting the human body and objects contained in each frame image in the video stream comprises:
    按照预设的划分方式,将所述视频流中各帧图像划分为多个网格;Dividing each frame of image in the video stream into a plurality of grids according to a preset division method;
    在每个网格中,通过预先设置的不同类型的检测框进行目标预测,针对每个检测框,获取所述检测框预测的目标的坐标参数、检测框的宽度和高度以及检测框的置信度,将置信度最高的检测框作为预测结果,所述预测结果包括所述目标、所述检测框、所述目标的坐标参数以及所述目标的类别,所述检测框为框选出所述目标的外接区域,所述目标的类别包括人体和物品;In each grid, target prediction is performed through preset detection frames of different types, and for each detection frame, the coordinate parameters of the target predicted by the detection frame, the width and height of the detection frame, and the confidence of the detection frame are obtained. , Taking the detection frame with the highest confidence level as the prediction result, the prediction result including the target, the detection frame, the coordinate parameters of the target, and the target category, and the detection frame is the frame selection of the target The circumscribed area of the target category includes human body and objects;
    根据所述预测结果确定所述视频流中各帧图像包含的人体和物品。The human body and objects contained in each frame image in the video stream are determined according to the prediction result.
  3. 根据权利要求1所述的人体行为识别方法,其中,所述人体的姿态包括关节点的位置和关节点之间的连线;所述对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态,包括:The human body behavior recognition method according to claim 1, wherein the posture of the human body includes the position of the joint point and the line between the joint point; the posture recognition is performed on each human body contained in each frame of the detected image , Get the posture of each human body, including:
    对每帧图像中包含的人体进行高维特征提取;Perform high-dimensional feature extraction on the human body contained in each frame of image;
    根据所述高维特征确定所述人体的关节点的位置;Determining the position of the joint point of the human body according to the high-dimensional feature;
    根据所述关节点的位置确定关节点之间的连线,将所述关节点的位置以及所述关节点之间的连线作为所述人体的姿态。The line between the joint points is determined according to the position of the joint point, and the position of the joint point and the line between the joint points are taken as the posture of the human body.
  4. 根据权利要求3所述的人体行为识别方法,其中,所述根据所述关节点的位置确定关节点之间的连线,包括:The human body behavior recognition method according to claim 3, wherein the determining the connection line between the joint points according to the positions of the joint points comprises:
    针对每两个关节点,根据所述两个关节点的位置获取所述两个关节点的方向向量,将所述两个关节点的方向向量分解为平行方向向量和垂直方向向量;For every two joint points, obtain the direction vectors of the two joint points according to the positions of the two joint points, and decompose the direction vectors of the two joint points into a parallel direction vector and a vertical direction vector;
    针对所述两个关节点之间的每个像素点,根据所述像素点的位置以及所述两个关节点的方向向量判断所述像素点是否位于第一关节上;For each pixel point between the two joint points, judging whether the pixel point is located on the first joint according to the position of the pixel point and the direction vector of the two joint points;
    若所述像素点位于所述第一关节上,根据相关性函数计算所述两个关节点的相关度,将相关度最高的所述两个关节点作为所述第一关节的两端,生成所述两个关节点之间的连线。If the pixel point is located on the first joint, the correlation between the two joint points is calculated according to the correlation function, and the two joint points with the highest correlation are used as the two ends of the first joint to generate The connecting line between the two joint points.
  5. 根据权利要求1所述的人体行为识别方法,其中,所述对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态之后,还包括:The human body behavior recognition method according to claim 1, wherein said performing the posture recognition of each human body contained in each frame of the detected image to obtain the posture of each human body, further comprising:
    针对每个人体的检测框,当所述检测框内包含多个人体时,基于位于所述检测框内的各人体的姿态,获取所述检测框中的多组关节点组,每组所述关节点组包括属于同一人体的多个关节点,所述人体的检测框为框选出每帧图像中包含的所述人体的外接区域;For the detection frame of each human body, when the detection frame contains multiple human bodies, based on the posture of each human body located in the detection frame, multiple sets of joint point groups in the detection frame are acquired, and each group of the The node group includes a plurality of joint points belonging to the same human body, and the detection frame of the human body is a circumscribed area of the human body included in each frame of image selected by the frame;
    从多组关节点组中获取左肩关节点和右肩关节点位于所述检测框内的关节点组;Acquiring, from a plurality of sets of joint points, a set of joint points in which a left shoulder joint point and a right shoulder joint point are located in the detection frame;
    从左肩关节点和右肩关节点位于所述检测框内的关节点组中选择关节点数量最多的关节点组标记为目标关节点组,将所述检测框内除目标关节点组外的关节点组标记为被遮挡关节点组,将所述目标关节点组对应的人体的姿态作为动作识别的对象。Select the joint point group with the largest number of joint points from the joint point group in which the left shoulder joint point and the right shoulder joint point are located in the detection frame and mark it as the target joint point group, and mark the joints in the detection frame excluding the target joint point group The point group is marked as the occluded joint point group, and the posture of the human body corresponding to the target joint point group is taken as the object of action recognition.
  6. 根据权利要求1所述的人体行为识别方法,其中,所述将所述视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取所述人体的第一动作识别结果,包括:The human body behavior recognition method according to claim 1, wherein the posture of each human body in the continuous multi-frame images in the video stream is input into a pre-trained first convolutional neural network to obtain the human body The first action recognition results include:
    利用注意力网络从所述视频流的每帧图像中提取感兴趣区域;Extracting a region of interest from each frame of the video stream by using an attention network;
    对所述视频流的每帧图像中各人体的不同关节点进行图卷积操作;Performing a graph convolution operation on different joint points of each human body in each frame of the video stream;
    对所述视频流中连续的多帧图像中各人体的相同关节点进行时间卷积操作;Performing a temporal convolution operation on the same joint points of each human body in consecutive multiple frames of images in the video stream;
    利用全连接层根据图卷积操作输出的特征和时间卷积操作输出的特征进行动作分类,获取各人体不同动作类别的出现概率。The fully connected layer is used to classify actions according to the features output by the graph convolution operation and the features output by the time convolution operation, and obtain the occurrence probability of different action categories of each human body.
  7. 根据权利要求6所述的人体行为识别方法,其中,所述人体行为识别方法还包括:将所述各人体的姿态及各人体的行为识别结果上传至区块链中,以使得所述区块链对所述各人体的姿态及各人体的行为识别结果进行加密存储;The human body behavior recognition method according to claim 6, wherein the human body behavior recognition method further comprises: uploading the posture of each human body and the behavior recognition result of each human body to the blockchain, so that the block The chain encrypts and stores the posture of each human body and the behavior recognition result of each human body;
    所述利用注意力网络从所述视频流的每帧图像中提取感兴趣区域之前,还包括:对所述视频流中连续的多帧图像中各人体的姿态进行归一化处理。Before extracting the region of interest from each frame of image in the video stream by using the attention network, the method further includes: normalizing the posture of each human body in the continuous multiple frames of images in the video stream.
  8. 根据权利要求1所述的人体行为识别方法,其中,所述第二卷积神经网络的训练过程包括:The human body behavior recognition method according to claim 1, wherein the training process of the second convolutional neural network comprises:
    获取包含发生摔倒动作的人体的第一样本图像,检测所述第一样本图像包含的人体和物品;Acquiring a first sample image of a human body that has a falling action, and detecting the human body and objects contained in the first sample image;
    对检测到的所述第一样本图像中包含的人体进行姿态识别,得到所述人体的姿态;Performing posture recognition on the human body included in the detected first sample image to obtain the posture of the human body;
    获取与所述人体的距离小于或等于所述预设阈值的物品,作为所述人体周围的物品,根据所述人体的位置以及所述人体周围的物品的位置确定所述物品相对于所述人体的位置;Obtain an article whose distance from the human body is less than or equal to the preset threshold as an article around the human body, and determine that the article is relative to the human body according to the position of the human body and the position of the article around the human body s position;
    将所述人体的姿态、所述人体周围的物品以及所述物品相对于所述人体的位置作为摔倒训练特征在所述第一样本图像中进行标注,得到第一标注样本图像;Labeling the first sample image with the posture of the human body, the objects around the human body, and the position of the article relative to the human body as fall training features in the first sample image to obtain a first label sample image;
    将所述第一标注样本图像输入预设的初始神经网络中进行训练,以获得第二卷积神经网络。The first labeled sample image is input into a preset initial neural network for training to obtain a second convolutional neural network.
  9. 一种基于环境语义理解的人体行为识别装置,其中,所述装置包括:A human body behavior recognition device based on environmental semantics understanding, wherein the device includes:
    目标检测模块,用于检测视频流中各帧图像包含的人体和物品;The target detection module is used to detect the human body and objects contained in each frame of the video stream;
    姿态识别模块,用于对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态;The posture recognition module is used to recognize the posture of each human body contained in each frame of the detected image to obtain the posture of each human body;
    通用动作分类模块,用于将所述视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取第一动作识别结果,所述第一卷积神经网络用于动作识别,所述第一动作识别结果包括各人体不同动作类别的发生概率;The general action classification module is used to input the posture of each human body in the continuous multi-frame images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result. The first convolution The neural network is used for action recognition, and the first action recognition result includes the occurrence probabilities of different action categories of each human body;
    摔倒动作识别模块,用于获取各人体周围的物品,将所述视频流中连续的多帧图像中各人体的姿态以及各人体周围的物品输入到预先训练完成的第二卷积神经网络中,获取第二动作识别结果,所述人体周围的物品为每帧图像中与所述人体的距离小于或等于预设阈值的物品,所述第二卷积神经网络用于摔倒动作识别,所述第二动作识别结果包括各人体摔倒的发生概率;The fall action recognition module is used to obtain the objects around each human body, and input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network Acquire a second action recognition result, the items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, and the second convolutional neural network is used for fall action recognition, so The second action recognition result includes the probability of each human body falling;
    输出模块,用于根据所述第一动作识别结果和所述第二动作识别结果输出各人体的行为识别结果。The output module is configured to output the behavior recognition result of each human body according to the first action recognition result and the second action recognition result.
  10. 一种电子设备,其中,所述电子设备包括处理器、以及与所述处理器耦接的存储器,所述存储器存储有可被所述处理器执行的程序指令;所述处理器执行所述存储器存储的所述程序指令时实现以下步骤:An electronic device, wherein the electronic device includes a processor and a memory coupled to the processor, the memory storing program instructions executable by the processor; the processor executes the memory The stored program instructions implement the following steps:
    检测视频流中各帧图像包含的人体和物品;Detect the human body and objects contained in each frame of the video stream;
    对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态;Perform posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body;
    将所述视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取第一动作识别结果,所述第一卷积神经网络用于动作识别,所述第一动作识别结果包括各人体不同动作类别的发生概率;Inputting the posture of each human body in the continuous multiple frames of images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body;
    获取各人体周围的物品,将所述视频流中连续的多帧图像中各人体的姿态以及各人体周围的物品输入到预先训练完成的第二卷积神经网络中,获取第二动作识别结果,所述人体周围的物品为每帧图像中与所述人体的距离小于或等于预设阈值的物品,所述第二卷积神经网络用于摔倒动作识别,所述第二动作识别结果包括各人体摔倒的发生概率;Obtain the objects around each human body, input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network to obtain the second action recognition result, The items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;
    根据所述第一动作识别结果和所述第二动作识别结果输出各人体的行为识别结果。The behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
  11. 根据权利要求10所述的电子设备,其中,所述检测视频流中各帧图像包含的人体和物品,包括:The electronic device according to claim 10, wherein said detecting the human body and objects contained in each frame of image in the video stream comprises:
    按照预设的划分方式,将所述视频流中各帧图像划分为多个网格;Dividing each frame of image in the video stream into a plurality of grids according to a preset division method;
    在每个网格中,通过预先设置的不同类型的检测框进行目标预测,针对每个检测框,获取所述检测框预测的目标的坐标参数、检测框的宽度和高度以及检测框的置信度,将置信度最高的检测框作为预测结果,所述预测结果包括所述目标、所述检测框、所述目标的坐标参数以及所述目标的类别,所述检测框为框选出所述目标的外接区域,所述目标的类别包括人体和物品;In each grid, target prediction is performed through preset detection frames of different types, and for each detection frame, the coordinate parameters of the target predicted by the detection frame, the width and height of the detection frame, and the confidence of the detection frame are obtained. , Taking the detection frame with the highest confidence level as the prediction result, the prediction result including the target, the detection frame, the coordinate parameters of the target, and the target category, and the detection frame is the frame selection of the target The circumscribed area of the target category includes human body and objects;
    根据所述预测结果确定所述视频流中各帧图像包含的人体和物品。The human body and objects contained in each frame image in the video stream are determined according to the prediction result.
  12. 根据权利要求10所述的电子设备,其中,所述人体的姿态包括关节点的位置和关节点之间的连线;所述对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态,包括:The electronic device according to claim 10, wherein the posture of the human body includes the position of the joint point and the connection line between the joint point; the posture recognition is performed on each human body contained in each frame of the detected image to obtain The posture of each human body, including:
    对每帧图像中包含的人体进行高维特征提取;Perform high-dimensional feature extraction on the human body contained in each frame of image;
    根据所述高维特征确定所述人体的关节点的位置;Determining the position of the joint point of the human body according to the high-dimensional feature;
    根据所述关节点的位置确定关节点之间的连线,将所述关节点的位置以及所述关节点之间的连线作为所述人体的姿态。The line between the joint points is determined according to the position of the joint point, and the position of the joint point and the line between the joint points are taken as the posture of the human body.
  13. 根据权利要求10所述的电子设备,其中,所述对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态之后,还包括:10. The electronic device according to claim 10, wherein said performing posture recognition of each human body included in each frame of the detected image to obtain the posture of each human body, further comprising:
    针对每个人体的检测框,当所述检测框内包含多个人体时,基于位于所述检测框内的各人体的姿态,获取所述检测框中的多组关节点组,每组所述关节点组包括属于同一人体的多个关节点,所述人体的检测框为框选出每帧图像中包含的所述人体的外接区域;For the detection frame of each human body, when the detection frame contains multiple human bodies, based on the posture of each human body located in the detection frame, multiple sets of joint point groups in the detection frame are acquired, and each group of the The node group includes a plurality of joint points belonging to the same human body, and the detection frame of the human body is a circumscribed area of the human body included in each frame of image selected by the frame;
    从多组关节点组中获取左肩关节点和右肩关节点位于所述检测框内的关节点组;Acquiring, from a plurality of sets of joint points, a set of joint points in which a left shoulder joint point and a right shoulder joint point are located in the detection frame;
    从左肩关节点和右肩关节点位于所述检测框内的关节点组中选择关节点数量最多的关节点组标记为目标关节点组,将所述检测框内除目标关节点组外的关节点组标记为被遮挡关节点组,将所述目标关节点组对应的人体的姿态作为动作识别的对象。Select the joint point group with the largest number of joint points from the joint point group in which the left shoulder joint point and the right shoulder joint point are located in the detection frame and mark it as the target joint point group, and mark the joints in the detection frame excluding the target joint point group The point group is marked as the occluded joint point group, and the posture of the human body corresponding to the target joint point group is taken as the object of action recognition.
  14. 根据权利要求10所述的电子设备,其中,所述将所述视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取所述人体的第一动作识别结果,包括:The electronic device according to claim 10, wherein said inputting the posture of each human body in the continuous multiple frames of images in the video stream into a first pre-trained convolutional neural network to obtain the first convolutional neural network of the human body An action recognition result, including:
    利用注意力网络从所述视频流的每帧图像中提取感兴趣区域;Extracting a region of interest from each frame of the video stream by using an attention network;
    对所述视频流的每帧图像中各人体的不同关节点进行图卷积操作;Performing a graph convolution operation on different joint points of each human body in each frame of the video stream;
    对所述视频流中连续的多帧图像中各人体的相同关节点进行时间卷积操作;Performing a temporal convolution operation on the same joint points of each human body in consecutive multiple frames of images in the video stream;
    利用全连接层根据图卷积操作输出的特征和时间卷积操作输出的特征进行动作分类,获取各人体不同动作类别的出现概率。The fully connected layer is used to classify actions according to the features output by the graph convolution operation and the features output by the time convolution operation, and obtain the occurrence probability of different action categories of each human body.
  15. 根据权利要求10所述的电子设备,其中,所述第二卷积神经网络的训练过程包括:The electronic device according to claim 10, wherein the training process of the second convolutional neural network comprises:
    获取包含发生摔倒动作的人体的第一样本图像,检测所述第一样本图像包含的人体和物 品;Acquiring a first sample image of a human body that has a falling action, and detecting the human body and objects contained in the first sample image;
    对检测到的所述第一样本图像中包含的人体进行姿态识别,得到所述人体的姿态;Performing posture recognition on the human body included in the detected first sample image to obtain the posture of the human body;
    获取与所述人体的距离小于或等于所述预设阈值的物品,作为所述人体周围的物品,根据所述人体的位置以及所述人体周围的物品的位置确定所述物品相对于所述人体的位置;Obtain an article whose distance from the human body is less than or equal to the preset threshold as an article around the human body, and determine that the article is relative to the human body according to the position of the human body and the position of the article around the human body s position;
    将所述人体的姿态、所述人体周围的物品以及所述物品相对于所述人体的位置作为摔倒训练特征在所述第一样本图像中进行标注,得到第一标注样本图像;Labeling the first sample image with the posture of the human body, the objects around the human body, and the position of the article relative to the human body as fall training features in the first sample image to obtain a first label sample image;
    将所述第一标注样本图像输入预设的初始神经网络中进行训练,以获得第二卷积神经网络。The first labeled sample image is input into a preset initial neural network for training to obtain a second convolutional neural network.
  16. 一种存储介质,其中,所述存储介质内存储有程序指令,所述程序指令被处理器执行时实现以下步骤:A storage medium, wherein program instructions are stored in the storage medium, and the following steps are implemented when the program instructions are executed by a processor:
    检测视频流中各帧图像包含的人体和物品;Detect the human body and objects contained in each frame of the video stream;
    对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态;Perform posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body;
    将所述视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取第一动作识别结果,所述第一卷积神经网络用于动作识别,所述第一动作识别结果包括各人体不同动作类别的发生概率;Inputting the posture of each human body in the continuous multiple frames of images in the video stream into the pre-trained first convolutional neural network to obtain the first action recognition result, and the first convolutional neural network is used for action recognition, The first action recognition result includes the occurrence probabilities of different action categories of each human body;
    获取各人体周围的物品,将所述视频流中连续的多帧图像中各人体的姿态以及各人体周围的物品输入到预先训练完成的第二卷积神经网络中,获取第二动作识别结果,所述人体周围的物品为每帧图像中与所述人体的距离小于或等于预设阈值的物品,所述第二卷积神经网络用于摔倒动作识别,所述第二动作识别结果包括各人体摔倒的发生概率;Obtain the objects around each human body, input the posture of each human body and the objects around each human body in the continuous multi-frame images in the video stream into the pre-trained second convolutional neural network to obtain the second action recognition result, The items around the human body are items whose distance from the human body in each frame of image is less than or equal to a preset threshold, the second convolutional neural network is used for fall action recognition, and the second action recognition result includes each Probability of human body falls;
    根据所述第一动作识别结果和所述第二动作识别结果输出各人体的行为识别结果。The behavior recognition result of each human body is output according to the first action recognition result and the second action recognition result.
  17. 根据权利要求16所述的存储介质,其中,所述检测视频流中各帧图像包含的人体和物品,包括:The storage medium according to claim 16, wherein said detecting the human body and objects contained in each frame image in the video stream comprises:
    按照预设的划分方式,将所述视频流中各帧图像划分为多个网格;Dividing each frame of image in the video stream into a plurality of grids according to a preset division method;
    在每个网格中,通过预先设置的不同类型的检测框进行目标预测,针对每个检测框,获取所述检测框预测的目标的坐标参数、检测框的宽度和高度以及检测框的置信度,将置信度最高的检测框作为预测结果,所述预测结果包括所述目标、所述检测框、所述目标的坐标参数以及所述目标的类别,所述检测框为框选出所述目标的外接区域,所述目标的类别包括人体和物品;In each grid, target prediction is performed through preset detection frames of different types, and for each detection frame, the coordinate parameters of the target predicted by the detection frame, the width and height of the detection frame, and the confidence of the detection frame are obtained. , Taking the detection frame with the highest confidence level as the prediction result, the prediction result including the target, the detection frame, the coordinate parameters of the target, and the target category, and the detection frame is the frame selection of the target The circumscribed area of the target category includes human body and objects;
    根据所述预测结果确定所述视频流中各帧图像包含的人体和物品。The human body and objects contained in each frame image in the video stream are determined according to the prediction result.
  18. 根据权利要求16所述的存储介质,其中,所述对检测到的每帧图像中包含的各人体进行姿态识别,得到各人体的姿态之后,还包括:15. The storage medium according to claim 16, wherein said performing posture recognition on each human body contained in each frame of the detected image to obtain the posture of each human body, further comprising:
    针对每个人体的检测框,当所述检测框内包含多个人体时,基于位于所述检测框内的各人体的姿态,获取所述检测框中的多组关节点组,每组所述关节点组包括属于同一人体的多 个关节点,所述人体的检测框为框选出每帧图像中包含的所述人体的外接区域;For the detection frame of each human body, when the detection frame contains multiple human bodies, based on the posture of each human body located in the detection frame, multiple sets of joint points in the detection frame are acquired, and each group of the The set of key nodes includes multiple joint points belonging to the same human body, and the detection frame of the human body is a circumscribed area of the human body included in each frame of image selected by the frame;
    从多组关节点组中获取左肩关节点和右肩关节点位于所述检测框内的关节点组;Acquiring, from a plurality of sets of joint points, a set of joint points in which a left shoulder joint point and a right shoulder joint point are located in the detection frame;
    从左肩关节点和右肩关节点位于所述检测框内的关节点组中选择关节点数量最多的关节点组标记为目标关节点组,将所述检测框内除目标关节点组外的关节点组标记为被遮挡关节点组,将所述目标关节点组对应的人体的姿态作为动作识别的对象。Select the joint point group with the largest number of joint points from the joint point group in which the left shoulder joint point and the right shoulder joint point are located in the detection frame and mark it as the target joint point group, and mark the joints in the detection frame excluding the target joint point group The point group is marked as the occluded joint point group, and the posture of the human body corresponding to the target joint point group is taken as the object of action recognition.
  19. 根据权利要求16所述的存储介质,其中,所述将所述视频流中连续的多帧图像中各人体的姿态输入到预先训练完成的第一卷积神经网络中,获取所述人体的第一动作识别结果,包括:The storage medium according to claim 16, wherein said inputting the posture of each human body in the continuous multi-frame images in the video stream into a first pre-trained convolutional neural network to obtain the first convolutional neural network of the human body An action recognition result, including:
    利用注意力网络从所述视频流的每帧图像中提取感兴趣区域;Extracting a region of interest from each frame of the video stream by using an attention network;
    对所述视频流的每帧图像中各人体的不同关节点进行图卷积操作;Performing a graph convolution operation on different joint points of each human body in each frame of the video stream;
    对所述视频流中连续的多帧图像中各人体的相同关节点进行时间卷积操作;Performing a temporal convolution operation on the same joint points of each human body in consecutive multiple frames of images in the video stream;
    利用全连接层根据图卷积操作输出的特征和时间卷积操作输出的特征进行动作分类,获取各人体不同动作类别的出现概率。20、根据权利要求16所述的存储介质,其中,所述第二卷积神经网络的训练过程包括:The fully connected layer is used to classify actions according to the features output by the graph convolution operation and the features output by the time convolution operation, and obtain the occurrence probability of different action categories of each human body. 20. The storage medium according to claim 16, wherein the training process of the second convolutional neural network comprises:
    获取包含发生摔倒动作的人体的第一样本图像,检测所述第一样本图像包含的人体和物品;Acquiring a first sample image of a human body that has a falling action, and detecting the human body and objects contained in the first sample image;
    对检测到的所述第一样本图像中包含的人体进行姿态识别,得到所述人体的姿态;Performing posture recognition on the human body included in the detected first sample image to obtain the posture of the human body;
    获取与所述人体的距离小于或等于所述预设阈值的物品,作为所述人体周围的物品,根据所述人体的位置以及所述人体周围的物品的位置确定所述物品相对于所述人体的位置;Obtain an article whose distance from the human body is less than or equal to the preset threshold as an article around the human body, and determine that the article is relative to the human body according to the position of the human body and the position of the article around the human body s position;
    将所述人体的姿态、所述人体周围的物品以及所述物品相对于所述人体的位置作为摔倒训练特征在所述第一样本图像中进行标注,得到第一标注样本图像;Labeling the first sample image with the posture of the human body, the objects around the human body, and the position of the article relative to the human body as fall training features in the first sample image to obtain a first label sample image;
    将所述第一标注样本图像输入预设的初始神经网络中进行训练,以获得第二卷积神经网络。The first labeled sample image is input into a preset initial neural network for training to obtain a second convolutional neural network.
PCT/CN2020/123214 2020-05-29 2020-10-23 Environmental semantic understanding-based body movement recognition method, apparatus, device, and storage medium WO2021114892A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010475795.7A CN111666857B (en) 2020-05-29 2020-05-29 Human behavior recognition method, device and storage medium based on environment semantic understanding
CN202010475795.7 2020-05-29

Publications (1)

Publication Number Publication Date
WO2021114892A1 true WO2021114892A1 (en) 2021-06-17

Family

ID=72385160

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/123214 WO2021114892A1 (en) 2020-05-29 2020-10-23 Environmental semantic understanding-based body movement recognition method, apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN111666857B (en)
WO (1) WO2021114892A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673319A (en) * 2021-07-12 2021-11-19 浙江大华技术股份有限公司 Abnormal posture detection method, abnormal posture detection device, electronic device and storage medium
CN113743273A (en) * 2021-08-27 2021-12-03 西安交通大学 Real-time rope skipping counting method, device and equipment based on video image target detection
CN113837005A (en) * 2021-08-20 2021-12-24 广州杰赛科技股份有限公司 Human body falling detection method and device, storage medium and terminal equipment
CN114157526A (en) * 2021-12-23 2022-03-08 广州新华学院 Digital image recognition-based home security remote monitoring method and device
CN115082836A (en) * 2022-07-23 2022-09-20 深圳神目信息技术有限公司 Behavior recognition-assisted target object detection method and device
CN115131826A (en) * 2022-08-23 2022-09-30 浙江大华技术股份有限公司 Article detection and identification method, and network model training method and device
WO2023082882A1 (en) * 2021-11-15 2023-05-19 河南理工大学 Pose estimation-based pedestrian fall action recognition method and device
CN116311542A (en) * 2023-05-23 2023-06-23 广州英码信息科技有限公司 Human body fall detection method and system compatible with crowded scene and uncongested scene
GB2616733A (en) * 2021-11-15 2023-09-20 Univ Henan Polytechnic Pose estimation-based pedestrian fall action recognition method and device
CN113673319B (en) * 2021-07-12 2024-05-03 浙江大华技术股份有限公司 Abnormal gesture detection method, device, electronic device and storage medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666857B (en) * 2020-05-29 2023-07-04 平安科技(深圳)有限公司 Human behavior recognition method, device and storage medium based on environment semantic understanding
CN112137591B (en) * 2020-10-12 2021-07-23 平安科技(深圳)有限公司 Target object position detection method, device, equipment and medium based on video stream
CN112712061B (en) * 2021-01-18 2023-01-24 清华大学 Method, system and storage medium for recognizing multidirectional traffic police command gestures
CN114494976A (en) * 2022-02-17 2022-05-13 平安科技(深圳)有限公司 Human body tumbling behavior evaluation method and device, computer equipment and storage medium
CN114565087B (en) * 2022-04-28 2022-07-22 苏州浪潮智能科技有限公司 Method, device and equipment for reasoning intention of people and storage medium
CN115147935B (en) * 2022-09-05 2022-12-13 浙江壹体科技有限公司 Behavior identification method based on joint point, electronic device and storage medium
CN116189238B (en) * 2023-04-19 2023-07-04 国政通科技有限公司 Human shape detection and identification fall detection method based on neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140334668A1 (en) * 2013-05-10 2014-11-13 Palo Alto Research Center Incorporated System and method for visual motion based object segmentation and tracking
CN110610154A (en) * 2019-09-10 2019-12-24 北京迈格威科技有限公司 Behavior recognition method and apparatus, computer device, and storage medium
CN111666857A (en) * 2020-05-29 2020-09-15 平安科技(深圳)有限公司 Human behavior recognition method and device based on environment semantic understanding and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220604A (en) * 2017-05-18 2017-09-29 清华大学深圳研究生院 A kind of fall detection method based on video
US10614310B2 (en) * 2018-03-22 2020-04-07 Viisights Solutions Ltd. Behavior recognition
CN110163127A (en) * 2019-05-07 2019-08-23 国网江西省电力有限公司检修分公司 A kind of video object Activity recognition method from thick to thin

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140334668A1 (en) * 2013-05-10 2014-11-13 Palo Alto Research Center Incorporated System and method for visual motion based object segmentation and tracking
CN110610154A (en) * 2019-09-10 2019-12-24 北京迈格威科技有限公司 Behavior recognition method and apparatus, computer device, and storage medium
CN111666857A (en) * 2020-05-29 2020-09-15 平安科技(深圳)有限公司 Human behavior recognition method and device based on environment semantic understanding and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JOSEPH REDMON, DIVVALA SANTOSH, GIRSHICK ROSS, FARHADI ALI: "You Only Look Once: Unified, Real-Time Object Detection", 9 May 2016 (2016-05-09), pages 1 - 10, XP055556774, Retrieved from the Internet <URL:https://arxiv.org/pdf/1506.02640.pdf> [retrieved on 20190214], DOI: 10.1109/CVPR.2016.91 *
SIJIE YAN; YUANJUN XIONG; DAHUA LIN: "Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 January 2018 (2018-01-23), 201 Olin Library Cornell University Ithaca, NY 14853, XP080853964 *
ZHE CAO: "Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields", IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 14 April 2017 (2017-04-14), pages 7291 - 7299, XP055712609, ISBN: 978-1-5386-0457-1, DOI: 10.1109/CVPR.2017.143 *
ZHOU, YICHONG: "Investigation And Application of Human-Object Interaction Detection Algorithm", INFORMATION & TECHNOLOGY, CHINA MASTER'S THESES FULL-TEXT DATABASE, 14 May 2019 (2019-05-14), pages 1 - 60, XP055820580, [retrieved on 20210702] *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673319A (en) * 2021-07-12 2021-11-19 浙江大华技术股份有限公司 Abnormal posture detection method, abnormal posture detection device, electronic device and storage medium
CN113673319B (en) * 2021-07-12 2024-05-03 浙江大华技术股份有限公司 Abnormal gesture detection method, device, electronic device and storage medium
CN113837005A (en) * 2021-08-20 2021-12-24 广州杰赛科技股份有限公司 Human body falling detection method and device, storage medium and terminal equipment
CN113743273A (en) * 2021-08-27 2021-12-03 西安交通大学 Real-time rope skipping counting method, device and equipment based on video image target detection
CN113743273B (en) * 2021-08-27 2024-04-05 西安交通大学 Real-time rope skipping counting method, device and equipment based on video image target detection
WO2023082882A1 (en) * 2021-11-15 2023-05-19 河南理工大学 Pose estimation-based pedestrian fall action recognition method and device
GB2616733A (en) * 2021-11-15 2023-09-20 Univ Henan Polytechnic Pose estimation-based pedestrian fall action recognition method and device
CN114157526A (en) * 2021-12-23 2022-03-08 广州新华学院 Digital image recognition-based home security remote monitoring method and device
CN114157526B (en) * 2021-12-23 2022-08-12 广州新华学院 Digital image recognition-based home security remote monitoring method and device
CN115082836B (en) * 2022-07-23 2022-11-11 深圳神目信息技术有限公司 Behavior recognition-assisted target object detection method and device
CN115082836A (en) * 2022-07-23 2022-09-20 深圳神目信息技术有限公司 Behavior recognition-assisted target object detection method and device
CN115131826B (en) * 2022-08-23 2022-11-11 浙江大华技术股份有限公司 Article detection and identification method, and network model training method and device
CN115131826A (en) * 2022-08-23 2022-09-30 浙江大华技术股份有限公司 Article detection and identification method, and network model training method and device
CN116311542A (en) * 2023-05-23 2023-06-23 广州英码信息科技有限公司 Human body fall detection method and system compatible with crowded scene and uncongested scene
CN116311542B (en) * 2023-05-23 2023-08-04 广州英码信息科技有限公司 Human body fall detection method and system compatible with crowded scene and uncongested scene

Also Published As

Publication number Publication date
CN111666857B (en) 2023-07-04
CN111666857A (en) 2020-09-15

Similar Documents

Publication Publication Date Title
WO2021114892A1 (en) Environmental semantic understanding-based body movement recognition method, apparatus, device, and storage medium
US11790682B2 (en) Image analysis using neural networks for pose and action identification
Jegham et al. Vision-based human action recognition: An overview and real world challenges
CN110135246B (en) Human body action recognition method and device
US10296102B1 (en) Gesture and motion recognition using skeleton tracking
CN107633207B (en) AU characteristic recognition methods, device and storage medium
US11514244B2 (en) Structured knowledge modeling and extraction from images
CN112597941B (en) Face recognition method and device and electronic equipment
WO2018228218A1 (en) Identification method, computing device, and storage medium
Ji et al. Learning contrastive feature distribution model for interaction recognition
JP2017506379A (en) System and method for identifying faces in unconstrained media
WO2022105118A1 (en) Image-based health status identification method and apparatus, device and storage medium
Arivazhagan et al. Human action recognition from RGB-D data using complete local binary pattern
CN110458235B (en) Motion posture similarity comparison method in video
CN110738650B (en) Infectious disease infection identification method, terminal device and storage medium
Kishore et al. Estimation of yoga postures using machine learning techniques
Wu et al. Pose-Guided Inflated 3D ConvNet for action recognition in videos
JP2022542199A (en) KEYPOINT DETECTION METHOD, APPARATUS, ELECTRONICS AND STORAGE MEDIA
CN112686211A (en) Fall detection method and device based on attitude estimation
Wang et al. Capturing feature and label relations simultaneously for multiple facial action unit recognition
Das et al. A fusion of appearance based CNNs and temporal evolution of skeleton with LSTM for daily living action recognition
CN115393964B (en) Fitness action recognition method and device based on BlazePose
CN111753796A (en) Method and device for identifying key points in image, electronic equipment and storage medium
CN108038451A (en) Anomaly detection method and device
GB2603640A (en) Action identification using neural networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20900119

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20900119

Country of ref document: EP

Kind code of ref document: A1