CN113408390A

CN113408390A - Human behavior real-time identification method, system, device and storage medium

Info

Publication number: CN113408390A
Application number: CN202110654270.4A
Authority: CN
Inventors: 曾碧; 姚壮泽
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-09-17

Abstract

The invention discloses a real-time human behavior identification method, a real-time human behavior identification system, a real-time human behavior identification device and a storage medium, wherein the method comprises the steps of acquiring a first image through a binocular camera, wherein the first image is an RGB (red, green and blue) image containing a human body; processing the first image by using a YOLOv3 network model to obtain a human body frame image; inputting the human body frame image into an AlphaPose network model to obtain a plurality of human body skeleton points; inputting a plurality of human skeleton points into a behavior recognition network for recognition to obtain behavior category results, wherein the behavior categories comprise falling, sitting, standing and walking behaviors; according to the invention, the binocular camera is used for acquiring the RGB image of the human body, the image is processed by utilizing the YOLOv3 network model, the frame image of the human body can be obtained, and high real-time performance and accuracy rate can be ensured without the time sequence information of the human body; the problem that the real-time performance and the accuracy are difficult to be compatible in the behavior recognition method is solved. The invention can be widely applied to the technical field of human behavior recognition.

Description

Human behavior real-time identification method, system, device and storage medium

Technical Field

The invention relates to the technical field of human behavior recognition, in particular to a human behavior real-time recognition method, a human behavior real-time recognition system, a human behavior real-time recognition device and a human behavior real-time recognition storage medium.

Background

Behavior detection is a big problem to be solved urgently in the current society, in a household robot, behavior identification and abnormal behavior alarm of a human body are quite important modules, and how to efficiently and accurately identify and detect the behavior of the human body and send out an alarm to the abnormal behavior is a problem worthy of attention.

With the aging of the population becoming more and more severe, the peripheral demand for the endowment service becomes stronger. Thirty-fifty thousand hip fractures recorded each year, about 90% are due to falls and only 1/4 patients with hip fractures are fully recovered, with about 1/4 of patients over 50 years of age dying within one year after injury. It is necessary for the old to find the fall and send the doctor in time. More and more home caregivers and elderly care institutions are aware of the importance of being able to monitor falls of the elderly in real time. In video surveillance, automatic fall detection is very important for protecting vulnerable groups (e.g. elderly).

In recent years, with the rapid development of deep learning, behavior recognition algorithms have also been greatly improved. Target detection and multi-target tracking are mostly used for current behavior recognition; the algorithm of posture estimation and behavior recognition needs to use time sequence information, and the real-time requirement of robot behavior recognition is difficult to meet. The common algorithm for simply judging the fall is also to judge by using information such as angular velocity and the like based on an SVM recognition method, and the accuracy is not high.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a method, a system, a device and a storage medium for real-time human behavior identification.

The technical scheme adopted by the invention is as follows:

in one aspect, an embodiment of the present invention includes a method for real-time human behavior recognition, including:

acquiring a first image through a binocular camera, wherein the first image is an RGB (red, green and blue) image containing a human body;

processing the first image by using a YOLOv3 network model to obtain a human body frame image;

inputting the human body frame image into an AlphaPose network model to obtain a plurality of human body skeleton points;

inputting a plurality of human skeleton points into a behavior recognition network for recognition to obtain behavior category results, wherein the behavior categories comprise falling, sitting, standing and walking behaviors.

Further, after the processing the first image by using the YOLOv3 network model to obtain the human body frame image, the method further includes:

acquiring coordinates of the center point of the human body frame according to the human body frame image;

acquiring a human body space position coordinate according to the human body frame central point coordinate;

and judging whether the human body position is on the ground or not according to the human body space position coordinate.

Further, the step of judging whether the human body position is on the ground according to the human body space position coordinates includes:

calculating a first height according to the space position coordinates of the human body, wherein the first height is the height of the human body from the ground;

and if the first height is smaller than a first threshold value, determining that the position of the human body is on the ground.

Further, the first height is calculated by the following formula:

wherein h denotes a first height, hc denotes a distance of the binocular camera from the ground, d denotes a distance of the human body from the binocular camera, wherein,

(x, y, z) represents the spatial position coordinates of the human body, and a represents the tilt angle of the binocular camera in the vertical direction.

Further, the step of inputting the human body frame image into an alphaPose network model to obtain a plurality of human body skeleton points includes:

inputting the human body image into an AlphaPose network model to obtain a plurality of human body key joint points;

and screening a plurality of human body key joint points to obtain a plurality of human body bone points.

Further, inputting the human body frame image into an AlphaPose network model, and also obtaining a bone point position and a bone point confidence corresponding to each human body bone point; the step of inputting a plurality of human skeleton points into a behavior recognition network for recognition to obtain a behavior classification result comprises the following steps:

inputting the position and confidence of the bone point corresponding to each human body bone point into a first full-connection layer to obtain a first output result;

inputting the first output result into a second full-connection layer to obtain a second output result;

inputting the second output result into a third full-connection layer to obtain a third output result;

inputting the third output result into a fourth full-connection layer to obtain a fourth output result;

inputting the fourth output result into a RELU layer to obtain a fifth output result;

and inputting the fifth output result into a dropout layer to obtain a behavior classification result.

Further, the method further comprises training the behavior recognition network, including:

constructing a training set, wherein the training set comprises an Le2i data set and a ntu-rgbd behavior recognition data set;

and training the behavior recognition network by using the training set in the behavior recognition network.

On the other hand, the embodiment of the invention also comprises a human body behavior real-time identification system, which comprises:

the system comprises an acquisition module, a display module and a control module, wherein the acquisition module is used for acquiring a first image through a binocular camera, and the first image is an RGB (red, green and blue) image containing a human body;

the processing module is used for processing the first image by utilizing a YOLOv3 network model to obtain a human body frame image;

the first input module is used for inputting the human body frame image into an AlphaPose network model to obtain a plurality of human body skeleton points;

and the second input module is used for inputting a plurality of human skeleton points into the behavior recognition network for recognition to obtain behavior category results, wherein the behavior categories comprise falling, sitting, standing and walking behaviors.

On the other hand, the embodiment of the invention also comprises a human behavior real-time recognition device, which comprises:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor is caused to implement the real-time behavior recognition method.

In another aspect, the embodiment of the present invention further includes a computer readable storage medium, on which a program executable by a processor is stored, and the program executable by the processor is used for implementing the real-time behavior recognition method when being executed by the processor.

The invention has the beneficial effects that:

the method comprises the steps of obtaining RGB images containing a human body through a binocular camera; then, processing the image by using a YOLOv3 network model to obtain a human body frame image; inputting the human body frame image into an AlphaPose network model to obtain a plurality of human body skeleton points; inputting a plurality of human skeleton points into a behavior recognition network for recognition to obtain a behavior category result, so that human behaviors can be recognized in real time; according to the invention, the binocular camera is used for acquiring the RGB image of the human body, the image is processed by utilizing the YOLOv3 network model, the frame image of the human body can be obtained, and high real-time performance and accuracy rate can be ensured without the time sequence information of the human body; the problem that the real-time performance and the accuracy are difficult to be compatible in the behavior recognition method is solved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart illustrating steps of a real-time human behavior recognition method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for determining whether a human body is located on the ground according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating calculation of a first height according to an embodiment of the present invention;

fig. 4 is a flowchart of a specific application example of inputting the human body frame image into an AlphaPose network model to obtain a plurality of human body skeleton points according to the embodiment of the present invention;

FIG. 5 is a schematic diagram of the labeling of key joint points of a human body according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a behavior recognition network according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a human behavior real-time identification device according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, front, rear, left, right, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.

In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.

The embodiments of the present application will be further explained with reference to the drawings.

Referring to fig. 1, an embodiment of the present invention provides a real-time human behavior recognition method, including but not limited to the following steps:

s101, acquiring a first image through a binocular camera, wherein the first image is an RGB (red, green and blue) image containing a human body;

s102, processing the first image by using a YOLOv3 network model to obtain a human body frame image;

s103, inputting the human body frame image into an AlphaPose network model to obtain a plurality of human body skeleton points;

and S104, inputting a plurality of human skeleton points into a behavior recognition network for recognition to obtain behavior category results, wherein the behavior categories comprise falling, sitting, standing and walking behaviors.

In this embodiment, a BGR image including a human body is acquired by a binocular camera, and then the BGR image including the human body is converted into an RGB image including the human body, that is, a first image, by opencv. In this embodiment, a binocular camera is used to obtain a first image, and then a YOLOv3 network model (target detection model) is used to process the first image, that is, a YOLOv3 algorithm is used as a detection algorithm to obtain a human body frame image, and the upper left corner coordinate and the lower right corner coordinate of the human body frame are obtained in the human body frame image; inputting the human body frame image into an AlphaPose network model (attitude estimation network model) to obtain a plurality of human body skeleton points; finally, inputting a plurality of human skeleton points into a behavior recognition network for recognition to obtain a behavior classification result; according to the embodiment of the invention, the timing sequence information of the human body is not needed, and after the RGB image containing the human body is obtained through the binocular camera, the behavior recognition result can be obtained only by utilizing the target detection model, the attitude estimation network model and the behavior recognition network, so that higher real-time performance and accuracy rate are ensured, and the problem that the real-time performance and the accuracy are difficult to be compatible in the behavior recognition method is solved.

In this embodiment, after step S102, that is, after the first image is processed by using the YOLOv3 network model to obtain the human body image, it is further determined whether the human body position is located on the ground by the following steps, specifically, referring to fig. 2, the process includes, but is not limited to, the following steps:

s201, acquiring coordinates of a center point of the human body frame according to the human body frame image;

s202, acquiring a human body space position coordinate according to the coordinate of the center point of the human body frame;

s203, judging whether the human body position is on the ground or not according to the human body space position coordinates.

In this embodiment, the coordinates of the upper left corner and the lower right corner of the human body frame can be obtained from the image of the human body frame obtained by processing the YOLOv3 network model, and the coordinates of the upper left corner of the human body frame are assumed to be (x)₁,y₁) The coordinate of the lower right corner of the human body frame is (x)₂,y₂) (ii) a Then according to the coordinates of the human body frame, obtaining the coordinates of the center point of the human body frame as

Then, the space position (x, y, z) of the coordinate of the center point of the human body frame relative to the camera, namely the space position coordinate of the human body, can be obtained through a binocular camera; wherein x is the left and right deviant of the front view angle of the human body from the camera, y is the depth deviant of the front view angle of the human body from the camera, and z is the upper and lower deviant of the front view angle of the human body from the camera. Then, according to the space position coordinates of the human body, whether the position of the human body is on the ground can be judged.

Specifically, in this embodiment, the step S203, that is, the step of determining whether the human body position is located on the ground according to the human body spatial position coordinates, includes:

s203-1, calculating a first height according to the space position coordinates of the human body, wherein the first height is the height of the human body from the ground;

s203-2, if the first height is smaller than the first threshold value, determining that the position of the human body is on the ground.

In this embodiment, referring to fig. 3, the first height is calculated by the following formula:

In this embodiment, if the height h of the human body from the ground is less than a threshold (for example, 30cm), it is determined that the human body is located on the ground. By judging whether the human body is on the ground or not, the identification of the subsequent real-time behavior of the human body can be further assisted, so that the identification accuracy is improved. For example, if the position of the human body is judged not to be on the ground in the process, and the subsequent behavior identified through the behavior identification network is walking, whether the error is identified or not can be considered, and further identification and verification can be guided.

In this embodiment, step S103, namely, the step of inputting the human body image into the AlphaPose network model to obtain a plurality of human body skeleton points, specifically includes:

s103-1, inputting the human body frame image into an AlphaPose network model to obtain a plurality of human body key joint points;

s103-2, screening the plurality of human body key joint points to obtain a plurality of human body bone points.

Specifically, referring to fig. 4, fig. 4 is a flowchart of a specific application example of inputting a human body image into an AlphaPose network model to obtain a plurality of human body skeletal points in the embodiment of the present invention. As shown in fig. 4, after inputting the frame image of the human body into the AlphaPose network model, 17 human body key joint points (as shown in fig. 5) can be obtained, and in order to reduce the inference time, 4 unnecessary joint points of the left eye, the right eye, the left ear and the right ear are removed, and 13 human body key joint points are left, which are respectively the nose joint point, the left elbow joint point, the right elbow joint point, the left palm joint point, the right palm joint point, the left hip joint point, the right hip joint point, the left knee joint point, the right knee joint point, the left ankle joint point, the right ankle joint point, the left shoulder joint point and the right shoulder joint point shown in fig. 5. Then averaging the left shoulder joint point and the right shoulder joint point to generate a 14 th joint point; these 14 joint points are referred to as human skeletal points.

In this embodiment, the 14 joint points (human skeleton points) are input into the behavior recognition network, each skeleton point includes three channels, i.e., a skeleton point position, and a skeleton point confidence (x, y, s), and the input behavior recognition network is formed by 14 × 3 — 42, and finally, a multi-classification result is obtained.

Specifically, referring to fig. 6, step S104, namely, the step of inputting a plurality of human skeleton points into the behavior recognition network for recognition to obtain a behavior classification result, includes:

s104-1, inputting the position and confidence of the bone point corresponding to each human body bone point into a first full-connection layer to obtain a first output result;

s104-2, inputting the first output result into a second full connection layer to obtain a second output result;

s104-3, inputting the second output result into a third full connection layer to obtain a third output result;

s104-4, inputting the third output result into a fourth full connection layer to obtain a fourth output result;

s104-5, inputting the fourth output result into a RELU layer to obtain a fifth output result;

and S104-6, inputting the fifth output result into a dropout layer to obtain a behavior classification result.

In this embodiment, the position and confidence of the bone point obtained by the alphapos network model are sent to the first fully-connected layer, so as to obtain a first output result, where the output dimension is 30. And sending the first output result to a second full-connection layer to obtain a second output result, wherein the input dimension of the second output result is 30, and the output dimension of the second output result is 20. And sending the second output result to a third full-connection layer to obtain a third output result, wherein the input dimension is 20, and the output dimension is 10. And sending the third output result to a fourth full-connection layer to obtain a fourth output result, wherein the input dimensionality of the fourth output result is 10, and the output dimensionality of the fourth output result is n. Then inputting the fourth output result into the RELU layer to obtain a fifth output result; and inputting the fifth output result into a dropout layer to obtain a behavior classification result. And processing the fourth output result through a RELU layer in order to increase the nonlinearity of the behavior recognition network model, and processing the fifth output result through two dropouts layers in order to achieve an overfitting phenomenon of the behavior recognition network model.

In this embodiment, the training of the behavior recognition network further includes:

p1. constructing a training set comprising a Le2i dataset and a ntu-rgbd behavioral recognition dataset;

and P2, inputting the training set into a behavior recognition network to train the behavior recognition network.

In this embodiment, the data set used by the behavior recognition network is bone point data created by the Le2i data set and the ntu-rgbd behavior recognition data set, and the labels of the bone point data set are a plurality of labels for falling, sitting, standing, walking, and the like.

The loss function of the behavior recognition network is a cross entropy loss function:

where K is the number of categories and y is the label, i.e., if the category is i, then y is_iOtherwise equal to 0, p is the network output.

The method for identifying the human body behaviors in real time has the following technical effects:

in the embodiment of the invention, the RGB image containing the human body is acquired through the binocular camera; then, processing the image by using a YOLOv3 network model to obtain a human body frame image; inputting the human body frame image into an AlphaPose network model to obtain a plurality of human body skeleton points; inputting a plurality of human skeleton points into a behavior recognition network for recognition to obtain a behavior category result, so that human behaviors can be recognized in real time; according to the invention, the binocular camera is used for acquiring the RGB image of the human body, the image is processed by utilizing the YOLOv3 network model, the frame image of the human body can be obtained, and high real-time performance and accuracy rate can be ensured without the time sequence information of the human body; the problem that the real-time performance and the accuracy are difficult to be compatible in the behavior recognition method is solved.

The embodiment of the present invention further provides a real-time human behavior recognition system, which includes:

The contents in the method embodiment shown in fig. 1 are all applicable to the embodiment of the present system, the functions specifically implemented by the embodiment of the present system are the same as those in the method embodiment shown in fig. 1, and the advantageous effects achieved by the embodiment of the present system are also the same as those achieved by the method embodiment shown in fig. 1.

Referring to fig. 7, an embodiment of the present invention further provides a device 200 for real-time human behavior recognition, which specifically includes:

at least one processor 210;

at least one memory 220 for storing at least one program;

when the at least one program is executed by the at least one processor 210, the at least one processor 210 is caused to implement the method as shown in fig. 1.

The memory 220, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs and non-transitory computer-executable programs. The memory 220 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 220 may optionally include remote memory located remotely from processor 210, and such remote memory may be connected to processor 210 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

It will be understood that the device structure shown in fig. 7 does not constitute a limitation of device 200, and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

In the apparatus 200 shown in fig. 7, the processor 210 may retrieve the program stored in the memory 220 and execute, but is not limited to, the steps of the embodiment shown in fig. 1.

The above-described embodiments of the apparatus 200 are merely illustrative, and the units illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purposes of the embodiments.

Embodiments of the present invention also provide a computer-readable storage medium, which stores a program executable by a processor, and the program executable by the processor is used for implementing the method shown in fig. 1 when being executed by the processor.

The embodiment of the application also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.

It will be understood that all or some of the steps, systems of methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. A real-time human behavior recognition method is characterized by comprising the following steps:

2. The method for real-time human behavior recognition according to claim 1, wherein after the processing the first image by using a YOLOv3 network model to obtain a human frame image, the method further comprises:

3. The method for real-time human behavior recognition according to claim 2, wherein the step of determining whether the human body position is on the ground according to the human body space position coordinates comprises:

4. The real-time human behavior recognition method according to claim 3, wherein the first height is calculated by the following formula:

wherein h represents a first height, hc represents a binocular camera distanceThe distance from the ground, d, the distance of the human body from the binocular camera, wherein,

5. The method for real-time human behavior recognition according to claim 1, wherein the step of inputting the human frame image into an alphaPose network model to obtain a plurality of human skeleton points comprises:

6. The real-time human behavior recognition method according to claim 1, wherein the human body frame image is input into an AlphaPose network model, and a bone point position and a bone point confidence corresponding to each human body bone point are further obtained; the step of inputting a plurality of human skeleton points into a behavior recognition network for recognition to obtain a behavior classification result comprises the following steps:

7. The real-time human behavior recognition method according to claim 1, further comprising training the behavior recognition network, including:

inputting the training set into the behavior recognition network to train the behavior recognition network.

8. A real-time human behavior recognition system is characterized by comprising:

9. A human behavior real-time recognition device is characterized by comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-7.

10. Computer-readable storage medium, on which a processor-executable program is stored, which, when being executed by a processor, is adapted to carry out the method according to any one of claims 1-7.