CN108229355B

CN108229355B - Behavior recognition method and apparatus, electronic device, computer storage medium

Info

Publication number: CN108229355B
Application number: CN201711407861.1A
Authority: CN
Inventors: 颜思捷; 熊元骏; 林达华
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-12-22
Filing date: 2017-12-22
Publication date: 2021-03-23
Anticipated expiration: 2037-12-22
Also published as: CN108229355A

Abstract

The embodiment of the disclosure discloses a behavior recognition method and device, electronic equipment, a computer storage medium and a program, wherein the method comprises the following steps: performing human body key point detection on at least one frame of video image to obtain a plurality of human body key points of the at least one frame of video image; and obtaining a behavior recognition result of the at least one frame of video image based on the feature information of the plurality of human body key points of the at least one frame of video image and the associated information of the plurality of human body key points. According to the embodiment of the invention, by combining the characteristic information of the human key points and the associated information between the human key points, the local information and the overall information are fully utilized, and the accuracy of behavior recognition is improved.

Description

Behavior recognition method and apparatus, electronic device, computer storage medium

Technical Field

The present disclosure relates to computer vision technologies, and in particular, to a behavior recognition method and apparatus, an electronic device, and a computer storage medium.

Background

Behavior recognition is to recognize actions or behaviors of a person, such as swimming, running, sweeping, and the like, from a video, and the behavior recognition plays an important role in understanding the content and meaning of the video. The behavior recognition can take video images, voice or human body key point coordinates as input, and utilizes a neural network to output the category of the behavior.

Disclosure of Invention

The embodiment of the disclosure provides a behavior recognition technology.

According to an aspect of the embodiments of the present disclosure, there is provided a behavior recognition method, including:

performing human body key point detection on at least one frame of video image to obtain a plurality of human body key points of the at least one frame of video image;

and obtaining a behavior identification result of each frame of video image in the at least one frame of video image based on the feature information of the plurality of human body key points of the at least one frame of video image and the associated information of the plurality of human body key points.

In another embodiment based on the above method of the present invention, the feature information of the human body key points includes coordinate information of the human body key points; alternatively, the first and second electrodes may be,

the feature information of the human body key points comprises coordinate information of the human body key points, and estimation confidence degrees of the human body key points and/or initial features corresponding to the human body key points.

In another embodiment of the above method according to the present invention, the associated information of the plurality of human key points includes any one or more of the following items: the video image processing method comprises the following steps of obtaining spatial correlation information between at least two human key points in the same frame of video image and time correlation information between at least two human key points which correspond to the same human body part and belong to the adjacent frame of video image in the at least one frame of video image.

The time correlation information between at least two human key points which correspond to the same human body part and belong to different frame video images in the at least one frame video image is used for indicating the moving track of the human body part in the at least one frame video image along with the time.

In another embodiment of the foregoing method according to the present invention, the at least one frame of video image is a plurality of frames of consecutive video images in the video; and/or

And the spatial correlation information between at least two human body key points in the same frame of video image is determined according to the communication relation of human body structures.

In another embodiment based on the above method of the present invention, the spatial correlation information between the at least two human key points includes a spatial neighboring relationship between the at least two key points, and/or

The time correlation information between the at least two key points comprises: and the adjacent relation of the frames to which the at least two key points belong.

In another embodiment of the above method according to the present invention, after the performing human body keypoint detection on at least one frame of video image and obtaining a plurality of human body keypoints of the at least one frame of video image, the method further includes:

establishing a space-time map based on the plurality of human key points in the at least one frame of video image, wherein the space-time map comprises feature information of the plurality of human key points in the at least one frame of video image and associated information of the plurality of human key points;

the obtaining of the behavior recognition result of the at least one frame of video image based on the feature information of the plurality of human body key points of the at least one frame of video image and the associated information of the plurality of human body key points includes:

and obtaining a behavior identification result of the at least one frame of video image based on the space-time map.

In another embodiment based on the above method of the present invention, the space-time map includes a plurality of nodes corresponding to the plurality of human key points, and each node of the plurality of nodes includes feature information of the corresponding human key point;

each node in the plurality of nodes has at least one edge, and the plurality of nodes have a plurality of edges indicating an incidence relation of the plurality of human body key points.

At least one edge of the human key points indicates the incidence relation between the human key points and other human key points.

In another embodiment of the above method according to the present invention, a first node of the plurality of nodes has a spatial edge with each second node of at least one second node, wherein a first human body key point corresponding to the first node and a second human body key point corresponding to each second node of the at least one second node belong to the same frame, and the first human body key point is in direct communication with a human body part corresponding to each second human body key point, and/or

And a time edge is arranged between the first node and each of at least one third node, wherein the first human body key point corresponds to the same human body part as the third human body key point corresponding to each third node and belongs to an adjacent frame.

The number of the plurality of nodes is equal to the number of the plurality of human body key points, and the plurality of nodes correspond to the plurality of human body key points one to one.

In another embodiment of the foregoing method according to the present invention, the building a space-time map based on a plurality of human key points in the at least one frame of video image includes:

connecting at least two human body key points positioned in the same frame of video image by using a spatial edge according to the connection relation of human body structures;

connecting at least two human body key points of the same body part in adjacent frames of the at least one frame of video image using temporal edges.

In another embodiment based on the above method of the present invention, the obtaining the behavior recognition result of the at least one frame of video image based on the space-time map includes:

and inputting the space-time diagram into a convolutional neural network to obtain a behavior recognition result of the at least one frame of video image.

In another embodiment based on the foregoing method of the present invention, the inputting the space-time map into a convolutional neural network for processing to obtain a behavior recognition result of the at least one frame of video image includes:

performing convolution processing on the plurality of human key points based on the associated information among the plurality of human key points to obtain convolution processing results of the plurality of human key points;

and obtaining a behavior recognition result of the at least one frame of video image based on the convolution processing result of the plurality of human body key points.

In another embodiment of the method according to the present invention, the performing convolution processing on the plurality of human body key points based on the associated information of the plurality of human body key points to obtain convolution processing results of the plurality of human body key points includes:

determining at least one fifth human key point having an association relation with a fourth human key point in the plurality of human key points based on the association information of the plurality of human key points;

and obtaining a convolution processing result of the fourth human key point based on the feature information of each human key point in the fourth human key point and the at least one fifth human key point.

The at least one fifth human body key point comprises at least one human body key point which has a spatial incidence relation with the fourth human body key point; or

The at least one fifth human body keypoint comprises at least one human body keypoint having a spatial association with the fourth human body keypoint and at least one human body keypoint having a temporal association with the fourth human body keypoint.

In another embodiment of the method according to the present invention, the obtaining a convolution processing result of the fourth human key point based on the feature information of each human key point of the fourth human key point and the at least one fifth human key point includes:

performing convolution processing on each human body key point by using a convolution parameter corresponding to a human body key point set to which each human body key point belongs in the fourth human body key point and the at least one fifth human body key point to obtain an initial convolution result of each human body key point;

and obtaining a convolution processing result of the fourth human key point based on the initial convolution result of the fourth human key point and each human key point in the at least one fifth human key point.

In another embodiment based on the above method of the present invention, before the performing convolution processing on each human body key point, the method further includes:

dividing the fourth human body key point and the at least one fifth human body key point into at least one human body key point set, wherein each human body key point set comprises at least one human body key point;

determining convolution parameters of each human body key point based on the fourth human body key point and a human body key point set to which each human body key point belongs in the at least one fifth human body key point, wherein the human body key points belonging to different human body key point sets correspond to different convolution parameters.

In another embodiment of the above method according to the invention, the at least one set of human keypoints comprises a first set of human keypoints and a second set of human keypoints;

said dividing the fourth human keypoints and the at least one fifth human keypoints into at least one human keypoint set, comprising:

the fourth human keypoints are classified into the first human keypoint set, and the at least one fifth human keypoint is classified into the second human keypoint set.

In another embodiment based on the above method of the present invention, the dividing the fourth human body keypoint and the at least one fifth human body keypoint into at least one human body keypoint set comprises:

and dividing the fourth human body key point and the at least one fifth human body key point into at least one human body key point set based on the distance between each human body key point in the fourth human body key point and the at least one fifth human body key point and a reference point.

In another embodiment of the above method according to the present invention, the dividing the fourth human key point and the at least one fifth human key point into at least one human key point set based on a distance between each of the fourth human key point and the at least one fifth human key point and a reference point includes:

determining a first distance between the fourth human body key point and the reference point based on the feature information of the fourth human body key point;

and determining a key point set to which each human body key point belongs based on the magnitude relation between the first distance and the distance between each human body key point in the fourth human body key point and the at least one fifth human body key point and the reference point.

In another embodiment of the above method according to the present invention, the determining the set of key points to which each human body key point belongs based on a magnitude relationship between a distance between each human body key point of the fourth human body key point and the at least one fifth human body key point and the reference point and the first distance includes:

determining that the human body key points with the distance to the reference point smaller than the first distance belong to a first key point set; and/or

Determining that the human body key points with the distance from the reference point equal to the first distance belong to a second key point set; and/or

And determining that the human key points with the distance to the reference point greater than the first distance belong to a third key point set.

In another embodiment of the method according to the present invention, the obtaining a behavior recognition result of the at least one frame of video image based on a convolution processing result of the plurality of human body key points includes:

performing global pooling on the convolution processing result of each human body key point in the plurality of human body key points included in the space-time diagram to obtain a pooling processing result;

and obtaining a behavior identification result of the at least one frame of video image based on the pooling processing result.

In another embodiment of the above method according to the present invention, the pooling processing result comprises a one-dimensional feature vector;

the obtaining a behavior recognition result of each frame of video image in the at least one frame of video image based on the pooling processing result includes:

processing the one-dimensional feature vector by using a full-connection layer to obtain an identification vector, wherein the identification vector comprises vector values corresponding to the number of behavior classification categories;

and obtaining human behavior classification in the video image based on each vector value in the identification vectors.

According to another aspect of the embodiments of the present disclosure, there is provided a behavior recognition apparatus including:

the key point detection unit is used for executing human body key point detection on at least one frame of video image to obtain a plurality of human body key points of the at least one frame of video image;

and the behavior identification unit is used for obtaining a behavior identification result of the at least one frame of video image based on the feature information of the plurality of human key points of the at least one frame of video image and the associated information of the plurality of human key points.

In another embodiment of the above apparatus according to the present invention, the feature information of the human body key points includes coordinate information of the human body key points; alternatively, the first and second electrodes may be,

In another embodiment of the above apparatus according to the present invention, the associated information of the plurality of human key points includes any one or more of the following items: the video image processing method comprises the following steps of obtaining spatial correlation information between at least two human key points in the same frame of video image and time correlation information between at least two human key points which correspond to the same human body part and belong to the adjacent frame of video image in the at least one frame of video image.

In another embodiment of the above apparatus according to the present invention, the at least one frame of video image is a plurality of frames of consecutive video images in the video; and/or

In another embodiment of the above apparatus according to the present invention, the spatial correlation information between the at least two human key points includes a spatial neighboring relationship between the at least two key points, and/or

In another embodiment of the above apparatus according to the present invention, further comprising:

the image establishing unit is used for establishing a space-time image based on a plurality of human key points in the at least one frame of video image, wherein the space-time image comprises feature information of the human key points in the at least one frame of video image and associated information of the human key points;

the behavior identification unit is specifically configured to obtain a behavior identification result of the at least one frame of video image based on the space-time map.

In another embodiment of the above apparatus according to the present invention, the space-time map includes a plurality of nodes corresponding to the plurality of human key points, and each of the plurality of nodes includes feature information of the corresponding human key point;

In another embodiment of the above-described device according to the invention,

a first node of the plurality of nodes and each second node of at least one second node have a spatial edge, wherein a first human body key point corresponding to the first node and a second human body key point corresponding to each second node of the at least one second node belong to the same frame, and the first human body key point is directly communicated with a human body part corresponding to each second human body key point, and/or

In another embodiment based on the above apparatus of the present invention, the behavior recognition unit is specifically configured to input the space-time map to a convolutional neural network, so as to obtain a behavior recognition result of the at least one frame of video image.

In another embodiment of the above apparatus according to the present invention, the behavior recognizing unit includes:

the convolution processing module is used for performing convolution processing on the plurality of human key points based on the associated information among the plurality of human key points to obtain convolution processing results of the plurality of human key points;

and the convolution identification module is used for obtaining a behavior identification result of the at least one frame of video image based on the convolution processing result of the plurality of human key points.

In another embodiment of the above apparatus according to the present invention, the convolution processing module includes:

the association determining module is used for determining at least one fifth human key point which has an association relation with a fourth human key point in the plurality of human key points based on the association information of the plurality of human key points;

and the feature processing module is used for obtaining a convolution processing result of the fourth human key point based on the feature information of each human key point in the fourth human key point and the at least one fifth human key point.

In another embodiment based on the above apparatus of the present invention, the feature processing module is specifically configured to perform convolution processing on each human body key point by using a convolution parameter corresponding to a human body key point set to which each human body key point belongs, in the fourth human body key point and the at least one fifth human body key point, so as to obtain an initial convolution result of each human body key point;

In another embodiment of the above apparatus according to the present invention, the behavior recognizing unit further includes:

a classification module, configured to divide the fourth human body keypoint and the at least one fifth human body keypoint into at least one human body keypoint set, where each human body keypoint set includes at least one human body keypoint;

a parameter determining module, configured to determine a convolution parameter of each human key point based on a human key point set to which each human key point belongs in the fourth human key point and the at least one fifth human key point, where the human key points belonging to different human key point sets correspond to different convolution parameters.

In another embodiment of the above apparatus according to the present invention, the at least one set of human keypoints comprises a first set of human keypoints and a second set of human keypoints;

the classification module is specifically configured to classify the fourth human keypoints into the first human keypoint set, and classify the at least one fifth human keypoint into the second human keypoint set.

In another embodiment of the above apparatus according to the present invention, the classification module is specifically configured to divide the fourth human body keypoint and the at least one fifth human body keypoint into at least one human body keypoint set based on a distance between each of the fourth human body keypoint and the at least one fifth human body keypoint and a reference point.

In another embodiment of the above apparatus according to the present invention, the classification module includes:

a first distance module, configured to determine a first distance between the fourth human body key point and the reference point based on feature information of the fourth human body key point;

a first relation module, configured to determine, based on a magnitude relation between a distance between each of the fourth human body key point and the at least one fifth human body key point and the reference point and the first distance, a key point set to which each human body key point belongs.

In another embodiment of the above apparatus according to the present invention, the first relation module is specifically configured to determine that a human key point whose distance from the reference point is smaller than the first distance belongs to a first key point set; and/or

In another embodiment based on the above apparatus of the present invention, the convolution identifying module is specifically configured to perform global pooling on a convolution processing result of each human key point in the plurality of human key points included in the space-time diagram to obtain a pooled processing result;

According to another aspect of the embodiments of the present disclosure, there is provided an electronic device including a processor, the processor including the behavior recognition apparatus as described above.

According to another aspect of the embodiments of the present disclosure, there is provided an electronic device including: a memory for storing executable instructions;

and a processor in communication with the memory to execute the executable instructions to perform the behavior recognition method as described above.

According to another aspect of the embodiments of the present disclosure, there is provided a computer storage medium for storing computer-readable instructions that, when executed, perform the behavior recognition method as described above.

According to another aspect of embodiments of the present disclosure, there is provided a computer program comprising computer readable code which, when run on a device, a processor in the device executes instructions for implementing a behavior recognition method as described above.

According to yet another aspect of the embodiments of the present disclosure, there is provided a computer program product for storing computer readable instructions, which when executed, cause a computer to execute the behavior recognition method described in any one of the above possible implementations.

In an alternative embodiment the computer program product is embodied as a computer storage medium, and in another alternative embodiment the computer program product is embodied as a software product, such as an SDK or the like.

The disclosed embodiment also provides another behavior identification method and a corresponding device, electronic equipment, computer storage medium, computer program and computer program product thereof, wherein the method comprises the following steps: performing human body key point detection on at least one frame of video image to obtain a plurality of human body key points of at least one frame of video image; and obtaining a behavior recognition result of the at least one frame of video image based on the feature information of the plurality of human body key points of the at least one frame of video image and the associated information of the plurality of human body key points.

Based on the behavior recognition method and apparatus, the electronic device, the computer storage medium, and the program provided by the embodiments of the present disclosure, the method performs keypoint detection on at least one frame of video image to obtain a plurality of human body keypoints of the at least one frame of video image; obtaining a behavior recognition result of at least one frame of video image based on the characteristic information of a plurality of human key points of at least one frame of video image and the associated information of the plurality of human key points; the method overcomes the defect that the prior art processes all key points together and is not beneficial to paying attention to local information, and makes full use of the local information and the overall information by combining the characteristic information of the human key points and the associated information between the human key points, thereby improving the accuracy of behavior recognition.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

fig. 1 is a schematic flow chart of a behavior recognition method according to an embodiment of the present disclosure.

Fig. 2 is a schematic structural diagram of a space-time graph constructed in an optional example of the behavior recognition method disclosed in the present disclosure.

Fig. 3a-d are specific exemplary diagrams of a behavior recognition method provided by an embodiment of the disclosure.

Fig. 4 is a flowchart illustrating a specific example of the behavior recognition method according to the present disclosure.

Fig. 5 is a schematic structural diagram of an embodiment of the behavior recognition apparatus of the present disclosure.

Fig. 6 is a schematic structural diagram of an electronic device for implementing an embodiment of the present application.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The disclosed embodiments may be applied to computer systems/servers that are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

In carrying out the present disclosure, the public finds that there are at least the following problems: in the prior art, no matter the LSTM network or the ResNet network is adopted, the natural connection in space between key points of a human body is not considered.

The prior art ignores the connection of adjacent key points and indiscriminately processes the key points into vectors which are directly input into the network. All human body key points are considered at the beginning of the network, so that the network is not beneficial to the model to pay attention to local information.

Fig. 1 is a schematic flow chart of a behavior recognition method according to an embodiment of the present disclosure. As shown in fig. 1, the method of this embodiment includes:

step 110, performing key point detection on at least one frame of video image to obtain a plurality of human body key points of at least one frame of video image.

Specifically, the at least one frame of video image may be derived from a video acquired by the device itself or a video input by a user or a video acquired from another device, and the embodiment of the present disclosure does not limit the manner of acquiring the at least one frame of video image. Alternatively, the at least one frame of video image may be a continuous video image, for example, the at least one frame of video image may belong to a video segment corresponding to an action or behavior, based on which a behavior recognition result may be obtained. Alternatively, each of the at least one frame of video image may correspond to an action or behavior, for example, the at least one frame of video image is discontinuous and is obtained by capturing one frame of video image at intervals of a set number of frames in the video, but the embodiment of the disclosure is not limited thereto.

Specifically, human body key point detection can be performed on each frame of video image in at least one frame of video image to obtain at least one human body key point in each frame of video image, so that a plurality of human body key points of the at least one frame of video image can be obtained. Optionally, a Machine learning method may be used to perform the keypoint detection, for example, a neural network, a Support Vector Machine (Support Vector Machine), a Random Forest (RF), and the like, and the implementation of the keypoint detection is not limited in the embodiment of the present disclosure.

And step 120, obtaining a behavior recognition result of the at least one frame of video image based on the feature information of the plurality of human body key points and the associated information of the plurality of human body key points of the at least one frame of video image.

In one or more optional embodiments, the feature information of the human body keypoints may include coordinate information of the human body keypoints, for example, 2D coordinates of the human body keypoints or 3D coordinates of the human body keypoints, or the feature information of the human body keypoints may further include an estimated confidence of the human body keypoints and/or initial features corresponding to the human body keypoints, where optionally, the initial features corresponding to the human body keypoints may be obtained by performing feature extraction on positions or regions where the human body keypoints are located, or the feature information of the human body keypoints may also include other information related to the keypoints, which information is specifically included in the feature information of the human body keypoints is not limited in the embodiments of the present disclosure.

In 120, in addition to the feature information of each of the plurality of human key points obtained from the at least one frame of video image, behavior recognition may be performed by using the association information between the plurality of human key points, so as to improve the accuracy of behavior recognition. Optionally, the association information of the plurality of human body key points may include spatial association information of the plurality of human body key points and/or temporal association relationship of the plurality of human body key points. As an example, the at least one frame of video image may be specifically a plurality of frames of consecutive video images in a video, at this time, the association information of the plurality of human body association points may include spatial association information and temporal association relationship of the plurality of human body key points, where the spatial association information may indicate spatial association relationship between the human body key points, for example, the spatial association information may include a positional relationship, such as adjacent or direct connection, of the human body key points in the same frame of video image; the temporal association information may indicate a temporal association relationship between key points of a human body, for example, the temporal association information may include a relationship between frames to which key points corresponding to the same body part (e.g., the same joint point) belong, for example, the belonging frames are adjacent, and the like, which is not limited by the embodiments of the present disclosure.

Alternatively, the spatial association relationship between the human key points in the same frame of video image may be determined according to the connectivity relationship of human structures. For example, the key points corresponding to the elbows may be considered to be adjacent to the key points corresponding to the wrists and the shoulder, or may be predefined, and the determination manner of the association relationship between the key points of the human body according to the embodiment of the present disclosure is not limited.

As an optional example, while detecting the human body key points, information of the human body parts corresponding to the human body key points, such as human body part tags, may also be output, and the association relationship between the human body key points may be determined based on the human body part information of the human body key points. Optionally, the time correlation information between the key points of the human body corresponding to the same human body part and belonging to different frames of video images may indicate the moving track of the human body part in the at least one frame of video images along with the time.

In some optional examples, the spatial association information between the human key points may include a neighboring relationship of the human key points in a spatial position, for example, may include information indicating that body parts corresponding to at least two human key points in the same video image are directly connected. Optionally, the time correlation information between the human body key points may include an adjacent relationship of frames to which the human body key points belong, for example, information indicating that frames to which at least two human body key points corresponding to the same body part belong are adjacent frames may be included, but the embodiment of the present disclosure is not limited thereto.

In 120, optionally, the feature information and the associated information of the plurality of human body key points may be processed by using a machine learning method to obtain a behavior recognition result of at least one frame of video image. In an optional example, the feature information and the associated information of the plurality of human body key points may be processed by using a neural network, for example, the feature information and the associated information of the plurality of human body key points may be processed and input to a convolutional neural network for processing, so as to obtain a behavior recognition result of at least one frame of video image, where the feature information and the associated information of the plurality of human body key points may be directly input to the convolutional neural network, or the feature information and the associated information of the plurality of human body key points are processed and input to the convolutional neural network, and a specific input form of the convolutional neural network is not limited in the embodiment of the present disclosure.

In one or more optional embodiments, a space-time map may be established based on feature information and associated information of a plurality of human body key points in at least one frame of video image, and accordingly, in 120, a behavior recognition result of at least one frame of video image may be obtained based on the space-time map. Optionally, the feature information and the associated information of the plurality of human body key points may also be embodied in other ways, and the embodiments of the present disclosure are not limited thereto.

Specifically, the space-time map may be established based on time correlation information and/or space correlation information of a plurality of human body key points. The space-time graph may include a plurality of nodes and a plurality of edges, and each node of the plurality of nodes may have at least one edge. In some optional examples, the human body key points may serve as nodes, and the association information between the human body key points may be embodied as edges in a space-time graph, that is, the human body key points having a temporal association relationship and/or a spatial association relationship may be connected by the edges, and the edges may be used to indicate the association relationship between the human body key points. As an optional example, the spatial association relationship and the temporal association relationship between the human body key points may be respectively indicated by a spatial edge and a temporal edge, and in addition, optionally, the human body key points may also have a self-connection edge with themselves.

As an example, assuming that a plurality of human body key points are from one frame of video image, the space-time graph may include a spatial edge obtained by connecting a node corresponding to each human body key point with a node corresponding to at least one other human body key point, and a self-connecting edge obtained by connecting a node corresponding to each human body key point with itself, where human body parts corresponding to two human body key points constituting the spatial edge are adjacent or directly connected.

As another example, assuming that a plurality of human key points are from at least two frames of video images, the space-time map may further include a time edge on the basis of including a spatial edge and a self-connected edge, where the time edge is obtained by connecting two human key points corresponding to the same human body part in two adjacent frames of video images, but the embodiment of the present disclosure is not limited thereto.

Optionally, at least one edge of the human body key point may indicate an association relationship between the human body key point and other human body key points, for example, different human body key points in the same video image correspond to different human body joints, and an association relationship exists between human body key points corresponding to two human body joints located at two ends of one bone. In an alternative example of the space-time map, there is a "time edge" between key points of human bodies corresponding to the same human body joint in adjacent frames (e.g., "left elbow" of the third and fourth frames), and there is a "space edge" between key points of human bodies corresponding to the same frame (e.g., "left elbow" and "left wrist" of the fifth frame), where the "adjacent key points of human bodies" may be manually defined or determined according to the connection relationship of human body structures or determined by other methods, for example, the key points of human bodies corresponding to the human body joints connected or directly connected by the same bone are adjacent key points of human bodies (e.g., left elbow joint and left wrist joint), but the embodiments of the present disclosure are not limited thereto.

In one or more optional embodiments, a first node in the plurality of nodes and each of the at least one second node have a spatial edge, wherein a first human body key point corresponding to the first node and a second human body key point corresponding to each of the at least one second node belong to the same frame of video image, and the first human body key point is directly communicated with a human body part corresponding to each of the second human body key points.

In one or more optional embodiments, a time edge is provided between the first node and each of the at least one third node, where the first human body key corresponds to the same human body part as the third human body key corresponding to each third node and belongs to an adjacent frame.

Optionally, when constructing the space-time graph, the spatial edge may be used to connect at least two human body key points located in the same frame of video image based on the connectivity of the human body structure, and then the temporal edge may be used to connect at least two human body key points of the same body part in the adjacent frame of the at least one frame of video image, so that automatic construction without manual allocation may be implemented, so that the same network structure has universality and may be applicable to different scenes of nodes and node connection structures, but the embodiment of the present disclosure does not limit this.

Optionally, the number of the plurality of nodes included in the space-time graph may be equal to or not equal to the number of the plurality of human body key points, where if the number of the plurality of nodes is equal to the number of the plurality of human body key points, the plurality of nodes and the plurality of human body key points may correspond one to one, but the embodiment of the present disclosure is not limited thereto.

Optionally, the behavior recognition result obtained in 120 may be a behavior recognition result corresponding to each frame of video image in at least one frame of video image, or may also be a behavior recognition result corresponding to all video images in at least one frame of video image in common, for example, the at least one frame of video image belongs to a video segment of a video stream, the video segment corresponds to a human body action, and accordingly, a behavior recognition result may be obtained based on at least one frame of video image in the video segment. Alternatively, the space-time diagram may be constructed by other procedures.

Fig. 2 is a schematic diagram of an example of a space-time map constructed in an embodiment of the present disclosure. In the space-time diagram, the human key points are associated through the spatial edges and the temporal edges, each human key point has at least one edge with other human key points, and accordingly, the adjacent key points of the temporal dimension and/or the adjacent key points of the spatial dimension of each human key point can be obtained through the space-time diagram, but the specific implementation of the space-time diagram in the embodiment of the present disclosure is not limited.

As an example, a space-time graph built based on N joints and a skeleton sequence of T frame video images may be represented as G ═ V, E, where V is a node set composed of a plurality of nodes and E is an edge set composed of a plurality of edges.

Specifically, the node set V ═ { V ═ V_tiI T1, …, T, i 1, …, N, includes all keypoints of the skeleton sequence, and the feature vector of the node may include a coordinate vector and an estimated confidence, or further include other information. As an alternative example, the edge set E may comprise two subsets, the first subset describing the inter-site connections in each frame of the video image, denoted as E_s＝{v_tiv_tjL (i, j) ∈ H }, where H is the set of naturally connected human joint points. The second subset contains inter-frame edges connecting the same joints in successive frames, denoted as E_F＝{v_tiv_(t+1)i}。E_FAll edges for a particular node may represent a trajectory over time for the body part corresponding to that node.

In this way, based on other key points having an association relationship with the human body key point, for example, adjacent key points, more information of the human body key point can be obtained, thereby facilitating obtaining a more accurate behavior recognition result.

Based on the behavior recognition method provided by the embodiment of the disclosure, the method comprises the steps of performing key point detection on at least one frame of video image to obtain a plurality of human body key points of the at least one frame of video image; the behavior recognition result of the at least one frame of video image is obtained based on the characteristic information of the plurality of human body key points and the associated information of the plurality of human body key points of the at least one frame of video image.

In one or more optional embodiments, the space-time map may be processed by using a neural network, so as to obtain a behavior recognition result of at least one frame of video image. In an optional example, the space-time graph may be subjected to convolution processing, and at this time, optionally, the space-time graph may be input into a convolutional neural network, for example, a feature vector of a node in the space-time graph may be input into the convolutional neural network, and the feature vector of the node may include feature information of a human body key point corresponding to the node and information of an edge corresponding to the node, but the embodiment of the present disclosure is not limited thereto. The convolutional neural network can process the input space-time image to obtain a behavior recognition result of at least one frame of video image.

In some optional embodiments, the convolution processing may be performed on the plurality of human body key points based on the association information between the plurality of human body key points, so as to obtain a convolution processing result of the plurality of human body key points. For example, each human body key point in the space-time map may be respectively processed by using a convolutional neural network to obtain a convolution processing result of each human body key point, for example, an image feature corresponding to each human body key point, and a behavior recognition result of at least one frame of video image may be obtained based on the convolution processing result of each human body key point in the plurality of human body key points. For convenience of description, the Neural network based on the space-time Graph may be referred to as space-time Graph Neural Networks (ST-GNCs) below, or the Neural network may have other names, and the name of the Neural network should not be construed as limiting the embodiments of the present disclosure.

Optionally, at least one associated key point of a certain human body key point (which may be referred to as a current human body key point) may be determined based on associated information of a plurality of human body key points, and an image feature of the current human body key point may be determined based on feature information of the current human body key point and feature information of each associated key point of the at least one associated key point. Wherein optionally, the associated keypoints may comprise neighboring keypoints. Taking the space-time graph as an example, assuming that the human body keypoint corresponding to the current node is referred to as a fourth human body keypoint, and the human body keypoint corresponding to each edge in at least one edge of the current node is referred to as a fifth human body keypoint, at least one fifth keypoint may be determined as a neighboring keypoint of the fourth human body keypoint, where optionally, the at least one edge may be part or all of the edge of the current node and may include at least one temporal edge and/or at least one spatial edge, and accordingly, at least one neighboring keypoint (i.e., at least one fifth human body keypoint) of the fourth human body keypoint may include a neighboring keypoint in a temporal dimension and/or a neighboring keypoint in a spatial dimension, but the present disclosure is not limited thereto. Alternatively, the current node further has self-connected edges, and the human key points corresponding to each edge of the current node may be determined as neighboring key points of a fourth human key point, at this time, at least one neighboring key point of the fourth human key point may include itself and at least one fifth key point, and accordingly, a convolution processing result of the fourth human key point may be determined based on feature information of each neighboring key point of the at least one neighboring key point of the fourth human key point, but the embodiment of the present disclosure is not limited thereto.

Optionally, the current human body key point and each associated key point of the at least one associated key point of the current human body key point may be convolved to obtain an initial convolution result of each human body key point, and based on the initial convolution result of the current human body key point and the initial convolution result of the at least one associated key point, for example, the initial convolution result of the current human body key point and the initial convolution result of the at least one associated key point are superimposed to obtain a convolution result of the current human body key point. For example, the convolution parameter may be utilized to perform convolution processing on the fourth human key point and each fifth human key point in the at least one fifth human key point, so as to obtain an initial convolution result of each human key point. Optionally, the convolution parameters corresponding to each human body key point may be the same or different.

In some optional embodiments, at least one associated keypoint of the current human body keypoint may be classified, or the current human body keypoint and the at least one associated keypoint thereof may be classified, to obtain a classification result, and a corresponding convolution parameter may be assigned based on the classification result. Alternatively, the type of each neighboring node of the node may be determined, and then a convolution kernel is allocated according to the type of the neighboring node to perform convolution processing. For example, the fourth human body keypoint and the at least one fifth human body keypoint may be divided into at least one human body keypoint set, wherein each human body keypoint set includes at least one human body keypoint, and each human body keypoint set may correspond to a set of convolution parameters, and then, the convolution parameter of each human body keypoint may be determined based on the human body keypoint set to which each human body keypoint belongs, of the fourth human body keypoint and the at least one fifth human body keypoint. Optionally, different sets of human keypoints may correspond to different convolution parameters, such as different convolution kernels, where different sets of human keypoints may correspond to different convolution kernels; the weight values in the convolution kernel can be initialized in advance, and can be obtained through training, for example. As an example, assuming that the human body key points are divided into two human body key point sets, numbers can be respectively allocated to the two human body key point sets, and 2 groups of weight values are initialized in advance, and for a certain human body key point, a corresponding weight value can be obtained according to the number of the human body key point set to which the certain human body key point belongs, that is, different convolution kernels are obtained; but the disclosed embodiments are not so limited.

In order to process different classified human key points differently, different network parameters (such as convolution kernel parameters) are distributed to each human key point set on the basis of not changing a network structure, and the human key points in the human key point sets are processed through the neural networks distributed with different parameters, so that local information can be highlighted; specifically, the specific parameters assigned may be a trained decision or predefined, and based on different tasks, the network parameters most suitable for the classification combination may be obtained through training. The structure of the neural network with different network parameters is not changed, specifically, different classification sets can be input into different convolutional neural networks respectively, convolutional operation is performed on the human key points through the convolutional neural networks, and the human features corresponding to the human key points with different classifications can be obtained respectively.

The manner in which the relevant key points of the human body are classified in the embodiments of the present disclosure will be described below with reference to fig. 3 a-d. Fig. 3a is an example of a frame of input skeleton, where the input skeleton includes 18 key points, and optionally, the number of the key points in the embodiment of the present disclosure may be any number, which is not limited in the embodiment of the present disclosure.

In one or more alternative embodiments, a unified-classification (Uni-labeling) approach, as shown in fig. 3b, may be employed. Specifically, all the human body keypoints can be classified into the same keypoint classification, that is, the current human body keypoint and at least one associated keypoint thereof can adopt the same convolution parameter.

In one or more alternative embodiments, a Distance-classification (Distance-classification) method as shown in fig. 3c may be employed. Specifically, at least one associated key point of the current human body key point may be classified according to a distance from the key point. As an example, if the distance between the current human key point connected by the self-connection edge and the self is 0, and the distance between the human key point connected by the space edge or the time edge and the current human key point is 1 (that is, the adjacent key point), the distance between the current human key point and the current human key point may be divided into one category, and all other associated key points of the current human key point may be divided into another category, for example, at least one human key point set may include a first human key point set and a second human key point set, and at this time, optionally, a fourth human key point may be divided into the first human key point set, and at least one fifth human key point may be divided into the second human key point set. In one or more alternative embodiments, a Spatial Configuration classification (Spatial Configuration Partitioning) method as shown in FIG. 3d may be employed. In particular, the human body key points may be classified based on a distance from a reference point, wherein the reference point may be any predefined point, such as a center of gravity point, a center point, or other type of reference point. For example, in the convolution process, for the neighboring nodes of the current node, the neighboring nodes closer to the reference point than the currently discussed node belong to one class, the neighboring nodes farther belong to one class, and the like, or more or less classes may also be set, which is not limited by the embodiment of the present disclosure. At this time, optionally, a first distance between the fourth human body key point and the reference point may be determined based on the coordinate information of the fourth human body key point, and a human body key point set to which the human body key point belongs may be determined based on a magnitude relationship between the distance between the human body key point and the reference point and the first distance. For example, three sets of keypoints may be set: the first key point set, the second key point set and the third key point set, at this time, it may be determined that a human key point whose distance from the reference point is less than the first distance belongs to the first key point set, and/or it may be determined that a human key point whose distance from the reference point is equal to the first distance belongs to the second key point set, and/or it is determined that a human key point whose distance from the reference point is greater than the first distance belongs to the third key point set.

Compared with the convolution processing without distinguishing adjacent key points, the convolution processing of different human key points by adopting different convolution parameters can embody the local information of the human key points, thereby being beneficial to improving the accuracy of the behavior recognition result.

Optionally, the embodiments of the present disclosure may be classified in other manners, which are not limited in the embodiments of the present disclosure.

Optionally, after obtaining a plurality of human body key points, the classification process may be performed, and labeling information may be added to each human body key point, where the labeling information may indicate a category to which the human body key point belongs, for example, labeling information may be added to each human body key point or each edge in a space-time diagram to indicate a category to which the human body key point or the human body key point corresponding to the edge belongs. At this time, the convolutional neural network may allocate a corresponding convolution parameter to each human body key point according to the labeling information for convolution processing, but the embodiment of the present disclosure is not limited thereto.

Optionally, in the embodiment of the present disclosure, the convolution operation of the convolutional neural network may be replaced by a graph convolution operation, that is, a graph model-based convolutional neural network, and the convolution processing based on the space-time graph may be implemented without changing a network structure. The space-time graph can still maintain the structure of the graph model after passing through the convolutional neural network, however, through layer-by-layer convolution, each node already contains high-level semantic information extracted from the underlying coordinate information.

In one or more alternative embodiments, the convolutional neural network may further include a global pooling layer and a fully-connected layer after the one or more convolutional layers. At this time, correspondingly, the convolution processing result of each human key point in the plurality of human key points included in the space-time map may be subjected to global pooling processing to obtain a pooled processing result, and based on the pooled processing result, a behavior recognition result of the at least one frame of video image may be obtained.

Because the input features for behavior recognition are features of a space-time graph structure formed based on human body features, in order to synthesize information of all nodes, all nodes in the space-time graph are passed through a global pooling layer to obtain a one-dimensional vector, wherein the global pooling layer is only a conversion mode, and the method for converting feature dimensions is not limited in the application.

Specifically, the pooling processing result includes a one-dimensional feature vector; obtaining a behavior recognition result of each frame of video image in at least one frame of video image based on the pooling processing result, which may specifically include: processing the one-dimensional characteristic vector by utilizing a full-connection layer to obtain an identification vector, wherein the identification vector comprises vector values corresponding to the classification quantity of the behaviors; a classification of behavior in the video image is obtained based on the magnitude of the anisotropy in the identification vector.

Specifically, the number of vector values in the obtained one-dimensional vector does not necessarily correspond to the number of behavior classification categories (the behavior classification categories are determined by the data set used, for example, 400 types of behaviors in kinetics and 200 types of behaviors in ActivityNet), and in order to realize behavior classification, the one-dimensional vector may be input into a full link layer to obtain an identification vector, and a classification result of behavior identification may be obtained based on the identification vector. The behavior recognition may also use a more complex network structure instead of a single fully connected layer, and the network structure of the behavior recognition is not limited in the present application.

Fig. 4 is a flowchart illustrating a behavior recognition method according to an embodiment of the disclosure. And if the image is the human body key point identification of at least one frame of video image, carrying out posture estimation based on the identified human body key points, and constructing a space-time image of the skeleton sequence based on all the human body key points. The application of the space-time graph convolutional network (st-gcn) will gradually produce a high quality feature map. The corresponding action class classification is obtained by a standard softmax classifier.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Fig. 5 is a schematic structural diagram of an embodiment of the behavior recognition apparatus of the present disclosure. The apparatus of this embodiment may be used to implement the method embodiments of the present disclosure described above. As shown in fig. 5, the apparatus of this embodiment includes:

the key point detecting unit 51 is configured to perform human key point detection on at least one frame of video image to obtain a plurality of human key points of the at least one frame of video image.

Specifically, the at least one frame of video image may be derived from a video acquired by the device itself or a video input by a user or a video acquired from another device, and the embodiment of the present disclosure does not limit the manner of acquiring the at least one frame of video image. Alternatively, the at least one frame of video image may be a continuous video image, for example, the at least one frame of video image may belong to a video segment corresponding to an action or behavior, based on which a behavior recognition result may be obtained.

The behavior recognition unit 52 is configured to obtain a behavior recognition result of the at least one frame of video image based on the feature information of the plurality of human body key points of the at least one frame of video image and the associated information of the plurality of human body key points.

In one or more optional embodiments, the association information of the plurality of human body key points includes any one or more of the following: the video image processing method comprises the following steps of obtaining spatial correlation information between at least two human key points in the same frame of video image and time correlation information between at least two human key points which correspond to the same human body part and belong to the adjacent frame of video image in at least one frame of video image.

Based on the behavior recognition device provided by the above embodiment of the present disclosure, the method includes performing key point detection on at least one frame of video image to obtain a plurality of human body key points of the at least one frame of video image; obtaining a behavior recognition result of at least one frame of video image based on the characteristic information of a plurality of human key points of at least one frame of video image and the associated information of the plurality of human key points; the method overcomes the defect that the prior art processes all key points together and is not beneficial to paying attention to local information, and makes full use of the local information and the overall information by combining the characteristic information of the human key points and the associated information between the human key points, thereby improving the accuracy of behavior recognition.

In one or more alternative embodiments, the at least one frame of video image is a plurality of frames of continuous video images in the video; and/or

The spatial correlation information between at least two human body key points in the same frame of video image is determined according to the connection relation of human body structures.

In one or more alternative embodiments, the disclosed apparatus further comprises:

the image establishing unit is used for establishing a space-time image based on a plurality of human body key points in at least one frame of video image;

correspondingly, the behavior recognition unit is specifically configured to obtain a behavior recognition result of at least one frame of video image based on the space-time map.

The space-time map comprises feature information of a plurality of human key points in at least one frame of video image and associated information of the plurality of human key points.

Optionally, when constructing the space-time map, at least two human body key points located in the same frame of video image may be connected by using a spatial edge based on a connectivity relationship of human body structures, and then at least two human body key points of the same body part in adjacent frames of the at least one frame of video image may be connected by using a temporal edge, which is not limited in this disclosure.

In one or more optional embodiments, the behavior recognition unit 52 is specifically configured to input the space-time map into a convolutional neural network, so as to obtain a behavior recognition result of at least one frame of video image.

Alternatively, the behavior recognition result obtained by the behavior recognition unit 52 may be a behavior recognition result corresponding to each frame of video image in at least one frame of video image, or may also be a behavior recognition result corresponding to all video images in at least one frame of video image in common, for example, the at least one frame of video image belongs to a video segment of a video stream, the video segment corresponds to a human body action, and accordingly, a behavior recognition result may be obtained based on at least one frame of video image in the video segment. Alternatively, the space-time diagram may be constructed by other procedures.

In one or more alternative embodiments, the behavior recognizing unit 52 includes:

and the convolution identification module is used for obtaining a behavior identification result of at least one frame of video image based on the convolution processing result of the plurality of human key points.

Optionally, the convolution processing module includes:

In some optional embodiments, the feature processing module is specifically configured to perform convolution processing on each human body key point by using a convolution parameter corresponding to a human body key point set to which each human body key point belongs in the fourth human body key point and the at least one fifth human body key point, so as to obtain an initial convolution result of each human body key point;

In some optional embodiments, the behavior recognizing unit 52 further includes:

the classification module is used for dividing the fourth human body key point and the at least one fifth human body key point into at least one human body key point set; each human body key point set comprises at least one human body key point;

and the parameter determining module is used for determining the convolution parameter of each human body key point based on the human body key point set to which each human body key point belongs in the fourth human body key point and the at least one fifth human body key point.

Wherein the human keypoints belonging to different sets of human keypoints correspond to different convolution parameters.

In one or more alternative embodiments, the at least one set of human keypoints comprises a first set of human keypoints and a second set of human keypoints;

and the classification module is specifically used for dividing the fourth human key point into the first human key point set and dividing at least one fifth human key point into the second human key point set.

In one or more optional embodiments, the classification module is specifically configured to divide the fourth human body keypoint and the at least one fifth human body keypoint into at least one human body keypoint set based on a distance between each human body keypoint and a reference point in the fourth human body keypoint and the at least one fifth human body keypoint.

Optionally, the classification module comprises:

the first distance module is used for determining a first distance between a fourth human body key point and a reference point based on the feature information of the fourth human body key point;

and the first relation module is used for determining a key point set to which each human body key point belongs based on the magnitude relation between the first distance and the distance between each human body key point and the reference point in the fourth human body key point and the at least one fifth human body key point.

Specifically, the first relation module is specifically configured to determine that a human body key point having a distance to the reference point smaller than a first distance belongs to a first key point set; and/or

And determining that the human key points with the distance from the reference point greater than the first distance belong to a third key point set.

In one or more optional embodiments, the convolution identifying module is specifically configured to perform global pooling on a convolution processing result of each human body key point of the plurality of human body key points included in the space-time diagram to obtain a pooling processing result;

and obtaining a behavior recognition result of at least one frame of video image based on the pooling processing result. At this time, correspondingly, the convolution processing result of each human key point in the plurality of human key points included in the space-time map may be subjected to global pooling processing to obtain a pooled processing result, and based on the pooled processing result, a behavior recognition result of the at least one frame of video image may be obtained.

Specifically, the pooling processing result includes a one-dimensional feature vector; and obtaining a behavior identification result of each frame of video image in at least one frame of video image based on the pooling processing result.

According to an aspect of the embodiments of the present disclosure, there is provided an electronic device including a processor, where the processor includes the behavior recognition apparatus according to any one of the embodiments of the present disclosure.

According to an aspect of the embodiments of the present disclosure, there is provided an electronic device including: a memory for storing executable instructions;

and a processor in communication with the memory for executing the executable instructions to perform any of the above embodiments of the disclosed behavior recognition method.

According to an aspect of the embodiments of the present disclosure, there is provided a computer storage medium for storing computer readable instructions, which when executed perform any of the above embodiments of the behavior recognition method of the present disclosure.

According to an aspect of the embodiments of the present disclosure, there is provided a computer program including computer readable code, which when run on a device, a processor in the device executes instructions for implementing any one of the implementation routines of the present disclosure as an identification method.

In one or more alternative embodiments, the disclosed embodiments also provide a computer program product for storing computer readable instructions, which when executed, cause a computer to perform the behavior recognition method in any of the above possible implementations.

The computer program product may be embodied in hardware, software or a combination thereof. In one alternative, the computer program product is embodied in a computer storage medium, and in another alternative, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

In one or more optional implementation manners, the present disclosure also provides another behavior identification method and a corresponding apparatus, an electronic device, a computer storage medium, a computer program, and a computer program product, where the method includes: performing human body key point detection on at least one frame of video image to obtain a plurality of human body key points of at least one frame of video image; and obtaining a behavior recognition result of the at least one frame of video image based on the feature information of the plurality of human body key points of the at least one frame of video image and the associated information of the plurality of human body key points.

In some embodiments, the behavior recognition indication may be embodied as a call instruction, and the first device may instruct the second device to perform the processing of the image by calling, and accordingly, in response to receiving the call instruction, the second device may perform the steps and/or processes in any of the above-described behavior recognition methods.

The embodiment of the disclosure also provides an electronic device, which may be a mobile terminal, a Personal Computer (PC), a tablet computer, a server, and the like. Referring now to FIG. 6, shown is a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present application: as shown in fig. 6, the electronic device 600 includes one or more processors, communication sections, and the like, for example: one or more Central Processing Units (CPUs) 601, and/or one or more image processors (GPUs) 613, etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM)602 or loaded from a storage section 608 into a Random Access Memory (RAM) 603. Communications portion 612 may include, but is not limited to, a network card, which may include, but is not limited to, an IB (Infiniband) network card.

The processor may communicate with the rom602 and/or the ram 630 to execute executable instructions, connect with the communication part 612 through the bus 604, and communicate with other target devices through the communication part 612, so as to complete operations corresponding to any one of the methods provided by the embodiments of the present application, for example, perform human key point detection on at least one frame of video image, and obtain a plurality of human key points of at least one frame of video image; and obtaining a behavior recognition result of the at least one frame of video image based on the feature information of the plurality of human body key points of the at least one frame of video image and the associated information of the plurality of human body key points.

In addition, in the RAM603, various programs and data necessary for the operation of the device can also be stored. The CPU601, ROM602, and RAM603 are connected to each other via a bus 604. The ROM602 is an optional module in case of the RAM 603. The RAM603 stores or writes executable instructions into the ROM602 at runtime, and the executable instructions cause the processor 601 to perform operations corresponding to the above-described communication method. An input/output (I/O) interface 605 is also connected to bus 604. The communication unit 612 may be integrated, or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and connected to the bus link.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

It should be noted that the architecture shown in fig. 6 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 6 may be selected, deleted, added or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication part may be separately set or integrated on the CPU or the GPU, and so on. These alternative embodiments are all within the scope of the present disclosure.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flowchart, the program code may include instructions corresponding to performing the method steps provided by embodiments of the present disclosure, e.g., performing human keypoint detection on at least one frame of video image, obtaining a plurality of human keypoints for the at least one frame of video image; and obtaining a behavior recognition result of the at least one frame of video image based on the feature information of the plurality of human body key points of the at least one frame of video image and the associated information of the plurality of human body key points. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601.

It is to be understood that the terms "first," "second," and the like in the embodiments of the present disclosure are used for distinguishing and not limiting the embodiments of the present disclosure.

It is also understood that in the present disclosure, "plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in this disclosure is generally to be construed as one or more, unless explicitly stated otherwise or indicated to the contrary hereinafter.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

The methods and apparatus, devices of the present disclosure may be implemented in a number of ways. For example, the methods and apparatuses, devices of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method of behavior recognition, comprising:

determining at least one fifth human body key point which has an association relation with a fourth human body key point in the plurality of human body key points based on association information between the plurality of human body key points included in the space-time map;

obtaining a convolution processing result of the fourth human body key point based on the initial convolution result of the fourth human body key point and each human body key point in the at least one fifth human body key point;

2. The method according to claim 1, wherein the feature information of the human body key point includes coordinate information of the human body key point; alternatively, the first and second electrodes may be,

3. The method according to claim 1, wherein the associated information of the plurality of human key points comprises any one or more of the following: the video image processing method comprises the following steps of obtaining spatial correlation information between at least two human key points in the same frame of video image and time correlation information between at least two human key points which correspond to the same human body part and belong to the adjacent frame of video image in the at least one frame of video image.

4. The method according to claim 3, wherein the at least one frame of video image is a plurality of frames of consecutive video images in a video; and/or

5. The method of claim 3,

the spatial correlation information between the at least two human key points comprises the spatial adjacent relation of the at least two key points, and/or

6. The method of claim 1, wherein the space-time graph comprises a plurality of nodes corresponding to the plurality of human key points, each of the plurality of nodes comprising feature information of the corresponding human key point;

7. The method of claim 6,

8. The method according to any one of claims 1-7, wherein prior to said convolving each of said human keypoints, said method further comprises:

9. The method of claim 8, wherein the at least one set of human keypoints comprises a first set of human keypoints and a second set of human keypoints;

10. The method of claim 9, wherein said dividing the fourth human keypoint and the at least one fifth human keypoint into at least one human keypoint set comprises:

11. The method of claim 10, wherein the dividing the fourth human keypoint and the at least one fifth human keypoint into at least one human keypoint set based on a distance between each of the fourth human keypoint and the at least one fifth human keypoint and a reference point comprises:

12. The method according to claim 11, wherein the determining the set of key points to which each of the fourth and the at least one fifth human key points belongs based on a magnitude relationship between the first distance and a distance between the reference point and each of the fourth and the at least one fifth human key points comprises:

13. The method according to any one of claims 1 to 7, wherein obtaining the behavior recognition result of the at least one frame of video image based on the convolution processing result of the plurality of human key points comprises:

14. A behavior recognition apparatus, comprising:

a graph establishing unit, configured to establish a space-time graph based on a plurality of human key points in the at least one frame of video image, where the space-time graph includes feature information of the plurality of human key points in the at least one frame of video image and associated information of the plurality of human key points

A behavior recognition unit comprising:

a convolution processing module comprising: the system comprises an association determining module and a feature processing module;

the association determining module is configured to determine, based on association information between the plurality of human body key points included in the space-time map, at least one fifth human body key point having an association relationship with a fourth human body key point in the plurality of human body key points;

the feature processing module is configured to perform convolution processing on each human body key point by using a convolution parameter corresponding to a human body key point set to which each human body key point belongs in the fourth human body key point and the at least one fifth human body key point, so as to obtain an initial convolution result of each human body key point; obtaining a convolution processing result of the fourth human body key point based on the initial convolution result of the fourth human body key point and each human body key point in the at least one fifth human body key point;

15. The apparatus according to claim 14, wherein the feature information of the human body key point includes coordinate information of the human body key point; alternatively, the first and second electrodes may be,

16. The apparatus according to claim 14, wherein the associated information of the plurality of human key points comprises any one or more of the following: the video image processing method comprises the following steps of obtaining spatial correlation information between at least two human key points in the same frame of video image and time correlation information between at least two human key points which correspond to the same human body part and belong to the adjacent frame of video image in the at least one frame of video image.

17. The apparatus according to claim 16, wherein the at least one frame of video image is a plurality of frames of consecutive video images in a video; and/or

18. The apparatus of claim 16,

19. The apparatus of claim 14, wherein the space-time graph comprises a plurality of nodes corresponding to the plurality of human key points, each of the plurality of nodes comprising feature information of the corresponding human key point;

20. The apparatus of claim 19,

21. The apparatus according to any one of claims 14-20, wherein the behavior recognition unit further comprises:

22. The apparatus of claim 21, wherein the at least one set of human keypoints comprises a first set of human keypoints and a second set of human keypoints;

23. The apparatus according to claim 22, wherein the classification module is specifically configured to divide the fourth human keypoint and the at least one fifth human keypoint into at least one human keypoint set based on a distance between each of the fourth human keypoint and the at least one fifth human keypoint and a reference point.

24. The apparatus of claim 23, wherein the classification module comprises:

25. The apparatus according to claim 24, wherein the first relation module is specifically configured to determine that a human keypoint having a distance to the reference point that is smaller than the first distance belongs to a first set of keypoints; and/or

26. The apparatus according to any one of claims 14 to 20, wherein the convolution identifying module is specifically configured to perform global pooling on a convolution processing result of each of a plurality of human body key points included in the space-time map to obtain a pooled processing result;

27. An electronic device, comprising a processor including the behavior recognition apparatus of any of claims 14 to 26.

28. An electronic device, comprising:

a memory for storing executable instructions;

and a processor for executing the executable instructions to perform the behavior recognition method of any one of claims 1 to 13.

29. A computer storage medium storing computer readable instructions that, when executed, perform the method of behavior recognition according to any one of claims 1 to 13.