CN110472613B

CN110472613B - Object behavior identification method and device

Info

Publication number: CN110472613B
Application number: CN201910777053.7A
Authority: CN
Inventors: 张玉; 高雪松; 陈维强
Original assignee: Hisense Co Ltd
Current assignee: Hisense Co Ltd
Priority date: 2019-08-22
Filing date: 2019-08-22
Publication date: 2022-05-10
Anticipated expiration: 2039-08-22
Also published as: CN110472613A

Abstract

The invention provides an object behavior identification method and device. The method comprises the following steps: and acquiring the bone point information of each object in each frame of image in the target video. And tracking the skeleton points of each object in each frame of image according to the skeleton point information of each object in each frame of image, and determining the index number of each object in each frame of image, wherein the index number is used for uniquely identifying the corresponding object, and the objects with the same index number are the same object. And inputting the skeleton point information and the index number of each object in each frame image into a target convolutional neural network for behavior identification, and determining the behavior information of the object corresponding to each index number in the target video. Therefore, the skeletal point tracking of the object is realized based on the position relation and/or the posture relation between each object in the continuous frame images, and the accuracy and the speed of identifying the behavior of the object are improved.

Description

Object behavior identification method and device

Technical Field

The invention relates to the technical field of computer vision, in particular to a method and a device for identifying object behaviors.

Background

The object behavior identification method can identify the behavior information of the object at specific time, place or occasion, and is widely applied to various fields such as intelligent video monitoring, patient monitoring, man-machine interaction, virtual reality, intelligent home, intelligent security, athlete auxiliary training, video retrieval, intelligent image compression and the like. The behavior information is used for representing the activity state or behavior actions of the object, such as walking, running, stillness, ascending stairs, descending stairs, sleeping, or fighting.

In a single-object scene, the traditional object behavior identification method can identify the behavior information of a single object in a video or image sequence to be identified by extracting the behavior characteristics of the object. However, in a multi-object scene, due to a plurality of factors such as uncertainty of the number of objects, mutual influence of object behaviors, complex and variable backgrounds, too many behavior features are extracted by the conventional object behavior identification method, behavior information of the objects cannot be accurately identified, and the identification accuracy is low.

Disclosure of Invention

The invention provides an object behavior identification method and device, and aims to solve the problem that behavior information of an object cannot be accurately identified due to the fact that behavior features are extracted from a multi-object scene in a traditional object behavior identification method.

In a first aspect, the present invention provides an object behavior identification method, including:

obtaining the bone point information of each object in each frame of image in the target video;

performing skeleton point tracking on each object in each frame image according to the skeleton point information of each object in each frame image, and determining the index number of each object in each frame image, wherein the index number is used for uniquely identifying the corresponding object, and the objects with the same index number are the same object;

and inputting the skeleton point information and the index number of each object in each frame image into a target convolutional neural network for behavior identification, and determining the behavior information of the object corresponding to each index number in the target video.

In a second aspect, the present invention provides an object behavior recognition apparatus, including:

the acquisition module is used for acquiring the bone point information of each object in each frame of image in the target video;

the determining module is used for tracking the skeleton points of each object in each frame of image according to the skeleton point information of each object in each frame of image, and determining the index number of each object in each frame of image, wherein the index number is used for uniquely identifying the corresponding object, and the objects with the same index number are the same object;

and the processing module is used for inputting the bone point information and the index number of each object in each frame of image into a target convolutional neural network for behavior identification, and determining the behavior information of the object corresponding to each index number in the target video.

The object behavior identification method and the object behavior identification device provided by the invention have the advantages that the skeleton point information of each object in each frame of image in the target video is obtained. And tracking the skeleton points of each object in each frame of image according to the skeleton point information of each object in each frame of image, and determining the index number of each object in each frame of image, wherein the index number is used for uniquely identifying the corresponding object, and the objects with the same index number are the same object. And inputting the skeleton point information and the index number of each object in each frame image into a target convolutional neural network for behavior identification, and determining the behavior information of the object corresponding to each index number in the target video. According to the invention, according to the characteristic that human motion has continuity and motion positions and motion postures in continuous frame images do not change greatly, each object in a target video can be accurately distinguished by tracking the skeleton point of each object in each frame image in the video through the position relation and/or posture relation of the object in the continuous frame images, the problem of tracking loss or wrong tracking caused by the factors of target self-shielding, mutual shielding among targets, shielding of background on the target and the like in a multi-object scene is solved, and the accuracy and the speed of object classification are improved. And based on the determination of each object in each frame of image and the bone point information of each object in each frame of image, data such as human faces, dresses, backgrounds and the like do not need to be recorded, and image video data does not need to be transmitted, the behavior information of the objects can be determined, so that the privacy requirement of a user is guaranteed, no trouble is caused to the identification of the objects, the risk that the data is leaked in the transmission process and the post-processing process is avoided, the limit of the number of the objects in the scene is broken through, and the accuracy of the behavior identification of the objects is improved. In addition, the invention does not need to carry hardware equipment, and gets rid of the inconvenience of carrying the hardware equipment.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the following briefly introduces the drawings needed to be used in the description of the embodiments or the prior art, and obviously, the drawings in the following description are some embodiments of the present invention, and those skilled in the art can obtain other drawings according to the drawings without inventive labor.

FIG. 1 is a flow chart of a method for identifying object behavior provided by the present invention;

FIG. 2 is a schematic view of an overall bone point;

fig. 3 is a schematic diagram of skeleton point information of two objects in continuous 8 frames of images in the object behavior identification method provided by the present invention;

fig. 4 is a schematic diagram of an index number of each object in 8 consecutive frames of images in the object behavior identification method provided by the present invention;

FIG. 5 is a flow chart of a method for identifying object behavior provided by the present invention;

FIG. 6 is a flow chart of a method for identifying object behavior provided by the present invention;

fig. 7 is a schematic diagram of skeleton point information of an object in an nth frame image and skeleton point information of an object in an n +1 th frame image in the object behavior identification method according to the present invention;

fig. 8 is a schematic diagram of a skeleton point bounding box corresponding to each skeleton point of an object in the nth frame image and a skeleton point bounding box corresponding to each skeleton point of an object in the n +1 th frame image in the object behavior identification method provided by the present invention;

FIG. 9 is a flowchart of an object behavior recognition method provided by the present invention;

fig. 10 is a schematic diagram of a skeleton bounding box of an object in the nth frame image and a skeleton bounding box of an object in the n +1 th frame image in the object behavior recognition method according to the present invention;

FIG. 11 is a flow chart of a method for identifying object behavior provided by the present invention;

fig. 12 is a schematic structural diagram of an object behavior recognition apparatus provided in the present invention;

fig. 13 is a schematic diagram of a hardware structure of the electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides an object behavior identification method, an object behavior identification device and a storage medium, which are applicable to a single-object scene and a multi-object scene of behavior identification, can accurately judge whether the same object exists in continuous frame images in a video or not by tracking skeleton points of each object in the video and accurately judge behavior characteristics of each object, can also quickly identify behavior information of each object, and can meet different requirements of multiple industries such as smart home, smart factories and smart hospitals. The object behavior identification method provided in this embodiment may be executed by a server or a terminal device, which is not limited in this embodiment.

Next, a specific implementation procedure of the object behavior recognition method according to this embodiment is described in detail through a specific embodiment with a server as an execution subject.

Fig. 1 is a flowchart of an object behavior identification method provided in the present invention, and as shown in fig. 1, the object behavior identification method of this embodiment may include:

s101, obtaining skeleton point information of each object in each frame of image in the target video.

In this embodiment, the server may obtain the target video through a shooting device on the server, and may also receive the target video sent by other devices. In this embodiment, the format, size, and number of the target video are not limited.

Further, the server may obtain, from the target video, skeleton point information of each object in each frame of image, which is used to construct the position and posture of the corresponding object, by using a conventional technique or a newly added technique. The specific content of the skeleton point information is not limited in this embodiment. Optionally, the bone point information comprises: position information of all the bone points and posture information of all the bone points.

Fig. 2 shows a structural diagram of human skeleton points, and the skeleton points of this embodiment may be 18 human skeleton points as shown in fig. 2, which are a left-eye joint point 14, a right-eye joint point 15, a nose joint point 0, a left-ear joint point 16, a right-ear joint point 17, a neck joint point 1, a left-shoulder joint point 2, a right-shoulder joint point 5, a left-elbow joint point 3, a right-elbow joint point 6, a left-hand joint point 4, a right-hand joint point 7, a left-crotch joint point 8, a right-crotch joint point 11, a left-knee joint point 9, a right-knee joint point 12, a left-foot joint point 10, and a right-foot joint point 13, respectively.

In fig. 2, the presence of a connecting line between any two bone points indicates that the two bone points are connected to each other. If no connecting line exists between any two bone points, the two bone points are not in a connecting relation.

The bone points in the present embodiment are not limited to the joint points described above. The information of the bone points acquired by the server may be all the joint points or some of the joint points, which is not limited in this embodiment.

Next, a specific implementation process of the server obtaining the bone point information of each object in each frame of image in the target video is described with a feasible implementation manner.

Optionally, the server may obtain position information of each bone point in each frame of image in the target video and pose information of each object in each frame of image in the target video, respectively, based on a vision-based pose estimation algorithm, to determine the bone point information of each object.

For example, for any frame of image in the target video, the server extracts all features of the frame of image through the VGG-19 network model based on an opencast algorithm, and the all features are respectively entered into a skeleton point position reliability network and a partial affinity domain analysis network of a connection relation between the joint points.

In the skeletal point position credibility network, a plurality of key point positions of each skeletal point are determined by calculating a local credibility diagram, and then the average value of the plurality of key point positions of each skeletal point is taken by a non-maximum inhibition method to obtain the position information of each skeletal point.

The position information may be represented by information of horizontal and vertical coordinates of the frame image where the bone point is located, or may be in other representation forms, which is not limited in this embodiment. For convenience of processing, the position information of each bone point in each frame of image in the target video usually adopts the same coordinate system.

In the partial affinity domain analysis network of the connection relation between the joint points, even matching is carried out through the partial affinity field network of the key points of the human body, the joint points of the same object are connected, the estimation of the posture of the human body is realized, and the posture information of each object can be obtained.

The pose information can be represented as a connection relation between any two bone points in the same frame image, and is used for determining the pose of the object in space. Taking the left-eye joint 14 in any frame image and the right-eye joint 15 and the nose joint 0 as examples, as shown in fig. 2, the posture information of the left-eye joint 14 in the frame image may indicate that the left-eye joint 14 and the right-eye joint 15 have no connection relationship, and the left-eye joint 14 and the nose joint 0 have a connection relationship.

In addition, the server can also connect the same type of bone points in different frame images based on the position information of each bone point so as to enrich the posture information of each object. Wherein, the posture information can also be expressed as the connection relation between all the bone points in the continuous frame images, and is used for determining the posture of the bone points in time.

Taking the left-eye joint 14 in the 2 nd frame image and the bone points in the 1 st frame image as examples, the posture information of the left-eye joint 14 in the 2 nd frame image may also indicate that the left-eye joint 14 in the 2 nd frame image has a connection relationship with the left-eye joint 14 in the 1 st frame image, and the left-eye joint 14 in the 2 nd frame image has no connection relationship with the bone points other than the left-eye joint 14 in the 1 st frame image.

S102, carrying out skeleton point tracking on each object in each frame of image according to the skeleton point information of each object in each frame of image, and determining the index number of each object in each frame of image, wherein the index number is used for uniquely identifying the corresponding object, and the objects with the same index number are the same object.

In the traditional object behavior identification method, when the bone point information of each object in each frame of image in the target video is acquired, the server directly inputs the bone point information of each object in the target video and each frame of image into the convolutional neural network together, so that the situation that the number of the objects in the same frame of image is excessive and the number of the objects is increased by mistake is caused, and the accuracy of the behavior of the identified object is reduced.

In consideration of the behavior uniqueness of each object and the characteristic that the motion of a human body has continuity, the position information and the posture information of each bone point in the continuous frame images do not change greatly, so that the server in the embodiment can track the position and/or the posture of each object in each frame image according to the bone point information of each object in each frame image to determine whether the same object appears in the continuous frame images, thereby obtaining the index number of each object in each frame image, accurately identifying each object in the target video, and being beneficial to improving the accuracy of various applications such as behavior identification and object tracking in a multi-object scene.

S103, inputting the skeleton point information and the index number of each object in each frame of image into a target convolutional neural network for behavior identification, and determining the behavior information of the object corresponding to each index number in the target video.

In this embodiment, the server may input the skeleton point information of each object and the index number of each object in each frame of image into the target convolutional neural network. Since the skeleton point information may be not only position information of all skeleton points, that is, representing spatial information, but also posture information of all skeleton points (a connection relationship between any two skeleton points in the same frame image and a connection relationship between all skeleton points in consecutive frame images), that is, representing temporal information, the target convolutional neural network may integrate input information along two dimensions of space and time, classify behavior characteristics of each object, and identify behavior information of the object corresponding to each index number in the target video.

The behavior information may refer to the foregoing content, and is not described herein again. The target convolutional neural network may be an existing convolutional neural network, or may also be a space-time diagram convolutional neural network for identifying behavior information of an object, which is not limited in this embodiment.

In a specific embodiment, the user a can shoot and monitor the activity of the old at home through a camera installed at home, and can store the shot video in a cloud server. The cloud server can monitor the activity condition of the old at home in real time by using the object behavior identification method of the embodiment. When the user A can send an instruction to the cloud server through the mobile phone, the cloud server can inform the user A of the current activity situation of the old at home through the mobile phone, for example, the old at home is informed of the situation that the old at home wrestles about 9 o 'clock, or the old at home makes a frame with other people about 10 o' clock.

For convenience of description, with reference to fig. 3 and 4, a process of identifying that the old person in home has a fighting behavior with another person (an example is given by an object in fig. 3 and 4) at about 10 points by using the object behavior identification method of the embodiment by the cloud server is described.

The cloud server acquires continuous 8-frame images of two objects which are framed from the video, and obtains the skeletal point information of the two objects in the continuous 8-frame images as shown in fig. 3. Then, the cloud server tracks the skeleton points of the two objects in the continuous 8-frame images according to the skeleton point information of the two objects in the continuous 8-frame images shown in fig. 3, and determines the index number of each object in the continuous 8-frame images shown in fig. 4. In fig. 4, "1" and "2" respectively represent index numbers of the two objects, and 8 diagrams in fig. 4 correspond to 8 sub-diagrams in fig. 3 one by one. Then, the cloud server inputs the skeleton point information of the two objects in the continuous 8 frames of images and the index number of each object in the continuous 8 frames of images into a time-space diagram convolutional neural network for behavior recognition, so that the occurrence of fighting behaviors of the two objects can be determined.

It should be noted that, when the server identifies the behavior information of the object corresponding to each index number in the target video, the server may also determine the behavior information of the specific object by using a face recognition algorithm. The face recognition algorithm may be integrated in the target convolutional neural network, or may be independently set as a convolutional neural network, which is not limited in this embodiment.

The object behavior identification method provided by this embodiment obtains the bone point information of each object in each frame of image in the target video. And tracking the skeleton points of each object in each frame of image according to the skeleton point information of each object in each frame of image, and determining the index number of each object in each frame of image, wherein the index number is used for uniquely identifying the corresponding object, and the objects with the same index number are the same object. And inputting the skeleton point information and the index number of each object in each frame image into a target convolutional neural network for behavior identification, and determining the behavior information of the object corresponding to each index number in the target video. In the embodiment, according to the characteristic that human motion has continuity, and motion positions and motion postures in continuous frame images do not change greatly, each object in a target video can be accurately distinguished by tracking bone points of each object in each frame image in the video through the position relation and/or posture relation of the object in the continuous frame images, the problem of tracking loss or wrong tracking caused by factors such as target self-shielding, mutual shielding among targets, shielding of a background on the target and the like in a multi-object scene is solved, and the accuracy and the speed of object classification are improved. And based on the determination of each object in each frame of image and the bone point information of each object in each frame of image, data such as human faces, dresses, backgrounds and the like do not need to be recorded, and image video data does not need to be transmitted, the behavior information of the objects can be determined, so that the privacy requirement of a user is guaranteed, no trouble is caused to the identification of the objects, the risk that the data is leaked in the transmission process and the post-processing process is avoided, the limit of the number of the objects in the scene is broken through, and the accuracy of the behavior identification of the objects is improved. In addition, the embodiment does not need to carry hardware equipment, and the inconvenience of carrying the hardware equipment is eliminated.

Based on the embodiment of fig. 1, with reference to fig. 5, a possible implementation manner of the server performing bone point tracking on each object in each frame of image according to the bone point information of each object in each frame of image in S102 and determining the index number of each object in each frame of image is described.

Fig. 5 is a flowchart of the object behavior identification method provided in the present invention, and as shown in fig. 5, the object behavior identification method of this embodiment may include:

s201, obtaining skeleton point information of each object in each frame of image in the target video.

S201 is similar to the implementation manner of S101 in the embodiment of fig. 1, and this embodiment is not described herein again.

S2021, respectively determining the matching coincidence proportion of each object in the N +1 th frame image and all objects in the N +1 th frame image according to all skeleton point information of each object in the N +1 th frame image and all skeleton point information of each object in the N +1 th frame image, wherein N is a positive integer smaller than N +1, N is the total frame number of the target video, and the matching coincidence proportion is used for representing the matching degree between each object in the N +1 th frame image and all objects in the N +1 th frame image.

S2022, determining the index number of each object in the n +1 frame image according to the index numbers of all objects in the n frame image and the matching coincidence proportion of each object in the n +1 frame image and all objects in the n frame image.

Based on the characteristic that human body movement has continuity, the positions and postures of the objects in the continuous frame images do not change greatly, so that the server in the embodiment can accurately and quickly identify whether the same object exists in the continuous frame images by matching the objects in the continuous frame images in the target video, and further determine the index number of each object in each frame image.

In this embodiment, the server may perform index numbering on each object in the 1 st frame image from the 1 st frame image, and then match each object in the 2 nd frame image with all objects in the 1 st frame image according to all skeleton point information of each object in the 1 st frame image and all skeleton point information of each object in the 2 nd frame image, so as to determine matching coincidence proportions of each object in the 2 nd frame image and all objects in the 1 st frame image, respectively.

The index number of the object may adopt various representation forms such as numbers, letters, secondary numbers and the like, which is not limited in this embodiment. In addition, the matching superposition proportion can adopt various expression forms such as fraction, decimal, mark and the like, which is not limited in this embodiment.

Since the matching coincidence proportion can indicate the matching degree between each object in the 2 nd frame image and all objects in the 1 st frame image, that is, the matching coincidence proportion can indicate the position relationship and/or posture relationship between each object in the 2 nd frame image and all objects in the 1 st frame image, the server can determine whether the same object exists in the 2 nd frame image and the 1 st frame image according to the index numbers of all objects in the 1 st frame image and the matching coincidence proportion between each object in the 2 nd frame image and all objects in the 1 st frame image, thereby determining the index number of each object in the 2 nd frame image.

In this embodiment, when determining the index numbers of all objects in the 2 nd frame image, the server may respectively match each object in the 3 rd frame image with all objects in the 2 nd frame image according to all skeleton point information of each object in the 2 nd frame image and all skeleton point information of each object in the 3 rd frame image, so as to respectively determine the matching coincidence proportion of each object in the 2 nd frame image and all objects in the 3 rd frame image.

Further, the server may determine whether the same object exists in the 3 rd frame image and the 2 nd frame image according to the index numbers of all objects in the 2 nd frame image and the matching coincidence proportion of each object in the 2 nd frame image and all objects in the 3 rd frame image, so as to determine the index number of each object in the 3 rd frame image.

In this embodiment, the server may continue to determine the index number of each object in the 4 th frame of image until the nth frame of image (i.e., the last frame of image) in the target video is traversed, so as to determine the index number of each object in each frame of image in the target video.

On the one hand, since the matching coincidence proportion may represent not only the positional relationship between each object in the n +1 th frame image and all objects in the n th frame image, but also the pose relationship between each object in the n +1 th frame image and all objects in the n th frame image, and may also represent both of the above two relationships, the server may implement the process of determining the matching coincidence proportion of each object in the n +1 th frame image and all objects in the n +1 th frame image, respectively, from all the skeletal point information of each object in the n th frame image and all the skeletal point information of each object in the n +1 th frame image in S2021, based on the positional relationship and/or the pose relationship of each object in the consecutive frame images.

In the following, two possible implementations are described, with reference to fig. 6 to fig. 10, respectively, for a process of determining a matching superposition ratio by a server.

In a possible implementation manner, the server may determine the pose relationship between each object in the consecutive frame images based on the degree of overlap between the bone points of the same type of each object in the consecutive frame images to obtain the matching coincidence proportion between each object in the subsequent frame image and all objects in the previous frame image, so as to determine whether the same object appears in the consecutive frame images.

Next, a specific process of determining the matching overlap ratio by the server will be described with reference to fig. 6.

Fig. 6 is a flowchart of the object behavior identification method provided in the present invention, and as shown in fig. 6, the object behavior identification method of this embodiment may include:

s301, determining a skeleton point surrounding frame corresponding to each skeleton point of each object in the nth frame image according to all skeleton point information of each object in the nth frame image.

In this embodiment, the left diagram in fig. 7 shows the bone point information of an object in the image of the nth frame, and the right diagram in fig. 7 shows the bone point information of an object in the image of the (n + 1) th frame (the bone points of the object in fig. 7 are illustrated as including all the joint points shown in fig. 2, and the bone point information is represented by black dots). The server may determine the skeleton point enclosure frame corresponding to each skeleton point based on the position information of all skeleton points in the nth frame image shown in the left image in fig. 7, or may determine the respective skeleton point enclosure frame of each skeleton point of each object in the nth frame image respectively based on the position information of all skeleton points of each object in the nth frame image, where all skeleton points of one object in the nth frame image are followed by skeleton points of another object in the nth frame image, as shown in the left image in fig. 8. Here, for convenience of explanation, a rectangular frame in fig. 8 represents a skeleton point surrounding frame of a skeleton point.

In this embodiment, the size and shape of the skeleton point enclosure frame of the skeleton point may be set according to actual conditions. For example, the skeleton point bounding box of any one skeleton point may be a rectangular box having a length and width of 60 pixels, and the horizontal and vertical coordinates obtained by subtracting 30 pixels from the horizontal and vertical coordinates of the skeleton point are taken as the upper left-hand coordinates of the skeleton point bounding box of the skeleton point. The skeleton point bounding box of the skeleton point may be represented by a plurality of horizontal and vertical coordinate positions, one coordinate position plus a size, and the like, which is not limited in this embodiment.

S302, determining a skeleton point surrounding frame corresponding to each skeleton point of each object in the (n + 1) th frame image according to all skeleton point information of each object in the (n + 1) th frame image.

In this embodiment, the server may determine, according to all the skeleton point information of each object in the (n + 1) th frame image shown in the left diagram in fig. 7, a skeleton point bounding box corresponding to each skeleton point of each object in the (n + 1) th frame image, as shown in the right diagram in fig. 8, and the specific implementation process may refer to the description of S301, which is not described herein again.

S303, determining the coincidence proportion of each skeleton point of each object in the n +1 th frame image and the skeleton points of the skeleton points corresponding to all the objects in the n +1 th frame image according to the skeleton point surrounding frame corresponding to each skeleton point of each object in the n th frame image and the skeleton point surrounding frame corresponding to each skeleton point of each object in the n +1 th frame image.

In this embodiment, when obtaining the skeleton point bounding boxes corresponding to each skeleton point of each object in the nth frame image and the skeleton point bounding boxes corresponding to each skeleton point of each object in the n +1 th frame image, the server may perform overlap comparison on the skeleton point bounding boxes corresponding to the skeleton points of each object in the consecutive frame images one by one, so as to determine the proportion of overlapping of all skeleton points of each object in the next frame image with skeleton points corresponding to all objects in the previous frame image.

Further, the server may perform overlap comparison on a skeleton point bounding box corresponding to each skeleton point of the first object in the n +1 th frame image and skeleton point bounding boxes corresponding to skeleton points of all objects in the n +1 th frame image from the first object in the n +1 th frame image to obtain a proportion of overlap of each skeleton point of the first object and skeleton points corresponding to all objects in the n +1 th frame image until the last object in the n +1 th frame image, so as to determine a proportion of overlap of each skeleton point of each object in the n +1 th frame image and skeleton points corresponding to all objects in the n +1 th frame image.

The coincidence proportion of the skeleton points may adopt various expression forms such as fraction, decimal, mark, etc., which is not limited in this embodiment.

In addition, the server may adopt various implementation manners to implement a process of determining a coincidence proportion of all skeleton points of each object in the n +1 th frame image and skeleton points of skeleton points corresponding to all objects in the n +1 th frame image according to a skeleton point bounding box corresponding to each skeleton point of each object in the n th frame image and a skeleton point bounding box corresponding to each skeleton point of each object in the n +1 th frame image.

Alternatively, for any one bone point (referred to as a target bone point) of any one object (referred to as a target object) in the image of the (n + 1) th frame, the server may calculate a ratio of intersection and union between the bone point enclosure frame of the target bone point of the target object in the image of the (n + 1) th frame and the bone point enclosure frames of the bone points of the same type as the target bone point of all objects in the image of the (n + 1) th frame, respectively, based on the bone point enclosure frame of the target bone point of the target object in the image of the (n + 1) th frame and the bone point enclosure frames of the bone points of the same type as the target bone point of all objects in the image of the (n + 1) th frame.

For example, the (n + 1) th frame image includes an object 1 and an object 2, the object 1 includes a bone point 1 and a bone point 2, and the object 2 includes a bone point 3. The image of the nth frame includes an object 3 and an object 4, the object 3 includes a bone point 2, and the object 4 includes a bone point 1 and a bone point 3.

For the bone point 1 of the object 1 in the image of the (n + 1) th frame, the bone point 1 does not exist in the object 3 in the image of the (n) th frame, and the bone point 1 exists in the object 4. Thus, the server can calculate the ratio of the intersection and union between the skeleton point bounding box of the skeleton point 1 of the object 1 in the image of the (n + 1) th frame and the skeleton point bounding box of the skeleton point 1 of the object 4 in the image of the (n) th frame.

For the bone point 2 of the object 1 in the image of the (n + 1) th frame, the bone point 2 exists in the object 3 in the image of the (n) th frame, and the bone point 2 does not exist in the object 4. Thus, the server can calculate the ratio of the intersection and union between the skeleton point bounding box of the skeleton point 2 of the object 1 in the image of the (n + 1) th frame and the skeleton point bounding box of the skeleton point 2 of the object 3 in the image of the (n) th frame.

For the bone point 3 of the object 2 in the image of the (n + 1) th frame, the bone point 3 does not exist in the object 3 in the image of the (n) th frame, and the bone point 3 exists in the object 4. Thus, the server can calculate the ratio of the intersection and union between the skeleton point bounding box of the skeleton point 3 of the object 2 in the image of the (n + 1) th frame and the skeleton point bounding box of the skeleton point 3 of the object 4 in the image of the (n) th frame.

The ratio of the intersection to the union may be a ratio between an area intersection of two bounding boxes and an area union of the two bounding boxes, or a ratio between an area intersection of two bounding boxes and an area of one frame of image, which is not limited in this embodiment. In addition, the ratio of the intersection set and the union set may adopt various expression forms such as fraction, decimal, mark, and the like, which is not limited in this embodiment.

For example, the ratio of the intersection and union of the bounding box a and the bounding box B may be expressed by an overlap degree (IOU), the calculation formula of the IOU is IOU ═ a ∞ B/a £ B, and the degree of coincidence between the bounding box a and the bounding box B is determined. Where the IOU is ideally 1.

Since the ratio of the intersection to the union may indicate the coincidence degree of the two skeleton point bounding boxes, the server may determine the ratio of the intersection and the union between the skeleton point bounding box corresponding to each skeleton point of each object in the image of the (n + 1) th frame and the skeleton point bounding boxes corresponding to the skeleton points of all objects in the image of the (n) th frame as the coincidence proportion of each skeleton point of each object in the image of the (n + 1) th frame and the skeleton point corresponding to all objects in the image of the (n) th frame.

S304, determining the matching coincidence proportion of each object in the n +1 frame image and all objects in the n frame image according to the coincidence proportion of each bone point of each object in the n +1 frame image and the bone points of the bone points corresponding to all objects in the n frame image.

Since the bone point coincidence proportion may represent the degree of overlap of each bone point of each object in the image of the (n + 1) th frame with the bone points corresponding to all the objects in the image of the (n) th frame, the bone point coincidence proportion may represent the posture relationship between each object in the image of the (n + 1) th frame and all the objects in the image of the (n) th frame. In addition, since the matching coincidence proportion may represent the degree of matching between each object in the n +1 th frame image and all objects in the n th frame image, in this embodiment, the server may take the form of an average value, a maximum value, a minimum value, and the like of the coincidence proportion of all skeletal points of the first object in the n +1 th frame image and skeletal points corresponding to all objects in the n th frame image from the first object in the n +1 th frame image as the matching coincidence proportion of the first object in the n +1 th frame image and all objects in the n th frame image until the last object in the n +1 th frame image, so as to determine the matching coincidence proportion of each object in the n +1 th frame image and all objects in the n th frame image.

In this embodiment, according to the characteristic that human motion has continuity, and the motion position and the motion posture in the continuous frame image do not change greatly, the matching coincidence proportion of each object in the next frame image and all objects in the previous frame image can be obtained through the posture relation of the objects in the continuous frame image, so as to quickly and accurately judge whether the same object exists in the continuous frame image.

In another possible implementation manner, the server may determine the pose relationship between each object in the consecutive frame images based on the overlapping degree between the bone points of the same type of each object in the consecutive frame images, and may determine the position relationship between each object in the consecutive frame images based on the overall overlapping degree between each object in the consecutive frame images, so as to obtain the matching overlapping proportion between each object in the next frame image and all objects in the previous frame image, thereby determining whether the same object appears in the consecutive frame images.

Next, a specific process of determining the matching overlap ratio by the server will be described with reference to fig. 9.

Fig. 9 is a flowchart of the object behavior identification method provided in the present invention, and as shown in fig. 9, the object behavior identification method of this embodiment may include:

s401, according to all the skeleton point information of each object in the nth frame image, determining a skeleton point surrounding frame corresponding to each skeleton point of each object in the nth frame image and a skeleton surrounding frame of each object in the nth frame image.

In this embodiment, based on the position information of each object in the nth frame image shown in the left diagram in fig. 7, the server may determine, starting from the first object in the nth frame image, a range surrounding the first object as a skeleton surrounding frame of the first object in the nth frame image until the last object in the nth frame image, so as to determine the skeleton surrounding frame of each object in the nth frame image, as shown in the left diagram in fig. 10. Here, for convenience of explanation, one rectangular box in fig. 10 represents a skeleton-surrounding box of one object.

In this embodiment, the size and shape of the skeleton-surrounding frame of the object may be set according to actual conditions. For example, the skeleton-enclosing frame of any one object may be determined as a rectangular frame enclosing the minimum range of the first object. In addition, the skeleton enclosing frame of the object may be represented by a plurality of horizontal and vertical coordinate positions, one coordinate position plus a size, and the like, which is not limited in this embodiment.

The specific implementation process of determining the bone point enclosure frame corresponding to each bone point of each object in the nth frame image by the server is similar to the implementation manner of S301 in the embodiment of fig. 6, and details of this embodiment are not repeated here.

S402, according to all the skeleton point information of each object in the (n + 1) th frame image, determining a skeleton point surrounding frame corresponding to each skeleton point of each object in the (n + 1) th frame image and a skeleton surrounding frame of each object in the (n + 1) th frame image.

The specific implementation process of determining the bone point bounding box corresponding to each bone point of each object in the n +1 th frame of image by the server is similar to the implementation manner of S302 in the embodiment of fig. 6, and details of this embodiment are not repeated here. Moreover, the server may determine the skeleton bounding box of each object in the n +1 th frame image based on all the skeleton point information of each object in the n +1 th frame image shown in the right diagram in fig. 7, as shown in the right diagram in fig. 10, a specific implementation process of the method may refer to the description content of determining the skeleton bounding box of each object in the n +1 th frame image in S401, which is not described herein again.

S403, determining the coincidence proportion of each skeleton point of each object in the n +1 th frame image and the skeleton points of the skeleton points corresponding to all the objects in the n +1 th frame image according to the skeleton point surrounding frame corresponding to each skeleton point of each object in the n th frame image and the skeleton point surrounding frame corresponding to each skeleton point of each object in the n +1 th frame image.

S403 is similar to the implementation manner of S303 in the embodiment of fig. 6, and details of this embodiment are not repeated here.

S404, determining the integral coincidence proportion of each object in the n +1 frame image and all objects in the n +1 frame image according to the skeleton surrounding frame of each object in the n frame image and the skeleton surrounding frame of each object in the n +1 frame image.

In this embodiment, when the server obtains the skeleton bounding box of each object in the nth frame image and the skeleton bounding box of each object in the (n + 1) th frame image, the server may perform overlap comparison on the skeleton bounding boxes of each object in the consecutive frame images one by one, so as to determine the overall overlapping proportion between each object in the subsequent frame image and all objects in the previous frame image.

Further, the server may perform overlap comparison on a skeleton bounding box of the first object in the n +1 th frame image and skeleton bounding boxes of all objects in the n +1 th frame image from the first object in the n +1 th frame image to obtain an overall coincidence proportion of the first object and all objects in the n +1 th frame image until the last object in the n +1 th frame image, so as to determine an overall coincidence proportion of each object in the n +1 th frame image and all objects in the n +1 th frame image.

The whole overlapping proportion can adopt various expression forms such as fraction, decimal, mark and the like, and the embodiment does not limit the representation forms.

In addition, the server may adopt various implementation manners to implement a process of determining the overall coincidence proportion of each object in the n +1 th frame image and all objects in the n +1 th frame image according to the skeleton bounding box of each object in the n +1 th frame image and the skeleton bounding box of each object in the n +1 th frame image.

Alternatively, for any one object (referred to as a target object) in the n +1 th frame image, the server may calculate the ratio of the intersection and the union between the skeleton bounding box of the target object in the n +1 th frame image and the skeleton bounding boxes of all objects in the n +1 th frame image, respectively, based on the skeleton bounding boxes of the target object in the n +1 th frame image and the skeleton bounding boxes of all objects in the n th frame image.

The ratio of the intersection set to the union set can refer to the description of S303 in fig. 6, and is not described herein again.

Since the ratio of the intersection and the union can indicate the coincidence degree of the two bounding boxes, the server can determine the ratio of the intersection and the union between the skeleton bounding box of each object in the n +1 th frame image and the skeleton bounding boxes of all the objects in the n +1 th frame image as the overall coincidence proportion of each object in the n +1 th frame image and all the objects in the n +1 th frame image.

S405, determining the matching coincidence proportion of each object in the n +1 frame image and all objects in the n frame image according to the weight relation between the bone point coincidence proportion and the integral coincidence proportion.

Since the bone point coincidence proportion can represent the overlapping degree of all bone points of each object in the image of the (n + 1) th frame and the bone points corresponding to all objects in the image of the (n) th frame, and the overall coincidence proportion can represent the overlapping degree of each object in the image of the (n + 1) th frame and all objects in the image of the (n) th frame, the bone point coincidence proportion can represent the posture relation between each object in the image of the (n + 1) th frame and all objects in the image of the (n) th frame, and the overall coincidence proportion can represent the position relation between each object in the image of the (n + 1) th frame and all objects in the image of the (n) th frame.

And since the matching coincidence proportion can represent the matching degree between each object in the n +1 th frame image and all objects in the n-th frame image, therefore, in this embodiment, the server may start from the first object in the n +1 th frame image, based on the weight relationship between the bone point coincidence ratio and the overall coincidence ratio, according to the forms of the mean value, the maximum value, the minimum value and the like of the coincidence proportion of all the skeleton points of the first object in the n +1 frame image and the skeleton points corresponding to all the objects in the n frame image, calculating the integral coincidence proportion of the first object in the n +1 frame image and all objects in the n +1 frame image until the last object in the n +1 frame image, so as to determine the matching coincidence proportion of each object in the n +1 th frame image and all objects in the n-th frame image.

Based on the weight relationship between the bone point coincidence ratio and the overall coincidence ratio, the matching coincidence ratio can be calculated by a formula of MatchScore ═ w1 × BoxScore + w2 × posesecore, BoxScore represents the bone point coincidence ratio, posesecore represents the overall coincidence ratio, MatchScore represents the matching coincidence ratio, w1+ w2 ═ 1, w1 represents the weight ratio of the bone point coincidence ratio, and w2 represents the weight ratio of the overall coincidence ratio.

It should be noted that, when a blank frame image (caused by lens occlusion or an abnormality accidentally occurring in the processing process) appears in the target video, the skeleton point information of each object in the previous frame image may be used as the skeleton point information in the blank frame image.

In addition, in addition to the two possible implementation manners, the server may further determine, based on the overall coincidence degree between each object in the consecutive frame images, a positional relationship between each object in the consecutive frame images to obtain a matching coincidence proportion between each object in the subsequent frame image and all objects in the previous frame image, so as to determine whether the same object appears in the consecutive frame images.

In the embodiment, according to the characteristic that human motion has continuity and the motion position and motion attitude in the continuous frame images do not change greatly, the matching coincidence proportion of each object in the next frame image and all objects in the previous frame image can be obtained by combining the position relation and the attitude relation of the objects in the continuous frame images, so that the situation that the positions of different objects are close but the attitudes are different is ensured, and whether the same object exists in the continuous frame images can be judged quickly and accurately.

On the other hand, since the matching coincidence proportion may represent the degree of matching between each object in the n +1 th frame image and each object in the n th frame image, the server may implement the process of determining the index number of each object in the n +1 th frame image according to the index numbers of all objects in the n th frame image and the matching coincidence proportion of each object in the n +1 th frame image and all objects in the n th frame image in S2022, based on the matching coincidence proportion of each object in the n +1 th frame image and all objects in the n th frame image.

In a feasible implementation manner, fig. 11 is a flowchart of the object behavior identification method provided by the present invention, and as shown in fig. 11, the object behavior identification method of this embodiment may include:

s501, determining the maximum matching coincidence proportion from the matching coincidence proportion of a target object in the (n + 1) th frame image and all objects in the (n + 1) th frame image, wherein the target object is any object in the (n + 1) th frame image.

In this embodiment, for any object (referred to as a target object) in the n +1 th frame image, the server may select a maximum value from matching overlap ratios of the target object and all objects in the n th frame image, and determine a maximum matching overlap ratio.

Further, the server may determine a size relationship between the maximum matching coincidence ratio and a preset threshold. If the maximum matching coincidence ratio is greater than or equal to the preset threshold, S502 is executed. If the maximum matching coincidence proportion is smaller than the preset threshold value, S503 is executed.

The preset threshold may be set according to an actual empirical value, which is not limited in this embodiment.

S502, determining the index number of the target object as the index number of the object in the nth frame image corresponding to the maximum matching coincidence proportion.

In this embodiment, when the maximum matching coincidence ratio is greater than or equal to the preset threshold, the server may determine that the matching degree between the target object and the object in the nth frame image corresponding to the maximum matching coincidence ratio meets the condition that two objects are determined to be the same object, so that the index number of the target object may be determined to be the index number of the object in the nth frame image corresponding to the maximum matching coincidence ratio.

For example, if the index number of the object in the nth frame image corresponding to the maximum matching overlap ratio is a11, the index number of the target object is a 11.

When there are a plurality of objects in the nth frame image corresponding to the maximum matching overlap ratio, the index number of one object may be randomly selected as the index number of the target object.

S503, determining the index number of the target object to be different from the index numbers of all objects in the nth frame image and the index numbers of other objects in the (n + 1) th frame image.

In this embodiment, when the maximum matching coincidence ratio is smaller than the preset threshold, the server may determine that the matching degree between the target object and the object in the nth frame image corresponding to the maximum matching coincidence ratio does not meet the condition of determining that the two objects are the same object, so that the index numbers of the target object and all the objects in the nth frame image and the remaining objects in the n +1 th frame image are not the same object, and thus may determine that the index number of the target object is the index number different from all the objects in the nth frame image and the index number different from the remaining objects in the n +1 th frame image.

The index number of the target object may adopt a calculation method of adding 1 to the index numbers of all objects in the nth frame image and the maximum index number in the index numbers of the other objects in the (n + 1) th frame image, or may adopt other calculation methods, which is not limited in this embodiment. For example, the index numbers of all objects in the image of the n-th frame and the index numbers of the remaining objects in the image of the n + 1-th frame include A11-A15, the index number of the target object may be set to A16.

It should be noted that, when the maximum matching overlap ratio is equal to the preset threshold, the server may also execute S503, which is not limited in this embodiment.

In this embodiment, the maximum matching coincidence proportion between each object in the (n + 1) th frame image and all objects in the nth frame image is determined, and then whether the same object appears in the consecutive frame images is determined by comparing the preset threshold with the maximum matching coincidence proportion, so that the situation that the two objects are determined to be the same object due to the low matching degree between the two objects is avoided, and the accuracy of identifying the objects is improved.

S203, inputting the skeleton point information and the index number of each object in each frame of image into a target convolutional neural network for behavior identification, and determining the behavior information of the object corresponding to each index number in the target video.

S203 is similar to the implementation manner of S103 in the embodiment of fig. 1, and this embodiment is not described herein again.

It should be noted that the sum of the index numbers of each object, which is input to all skeleton point information of each object in each frame of image in the target video, may further include: the frame number of the skeleton point surrounding frame corresponding to each skeleton point of each object in each frame of image in the target video, the frame number of the skeleton surrounding frame of each object and the skeleton point score of each object are used for indicating the probability of the current type of the skeleton point.

In this embodiment, the bone point information of each object in each frame of image in the target video is obtained. And respectively determining the matching coincidence proportion of each object in the N +1 th frame image and all objects in the N +1 th frame image according to all skeletal point information of each object in the N +1 th frame image and all skeletal point information of each object in the N +1 th frame image, wherein N is a positive integer less than N +1, N is the total frame number of the target video, and the matching coincidence proportion is used for expressing the matching degree between each object in the N +1 th frame image and all objects in the N +1 th frame image. And determining the index number of each object in the n +1 frame image according to the index numbers of all objects in the n frame image and the matching coincidence proportion of each object in the n +1 frame image and all objects in the n frame image. And then, inputting the skeleton point information and the index number of each object in each frame image into a target convolutional neural network for behavior recognition, and determining the behavior information of the object corresponding to each index number in the target video. In the embodiment, according to the characteristic that human motion has continuity, and motion positions and motion postures in continuous frame images do not change greatly, skeletal point tracking can be performed on each object in each frame image in a video through the position relation and/or posture relation of the objects in the continuous frame images to obtain the matching coincidence proportion between each object in the next frame image and all objects in the previous frame image, so as to judge whether the same object exists in the continuous frame images, and the corresponding relation between each object in the target video can be determined. And based on the determination of each object in each frame of image and the bone point information of each object in each frame of image, data such as human faces, dresses, backgrounds and the like do not need to be recorded, and image video data does not need to be transmitted, the behavior information of the objects can be determined, so that the privacy requirement of a user is guaranteed, no trouble is caused to the identification of the objects, the risk that the data is leaked in the transmission process and the post-processing process is avoided, the limit of the number of the objects in the scene is broken through, and the accuracy of the behavior identification of the objects is improved. In addition, the embodiment does not need to carry hardware equipment, and the inconvenience of carrying the hardware equipment is eliminated.

In a specific embodiment, the detailed process of the object behavior identification method according to this embodiment may include:

step 1, obtaining the bone point information of each object in each frame of image from a target video. Wherein the bone point information includes: position information of all the bone points and posture information of all the bone points.

And 2, tracking the skeleton points of each object in each frame of image based on the skeleton point information of each object in each frame of image, and judging whether the same object exists in the continuous frames of images, thereby determining the index number of each object in each frame of image.

Step 21, establishing an index number of each object in the 1 st frame image, and acquiring a skeleton point surrounding frame PoseScore corresponding to each skeleton point of each object in the 1 st frame image and a skeleton surrounding frame BoxScore of each object. And recording all skeleton point information of each object in the 1 st frame of image, the index number of each object, the frame number of a skeleton point surrounding frame PoseScore corresponding to each skeleton point of each object, the frame number of a skeleton surrounding frame BoxScore of each object and the skeleton point score of each object.

And 22, acquiring a bone point surrounding frame PoseScore corresponding to each bone point of each object in the 2 nd frame image and a bone surrounding frame BoxScore of each object.

And step 23, calculating a skeleton point coincidence proportion PoseScore of each object in the 2 nd frame image and all the objects in the 1 st frame image according to the formula IOU (equal to A n B/A U B), and calculating an overall coincidence proportion BoxScore of each object in the 2 nd frame image and all the objects in the 1 st frame image.

And 24, calculating the matching coincidence proportion MatchScore of each object in the 2 nd frame image and all the objects in the 1 st frame image according to the formula MatchScore which is w1 multiplied by BoxScore + w2 multiplied by PoseScore.

And 25, aiming at any object in the 2 nd frame image, determining the maximum matching coincidence proportion Max _ MatchScore from the matching coincidence proportion MatchScore of the object in the 2 nd frame image and all objects in the 1 st frame image. And compares the maximum matching coincidence ratio Max _ MatchScore with a preset threshold MactchThreshold. And when the maximum matching coincidence proportion Max _ MatchScore is greater than or equal to a preset threshold MactchThreshold, determining the index number of the object as the index number of the object in the 1 st frame image corresponding to the maximum matching coincidence proportion. And when the maximum matching coincidence proportion Max _ MatchScore is smaller than a preset threshold MactchThreshold, determining that the index number of the object is a new index number which is different from the index numbers of all objects in the 1 st frame image and the 2 nd frame image. Step 25 is repeatedly executed until the index number of each object in the 2 nd frame image is determined. And recording all skeleton point information of each object in the 2 nd frame image, the index number of each object, the frame number of a skeleton point surrounding frame PoseScore corresponding to each skeleton point of each object, the frame number of a skeleton surrounding frame BoxScore of each object and the skeleton point score of each object.

And 26, repeatedly executing the steps 21 to 25, and continuously determining the index number of each object in the images from the 3 rd frame to the Nth frame, wherein N is the total frame number of the target video.

And 3, inputting all skeleton point information of each object in each frame image in the target video, the index number of each object, the frame number of a skeleton point surrounding frame PoseScore corresponding to each skeleton point of each object, the frame number of a skeleton surrounding frame BoxScore corresponding to each object and the skeleton point score of each object into the target convolutional neural network for behavior recognition, and thus determining the behavior information of the object corresponding to each index number in the target video.

Fig. 12 is a schematic structural diagram of the object behavior recognition apparatus provided in the present invention, and as shown in fig. 12, the object behavior recognition apparatus 100 of this embodiment may include: an acquisition module 11, a determination module 12 and a processing module 13.

The acquisition module 11 is configured to acquire bone point information of each object in each frame of image in the target video;

the determining module 12 is configured to perform bone point tracking on each object in each frame of image according to the bone point information of each object in each frame of image, and determine an index number of each object in each frame of image, where the index number is used to uniquely identify the corresponding object, and the objects with the same index number are the same object;

and the processing module 13 is configured to input the bone point information and the index number of each object in each frame of image into a target convolutional neural network for behavior identification, and determine behavior information of the object corresponding to each index number in the target video.

Optionally, the determining module 12 is configured to determine, according to all skeleton point information of each object in the image of the nth frame and all skeleton point information of each object in the image of the (N + 1) th frame, a matching coincidence proportion between each object in the image of the (N + 1) th frame and all objects in the image of the nth frame, where N is a positive integer smaller than N +1, N is a total frame number of the target video, and the matching coincidence proportion is used to represent a degree of matching between each object in the image of the (N + 1) th frame and all objects in the image of the nth frame; and determining the index number of each object in the n +1 frame image according to the index numbers of all objects in the n frame image and the matching coincidence proportion of each object in the n +1 frame image and all objects in the n frame image.

Optionally, the determining module 12 is specifically configured to determine, according to all the skeleton point information of each object in the nth frame image, a skeleton point bounding box corresponding to each skeleton point of each object in the nth frame image; determining a skeleton point surrounding frame corresponding to each skeleton point of each object in the (n + 1) th frame image according to all skeleton point information of each object in the (n + 1) th frame image; determining the coincidence proportion of each skeleton point of each object in the n +1 th frame image and the skeleton points of all the objects in the n +1 th frame image according to the skeleton point surrounding frame corresponding to each skeleton point of each object in the n th frame image and the skeleton point surrounding frame corresponding to each skeleton point of each object in the n +1 th frame image; and determining the matching coincidence proportion of each object in the n +1 frame image and all objects in the n frame image according to the coincidence proportion of each bone point of each object in the n +1 frame image and the bone points of the bone points corresponding to all objects in the n frame image.

Optionally, the determining module 12 is further specifically configured to determine, according to all skeleton point information of each object in the nth frame image, a skeleton point bounding box corresponding to each skeleton point of each object in the nth frame image and a skeleton bounding box of each object in the nth frame image; determining a skeleton point surrounding frame corresponding to each skeleton point of each object in the (n + 1) th frame image and a skeleton surrounding frame of each object in the (n + 1) th frame image according to all skeleton point information of each object in the (n + 1) th frame image; determining the coincidence proportion of each skeleton point of each object in the n +1 th frame image and the skeleton points of all the objects in the n +1 th frame image according to the skeleton point surrounding frame corresponding to each skeleton point of each object in the n th frame image and the skeleton point surrounding frame corresponding to each skeleton point of each object in the n +1 th frame image; determining the integral coincidence proportion of each object in the n +1 frame image and all objects in the n +1 frame image according to the skeleton surrounding frame of each object in the n frame image and the skeleton surrounding frame of each object in the n +1 frame image; and determining the matching coincidence proportion of each object in the n +1 frame image and all objects in the n frame image according to the weight relation between the skeletal point coincidence proportion and the whole coincidence proportion.

Optionally, the determining module 12 is further specifically configured to determine, for a target bone point of a target object in the image of the (n + 1) th frame, a ratio between an intersection and a union of a bone point bounding box of the target bone point of the target object in the image of the (n + 1) th frame and bone point bounding boxes of bone points of all objects in the image of the (n + 1) th frame, which are of the same type as the target bone point, where the target object is any one object in the image of the (n + 1) th frame, and the target bone point is any one bone point of the target object; and determining the ratio of the intersection and union of the skeleton point surrounding frames of all skeleton points of each object in the image of the (n + 1) th frame and the skeleton point surrounding frames of the skeleton points corresponding to all objects in the image of the n (n) th frame as the coincidence proportion of each skeleton point of each object in the image of the (n + 1) th frame and the skeleton point corresponding to all objects in the image of the n (n) th frame.

Optionally, the determining module 12 is further configured to determine, for a target object in the n +1 th frame of image, a ratio of an intersection and a union between skeleton bounding boxes of the target object in the n +1 th frame of image and skeleton bounding boxes of all objects in the n +1 th frame of image, where the target object is any object point in the n +1 th frame of image; and determining the ratio of the intersection and union of the skeleton surrounding frame of each object in the n +1 frame image and the skeleton surrounding frames of all the objects in the n frame image as the integral coincidence proportion of each object in the n +1 frame image and all the objects in the n frame image.

Optionally, the determining module 12 is further specifically configured to determine a maximum matching coincidence ratio from matching coincidence ratios of a target object in the n +1 th frame image and all objects in the n th frame image, where the target object is any one object in the n +1 th frame image; when the maximum matching coincidence proportion is larger than or equal to a preset threshold value, determining that the index number of the target object is the index number of an object in the nth frame image corresponding to the maximum matching coincidence proportion; and when the maximum matching coincidence proportion is smaller than a preset threshold value, determining that the index numbers of the target objects are different from the index numbers of all the objects in the nth frame image and the index numbers of the other objects in the (n + 1) th frame image.

Optionally, the bone point information comprises: position information and pose information for all bone points.

The object behavior recognition apparatus of this embodiment may be configured to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects thereof are similar, and are not described herein again.

In this embodiment, the object behavior recognition apparatus may be divided into functional modules according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that the division of the modules in the embodiments of the present invention is schematic, and is only a logical function division, and there may be another division manner in actual implementation.

Fig. 13 is a schematic diagram of a hardware structure of the electronic device provided in the present invention. As shown in fig. 13, the electronic device 20 is configured to implement the operation corresponding to the server or the terminal device in any of the above method embodiments, where the electronic device 20 of this embodiment may include: a memory 21 and a processor 22;

a memory 21 for storing a computer program;

a processor 22 for executing the computer program stored in the memory to implement the object behavior recognition method in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.

Alternatively, the memory 21 may be separate or integrated with the processor 22.

When the memory 21 is a device separate from the processor 22, the electronic device 20 may further include:

a bus 23 for connecting the memory 21 and the processor 22.

Optionally, this embodiment further includes: a communication interface 24, the communication interface 24 being connectable to the processor 22 via a bus 23. The processor 22 may control the communication interface 23 to implement the above-described receiving and transmitting functions of the electronic device 20.

The electronic device provided in this embodiment may be used to execute the object behavior recognition method, and the implementation manner and the technical effect thereof are similar, and this embodiment is not described herein again.

The present embodiment also provides a computer-readable storage medium including a computer program for implementing the object behavior recognition method in the above embodiments.

In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The computer-readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An object behavior recognition method, comprising:

according to the bone point information of each object in each frame of image, performing bone point tracking on each object in each frame of image, and determining the index number of each object in each frame of image, wherein the index number is used for uniquely identifying the corresponding object, the objects with the same index number are the same object, and the bone point information comprises: position information of all the skeleton points and posture information of all the skeleton points;

2. The method according to claim 1, wherein the determining the index number of each object in each frame of image by performing skeletal point tracking on each object in each frame of image according to the skeletal point information of each object in each frame of image comprises:

respectively determining the matching coincidence proportion of each object in the N +1 th frame image and all objects in the N +1 th frame image according to all skeletal point information of each object in the N +1 th frame image and all skeletal point information of each object in the N +1 th frame image, wherein N is a positive integer less than N +1, N is the total frame number of the target video, and the matching coincidence proportion is used for expressing the matching degree between each object in the N +1 th frame image and all objects in the N +1 th frame image;

and determining the index number of each object in the n +1 frame image according to the index numbers of all objects in the n frame image and the matching coincidence proportion of each object in the n +1 frame image and all objects in the n frame image.

3. The method according to claim 2, wherein the determining the matching coincidence proportion of each object in the image of the (n + 1) th frame and all objects in the image of the (n + 1) th frame according to all skeletal point information of each object in the image of the n th frame and all skeletal point information of each object in the image of the (n + 1) th frame respectively comprises:

determining a skeleton point surrounding frame corresponding to each skeleton point of each object in the nth frame of image according to all skeleton point information of each object in the nth frame of image;

determining a skeleton point surrounding frame corresponding to each skeleton point of each object in the (n + 1) th frame image according to all skeleton point information of each object in the (n + 1) th frame image;

determining the coincidence proportion of each skeleton point of each object in the n +1 th frame image and the skeleton points of all the objects in the n +1 th frame image according to the skeleton point surrounding frame corresponding to each skeleton point of each object in the n th frame image and the skeleton point surrounding frame corresponding to each skeleton point of each object in the n +1 th frame image;

and determining the matching coincidence proportion of each object in the n +1 frame image and all objects in the n frame image according to the coincidence proportion of each bone point of each object in the n +1 frame image and the bone points of the bone points corresponding to all objects in the n frame image.

4. The method according to claim 2, wherein the determining the matching coincidence proportion of each object in the image of the (n + 1) th frame and all objects in the image of the (n + 1) th frame according to all skeletal point information of each object in the image of the n th frame and all skeletal point information of each object in the image of the (n + 1) th frame respectively comprises:

determining a skeleton point surrounding frame corresponding to each skeleton point of each object in the nth frame image and a skeleton surrounding frame of each object in the nth frame image according to all skeleton point information of each object in the nth frame image;

determining a skeleton point surrounding frame corresponding to each skeleton point of each object in the (n + 1) th frame image and a skeleton surrounding frame of each object in the (n + 1) th frame image according to all skeleton point information of each object in the (n + 1) th frame image;

determining the integral coincidence proportion of each object in the n +1 frame image and all objects in the n +1 frame image according to the skeleton surrounding frame of each object in the n frame image and the skeleton surrounding frame of each object in the n +1 frame image;

and determining the matching coincidence proportion of each object in the n +1 frame image and all objects in the n frame image according to the weight relation between the bone point coincidence proportion and the integral coincidence proportion.

5. The method according to claim 3, wherein the determining the coincidence ratio of each bone point of each object in the n +1 th frame image to the bone points of all objects in the n +1 th frame image according to the respective bone point bounding box of each bone point of each object in the n +1 th frame image and the respective bone point bounding box of each bone point of each object in the n +1 th frame image comprises:

aiming at a target bone point of a target object in an n +1 frame image, determining the ratio of intersection and union between a bone point surrounding frame of the target bone point of the target object in the n +1 frame image and bone point surrounding frames of bone points of all objects in the n +1 frame image, wherein the bone points are the same type as the target bone point, the target object is any one object in the n +1 frame image, and the target bone point is any one bone point of the target object;

and determining the ratio of the intersection and union of the skeleton point surrounding frames of all skeleton points of each object in the image of the (n + 1) th frame and the skeleton point surrounding frames of the skeleton points corresponding to all objects in the image of the n (n) th frame as the coincidence proportion of each skeleton point of each object in the image of the (n + 1) th frame and the skeleton point corresponding to all objects in the image of the n (n) th frame.

6. The method according to claim 4, wherein the determining the overall coincidence proportion of each object in the n +1 th frame image and all objects in the n +1 th frame image according to the skeleton bounding box of each object in the n th frame image and the skeleton bounding box of each object in the n +1 th frame image comprises:

aiming at a target object in an n +1 frame image, determining the ratio of intersection and union between a skeleton surrounding frame of the target object in the n +1 frame image and skeleton surrounding frames of all objects in the n +1 frame image, wherein the target object is any one object point in the n +1 frame image;

and determining the ratio of the intersection and union of the skeleton surrounding frame of each object in the n +1 frame image and the skeleton surrounding frames of all the objects in the n frame image as the integral coincidence proportion of each object in the n +1 frame image and all the objects in the n frame image.

7. The method according to any one of claims 2 to 6, wherein the determining the index number of each object in the n +1 th frame image according to the index numbers of all objects in the n th frame image and the matching coincidence proportion of each object in the n +1 th frame image and all objects in the n +1 th frame image comprises:

determining the maximum matching coincidence proportion from the matching coincidence proportion of a target object in the (n + 1) th frame image and all objects in the (n + 1) th frame image, wherein the target object is any one object in the (n + 1) th frame image;

when the maximum matching coincidence proportion is larger than or equal to a preset threshold value, determining that the index number of the target object is the index number of an object in the nth frame image corresponding to the maximum matching coincidence proportion;

and when the maximum matching coincidence proportion is smaller than a preset threshold value, determining that the index numbers of the target objects are different from the index numbers of all the objects in the nth frame image and the index numbers of the other objects in the (n + 1) th frame image.

8. The method of any one of claims 1-6, wherein the skeletal point information includes: position information and pose information for all bone points.

9. An object behavior recognition apparatus, comprising:

a determining module, configured to perform bone point tracking on each object in each frame of image according to bone point information of each object in each frame of image, and determine an index number of each object in each frame of image, where the index number is used to uniquely identify the corresponding object, and the objects with the same index number are the same object, where the bone point information includes: position information of all the skeleton points and posture information of all the skeleton points;

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the object behavior recognition method according to any one of claims 1 to 8.

11. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the object behavior recognition method of any of claims 1-8 via execution of the executable instructions.