CN113496143A

CN113496143A - Action recognition method and device, and storage medium

Info

Publication number: CN113496143A
Application number: CN202010196461.6A
Authority: CN
Inventors: 王飞; 王利鸣; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2021-10-12
Also published as: JP2022543032A; KR20220027241A; TW202139061A; WO2021185317A1; TWI776429B

Abstract

The present disclosure provides a method and an apparatus for motion recognition, and a storage medium, wherein the method comprises: acquiring a scene image; detecting different parts of an object, associating different parts in the same object and identifying the motion of the object on the scene image, and determining at least one object included in the scene image and a target motion type of each object in the at least one object.

Description

Action recognition method and device, and storage medium

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a method and an apparatus for motion recognition, and a storage medium.

Background

Currently, there is an increasing need to analyze the motion of objects by computer vision techniques. In the process of identifying the action type, the position of an object needs to be obtained through object detection, each object is cut according to the position of the object, and the cut object is input into an action classification network to obtain an action identification result.

The processing time of the motion recognition is linearly related to the number of objects in the scene. For example, a scene includes N objects, where N may be a positive integer, the action classification network needs to perform N times of inference, the action recognition time is increased by N times, and the larger the value of N is, the longer the action recognition time is, which requires that the device has higher computing power and consumes longer time.

Disclosure of Invention

The disclosure provides a motion recognition method and device and a storage medium.

According to a first aspect of embodiments of the present disclosure, there is provided a motion recognition method, the method including: acquiring a scene image; detecting different parts of an object, associating different parts in the same object and identifying the motion of the object on the scene image, and determining at least one object included in the scene image and a target motion type of each object in the at least one object.

In some optional embodiments, the object comprises a person, and the different parts of the object comprise a face and a body of the person; the performing different part detection of an object, association of different parts in the same object, and motion recognition of the object on the scene image, and determining at least one object included in the scene image and a target motion type of each object in the at least one object, includes: determining the face position of each person and the body position of each person on the scene image; associating the face position and the body position belonging to the same person; and determining at least one person included in the scene image and the target action type of each person in the at least one person according to the associated face position and the body position.

In some optional embodiments, the associating the face position and the body position belonging to the same person includes: determining a reference human body position corresponding to each human face position; and associating the human face position and the human body position belonging to the same person according to the reference human body position and the human body position.

In some optional embodiments, the determining the reference body position corresponding to each face position includes: determining a first coordinate value corresponding to each face position on the scene image; respectively determining a second coordinate value according to a preset vector and the first coordinate value; the preset vector is a vector pointing to the position of the human body from the position of the face; and taking the second coordinate value as the reference human body position.

In some optional embodiments, the associating the face position and the body position belonging to the same person according to the reference body position and the body position includes: and taking the human body position with the minimum distance from the reference human body position and the human face position corresponding to the reference human body position as the associated human face position and human body position.

In some optional embodiments, the determining at least one person included in the scene image and the target action type of each person in the at least one person according to the associated face position and the body position includes: determining at least one of the face position and the person position as a position of each person included in the scene image, wherein the at least one person included in the scene image is determined; determining a plurality of feature vectors according to the associated face position and the body position; determining the target action type for each of the at least one person based on the plurality of feature vectors.

In some optional embodiments, the determining a plurality of feature vectors according to the associated face position and the body position includes: and determining the characteristic vectors which respectively correspond to at least one preset action type and point to the associated human body position by the human face position to obtain the plurality of characteristic vectors.

In some optional embodiments, the determining the target action type for each of the at least one person based on the plurality of feature vectors comprises: normalizing the plurality of feature vectors corresponding to each person to obtain a normalized value of each feature vector; taking the feature vector corresponding to the maximum normalization value of each person as the target feature vector of each person; and taking the action type corresponding to the target feature vector as the target action type of each character.

In some optional embodiments, the performing, on the scene image, different part detection of an object, association of different parts in the same object, and motion recognition of the object, and determining at least one object included in the scene image and a target motion type of each object in the at least one object, includes: determining a target position of each part of each object on the scene image through an object detection model, and then associating the target positions of different parts belonging to the same object; and determining at least one object included in the scene image and a target action type of each object in the at least one object according to the target positions of the associated different parts through the object detection model.

In some optional embodiments, the method further comprises: determining a label type in the sample image set; the label type comprises at least one of a face position label, a human body position label, an incidence relation label between the face position and the human body position, and an action identification label between the human body position and the action type; and respectively training branches corresponding to the label types in a preset model by adopting the sample image set to obtain the object detection model.

In some optional embodiments, the method further comprises: and determining the accumulated detection result of the action which is made by each object in a set time period and is matched with the target action type.

In some optional embodiments, the scene image comprises a scene image captured in a classroom, the object comprises a teaching object, and the target action type comprises at least one action type in a teaching task.

According to a second aspect of the embodiments of the present disclosure, there is provided a motion recognition apparatus, the apparatus including: the image acquisition module is used for acquiring a scene image; and the action recognition module is used for detecting different parts of the object, associating different parts in the same object and recognizing the action of the object on the scene image, and determining at least one object included in the scene image and the target action type of each object in the at least one object.

In some optional embodiments, the object comprises a person, and the different parts of the object comprise a face and a body of the person; the action recognition module includes: the first determining submodule is used for determining the face position of each person and the human body position of each person on the scene image; the association submodule is used for associating the human face position and the human body position which belong to the same person; and the second determining sub-module is used for determining at least one person included in the scene image and the target action type of each person in the at least one person according to the associated face position and the associated body position.

In some optional embodiments, the association sub-module comprises: a first determination unit for determining a reference human body position corresponding to each face position; and the association unit is used for associating the human face position and the human body position belonging to the same person according to the reference human body position and the human body position.

In some optional embodiments, the first determining unit comprises: determining a first coordinate value corresponding to each face position on the scene image; respectively determining a second coordinate value according to a preset vector and the first coordinate value; the preset vector is a vector pointing to the position of the human body from the position of the face; and taking the second coordinate value as the reference human body position.

In some optional embodiments, the associating unit comprises: and taking the human body position with the minimum distance from the reference human body position and the human face position corresponding to the reference human body position as the associated human face position and human body position.

In some optional embodiments, the second determining sub-module comprises: a second determining unit, configured to determine at least one of the face position and the person position of the scene image as a position where each person included in the scene image is located, and the at least one person included in the scene image; a third determining unit, configured to determine a plurality of feature vectors according to the associated face position and the body position; a fourth determining unit configured to determine the target action type of each of the at least one person based on the plurality of feature vectors.

In some optional embodiments, the third determining unit comprises: and determining the characteristic vectors which respectively correspond to at least one preset action type and point to the associated human body position by the human face position to obtain the plurality of characteristic vectors.

In some optional embodiments, the fourth determining unit comprises: normalizing the plurality of feature vectors corresponding to each person to obtain a normalized value of each feature vector; taking the feature vector corresponding to the maximum normalization value of each person as the target feature vector of each person; and taking the action type corresponding to the target feature vector as the target action type of each character.

In some optional embodiments, the action recognition module comprises: the second association submodule is used for associating the target positions of different parts belonging to the same object after determining the target position of each part of each object on the scene image through the object detection model; and the third determining sub-module is used for determining at least one object included in the scene image and a target action type of each object in the at least one object according to the target positions of the associated different parts through the object detection model.

In some optional embodiments, the apparatus further comprises: the label type determining module is used for determining the type of a label in the sample image set; the label type comprises at least one of a face position label, a human body position label, an incidence relation label between the face position and the human body position, and an action identification label between the human body position and the action type; and the training module is used for respectively training the branches corresponding to the label types in a preset model by adopting the sample image set to obtain the object detection model.

In some optional embodiments, the apparatus further comprises: and the matching determination module is used for determining the accumulated detection result of the action which is made by each object within a set time period and is matched with the target action type.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the motion recognition method according to any one of the first aspect.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a motion recognition apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to invoke executable instructions stored in the memory to implement the action recognition method of any of the first aspects.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the embodiment of the disclosure, different part detection of an object, association of different parts in the same object, and motion recognition of the object can be performed on a scene image, so as to determine at least one object included in the scene image and a target motion type of each object in the at least one object, where the motion recognition duration is unrelated to the number of objects included in the scene image, and thus, the increase of the calculation duration due to the increase of the number of objects is avoided, thereby greatly saving calculation resources, shortening the motion recognition duration, and effectively improving the detection efficiency.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating a method of motion recognition according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow diagram of another method of motion recognition shown in the present disclosure in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram of another method of motion recognition shown in the present disclosure in accordance with an exemplary embodiment;

FIG. 4 is a flow diagram of another method of motion recognition shown in the present disclosure in accordance with an exemplary embodiment;

FIG. 5 is a diagram illustrating preset vectors in accordance with an exemplary embodiment of the present disclosure;

FIG. 6 is a flow diagram of another method of motion recognition shown in the present disclosure in accordance with an exemplary embodiment;

FIG. 7 is a flow diagram of another method of motion recognition shown in the present disclosure in accordance with an exemplary embodiment;

FIG. 8 is a schematic diagram illustrating an object detection model architecture according to an exemplary embodiment of the present disclosure;

FIG. 9 is a flow chart of another method of motion recognition shown in the present disclosure in accordance with an exemplary embodiment;

FIG. 10 is a schematic diagram illustrating an object detection model training scenario in accordance with an exemplary embodiment of the present disclosure;

FIG. 11 is a flow chart of another method of motion recognition shown in the present disclosure in accordance with an exemplary embodiment;

FIG. 12 is a block diagram of a motion recognition device according to an exemplary embodiment of the present disclosure;

fig. 13 is a schematic structural diagram illustrating a motion recognition device according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as operated herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.

The embodiment of the disclosure provides an action recognition scheme, which is exemplary and can be applied to terminal equipment in different scenes. Different scenes include but are not limited to a classroom, a place where an advertisement is played, or other indoor or outdoor scenes needing to identify the motion of at least one object, and the terminal device may be any terminal device with a camera, or the terminal device may also be an external camera device. The terminal device detects different parts of an object, associates different parts in the same object and identifies the motion of the object on the acquired scene image, so that at least one object included in the scene image and the target motion type of each object in the at least one object are determined.

For example, in a classroom, the terminal device may employ a teaching multimedia device with a camera deployed in the classroom, including but not limited to a teaching projector, a monitoring device in the classroom, and the like. For example, the terminal device acquires a scene image in a classroom, so that different parts of objects in the classroom are detected, different parts of the same object are associated, and the motion of the objects is identified, and a detection result is quickly obtained, wherein the detection result may include at least one object included in the scene image and a target motion type of each object, and the target motion type may include hand-lifting, standing or performing other interactive motions.

For another example, the terminal device may obtain a scene image in an elevator, the elevator is playing an advertisement, and by using the scheme provided by the embodiment of the present disclosure, a target action type corresponding to an object in the elevator when the elevator plays the advertisement may be determined, and the target action type may include, but is not limited to, turning around, focusing on an advertisement delivery position, leaning on, and the like.

For example, the motion recognition scheme provided by the embodiment of the present disclosure may also be applicable to cloud servers in different scenes, where the cloud server may be provided with an external camera, the external camera collects a scene image, the scene image is sent to the cloud server through a router or a gateway, the cloud server performs detection on different parts of an object, association between different parts in the same object, and motion recognition on the object on the scene image, and determines at least one object included in the scene image and a target motion type of each object in the at least one object.

For example, an external camera is arranged in a classroom, the external camera collects scene images in the classroom and then sends the scene images to a cloud server through a router or a gateway, the cloud server performs object different part detection, object different part association and object action recognition on the scene images, and determines at least one object included in the scene images and a target action type of each object in the at least one object. Further, the cloud server can feed back the results to the corresponding teaching task analysis server as required, so that teachers are reminded to adjust teaching contents, and teaching activities can be better performed.

For another example, in a place where the advertisement is played, assuming that the place is an elevator, the external camera is arranged in the elevator, the external camera collects a scene image in the elevator, the scene image can be sent to the cloud server through the router or the gateway, and the cloud server determines at least one object included in the scene image and a target action type of each object in the at least one object. And subsequently, the target action statistical result of the object in the elevator can be fed back to the corresponding advertiser server according to the requirement, and the advertiser adjusts the advertisement content.

In the embodiment of the present disclosure, further processing may be performed according to the detection result by the terminal device or the cloud server, for example, outputting a target image, and identifying at least one object included in the scene image and a target action type of each object in the at least one object on the target image, so as to better understand the object and the action type of each object in the current scene.

In addition, the accumulated detection result of the action matched with the target action type, which is made by each object included in the scene image in the set time period, can be determined through the terminal device or the cloud server.

If the scene image comprises a scene image captured in a classroom and the objects comprise teaching objects, such as students, the target action type may comprise at least one action type in a teaching task.

For example, in a classroom where a teacher is teaching, target action types include, but are not limited to, raising hands, answering questions on their own, interacting with the teacher, focusing on the blackboard, writing with a low head, and the like. Through the scheme, the time period that the teacher teaches, for example, the time period of a class, can determine the time period that each teaching object takes hands for several times, focuses on the blackboard, writes with a low head, answers the questions immediately, interacts with the teacher, and the like. Further, the terminal device can display after obtaining the accumulated detection result, so that a teacher can better perform a teaching task, or the cloud server sends the accumulated detection result to the designated terminal device for displaying, and the teacher can better perform the teaching task.

The above is merely an example of the applicable scenario of the present disclosure, and other indoor or scenario requiring rapid motion type recognition also belong to the scope of the present disclosure.

For example, as shown in fig. 1, fig. 1 illustrates a motion recognition method according to an exemplary embodiment, including the following steps:

in step 101, an image of a scene is acquired.

In the embodiment of the present disclosure, a scene image of a current scene may be acquired, and the scene of the present disclosure includes, but is not limited to, any scene that requires motion recognition of an object in the scene, such as a classroom, a place where an advertisement is played, and the like.

In step 102, different part detection of an object, association of different parts in the same object and motion recognition of the object are performed on the scene image, and at least one object included in the scene image and a target motion type of each object in the at least one object are determined.

In the embodiment of the present disclosure, the object may include, but is not limited to, a person, the different parts may include, but is not limited to, a human face and a human body, and the detecting the different parts of the object on the scene image may include detecting a position of the human face and a position of the human body on the scene image. The association of different parts in the same object may be such that the face position and the body position belonging to the same person need to be associated. And the motion recognition of the object can determine the target motion type of each character included in the scene image from at least one preset motion type.

The preset action type can be set according to the scene requirement, and includes but is not limited to raising hands, bending down, jumping, turning around, and the like, or the preset action type can also include a type without any action, for example, a character keeps the previous action type unchanged.

In the embodiment, different part detection of the object, association of different parts in the same object, and motion recognition of the object are performed on the scene image, so that at least one object included in the scene image and a target motion type of each object in the at least one object are determined, the motion recognition duration is unrelated to the number of the objects included in the scene image, the increase of the calculation duration due to the increase of the number of the objects is avoided, the calculation resources are greatly saved, the duration of the motion recognition is shortened, and the detection efficiency is improved.

In some alternative embodiments, step 102 may include:

and after the characteristic map is obtained, detecting different parts of the object, associating different parts in the same object and identifying the action of the object on the characteristic map.

In the embodiment of the present disclosure, the image features in the scene image may be extracted through a pre-trained neural network model (backbone), so as to obtain a feature map. The neural Network model may be, but not limited to, a Visual Geometry Group Network (VGG Net) model.

The dimensionality of a feature map obtained by extracting image features through the neural network model is smaller than the dimensionality of the scene image. For example, inputting a scene image with dimensions 640 × 480 into the neural network model, a feature map with dimensions 80 × 60 can be obtained.

The extracted image features may include, but are not limited to, color features, texture features, shape features, and the like. The color feature is a global feature which describes the surface color attribute of an object corresponding to an image, the texture feature is a global feature which describes the surface texture attribute of the object corresponding to the image, the shape feature has two types of representation methods, one type is a contour feature, the other type is a region feature, the contour feature of the image mainly aims at the outer boundary of the object, and the region feature of the image is related to the shape of an image region.

In the embodiment of the present disclosure, the action recognition is performed subsequently mainly for the feature map.

In the above embodiment, after the feature extraction is performed on the scene image to obtain the feature map, different part detection of the object, association of different parts in the same object, and motion recognition of the object are subsequently performed on the feature map, so that the at least one object included in the scene image and the target motion type of each object in the at least one object are determined quickly according to the image features, and the method is simple and convenient to implement and high in usability.

In some alternative embodiments, the object includes a person, and the different parts of the object include a face and a body of the person, for example, as shown in fig. 2, step 102 may include:

in step 102-1, a face position of each person and a body position of each person are determined on the scene image.

In the embodiment of the present disclosure, a face Region belonging to a face and a body Region belonging to a body on a feature map corresponding to a scene image may be detected through a Region Predictive Network (RPN). The human face area can be identified through the human face identification frame, and the human body area can be identified through the human body identification frame. Further, the size of the face recognition box may be determined by the center position of the face recognition box, the length and the width of the face recognition box, and in the embodiment of the present disclosure, the face position may be represented by the center position of the face recognition box. Likewise, the size of the human body recognition frame may be determined by the center position of the human body recognition frame, the length and the width of the human body recognition frame, and the human body position may be represented by the center position of the human body recognition frame.

In the embodiment of the present disclosure, the above-mentioned position description information of the human face and the human body may be respectively represented by different channels. For example, the dimension of the feature map is 80 × 60, after the face region and the body region of each person are determined, a first feature map of 80 × 60 × 6 may be obtained, and the 6 channels of the first feature map respectively output the center position of the face recognition frame, the length of the face recognition frame, the width of the face recognition frame, the center position of the body recognition frame, the length of the body recognition frame, and the width of the body recognition frame.

In a possible implementation manner, the first feature maps corresponding to two channels of the center position of the face recognition frame and the center position of the human body recognition frame can be obtained, so that the face position and the human body position are respectively determined.

In step 102-2, the face position and the body position belonging to the same person are associated.

In the embodiment of the present disclosure, if the number of persons is multiple, after each face position and each body position are determined, the face position and the body position belonging to the same person need to be associated, so as to obtain the associated face position and body position. In the embodiment of the present disclosure, it is the center position of the face recognition frame and the center position of the human body recognition frame that need to be associated.

For example, the center positions of 2 face recognition frames, respectively a1 and a2, and the center positions of 2 body recognition frames, respectively B1 and B2, are determined on the feature map, and the center positions of the face recognition frames and the center positions of the body recognition frames can be associated, so that the associated center position a1 of the face recognition frame and the center position B2 of the body recognition frame, and the associated center position a2 of the face recognition frame and the center position B1 of the body recognition frame are finally obtained.

In the embodiment of the present disclosure, the face position and the body position associated with the face position may be respectively represented by 2 channels. For example, the dimension of the feature map is 80 × 60, after the face region and the body region of each person are determined, a first feature map with the dimension of 80 × 60 × 6 is obtained, and further, the face position and the body position are associated to obtain a second feature map with the dimension of 80 × 60 × 2. The second feature map includes 2 channels, one channel corresponding to the face position of each person and the other channel corresponding to the body position associated with the face position.

In step 102-3, at least one person included in the scene image and the target action type of each person in the at least one person are determined according to the associated face position and the body position.

In the embodiment of the disclosure, the position of each person may be represented by the face position and/or the body position corresponding to the person, so that at least one person included in the scene image may be determined.

For example, the positions of the persons may be determined by the positions of the faces, where the positions of the faces refer to the center positions of the face recognition boxes, and assuming that the positions of the faces include a1, a2, and A3, it may be determined that the scene image includes 3 persons, and each person is located at a position of a1, a2, and A3.

In addition, in the embodiment of the present disclosure, a plurality of feature vectors may be determined according to the associated face position and the human body position, the feature vectors are respectively obtained according to preset action types, and further, a target action type of each of at least one person may be determined according to the feature vectors. The target action type may be at least one of preset action types. Assuming that the number of the preset action types is n, n channels are required to respectively correspond to different preset action types. The preset action types include action types which can be performed by various characters and action types which are not performed by any actions by the characters.

For example, the dimension of the feature map is 80 × 60, a first feature map having a dimension of 80 × 60 × 6 is obtained after the face position and the body position of each person are determined, and a second feature map having a dimension of 80 × 60 × 2 is obtained after the face position and the body position having an association relationship are determined. From the second feature map, a third feature map having dimensions of 80 × 60 × n needs to be determined. And determining the final target action type according to the third feature diagram.

In the above embodiment, the face position of each person and the body position of each person may be determined on the feature map, and further, the face position and the body position belonging to the same person are associated, so that at least one person included in the scene image and the target action type corresponding to each person in the at least one person are determined according to the associated face position and body position. In the process, even if a plurality of characters are included in the scene image, the target action type corresponding to each character can be quickly determined, the requirement on the computing capacity of the equipment is reduced, the action recognition duration is reduced, and the competitiveness of the equipment is improved.

In some alternative embodiments, such as shown in FIG. 3, step 102-2 may include:

in steps 102-21, a reference body position corresponding to each face position is determined.

In the embodiment of the present disclosure, the most probable center position of the human body recognition frame corresponding to the center position of each human face recognition frame may be predicted according to the center position of the human face recognition frame of each person, and the predicted position is used as the reference human body position.

In step 102-22, the associated face position and body position are determined based on the reference body position and body position.

In the embodiment of the present disclosure, each reference human body position may correspond to a human body position, and the human face position and the human body position corresponding to the same reference human body position are associated.

In the above embodiment, the reference human body position corresponding to each human face position may be determined according to the human face position of each person, so that the human face position and the human body position are associated, and the method is simple and convenient to implement and high in usability.

In some alternative embodiments, such as shown in FIG. 4, steps 102-21 may include:

in step 201, a first coordinate value corresponding to each face position is determined on the scene image.

In the embodiment of the present disclosure, the face position of each person has been previously determined on the feature map corresponding to the scene image, where the face position may be represented by the center position of the face recognition box. Then, in the image coordinate system corresponding to the feature map, a coordinate value corresponding to the center position of each face recognition frame may be determined, where the coordinate value is the first coordinate value.

In step 202, a second coordinate value is determined according to a preset vector and the first coordinate value, respectively.

In the embodiment of the present disclosure, the preset vector is a preset vector pointing to the position of the human body from the position of the human face, for example, as shown in fig. 5, the preset vector may point to the center position of the human body recognition frame from the center position of the human face recognition frame. Then, a second coordinate value can be determined according to the first coordinate value of the face position and the preset vector.

In step 203, the second coordinate value is used as the reference human body position.

In the embodiment of the present disclosure, the second coordinate value is directly used as the reference human body position.

In the above embodiment, the reference human body position corresponding to each human face position may be determined according to the human face position and the preset vector of each person, so that the human face position and the human body position are subsequently associated, and the usability is high.

In some alternative embodiments, steps 101-22 may include:

and taking the human body position with the minimum distance from the reference human body position and the human face position corresponding to the reference human body position as the human face position and the human body position with the incidence relation.

In the embodiment of the present disclosure, after the reference human body position is determined, a human body position closest to the reference human body position in the plurality of human body positions and a human face position corresponding to the reference human body position are the human face position and the human body position belonging to the same person. Accordingly, the associated face position and body position are obtained.

For example, the reference human body position includes C1 and C2, where C1 is determined from the face position a1 and C2 is determined from the face position a 2. The human body positions comprise B1 and B2, wherein the position closest to C1 in the human body positions is B2, and the position closest to C2 is B1. Therefore, it can be determined that A1 and B2 have an association relationship and A2 and B1 have an association relationship.

In the above embodiment, a human body position closest to each reference human body position may be determined among a plurality of human body positions, and the human body position and the face position corresponding to the determined reference human body position are the associated human body position and face position, which is simple and convenient to implement and high in usability.

In the embodiment of the present disclosure, it should be noted that a reference face position corresponding to each human body position may also be determined according to the human body position of each person and another preset vector, and further, the face position with the minimum distance from the reference face position and the human body position corresponding to the reference face position are taken as the face position and the human body position with the association relationship. Wherein, another preset vector may be a preset vector pointing to the face position from the body position. The method for determining the position of the reference face is the same as the method for determining the position of the reference human body, and is not described herein again.

In some alternative embodiments, such as shown in FIG. 6, step 102-3 comprises:

in steps 102-31, determining at least one of the associated face position and the associated person position as a position where each person included in the scene image is located, and determining the at least one person included in the scene image.

The position of each person can be represented by the face position and/or the body position corresponding to the person, so that the person included in the scene image can be determined.

In steps 102-32, a plurality of feature vectors are determined according to the associated face position and the body position.

In the embodiment of the present disclosure, according to preset action types, feature vectors corresponding to at least one preset action type and pointing to the associated human body position from the human face position are respectively determined, so as to obtain the plurality of feature vectors corresponding to the same person.

In steps 102-33, the target action type corresponding to each character is determined based on the plurality of feature vectors.

In the embodiment of the present disclosure, the most likely action type to be performed by the person may be determined according to the plurality of feature vectors, and this action type may be used as the target action type.

In some alternative embodiments, such as shown in FIG. 7, steps 102-33 may include:

in step 301, the feature vectors corresponding to each person are normalized to obtain a normalized value corresponding to each feature vector.

In the embodiment of the present disclosure, a normalization function, for example, a softmax function, may be used to normalize the plurality of feature vectors corresponding to each person, so as to obtain a normalization value corresponding to each feature vector.

In step 302, the feature vector corresponding to the maximum normalized value of each person is used as the target feature vector of each person.

In the embodiment of the present disclosure, after normalizing the plurality of feature vectors corresponding to each person, the feature vector corresponding to the maximum normalized value is used as the target feature vector of each person.

In step 303, the action type corresponding to the target feature vector is used as the target action type corresponding to each character.

The action type corresponding to the target feature vector is the most likely action type of the person, and accordingly, can be used as the target action type of the person.

In the above embodiment, the most likely motion type of each person is determined by performing normalization processing on a plurality of feature vectors of each person, and the motion type is taken as the target motion type, so that the purpose of identifying the motion of the object is achieved.

In some optional embodiments, after the scene image is acquired, the scene image may be input into a pre-trained object detection model, a target position of each part of each object is determined on the scene image by the object detection model, then the target positions of different parts belonging to the same object are associated, and according to the target positions of the associated different parts, at least one object included in the scene image and a target action type of each object in the at least one object are determined.

The structure of the object detection model is shown in fig. 8, for example, after a scene image is acquired, the scene image is input into the object detection model, and the object detection model firstly performs feature extraction on the scene image by using a backbone to obtain a feature map.

Further, the object detection model determines the face position of each person and the body position of each person on the feature map through the positioning branch.

Still further, the object detection model determines the face position and the body position that belong to the same object through an association branch. And determining at least one person included in the scene image and a target action type corresponding to each person in the at least one person according to the associated face position and the body position through an action recognition branch.

The final object detection model may output the motion detection result, which includes at least one person included in the scene image and a target motion type corresponding to each of the at least one person.

In some optional embodiments, the object detection model may also directly output a target image, and at least one object included in the scene image and a target action type of each object in the at least one object may be simultaneously identified on the target image, so that an object detection result may be more intuitively reflected.

In the embodiment, different part detection of the object, association of different parts in the same object, and motion recognition of the object can be performed on the scene image, so that at least one object included in the scene image and the target motion type of each object in the at least one object are determined, the motion recognition duration is unrelated to the number of the objects included in the scene image, the increase of the calculation duration due to the increase of the number of the objects is avoided, the calculation resources are greatly saved, the duration of the motion recognition is shortened, and the detection efficiency is effectively improved.

In some optional embodiments, considering that the optimal sample image set labeled with the face position label, the body position label, the association relation label between the face position and the body position, and the motion identification label between the body position and the motion type is less, it takes more time to label other labels for the sample image set labeled with only part of the labels.

To solve this problem, in an embodiment of the present disclosure, for example as shown in fig. 9, the method may further include:

in step 100-1, the type of label in the sample image set is determined.

In the embodiment of the present disclosure, an existing sample image set is adopted, and the tag types in the sample image set include at least one of a tag type including a face position tag, a body position tag, an association relationship tag between a face position and a body position, and an action identification tag between a body position and an action type.

In step 100-2, the sample image set is adopted to train branches corresponding to the label types in a preset model respectively, and the object detection model is obtained.

In the embodiment of the present disclosure, the structure of the preset model may also include a positioning branch, an association branch, and an action recognition branch as shown in fig. 8. And respectively training branches corresponding to the label types in a preset model by adopting a sample image set, and obtaining a trained object detection model under the condition that the loss function of the corresponding branch is minimum.

The positioning branches may further include a face positioning branch and a body positioning branch (not shown in fig. 9).

For example, as shown in fig. 10, if the label types in the sample image set include only face position labels, the sample image set is used, and then the sample image set is used to train a face location branch in location branches of a preset model. No processing is done for the other branches per training iteration. I.e. the loss function determined each time is the same as the first loss function, the second, third and fourth loss functions may be set to 0, for example.

And if the label type in the sample image set only comprises the human body position label, training the human body positioning branch in the positioning branches of the preset model by adopting the sample image set. If the label types in the sample image set comprise the face position label and the body position label at the same time, the sample image set can be adopted to directly train the positioning branch.

If the label type in the sample image set only includes the association relation label, the sample image set can be adopted to train the association branch of the preset model, and the loss functions corresponding to other branches are 0.

Similarly, if the type of the label in the sample image set includes only the motion recognition label, the sample image set may be used to train the motion recognition branch of the preset model, and the loss function corresponding to other branches may be 0, for example.

If the label categories in the sample image set are two or more, the sample image set may be used to train the corresponding branches of the preset model, and the loss functions corresponding to the other branches may be 0, for example.

In the above embodiment, the sample image set is adopted, and the branches corresponding to the label types of the sample image set in the preset model are trained respectively to obtain the object detection model, so that the detection performance and the generalization performance of the object detection model are improved.

In some alternative embodiments, such as shown in fig. 11, the method may further include:

in step 103, the accumulated detection result of the motion matched with the target motion type made by each object in a set time period is determined.

In the disclosed embodiment, the scene image comprises a scene image collected in a classroom, the object comprises a teaching object, the target action type comprises at least one action type in a teaching task, and the action type matched with the teaching task comprises but is not limited to raising hands, interacting with teachers, answering questions on stand, focusing on a blackboard, writing with heads down, and the like.

For example, in a classroom, a camera-equipped teaching multimedia device, including but not limited to a teaching projector, a monitoring device in the classroom, etc., deployed in the classroom may be employed to acquire images of a scene captured in the classroom. A classroom scene image is determined to include at least one teaching object, which can be a student, and a target action type for each teaching object.

Further, the accumulated detection result of the motion matched with the target motion type made by each teaching object, for example, each student, may be determined within a set period of time, for example, a period of time of a class in which the teacher teaches. For example, it is determined how many times each student held hands in a class, how long to pay attention to the blackboard, how long to write down, how many times to answer questions on hold, how many times to interact with a teacher, etc. The results may be displayed by the teaching multimedia device for the teacher to better perform the teaching task.

Corresponding to the foregoing method embodiments, the present disclosure also provides embodiments of an apparatus.

As shown in fig. 12, fig. 12 is a block diagram of a motion recognition device according to an exemplary embodiment of the present disclosure, the device including: an image acquisition module 410, configured to acquire a scene image; and the action recognition module 420 is configured to perform different part detection on the scene image, association of different parts in the same object, and action recognition on the object, and determine at least one object included in the scene image and a target action type of each object in the at least one object.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiment of the disclosure also provides a computer-readable storage medium, which stores a computer program for executing any one of the above-mentioned motion recognition methods.

In some optional embodiments, the disclosed embodiments provide a computer program product comprising computer readable code which, when run on a device, a processor in the device executes instructions for implementing an action recognition method as provided in any of the above embodiments.

In some optional embodiments, the present disclosure further provides another computer program product for storing computer readable instructions, which when executed, cause a computer to perform the operations of the motion recognition method provided in any one of the above embodiments.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

The embodiment of the present disclosure further provides an action recognition apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to call executable instructions stored in the memory to implement any of the above-described action recognition methods.

Fig. 13 is a schematic diagram of a hardware structure of an action recognition device according to an embodiment of the present disclosure. The motion recognition device 510 includes a processor 511 and may further include an input device 512, an output device 513, and a memory 514. The input device 512, the output device 513, the memory 514 and the processor 511 are connected to each other via a bus.

The memory includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), which is used for storing instructions and data.

The input means are for inputting data and/or signals and the output means are for outputting data and/or signals. The output means and the input means may be separate devices or may be an integral device.

The processor may include one or more processors, for example, one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU.

The memory is used to store program codes and data of the network device.

The processor is used for calling the program codes and data in the memory and executing the steps in the method embodiment. Specifically, reference may be made to the description of the method embodiment, which is not repeated herein.

It will be appreciated that fig. 13 only shows a simplified design of the motion recognition means. In practical applications, the motion recognition devices may also respectively include other necessary components, including but not limited to any number of input/output devices, processors, controllers, memories, etc., and all motion recognition devices that can implement the embodiments of the present disclosure are within the scope of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

The above description is only exemplary of the present disclosure and should not be taken as limiting the disclosure, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A motion recognition method, comprising:

acquiring a scene image;

detecting different parts of an object, associating different parts in the same object and identifying the motion of the object on the scene image, and determining at least one object included in the scene image and a target motion type of each object in the at least one object.

2. The method of claim 1, wherein the object comprises a human figure, and the different parts of the object comprise a human face and a human body of the human figure;

the performing different part detection of an object, association of different parts in the same object, and motion recognition of the object on the scene image, and determining at least one object included in the scene image and a target motion type of each object in the at least one object, includes:

determining the face position of each person and the body position of each person on the scene image;

associating the face position and the body position belonging to the same person;

and determining at least one person included in the scene image and the target action type of each person in the at least one person according to the associated face position and the body position.

3. The method of claim 2, wherein associating the face position and the body position belonging to the same person comprises:

determining a reference human body position corresponding to each human face position;

and associating the human face position and the human body position belonging to the same person according to the reference human body position and the human body position.

4. The method of claim 3, wherein determining the reference body position corresponding to each face position comprises:

determining a first coordinate value corresponding to each face position on the scene image;

respectively determining a second coordinate value according to a preset vector and the first coordinate value; the preset vector is a vector pointing to the position of the human body from the position of the face;

and taking the second coordinate value as the reference human body position.

5. The method according to claim 3 or 4, wherein the associating the face position and the body position belonging to the same person according to the reference body position and the body position comprises:

and taking the human body position with the minimum distance from the reference human body position and the human face position corresponding to the reference human body position as the associated human face position and human body position.

6. The method according to any one of claims 2-5, wherein the determining the at least one person included in the scene image and the target action type of each person of the at least one person according to the associated face position and the body position comprises:

determining at least one of the face position and the person position as a position of each person included in the scene image, wherein the at least one person included in the scene image is determined;

determining a plurality of feature vectors according to the associated face position and the body position;

determining the target action type for each of the at least one person based on the plurality of feature vectors.

7. The method of claim 6, wherein determining a plurality of feature vectors based on the associated face and body positions comprises:

and determining the characteristic vectors which respectively correspond to at least one preset action type and point to the associated human body position by the human face position to obtain the plurality of characteristic vectors.

8. The method of claim 6 or 7, wherein the determining the target action type for each of the at least one person based on the plurality of feature vectors comprises:

normalizing the plurality of feature vectors corresponding to each person to obtain a normalized value of each feature vector;

taking the feature vector corresponding to the maximum normalization value of each person as the target feature vector of each person;

and taking the action type corresponding to the target feature vector as the target action type of each character.

9. The method according to any one of claims 1 to 8, wherein the performing, on the scene image, different part detection of an object, association of different parts in the same object, and motion recognition of the object, and determining at least one object included in the scene image and a target motion type of each object in the at least one object comprises:

determining a target position of each part of each object on the scene image through an object detection model, and then associating the target positions of different parts belonging to the same object;

and determining at least one object included in the scene image and a target action type of each object in the at least one object according to the target positions of the associated different parts through the object detection model.

10. The method of claim 9, further comprising:

determining a label type in the sample image set; the label type comprises at least one of a face position label, a human body position label, an incidence relation label between the face position and the human body position, and an action identification label between the human body position and the action type;

and respectively training branches corresponding to the label types in a preset model by adopting the sample image set to obtain the object detection model.

11. The method according to any one of claims 1-10, further comprising:

and determining the accumulated detection result of the action which is made by each object in a set time period and is matched with the target action type.

12. The method of claim 11, wherein the scene image comprises a scene image captured in a classroom, the object comprises a teaching object, and the target action type comprises at least one action type in a teaching task.

13. An action recognition device, characterized in that the device comprises:

the image acquisition module is used for acquiring a scene image;

and the action recognition module is used for detecting different parts of the object, associating different parts in the same object and recognizing the action of the object on the scene image, and determining at least one object included in the scene image and the target action type of each object in the at least one object.

14. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the motion recognition method according to any one of claims 1 to 12.

15. An action recognition device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to invoke executable instructions stored in the memory to implement the action recognition method of any of claims 1-12.