CN113496143B

CN113496143B - Action recognition method and device and storage medium

Info

Publication number: CN113496143B
Application number: CN202010196461.6A
Authority: CN
Inventors: 王飞; 王利鸣; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2024-07-16
Anticipated expiration: 2040-03-19
Also published as: KR20220027241A; TW202139061A; CN113496143A; WO2021185317A1; TWI776429B; JP2022543032A

Abstract

The present disclosure provides an action recognition method and apparatus, and a storage medium, wherein the method includes: acquiring a scene image; and detecting different parts of the object, associating different parts in the same object and identifying the actions of the object, and determining at least one object included in the scene image and the target action type of each object in the at least one object.

Description

Action recognition method and device and storage medium

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a method and apparatus for motion recognition, and a storage medium.

Background

Currently, there is an increasing need to analyze the motion of objects through computer vision techniques. In the process of identifying the action type, the object position is obtained through object detection, each object is cut according to the object position, and the cut object is input into an action classification network to obtain an action identification result.

The processing time of the above action recognition is linearly related to the number of objects in the scene. For example, the scene includes N objects, where N may be a positive integer, and the action classification network needs to perform N times of reasoning, and the time for action recognition increases N times, and the larger the value of N is, the longer the time for action recognition is, which requires that the device has higher computing power and takes longer time.

Disclosure of Invention

The disclosure provides an action recognition method and device and a storage medium.

According to a first aspect of embodiments of the present disclosure, there is provided an action recognition method, the method comprising: acquiring a scene image; and detecting different parts of the object, associating different parts in the same object and identifying the actions of the object, and determining at least one object included in the scene image and the target action type of each object in the at least one object.

In some alternative embodiments, the object comprises a person, and the different parts of the object comprise a face and a body of the person; the detecting different parts of the object, associating different parts in the same object and identifying the actions of the object are carried out on the scene image, and determining at least one object included in the scene image and the target action type of each object in the at least one object comprises the following steps: determining the face position of each person and the human body position of each person on the scene image; correlating the face position and the body position belonging to the same person; and determining at least one person included in the scene image and the target action type of each person in the at least one person according to the associated face position and the human body position.

In some optional embodiments, the associating the face position and the body position of the same person includes: determining a reference human body position corresponding to each human face position; and according to the reference human body position and the human body position, correlating the human face position and the human body position belonging to the same person.

In some optional embodiments, the determining a reference human body position corresponding to each human face position includes: determining a first coordinate value corresponding to each face position on the scene image; respectively determining a second coordinate value according to a preset vector and the first coordinate value; the preset vector points to the position of the human body from the position of the human face; and taking the second coordinate value as the reference human body position.

In some optional embodiments, the associating the face position and the body position belonging to the same person according to the reference body position and the body position includes: and taking the human body position with the smallest distance from the reference human body position and the human face position corresponding to the reference human body position as the associated human face position and human body position.

In some optional embodiments, the determining the target action type of the at least one person and each of the at least one person included in the scene image according to the associated face position and the body position includes: determining at least one person included in the scene image by using at least one of the associated face position and the human body position as a position of each person included in the scene image; determining a plurality of feature vectors according to the associated face position and the human body position; the target action type for each of the at least one persona is determined based on the plurality of feature vectors.

In some optional embodiments, the determining a plurality of feature vectors according to the associated face position and the body position includes: and determining feature vectors which respectively correspond to at least one preset action type and are pointed to the associated human body positions by the human face positions, and obtaining a plurality of feature vectors.

In some optional embodiments, the determining the target action type for each of the at least one persona based on the plurality of feature vectors includes: normalizing the plurality of feature vectors corresponding to each person to obtain a normalized value of each feature vector; taking the feature vector corresponding to the maximum normalized value of each character as the target feature vector of each character; and taking the action type corresponding to the target feature vector as the target action type of each character.

In some optional embodiments, the detecting different parts of the object, associating different parts in the same object and identifying actions of the object for the scene image, determining at least one object included in the scene image and a target action type of each object in the at least one object includes: after determining the target position of each part of each object on the scene image through an object detection model, correlating the target positions of different parts belonging to the same object; and determining at least one object included in the scene image and a target action type of each object in the at least one object according to the target positions of the associated different parts through the object detection model.

In some alternative embodiments, the method further comprises: determining the type of a label in the sample image set; the label type comprises at least one of a face position label, a human body position label, an association relation label between the face position and the human body position, and an action identification label between the human body position and the action type; and training branches corresponding to the label types in a preset model by adopting the sample image set to obtain the object detection model.

In some alternative embodiments, the method further comprises: and determining the accumulated detection result of the actions which are matched with the target action type and are made by each object in a set time period.

In some alternative embodiments, the scene image comprises a scene image captured in a classroom, the object comprises a teaching object, and the target action type comprises at least one action type in a teaching task.

According to a second aspect of embodiments of the present disclosure, there is provided an action recognition apparatus, the apparatus comprising: the image acquisition module is used for acquiring a scene image; and the action recognition module is used for detecting different parts of the object, associating different parts in the same object and recognizing the action of the object for the scene image, and determining at least one object included in the scene image and the target action type of each object in the at least one object.

In some alternative embodiments, the object comprises a person, and the different parts of the object comprise a face and a body of the person; the action recognition module includes: the first determining submodule is used for determining the face position of each person and the human body position of each person on the scene image; the association sub-module is used for associating the face position and the human body position of the same person; and the second determining submodule is used for determining at least one person included in the scene image and the target action type of each person in the at least one person according to the associated face position and the associated human body position.

In some alternative embodiments, the association submodule includes: a first determining unit configured to determine a reference human body position corresponding to each human face position; and the association unit is used for associating the human face position and the human body position belonging to the same person according to the reference human body position and the human body position.

In some alternative embodiments, the first determining unit includes: determining a first coordinate value corresponding to each face position on the scene image; respectively determining a second coordinate value according to a preset vector and the first coordinate value; the preset vector points to the position of the human body from the position of the human face; and taking the second coordinate value as the reference human body position.

In some alternative embodiments, the association unit includes: and taking the human body position with the smallest distance from the reference human body position and the human face position corresponding to the reference human body position as the associated human face position and human body position.

In some alternative embodiments, the second determining submodule includes: a second determining unit configured to determine, as a position where each person included in the scene image is located, at least one of the face position and the body position associated therewith, the at least one person included in the scene image; a third determining unit configured to determine a plurality of feature vectors according to the associated face position and the human body position; and a fourth determining unit configured to determine the target action type of each of the at least one person based on the plurality of feature vectors.

In some alternative embodiments, the third determining unit includes: and determining feature vectors which respectively correspond to at least one preset action type and are pointed to the associated human body positions by the human face positions, and obtaining a plurality of feature vectors.

In some alternative embodiments, the fourth determining unit includes: normalizing the plurality of feature vectors corresponding to each person to obtain a normalized value of each feature vector; taking the feature vector corresponding to the maximum normalized value of each character as the target feature vector of each character; and taking the action type corresponding to the target feature vector as the target action type of each character.

In some alternative embodiments, the action recognition module includes: the second association submodule is used for associating the target positions of different parts belonging to the same object after determining the target position of each part of each object on the scene image through the object detection model; and a third determining sub-module, configured to determine, according to the target positions of the associated different parts by using the object detection model, at least one object included in the scene image and a target action type of each object in the at least one object.

In some alternative embodiments, the apparatus further comprises: the label type determining module is used for determining the label types in the sample image set; the label type comprises at least one of a face position label, a human body position label, an association relation label between the face position and the human body position, and an action identification label between the human body position and the action type; and the training module is used for training branches corresponding to the label types in a preset model by adopting the sample image set to obtain the object detection model.

In some alternative embodiments, the apparatus further comprises: and the matching determining module is used for determining the accumulated detection result of the actions which are matched with the target action type and are made by each object in a set time period.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the action recognition method of any one of the first aspects.

According to a fourth aspect of embodiments of the present disclosure, there is provided an action recognition apparatus, comprising: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to invoke the executable instructions stored in the memory to implement the method of action recognition of any of the first aspects.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

In the embodiment of the disclosure, detection of different parts of the object, association of different parts in the same object and action recognition of the object can be performed on the scene image, so that at least one object included in the scene image and a target action type of each object in the at least one object are determined, the action recognition duration is irrelevant to the number of the objects included in the scene image, the increase of calculation duration caused by the increase of the number of the objects is avoided, the calculation resource is greatly saved, the action recognition duration is shortened, and the detection efficiency is effectively improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flowchart of a method of action recognition, according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flowchart of another method of action recognition, according to an exemplary embodiment of the present disclosure;

FIG. 3 is a flowchart of another method of action recognition, according to an exemplary embodiment of the present disclosure;

FIG. 4 is a flowchart of another method of action recognition, according to an exemplary embodiment of the present disclosure;

FIG. 5 is a schematic diagram of preset vectors shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 6 is a flowchart of another method of action recognition, according to an exemplary embodiment of the present disclosure;

FIG. 7 is a flowchart of another method of action recognition, according to an exemplary embodiment of the present disclosure;

FIG. 8 is a schematic diagram of an object detection model structure according to an exemplary embodiment of the present disclosure;

FIG. 9 is a flowchart of another method of action recognition, according to an exemplary embodiment of the present disclosure;

FIG. 10 is a schematic diagram of an object detection model training scenario, according to an exemplary embodiment of the present disclosure;

FIG. 11 is a flowchart of another method of action recognition, according to an exemplary embodiment of the present disclosure;

FIG. 12 is a block diagram of an action recognition device according to an exemplary embodiment of the present disclosure;

Fig. 13 is a schematic diagram of a structure for an action recognition device according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as run herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. Depending on the context, the word "if" as run herein may be interpreted as "at … …" or "at … …" or "in response to a determination".

The embodiment of the disclosure provides an action recognition scheme, which is applicable to terminal equipment in different scenes. The different scenes include, but are not limited to, classrooms, places for playing advertisements, or other indoor or outdoor scenes requiring action recognition of at least one object, and the terminal device can be any terminal device with a camera, or the terminal device can also be an external camera device. And the terminal equipment detects different parts of the object, associates different parts of the same object and identifies the actions of the object, so as to determine at least one object included in the scene image and the target action type of each object in the at least one object.

For example, in a classroom, a terminal device may employ a camera-equipped teaching multimedia device deployed in the classroom, including but not limited to a teaching projector, a monitoring device in the classroom, and the like. For example, the terminal device acquires a scene image in a classroom, so as to detect different parts of an object, correlate different parts in the same object and identify actions of the object in the classroom, and quickly acquire a detection result, where the detection result may include at least one object included in the scene image and a target action type of each object, and the target action type may include lifting a hand, standing or performing other interactive actions.

For another example, the terminal device may obtain a scene image in an elevator, the elevator is playing an advertisement, and by adopting the scheme provided by the embodiment of the disclosure, a target action type corresponding to an object in the elevator when the elevator plays the advertisement may be determined, where the target action type may include, but is not limited to, turning, focusing on an advertisement putting position, turning around, and the like.

The action recognition scheme provided by the embodiment of the disclosure may be further applicable to cloud servers in different scenes, the cloud servers may be provided with external cameras, scene images are collected by the external cameras and sent to the cloud servers through routers or gateways, the cloud servers detect different parts of objects, correlate different parts of the same object and recognize actions of the objects, and at least one object included in the scene images and a target action type of each object in the at least one object are determined.

For example, the external camera is arranged in a classroom, after the external camera collects scene images in the classroom, the scene images are sent to the cloud server through the router or the gateway, the cloud server detects different parts of objects, associates different parts in the same object and identifies actions of the objects, and at least one object included in the scene images and the target action type of each object in the at least one object are determined. Further, the cloud server can feed the results back to the corresponding teaching task analysis server according to the needs, so that a teacher is reminded of adjusting teaching contents, and teaching activities can be conducted better.

For another example, at a place where the advertisement is played, assuming that the place is an elevator, the external camera is arranged in the elevator, the external camera collects scene images in the elevator, the scene images can be sent to the cloud server through the router or the gateway, and the cloud server determines at least one object included in the scene images and a target action type of each object in the at least one object. The target action statistical result of the object in the elevator can be fed back to the corresponding advertiser server according to the requirement, and the advertiser adjusts the advertisement content.

In the embodiment of the disclosure, further processing may be performed by the terminal device or the cloud server according to the detection result, for example, outputting a target image, and identifying at least one object included in the scene image and a target action type of each object in the at least one object on the target image, so as to better understand the object under the current scene and the action type of each object.

In addition, the accumulated detection result of the actions which are matched with the target action types and are made by each object included in the scene image in the set time period can be determined through the terminal equipment or the cloud server.

If the scene image comprises a scene image captured in a classroom, the object comprises a teaching object, such as a student, and the target action type may comprise at least one action type in a teaching task.

For example, in a classroom, a teacher is teaching and the types of target actions include, but are not limited to, lifting hands, raising answer questions, interacting with the teacher, focusing on a blackboard, lower-head writing, etc. According to the scheme, in the period of teaching by a teacher, for example, in the period of a class, the number of times that each teaching object lifts a few hands, the duration of paying attention to a blackboard, the duration of low-head writing, the number of times of answering questions, the number of times of interaction with the teacher and the like can be determined. Further, the terminal device can display after the accumulated detection result is obtained, so that a teacher can better perform teaching tasks, or the cloud server can send the accumulated detection result to the designated terminal device for display after the accumulated detection result is obtained, and the teacher can also better perform teaching tasks.

The above is merely exemplary of a scenario to which the present disclosure is applicable, and other indoor or scenario requiring fast action type recognition also fall within the scope of the present disclosure.

For example, as shown in fig. 1, fig. 1 is a method for identifying actions according to an exemplary embodiment, including the steps of:

in step 101, a scene image is acquired.

In an embodiment of the present disclosure, a scene image of a current scene may be acquired, where the scene of the present disclosure includes, but is not limited to, any scene that requires action recognition of an object in the scene, such as a classroom, a place where an advertisement is played, and the like.

In step 102, detection of different parts of an object, association of different parts in the same object and motion recognition of the object are performed on the scene image, and at least one object included in the scene image and a target motion type of each object in the at least one object are determined.

In the embodiment of the disclosure, the object may include, but is not limited to, a person, the different parts may include, but are not limited to, a face and a human body, and the detecting of the different parts of the object on the scene image may include detecting a face position and a human body position of the person on the scene image. The association of different parts in the same object may be that the face position and the body position belonging to the same person need to be associated. The motion recognition of the object may determine a target motion type for each person included in the scene image from at least one preset motion type.

The preset action type may be set according to scene requirements, including but not limited to lifting hands, bending down, jumping, turning around, etc., or may also include a type that does not perform any action, for example, the action type before the person keeps unchanged.

In the above embodiment, detection of different parts of the object, association of different parts in the same object, and motion recognition of the object are performed on the scene image, so that at least one object included in the scene image and a target motion type of each object in the at least one object are determined, the motion recognition duration is irrelevant to the number of objects included in the scene image, and the increase of the calculation duration due to the increase of the number of objects is avoided, so that the calculation resources are greatly saved, the motion recognition duration is shortened, and the detection efficiency is improved.

In some alternative embodiments, step 102 may include:

And extracting features of the scene image, obtaining a feature map, and then detecting different parts of the object, associating different parts in the same object and identifying actions of the object.

In the embodiment of the disclosure, the image features in the scene image can be extracted through a pre-trained neural network model (backhaul) to obtain a feature map. The neural network model may employ, but is not limited to, a model such as a visual geometry group network (Visual Geometry Group Network, VGG Net).

The dimension of the feature map obtained by extracting the image features through the neural network model is smaller than the dimension of the scene image. For example, a scene image having dimensions 640×480 is input to the neural network model, and a feature map having dimensions 80×60 can be obtained.

The extracted image features may include, but are not limited to, color features, texture features, shape features, and the like. The color feature is a global feature describing the surface color attribute of the object corresponding to the image, the texture feature is also a global feature describing the surface texture attribute of the object corresponding to the image, the shape feature has two types of representation methods, one type is a contour feature, the other type is a regional feature, the contour feature of the image is mainly aimed at the outer boundary of the object, and the regional feature of the image is related to the shape of the image region.

In the embodiment of the disclosure, the motion recognition is mainly performed on the feature map in the following.

In the above embodiment, after the feature extraction is performed on the scene image to obtain the feature map, the detection of different parts of the object, the association of different parts in the same object and the motion recognition of the object are performed on the feature map, so that at least one object included in the scene image and the target motion type of each object in the at least one object can be determined quickly according to the image features, the implementation is simple and convenient, and the availability is high.

In some alternative embodiments, the object includes a person, and the different parts of the object include a face and a body of the person, for example, as shown in fig. 2, step 102 may include:

in step 102-1, a face position of each person and a body position of each person are determined on the scene image.

In the embodiment of the disclosure, a face region belonging to a face and a human body region belonging to a human body on a feature map corresponding to a scene image can be detected through a region prediction network (Region Proposal Network, RPN). The human face area can be identified through a human face identification frame, and the human body area can be identified through a human body identification frame. Further, the size of the face recognition frame may be determined by the center position of the face recognition frame, the length and width of the face recognition frame, and in the embodiment of the present disclosure, the face position may be represented by the center position of the face recognition frame. Also, the size of the human body recognition frame may be determined by the center position of the human body recognition frame, the length and width of the human body recognition frame, and the human body position may be represented by the center position of the human body recognition frame.

In the embodiment of the present disclosure, the above-mentioned facial and human position description information may be represented by different channels, respectively. For example, the dimension of the feature map is 80×60, and after the face area and the human body area of each person are determined, a first feature map of 80×60×6 can be obtained, and 6 channels of the first feature map output the center position of the face recognition frame, the length of the face recognition frame, the width of the face recognition frame, the center position of the human body recognition frame, the length of the human body recognition frame, and the width of the human body recognition frame, respectively.

In a possible implementation manner, first feature maps corresponding to two channels at the center position of the face recognition frame and the center position of the human body recognition frame can be obtained, so that the face position and the human body position are respectively determined.

In step 102-2, the face position and the body position belonging to the same person are associated.

In the embodiment of the present disclosure, if the number of persons is plural, after each face position and each body position are determined, it is necessary to correlate the face position and the body position belonging to the same person, thereby obtaining the correlated face position and body position. In the embodiment of the disclosure, all that needs to be associated is the center position of the face recognition frame and the center position of the human body recognition frame.

For example, the central positions of 2 face recognition frames, A1 and A2 respectively, and the central positions of 2 human body recognition frames, B1 and B2 respectively, are determined on the feature map, and the central positions of the face recognition frames and the central positions of the human body recognition frames can be correlated to obtain the central positions of the associated face recognition frames A1 and B2, and the central positions of the associated face recognition frames A2 and B1.

In the embodiment of the present disclosure, the face position and the human body position associated with the face position may be represented by 2 channels, respectively. For example, the dimension of the feature map is 80×60, after the face area and the human body area of each person are determined, a first feature map with the dimension of 80×60×6 is obtained, and further, the face position and the human body position are associated, and a second feature map with the dimension of 80×60×2 is obtained. The second feature map includes 2 channels, one corresponding to the face position of each person and the other corresponding to the body position associated with the face position.

In step 102-3, the target action type of at least one person and each of the at least one person included in the scene image is determined based on the associated face position and the body position.

In the embodiment of the disclosure, the position of each person may be represented by the face position and/or the body position corresponding to the person, so as to determine at least one person included in the scene image.

For example, the position of the person may be determined by the face position, which refers to the center position of the face recognition frame, and assuming that the face position includes A1, A2, and A3, it may be determined that the scene image includes 3 persons, and the positions of each person are A1, A2, and A3.

In addition, in the embodiment of the present disclosure, a plurality of feature vectors may be determined according to the associated face position and the body position, where the feature vectors are obtained respectively according to a preset action type, and further, a target action type of each person in at least one person may be determined according to the feature vectors. The target action type may be at least one of preset action types. Assuming that the number of preset action types is n, different preset action types need to be respectively corresponding to the n channels. The preset action types comprise action types possibly performed by various persons and action types which are not performed by the persons.

For example, the dimension of the feature map is 80×60, a first feature map having a dimension of 80×60×6 is obtained after determining the face position and the body position of each person, and a second feature map having a dimension of 80×60×2 is obtained after determining the face position and the body position having an association relationship. From the second feature map, a third feature map having dimensions 80×60×n needs to be determined. And determining the final target action type according to the third characteristic diagram.

In the above embodiment, the face position of each person and the body position of each person may be determined on the feature map, and further, the face position and the body position of the same person may be associated, so that the at least one person included in the scene image and the target action type corresponding to each person in the at least one person may be determined according to the associated face position and body position. In the process, even if the scene image comprises a plurality of people, the target action type corresponding to each person can be rapidly determined, the requirement on the computing capacity of the equipment is reduced, the action recognition time is shortened, and the competitiveness of the equipment is improved.

In some alternative embodiments, such as shown in FIG. 3, step 102-2 may include:

In step 102-21, a reference body position corresponding to each face position is determined.

In the embodiment of the present disclosure, the center position of the most likely human body recognition frame corresponding to the center position of each human body recognition frame may be predicted from the center position of the human body recognition frame of each human body, and the position may be used as the reference human body position.

In step 102-22, the associated face position and body position are determined based on the reference body position and the body position.

In the embodiment of the disclosure, each reference human body position may be corresponding to one human body position, and the human face position and the human body position corresponding to the same reference human body position are associated.

In the above embodiment, the reference human body position corresponding to each human face position may be determined according to the human face position of each person, so as to correlate the human face position with the human body position, which is simple and convenient to implement and has high usability.

In some alternative embodiments, such as shown in FIG. 4, steps 102-21 may include:

in step 201, a first coordinate value corresponding to each face position is determined on the scene image.

In the embodiment of the present disclosure, the face position of each person has been previously determined on the feature map corresponding to the scene image, where the face position may be represented by the center position of the face recognition frame. Then the coordinate value corresponding to the center position of each face recognition frame can be determined in the image coordinate system corresponding to the feature map, and the coordinate value is the first coordinate value.

In step 202, second coordinate values are respectively determined according to the preset vector and the first coordinate values.

In the embodiment of the present disclosure, the preset vector is a preset vector that points to the position of the human body from the position of the human face, for example, as shown in fig. 5, the preset vector may point to the center position of the human body identification frame from the center position of the human face identification frame. Then a second coordinate value may be determined based on the first coordinate value of the face position and the predetermined vector, respectively.

In step 203, the second coordinate value is taken as the reference human body position.

In the embodiment of the present disclosure, the second coordinate value is directly used as the reference human body position.

In the above embodiment, the reference human body position corresponding to each human face position may be determined according to the human face position and the preset vector of each human object, so as to correlate the human face position and the human body position subsequently, and the usability is high.

In some alternative embodiments, steps 101-22 may include:

And using the human body position with the smallest distance from the reference human body position and the human face position corresponding to the reference human body position as the human face position and the human body position with the association relation.

In the embodiment of the disclosure, after the reference human body position is determined, the human body position closest to the reference human body position among the plurality of human body positions is the human face position corresponding to the reference human body position, that is, the human face position and the human body position belonging to the same person. Accordingly, the associated face position and body position are obtained.

For example, the reference body position includes C1 and C2, wherein C1 is determined from the face position A1 and C2 is determined from the face position A2. The human body position includes B1 and B2, and B2 is nearest to C1 and B1 is nearest to C2 in the human body position. Thus, it can be determined that A1 and B2 have an association relationship, and A2 and B1 have an association relationship.

In the above embodiment, one human body position closest to each reference human body position can be determined among a plurality of human body positions, and the human body position is associated with the human face position corresponding to the determined reference human body position, so that the implementation is simple and the availability is high.

In the embodiment of the present disclosure, it is noted that, the reference face position corresponding to each person may be determined according to the body position of each person and another preset vector, and further, the face position with the smallest distance from the reference face position and the body position corresponding to the reference face position are used as the face position and the body position with the association relationship. The other preset vector may be a preset vector pointing from the human body position to the human face position. The method for determining the reference face position is the same as the method for determining the reference body position, and will not be described here again.

In some alternative embodiments, such as shown in FIG. 6, step 102-3 includes:

in step 102-31, the at least one person included in the scene image is determined using at least one of the associated face position and the body position as the location of each person included in the scene image.

The position of each person can be represented by the face position and/or the body position corresponding to the person, so that the person included in the scene image can be determined.

In step 102-32, a plurality of feature vectors are determined based on the associated face position and the body position.

In the embodiment of the disclosure, according to the preset action types, feature vectors corresponding to at least one preset action type and pointed to the associated human body position by the human face position are respectively determined, so as to obtain the feature vectors corresponding to the same person.

In step 102-33, the target action type corresponding to each character is determined based on the plurality of feature vectors.

In the embodiment of the disclosure, the most probable action type of the person can be determined according to the feature vectors, and the action type is taken as the target action type.

In some alternative embodiments, such as shown in FIG. 7, steps 102-33 may include:

in step 301, the plurality of feature vectors corresponding to each person are normalized, so as to obtain a normalized value corresponding to each feature vector.

In the embodiment of the present disclosure, a normalization function, for example, a softmax function, may be used to normalize a plurality of feature vectors corresponding to each person, so as to obtain a normalized value corresponding to each feature vector.

In step 302, the feature vector corresponding to the maximum normalized value of each person is used as the target feature vector of each person.

In the embodiment of the present disclosure, after normalizing a plurality of feature vectors corresponding to each person, the feature vector corresponding to the maximum normalized value is used as the target feature vector of each person.

In step 303, the action type corresponding to the target feature vector is used as the target action type corresponding to the each character.

The action type corresponding to the target feature vector is the action type most likely to be performed by the person, and accordingly, the action type can be used as the target action type of the person.

In the above embodiment, the normalization processing is performed on the plurality of feature vectors of each person, so as to determine the action type most likely to be performed by each person, and the action type is used as the target action type, thereby achieving the purpose of identifying the action of the object.

In some alternative embodiments, after a scene image is acquired, the scene image may be input into a pre-trained object detection model, a target position of each part of each object is determined by the object detection model on the scene image, then the target positions of different parts belonging to the same object are associated, and at least one object included in the scene image and a target action type of each object in the at least one object are determined according to the associated target positions of the different parts.

The structure of the object detection model is shown in fig. 8, for example, after a scene image is acquired, the scene image is input into the object detection model, and the object detection model firstly adopts a backstone to extract features of the scene image, so as to obtain a feature map.

Further, the object detection model determines the face position of each person and the human body position of each person on the feature map through the positioning branch.

Still further, the object detection model determines the associated face position and the body position belonging to the same object through an association branch. And determining at least one person included in the scene image and a target action type corresponding to each person in the at least one person according to the associated face position and the associated human body position through an action recognition branch.

The final object detection model may output the motion detection result, where the result includes at least one person included in the scene image and a target motion type corresponding to each of the at least one person.

In some alternative embodiments, the object detection model may also directly output a target image, where at least one object included in the scene image and a target action type of each object in the at least one object may be identified at the same time, so that an object detection result may be more intuitively reflected.

In the above embodiment, detection of different parts of the object, association of different parts in the same object, and motion recognition of the object may be performed on the scene image, so as to determine at least one object included in the scene image and a target motion type of each object in the at least one object, where the motion recognition duration is irrelevant to the number of objects included in the scene image, and the increase of computation duration due to the increase of the number of objects is avoided, so that the computation resource is greatly saved, the motion recognition duration is shortened, and the detection efficiency is effectively improved.

In some alternative embodiments, considering that the optimal sample image set of the face position label, the human body position label, the association relation label between the face position and the human body position, and the action identification label between the human body position and the action type are less, it takes more time to label the sample image set with only part of labels.

To address this issue, in embodiments of the present disclosure, such as shown in fig. 9, the method may further include:

in step 100-1, the tag type in the sample image set is determined.

In the embodiment of the disclosure, an existing sample image set is adopted, and the label type in the sample image set comprises at least one of a face position label, a human body position label, an association relation label between a face position and a human body position, and an action identification label between a human body position and an action type.

In step 100-2, the sample image set is adopted to train branches corresponding to the label types in a preset model respectively, so as to obtain the object detection model.

In the embodiment of the present disclosure, the structure of the preset model may also include a positioning branch, an association branch, and an action recognition branch as shown in fig. 8. And training branches corresponding to the label types in a preset model respectively by adopting a sample image set, and obtaining a trained object detection model under the condition that the loss function of the corresponding branch is minimum.

The locating branches may further include a face locating branch and a body locating branch (not shown in fig. 9).

For example, as shown in fig. 10, if the label type in the sample image set only includes a face position label, the sample image set is used, and then the sample image set is used to train the face positioning branch in the positioning branches of the preset model. Each training iteration, no processing is done for the other branches. I.e. the loss function determined each time is identical to the first loss function, the second, third and fourth loss functions may for example be set to 0.

If the label type in the sample image set only comprises a human body position label, training the human body positioning branch in the positioning branches of the preset model by adopting the sample image set. If the label types in the sample image set comprise the face position label and the human body position label at the same time, the sample image set can be adopted to directly train the positioning branch.

If the label type in the sample image set only comprises the association relation label, training the association branch of the preset model by adopting the sample image set, wherein the loss function corresponding to other branches is 0.

Likewise, if the label type in the sample image set only includes the action recognition label, the sample image set may be used to train the action recognition branch of the preset model, and the loss function corresponding to the other branches may be, for example, 0.

If the label categories in the sample image set are two or more, the sample image set may be used to train the corresponding branches of the preset model, and the loss functions corresponding to other branches may be, for example, 0.

In the above embodiment, the sample image set is adopted to train the branches corresponding to the label types of the sample image set in the preset model respectively, so as to obtain the object detection model, and the detection performance and generalization performance of the object detection model are improved.

In some alternative embodiments, such as shown in fig. 11, the method may further comprise:

In step 103, a cumulative detection result of the actions, which are made by each object in a set period of time and match the target action type, is determined.

In an embodiment of the disclosure, the scene image includes a scene image collected in a classroom, the object includes a teaching object, the target action type includes at least one action type in a teaching task, and the action type matching the teaching task includes, but is not limited to, lifting a hand, interacting with a teacher, standing up to answer questions, focusing on a blackboard, writing with a low head, and the like.

For example, in a classroom, a camera-equipped teaching multimedia device deployed in the classroom, including but not limited to a teaching projector, a monitoring device in the classroom, etc., may be employed to capture images of a scene captured in the classroom. The classroom scene image is determined to include at least one teaching object and a target action type for each teaching object, wherein the teaching object may be a student.

Further, the accumulated detection result of the action matching the target action type by each teaching object, for example, each student, can be determined in a set period, for example, a period of one lesson of a teacher. For example, it is determined how many times each student holds his or her hands in a class, the length of time he or she is paying attention to the blackboard, the length of time he or she is writing with his or her lower head, the number of times he or she is answering questions, the number of interactions with a teacher, and so on. The results can be displayed through the teaching multimedia device so that a teacher can better perform teaching tasks.

Corresponding to the foregoing method embodiments, the present disclosure also provides embodiments of the apparatus.

As shown in fig. 12, fig. 12 is a block diagram of an action recognition apparatus according to an exemplary embodiment of the present disclosure, the apparatus comprising: an image acquisition module 410, configured to acquire a scene image; the motion recognition module 420 is configured to perform detection of different parts of an object, association of different parts in the same object, and motion recognition of the object on the scene image, and determine at least one object included in the scene image and a target motion type of each object in the at least one object.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the objectives of the disclosed solution. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The embodiment of the disclosure also provides a computer readable storage medium, and the storage medium stores a computer program, and the computer program is used for executing any one of the action recognition methods.

In some alternative embodiments, the disclosed embodiments provide a computer program product comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for implementing the action recognition method provided in any of the embodiments above.

In some alternative embodiments, the instant disclosure also provides another computer program product for storing computer readable instructions that, when executed, cause a computer to perform the operations of the action recognition method provided by any of the embodiments described above.

The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

The embodiment of the disclosure also provides an action recognition device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the executable instructions stored in the memory to implement the method of action recognition of any of the above.

Fig. 13 is a schematic hardware structure of an action recognition device according to an embodiment of the disclosure. The action recognition means 510 comprise a processor 511 and may further comprise input means 512, output means 513 and a memory 514. The input device 512, the output device 513, the memory 514, and the processor 511 are connected to each other via a bus.

The memory includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM) for associated instructions and data.

The input means is for inputting data and/or signals and the output means is for outputting data and/or signals. The output device and the input device may be separate devices or may be a single device.

The processor may include one or more processors, including for example one or more central processing units (central processing unit, CPU), which in the case of a CPU, may be a single-core CPU or a multi-core CPU.

The memory is used to store program codes and data for the network device.

The processor is used to call the program code and data in the memory to perform the steps of the method embodiments described above. Reference may be made specifically to the description of the method embodiments, and no further description is given here.

It will be appreciated that figure 13 shows only a simplified design of an action recognition device. In practical applications, the motion recognition device may also include other necessary elements, including but not limited to any number of input/output devices, processors, controllers, memories, etc., and all motion recognition devices that may implement the embodiments of the present disclosure are within the scope of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

The foregoing description of the preferred embodiments of the present disclosure is not intended to limit the disclosure, but rather to cover all modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present disclosure.

Claims

1. A method of motion recognition, comprising:

Acquiring a scene image;

Detecting different parts of the object, associating different parts in the same object and identifying the actions of the object, and determining at least one object included in the scene image and a target action type of each object in the at least one object;

The detecting different parts of the object, associating different parts in the same object and identifying the actions of the object are carried out on the scene image, and determining at least one object included in the scene image and the target action type of each object in the at least one object comprises the following steps:

After determining the target position of each part of each object on the scene image through an object detection model, correlating the target positions of different parts belonging to the same object;

determining at least one object included in the scene image and a target action type of each object in the at least one object according to the target positions of the associated different parts through the object detection model;

The object detection model is obtained based on the following steps:

Determining the type of a label in the sample image set; the label type comprises at least one of a face position label, a human body position label, an association relation label between the face position and the human body position, and an action identification label between the human body position and the action type;

Training branches corresponding to the label types in a preset model by adopting the sample image set to obtain the object detection model; the branches corresponding to the label types comprise a face positioning branch, a human body positioning branch, an association branch and an action recognition branch.

2. The method of claim 1, wherein the object comprises a person, and wherein the different parts of the object comprise a face and a body of the person;

determining the face position of each person and the human body position of each person on the scene image;

Correlating the face position and the body position belonging to the same person;

And determining at least one person included in the scene image and the target action type of each person in the at least one person according to the associated face position and the human body position.

3. The method of claim 2, wherein the correlating the face location and the body location belonging to the same person comprises:

determining a reference human body position corresponding to each human face position;

and according to the reference human body position and the human body position, correlating the human face position and the human body position belonging to the same person.

4. A method according to claim 3, wherein said determining a reference body position corresponding to each face position comprises:

Determining a first coordinate value corresponding to each face position on the scene image;

Respectively determining a second coordinate value according to a preset vector and the first coordinate value; the preset vector points to the position of the human body from the position of the human face;

and taking the second coordinate value as the reference human body position.

5. A method according to claim 3, wherein said associating the face position and the body position belonging to the same person from the reference body position and the body position comprises:

and taking the human body position with the smallest distance from the reference human body position and the human face position corresponding to the reference human body position as the associated human face position and human body position.

6. The method of claim 2, wherein the determining the at least one person and the target action type for each of the at least one person included in the scene image based on the associated face position and body position comprises:

Determining at least one person included in the scene image by using at least one of the associated face position and the human body position as a position of each person included in the scene image;

Determining a plurality of feature vectors according to the associated face position and the human body position;

The target action type for each of the at least one persona is determined based on the plurality of feature vectors.

7. The method of claim 6, wherein the determining a plurality of feature vectors from the associated face position and the body position comprises:

and determining feature vectors which respectively correspond to at least one preset action type and are pointed to the associated human body positions by the human face positions, and obtaining a plurality of feature vectors.

8. The method of claim 6, wherein the determining the target action type for each of the at least one persona based on the plurality of feature vectors comprises:

normalizing the plurality of feature vectors corresponding to each person to obtain a normalized value of each feature vector;

taking the feature vector corresponding to the maximum normalized value of each character as the target feature vector of each character;

And taking the action type corresponding to the target feature vector as the target action type of each character.

9. The method according to any one of claims 1-8, further comprising:

And determining the accumulated detection result of the actions which are matched with the target action type and are made by each object in a set time period.

10. The method of claim 9, wherein the scene image comprises a scene image captured in a classroom, the object comprises a teaching object, and the target action type comprises at least one action type in a teaching task.

11. An action recognition device, the device comprising:

The image acquisition module is used for acquiring a scene image;

the action recognition module is used for detecting different parts of the object, associating different parts in the same object and recognizing the action of the object, and determining at least one object included in the scene image and a target action type of each object in the at least one object;

the action recognition module, when used for detecting different parts of an object, associating different parts in the same object and recognizing actions of the object, determines at least one object included in the scene image and a target action type of each object in the at least one object, comprises:

The object detection model is obtained based on the following steps:

12. A computer readable storage medium, characterized in that the storage medium stores a computer program for executing the action recognition method according to any one of the preceding claims 1-10.

13. An action recognition device, comprising:

A processor;

A memory for storing the processor-executable instructions;

Wherein the processor is configured to invoke executable instructions stored in the memory to implement the action recognition method of any of claims 1-10.