CN116311519A

CN116311519A - Action recognition method, model training method and device

Info

Publication number: CN116311519A
Application number: CN202310271127.6A
Authority: CN
Inventors: 陈毅; 郭紫垣; 赵亚飞; 范锡睿; 张世昌; 王志强; 杜宗财
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-06-23
Anticipated expiration: 2043-03-17
Also published as: CN116311519B

Abstract

The disclosure provides an action recognition method, a model training method and a model training device, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, augmented reality, virtual reality and the like, and can be applied to scenes such as metauniverse, digital people and the like. The implementation scheme is as follows: acquiring a current video frame in a video, wherein the video comprises an object to be identified; determining first pose information of an object in a current video frame; correcting the first posture information based on a plurality of historical video frames before the current video frame to obtain second posture information; and determining an action of the object in the current video frame based on the second pose information.

Description

Action recognition method, model training method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, augmented reality, virtual reality and the like, and can be applied to scenes such as metauniverse, digital people and the like, and particularly relates to an action recognition method and device, a training method and device of an action matching model, electronic equipment, a computer readable storage medium and a computer program product.

Background

Motion capture (Mocap) refers to recording the Motion of a moving object (e.g., a person, an animal, etc.) in an actual three-dimensional space and reconstructing the Motion into a digital model (e.g., a digital person) in a virtual three-dimensional space.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides a method and apparatus for motion recognition, a method and apparatus for training a motion matching model, an electronic device, a computer-readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided an action recognition method including: acquiring a current video frame in a video, wherein the video comprises an object to be identified; determining first pose information of the object in the current video frame; correcting the first posture information based on a plurality of historical video frames before the current video frame to obtain second posture information; and determining an action of the object in the current video frame based on the second pose information.

According to an aspect of the present disclosure, there is provided a training method of an action matching model, including: acquiring a sample gesture information sequence and a gesture label corresponding to the sample gesture information sequence, wherein the sample gesture information sequence corresponds to an action sequence of a sample object, the action sequence comprises a plurality of actions, the sample gesture information sequence comprises a plurality of sample gesture information respectively corresponding to the plurality of actions, and the gesture label comprises real gesture information of the last action in the plurality of actions; inputting the sample gesture information sequence into the action matching model to obtain the predicted gesture information of the last action output by the action matching model; determining a loss value of the action matching model based at least on the predicted pose information and the true pose information; and adjusting parameters of the action matching model based on the loss value.

According to an aspect of the present disclosure, there is provided an action recognition apparatus including: an acquisition unit configured to acquire a current video frame in a video, the video including an object to be identified; a first determining unit configured to determine first pose information of the object in the current video frame; a correction unit configured to correct the first pose information based on a plurality of historical video frames preceding the current video frame to obtain second pose information; and a second determining unit configured to determine an action of the object in the current video frame based on the second pose information.

According to an aspect of the present disclosure, there is provided a training apparatus of an action matching model, including: an obtaining unit configured to obtain a sample gesture information sequence and a gesture tag corresponding to the sample gesture information sequence, the sample gesture information sequence corresponding to an action sequence of a sample object, the action sequence including a plurality of actions, the sample gesture information sequence including a plurality of sample gesture information corresponding to the plurality of actions, respectively, the gesture tag including real gesture information of a last action of the plurality of actions; an input unit configured to input the sample gesture information sequence into the action matching model to obtain predicted gesture information of the last action output by the action matching model; a determining unit configured to determine a loss value of the action matching model based at least on the predicted pose information and the true pose information; and an adjustment unit configured to adjust parameters of the action matching model based on the loss value.

According to an aspect of the present disclosure, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to an aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of the above aspects.

According to an aspect of the present disclosure, there is provided a computer program product comprising computer program instructions which, when executed by a processor, implement the method of any of the above aspects.

According to one or more embodiments of the present disclosure, the accuracy of motion recognition can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a method of action recognition according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a motion capture process according to an embodiment of the present disclosure;

FIG. 4 illustrates a flow chart of a method of training an action matching model according to an embodiment of the present disclosure;

FIG. 5 shows a block diagram of a motion recognition device according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of a training apparatus of an action matching model according to an embodiment of the present disclosure; and

fig. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another element. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

In the related art, a tracker is generally disposed at a critical portion of a moving object, a position of the tracker is captured by a motion capture system, and three-dimensional coordinate data of the critical portion is obtained after processing by a computer, so that an action of the moving object is identified. The resulting three-dimensional coordinate data may be used to drive limb movements of digital persons (e.g., animated characters, virtual anchor, virtual customer service, interactive game characters, etc.) in real time. The scheme has high hardware cost, complex data processing process and poor practicability.

AI (Artificial Intelligence) dynamic capture refers to the use of AI technology to achieve motion capture. Conventionally, an image of a moving object is acquired using an image acquisition apparatus, an action of the moving object is recognized from the image using AI technology, and then a digital person is driven using the recognized action. Compared with the scheme for realizing dynamic capture by using the tracker, the scheme for AI motion capture reduces hardware cost and improves practicability. However, the existing AI dynamic capture scheme generally performs motion recognition only for a single image, and a jitter problem occurs in recognition of continuous motion in video, so that a recognition result is not accurate enough. In addition, as a single video frame in the video may have the conditions of blurred moving objects, local occlusion of the moving objects and the like, the accuracy of motion recognition is further affected.

In view of the above problems, embodiments of the present disclosure provide a motion recognition method, which can improve accuracy of motion recognition of a video, so that motion of an object recognized from the video is smoother and more natural.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure,

client devices

101, 102, 103, 104, 105, and 106 and server 120 may run one or more services or software applications that enable execution of the action recognition method and/or training method of the action matching model.

In some embodiments, server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user

operating client devices

101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The

client devices

101, 102, 103, 104, 105, and/or 106 may provide interfaces that enable a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, vehicle-mounted devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, appli os, UNIX-like operating systems, linux, or Linux-like operating systems; or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, wi-Fi), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of

client devices

101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and/or 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

According to some embodiments, client devices 101-106 may capture video of a moving object and send the video to server 120 via network 110. The server 120 may identify the motion of the moving object from the video by performing the motion identification method of the embodiments of the present disclosure. Alternatively, the client devices 101-106 may also perform the action recognition methods of embodiments of the present disclosure to recognize actions of moving objects from the video.

According to other embodiments, the server 120 may also obtain a video stored in the database 130 and identify the motion of the moving object from the video by performing the motion recognition method of the embodiments of the present disclosure.

Fig. 2 shows a flowchart of an action recognition method 200 according to an embodiment of the present disclosure. The subject of execution of the steps of method 200 may be a client (e.g., client devices 101-106 shown in FIG. 1) or a server (e.g., server 120 shown in FIG. 1).

As shown in fig. 2, the method 200 includes steps S210-S240.

In step S210, a current video frame in the video is acquired. The video includes an object to be identified.

In step S220, first pose information of the object in the current video frame is determined.

In step S230, the first pose information is modified based on a plurality of historical video frames preceding the current video frame to obtain second pose information.

In step S240, an action of the object in the current video frame is determined based on the second pose information.

According to an embodiment of the present disclosure, pose information (i.e., first pose information) of a current video frame is corrected based on a history video frame, resulting in corrected pose information (i.e., second pose information). In the process of motion recognition, the continuity and the correlation between video frames are considered, and the accuracy of the second gesture information of the current video frame is improved, so that the accuracy of a motion recognition result is improved, and the motion of an object recognized from the video is smoother and more natural.

The steps of method 200 are described in detail below.

According to some embodiments, the video acquired in step S210 may be recorded by the client device in real time. In other embodiments, already recorded video may also be read from the database.

In an embodiment of the present disclosure, the object to be identified is a moving object (e.g., a person, an animal, etc.), and the video has a continuous motion of the object recorded therein. It should be noted that the present disclosure is not limited to the type and magnitude of the actions. The action of the subject may be a physical action, such as walking, running, jumping, kicking, etc.; or hand movements such as fist making, five-finger opening, etc.; but also facial expressions such as happy, sad, grimaced, etc.

It is understood that a video includes a plurality of video frames. In an embodiment of the present disclosure, motion recognition is performed on each video frame in turn according to the temporal order of each video frame in the video. Accordingly, the current video frame refers to a video frame that is currently being processed, i.e., a video frame that is currently being motion-identified.

According to some embodiments, step S220 may include the following steps S222 and S224.

In step S222, object detection is performed on the current video frame to identify an area where the object is located.

In step S224, the keypoint pose detection is performed on the region to obtain first pose information, where the first pose information includes first keypoint pose information of each of the plurality of keypoints of the object.

According to the embodiment, the first gesture information is obtained by carrying out the key point gesture detection on the region where the object is located, so that the interference of the background content in the current video frame on the gesture recognition of the object can be avoided, and the accuracy of the first gesture information is improved.

According to some embodiments, target detection may be achieved using a trained target detection model. The object detection model may be a neural network model, such as YOLO (You Only Look Once, look only once, i.e., using one network to predict both class and bounding box of an object), R-CNN (Region based Convolutional Neural Network, region-based convolutional neural network), faster R-CNN (Faster, region-based convolutional neural network), etc. By inputting the current video frame into the target detection model, the Bounding box (Bounding box) of the object and the category of the object part output by the target detection model can be obtained. The categories may be, for example, body, hands, face, etc.

According to some embodiments, a target detection model may be used to detect the region where different parts of the object are located. According to other embodiments, different locations may correspond to different object detection models, each for detecting the region in which the respective location is located. For example, the object detection model may include a body detection model and a hand detection model. The body detection model is used for detecting the region where the body (without the hand) is located, and the hand detection model is used for detecting the region where the hand is located.

According to some embodiments, the key points may be joint points of the subject, such as shoulder joints, elbow joints, wrist joints, finger joints, knee joints, and the like. Further, the key points may also include facial contour points, such as eye, lip contour points.

In embodiments of the present disclosure, the first pose information refers to initial pose information (i.e., pose information that has not been corrected). The first pose information includes first keypoint pose information for each keypoint, and accordingly, the first keypoint pose information refers to initial pose information (i.e., uncorrected pose information) for each keypoint. The second posture information refers to corrected posture information. The second posture information includes second key point posture information of each key point, and accordingly, the second key point posture information refers to corrected posture information of each key point.

According to some embodiments, the pose information of the keypoint includes rotation angles (pose angles) of the keypoint around a plurality of directions. In general, a three-dimensional coordinate system including three coordinate axes of x, y, and z may be preset, and accordingly, the pose information of the key point includes an x-axis pose angle, a y-axis pose angle, and a z-axis pose angle of the key point.

According to some embodiments, keypoint gesture detection may be implemented using a trained motion capture model. The motion capture model may be a neural network model. And cutting out the area where the object is located from the current video frame, and inputting the motion capture model. The motion capture, model outputs the position of each key point of the object and the first key point gesture information (i.e. gesture angles in different directions) of each key point.

According to some embodiments, first keypoint pose information for different parts of an object (e.g., body, hand, face, etc.) may be detected using one motion capture model. According to other embodiments, the different locations may correspond to different motion capture models, each for detecting first keypoint pose information of the respective location. For example, the motion capture model may include a body capture model and a hand capture model. The body capture model is used to detect first keypoint pose information for body keypoints (e.g., shoulder joints, elbow joints, knee joints, etc., without hand keypoints), and the hand capture model is used to detect first keypoint pose information for hand keypoints (e.g., carpometacarpal joints, metacarpophalangeal joints, phalangeal joints, etc.).

According to some embodiments, step S230 may include: and correcting the first posture information based on the third posture information of the object in each of the plurality of historical video frames to obtain second posture information.

According to the above embodiment, the accuracy of the pose information of the current video frame can be improved by correcting the pose information (first pose information) of the current video frame using the pose information (third pose information) in the history video frame.

In an embodiment of the present disclosure, the third pose information refers to pose information of the object in the historical video frame. According to some embodiments, the third pose information may be corrected pose information for each historical video frame. For example, the third pose information may be second pose information obtained by correcting the first pose information of the historical video frame according to the motion recognition method of the embodiment of the present disclosure. Therefore, the first posture information of the current video frame can be corrected by utilizing the corrected and more accurate posture information of each historical video frame, so that the accuracy of the posture information of the current video frame is improved.

According to some embodiments, the third pose information and the first pose information for each of the plurality of historical video frames may be arranged in a temporal order to generate the target pose information sequence. And then, determining second posture information corresponding to the target posture information sequence based on the association relation between the preset historical posture information sequence and the current posture information.

According to the embodiment, the target gesture information sequence can express the continuity and the correlation of the gesture information in time, so that the accuracy of the second gesture information is improved.

According to some embodiments, the association of the historical pose information sequence with the current pose information is represented by a trained action matching model. Correspondingly, the step of determining the second gesture information corresponding to the target gesture information sequence based on the association relationship between the preset historical gesture information sequence and the current gesture information may include: and inputting the target gesture information sequence into the action matching model to obtain second gesture information output by the action matching model.

According to some embodiments, the action matching model may include a plurality of action matching models that respectively correspond to a plurality of locations of the object. Each motion matching model takes a historical gesture information sequence (namely, a target gesture information sequence, which is composed of gesture information of a plurality of key points of the corresponding part in a historical video frame and a current video frame) of the corresponding part as input, and outputs corrected current gesture information (namely, second gesture information) of the part. And splicing the second posture information output by the different motion matching models to obtain the second posture information of the whole object. By setting different action matching models for different parts, each action matching model can be focused on describing the action time sequence relation of one part, so that the interference of actions of other parts is avoided, and the accurate correction of the gesture information of the part is realized.

For example, the motion matching model may include a body matching model and a hand matching model (in some cases, the motion matching model may further include a face matching model). By inputting the target gesture information sequence of the body key points into the body matching model, second gesture information of the body key points output by the body matching model can be obtained. The target gesture information sequence of the hand key points is input into the hand matching model, so that the hand matching can be obtained, and the second gesture information of the hand key points output by the model can be obtained. Further, the posture information of the wrist joint point is obtained from the second posture information of the body key point, and the second posture information of the body key point and the second posture information of the hand key point are spliced based on the posture information of the wrist joint, so that corrected second posture information of the whole object is obtained.

According to some embodiments, the action matching model may be trained according to the training method of the action matching model of the embodiments of the present disclosure. In the case where there are a plurality of motion matching models (for example, the plurality of motion matching models includes a body matching model, a hand matching model, a face matching model, and the like), each of the motion matching models may be trained according to the training method of the motion matching model of the embodiment of the present disclosure, and the training processes of the respective motion matching models may be independent of each other.

The training method of the motion matching model will be described in detail below.

As described above, step S230 may be implemented with a trained action matching model, according to some embodiments. The action matching model learns the association relation between the historical gesture information sequence and the current gesture information in advance. Even if the current video frame is the first frame in the video, the corresponding target pose information sequence (at this time, the third pose information of the historical video frame is null, and the target pose information sequence only includes the first pose information of the current video frame) can be input into the motion matching model. The motion matching model will output the corrected pose information, i.e., the second pose information, for the current video frame.

According to some embodiments, instead of using the motion matching model, the third pose information of each of the plurality of historical video frames may be weighted and summed with the first pose information of the current video frame to obtain corrected second pose information. The weight of the third pose information may be inversely related to, for example, a time difference of the current video frame and the corresponding historical video frame. That is, the greater the time difference between the current video frame and the historical video frame (i.e., the farther apart the two are), the lower the weight of the third pose information of the historical video frame.

According to the above-described embodiment, in the case where the current video frame is the first frame in the video, since the video frame does not have the history video frame, the first pose information can be directly taken as the second pose information.

According to some embodiments, as described above, the second pose information includes second keypoint pose information for each of a plurality of keypoints of the object. Accordingly, step S240 may include: based on the connection relation among the plurality of key points and the second gesture information, determining the action of the object in the current video frame.

The connection relation of the key points may be preconfigured. For example, for a human subject, the shoulder joint is connected to the elbow joint, and the elbow joint is connected to the wrist joint. The corrected posture information of each key point can be obtained through step S230. And connecting the key points according to a preset key point connection relation based on the corrected gesture information, so that the action of the object can be reconstructed. The actions may be represented by an action topology graph. The action topological graph comprises a plurality of nodes and a plurality of edges, wherein the nodes represent key points, and the edges represent connection relations of the key points.

According to some embodiments, the method 200 may further comprise: the digital person is driven by the second gesture information so that the digital person performs the above-mentioned actions.

According to some embodiments, the connection relationship of the key points of the digital person may be preconfigured. Therefore, based on the second gesture information, all key points are connected according to the preset key point connection relation of the digital person, and the digital person can be enabled to present the same action as the object in the video. Further, the shape and the texture of the connecting line (corresponding to the muscle connecting the joint points) of each pair of adjacent key points can be set for the digital person, so that the digital person is rendered in three dimensions, and the digital person is more vivid and visual.

FIG. 3 shows a schematic diagram of a motion capture process 300 utilizing a motion recognition method of an embodiment of the present disclosure.

As shown in fig. 3, in the motion capture process 300, a current video frame 301 in a video is acquired. The current video frame 301 includes an object 302 to be identified.

The current video frame 301 is input into the object detection model 310 to obtain the region where the object output by the object detection model 310 is located. The object detection model 310 includes a body detection model 312 and a hand detection model 314. The current video frame 301 is input to the body detection model 312, and the body region 361 (without the hand) of the subject output from the body detection model 312 can be obtained. The current video frame 301 is input into the hand detection model 314, and the hand region 362 of the object output by the hand detection model 314 can be obtained.

The body region 361 and the hand region 362 are input into the motion capture model 320. The motion capture model 320 outputs initial pose information 363 for each key point of the subject's body (without the hand) and initial pose information 364 for each key point of the hand.

The motion capture model 320 may include only one model that outputs initial pose information 363 for each keypoint of the body and initial pose information 364 for each keypoint of the hand simultaneously. The motion capture model 320 may also include two models (not shown in fig. 3) of a body capture model and a hand capture model. The body region 361 is input into the body capturing model, and initial posture information 363 of each key point of the body (without the hand) can be obtained. The initial pose information 364 of each key point of the hand can be obtained by inputting the hand region 362 into the hand capture model.

The action matching model 330 includes a body matching model 332 and a hand matching model 334.

Initial pose information 363 for each keypoint of the body (excluding hands) is input to the body matching model 332, and the body matching model 332 outputs corrected pose information 365 for each keypoint of the body. As shown in fig. 3, the body region 361 and the initial pose information 363 of the body key point may be input into the body matching model 332 together, so that the body matching model 332 corrects the initial pose information 363 in combination with the image features of the body region 361, and the accuracy of the corrected pose information 365 is improved.

Initial pose information 364 for each key point of the hand is input to the hand matching model 334, and the hand matching model 334 outputs corrected pose information 366 for each key of the hand. As shown in fig. 3, the initial pose information 364 of the hand region 362 and the hand key points may be input into the hand matching model 334 together, so that the hand matching model 334 corrects the initial pose information 364 in combination with the image features of the hand region 362, and the accuracy of the corrected pose information 366 is improved.

The corrected pose information 365 for the body keypoints and the corrected pose information 366 for the hand keypoints are input to the action stitching module 340. The motion stitching module 340 stitches the posture information 365 of the body with the posture information 366 of the hand based on the posture information of the wrist joint, and obtains posture information 367 of the whole object.

Rendering module 350 drives digital person 304 based on object pose information 367 to render generated image 303. In image 303, digital person 304 presents the same action as object 302 in the video.

According to an embodiment of the present disclosure, a training method of an action matching model is also provided. Based on this approach, a trained motion matching model can be obtained. The trained motion matching model may be applied in a motion recognition method (e.g., motion recognition method 200 described above) of an embodiment of the present disclosure to correct the first pose information.

Fig. 4 illustrates a flowchart of a method 400 of training an action matching model according to an embodiment of the present disclosure. The method 400 may be performed by a server or a client device. As shown in fig. 4, the method 400 includes steps S410-S440.

In step S410, a sample posture information sequence and a posture label corresponding to the sample posture information sequence are acquired. The sample pose information sequence corresponds to a sequence of actions of the sample object, the sequence of actions comprising a plurality of actions. The sample posture information sequence includes a plurality of sample posture information corresponding to the plurality of actions, respectively. The gesture tag includes real gesture information of a last action of the plurality of actions.

In step S420, the sample gesture information sequence is input into the motion matching model to obtain the predicted gesture information of the last motion output by the motion matching model.

In step S430, a loss value of the motion matching model is determined based at least on the predicted pose information and the true pose information.

In step S440, parameters of the motion matching model are adjusted based on the loss value.

According to the embodiment of the disclosure, the gesture matching model takes a gesture information sequence corresponding to the action sequence as input, and outputs gesture information of the last action in the action sequence. By training the motion matching model, the motion matching model can learn the association relation between the historical gesture information sequence and the current gesture information, so that the trained motion matching model can correct the current gesture information by utilizing the historical gesture information, and the accuracy of video motion recognition is improved.

The steps of method 400 are described in detail below.

According to some embodiments, the sample pose information sequence may be obtained as follows steps S412 and S414.

In step S412, a sequence of real pose information of the sample object is acquired, the sequence of real pose information including real pose information of each of a plurality of actions of the sample object.

In step S414, noise is added to the true pose information sequence to generate a sample pose information sequence.

According to the embodiment, the training sample is generated by adding noise to the real gesture information sequence, manual labeling is not needed, and the sample generation efficiency is improved.

According to some embodiments, the sequence of real pose information may be obtained from an open source motion capture dataset. The open source motion capture data set may be, for example, AMASS (Archive of Mocap As Surface Shapes, archiving motion captures to a curved shape), AIST++ (dance motion data set), or the like. By adding noise to the real gesture information sequence, a sample gesture information sequence capable of simulating the video action situation can be obtained.

In step S420, the sample gesture information sequence is input into the motion matching model to obtain the predicted gesture information of the last motion output by the motion matching model. The predicted pose information includes predicted keypoint pose information, such as an x-axis pose angle, a y-axis pose angle, a z-axis pose angle, for each of a plurality of keypoints of the sample object.

According to some embodiments, the action matching model includes a position detection module and a gesture detection module. The sample pose information includes keypoint pose information for each of a plurality of keypoints of the sample object. Accordingly, step S420 may include the following steps S422 and S424.

In step S422, the last sample gesture information in the sample gesture information sequence is input to the position detection module, so as to obtain the predicted position information of each of the plurality of key points output by the position detection module.

In step S424, the sample gesture information sequence and the predicted position information of each of the plurality of key points are input to the gesture detection module, so as to obtain the predicted gesture information output by the gesture detection module. The predicted pose information includes predicted keypoint pose information for each of the plurality of keypoints.

The position and posture of the key points (joints) of the human body are related. According to the embodiment, the position information of the key points is predicted based on the gesture information (sample gesture information), and the predicted gesture information is generated based on the predicted position information, so that the mutual verification of the position information and the gesture information can be realized, and the accuracy of the predicted gesture information is improved.

According to some embodiments, the predicted location information of the keypoint includes three-dimensional coordinates of the keypoint, e.g. x-axis coordinates, y-axis coordinates and z-axis coordinates.

The location detection module may be implemented as a neural network model, such as an SMPL (skin multiplayer Linear) model, an SMPL-X (SMPL eXpress, eXpressive skin multiplayer Linear) model, or the like.

According to some embodiments, the location detection module may complete training prior to execution of the method 400. Accordingly, the method 400 is only used for training the gesture detection module, i.e. only adjusting the parameters of the gesture detection module in the subsequent step S440.

In other embodiments, the method 400 may also perform joint training on the position detection module and the posture detection module, that is, in the subsequent step S440, parameters of the position detection module and the posture detection module are adjusted simultaneously.

According to some embodiments, the loss value of the motion matching model may be determined based only on the predicted pose information and the true pose information of the last motion. For example, the mean square error (Mean Squared Error, MSE) or mean absolute error (Mean Absolute Error, MAE) of the predicted pose information and the true pose information may be used as a loss value for the motion matching model.

According to some embodiments, in addition to the predicted pose information and the true pose information for the last action, other factors may be further considered in determining the loss value for the action matching model.

According to some embodiments, the pose tag includes real position information, real velocity, real keypoint pose information (i.e., the real pose information described above), and real angular velocity for each of the plurality of keypoints for the last action. These real information may be obtained, for example, from a real pose information sequence corresponding to the sample pose information sequence, or by calculating the real pose information sequence. Accordingly, step S430 may include the following steps S432-S436.

In step S432, the predicted speed of each of the plurality of key points is determined based on the predicted position information of each of the plurality of key points.

In step S434, the predicted angular velocity of each of the plurality of keypoints is determined based on the predicted keypoint pose information of each of the plurality of keypoints.

In step S436, a loss value of the motion matching model is determined based on the predicted position information, the true position information, the predicted speed, the true speed, the predicted key point posture information, the true key point posture information, the predicted angular speed, and the true angular speed of each of the plurality of key points.

According to the embodiment, the loss value comprehensively considers the position, the speed, the angular speed and the attitude information of the key points, so that the jitter of the predicted attitude information output by the model can be reduced, and the motion of the model output is natural and smooth.

According to some embodiments, for step S432, for any key point, the first sample pose information in the sample pose information sequence may be input to the position detection module to obtain initial position information of the key point under the first sample pose information. And subtracting the predicted position information of the key point from the initial position information to obtain the displacement of the key point. And subtracting the time stamp of the last sample gesture information from the time stamp of the first sample gesture information to obtain a time difference. The predicted speed of the keypoints may be the quotient of the displacement and the time difference.

According to some embodiments, for step S434, for any key point, initial position information of the key point under the first sample pose information may be input to the pose detection module, so as to obtain an initial pose angle of the key point under the first sample pose information. Subtracting the predicted key point attitude information (attitude angle) of the key point from the initial attitude angle to obtain the angular displacement of the key point. And subtracting the time stamp of the last sample gesture information from the time stamp of the first sample gesture information to obtain a time difference. The predicted angular velocity of the keypoint may be the quotient of the angular displacement and the time difference.

According to some embodiments, for step S436, a loss value of the action matching model may be determined based on a first gap of predicted position information and real position information of the keypoint, a second gap of predicted speed and real speed, a third gap of predicted keypoint pose information and real keypoint pose information, and a fourth gap of predicted angular speed and real angular speed. The loss value of the motion matching model may be, for example, a weighted sum of the first gap, the second gap, the third gap, and the fourth gap.

It should be noted that, the steps S410 to S440 may be performed repeatedly until the preset termination condition is satisfied. The preset termination condition may be, for example, the number of cycles reaching a first threshold, the loss value being smaller than a second threshold, the loss value converging, etc. And when the preset termination condition is met, completing training of the action matching model. The trained motion matching model may be applied in a motion recognition method (e.g., motion recognition method 200 described above) of an embodiment of the present disclosure to correct the first pose information.

According to an embodiment of the present disclosure, there is also provided an action recognition apparatus. Fig. 5 shows a block diagram of a motion recognition apparatus 500 according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus 500 includes an acquisition unit 510, a first determination unit 520, a correction unit 530, and a second determination unit 540.

The acquisition unit 510 is configured to acquire a current video frame in a video, wherein the video comprises an object to be identified.

The first determining unit 520 is configured to determine first pose information of the object in the current video frame.

The correction unit 530 is configured to correct the first pose information based on a plurality of historical video frames preceding the current video frame to obtain second pose information.

The second determining unit 540 is configured to determine an action of the object in the current video frame based on the second pose information.

According to an embodiment of the present disclosure, pose information (i.e., first pose information) of a current video frame is corrected based on a history video frame, resulting in corrected pose information (i.e., second pose information). In the process of motion recognition, the continuity and the correlation between video frames are considered, and the accuracy of the second gesture information is improved, so that the accuracy of a motion recognition result is improved, and the motion of an object recognized from a video is smoother and more natural.

According to some embodiments, the first determining unit 520 comprises: the first detection subunit is configured to perform target detection on the current video frame so as to identify an area where the object is located; and a second detection subunit configured to perform keypoint pose detection on the region to obtain the first pose information, wherein the first pose information includes first keypoint pose information of each of a plurality of keypoints of the object.

According to some embodiments, the correction unit 530 is further configured to: and correcting the first posture information based on third posture information of the object in each of the plurality of historical video frames to obtain the second posture information.

According to some embodiments, the correction unit 530 includes: a generation subunit configured to arrange the third pose information and the first pose information of each of the plurality of historical video frames in time order to generate a target pose information sequence; and a determining subunit configured to determine the second posture information corresponding to the target posture information sequence based on an association relationship between a preset historical posture information sequence and current posture information.

According to some embodiments, the association is represented by a trained action matching model, and wherein the determining subunit is further configured to: and inputting the target gesture information sequence into the action matching model to obtain the second gesture information output by the action matching model.

It should be appreciated that the various elements of the apparatus 500 shown in fig. 5 may correspond to the various steps in the method 200 described with reference to fig. 2. Thus, the operations, features and advantages described above with respect to method 200 are equally applicable to apparatus 500 and the units comprised thereof. For brevity, certain operations, features and advantages are not described in detail herein.

According to an embodiment of the present disclosure, there is also provided a training apparatus of an action matching model. Fig. 6 shows a block diagram of a training apparatus 600 of an action matching model according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus 600 includes an acquisition unit 610, an input unit 620, a determination unit 630, and an adjustment unit 640.

The obtaining unit 610 is configured to obtain a sample gesture information sequence and a gesture tag corresponding to the sample gesture information sequence, wherein the sample gesture information sequence corresponds to an action sequence of a sample object, the action sequence includes a plurality of actions, the sample gesture information sequence includes a plurality of sample gesture information corresponding to the plurality of actions respectively, and the gesture tag includes real gesture information of a last action in the plurality of actions.

The input unit 620 is configured to input the sample gesture information sequence into the action matching model to obtain the predicted gesture information of the last action output by the action matching model.

The determining unit 630 is configured to determine a loss value of the action matching model based at least on the predicted pose information and the true pose information.

The adjustment unit 640 is configured to adjust parameters of the action matching model based on the loss value.

It should be appreciated that the various elements of the apparatus 600 shown in fig. 6 may correspond to the various steps in the method 400 described with reference to fig. 4. Thus, the operations, features and advantages described above with respect to method 400 are equally applicable to apparatus 600 and the units comprised thereof. For brevity, certain operations, features and advantages are not described in detail herein.

Although specific functions are discussed above with reference to specific modules, it should be noted that the functions of the various modules discussed herein may be divided into multiple modules and/or at least some of the functions of the multiple modules may be combined into a single module.

It should also be appreciated that various techniques may be described herein in the general context of software hardware elements or program modules. The various units described above with respect to fig. 5 and 6 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the units may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these units may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the units 510-640 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip including one or more components of a processor (e.g., a central processing unit (Central Processing Unit, CPU), microcontroller, microprocessor, digital signal processor (Digital Signal Processor, DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions.

There is also provided, in accordance with an embodiment of the present disclosure, an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor, the memory storing instructions executable by the at least one processor to enable the at least one processor to perform the method of motion recognition and/or the method of training the motion matching model of embodiments of the present disclosure.

According to an embodiment of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the action recognition method and/or the training method of the action matching model of the embodiments of the present disclosure.

According to an embodiment of the present disclosure, there is also provided a computer program product comprising computer program instructions which, when executed by a processor, implement the method of motion recognition and/or the method of training a motion matching model of embodiments of the present disclosure.

Referring to fig. 7, a block diagram of an electronic device 700 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the electronic device 700, the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 708 may include, but is not limited to, magnetic disks, optical disks. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through computer networks, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth devices, 802.11 devices, wi-Fi devices, wiMAX devices, cellular communication devices, and/or the like.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as method 200. For example, in some embodiments, method 200 and/or method 400 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of method 200 or method 400 described above may be performed. Alternatively, in other embodiments, computing unit 701 may be configured to perform method 200 and/or method 400 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely illustrative embodiments or examples and that the scope of the present disclosure is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A method of action recognition, comprising:

acquiring a current video frame in a video, wherein the video comprises an object to be identified;

Determining first pose information of the object in the current video frame;

correcting the first posture information based on a plurality of historical video frames before the current video frame to obtain second posture information; and

based on the second pose information, an action of the object in the current video frame is determined.

2. The method of claim 1, wherein the determining first pose information of the object in the current video frame comprises:

performing target detection on the current video frame to identify an area where the object is located; and

and carrying out key point gesture detection on the region to obtain the first gesture information, wherein the first gesture information comprises first key point gesture information of each of a plurality of key points of the object.

3. The method of claim 1 or 2, wherein the modifying the first pose information based on a plurality of historical video frames preceding the current video frame to obtain second pose information comprises:

and correcting the first posture information based on third posture information of the object in each of the plurality of historical video frames to obtain the second posture information.

4. The method of claim 3, wherein the modifying the first pose information based on third pose information of the object in each of the plurality of historical video frames to obtain the second pose information comprises:

arranging the third gesture information and the first gesture information of each of the plurality of historical video frames according to a time sequence to generate a target gesture information sequence; and

and determining the second posture information corresponding to the target posture information sequence based on the association relation between the preset historical posture information sequence and the current posture information.

5. The method of claim 4, wherein the association is represented by a trained action matching model, and wherein the determining the second pose information corresponding to the target pose information sequence based on the association of a preset historical pose information sequence with current pose information comprises:

and inputting the target gesture information sequence into the action matching model to obtain the second gesture information output by the action matching model.

6. The method of any of claims 1-5, wherein the second pose information comprises second keypoint pose information for each of a plurality of keypoints of the object, and wherein determining, based on the second pose information, an action of the object in the current video frame comprises:

And determining the action based on the connection relation among the key points and the second gesture information.

7. The method of any of claims 1-6, further comprising:

and driving the digital person by using the second gesture information so as to enable the digital person to perform the action.

8. A method of training an action matching model, comprising:

acquiring a sample gesture information sequence and a gesture label corresponding to the sample gesture information sequence, wherein the sample gesture information sequence corresponds to an action sequence of a sample object, the action sequence comprises a plurality of actions, the sample gesture information sequence comprises a plurality of sample gesture information respectively corresponding to the plurality of actions, and the gesture label comprises real gesture information of the last action in the plurality of actions;

inputting the sample gesture information sequence into the action matching model to obtain the predicted gesture information of the last action output by the action matching model;

determining a loss value of the action matching model based at least on the predicted pose information and the true pose information; and

and adjusting parameters of the action matching model based on the loss value.

9. The method of claim 8, wherein the acquiring a sample pose information sequence comprises:

acquiring a real gesture information sequence of the sample object, wherein the real gesture information sequence comprises real gesture information of each of the plurality of actions of the sample object; and

noise is added to the true pose information sequence to generate the sample pose information sequence.

10. The method of claim 8 or 9, wherein the action matching model includes a position detection module and a gesture detection module, the sample gesture information includes keypoint gesture information for each of a plurality of keypoints of the sample object, and the inputting the sequence of sample gesture information into the action matching model to obtain predicted gesture information for the last action output by the action matching model includes:

inputting last sample gesture information in the sample gesture information sequence into the position detection module to obtain the predicted position information of each of the plurality of key points output by the position detection module; and

and inputting the sample gesture information sequence and the respective predicted position information of the plurality of key points into the gesture detection module to obtain the predicted gesture information output by the gesture detection module, wherein the predicted gesture information comprises the respective predicted key point gesture information of the plurality of key points.

11. The method of claim 10, wherein the pose tag includes true position information, true velocity, true keypoint pose information, and true angular velocity for each of the plurality of keypoints, the determining a loss value for the action matching model based at least on the predicted pose information and the true pose information comprising:

determining a predicted speed of each of the plurality of keypoints based on the predicted position information of each of the plurality of keypoints;

determining a predicted angular velocity of each of the plurality of keypoints based on the predicted keypoint pose information of each of the plurality of keypoints; and

and determining a loss value of the action matching model based on the predicted position information, the real position information, the predicted speed, the real speed, the predicted key point posture information, the real key point posture information, the predicted angular speed and the real angular speed of each of the plurality of key points.

12. An action recognition device, comprising:

an acquisition unit configured to acquire a current video frame in a video, wherein the video includes an object to be identified;

a first determining unit configured to determine first pose information of the object in the current video frame;

A correction unit configured to correct the first pose information based on a plurality of historical video frames preceding the current video frame to obtain second pose information; and

and a second determining unit configured to determine an action of the object in the current video frame based on the second pose information.

13. The apparatus of claim 12, wherein the first determining unit comprises:

the first detection subunit is configured to perform target detection on the current video frame so as to identify an area where the object is located; and

and a second detection subunit configured to perform keypoint pose detection on the region to obtain the first pose information, where the first pose information includes first keypoint pose information of each of a plurality of keypoints of the object.

14. The apparatus of claim 12 or 13, wherein the correction unit is further configured to:

15. The apparatus of claim 14, wherein the correction unit comprises:

A generation subunit configured to arrange the third pose information and the first pose information of each of the plurality of historical video frames in time order to generate a target pose information sequence; and

and the determining subunit is configured to determine the second posture information corresponding to the target posture information sequence based on the association relation between the preset historical posture information sequence and the current posture information.

16. The apparatus of claim 15, wherein the association is represented by a trained action matching model, and wherein the determination subunit is further configured to:

17. A training device of an action matching model, comprising:

an obtaining unit configured to obtain a sample gesture information sequence and a gesture tag corresponding to the sample gesture information sequence, wherein the sample gesture information sequence corresponds to an action sequence of a sample object, the action sequence comprises a plurality of actions, the sample gesture information sequence comprises a plurality of sample gesture information respectively corresponding to the plurality of actions, and the gesture tag comprises real gesture information of a last action in the plurality of actions;

An input unit configured to input the sample gesture information sequence into the action matching model to obtain predicted gesture information of the last action output by the action matching model;

a determining unit configured to determine a loss value of the action matching model based at least on the predicted pose information and the true pose information; and

and an adjustment unit configured to adjust parameters of the action matching model based on the loss value.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

19. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-11.

20. A computer program product comprising computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1-11.