CN114842549A

CN114842549A - Method, device, equipment, storage medium and product for training motion recognition model

Info

Publication number: CN114842549A
Application number: CN202210265324.2A
Authority: CN
Inventors: 陈柯辛; 武子熙; 蒋昊青
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-08-02

Abstract

The application provides a training method, a training device, equipment, a storage medium and a product of an action recognition model, belongs to the technical field of artificial intelligence, and can be applied to a processing scene of multimedia resources. The method comprises the following steps: acquiring a plurality of sample videos; dividing a target object in a sample video to obtain a plurality of action parts corresponding to the sample video, wherein the action parts are action parts of the target object; determining relative position vectors among the action parts to obtain corresponding relative position vectors of the sample video, wherein the relative position vectors are used for expressing the position relation among the action parts; clustering the sample videos based on the corresponding relative position vectors of the sample videos to obtain a plurality of video clusters, wherein the video clusters comprise at least one sample video, and the sample videos in the same video cluster have the same action; based on the plurality of video clusters, a motion recognition model is trained. The method improves the training efficiency of the motion recognition model.

Description

Method, device, equipment, storage medium and product for training motion recognition model

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a product for training a motion recognition model.

Background

With the development of artificial intelligence technology, the application of the motion recognition model is more and more extensive; for example, the motion recognition model may be applied in an intelligent educational scenario; through the action recognition model, actions in a teaching video of a teacher are recognized, and therefore Artificial Intelligence (AI) evaluation and education are performed on the teacher based on the recognized action types.

In the related art, a supervised deep learning algorithm is adopted to train a motion recognition model, namely the motion recognition model is trained through a labeled sample video. The action labels in the sample videos are labeled by people, and the data volume of the sample videos is large, so that more time is spent on manually labeling the action labels of the sample videos, and the training efficiency of the action recognition model is reduced.

Disclosure of Invention

The embodiment of the application provides a training method, a training device, equipment, a storage medium and a product of a motion recognition model, and the training efficiency of the motion recognition model can be improved. The technical scheme is as follows:

in one aspect, a method for training a motion recognition model is provided, the method including:

acquiring a plurality of sample videos;

dividing a target object in the sample video to obtain a plurality of action parts corresponding to the sample video, wherein the action parts are action parts of the target object, and the target object is an object making action;

determining relative position vectors among the action parts to obtain relative position vectors corresponding to the sample video, wherein the relative position vectors are used for representing the position relation among the action parts;

clustering the sample videos based on the corresponding relative position vectors of the sample videos respectively to obtain a plurality of first video clusters, wherein the first video clusters comprise at least one sample video, and the sample videos in the same first video cluster have the same action;

training a motion recognition model based on the plurality of first video clusters.

In another aspect, an apparatus for training a motion recognition model is provided, the apparatus including:

the first acquisition module is used for acquiring a plurality of sample videos;

a segmentation module, configured to segment a target object in the sample video to obtain a plurality of action portions corresponding to the sample video, where the action portions are action portions of the target object, and the target object is an object that makes an action;

a determining module, configured to determine relative position vectors between the multiple action portions, to obtain a relative position vector corresponding to the sample video, where the relative position vector is used to represent a position relationship between the multiple action portions;

the clustering module is used for clustering the sample videos based on the corresponding relative position vectors of the sample videos respectively to obtain a plurality of first video clusters, wherein the first video clusters comprise at least one sample video, and the actions of the sample videos in the same first video cluster are the same;

and the training module is used for training the motion recognition model based on the plurality of first video clusters.

In some embodiments, the determining module is configured to determine a first action part and a plurality of second action parts in the plurality of action parts based on part information of a plurality of action parts corresponding to the sample video, where the first action part is a reference action part of the target object; and determining relative position vectors of the plurality of second action parts relative to the first action part to obtain the corresponding relative position vector of the sample video.

In some embodiments, the determining module is configured to determine a first center position and a second center position, respectively, where the first center position is a center position of the first action portion, and the second center position is a center position of the second action portion; determining a vector between the first center position and the second center position as a relative position vector of the second motion part with respect to the first motion part; and forming a corresponding relative position vector of the sample video by using the relative position vectors of the plurality of second action parts relative to the first action part.

In some embodiments, the determining module is configured to determine a first center position and a plurality of boundary positions, respectively, where the first center position is a center position of the first action portion, and the plurality of boundary positions are boundary positions of the second action portion, respectively; determining vectors between the first center position and the plurality of boundary positions, respectively; splicing the first central position with the vectors between the boundary positions respectively to obtain a relative position vector of the second action part relative to the first action part; and forming a corresponding relative position vector of the sample video by using the relative position vectors of the plurality of second action parts relative to the first action part.

In some embodiments, the sample video includes a plurality of video frames, and the corresponding relative position vectors of the sample video include corresponding relative position vectors of the plurality of video frames, respectively; the clustering module is used for splicing a plurality of relative position vectors corresponding to the same video frame to obtain a first relative position vector corresponding to the same video frame; splicing first relative position vectors corresponding to a plurality of same video frames of the same sample video to obtain a second relative position vector corresponding to the sample video; and clustering the plurality of sample videos based on second relative position vectors respectively corresponding to the plurality of sample videos to obtain a plurality of first video clusters.

In some embodiments, the clustering module is configured to determine a distance between any two second relative position vectors based on the second relative position vectors respectively corresponding to the plurality of sample videos; and under the condition that the distance is not greater than the preset distance, aggregating the two sample videos corresponding to the two second relative position vectors into the same first video cluster.

In some embodiments, the apparatus further comprises:

the second acquisition module is used for acquiring a target video, wherein the target video is a video of the action category to be identified;

and the input and output module is used for inputting the target video into the action recognition model and outputting the action type of the target video and relative position vectors of a plurality of action parts of a target object in the target video, wherein the relative position vectors are used for explaining the reason that the target video is recognized as the action type.

In some embodiments, the training module is configured to determine a first action part and a plurality of second action parts in the plurality of action parts based on the part information of the plurality of action parts, wherein the first action part is a reference action part of the target object; determining a target action part from the plurality of second action parts, wherein the target action part comprises the least pixel points; clustering the sample videos based on the first action parts and the target action parts of the sample videos to obtain a plurality of second video clusters, wherein the second video clusters comprise at least one sample video, and the action of the sample videos in the same second video cluster is the same; training the motion recognition model based on the plurality of first video clusters and the plurality of second video clusters.

In some embodiments, the training module is configured to determine a first similarity and a second similarity of any two sample videos, where the first similarity is a similarity between first motion portions of the two sample videos, and the second similarity is a similarity between target motion portions of the two sample videos; determining a third similarity between the two sample videos based on the first similarity and the second similarity; and aggregating the two sample videos into the same second video cluster under the condition that the third similarity is greater than a preset similarity.

In some embodiments, the segmentation module is configured to extract a plurality of video frames from the sample video; dividing a target object included in the video frame to obtain a plurality of action parts corresponding to the video frame; and combining the action parts corresponding to the plurality of video frames respectively into a plurality of action parts corresponding to the sample video.

In another aspect, a computer device is provided, which includes a processor and a memory, where the memory is used to store at least one piece of computer program, and the at least one piece of computer program is loaded by the processor and executed to implement the training method for motion recognition model in the embodiment of the present application.

In another aspect, a computer-readable storage medium is provided, in which at least one piece of computer program is stored, and the at least one piece of computer program is loaded and executed by a processor to implement the training method of the motion recognition model as in the embodiment of the present application.

In another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer-readable storage medium, the computer program code being read by a processor of a computer device from the computer-readable storage medium, the computer program code being executed by the processor to cause the computer device to perform the method for training a motion recognition model provided in the various alternative implementations of the aspects.

The embodiment of the application provides a training method of an action recognition model, the method divides an object in a sample video to obtain a plurality of action parts of the object, because relative position vectors among the action parts can effectively represent actions made by the object, and then clustering the sample videos based on relative position information respectively corresponding to the sample videos, a plurality of video clusters with accurate classification can be obtained, and then the action recognition model is trained based on the sample videos in the video clusters obtained by clustering, so that the process of labeling action labels on the sample videos in sequence is avoided, and the training efficiency of the action recognition model is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment of a training method for a motion recognition model according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for training a motion recognition model according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a method for training a motion recognition model according to an embodiment of the present disclosure;

FIG. 4 is a schematic image diagram of a video frame according to an embodiment of the present application;

FIG. 5 is a schematic image diagram of another video frame provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a relative position vector provided by an embodiment of the present application;

FIG. 7 is a flowchart of a method for training a motion recognition model according to an embodiment of the present disclosure;

FIG. 8 is a flowchart of a method for training a motion recognition model according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of an application of a motion recognition model in an intelligent education scenario according to an embodiment of the present application;

FIG. 10 is a block diagram of a training apparatus for motion recognition model according to an embodiment of the present disclosure;

fig. 11 is a block diagram of a terminal according to an embodiment of the present application;

fig. 12 is a block diagram of a server according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.

The term "at least one" in this application means one or more, and the meaning of "a plurality" means two or more.

It should be noted that, in the embodiment of the present application, related data such as sample video, location information, etc. when the above embodiment of the present application is applied to a product or technology, it is required to obtain related permission or consent, and the collection, use and processing of the related data are required to comply with related laws and regulations and standards of related countries and regions.

Hereinafter, terms related to the present application are explained.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

The following describes an implementation environment related to the present application:

the training method of the motion recognition model provided by the embodiment of the application can be executed by computer equipment. In some embodiments, the computer device is a terminal 101 or a server 102. First, taking a computer device as the server 102 as an example, an implementation environment schematic diagram of the method for training the motion recognition model provided in the embodiment of the present application is described below. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 can be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

In some embodiments, the terminal 101 is a smartphone, a tablet computer, a laptop computer, a desktop computer, a smart voice interaction device, a smart appliance, a vehicle-mounted terminal, and the like, but is not limited thereto. In some embodiments, the server 102 is an independent server, can also be a server cluster or distributed system of multiple physical server projects, and can also be a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, web services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. The server 102 is used for providing background services for the application programs installed on the terminal 101. In some embodiments, the server 102 undertakes primary computing work and the terminal 101 undertakes secondary computing work; or, the server 102 undertakes the secondary computing service, and the terminal 101 undertakes the primary computing work; alternatively, the server 102 and the terminal 101 perform cooperative computing by using a distributed computing architecture.

In some embodiments, the computer device is configured to obtain a sample video, and perform model training based on the sample video to obtain the motion recognition model. Optionally, the motion recognition model is widely applied to scenes such as intelligent education and social entertainment. For example, the action recognition model is applied to an intelligent education scene and is used for recognizing actions in teaching videos of teachers, so that the teachers are subjected to AI assessment and teaching based on the recognized action categories. For example, the motion recognition model is applied to a social entertainment scene for recognizing motion in a social entertainment video, so as to determine whether restrictive content is included in the video based on a recognized motion category. In some embodiments, the motion recognition model is also used in service scenes such as video understanding, highlight video collection, video subtitle generation, motion recognition and the like, and serves as a bottom layer guarantee of a video analysis task to improve the efficiency of video analysis processing.

Fig. 2 is a flowchart of a training method of a motion recognition model according to an embodiment of the present application, and referring to fig. 2, in the embodiment of the present application, an example of the training method performed by a computer device is described. The training method of the motion recognition model comprises the following steps:

201. a plurality of sample videos are acquired.

202. And dividing the target object in the sample video to obtain a plurality of action parts corresponding to the sample video, wherein the action parts are action parts of the target object, and the target object is an object making action.

203. And determining relative position vectors among the plurality of action parts to obtain corresponding relative position vectors of the sample video, wherein the relative position vectors are used for representing the position relation among the plurality of action parts.

204. Clustering the plurality of sample videos based on the relative position vectors of the action parts corresponding to the plurality of sample videos respectively to obtain a plurality of first video clusters, wherein the first video clusters comprise at least one sample video, and the actions of the sample videos in the same first video cluster are the same.

205. Based on the plurality of first video clusters, a motion recognition model is trained.

In some embodiments, determining a relative position vector between a plurality of motion portions to obtain a corresponding relative position vector of the sample video comprises:

determining a first action part and a plurality of second action parts in the plurality of action parts based on part information of the plurality of action parts corresponding to the sample video, wherein the first action part is a reference action part of the target object;

and determining relative position vectors of the plurality of second action parts relative to the first action part to obtain corresponding relative position vectors of the sample video.

In some embodiments, determining a relative position vector of the plurality of second motion portions with respect to the first motion portion to obtain a corresponding relative position vector of the sample video comprises:

respectively determining a first central position and a second central position, wherein the first central position is the central position of a first action part, and the second central position is the central position of a second action part;

determining a vector between the first center position and the second center position as a relative position vector of the second action site with respect to the first action site;

and forming a relative position vector corresponding to the sample video by using the relative position vectors of the plurality of second action parts relative to the first action part.

respectively determining a first central position and a plurality of boundary positions, wherein the first central position is the central position of a first action part, and the plurality of boundary positions are the boundary positions of a second action part;

determining vectors between the first center position and the plurality of boundary positions respectively;

splicing the first central position with vectors among the plurality of boundary positions respectively to obtain a relative position vector of the second action part relative to the first action part;

and forming a corresponding relative position vector of the sample video by using the relative position vectors of the plurality of second motion parts relative to the first motion part.

In some embodiments, the sample video includes a plurality of video frames, and the corresponding relative position vector of the sample video includes the corresponding relative position vector of each of the plurality of video frames;

based on the relative position vectors corresponding to the plurality of sample videos respectively, clustering the plurality of sample videos to obtain a plurality of first video clusters, including:

splicing a plurality of relative position vectors corresponding to the same video frame to obtain a first relative position vector corresponding to the same video frame;

splicing first relative position vectors corresponding to a plurality of same video frames of the same sample video to obtain second relative position vectors corresponding to the sample video;

and clustering the plurality of sample videos based on the second relative position vectors respectively corresponding to the plurality of sample videos to obtain a plurality of first video clusters.

In some embodiments, clustering the plurality of sample videos based on the second relative position vectors respectively corresponding to the plurality of sample videos to obtain a plurality of first video clusters includes:

determining a distance between any two second relative position vectors based on the second relative position vectors respectively corresponding to the plurality of sample videos;

and under the condition that the distance is not greater than the preset distance, aggregating the two sample videos corresponding to the two second relative position vectors into the same first video cluster.

In some embodiments, training the motion recognition model based on the plurality of first video clusters comprises:

determining a first action part and a plurality of second action parts in the plurality of action parts based on the part information of the plurality of action parts, wherein the first action part is a reference action part of the target object;

determining a target action part from the plurality of second action parts, wherein the target action part contains the least pixel points;

clustering the plurality of sample videos based on the first action parts and the target action parts of the plurality of sample videos to obtain a plurality of second video clusters, wherein the second video clusters comprise at least one sample video, and the actions of the sample videos in the same second video cluster are the same;

training a motion recognition model based on the plurality of first video clusters and the plurality of second video clusters.

In some embodiments, clustering the plurality of sample videos based on the first motion part and the target motion part of the plurality of sample videos to obtain a plurality of second video clusters includes:

determining a first similarity and a second similarity of any two sample videos, wherein the first similarity is the similarity between first action parts of the two sample videos, and the second similarity is the similarity between target action parts of the two sample videos;

determining a third similarity between the two sample videos based on the first similarity and the second similarity;

and under the condition that the third similarity is greater than the preset similarity, aggregating the two sample videos into the same second video cluster.

In some embodiments, segmenting the target object in the sample video to obtain a plurality of motion portions corresponding to the sample video includes:

extracting a plurality of video frames from a sample video;

dividing a target object included in a video frame to obtain a plurality of action parts corresponding to the video frame;

and combining the action parts corresponding to the plurality of video frames into a plurality of action parts corresponding to the sample video.

In some embodiments, the method further comprises:

acquiring a target video, wherein the target video is a video of an action category to be identified;

and inputting the target video into the motion recognition model, and outputting the motion category of the target video and relative position vectors of a plurality of motion parts of the target object in the target video, wherein the relative position vectors are used for explaining the reason that the target video is recognized as the motion category.

The embodiment of the application provides a training method of an action recognition model, the method divides an object in a sample video to obtain a plurality of action parts of the object, because relative position vectors among the action parts can effectively represent actions made by the object, and then clustering the sample videos based on the relative position vectors corresponding to the sample videos respectively to obtain a plurality of video clusters which are accurately classified, and thus, the action recognition model is trained based on the sample videos in the video clusters obtained by clustering, so that the process of labeling action labels on the sample videos in sequence is avoided, and the training efficiency of the action recognition model is improved.

Fig. 3 is a flowchart of a training method of a motion recognition model according to an embodiment of the present application. The training method of the motion recognition model comprises the following steps:

301. a computer device obtains a plurality of sample videos.

The sample video comprises a plurality of video frames, and the video duration of the sample video can be set and changed more as required; optionally, the video duration of the sample video is 1 second.

In one implementation, a computer device directly obtains a sample video. In another implementation, a computer device obtains at least one initial sample video, wherein the video duration of the initial sample video is greater than the video duration of the sample video; and splitting the initial sample video by the computer equipment to obtain a plurality of sample videos. Optionally, the splitting time interval of the initial sample video may be set and changed as needed; for example, the computer device determines a splitting time interval according to the video type of the initial sample video, splits the initial sample video into m pieces of sample videos, and optionally obtains a plurality of sample videos with the video duration of 1 second. In the implementation mode, because the initial sample video has longer duration and contains more actions, the initial sample video is split into a plurality of sample videos with shorter duration, so that the action change in the sample videos is smaller, and the sample videos are convenient to process, so that the accuracy of identifying the actions in the sample videos is improved.

In some embodiments, the sample video comprises video in at least one of a smart education scene and a social entertainment scene. Optionally, the sample video is a video in an intelligent education scene, and the sample video includes a positive sample video in which the teacher makes an action such as writing on a blackboard, explaining, and the like, and a negative sample video in which the teacher does not make an action. Optionally, the sample video is a video in a social entertainment scene, and the sample video includes a positive sample video in which an object in the video makes a restrictive action and a negative sample video in which the object does not make the restrictive action.

302. And the computer equipment divides the target object in the sample video to obtain a plurality of action parts corresponding to the sample video.

The plurality of action parts are action parts of a target object, and the target object is an action object. The plurality of action parts comprises at least one of four limb parts and a body part of the target object; optionally, the plurality of action parts includes a torso part, a left upper limb part, a left lower limb part, a right upper limb part, and a right lower limb part.

In some embodiments, the computer device segments a target object included in any video frame of a plurality of video frames included in the sample video, and obtains a plurality of action parts corresponding to the any video frame. The computer equipment combines the action parts corresponding to the video frames into a plurality of action parts corresponding to the sample video.

In some embodiments, a computer device extracts a plurality of video frames from a sample video. The computer device divides the target object included in the video frame to obtain a plurality of action parts corresponding to the video frame. The computer equipment combines the action parts corresponding to the video frames into a plurality of action parts corresponding to the sample video.

And the motion information of the target object in the plurality of video frames is used for representing the motion information of the target object in the sample video. The number of the plurality of video frames can be set and changed according to the requirement; in one implementation, a computer device determines a number of a plurality of video frames to extract based on at least one of a video type, an action recognition requirement, and a resource configuration of a sample video. The video type is used for representing the type of the sample video, and the video category comprises an education video, a food video, a social entertainment video and the like; the number of video frames extracted for sample video of different video types varies. The action recognition requirements comprise precision requirements for action recognition in the sample video, quantity requirements for action categories and the like, for example, for an education video, the precision requirements for action recognition are high, and the quantity of correspondingly extracted video frames is large. The higher the resource configuration of the computer equipment is, the more the number of the extracted video frames is, and the lower the resource configuration is, the less the number of the extracted video frames is. In another implementation, the computer device defaults to extract the number of video frames, e.g., the computer device defaults to extract the number of video frames n for a sample video of 1 second is 4.

Wherein, the computer equipment can randomly extract the video frames or extract the video frames at equal time intervals. Randomly extracting video frames refers to randomly extracting n video frames in a sample video. The extraction of the video frames at equal time intervals means that the extracted video frames have the same time interval. For example, if the total duration of the sample video is t and the number of extracted video frames is n, the time points at which the plurality of video frames are extracted are 0,

The computer device segments the target object in the image by using a Self-Supervised common Segmentation algorithm (SCOPS) to obtain a plurality of action parts of the target object, so as to facilitate subsequent determination of relative position vectors of the plurality of action parts.

It should be noted that, the SCOPS algorithm considers several characteristics of good image segmentation, and encodes a priori knowledge of these characteristics into a loss function, so as to implement accurate segmentation of the target object. The several characteristics include geometric concentration, robustness to changes, semantic consistency, and foreground-background distinctiveness. The geometric concentration means that the blocks at a plurality of positions are geometrically concentrated and have close positions. Robustness to variation means that the block segmentation is robust to target object deformation due to pose variation and camera and viewpoint variation. Semantic consistency means that a block should have semantic consistency between different object instances, which refer to target objects included in different video frames. The method includes the steps that the action parts of segmentation are presented on a foreground but not on a background, in the application, the foreground refers to a target object, the union of a plurality of action parts forms the foreground, and then segmentation of an image is more focused on the foreground in a segmentation map and the background is ignored, and the method conforms to the segmentation requirements of the image on the target object in the image in the embodiment of the application.

For example, fig. 4 is an image schematic diagram of a video frame provided according to an embodiment of the present application, including a target object for making a turn-over jump action, as an image before segmentation. Fig. 5 is another image of a video frame provided in an embodiment of the present application, where the image is a schematic diagram of an effect obtained by performing segmentation processing on the image shown in fig. 4 by using a SCOPS algorithm, and a target object in a foreground is segmented into 5 motion portions. Optionally, the 5 action parts are respectively represented by 5 different color blocks of red, yellow, blue, green and gray, and are respectively used for representing different action parts.

303. The computer device determines a first action part and a plurality of second action parts among the plurality of action parts based on part information of the plurality of action parts corresponding to the sample video.

Wherein the first action part is a reference action part of the target object. In some embodiments, the location information includes contour information of the action location. The computer equipment respectively determines the minimum circumscribed rectangle of the action parts based on the contour information respectively corresponding to the action parts; then, the computer device determines a first action part based on the smallest circumscribed rectangle corresponding to each of the plurality of action parts, and the other parts except the first action part are second action parts. The minimum circumscribed rectangle refers to a rectangle which contains the action part and is tangent to the outline of the action part, and the minimum circumscribed rectangle has uniqueness; the aspect ratio of the minimum circumscribed rectangle of the first action part is closest to a preset value 1, namely the first action part is a trunk part, and the plurality of second action parts are four limbs respectively.

304. The computer device determines relative position vectors of the plurality of second motion parts relative to the first motion part, and obtains corresponding relative position vectors of the sample video.

In some embodiments, the computer device determines a relative position vector of the plurality of second motion portions with respect to the first motion portion, and the step of obtaining corresponding relative position vectors for the sample video comprises: the computer device respectively determines a first central position and a second central position, wherein the first central position is the central position of the first action part, and the second central position is the central position of the second action part. The computer device determines a vector between the first center position and the second center position as a relative position vector of the second action location with respect to the first action location. And the computer equipment forms the relative position vector of the plurality of second action parts relative to the first action part into a corresponding relative position vector of the sample video.

For example, fig. 6 is a schematic diagram of a relative position vector provided in this embodiment of the application, where a dot is a central point of an action portion, and an arrow is a relative position vector of a second action portion relative to a first action portion, so as to abstract a position relationship between the first action portion and a plurality of second action portions into the relative position vector, so as to subsequently cluster sample videos based on the relative position vector between the first action portion and the plurality of second action portions.

In this embodiment, since the center position of the motion portion can effectively represent the position information of the motion portion, and further, vectors between the center positions of the plurality of second motion portions and the center position of the first motion portion are determined as relative position vectors of the plurality of second motion portions with respect to the first motion portion, respectively, it is possible to improve the accuracy and the representativeness of the obtained relative position vectors of the plurality of motion portions.

In some embodiments, the computer device determines a relative position vector of the plurality of second motion portions with respect to the first motion portion, and the step of obtaining corresponding relative position vectors for the sample video comprises: the computer device determines a first center position and a plurality of boundary positions respectively, wherein the first center position is a center position of the first action part, and the plurality of boundary positions are boundary positions of the second action part respectively. The computer device determines vectors between the first center locations and the plurality of boundary locations, respectively. And the computer equipment splices the vectors between the first central position and the plurality of boundary positions respectively to obtain a relative position vector of the second action part relative to the first action part. And the computer equipment forms the relative position vector of the plurality of second action parts relative to the first action part into a corresponding relative position vector of the sample video. Optionally, the plurality of boundary positions includes at least one of an uppermost, a lowermost, a leftmost, and a rightmost boundary position of the action site.

In this embodiment, since the plurality of boundary positions of the second motion portion can effectively represent the position information of the second motion portion, and further, the position vector between the boundary positions of the plurality of second motion portions and the center position of the first motion portion is determined as the relative position vector of the second motion portion with respect to the first motion portion, it is possible to improve the accuracy and the representativeness of the obtained relative position vectors of the plurality of motion portions.

305. The computer equipment clusters the sample videos based on the corresponding relative position vectors of the sample videos respectively to obtain a plurality of first video clusters.

The first video cluster comprises at least one sample video, and the motion of the sample videos in the same first video cluster is the same. The sample video comprises a plurality of video frames, and the corresponding relative position vector of the sample video comprises the corresponding relative position vector of the plurality of video frames respectively.

In some embodiments, the computer device stitches a plurality of relative position vectors corresponding to the same video frame to obtain a first relative position vector corresponding to the same video frame. And the computer equipment splices the first relative position vectors corresponding to a plurality of same video frames of the same sample video to obtain a second relative position vector corresponding to the sample video. And the computer equipment clusters the plurality of sample videos based on the second relative position vectors respectively corresponding to the plurality of sample videos to obtain a plurality of first video clusters.

In one implementation, the computer device determines pixel points corresponding to a plurality of second action portions in the same video frame, and sequentially splices a plurality of relative position vectors of the same video frame based on the pixel points included in the plurality of second action portions, so as to obtain a first relative position vector corresponding to the same video frame. Optionally, the arrangement order of the plurality of relative position vectors in the first relative position vector is positively correlated with the pixel points included in the corresponding second action part, that is, the computer device orders the relative position vector of the second action part including the most pixel points in the front and orders the relative position vector of the second action part including the least pixel points in the back. Or the arrangement sequence of the relative position vectors in the first relative position vector is in negative correlation with the pixel points contained in the corresponding second action part; the computer device orders the relative position vector of the second action part containing the least pixel points in the front and orders the relative position vector of the second action part containing the most pixel points in the back; in the embodiments of the present application, this is not particularly limited.

Optionally, if the computer device determines the first relative position vector based on the first center position of the first motion portion and the boundary positions of the plurality of second motion portions, and the plurality of boundary positions include the uppermost, lowermost, leftmost, and rightmost boundary positions of the second motion portion, for the same second motion portion, the computer device sequentially sorts the plurality of relative position vectors of the same second motion portion in the order of uppermost, lowermost, leftmost, and rightmost to obtain a third relative position vector of the same second motion portion, and further sequentially concatenates the plurality of third relative position vectors of the same video frame based on pixel points included in the plurality of second motion portions in the same video frame, to obtain a first relative position vector corresponding to the same video frame.

Alternatively, the relative position vector is a 2-dimensional position vector, in which the elements represent the abscissa and ordinate of the vector with respect to the origin of the first center position, respectively. If the plurality of second action parts are four-limb parts, and the computer device determines the first relative position vector based on the first central position of the first action part and the second central positions of the plurality of second action parts, the dimension of the first relative position vector obtained by splicing the plurality of relative position vectors corresponding to the same video frame is 8-dimensional. Correspondingly, if the number of the plurality of video frames is 4 and each video frame corresponds to an 8-dimensional first relative position vector, the dimension of the second relative position vector obtained by splicing the first relative position vectors corresponding to the 4 video frames corresponding to the same sample video is 32-dimensional. Optionally, the computer device splices the first relative position vectors corresponding to the plurality of video frames respectively according to the time sequence of the plurality of video frames to obtain a second relative position vector.

Alternatively, if the plurality of second motion portions are four-limb portions, and the computer device determines the first relative position vector based on the first center position of the first motion portion and the boundary positions of the plurality of second motion portions, and the plurality of boundary positions include the uppermost, lowermost, leftmost, and rightmost boundary positions of the second motion portion, the dimension of the first relative position vector obtained by stitching 16 2-dimensional relative position vectors corresponding to the same video frame is 32-dimensional. Correspondingly, if the number of the plurality of video frames is 4 and each video frame corresponds to a 32-dimensional first relative position vector, the dimension of the second relative position vector obtained by splicing the first relative position vectors corresponding to the 4 video frames corresponding to the same sample video is 128-dimensional. Optionally, the computer device splices the first relative position vectors corresponding to the plurality of video frames according to the time sequence of the plurality of video frames to obtain a second relative position vector.

In the embodiment of the application, the second relative position vector of the sample video is obtained by splicing the first relative position vectors of the plurality of video frames based on the sample video, and the second relative position vector includes the first relative position vectors of the plurality of video frames, so that the second relative position vector is more representative, that is, the second relative position vector is more accurate, and then the plurality of sample videos are clustered based on the second relative position vectors respectively corresponding to the plurality of sample videos, so that a plurality of first video clusters with high accuracy can be obtained.

In some embodiments, the computer device determines a distance between any two second relative position vectors based on the second relative position vectors respectively corresponding to the plurality of sample videos; and under the condition that the distance is not greater than the preset distance, the computer equipment aggregates the two sample videos corresponding to the two second relative position vectors into the same first video cluster.

In this embodiment, the computer device clusters a plurality of sample videos by using a Hierarchical Clustering (Hierarchical Clustering) method, where the Hierarchical Clustering algorithm is one of the Clustering algorithms. The computer device determines similarity between sample videos of different categories by determining distances between the sample videos of different categories. And further constructing a nested clustering tree with hierarchy. In the clustering tree, each sample video is used as a category which is the lowest layer of the clustering tree, and the top layer of the clustering tree is a root node of the clustering, namely the final category of the multiple sample videos. The method for creating the clustering tree through the hierarchical clustering method comprises two implementation modes of bottom-up combination and top-down classification, and in the embodiment of the application, a plurality of sample videos are clustered based on the bottom-up combination mode. Alternatively, hierarchical clustering does not require a specified number of clusters, given a distance threshold for the clusters.

Optionally, the preset distance is the minimum of the distances between any two second relative position vectors. The computer equipment aggregates the two sample videos corresponding to the two second position vectors with the minimum distance into the same first video cluster. Then, the computer apparatus recalculates the distance between the first video cluster and second relative position vectors of other sample videos not including two sample videos in the first video cluster, aggregates the sample videos corresponding to the first video cluster and the second position vector having the smallest distance into the same first video cluster, or aggregates the two sample videos corresponding to the two second position vectors having the smallest distance into the same first video cluster. And the computer equipment repeats the iteration of the steps until the distance between any two relative position vectors is larger than a preset threshold value, and stops clustering to obtain a plurality of first video clusters.

In the embodiment of the application, the plurality of sample videos are clustered based on the distance between any two second relative position vectors, the hierarchical clustering method does not need to preset the clustering number, so that multi-level clustering structures on different granularities can be obtained by setting different distance thresholds, the calculation speed is high, the logic is simple, and the efficiency of clustering the plurality of sample videos can be improved.

In another implementation, a computer device clusters a plurality of sample videos based on a k-means clustering algorithm (k-means clustering algorithm). Correspondingly, the computer equipment randomly determines k sample videos as initial clustering centers, then determines the distances between other sample videos and the initial clustering centers respectively, and allocates the other sample videos to the first video cluster corresponding to the closest clustering center. Wherein the cluster centers and the sample videos assigned to them represent a cluster; each sample video is assigned, the cluster center of the cluster is recalculated based on the existing sample video in the cluster. This process will be repeated until the target termination condition is met. The target termination condition may be at least one of no or a minimum number of sample videos being reassigned to different clusters, no or a minimum number of cluster centers being changed again, and a local minimum of squared error sum. In another implementation, the computer device clusters the plurality of sample videos Based on a representative Density-Based Clustering algorithm (DBSCAN), which is not described herein again.

In the embodiment of the application, the sample video is subjected to action division and clustering without supervision, and action labels are not needed, so that the problem that the action recognition model is trained by the sample video with labels in the prior art is solved.

306. The computer device trains a motion recognition model based on the plurality of first video clusters.

In some embodiments, the computer device training the motion recognition model based on the first plurality of video clusters comprises: the computer device labels action tags of the plurality of sample videos based on the plurality of first video clusters. And training the computer equipment based on the marked sample video to obtain the action recognition model.

In some embodiments, the step of the computer device performing motion recognition by the motion recognition model comprises: the computer equipment acquires a target video, inputs the target video into the action recognition model and outputs the action type of the target video and the relative position vectors of a plurality of action parts of a target object in the target video. The target video is a video of the action category to be identified, and the relative position vector is used for explaining the reason why the target video is identified as the action category. If the target video comprises a plurality of video frames, the computer device determines the average value of the fourth relative position vectors of the same action part corresponding to the plurality of video frames respectively, obtains the relative position vector of the same action part, and further obtains the relative position vectors of the plurality of action parts of the target object in the target video.

Optionally, the motion recognition model labels the relative position vectors of a plurality of motion parts on the schematic diagram of the relative position vectors as shown in fig. 6, and outputs the schematic diagram to explain the reason why the target video is recognized as the motion category, so that the motion recognition model is prevented from being unfavorable for understanding and application as a black box model, that is, the problem of poor interpretability of the motion recognition model in the prior art is solved, and the motion recognition model has a wider application scene, such as an intelligent education scene.

Referring to fig. 7, fig. 7 is a flowchart of a training method of a motion recognition model according to an embodiment of the present application. The method comprises five stages of video division, video frame sampling, image segmentation, relative position determination and action clustering. In the video division stage, the computer equipment divides the initial sample video into short videos at certain time intervals to obtain sample videos, and the action change in the sample videos is small. In the video frame sampling stage, the computer device randomly extracts images of n video frames in the sample video, and the motion of the object in the images is used for representing the motion of the object in the sample video. In the image segmentation stage, the computer device segments an image of the video frame to segment an object in the image into a plurality of motion parts of a torso and limbs. In a relative position determination phase, the computer device determines relative position vectors of the torso region and the extremity region, and stitches the relative position vectors in the plurality of video frames. In the action clustering stage, the computer equipment clusters a plurality of sample videos based on a k-means algorithm to obtain a plurality of first video clusters of different action types, and then the computer equipment puts the plurality of first video clusters into application to train to obtain an action recognition model. It should be noted that the resource configuration of the computer device may be determined according to the data size of the sample video, so as to improve the flexibility of processing the sample video; and if the data volume of the sample video is large, processing the sample video and training the motion recognition model by adopting a plurality of background servers.

In the embodiment of the application, by determining the relative position vectors among the plurality of action parts, the relative position vectors among the plurality of action parts can effectively convey the action information of the object, so that the plurality of sample videos are clustered based on the relative position vectors corresponding to the plurality of sample videos respectively, a plurality of accurately classified first video clusters can be obtained, then the action recognition model is trained based on the plurality of first video clusters, and the recognition effect of the obtained action recognition model can be improved.

Fig. 8 is a flowchart of a training method of a motion recognition model according to an embodiment of the present application. The training method of the motion recognition model comprises the following steps:

801. a computer device obtains a plurality of sample videos.

802. And the computer equipment divides the target object in the sample video to obtain a plurality of action parts corresponding to the sample video.

803. The computer device identifies a first motion portion and a plurality of second motion portions among the plurality of motion portions based on portion information of the plurality of motion portions corresponding to the sample video, the first motion portion being a reference motion portion of the target object.

804. The computer device determines relative position vectors of the plurality of second motion parts relative to the first motion part, and obtains corresponding relative position vectors of the sample video.

805. The computer equipment clusters the sample videos based on the corresponding relative position vectors of the sample videos respectively to obtain a plurality of first video clusters.

Steps 801-805 are similar to steps 301-305, and are not described herein again.

806. The computer device determines a target action site from the plurality of second action sites.

Wherein, the target action part contains the least pixel points.

In some embodiments, the target action site may also be a second action site that contains the most pixels. Or the pixel points contained in the target action part are in the range of the preset pixel points. In some embodiments, the number of the target action parts is multiple, for example, the target action part is two second action parts including the least number of pixels, or the target action part is two second action parts including the most number of pixels, which is not limited herein.

807. The computer equipment clusters the sample videos based on the first action parts and the target action parts of the sample videos to obtain a plurality of second video clusters.

In some embodiments, the step of clustering, by the computer device, the plurality of sample videos based on the first motion part and the target motion part of the plurality of sample videos to obtain a plurality of second video clusters includes: the computer equipment determines a first similarity and a second similarity of any two sample videos, wherein the first similarity is the similarity between first action parts of the two sample videos, and the second similarity is the similarity between target action parts of the two sample videos; the computer device determines a third similarity between the two sample videos based on the first similarity and the second similarity. And under the condition that the third similarity is greater than the preset similarity, the computer equipment aggregates the two sample videos into the same second video cluster.

Optionally, the step of determining, by the computer device, the first similarity of any two sample videos is a Structural Similarity Index (SSIM), and accordingly, the step of determining, by the computer device, the first similarity of any two sample videos includes that the computer device extracts a luminance feature, a contrast feature, and a structural feature of a first action part of each of the two sample videos, and obtains, by an SSIM function, the first similarity between the two sample videos based on the luminance feature, the contrast feature, and the structural feature.

In some embodiments, the sample video includes a plurality of video frames, and the corresponding first similarity of the sample video includes corresponding first similarities of the plurality of video frames, respectively. Optionally, the computer device determines similarity of first motion parts in video frames at the same time in the two sample videos, and obtains the first similarity of the two sample videos based on an average value of the similarity of the first motion parts in the video frames corresponding to the multiple times in the two sample videos. Similarly, the computer device determines the similarity of the target action parts in the video frames at the same time in the two sample videos, and obtains a second similarity of the two sample videos based on the average value of the similarities of the target action parts in the video frames corresponding to the multiple times in the two sample videos. Optionally, the computer device determines an average value between the first similarity and the second similarity, to obtain a third similarity between the two sample videos; alternatively, the computer device determines a sum of the first similarity and the second similarity, resulting in a third similarity between the two sample videos. The preset similarity may be set and changed as needed, and is not specifically limited in this embodiment of the present application.

In the embodiment of the application, the similarity between the first action parts and the similarity between the target action parts can effectively represent the similarity between the plurality of sample videos, so that the plurality of sample videos are clustered by integrating the similarities between the first action parts of the plurality of sample videos and the similarity between the target action parts, and a plurality of second video clusters which are accurately classified can be obtained.

In the embodiment of the application, the target action parts in the sample videos are all the second action parts with the least pixel points, and then the sample videos are clustered through the first action parts and the target action parts, so that the sample videos can be contrastively clustered on the same level, and clustering is performed through the target action parts based on the second action parts, the time-consuming problem of clustering based on the second action parts is avoided, the clustering efficiency of the sample videos is improved, and the training efficiency of the action recognition model is improved.

It should be noted that, in some embodiments, the computer device can further perform only the steps 801-. The specific training steps of the motion recognition model are the same as those of step 306, and are not described herein again.

It should be noted that, in the embodiment of the present application, the sequence of

steps

804 and 805 and 806 and 807 is not limited. The computer device can perform the

steps

804 and 805 first, and then perform the

steps

806 and 807; alternatively, the computer device can perform the

steps

806 and 807 first, and then perform the

steps

804 and 805 later; alternatively, the computer device can also perform

steps

804 and 805 and 806 and 807 at the same time; the embodiment of the present application is not particularly limited to this.

808. The computer device trains a motion recognition model based on the plurality of first video clusters and the plurality of second video clusters.

In some embodiments, the computer device determines whether the plurality of first video clusters are the same as the plurality of second video clusters; if the plurality of first video clusters are the same as the plurality of second video clusters, the computer device trains an action recognition model based on the plurality of first video clusters or the plurality of second video clusters. If the plurality of first video clusters are different from the plurality of second video clusters, manually determining whether video clusters with inaccurate clustering exist in the plurality of first video clusters or the plurality of second video clusters, and further correcting the video clusters with inaccurate clustering; the computer device trains a motion recognition model based on the corrected plurality of first video clusters or the corrected plurality of second video clusters. The specific training steps of the motion recognition model are the same as those of step 306, and are not described herein again.

Referring to fig. 9, fig. 9 is an application diagram of a motion recognition model in an intelligent education scene, in which a target object is a teacher, the teacher is standing for teaching, a whiteboard or a projection is provided for the teacher to display possible contents, the scene includes a half body or a whole half body of the teacher, and a motion type in the scene is recognized as an explanation by the motion recognition model. Therefore, the action recognition model in the embodiment of the application can analyze the teaching behavior of the teacher, and further supports AI evaluation and teaching.

In the embodiment of the application, a plurality of sample videos are respectively clustered through two factors, namely a relative position vector and similarity, so that a plurality of first video clusters and a plurality of second video clusters are respectively clustered based on the relative position vector and the similarity, whether clustering-inaccurate video clusters exist in the video clusters obtained by clustering can be determined through the relative relation between the plurality of first video clusters and the plurality of second video clusters, and then the video clusters can be corrected under the condition that clustering-inaccurate video clusters exist in the video clusters obtained by clustering, so that the motion recognition model is trained based on the corrected video clusters, and the accuracy of the motion recognition model obtained by training can be improved.

Fig. 10 is a block diagram of a training apparatus for motion recognition models, which is provided according to an embodiment of the present application, and is configured to perform the above-mentioned training method for motion recognition models, referring to fig. 10, the apparatus includes:

a first obtaining module 1001 configured to obtain a plurality of sample videos;

a dividing module 1002, configured to divide a target object in a sample video to obtain a plurality of action portions corresponding to the sample video, where the plurality of action portions are action portions of the target object, and the target object is an object making an action;

a determining module 1003, configured to determine a relative position vector between the multiple action portions, to obtain a relative position vector corresponding to the sample video, where the relative position vector is used to represent a position relationship between the multiple action portions;

the clustering module 1004 is configured to cluster the plurality of sample videos based on the position information of the action positions corresponding to the plurality of sample videos, so as to obtain a plurality of first video clusters, where each first video cluster includes at least one sample video, and the sample videos in the same first video cluster have the same action;

a training module 1005, configured to train the motion recognition model based on the plurality of first video clusters.

In some embodiments, the determining module 1003 is configured to determine, based on the location information of the plurality of motion locations corresponding to the sample video, a first motion location and a plurality of second motion locations in the plurality of motion locations, where the first motion location is a reference motion location of the target object; and determining relative position vectors of the plurality of second motion parts relative to the first motion part to obtain corresponding relative position vectors of the sample video.

In some embodiments, the determining module 1003 is configured to determine a first center position and a second center position, respectively, where the first center position is a center position of a first action portion, and the second center position is a center position of a second action portion; determining a vector between the first center position and the second center position as a relative position vector of the second action site with respect to the first action site; and forming a relative position vector corresponding to the sample video by using the relative position vectors of the plurality of second action parts relative to the first action part.

In some embodiments, the determining module 1003 is configured to determine a first center position and a plurality of boundary positions, respectively, where the first center position is a center position of the first action portion, and the plurality of boundary positions are boundary positions of the second action portion, respectively; determining vectors between the first center position and the plurality of boundary positions respectively; splicing the first central position with vectors among the plurality of boundary positions respectively to obtain a relative position vector of the second action part relative to the first action part; and forming a relative position vector corresponding to the sample video by using the relative position vectors of the plurality of second action parts relative to the first action part.

In some embodiments, the sample video includes a plurality of video frames, and the corresponding relative position vector of the sample video includes the corresponding relative position vector of each of the plurality of video frames; the clustering module 1004 is configured to splice multiple relative position vectors corresponding to the same video frame to obtain a first relative position vector corresponding to the same video frame; splicing first relative position vectors corresponding to a plurality of same video frames of the same sample video to obtain second relative position vectors corresponding to the sample video; and clustering the plurality of sample videos based on the second relative position vectors respectively corresponding to the plurality of sample videos to obtain a plurality of first video clusters.

In some embodiments, the clustering module 1004 is configured to determine a distance between any two second relative position vectors based on the second relative position vectors corresponding to the plurality of sample videos, respectively;

In some embodiments, the apparatus further comprises:

and the input and output module is used for inputting the target video into the action recognition model and outputting the action type of the target video and relative position vectors of a plurality of action parts of the target object in the target video, wherein the relative position vectors are used for explaining the reason that the target video is recognized as the action type.

In some embodiments, training module 1005 is configured to determine a first action part and a plurality of second action parts in the plurality of action parts based on the part information of the plurality of action parts, where the first action part is a reference action part of the target object; determining a target action part from the plurality of second action parts, wherein the target action part contains the least pixel points; clustering the plurality of sample videos based on the first action parts and the target action parts of the plurality of sample videos to obtain a plurality of second video clusters, wherein the second video clusters comprise at least one sample video, and the actions of the sample videos in the same second video cluster are the same; training a motion recognition model based on the plurality of first video clusters and the plurality of second video clusters.

In some embodiments, the training module 1005 is configured to determine a first similarity and a second similarity between any two sample videos, where the first similarity is a similarity between first motion portions of the two sample videos, and the second similarity is a similarity between target motion portions of the two sample videos; determining a third similarity between the two sample videos based on the first similarity and the second similarity; and under the condition that the third similarity is greater than the preset similarity, aggregating the two sample videos into the same second video cluster.

In some embodiments, the segmentation module 1002 is configured to extract a plurality of video frames from the sample video; dividing a target object included in a video frame to obtain a plurality of action parts corresponding to the video frame; and combining the action parts corresponding to the plurality of video frames into a plurality of action parts corresponding to the sample video.

In the embodiment of the application, the computer device can be configured as a terminal or a server, and when the computer device is configured as a terminal, the terminal is used as an execution subject to implement the technical scheme provided by the embodiment of the application; when the computer device is configured as a server, the server is used as an execution subject to implement the technical scheme provided by the embodiment of the application; or, the technical solution provided by the present application is implemented through interaction between a terminal and a server, and the embodiment of the present application is not limited thereto.

Fig. 11 shows a block diagram of a terminal 1100 according to an exemplary embodiment of the present application. The terminal 1100 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Movi12g Picture Experts Group Audio Layer III, mpeg Audio Layer 3), an MP4 player (Movi12g Picture Experts Group Audio Layer IV, mpeg Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1100 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so forth.

In general, terminal 1100 includes: a processor 1101 and a memory 1102.

Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital signal processor, 12al, process 12 g), FPGA (Field-Programmable Gate Array), PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also called a CPU (Ce12 trace processing 12g U12it, central processing unit); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics process 12g U12it, image processor) that is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 1101 may further include an AI (Artificial I12 tellge 12 ce) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1102 is used to store at least one program code for execution by the processor 1101 to implement the method of training a motion recognition model provided by the method embodiments herein.

In some embodiments, the terminal 1100 may further include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, display screen 1105, camera assembly 1106, audio circuitry 1107, and power supply 1108.

The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (I12 Output/Output) to the processor 1101 and the memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1101, the memory 1102 and the peripheral device interface 1103 may be implemented on separate chips or circuit boards, which is not limited by this embodiment.

The Radio frequency circuit 1104 is used to receive and transmit RF (Radio frequency 12cy, Radio frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1104 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1104 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1104 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1104 may further include a circuit related to 12FC (12ear Field Commu12 information 12, short-range wireless communication), which is not limited in this application.

The display screen 1105 is used to display a UI (User I12 interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1105 is a touch display screen, the display screen 1105 also has the ability to capture touch signals on or over the surface of the display screen 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this point, the display screen 1105 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1105 may be one, disposed on a front panel of terminal 1100; in other embodiments, the display screens 1105 can be at least two, respectively disposed on different surfaces of the terminal 1100 or in a folded design; in other embodiments, display 1105 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1100. Even further, the display screen 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display screen 1105 may be made of LCD (Liquid Crystal Display), OLED (Orga12ic Light-Emitti12g Diode), and the like.

Camera assembly 1106 is used to capture images or video. Optionally, camera assembly 1106 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1106 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp and can be used for light compensation under different color temperatures.

The audio circuitry 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing or inputting the electric signals to the radio frequency circuit 1104 to achieve voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different action portions of terminal 1100. The microphone may also be an array microphone or an omni-directional acquisition microphone. The speaker is then used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1107 may also include a headphone jack.

Power supply 1108 is used to provide power to various components within terminal 1100. The power source 1108 may be alternating current, direct current, disposable or rechargeable. When the power source 1108 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1100 can also include one or more sensors 1109. The one or more sensors 1109 include, but are not limited to: acceleration sensor 1110, gyro sensor 1111, pressure sensor 1112, optical sensor 1113, and proximity sensor 1114.

The acceleration sensor 1110 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 1100. For example, the acceleration sensor 1110 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1101 may control the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1110. The acceleration sensor 1110 may also be used for game or user motion data acquisition.

The gyro sensor 1111 may detect a body direction and a rotation angle of the terminal 1100, and the gyro sensor 1111 may collect a 3D motion of the user with respect to the terminal 1100 in cooperation with the acceleration sensor 1110. The processor 1101 can implement the following functions according to the data collected by the gyro sensor 1111: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 1112 may be disposed on the side bezel of terminal 1100 and/or underlying display screen 1105. When the pressure sensor 1112 is disposed on the side frame of the terminal 1100, the holding signal of the user to the terminal 1100 can be detected, and the processor 1101 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1112. When the pressure sensor 1112 is disposed at a lower layer of the display screen 1105, the processor 1101 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The optical sensor 1113 is used to collect the ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 according to the ambient light intensity collected by the optical sensor 1113. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1105 is increased; when the ambient light intensity is low, the display brightness of the display screen 1105 is reduced. In another embodiment, the processor 1101 may also dynamically adjust the shooting parameters of the camera assembly 1106 according to the intensity of the ambient light collected by the optical sensor 1113.

Proximity sensor 1114, also known as a distance sensor, is typically disposed on a front panel of terminal 1100. Proximity sensor 1114 is used to capture the distance between the user and the front face of terminal 1100. In one embodiment, when proximity sensor 1114 detects that the distance between the user and the front surface of terminal 1100 is gradually decreasing, display screen 1105 is controlled by processor 1101 to switch from a bright screen state to a dark screen state; when the proximity sensor 1114 detects that the distance between the user and the front surface of the terminal 1100 is gradually increased, the display screen 1105 is controlled by the processor 1101 to switch from a rest screen state to a lighted screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 11 does not constitute a limitation of terminal 1100, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 12 is a schematic structural diagram of a server 1200 according to an embodiment of the present application, where the server 1200 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1201 and one or more memories 1202, where the memory 1202 stores at least one computer program, and the at least one computer program is loaded and executed by the processors 1201 to implement the method for training the motion recognition model provided by the method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

An embodiment of the present application further provides a computer-readable storage medium, where at least one program code is stored in the computer-readable storage medium, and the at least one program code is loaded and executed by a processor, so as to implement the method for training a motion recognition model according to any implementation manner.

An embodiment of the present application further provides a computer program product, where the computer program product includes at least one program code, and the at least one program code is loaded and executed by a processor to implement the method for training a motion recognition model according to any of the above implementation manners.

In some embodiments, the computer program product according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.

The present application is intended to cover various modifications, alternatives, and equivalents, which may be included within the spirit and scope of the present application. It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions. For example, sample videos referred to in this application are all obtained with sufficient authorization.

Claims

1. A method for training a motion recognition model, the method comprising:

acquiring a plurality of sample videos;

2. The method of claim 1, wherein the determining a relative position vector between the plurality of motion portions to obtain a corresponding relative position vector of the sample video comprises:

determining a first action part and a plurality of second action parts in a plurality of action parts corresponding to the sample video based on part information of the plurality of action parts, wherein the first action part is a reference action part of the target object;

and determining relative position vectors of the plurality of second action parts relative to the first action part to obtain the corresponding relative position vector of the sample video.

3. The method of claim 2, wherein the determining relative position vectors of the plurality of second motion portions with respect to the first motion portion to obtain corresponding relative position vectors of the sample video comprises:

respectively determining a first central position and a second central position, wherein the first central position is the central position of the first action part, and the second central position is the central position of the second action part;

determining a vector between the first center position and the second center position as a relative position vector of the second motion part with respect to the first motion part;

and forming a corresponding relative position vector of the sample video by using the relative position vectors of the plurality of second action parts relative to the first action part.

4. The method of claim 2, wherein the determining relative position vectors of the plurality of second motion portions with respect to the first motion portion to obtain corresponding relative position vectors of the sample video comprises:

respectively determining a first central position and a plurality of boundary positions, wherein the first central position is the central position of the first action part, and the plurality of boundary positions are the boundary positions of the second action part;

determining vectors between the first center position and the plurality of boundary positions, respectively;

splicing the first central position with the vectors between the boundary positions respectively to obtain a relative position vector of the second action part relative to the first action part;

5. The method according to claim 1, wherein the sample video comprises a plurality of video frames, and the corresponding relative position vector of the sample video comprises corresponding relative position vectors of the plurality of video frames respectively;

the clustering the plurality of sample videos based on the relative position vectors corresponding to the plurality of sample videos respectively to obtain a plurality of first video clusters includes:

splicing first relative position vectors corresponding to a plurality of same video frames of the same sample video to obtain a second relative position vector corresponding to the sample video;

and clustering the plurality of sample videos based on second relative position vectors respectively corresponding to the plurality of sample videos to obtain a plurality of first video clusters.

6. The method according to claim 5, wherein the clustering the sample videos based on the second relative position vectors corresponding to the sample videos, respectively, to obtain the first video clusters comprises:

7. The method according to any one of claims 1-6, further comprising:

inputting the target video into the action recognition model, and outputting an action type of the target video and relative position vectors of a plurality of action parts of a target object in the target video, wherein the relative position vectors are used for explaining the reason that the target video is recognized as the action type.

8. The method of claim 1, wherein training a motion recognition model based on the first plurality of video clusters comprises:

determining a first action part and a plurality of second action parts in the plurality of action parts based on part information of the plurality of action parts, wherein the first action part is a reference action part of the target object;

determining a target action part from the plurality of second action parts, wherein the target action part comprises the least pixel points;

clustering the sample videos based on the first action parts and the target action parts of the sample videos to obtain a plurality of second video clusters, wherein the second video clusters comprise at least one sample video, and the action of the sample videos in the same second video cluster is the same;

training the motion recognition model based on the plurality of first video clusters and the plurality of second video clusters.

9. The method of claim 8, wherein clustering the plurality of sample videos based on the first motion portions and the target motion portions of the plurality of sample videos to obtain a plurality of second video clusters comprises:

and aggregating the two sample videos into the same second video cluster under the condition that the third similarity is greater than a preset similarity.

10. The method according to claim 1, wherein the segmenting the target object in the sample video to obtain a plurality of action parts corresponding to the sample video comprises:

extracting a plurality of video frames from the sample video;

dividing a target object included in the video frame to obtain a plurality of action parts corresponding to the video frame;

and combining the action parts corresponding to the plurality of video frames respectively into a plurality of action parts corresponding to the sample video.

11. An apparatus for training a motion recognition model, the apparatus comprising:

12. A computer device, characterized in that the computer device comprises a processor and a memory for storing at least one piece of computer program, which is loaded by the processor and executes the method of training a motion recognition model according to any of claims 1 to 10.

13. A computer-readable storage medium for storing at least one computer program for executing the method for training a motion recognition model according to any one of claims 1 to 10.

14. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements a method of training a motion recognition model according to any of claims 1 to 10.