CN114494543A

CN114494543A - Action generation method and related device, electronic equipment and storage medium

Info

Publication number: CN114494543A
Application number: CN202210089863.5A
Authority: CN
Inventors: 宋子扬; 王栋梁
Original assignee: Shanghai Sensetime Technology Development Co Ltd
Current assignee: Shanghai Sensetime Technology Development Co Ltd
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-13
Also published as: WO2023142651A1

Abstract

The application discloses an action generation method, a related device, an electronic device and a storage medium, wherein the action generation method comprises the following steps: acquiring first feature representations respectively representing a plurality of individuals in a plurality of action frames, and acquiring second feature representations respectively representing the plurality of individuals about target action categories; performing relational modeling based on the first feature representation and the second feature representation to obtain fusion feature representations of all individuals in all action frames; wherein the type of the relational modeling is related to a first total number of the plurality of individuals; performing action mapping based on the fusion feature representation to obtain action sequences of a plurality of individuals about the target action category; the action sequence comprises a plurality of action frames, and the action frames comprise action representations of all individuals. By the scheme, two application scenes, namely a single individual application scene and a plurality of individuals application scenes, can be compatible on the premise of improving the action generation efficiency.

Description

Action generation method and related device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to an action generation method, a related apparatus, an electronic device, and a storage medium.

Background

Motion generation is a key to many computer vision tasks such as animation authoring, humanoid robotic interaction, and so on. At present, the existing action generation modes mainly comprise two modes, one mode is a modeling-rendering mode based on computer graphics, a designer needs to invest a great deal of time and energy to carry out the work of modeling, skinning, action capturing and the like, and the efficiency is low; another is a way based on machine learning, in particular deep learning.

The method has the advantages that due to the rapid development of machine learning technology in recent years, the action generation task is executed by utilizing the deep neural network, and the action generation efficiency can be greatly improved. However, existing approaches typically only consider motion generation by a single individual and are not compatible with multiple individual application scenarios. In view of this, how to be compatible with two application scenarios, namely a single individual and a plurality of individuals, on the premise of improving the action generation efficiency becomes an urgent problem to be solved.

Disclosure of Invention

The application provides an action generation method and a related device, an electronic device and a storage medium.

A first aspect of the present application provides an action generation method, including: acquiring first feature representations respectively representing a plurality of individuals in a plurality of action frames, and acquiring second feature representations respectively representing the plurality of individuals about target action categories; performing relational modeling based on the first feature representation and the second feature representation to obtain fusion feature representations of all individuals in all action frames; wherein the type of the relational modeling is related to a first total number of the plurality of individuals; performing action mapping based on the fusion feature representation to obtain action sequences of a plurality of individuals about the target action category; the action sequence comprises a plurality of action frames, and the action frames comprise action representations of all individuals.

Therefore, first feature representations respectively representing a plurality of individuals in a plurality of action frames are obtained, second feature representations respectively representing the plurality of individuals about target action categories are obtained, on the basis, relational modeling is carried out based on the first feature representations and the second feature representations, fusion feature representations of the individuals in the action frames are obtained, the type of the relational modeling is related to the first total number of the plurality of individuals, action mapping is carried out based on the fusion feature representations, action sequences of the individuals about the target action categories can be obtained, the action sequences comprise the plurality of action frames, the action frames comprise the action representations of the individuals, on one hand, actions can be automatically generated without depending on manual work, on the other hand, the relational modeling is carried out in a targeted mode according to the first total number of the plurality of individuals, and two application scenes of a single individual and a plurality of individuals can be compatible. Therefore, the method and the device can be compatible with two application scenes of a single individual and a plurality of individuals on the premise of improving the action generation efficiency.

Wherein, in the case that the first total number of the plurality of individuals is single, the relational modeling comprises modeling a time sequence relation between the action frames; and/or, in the case that the first total number of the individuals is multiple, the relational modeling includes modeling an interactive relationship between the individuals in each action frame and modeling a time-series relationship between the action frames.

Therefore, under the condition that the first total number of the plurality of individuals is single, the relational modeling comprises modeling of the time sequence relation among the action frames, so that the time sequence consistency among the action frames can be improved through modeling of the time sequence relation, and the authenticity of the action sequence is favorably improved.

Wherein, under the condition that the relational modeling includes modeling time sequence relation, the relational modeling is carried out based on the first feature representation and the second feature representation to obtain the fusion feature representation of each individual in each action frame, and the method comprises the following steps: selecting an individual as a target individual, and taking a first characteristic representation and a second characteristic representation corresponding to the target individual as time sequence characteristic representations of the target individual at different time sequences; respectively selecting each time sequence as a first current time sequence, and selecting the time sequence characteristic representation of the first current time sequence as a first current time sequence representation; obtaining a fusion characteristic representation corresponding to the first current time sequence representation based on the correlation degree between each first reference time sequence representation and the first current time sequence representation; the first reference time sequence representation comprises time sequence characteristic representations of the target individuals at all time sequences.

Wherein, under the condition that the relational modeling includes modeling interactive relation, the relational modeling is carried out based on the first feature representation and the second feature representation to obtain the fusion feature representation of each individual in each action frame, and the method comprises the following steps: selecting an individual as a target individual, and taking a first characteristic representation and a second characteristic representation corresponding to the target individual as time sequence characteristic representations of the target individual at different time sequences; respectively selecting each time sequence as a second current time sequence, and selecting the time sequence characteristic representation of the second current time sequence as a second current time sequence representation; obtaining fusion characteristic representations corresponding to the second current time sequence representations based on the correlation degrees of the second reference time sequence representations and the second current time sequence representations respectively; and the second reference time sequence representation comprises time sequence characteristic representations of the individuals at the second current time sequence respectively.

Therefore, an individual is selected as a target individual, the first characteristic representation and the second characteristic representation corresponding to the target individual are used as time sequence characteristic representations of the target individual at different time sequences, the time sequence characteristic representations at different time sequences are respectively used as current time sequence representations based on the time sequence characteristic representations, then a fusion characteristic representation corresponding to the current time sequence representation is obtained based on the correlation degree between each reference time sequence representation and the current time sequence representation, and in the case of modeling the timing relationship, the reference timing representation includes a timing characteristic representation of the target individual at each timing, in the case of modeling the interaction relationship, the reference time series representation includes time series characteristic representations of the individuals at the reference time series, and the reference time sequence is the time sequence corresponding to the current time sequence, so that the time sequence relation and the interaction relation can be modeled through similar modeling processes, and the compatibility of two application scenes of a single individual and a plurality of individuals can be further improved.

Under the condition that the relational modeling comprises modeling interactive relation and time sequence relation, the relational modeling is carried out based on the first feature representation and the second feature representation to obtain the fusion feature representation of each individual in each action frame, and the method comprises the following steps: modeling a prior relation based on the first feature representation and the second feature representation to obtain an output feature representation of the prior relation, and modeling a subsequent relation based on the output feature representation to obtain a fusion feature representation; the former relation is an interactive relation, the latter relation is a time sequence relation, or the former relation is a time sequence relation and the latter relation is an interactive relation.

Therefore, under the condition that the relational modeling comprises the modeling interactive relation and the time sequence relation, the output characteristic of the interactive relation modeled before is represented as the input characteristic representation of the time sequence relation modeled after, so that under the application scene of a plurality of individuals, the interactive relation and the time sequence relation are modeled successively, all the fusion characteristic representations are respectively fused into the interactive relation and the time sequence relation, and the fusion effect of the interactive relation and the time sequence relation is favorably improved.

The action sequence is obtained by an action generating model, the action generating model comprises a relational modeling network, the relational modeling network comprises a time sequence modeling sub-network and an interactive modeling sub-network, the time sequence modeling sub-network is used for modeling a time sequence relation, and the interactive modeling sub-network is used for modeling an interactive relation.

Therefore, the action sequence is obtained by the action generating model, the action generating model comprises a relation modeling network, the relation modeling network comprises a time sequence modeling sub-network and an interaction modeling sub-network, the time sequence modeling sub-network is used for modeling the time sequence relation, and the interaction modeling sub-network is used for modeling the interaction relation, so that the action generating task can be completed through the network model, and the action generating efficiency is further improved.

Wherein the first feature represents a sampling based on a gaussian process.

Therefore, the first feature representation is obtained based on Gaussian process sampling, the acquisition complexity of the first feature representation is greatly reduced, and the generation quality of the action data with rich categories can be improved.

The method for acquiring the first feature representations respectively representing the plurality of individuals in the plurality of action frames comprises the following steps: respectively sampling the second total number of times in a plurality of Gaussian processes to obtain first original representations respectively representing the second total number of action frames; the length of the first original representation is the same as the number of Gaussian processes, and the characteristic length scales of the Gaussian processes are different; obtaining a third total number of first feature representations based on the first total number and the first original representation; wherein the third total is a product of the first total and the second total.

Therefore, in the plurality of Gaussian processes, the second total number is respectively sampled for several times to obtain first original representations respectively representing the second total number of action frames, the length of the first original representations is the same as the number of the Gaussian processes, the characteristic length scales of the Gaussian processes are different, and based on the first total number and the first original representations, a plurality of first characteristic representations of a third total number are obtained, the third total number is the product of the first total number and the second total number.

Wherein the second feature representation is mapped based on the target action category.

Therefore, the second feature representation is obtained based on mapping the target action type, so that the second feature representation can be obtained only by simple processing such as mapping text information, and the complexity of generating the driving action is greatly reduced.

Wherein, obtaining second feature representations respectively representing a plurality of individuals about the target action category comprises: embedding and expressing the target action type to obtain a second original expression; based on the first total and the second original representation, a first total number of second feature representations is obtained.

Therefore, the target action category is subjected to embedded representation to obtain a second original representation, and a first total number of second feature representations are obtained based on the first total number and the second original representation, namely the first total number of second feature representations can be obtained by performing embedded representation on the text information and performing related processing by combining the first total number, so that the complexity of obtaining the second feature representations is greatly reduced.

Wherein the first feature representation and the second feature representation are fused with position codes; wherein, in the case that the several individuals are single individuals, the position code comprises a time sequence position code, and in the case that the several individuals are multiple individuals, the position code comprises an individual position code and a time sequence position code.

Therefore, the first feature representation and the second feature representation are fused with position codes, the position codes comprise time sequence position codes under the condition that the plurality of individuals are single individuals, and the position codes comprise individual position codes and time sequence position codes under the condition that the plurality of individuals are multiple individuals, so that different feature representations can be distinguished by adopting different position coding strategies under two application scenes of single individuals and multiple individuals, the position codes of the feature representations are different, and the accuracy of the feature representation is improved.

The action sequence is obtained by the action generating model, and the position code is adjusted together with the network parameters of the action generating model in the training process of the action generating model until the training of the action generating model is converged.

Therefore, the action sequence is obtained by the action generating model, the position code is adjusted together with the network parameters of the action generating model in the training process of the action generating model until the training of the action generating model is converged, the position code is trained along with the network model, so that the representing capability of the position code can be improved, and the position code is not adjusted any more after the training is converged, namely, the position code is kept fixed, so that strong prior constraint can be added, balance can be achieved between the prior constraint and the representing capability, the accuracy of feature representation can be further improved, and the generation effect of the action sequence can be improved.

Wherein the individual action representation in the action frame comprises: in the action frame, first position information of key points of the individual and posture information of the individual, and the posture information includes second position information of a plurality of joint points of the individual.

Thus, the representation of the individual actions in the action frame includes: in the action frame, the first position information of the key points of the individuals and the position and orientation information of the individuals are included, and the position and orientation information comprises the second position information of a plurality of joint points of the individuals, so that the individual action can be expressed through the position information of the key points and the joint points, and the accuracy of action representation is improved.

The action sequence is obtained by an action generation model, and the action generation model and the identification model are obtained by generation of confrontation training.

Therefore, the action generation model and the identification model are cooperatively trained by generating the confrontation training, so that the action generation model and the identification model can be mutually promoted and complement each other in the cooperative training process, and finally, the model performance of the action generation model is favorably improved.

Wherein the step of generating the counter-training comprises: acquiring sample action sequences of a plurality of sample individuals about sample action categories; the sample action sequence comprises a preset number of sample action frames, and is marked with a sample mark, wherein the sample mark indicates whether the sample action sequence is actually generated by an action generation model; respectively decomposing each sample action frame in the sample action sequence to obtain sample graph data; the sample graph data comprises a preset numerical value node graph, the node graph is formed by connecting nodes, the nodes comprise key points and joint points, the node graph comprises node characteristic representations of all the nodes, and the position characteristic representations of the nodes are obtained by splicing position characteristic representations of a plurality of sample individuals at corresponding nodes respectively; identifying the sample image data and the sample action type based on the identification model to obtain a prediction result; the prediction result comprises a first prediction mark of the sample action sequence, wherein the first prediction mark represents the possibility that the sample action sequence is generated by the action generation model through prediction, and the second prediction mark represents the possibility that the sample action sequence belongs to the sample action category; network parameters of either the action generation model or the authentication model are adjusted based on the sample flag, the first prediction flag, and the second prediction flag.

Therefore, by decomposing the sample action representation into the sample graph data, the identification of the action sequence can be skillfully decomposed into the identification of the graph data, which is beneficial to greatly reducing the training complexity and the construction difficulty of the identification model.

Under the condition that the sample action sequence is acquired from a real scene, the position feature representations of the nodes are spliced by the position feature representations of the sample individuals at the corresponding nodes according to the random sequence of the sample individuals.

Therefore, the position characteristic representation of the nodes is obtained by splicing the position characteristic representations of the sample individuals at the corresponding nodes respectively according to the random sequence of the sample individuals, so that the action generating model is treated as different samples according to the conditions of different sequences and actually belonging to the same sample action sequence in the training process, and is modeled, thereby realizing data enhancement and being beneficial to improving the robustness of the model.

A second aspect of the present application provides an action generating apparatus, including: the system comprises a feature acquisition module, a relation modeling module and an action mapping module, wherein the feature acquisition module is used for acquiring first feature representations respectively representing a plurality of individuals in a plurality of action frames and acquiring second feature representations respectively representing the plurality of individuals about target action categories; the relational modeling module is used for carrying out relational modeling on the basis of the first feature representation and the second feature representation to obtain fusion feature representations of all individuals in all action frames; wherein the type of the relational modeling is related to a first total number of the plurality of individuals; the action mapping module is used for carrying out action mapping based on the fusion characteristic representation to obtain action sequences of a plurality of individuals about the target action category; the action sequence comprises a plurality of action frames, and the action frames comprise action representations of all individuals.

A third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, where the processor is configured to execute program instructions stored in the memory to implement the action generation method in the first aspect.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon program instructions that, when executed by a processor, implement the action generation method of the first aspect described above.

According to the scheme, first feature representations respectively representing a plurality of individuals in a plurality of action frames are obtained, second feature representations respectively representing a plurality of individuals about target action categories are obtained, on the basis, performing relational modeling based on the first feature representation and the second feature representation to obtain a fused feature representation of each individual in each action frame, and the type of the relational modeling is related to the first total number of the plurality of individuals, action mapping is carried out based on the fusion characteristic representation, the action sequence of the plurality of individuals about the target action category can be obtained, the action sequence comprises a plurality of action frames, the action frames comprise action representations of the individuals, so that on one hand, actions can be automatically generated without depending on manual work, on the other hand, the relational modeling is performed in a targeted mode according to the first total number of the plurality of individuals, and the application scenarios of a single individual and a plurality of individuals can be compatible. Therefore, the method and the device can be compatible with two application scenes of a single individual and a plurality of individuals on the premise of improving the action generation efficiency.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of an action generation method of the present application;

FIG. 2 is a process diagram of an embodiment of the method for generating actions of the present application;

FIG. 3a is a schematic diagram of one embodiment of a sequence of actions;

FIG. 3b is a schematic diagram of one embodiment of a sequence of actions;

FIG. 3c is a diagram of one embodiment of a sequence of actions;

FIG. 3d is a diagram of one embodiment of a sequence of actions;

FIG. 3e is a diagram of one embodiment of a sequence of actions;

FIG. 3f is a schematic diagram of one embodiment of a sequence of actions;

FIG. 4 is a flow diagram illustrating an embodiment of a method for training a motion generation model;

FIG. 5 is a schematic diagram of an embodiment of sample action frame acquisition;

FIG. 6 is a schematic diagram of one embodiment of sample graph data;

FIG. 7 is a block diagram of an embodiment of the motion generator of the present application;

FIG. 8 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 9 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating an embodiment of an action generating method according to the present application. Specifically, the method may include the steps of:

step S11: first feature representations respectively representing a plurality of individuals in a plurality of action frames are obtained, and second feature representations respectively representing the plurality of individuals about the target action category are obtained.

In one implementation scenario, the first total number of individuals and the target action category may be specified by the user prior to the formal implementation action generation. Illustratively, the user may specify the target action category as "hug" and specify a first total of two individuals; alternatively, the user may specify the target action category as "dance" and a first total of several individuals as one; alternatively, the user may specify that the target action category is "fight," and that the first total of the number of individuals is three. It should be noted that the above examples are only a few possible embodiments in the practical application process, and the target action category and the first total number of individuals in the practical application process are not limited thereby.

In another implementation scenario, the target action category may be specified by the user before the formal implementation action is generated, and the first total number of the individuals may be automatically analyzed based on the target action category. For example, the user may specify that the target action category is "clapping", and based on the target action category, the first total number of the individuals may be automatically analyzed to be two; or the user can specify that the target action category is 'exchange goods', and the first total number of the individuals is two according to the target action category which can be automatically analyzed; alternatively, the user may specify the target action category as "carry item," and the first total of the number of individuals may be automatically analyzed to be one based on the target action category. It should be noted that the above examples are only a few possible embodiments in the practical application process, and the target action category and the first total number of individuals in the practical application process are not limited thereby.

In yet another implementation scenario, the target action category may be specified by the user before the formal implementation action is generated, and the first total of the individuals may be automatically analyzed based on the target action category, and a modification instruction of the automatically analyzed first total by the user may be accepted to correct the automatically analyzed first total. For example, the user may specify that the target action category is "fighting", then the first total number of the individuals is two based on the target action category, and a modification instruction of the user on the automatically analyzed first total number is accepted to correct the first total number into four; or, the user may specify the target action category as "walking", and the first total number of the individuals obtained by the automatic analysis based on the target action category is one, and the modification instruction of the user on the first total number obtained by the automatic analysis is received, and the first total number is corrected into two. It should be noted that the above examples are only a few possible embodiments in the practical application process, and the target action category and the first total number of individuals in the practical application process are not limited thereby.

It should be noted that several of the individuals mentioned above may be humans. Of course, it is not excluded that several individuals comprise both humans and animals. Illustratively, the target action category may be designated "dog walking," and the several individuals may include people and dogs.

In one implementation scenario, the second total number of the action frames may be pre-specified, and for example, the second total number may be 10, 15, 20, 30, and so on, which is not limited herein.

In one implementation scenario, a first feature representation of each individual at each action frame may be obtained. For example, for a case where the first total number of the several individuals is one (i.e., for a motion generation scene of a single individual), first feature representations of the single individual at respective motion frames may be acquired; alternatively, for the case that the first total number of the individuals is two (i.e. for the motion generation scenes of the two individuals), the first feature representation of each individual in each motion frame may be acquired, and for convenience of description, the two individuals may be referred to as "a" and "b", respectively, and then the first feature representation of "a" in each motion frame may be acquired, and the first feature representation of "b" in each motion frame may be acquired. Other cases may be analogized, and no one example is given here.

In one implementation scenario, it should be noted that the action frame is included in the action sequence that is finally expected to be generated in the disclosed embodiment of the action generation method of the present application, that is, when the first feature representation is obtained, the action frame is not actually generated, and the first feature representation may be regarded as feature representations that are respectively initialized in the action frames for each individual. In particular, the first feature representation may be derived based on gaussian process sampling. It should be noted that the gaussian process is one of the random processes in probability theory and mathematical statistics, and is a combination of a series of random variables that obey normal distribution in an index set, and specific meanings of the gaussian process can refer to technical details of the gaussian process, which are not described herein again.

In a specific implementation scenario, the second total number of times may be sampled in a plurality of gaussian processes, so as to obtain first original representations respectively representing the second total number of action frames, where the length of the first original representations is the same as the number of the gaussian processes, and the characteristic length scales of the gaussian processes are different from each other. On the basis, a third total number of first feature representations are obtained based on the first total number and the first original representation, and the third total number is the product of the first total number and the second total number. Illustratively, for convenience of description, the second total number of action frames may be denoted as T, and the characteristic length scale σ of gaussian process may be denoted as_cCan take values respectively1. 10, 100, 1000, then in the characteristic length scale σ_cSampling T times in a Gaussian process of 1 to obtain a one-dimensional vector with the length of T, and so on in a characteristic length scale sigma_cThe gaussian processes 10, 100, and 1000 may all obtain a one-dimensional vector with a length T by sampling, and combine the same position elements on the one-dimensional feature vectors with a length T obtained by sampling by the 4 gaussian processes, respectively, to obtain T first original representations with a length 4, where the T first original representations are respectively in one-to-one correspondence with T motion frames, that is, the first original representation corresponds to a first motion frame, the second first original representation corresponds to a second motion frame, … …, and the T first original representation corresponds to a T motion frame. In addition, please refer to fig. 2 in combination, fig. 2 is a process diagram of an embodiment of the action generating method of the present application. As shown in fig. 2, for convenience of description, the length of the first original representation obtained by the above sampling may be denoted as C₀Therefore, the first original representation representing a plurality of motion frames can be denoted as (T, C)₀). On this basis, the first original representation (T, C) can be represented₀) Input mapping (e.g., a multi-layer perceptron may be sampled to map the first original representation) to alter the original first original representation (T, C)₀) Of (c) is calculated. Furthermore, the number of first original representations after mapping is still T. In the above manner, because the characteristic length scales of the gaussian processes are different, and the characteristic information of each action frame can be obtained by sampling the gaussian process every time, the accuracy of the representation of each first characteristic can be improved.

In a specific implementation scenario, after obtaining the first original representations respectively representing the second total number of action frames, it may be determined whether to duplicate the first original representations representing the respective action frames based on whether the first total number is equal to one or greater than one, so as to obtain the first original representations of the individuals in the respective action frames. For example, in a case that the first total number is equal to one, it may be determined that the motion is generated as a single individual scene, and then the first original representation representing each motion frame obtained by the foregoing sampling may be directly used as the first original representation of the single individual in each motion frame; or, in a case that the first total number is greater than one, it may be determined that the action is generated as a scene of multiple individuals, the first original representations representing the respective actions obtained by the foregoing sampling may be copied for the first total number of times, respectively, to obtain the first original representations of the multiple individuals in each action frame, for example, in a case that the first total number is 2, the first original representation representing the 1 st action frame may be copied into two first original representations, which respectively represent the first original representations of the two individuals in the 1 st action frame, and the other cases may be similar to each other, which is not illustrated here.

In a specific implementation scenario, please continue to refer to fig. 2 in combination, in order to distinguish different first original representations in the case of a single individual and a plurality of individuals, the position information of each first original representation may be encoded on the basis of the first original representations to obtain corresponding first feature representations, that is, the first feature representations are merged with position codes, and the position codes are different from each other. Specifically, in the case where the plurality of individuals are a single individual, the position coding includes time-series position coding, that is, in the case where the plurality of individuals are a single individual, the first feature representation is obtained by mainly distinguishing different first original representations by coding motion frames at different time series. Illustratively, still taking T motion frames as an example, in the case of a single individual, the time-series position coding (e.g., 1, 2, 1. Similarly, in the case that the number of individuals is multiple, the position code may include a time-series position code and an individual position code, that is, in the case that the number of individuals is multiple, not only the motion frames with different time series are coded, but also the multiple individuals in each motion frame are coded, so as to distinguish the different first original representations, thereby obtaining the first feature representation (as shown by the dashed box after the position code in fig. 2). Illustratively, still taking T motion frames as an example, in the case of multiple individuals, the temporal position code (e.g., 1) may be merged into the first original representation of the 1 st motion frame, and further the individual position codes (e.g., 1, 2, 1.... times.) may be merged into multiple individuals of the 1 st motion frame, thereby combining the temporal position code and the individual position codes as position codes, so that the first feature representations representing multiple individuals in the 1 st motion frame are respectively merged with different position codes (e.g., 1-1, 1-2, 1.. times.); similarly, a time-series position code (e.g., 2) may be merged into the first original representation of the 2 nd action frame, and further, individual position codes (e.g., 1, 2, 1.. or...) may be merged into a plurality of individuals of the 1 st action frame, so that the time-series position code and the individual position codes are combined as the position codes, so that the first feature representations representing the plurality of individuals in the 2 nd action frame are respectively merged with different position codes (e.g., 2-1, 2-2, … …), and the rest of the action frames may be similar, and are not exemplified herein. In addition, the position code is only an example, in the practical application process, one motion generation model may be trained in advance, and the position code may be adjusted together with the network parameters of the motion generation model in the training process of the motion generation model until the motion generation model training converges, and thereafter, in the subsequent application process, the adjusted position code may be used. By the method, different feature representations can be distinguished by adopting different position coding strategies in two application scenes of a single individual and a plurality of individuals, so that the position codes of the feature representations are different, and the accuracy of the feature representation is improved.

In one implementation scenario, similar to the first feature representation, for the second feature representation, a second feature representation of each individual with respect to the target action category may be obtained. For example, for the case where the first total number of several individuals is one (i.e., for the motion generation scenario of a single individual), a second feature representation of the single individual with respect to the target motion category may be obtained; alternatively, for the case where the first total number of the several individuals is two (i.e., for the motion generation scenarios of two individuals), the second feature representation of each individual with respect to the target motion category may be acquired, and for convenience of description, the two individuals may be referred to as "a" and "b", respectively, and then the second feature representation of "a" with respect to the target motion category may be acquired, and the second feature representation of "b" with respect to the target motion category may be acquired. Other cases may be analogized, and no one example is given here.

In one implementation scenario, as previously described, the target action category may be specified by the user, and after determining the target action category, the second feature representation may be mapped based on the target action category.

In a specific implementation scenario, the target action category may be embedded and represented to obtain a second original representation, and then, based on the first total number and the second original representation, a first total number of second feature representations may be obtained. The embedded representation has the function of converting the target motion type into a vector. For example, a category vector of different action categories may be preset, for example, in a case that there are 26 different action categories in total, a category vector of 26 action categories may be preset (for example, each category vector may have a length of 200), and then after the target action category is determined, a category vector of an action category that matches the target action category may be used as the second original representation of the target action category; alternatively, the target motion category may be first subjected to one-hot (one-hot) encoding, and then subjected to linear transformation by using a fully-connected layer to obtain the second original representation of the target motion category, for example, when there are 26 different motion categories in total, the target motion category may be first subjected to one-hot (one-hot) encoding to obtain a 26-dimensional vector, and the linear transformation of the fully-connected layer may be regarded as a transformation matrix of N (e.g., 200) × 26, and then the matrix may be multiplied by the 26-dimensional one-hot encoding to obtain the second original representation of the target motion category.

In a specific implementation scenario, similar to the obtaining of the first feature representation, after obtaining the second original representation representing the target action category, it may also be determined whether to copy the first original representation core based on whether the first total number is equal to one or greater than one, so as to obtain the second original representations of the plurality of individuals respectively related to the target action category. For example, in a case that the first total number is equal to one, it may be determined that the action is generated as a single individual scene, and then the second original representation obtained by the foregoing sampling and representing the target action category may be directly used as the second original representation of the single individual regarding the target action category; or, in a case that the first total number is greater than one, it may be determined that the action is generated as a scenario of multiple individuals, and then the aforementioned second original representation characterizing the target action category may be copied for the first total number of times, so as to obtain second original representations of the multiple individuals respectively related to the target action category, for example, in a case that the first total number is 2, the second original representation characterizing the target action category may be copied into two second original representations respectively representing the second original representations of the two individuals related to the target action category, and the other cases may be analogized, which is not illustrated here.

In a specific implementation scenario, continuing with reference to fig. 2, similarly to the obtaining of the first feature representation, in order to distinguish different second original representations in the case of a single individual as well as a plurality of individuals, the position information of each second original representation may be encoded on the basis of the second original representations to obtain corresponding second feature representations. That is, the second feature representation is also fused with position codes, similar to the first feature representation, and the position codes are different from each other. It should be noted that not only the position codes of the second feature representation fusions are different from each other, but also the position codes of the second feature representation fusions and the first feature representation fusions are different from each other. In particular, in case the number of individuals is a single individual, the position coding comprises a time-sequential position coding, that is, in case the number of individuals is a single individual, the second feature representation may be distinguished by the first feature representation of different action frames in a time-sequential dimension. Illustratively, still taking T motion frames as an example, in the case of a single individual, the time-series position coding (e.g., 1, 2, … …, T) may be respectively incorporated into the first original representations of the T motion frames, so as to obtain first feature representations respectively characterizing the single individual in the T motion frames, and then the time-series position coding (e.g., T +1) may be incorporated into the second original representation of the target motion category, so as to obtain a second feature representation of the single individual about the target motion category. Similarly, in the case that the number of individuals is multiple, the position code may include a time sequence position code and an individual position code, that is, in the case that the number of individuals is multiple, it is necessary to distinguish between a time sequence dimension and an individual dimension at the same time (as shown by a dashed box after the position code in fig. 2). Illustratively, still taking T motion frames as an example, in the case of multiple individuals, the time-series position code (e.g., T +1) may be merged into the second original representation of the multiple individuals with respect to the target motion category first, and further the individual position code (e.g., 1) may be merged into the second original representation of the 1 st individual with respect to the target motion category, further the individual position code (e.g., 2) may be merged into the second original representation of the 2 nd individual with respect to the target motion category, and so on, so as to combine the time-series position code and the individual position code, so that different position codes (e.g., T +1-1, T +1-2, … …) are respectively merged into the representation of the multiple individuals with respect to the target motion category. In addition, the position code is only an example, in the practical application process, one motion generation model may be trained in advance, and the position code may be adjusted together with the network parameters of the motion generation model in the training process of the motion generation model until the motion generation model training converges, and thereafter, in the subsequent application process, the adjusted position code may be used. By the method, different feature representations can be distinguished by adopting different position coding strategies under two application scenes of a single individual and a plurality of individuals, so that the position codes of the feature representations are different, and the accuracy of the feature representation is improved.

In an implementation scenario, as described above, the first feature representation and the second feature representation are both fused with position codes, and the position codes include time-series position codes when the plurality of individuals are single individuals, and the position codes include individual position codes and time-series position codes when the plurality of individuals are multiple individuals, which may be specifically combined with fig. 2 and the above description, and are not repeated here. Further, the respective position codes may be different for the sake of distinction. Taking T motion frames and P individuals (P is equal to 1, or P is greater than 1) as an example, the above operation may finally obtain a feature representation of (T +1) × P, where T × P first feature representations respectively representing the individuals in each motion frame and P second feature representations respectively representing the individuals with respect to the target motion category.

It should be noted that, for each individual, the first original representation of the individual in several motion frames and the second original representation of the target motion category may be used as the original time sequence representation of the individual at different time sequences. Still with T action frames, for the p-th individual, the first original representation of the p-th individual in the T action frames and the second original representation of the target action category can be regarded as the original time sequence representation of the p-th individual from the 1 st time sequence to the T +1 st time sequence. On the basis, in the action generation scene of a single individual, under the condition that the time sequence T ranges from 1 to T, the time sequence position code TPE of the T time sequence can be coded_tAdding the T original time sequence representation to obtain a first characteristic representation of the T time sequence, and under the condition that the time sequence T is T +1, encoding the time sequence position of the T time sequence to obtain the TPE_tAnd adding the representation of the t original time sequence to obtain a second characteristic representation of the t time sequence. Similarly, in the scenario of motion generation for multiple individuals, the individual position of the p-th individual may be encoded first_pSplicing with the time sequence position code of the t time sequence to obtain the position code PE (t, p) ═ concat (TPE) of the p individual at the t time sequence_t，PPE_p) Wherein, concat represents the splicing operation, in the case that the time sequence T ranges from 1 to T, the position code PE (T, p) of the p-th individual at the T-th time sequence can be added to the original time sequence representation of the p-th individual at the T-th time sequence to obtain the first feature representation of the p-th individual at the T-th time sequence, and in the case that the time sequence T is T +1, the position code PE (T, p) of the p-th individual at the T-th time sequence can be added to the original time sequence representation of the p-th individual at the T-th time sequence to obtain the second feature representation of the p-th individual at the T-th time sequence. In addition to the above-described combined coding of the time-series position code and the individual position code, it is also possible to adopt completely independent fixed coding without distinguishing the time-series position code from the individual position code, that is, T motion frames and P individual motion generation scenes may be previously subjected to motion generation by using T motion frames and P individual motion frames(T +1) × P independent position codes are set.

Step S12: and performing relational modeling based on the first feature representation and the second feature representation to obtain the fusion feature representation of each individual in each action frame.

In the embodiment of the disclosure, the type of the relational modeling is related to the first total number of the plurality of individuals, specifically, in the case that the first total number of the plurality of individuals is single, the relational modeling includes modeling a timing relationship between the action frames, so that timing continuity between the action frames is improved by modeling the timing relationship, and the authenticity of the action sequence is favorably improved, and in the case that the first total number of the plurality of individuals is multiple, the relational modeling includes modeling an interaction relationship between the plurality of individuals in the action frames and modeling a timing relationship between the action frames, so that the interaction rationality between the individuals is improved by modeling the interaction relationship, and the timing continuity between the action frames is improved by modeling the timing relationship, and the authenticity of the action sequence is favorably improved.

In one implementation scenario, in the case that the first total number of the several individuals is single, only the time sequence relationship needs to be modeled, in this case, the single individual may be directly selected as the target individual, and the first feature representation and the second feature representation corresponding to the target individual are used as the time sequence feature representations of the target individual at different time sequences. Illustratively, still taking T action frames as an example, the first feature representation of the target individual in the 1 st action frame may be taken as a first time series feature representation, the first feature representation of the target individual in the 2 nd action frame may be taken as a second time series feature representation, the first feature representation of the target individual in the T action frame may be taken as a T-th time series feature representation, and the second feature representation of the target individual with respect to the target action category may be taken as a T + 1-th time series feature representation. On the basis, each time sequence can be respectively selected as a current time sequence, the time sequence feature representation of the current time sequence can be selected as a current time sequence feature representation, and a fusion feature representation corresponding to the current time sequence representation can be obtained based on the correlation degree between each reference time sequence representation and the current time sequence representation. That is, in the case of taking the ith time-series feature representation as the current time-series representation, the time-series feature representations of the target individual at each time series (i.e., 1 to T +1) may be taken as reference time-series representations, and based on the correlation between these reference time-series representations and the ith time-series feature representation, a fused feature representation corresponding to the ith time-series feature representation is obtained, so that in the motion generation scenario of a single individual, T +1 fused feature representations may be finally obtained, where the T +1 fused feature representations include: the single individual fusion time sequence relation is followed by feature representations of the respective T action frames, and the single individual fusion time sequence relation is followed by feature representations of the target action category. It should be noted that, in order to facilitate distinguishing from the modeling step of the subsequent interaction relationship, in the time sequence modeling, the current time sequence may be named as a first current time sequence, the time sequence characteristic representation of the current time sequence may be named as a first current time sequence representation, and the reference time sequence representation may be named as a first reference time sequence representation.

In a specific implementation scenario, as mentioned above, to improve the action generation efficiency, an action generation model may be trained in advance, and the action generation model may include a relational modeling network, and the relational modeling network may further include a timing modeling sub-network. For example, the time series modeling sub-network may be constructed based on a Transformer, and for convenience of description, the Transformer included in the time series modeling sub-network may be referred to as a T-Former, and then the aforementioned T +1 time series feature representations may be subjected to linear transformation, respectively, to obtain a { query, key, value } feature representation corresponding to each time series feature representation. Representing F by the t-th time sequence characteristic_tFor example, a linear transformation may result in a corresponding { query, key, value } representation q_t，k_t，v_t：

q_t＝W_qF_t，k_t＝W_kF_t，v_t＝W_vF_t......(1)

In the above formula (1), W_q，W_k，

Respectively represent linear transformation parameters, andmay be adjusted during the training of the motion generating model. On the basis, under the condition that the tth time sequence feature representation is selected as the current time sequence representation, the correlation degree w between the query feature representation corresponding to the tth time sequence feature representation and the key feature representation of the tth time sequence feature representation (the value range is 1 to T +1) can be acquired_t，t′：

w_t，t′＝q_t·k_t，......(2)

In obtaining the correlation degree w_t，t' thereafter, the correlation w can be based on_t，tWeighting the value feature representation of the T' (with the value range of 1 to T +1) th time sequence feature representation to obtain a fusion feature representation H after the T-th time sequence feature representation fuses the time sequence relation_t：

In a specific implementation scenario, the timing modeling sub-network may be formed by stacking L (L is greater than or equal to 1) layers of transformers, and based on this, obtaining a fused feature representation of the output of the L-th layer of transformers

Then, the time sequence modeling process can be executed again by taking the time sequence as the input of the (l +1) th layer Transformer, and the fused characteristic representation output by the (l +1) th layer Transformer is obtained

By analogy, the fused features output by the last layer of Transformer can be finally represented

As the final fused feature representation. Furthermore, the final fused feature representation is obtained

Thereafter, the 1 st to the Tth final fusion feature representations have been representedSince the target motion category is sufficiently merged, the T +1 th final merged feature related to the target motion category can be represented before the motion generation in the subsequent step S13

And (4) discarding.

In an implementation scenario, when a plurality of individuals are provided, a time sequence relationship and an interaction relationship need to be modeled, and the interaction relationship and the time sequence relationship can be modeled sequentially, illustratively, the interaction relationship can be modeled first, and then the time sequence relationship can be modeled; alternatively, the time sequence relationship may be modeled first, and then the interaction relationship may be modeled. Further, the output feature representation of the prior modeled relationship is an input feature representation of the subsequent modeled relationship. That is, in the case where the relational modeling includes modeling an interaction relationship and a time-series relationship, a preceding relationship may be modeled based on the first feature representation and the second feature representation to obtain an output feature representation of the preceding relationship, and then a subsequent relationship may be modeled based on the output feature representation to obtain a fused feature representation. It should be noted that the preceding relationship is an interactive relationship, and the following relationship is a time sequence relationship, or the preceding relationship is a time sequence relationship and the following relationship is a time sequence relationship.

In a specific implementation scenario, as described above, to improve the action generation efficiency, an action generation model may be trained in advance, and the action generation model may include a relational modeling network, which may include a timing modeling sub-network and an interactive modeling sub-network. Illustratively, the timing modeling sub-network and the interaction modeling sub-network can both be constructed based on transformers, which can be referred to as T-formers for descriptive convenience and I-formers for the interaction modeling sub-network. Similarly to the aforementioned single-individual motion generation scenario, in a motion generation scenario of multiple individuals, one of the individuals may also be selected as the target individual. Illustratively, the pth individual of the P individuals may be selected as the target individual. On the basis, the first feature representation and the second feature representation corresponding to the target individual can be regarded as the target individual respectively at different time sequencesAnd (4) order characteristic representation. For the convenience of distinction, the first feature representation of the target individual in the T motion frames and the second feature representation related to the target motion category are respectively regarded as time sequence feature representations from time sequence 1 to time sequence T +1, and then the T +1 time sequence feature representations may be respectively subjected to linear transformation to obtain the { query, key, value } feature representation corresponding to each time sequence feature representation. Taking the p-th individual as the target individual, the t-th time sequence characteristic thereof is represented

The corresponding { query, key, value } feature representation can be obtained through linear transformation

When the interactive relationship is constructed in advance, similarly to the construction of the time sequence relationship, after obtaining the time sequence feature representations of the target individuals at different time sequences, the time sequences can be respectively selected as the current time sequence, the time sequence feature representation of the current time sequence can be selected as the current time sequence representation, and the fusion feature representation corresponding to the current time sequence representation is obtained based on the correlation degree between each reference time sequence representation and the current time sequence representation. It should be noted that, in order to distinguish from the modeling step of the timing relationship, the current timing may be named as a second current timing, the timing characteristic representation of the second current timing may be named as a second current timing representation, and the reference timing representation may be named as a second reference timing representation. Specifically, the tth time sequence may be used as the reference time sequence, and the time sequence feature representation of each individual at the reference time sequence is the key feature representation of each individual at the reference time sequence

Wherein, the value range of P' is 1 to P. In this case, the correlation may be expressed as:

further, the correlation may be based on

Weighting the value characteristic representation of the P '(P' ranges from 1 to P) th individual at the t time sequence to obtain a fusion characteristic representation after the time sequence characteristic representation of the P th individual at the t time sequence fuses the interactive relation

In a specific implementation scenario, fusion feature representations of the individuals after the fusion interaction relation of the time sequences are obtained

These fused feature representations may then be used as input feature representations for building a timing relationship to continue building the timing relationship, as previously described. The process of constructing the timing relationship can refer to the above description, and is not described herein again.

In one specific implementation scenario, referring to FIG. 2, the I-formers for constructing the interaction relationships and the T-formers for constructing the timing relationships may be combined into a set of transformers to jointly construct the interaction relationships and the timing relationships, and the relationship construction network may include L sets of transformers. On the basis, for the p-th individual in the t-th time sequence action frame, the fusion feature output by the first group of transformers can be represented

Thereafter, since the 1 st to tth final fusion feature representations have been sufficiently fused into the target motion category, the T +1 th final fusion feature representation related to the target motion category may be generated before the subsequent step S13 is performed

And (4) discarding. In addition, please refer to table 1 in combination, table 1 is a schematic structural table of an embodiment of the motion generation model. As shown in Table 1, the action generation model may illustratively contain 2 sets of transformers. Of course, 3 sets of transformers, 4 sets of transformers, 5 sets of transformers, or the like may be provided, and the present invention is not limited thereto. It should be noted that, the specific meanings of the input mapping layer and the category embedding layer may refer to the specific obtaining processes of the first feature representation and the second feature representation, which are not described herein again. In addition, the motion generation model shown in table 1 is only one possible implementation manner in the practical application process, and the specific structure of the motion generation model is not limited herein. For example, the number of input/output channels of each network layer shown in table 1 may also be adaptively adjusted according to actual application needs.

Table 1 schematic configuration table of an embodiment of an action generation model

It should be noted that, as can be seen from the foregoing embodiments, no matter a time sequence relationship is modeled or an interactive relationship is modeled, modeling processes of the two tend to be similar, that is, an individual may be selected as a target individual, a first feature representation and a second feature representation corresponding to the target individual may be used as time sequence feature representations of the target individual at different time sequences, then each time sequence may be selected as a current time sequence, and a time sequence feature representation of the current time sequence may be selected as a current time sequence representation, and then a fused feature representation corresponding to the current time sequence representation may be obtained based on a correlation between each reference time sequence representation and the current time sequence representation. The difference between the two is that in the case of modeling the time sequence relationship, the reference time sequence representation includes time sequence characteristic representations of the target individuals at the respective time sequences, and in the case of modeling the interaction relationship, the reference time sequence representation includes time sequence characteristic representations of the respective individuals at the current time sequences. Therefore, the time sequence relation and the interaction relation can be modeled through similar modeling processes, and the compatibility of a single individual and a plurality of individual application scenes can be further improved.

Step S13: and performing action mapping based on the fusion feature representation to obtain action sequences of a plurality of individuals about the target action category.

In an embodiment of the disclosure, the action sequence includes a number of action frames, and the action frames contain representations of actions of the individuals. Illustratively, the action sequence may include T action frames, and if several individuals are P individuals, each action frame includes action representations of P individuals, so that a time-series continuous three-dimensional action may be generated.

In one implementation scenario, as described above, in order to improve the action generation efficiency, the action generation model may be trained in advance, and the action generation model may include an action mapping network, as shown in table 1, the action mapping network may specifically include a linear layer such as a fully-connected layer, and a specific structure of the action mapping network is not limited herein. On this basis, the fused feature representation of each individual in each action frame can be input to the action mapping network, i.e.A sequence of actions of several individuals with respect to the target action category is available. Taking T action frames and P individuals as an example, T × P fusion feature representations can be obtained, and then the T × P fusion feature representations can be input to an action mapping network, so that T action frames can be obtained, and each action frame includes P individual action representations, so that the T action frames can be combined in time sequence order to obtain an action sequence. For ease of description, the sequence of actions may be represented as { M }_t|t∈[1，...，T]}，M_tRepresents the t-th motion frame, and each motion frame M_tContaining P individual motion representations, i.e.

In one implementation scenario, the representation of the individual actions in the action frame may include: in the action frame, first position information of a key point (e.g., a crotch) of the individual and pose information of the individual, and the pose information may specifically include second position information of several joint points (e.g., a left shoulder, a right shoulder, a left elbow, a right elbow, a left knee, a right knee, a left foot, a right foot, etc.) of the individual. Illustratively, taking the p-th individual in the t-th action frame as an example, the first position information may be recorded as

The method can specifically calculate the absolute position of the key point in a local coordinate system, and record attitude information as

Which may specifically include the position coordinates of each joint point in the local coordinate system. Illustratively, each motion frame in the motion sequence may be represented as a tensor of size (P, C), i.e., the motion representation of each individual in the motion frame may be represented as a C-dimensional vector. Based on this, the motion sequence can be expressed as a tensor of size (P, T, C). Of course, the above-mentioned gesture information can be expressed as gesture representation in SMPL (i.e. Skinned Multi Person Model), where SMPL is a widely used parameterized human body Model, and the specific meaning thereof can refer to the technical details of SMPL, and is not described herein again.

In one implementation scenario, referring to fig. 3a to 3f in combination, fig. 3a to 3f are schematic diagrams of an embodiment of an action sequence. As shown in fig. 3a to 3f, fig. 3a is a motion sequence generated when the target motion category is "congratulatory wine", fig. 3b is a motion sequence generated when the target motion category is "photograph", fig. 3c is a motion sequence generated when the target motion category is "supporting", fig. 3d is a motion sequence generated when the target motion category is "assault", fig. 3e is a motion sequence generated when the target motion category is "stretch", and fig. 3f is a motion sequence generated when the target motion category is "dancing".

In one implementation scenario, as shown in fig. 3a to 3f, the motion sequence generated by the motion generation model only includes the motion representation of each individual in each motion frame, but does not include the external shape and the motion scene of each individual, so after obtaining the motion sequence, the external shape of each individual (e.g., hair style, wearing, hair color, etc.) can be freely designed as needed, and the motion scene (e.g., street lane, mall, park, etc.) can also be freely designed as needed. Illustratively, after determining that the target action category is "take a picture" and the first total number of the several individuals is 2, the action sequence shown in fig. 3b can be generated through the foregoing process, on the basis of which, the shape of the left individual in fig. 3b (e.g., short hair, shirt, short pants, black hair, etc.) and the shape of the right individual (long hair, one-piece dress, black hair, etc.) can be designed, and the action scene can be designed as "park", so that animation can be enriched further, thereby on the one hand, the design flexibility can be improved, and on the other hand, the creation workload can be greatly reduced.

Referring to fig. 4, fig. 4 is a flowchart illustrating an embodiment of a training method for an action generation model. As mentioned above, the action sequence is obtained by an action generating model, and in order to enhance the training effect, the action generating model may be obtained by generating a confrontation training with the identification model, and the specific training process may include the following steps:

step S41: sample action sequences of a number of sample individuals with respect to a sample action category are obtained.

In the embodiment of the present disclosure, the sample action sequence includes a preset number of sample action frames, and the sample action sequence is marked with a sample mark, where the sample mark indicates whether the sample action sequence actually generates a model. Specifically, the sample motion sequence may be generated by a motion generation model, or may be acquired in a real scene.

In one implementation scenario, please refer to fig. 5 in combination, and fig. 5 is a schematic diagram illustrating an embodiment of obtaining a sample action frame. As shown in fig. 5, sample captured images of several samples with respect to sample motion categories may be acquired, i.e., real individual demonstration sample motion categories may be captured. On the basis of the motion information, sample motion representations of the sample individuals in the sample shooting images can be extracted, for example, the sample motion representation of each sample individual can comprise the position information of key points and a plurality of joint points of the sample individual. On this basis, each sample captured image can be represented as one sample motion frame, and the sample motion representation of each sample individual in each sample motion frame, similar to the motion representation in the previously disclosed embodiment, can be represented as a C-dimensional vector.

Step S42: and respectively decomposing each sample action frame in the sample action sequence to obtain sample image data.

In the embodiment of the disclosure, the sample graph data includes a preset number of node graphs, each node graph is formed by connecting nodes, each node includes a key point and a joint point, each node graph includes node feature representations of each node, and the position feature representations of the nodes are obtained by splicing the position feature representations of a plurality of sample individuals at corresponding nodes respectively. Still taking the example that the sample motion representation of each sample individual includes the location information of the key point and several joint points of the sample individual, the C-dimensional vector of the sample motion representation may be decomposed into K D-dimensional vectors (as mentioned above, the vector represents the location information, such as location coordinates, etc.), and C ═ K × D, where K is the total number of the key point and several joint points of the sample individual, such as the total number of the key point and several joint points of each sample individual is 18.

In one implementation, please refer to fig. 6 in combination, and fig. 6 is a schematic diagram of an embodiment of sample graph data. As shown in fig. 6, for a scene with a single sample individual, each node map only needs to represent the single sample individual, so each node map is formed by connecting K nodes, and each node on the node map is represented by a D-dimensional vector of the node, so each node map can be represented as a tensor with a size (K, D), and based on this sample map data, a tensor with a size (T, K, D) can be represented.

In another implementation scenario, unlike the scenario of a single sample individual, in the scenario of multiple sample individuals, each node map needs to represent multiple sample individuals, and each node map is still formed by connecting K nodes, but each node on the node map is obtained by splicing D-dimensional vectors of the multiple sample individuals at the node, that is, each node map can be represented as a tensor with a size of (K, P · D), and based on this sample map data, the tensor with a size of (T, K, P · D) can be represented. In addition, for a plurality of sample individuals in a sample action sequence, if the sequences are different, the prediction results of subsequent identification models are possibly different, so that uncertainty is brought to model training, and in order to make up for the defect, under the condition that the sample action sequence is acquired from a real scene, the position characteristic representations of the nodes are spliced according to the random sequence of the plurality of sample individuals and the position characteristic representations of the plurality of sample individuals at the corresponding nodes, so that the action generation model is enabled to regard the conditions of different sequences and actually belonging to the same sample action sequence as different samples in the training process, and is modeled, so that data enhancement can be realized, and the robustness of the model is favorably improved.

Step S43: and identifying the sample image data and the sample action type based on the identification model to obtain a prediction result.

In one implementation scenario, the authentication model may be constructed based on a space-time graph convolutional network, for example, please refer to table 2, where table 2 is a structural schematic table of an embodiment of the authentication model. It should be noted that table 2 is only one possible implementation of the identification model in the practical application process, and the specific structure of the identification model is not limited thereby. In addition, as to specific meanings of the space-time convolution in table 2, details of related technologies of the space-time convolution can be referred to, and are not described herein again.

Table 2 structural schematic table of one embodiment of authentication model

In the embodiment of the present disclosure, the prediction result includes a first prediction flag and a second prediction flag of the sample action sequence, the first prediction flag represents a possibility that the sample action sequence is predicted to be generated by the action generation model, and the second prediction flag represents a possibility that the sample action sequence belongs to the sample action category. It should be noted that the first prediction flag and the second prediction flag may be represented by numerical values, and the larger the numerical value, the higher the corresponding probability. Taking the network structure shown in table 2 as an example of the identification model, the sample graph data can be recorded as x, and after the convolutional layers of the space-time graphs of each layer are processed, a 512-dimensional vector can be obtainedAfter phi (x) and the sample motion class is expressed by class embedding, a 512-dimensional vector y can be obtained, and the two vectors are subjected to inner product to obtain phi (x) y. Further, the vector φ (x) may be input to the output mapping layer to obtain

And combining the inner product phi (x) y to obtain the scores given by the identification model to the input sample action types and sample action sequences, namely the first prediction mark and the second prediction mark.

Step S44: network parameters of either the action generation model or the authentication model are adjusted based on the sample flag, the first prediction flag, and the second prediction flag.

Specifically, the first prediction flag and the sample flag may measure the discrimination loss of the discrimination model, and the second prediction flag and the sample flag may measure the generation loss of the motion generation model, and in the training process, the motion generation model may be trained N times (at which the network parameter of the motion generation model is adjusted) every M times of training the discrimination model (at which the network parameter of the discrimination model is adjusted), and for example, the motion generation model may be trained 1 time every 4 times of training the discrimination model, which is not limited herein. On the basis, the identification capability of the identification model to the sample action sequence (namely the capability of distinguishing the sample action sequence generated by the model from the truly acquired sample action sequence) can be improved by training the identification model, so that the action generation model can be promoted to improve the authenticity of the generated action sequence, by training the motion training model, the authenticity of the motion sequence generated by the motion generation model can be improved (namely, the motion sequence generated by the model is as close to the actually acquired motion sequence as possible), so that the identification capability of the identification model is promoted to be improved, further, the identification model and the action generation model are mutually promoted and supplemented, after a plurality of rounds of training, the model performance of the motion generation model is more and more excellent, and the identification model cannot distinguish the motion sequence generated by the motion generation model from the actually acquired motion sequence, so that the training can be finished. It should be noted that, the specific process of generating the countermeasure training can refer to the specific technical details of generating the countermeasure training, and is not described herein again. Furthermore, as described in the foregoing disclosure, the position code may be performed during the motion generation process, and the position code may be adjusted together with the network parameters of the motion generation model during the training process of the motion generation model.

According to the scheme, the action generating model and the identification model are cooperatively trained by generating the confrontation training, so that the action generating model and the identification model can be mutually promoted and complement each other in the cooperative training process, and finally, the model performance of the action generating model is favorably improved; in addition, by decomposing the sample action representation into sample graph data, the identification of the action sequence can be skillfully decomposed into the identification of the graph data, which is beneficial to greatly reducing the training complexity and the construction difficulty of an identification model.

Referring to fig. 7, fig. 7 is a schematic diagram of a framework of an embodiment of the motion generating device 70 of the present application. The motion generation device 70 includes: the system comprises a feature acquisition module 71, a relation modeling module 72 and an action mapping module 73, wherein the feature acquisition module 71 is used for acquiring first feature representations respectively representing a plurality of individuals in a plurality of action frames and acquiring second feature representations respectively representing the plurality of individuals about target action categories; a relational modeling module 72, configured to perform relational modeling based on the first feature representation and the second feature representation to obtain a fused feature representation of each individual in each action frame; wherein the type of the relational modeling is related to a first total number of the plurality of individuals; the action mapping module 73 is used for carrying out action mapping based on the fusion feature representation to obtain action sequences of a plurality of individuals about the target action category; the action sequence comprises a plurality of action frames, and the action frames comprise action representations of all individuals.

According to the scheme, on one hand, actions can be automatically generated without depending on manual work, on the other hand, the relational modeling is performed in a targeted mode according to the first total number of the individuals, and two application scenes, namely a single individual and a plurality of individuals, can be compatible. Therefore, the method and the device can be compatible with two application scenes of a single individual and a plurality of individuals on the premise of improving the action generation efficiency.

In some disclosed embodiments, where the first total number of individuals is single, the relational modeling includes modeling a temporal relationship between the action frames; and/or, in the case that the first total number of the individuals is multiple, the relational modeling includes modeling an interactive relationship between the individuals in each action frame and modeling a time-series relationship between the action frames.

In some disclosed embodiments, the relational modeling module 72 includes a time sequence modeling sub-module, which includes a first selection unit configured to select an individual as a target individual, and to use a first feature representation and a second feature representation corresponding to the target individual as time sequence feature representations of the target individual at different time sequences, and to use the different time sequences as first current time sequences, respectively, and to use the time sequence feature representation of the first current time sequence as a first current time sequence representation; the time sequence modeling submodule comprises a first representation fusion unit, and is used for obtaining fusion characteristic representations corresponding to first current time sequence representations based on the correlation between each first reference time sequence representation and the first current time sequence representation; the first reference time sequence representation comprises time sequence characteristic representations of the target individuals at all time sequences.

In some disclosed embodiments, the relationship modeling module 72 includes an interaction modeling sub-module, which includes a second selection unit, configured to select an individual as a target individual, and use the first feature representation and the second feature representation corresponding to the target individual as time sequence feature representations of the target individual at different time sequences, and use the different time sequences as second current time sequences, respectively, and use the time sequence feature representation of the second current time sequence as a second current time sequence representation; the interactive modeling submodule comprises a second fusion unit for representing the second fusion unit, and is used for obtaining fusion characteristic representation corresponding to the second current time sequence representation based on the correlation between each second reference time sequence representation and the second current time sequence representation; and the second reference time sequence representation comprises time sequence characteristic representations of the individuals at the second current time sequence respectively.

In some disclosed embodiments, where the relational modeling includes modeling an interaction relationship and a timing relationship, the relational modeling module 72 includes a prior modeling sub-module for modeling the prior relationship based on the first feature representation and the second feature representation to obtain an output feature representation of the prior relationship, and the relational modeling module 72 includes a subsequent modeling sub-module for modeling the subsequent relationship based on the output feature representation to obtain a fused feature representation; the former relation is an interactive relation, the latter relation is a time sequence relation, or the former relation is a time sequence relation and the latter relation is an interactive relation.

In some disclosed embodiments, the sequence of actions is derived from an action generating model, the action generating model comprising a relational modeling network, and the relational modeling network comprising a timing modeling sub-network for modeling the timing relationships and an interaction modeling sub-network for modeling the interaction relationships.

In some disclosed embodiments, the first feature represents a sampling based gaussian process.

In some disclosed embodiments, the feature obtaining module 71 includes a first obtaining sub-module, and the first obtaining sub-module includes a process sampling unit, configured to sample the second total number of times respectively in a plurality of gaussian processes, to obtain first original representations respectively representing the second total number of action frames; the length of the first original representation is the same as the number of Gaussian processes, and the characteristic length scales of the Gaussian processes are different; the first obtaining submodule comprises a first obtaining unit and a second obtaining submodule, wherein the first obtaining unit is used for obtaining a third total number of first feature representations based on the first total number and the first original representation; wherein the third total is a product of the first total and the second total.

In some disclosed embodiments, the second feature representation is mapped based on the target action category.

In some disclosed embodiments, the feature obtaining module 71 includes a second obtaining sub-module, and the second obtaining sub-module includes an embedded representation unit, configured to perform embedded representation on the target action category, so as to obtain a second original representation; the second obtaining submodule comprises a second obtaining unit, and is used for obtaining a first total number of second feature representations based on the first total number and the second original representation.

In some disclosed embodiments, the first feature representation and the second feature representation are each fused with a position code; wherein, in the case that the several individuals are single individuals, the position code comprises a time sequence position code, and in the case that the several individuals are multiple individuals, the position code comprises an individual position code and a time sequence position code.

In some disclosed embodiments, the sequence of actions is derived from the action generating model, and the position code is adjusted during training of the action generating model along with network parameters of the action generating model until the action generating model training converges.

In some disclosed embodiments, the representation of the individual actions in the action frame includes: in the action frame, first position information of key points of the individual and posture information of the individual, and the posture information includes second position information of a plurality of joint points of the individual.

In some disclosed embodiments, the sequence of actions is derived from an action generation model, and the action generation model and the discriminant model are derived by generating a confrontation training.

In some disclosed embodiments, the motion generating apparatus 70 includes a sample sequence acquiring module for acquiring a sample motion sequence of several sample individuals with respect to a sample motion category; the sample action sequence comprises a preset number of sample action frames, and is marked with a sample mark, wherein the sample mark indicates whether the sample action sequence is actually generated by an action generation model; the motion generating device 70 includes a sample sequence decomposition module, configured to decompose each sample motion frame in the sample motion sequence, respectively, to obtain sample map data; the sample graph data comprises preset numerical value node graphs, the node graphs are formed by connecting nodes, the nodes comprise key points and joint points of sample individuals, the node graphs comprise node characteristic representations of all the nodes, and the position characteristic representations of the nodes are obtained by splicing position characteristic representations of a plurality of sample individuals at corresponding nodes respectively; the motion generation device 70 includes a sample sequence identification module, configured to identify sample diagram data and sample motion categories based on an identification model, so as to obtain a prediction result; the prediction result comprises a first prediction mark and a second prediction mark of the sample action sequence, the first prediction mark represents the possibility that the sample action sequence is generated by the action generating model through prediction, and the second prediction mark represents the possibility that the sample action sequence belongs to the sample action category; the action generating device 70 includes a network parameter adjusting module for adjusting a network parameter of any one of the action generating model and the authentication model based on the sample flag, the first prediction flag, and the second prediction flag.

In some disclosed embodiments, under the condition that the sample action sequence is acquired from a real scene, the position feature representations of the nodes are obtained by splicing the position feature representations of the plurality of sample individuals at the corresponding nodes respectively according to a random sequence of the plurality of sample individuals.

Referring to fig. 8, fig. 8 is a schematic block diagram of an embodiment of an electronic device 80 according to the present application. The electronic device 80 comprises a memory 81 and a processor 82 coupled to each other, the processor 82 being configured to execute program instructions stored in the memory 81 to implement the steps of any of the above-described embodiment of the action generating method. In one particular implementation scenario, the electronic device 80 may include, but is not limited to: a microcomputer, a server, and the electronic device 80 may also include a mobile device such as a notebook computer, a tablet computer, and the like, which is not limited herein.

In particular, the processor 82 is configured to control itself and the memory 81 to implement the steps of any of the above-described embodiments of the action generating method. The processor 82 may also be referred to as a CPU (Central Processing Unit). The processor 82 may be an integrated circuit chip having signal processing capabilities. The Processor 82 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 82 may be collectively implemented by an integrated circuit chip.

Referring to fig. 9, fig. 9 is a block diagram illustrating an embodiment of a computer-readable storage medium 90 according to the present application. The computer readable storage medium 90 stores program instructions 901 executable by the processor, the program instructions 901 for implementing the steps of any of the above-described embodiment of the action generation method.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on network elements. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. An action generating method, comprising:

acquiring first feature representations respectively representing a plurality of individuals in a plurality of action frames, and acquiring second feature representations respectively representing the plurality of individuals about target action categories;

performing relational modeling based on the first feature representation and the second feature representation to obtain a fused feature representation of each individual in each action frame; wherein the type of the relational modeling is related to a first total number of the number of individuals;

performing action mapping based on the fusion feature representation to obtain action sequences of the plurality of individuals about the target action category; wherein the sequence of actions includes the number of action frames, and the action frames contain a representation of the actions of the individuals.

2. The method of claim 1, wherein the type of the relational modeling is related to a first population of the number of individuals, comprising:

in the case that the first total number of the number of individuals is single, the relational modeling includes modeling a timing relationship between the action frames;

and/or, in the case that the first total number of the plurality of individuals is multiple, the relational modeling includes modeling an interaction relationship between the plurality of individuals in each of the action frames and modeling a timing relationship between the action frames.

3. The method according to claim 2, wherein in a case where the relational modeling includes modeling the time-series relationship, the performing relational modeling based on the first feature representation and the second feature representation to obtain a fused feature representation of each individual in each of the action frames includes:

selecting the individual as a target individual, and taking a first characteristic representation and a second characteristic representation corresponding to the target individual as time sequence characteristic representations of the target individual at different time sequences;

respectively selecting each time sequence as a first current time sequence, and selecting time sequence characteristic representation of the first current time sequence as a first current time sequence representation;

obtaining fusion characteristic representations corresponding to the first current time sequence representations based on the correlation degrees of the first reference time sequence representations and the first current time sequence representations respectively;

wherein the first reference time series representation comprises a time series signature representation of the target individual at each of the time series.

4. The method of claim 2, wherein in the case that the relational modeling includes modeling the interaction relationship, the performing the relational modeling based on the first feature representation and the second feature representation to obtain a fused feature representation of each individual in each of the action frames comprises:

respectively selecting each time sequence as a second current time sequence, and selecting time sequence characteristic representation of the second current time sequence as second current time sequence representation;

obtaining fusion characteristic representations corresponding to the second current time sequence representations based on the correlation degrees of the second reference time sequence representations and the second current time sequence representations respectively;

wherein the second reference timing representation comprises a timing characteristic representation of each of the individuals at the second current timing, respectively.

5. The method according to claim 2, wherein in a case where the relational modeling includes modeling the interaction relationship and the time-series relationship, the performing relational modeling based on the first feature representation and the second feature representation to obtain a fused feature representation of each individual in each of the action frames comprises:

modeling a prior relationship based on the first feature representation and the second feature representation to obtain an output feature representation of the prior relationship;

modeling a posterior relation based on the output feature representation to obtain the fusion feature representation;

the prior relationship is the interactive relationship, and the subsequent relationship is the time sequence relationship, or the prior relationship is the time sequence relationship and the subsequent relationship is the interactive relationship.

6. The method of claim 2, wherein the sequence of actions is derived from an action generating model, wherein the action generating model comprises a relational modeling network, and wherein the relational modeling network comprises a timing modeling sub-network for modeling the timing relationships and an interaction modeling sub-network for modeling the interaction relationships.

7. The method of claim 1, wherein the first feature representation is based on sampling by a gaussian process.

8. The method of claim 7, wherein obtaining the first feature representations respectively characterizing the plurality of individuals in the plurality of action frames comprises:

respectively sampling a second total number of times in a plurality of Gaussian processes to obtain first original representations respectively representing the second total number of action frames; the length of the first original representation is the same as the number of the Gaussian processes, and the characteristic length scales of the Gaussian processes are different;

obtaining a third total number of the first feature representations based on the first total number and the first original representation; wherein the third total is a product of the first total and the second total.

9. The method of claim 1, wherein the second feature representation is mapped based on the target action category.

10. The method of claim 9, wherein obtaining second feature representations respectively characterizing the plurality of individuals with respect to a target action category comprises:

embedding and expressing the target action category to obtain a second original expression;

and obtaining the first total number of the second feature representations based on the first total number and the second original representation.

11. The method of claim 1, wherein the first feature representation and the second feature representation are each fused with a position code;

wherein, in the case that the several individuals are a single individual, the position code comprises a time-series position code, and in the case that the several individuals are a plurality of individuals, the position code comprises an individual position code and the time-series position code.

12. The method of claim 11, wherein the motion sequence is derived from a motion generative model, and the position code is adjusted during training of the motion generative model along with network parameters of the motion generative model until the motion generative model training converges.

13. The method of claim 1, wherein the representation of the individual's actions in the action frame comprises: in the action frame, first position information of key points of the individual and posture information of the individual, and the posture information includes second position information of a plurality of joint points of the individual.

14. The method of claim 1, wherein the sequence of actions is derived from an action generation model, and the action generation model and the discriminant model are derived by generating a confrontation training.

15. The method of claim 14, wherein the step of generating the counter training comprises:

acquiring sample action sequences of a plurality of sample individuals about sample action categories; the sample action sequence comprises a preset number of sample action frames, and is marked with a sample mark, wherein the sample mark indicates whether the sample action sequence is actually generated by the action generation model;

respectively decomposing each sample action frame in the sample action sequence to obtain sample image data; the sample graph data comprises the preset numerical value node graph, the node graph is formed by connecting nodes, the nodes comprise key points and joint points of the sample individuals, the node graph comprises node feature representations of the nodes, and the position feature representations of the nodes are obtained by splicing the position feature representations of the sample individuals at the corresponding nodes respectively;

identifying the sample image data and the sample action category based on an identification model to obtain a prediction result; wherein the prediction result comprises a first prediction flag and a second prediction flag of the sample action sequence, the first prediction flag representing a possibility that the sample action sequence is predicted to be generated by the action generation model, and the second prediction flag representing a possibility that the sample action sequence belongs to a sample action category;

adjusting a network parameter of either the action generation model or the authentication model based on the sample token, the first predictive token, and the second predictive token.

16. The method according to claim 15, wherein in a case that the sample action sequence is acquired from a real scene, the position feature representations of the nodes are obtained by splicing the position feature representations of the sample individuals at the corresponding nodes respectively according to a random order of the sample individuals.

17. An action generating device, comprising:

the characteristic acquisition module is used for acquiring first characteristic representations respectively representing a plurality of individuals in a plurality of action frames and acquiring second characteristic representations respectively representing the plurality of individuals about the target action category;

the relational modeling module is used for carrying out relational modeling on the basis of the first feature representation and the second feature representation to obtain the fusion feature representation of each individual in each action frame; wherein the type of the relational modeling is related to a first total number of the number of individuals;

the action mapping module is used for carrying out action mapping based on the fusion characteristic representation to obtain action sequences of the plurality of individuals about the target action category; wherein the sequence of actions includes the number of action frames, and the action frames contain a representation of the actions of each of the individuals.

18. An electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the action generation method of any of claims 1 to 16.

19. A computer-readable storage medium having stored thereon program instructions, which when executed by a processor implement the action generation method of any of claims 1 to 16.