WO2023142651A1 - 动作生成方法及相关装置、电子设备、存储介质和程序 - Google Patents

动作生成方法及相关装置、电子设备、存储介质和程序 Download PDF

Info

Publication number
WO2023142651A1
WO2023142651A1 PCT/CN2022/135160 CN2022135160W WO2023142651A1 WO 2023142651 A1 WO2023142651 A1 WO 2023142651A1 CN 2022135160 W CN2022135160 W CN 2022135160W WO 2023142651 A1 WO2023142651 A1 WO 2023142651A1
Authority
WO
WIPO (PCT)
Prior art keywords
action
representation
individuals
feature
modeling
Prior art date
Application number
PCT/CN2022/135160
Other languages
English (en)
French (fr)
Inventor
宋子扬
王栋梁
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023142651A1 publication Critical patent/WO2023142651A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects

Definitions

  • This application relates to the technical field of computer vision, and involves but is not limited to an action generation method and related devices, electronic equipment, storage media and programs.
  • Motion generation is the key to many computer vision tasks such as animation creation, humanoid robot interaction, and so on.
  • the existing motion generation methods mainly include two types.
  • One is the modeling-rendering method based on computer graphics, which requires designers to invest a lot of time and energy in modeling, skinning and motion capture. Lower; the other is based on machine learning, especially deep learning. Thanks to the rapid development of machine learning technology in recent years, the use of deep neural networks to perform action generation tasks can greatly improve the efficiency of action generation.
  • Embodiments of the present application provide an action generation method and related devices, electronic equipment, storage media, and programs.
  • An embodiment of the present application provides an action generation method, which is applied to electronic devices.
  • the method includes: obtaining first feature representations representing several individuals in several action frames, and obtaining second feature representations representing several individuals with respect to the target action category.
  • Feature representation perform relationship modeling based on the first feature representation and the second feature representation, and obtain the fusion feature representation of each individual in each action frame; wherein, the type of relationship modeling is related to the first total number of several individuals; based on the fusion feature representation Perform action mapping to obtain the action sequences of several individuals with respect to the target action category; wherein, the action sequence includes several action frames, and the action frames include the action representations of each individual.
  • the first feature representations representing several individuals in several action frames are obtained, and the second feature representations representing several individuals with respect to the target action category are obtained, and on this basis, the relationship is performed based on the first feature representation and the second feature representation Modeling, to obtain the fusion feature representation of each individual in each action frame, and the type of relationship modeling is related to the first total number of several individuals, and then perform action mapping based on the fusion feature representation, to obtain the actions of several individuals about the target action category sequence, and the action sequence includes a number of action frames, and the action frame includes the action representation of each individual, so on the one hand, the action can be automatically generated without relying on manual work;
  • the model can be compatible with both single individual and multiple individual application scenarios. Therefore, under the premise of improving the efficiency of action generation, it is compatible with both application scenarios of a single individual and multiple individuals.
  • the type of relationship modeling is related to the first total of the several individuals, including at least one of the following: when the first total of the several individuals is a single, the relationship modeling includes modeling each Timing relationship between action frames; when the first total number of several individuals is multiple, relationship modeling includes modeling the interaction relationship between several individuals in each action frame and modeling the timing relationship between each action frame .
  • relationship modeling includes modeling the temporal relationship between action frames, so the temporal coherence between action frames can be improved by modeling the temporal relationship, which is conducive to improving The authenticity of the action sequence, and in the case that the first total number of several individuals is multiple, the relationship modeling includes modeling the interaction relationship between several individuals in each action frame and modeling the timing relationship between each action frame, Therefore, the rationality of the interaction between individuals can be improved by modeling the interaction relationship, and the temporal coherence between action frames can be improved by modeling the temporal relationship, which is conducive to improving the authenticity of the action sequence.
  • the relationship modeling when the relationship modeling includes modeling temporal relationships, the relationship modeling is performed based on the first feature representation and the second feature representation to obtain the fusion feature representation of each individual in each action frame, including: selecting the individual As the target individual, the first feature representation and the second feature representation corresponding to the target individual are used as the time series feature representation of the target individual at different time series; each time series is respectively selected as the first current time series, and the time series features of the first current time series are selected represented as the first current time-series representation; based on the correlation between each first reference time-series representation and the first current time-series representation respectively, the fusion feature representation corresponding to the first current time-series representation is obtained; wherein, the first reference time-series representation includes the target individual in each Temporal feature representation of time series.
  • relationship modeling when relationship modeling includes modeling interaction relationships, relationship modeling is performed based on the first feature representation and the second feature representation, and the fusion feature representations of each individual in each action frame are obtained, including: selecting an individual As the target individual, the first feature representation and the second feature representation corresponding to the target individual are used as the time series feature representation of the target individual at different time series; each time series is selected as the second current time series, and the time series features of the second current time series are selected represented as the second current time-series representation; based on the correlation between each second reference time-series representation and the second current time-series representation, the fusion feature representation corresponding to the second current time-series representation is obtained; wherein, the second reference time-series representation includes each individual respectively in The timing feature representation of the second current timing.
  • the individual is selected as the target individual, and the first feature representation and the second feature representation corresponding to the target individual are used as the time-series feature representation of the target individual at different time series, and based on this, the time-series feature representations of different time series are respectively used as the current time-series representation , and then based on the correlation between each reference time series representation and the current time series representation, the fusion feature representation corresponding to the current time series representation is obtained, and in the case of modeling the time series relationship, the reference time series representation includes the time series feature representation of the target individual at each time series, In the case of modeling interaction relationships, the reference time series representation includes the time series feature representations of each individual in the reference time series, and the reference time series is the time series corresponding to the current time series representation, so the time series relationship and interaction can be modeled through a similar modeling process Therefore, it can further improve the compatibility of single individual and multiple individual application scenarios.
  • relationship modeling when the relationship modeling includes modeling interaction relationships and temporal relationships, relationship modeling is performed based on the first feature representation and the second feature representation, and the fusion feature representation of each individual in each action frame is obtained, including : Model the previous relationship based on the first feature representation and the second feature representation, get the output feature representation of the previous relationship, model the subsequent relationship based on the output feature representation, and obtain the fusion feature representation; where the previous relationship is an interaction relationship, The subsequent relationship is a sequential relationship, or the prior relationship is a temporal relationship and the subsequent relationship is an interactive relationship.
  • relational modeling includes modeling interaction relationships and temporal relationships
  • the output features of the previously modeled interaction relationships are expressed as input feature representations of the later modeled temporal relationships, so the application of multiple individuals In the scene, by successively modeling the interaction relationship and the timing relationship, the fusion feature representations are respectively integrated into the interaction relationship and the timing relationship, which is conducive to improving the fusion effect of the interaction relationship and the timing relationship.
  • the action sequence is obtained by an action generation model
  • the action generation model includes a relational modeling network
  • the relational modeling network includes a timing modeling subnetwork and an interaction modeling subnetwork
  • the timing modeling subnetwork is used to model Temporal relationship
  • the interaction modeling sub-network is used to model the interaction relationship. Therefore, the action sequence is obtained from the action generation model, which includes a relational modeling network, and the relational modeling network includes a temporal modeling sub-network and an interaction modeling sub-network.
  • the temporal modeling sub-network is used to model temporal relationships, interaction
  • the modeling sub-network is used to model the interaction relationship, so the action generation task can be completed through the network model, which is conducive to further improving the efficiency of action generation.
  • the first feature representation is based on Gaussian process sampling. Therefore, obtaining the first feature representation based on Gaussian process sampling is conducive to greatly reducing the complexity of obtaining the first feature representation, and can also improve the generation quality on the action data with rich categories.
  • obtaining the first feature representations representing several individuals in several action frames respectively includes: sampling the second total number of times in several Gaussian processes to obtain the first original representations respectively representing the second total number of action frames ; Wherein, the length of the first original representation is the same as the number of Gaussian processes, and the characteristic length scales of each Gaussian process are different; based on the first total number and the first original representation, the third total number of first feature representations is obtained; wherein, The third total is the product of the first total and the second total.
  • the length scales are different. Based on this, based on the first total number and the first original representation, the third total number of first feature representations is obtained, and the third total number is the product of the first total number and the second total number. Due to the characteristics of each Gaussian process The length scales are different, and each time the Gaussian process is sampled, the feature information of each action frame can be obtained, so the accuracy of each first feature representation can be improved.
  • the second feature representation is obtained based on target action category mapping. Therefore, based on mapping the target action category to obtain the second feature representation, the second feature representation can be obtained only through simple processing such as mapping text information, which is conducive to greatly reducing the complexity of driving action generation.
  • obtaining the second feature representations representing several individuals with respect to the target action category includes: embedding the target action category to obtain the second original representation; based on the first total and the second original representation, obtaining the first The total number of second feature representations. Therefore, embedding the target action category to obtain the second original representation, and based on the first total and the second original representation, obtain the first total second feature representation, that is, by embedding the text information and combining the first total Performing correlation processing can obtain the first total number of second feature representations, which is beneficial to greatly reducing the complexity of obtaining the second feature representations.
  • both the first feature representation and the second feature representation are fused with position codes; wherein, when several individuals are a single individual, the position codes include temporal position codes, and when several individuals are multiple individuals , position codes include individual position codes and time series position codes. Therefore, both the first feature representation and the second feature representation are combined with position codes.
  • the position codes include time-series position codes.
  • the position codes include individual position codes and time-series position coding, so different position coding strategies can be used to distinguish different feature representations in two application scenarios of single individual and multiple individuals, so that the position coding of feature representations is different, which is conducive to improving the accuracy of feature representation .
  • the action sequence is obtained by the action generation model, and the position code is adjusted together with the network parameters of the action generation model during the training process of the action generation model until the training of the action generation model converges. Therefore, the action sequence is obtained by the action generation model, and the position code is adjusted together with the network parameters of the action generation model during the training process of the action generation model until the training of the action generation model converges.
  • the position code Since the position code is trained together with the network model, the It can improve the representation ability of the position code, and after the training converges, the position code will not be adjusted, that is, it will remain fixed, so that a strong prior constraint can be added, so that a balance can be achieved between the prior constraint and the representation ability, and then the Further improving the accuracy of feature representation is conducive to improving the generation effect of action sequences.
  • the action representation of the individual in the action frame includes: in the action frame, first position information of key points of the individual and pose information of the individual, and the pose information includes second position information of several joint points of the individual. Therefore, the action representation of the individual in the action frame includes: the first position information of the key points of the individual in the action frame and the pose information of the individual, and the pose information includes the second position information of several joint points of the individual, so it can be obtained by The position information of both key points and joint points is used to express individual actions, which is conducive to improving the accuracy of action representation.
  • the action sequence is obtained by an action generation model, and the action generation model and the identification model are obtained through generative confrontation training. Therefore, co-training the action generation model and the discrimination model through generative confrontation training can make the action generation model and the discrimination model promote and complement each other in the process of collaborative training, and ultimately help to improve the model performance of the action generation model.
  • the step of generating adversarial training includes: obtaining sample action sequences of several sample individuals with respect to sample action categories; wherein, the sample action sequences include a preset number of sample action frames, and the sample action sequences are marked with sample marks, and the sample The mark indicates whether the sample action sequence is actually generated by the action generation model; respectively decompose each sample action frame in the sample action sequence to obtain the sample graph data; wherein, the sample graph data includes preset numerical node graphs, and the node graph is formed by connecting nodes , the nodes include key points and joint points, the node graph includes the node feature representation of each node, and the position feature representation of the node is obtained by splicing the position feature representations of several sample individuals at the corresponding nodes; based on the identification model, the sample graph data and sample The action category is identified to obtain the prediction result; wherein, the prediction result includes the first prediction mark of the sample action sequence, the first prediction mark represents the possibility that the sample action sequence is predicted to be generated by the action generation model,
  • the position feature representation of a node when the sample action sequence is collected from a real scene, the position feature representation of a node is spliced from the position feature representations of several sample individuals at corresponding nodes in a random order of several sample individuals. Therefore, the position feature representation of a node is obtained by splicing the position feature representations of several sample individuals at the corresponding nodes according to the random order of several sample individuals, so that the action generation model will be differently sorted during the training process and actually belong to the same sample action sequence
  • the situations of different samples are regarded as different samples and modeled, so that data enhancement can be realized, which is beneficial to improve the robustness of the model.
  • the embodiment of the present application also provides an action generation device, including: a feature acquisition part, a relationship modeling part, and an action mapping part, the feature acquisition part is configured to acquire first feature representations representing several individuals in several action frames, and Obtaining the second feature representations representing several individuals about the target action category; the relationship modeling part is configured to perform relationship modeling based on the first feature representation and the second feature representation, and obtain the fusion feature representation of each individual in each action frame; wherein , the type of relationship modeling is related to the first total number of several individuals; the action mapping part is configured to perform action mapping based on the fusion feature representation, and obtain the action sequences of several individuals with respect to the target action category; where the action sequence includes several action frames, and Action frames contain action representations for each individual.
  • An embodiment of the present application also provides an electronic device, including a memory and a processor coupled to each other, and the processor is configured to execute program instructions stored in the memory, so as to implement any one of the above-mentioned action generating methods.
  • the embodiment of the present application also provides a computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a processor, any one of the above-mentioned action generation methods is implemented.
  • the embodiment of the present application also provides a computer program, including computer readable codes.
  • the processor in the electronic device executes to implement any one of the above action generating methods.
  • the above solution obtains the first feature representations representing several individuals in several action frames, and obtains the second feature representations representing several individuals with respect to the target action category, on this basis, based on the first feature representation and the second feature representation.
  • Relation modeling to obtain the fusion feature representation of each individual in each action frame, and the type of relationship modeling is related to the first total number of several individuals, and then perform action mapping based on the fusion feature representation, to obtain the target action category of several individuals
  • An action sequence and the action sequence includes a number of action frames, and the action frame includes the action representation of each individual, so on the one hand, the action can be automatically generated without relying on manual work, and on the other hand, the relationship can be targeted according to the first total number of several individuals Modeling can be compatible with both single individual and multiple individual application scenarios. Therefore, under the premise of improving the efficiency of action generation, it is compatible with both application scenarios of a single individual and multiple individuals.
  • FIG. 1 is a schematic flowchart of an action generation method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the process of an action generation method provided by an embodiment of the present application.
  • Fig. 3a is a schematic diagram of the first action sequence provided by the embodiment of the present application.
  • Fig. 3b is a schematic diagram of the second action sequence provided by the embodiment of the present application.
  • Fig. 3c is a schematic diagram of the third action sequence provided by the embodiment of the present application.
  • Fig. 3d is a schematic diagram of the fourth action sequence provided by the embodiment of the present application.
  • Fig. 3e is a schematic diagram of the fifth action sequence provided by the embodiment of the present application.
  • Fig. 3f is a schematic diagram of the sixth action sequence provided by the embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a training method for an action generation model provided in an embodiment of the present application
  • Fig. 5 is a schematic diagram of acquisition of a sample action frame provided by an embodiment of the present application.
  • Fig. 6 is a schematic diagram of a sample image data provided by the embodiment of the present application.
  • Fig. 7 is a schematic framework diagram of an action generation device provided by an embodiment of the present application.
  • Fig. 8 is a schematic frame diagram of an electronic device provided by an embodiment of the present application.
  • Fig. 9 is a schematic framework diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • the execution subject of the action generating method may be an electronic device, for example, the electronic device may be a terminal device, a server or other processing device, wherein the terminal device may be a user equipment (User Equipment, UE), a mobile device, User terminals, terminals, cellular phones, cordless phones, personal digital assistants (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc.
  • the action generation method may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • FIG. 1 is a schematic flowchart of an action generation method provided in an embodiment of the present application, which may include the following steps:
  • Step S11 Obtain first feature representations representing several individuals in several action frames, and obtain second feature representations representing several individuals with respect to the target action category.
  • the first total number of several individuals and the target action category may be specified by the user before the formal implementation of action generation.
  • the user may specify the target action category as "hug”, and specify the first total of several individuals as two; or, the user may specify the target action category as "dancing", and specify the first total of several individuals as one ; Alternatively, the user may specify the target action category as "fighting", and specify that the first total of several individuals be three. It should be noted that the above examples are only several possible implementations in the actual application process, and therefore do not limit the target action category and the first total number of several individuals in the actual application process.
  • the target action category can be specified by the user before the formal implementation of action generation, and the first total of several individuals can be automatically analyzed based on the target action category.
  • the user can specify the target action category as "high five", then based on the target action category, the first total of several individuals can be automatically analyzed to be two; or, the user can specify the target action category as "exchange items", Then based on the target action category, the first total number of several individuals can be automatically analyzed to be two; or, the user can specify the target action category as "carrying objects", and the first total number of several individuals can be automatically analyzed based on the target action category for one.
  • the above examples are only several possible implementations in the actual application process, and therefore do not limit the target action category and the first total number of several individuals in the actual application process.
  • the target action category can be specified by the user before the formal implementation of action generation, and the first total number of several individuals can be automatically analyzed based on the target action category, and the user's preference for the first total number obtained from the automatic analysis can be accepted. Modify the instruction to correct the first total obtained by the automatic analysis.
  • the user can specify the target action category as "fighting", then based on the target action category, the first total number of several individuals can be automatically analyzed to be two, and the user's modification instruction for the first total number obtained by the automatic analysis is accepted, Correct it to four; or, the user can specify the target action category as "walking", then the first total number of several individuals that can be automatically analyzed based on the target action category is one, and the first total number obtained by the user's automatic analysis is accepted.
  • the modified instruction of the correct it to two.
  • the above-mentioned individuals may all be human. Of course, it is not excluded that several individuals include both humans and animals.
  • the target action category can be specified as "walking the dog", and several individuals can include people and dogs.
  • the second total number of action frames may be specified in advance, for example, the second total number may be 10, 15, 20, 30, etc., which is not limited here.
  • the first feature representation of each individual in each action frame may be obtained.
  • the first feature representation of the single individual in each action frame can be obtained; or, for several individual
  • the first feature representation of each individual in each action frame can be obtained, and for the convenience of description, the two individuals can be Respectively referred to as "A" and "B", then the first feature representations of "A” in each action frame can be obtained, and the first feature representations of "B” in each action frame can be obtained.
  • Other situations can be deduced by analogy, and no more examples will be given here.
  • the action frame is included in the action sequence that is finally expected to be generated by the embodiment of the action generation method of the present application, that is, when the first feature representation is obtained, the action frame is not actually generated, and the first feature
  • the representations can be viewed as feature representations initialized for each individual in each action frame respectively.
  • the first feature representation may be obtained based on Gaussian process sampling.
  • Gaussian process is a kind of random process in probability theory and mathematical statistics. It is a combination of a series of random variables that obey the normal distribution in an index set.
  • the specific meaning of the Gaussian process please refer to the technology of the Gaussian process. detail.
  • the second total number of samples can be sampled respectively in several Gaussian processes to obtain the first original representation representing the second total number of action frames respectively, and the length of the first original representation is the same as the number of Gaussian processes, Each Gaussian process has a different characteristic length scale.
  • a third total number of first feature representations is obtained, and the third total number is the product of the first total number and the second total number.
  • the second total number of several action frames can be recorded as T, and the characteristic length scale ⁇ c of several Gaussian processes can take values of 1, 10, 100, 1000 respectively, then the characteristic length scale ⁇ c is The Gaussian process of 1 is sampled T times to obtain a one-dimensional vector of length T, and so on, the Gaussian process whose characteristic length scale ⁇ c is 10, 100, and 1000 can be sampled to obtain a one-dimensional vector of length T.
  • T first original representations of length 4 can be obtained, and these T first original representations are respectively related to T
  • There is a one-to-one correspondence between action frames that is, the first first original representation corresponds to the first action frame, the second first original representation corresponds to the second action frame, ..., and the T-th first original representation corresponds to The Tth action frame.
  • FIG. 2 is a schematic process diagram of an action generation method provided by an embodiment of the present application. As shown in FIG.
  • the length of the first original representation obtained by the above sampling can be denoted as C 0
  • the above-mentioned first original representation representing several action frames can be denoted as (T, C 0 ).
  • input mapping can be performed on the above-mentioned first original representation (T, C 0 ) (for example, a multi-layer perceptron can be sampled to map the first original representation), so as to change the original first original representation (T, C 0 ) dimension.
  • the number of first original representations after mapping is still T.
  • the feature length scales of each Gaussian process are different, and each time the Gaussian process is sampled, the feature information of each action frame can be obtained, so the accuracy of each first feature representation can be improved.
  • the first original representations representing the second total number of action frames after obtaining the first original representations representing the second total number of action frames, it may be determined whether to copy the first original representations representing each action frame based on whether the first total number is equal to or greater than one, so as to In this way, the first original representations of several individuals in each action frame are obtained.
  • the first original representation representing each action frame obtained by the aforementioned sampling can be directly used as the first original representation of the individual action frame in each action frame An original representation; or, in the case where the first total is greater than one, it can be determined that the action is generated as a scene of multiple individuals, then the first original representation representing each action obtained from the sampling above can be copied for the first total number of times to obtain
  • the first original representations of multiple individuals in each action frame for example, in the case that the first total is 2, the first original representation representing the first action frame can be copied into two first original representations, the two first original representations An original representation respectively represents the first original representations of the two individuals in the first action frame, and other cases can be deduced by analogy, and examples are not given here.
  • the position information of each first original representation may be encoded on the basis of the first original representation, A corresponding first feature representation is obtained, that is to say, the first feature representation is fused with position codes, and each position code is different.
  • position encoding includes temporal position encoding, that is to say, in the case of several individuals being a single individual, different first original representations are mainly distinguished by encoding action frames of different time sequences , so as to obtain the first feature representation.
  • the first original representations of these T action frames can be respectively incorporated into the temporal position encoding (eg, 1, 2, ..., T), In this way, the first feature representations representing a single individual in these T action frames are obtained.
  • the position encoding can include temporal position encoding and individual position encoding, that is to say, in the case of several individuals being multiple individuals, not only the action frames of different time sequences need to be Encoding also encodes multiple individuals in each action frame, so as to distinguish different first original representations, so as to obtain the first feature representation (as shown in the dotted box after position encoding in Figure 2).
  • the first original representation of the first action frame can be incorporated into the temporal position code (eg, 1), and further the first action frame Multiple individuals are integrated into individual position codes (for example, 1, 2, ...), thus combining temporal position codes and individual position codes as position codes, so that the first feature representation representing multiple individuals in the first action frame is fused with Different position encodings (eg, 1-1, 1-2, ...); similarly, the first original representation of the second action frame can be incorporated into the temporal position encoding (eg, 2), and further the first action frame Multiple individuals in the frame are integrated into individual position codes (such as 1, 2, ...), so that the temporal position codes and individual position codes are combined as position codes, so that the first feature representations representing multiple individuals in the second action frame are fused separately There are different position codes (eg, 2-1, 2-2, .
  • position codes are just examples.
  • an action generation model can be pre-trained, and the position codes can be adjusted together with the network parameters of the action generation model during the training process of the action generation model until the action generation model Until the training converges, from then on, the adjusted position encoding can be used in the subsequent application process.
  • the above method can use different position encoding strategies to distinguish different feature representations in both application scenarios of a single individual and multiple individuals, so that the position encodings of feature representations are different, which is conducive to improving the accuracy of feature representation.
  • the second feature representation of each individual with respect to the target action category may be obtained.
  • the first total of several individuals is one (that is, for the action generation scene of a single individual)
  • the second feature representation of the single individual with respect to the target action category can be obtained;
  • the second feature representation of each individual with respect to the target action category can be obtained.
  • the two individuals can be Respectively referred to as "A" and "B", then the second feature representation of "A" about the target action category can be obtained, and the second feature representation of "B” about the target action category can be obtained.
  • Other situations can be deduced by analogy, and no more examples will be given here.
  • the target action category may be specified by the user, and after the target action category is determined, mapping may be performed based on the target action category to obtain the second feature representation.
  • the target action category may be embedded and represented to obtain a second original representation, and on this basis, based on the first total number and the second original representation, a first total number of second feature representations is obtained.
  • the role of the above embedding representation is to convert the target action category into a vector.
  • the category vectors of different action categories can be preset.
  • the category vectors of 26 action categories can be preset (for example, the length of each category vector can be 200), Then after the target action category is determined, the category vector of the action category consistent with the target action category can be used as the second original representation of the target action category; or, the target action category can also be one-hot encoding, and then use the fully connected layer to perform linear transformation to obtain the second original representation of the target action category.
  • the target action category can be one-hot encoded as 26-dimensional vector
  • the linear transformation of the above-mentioned fully connected layer can be regarded as a transformation matrix of N (for example, 200)*26, and then multiply the matrix by the 26-dimensional one-hot encoding to obtain the second original representation.
  • the second original representation representing the target action category may also be determined whether to copy the first original representation based on whether the first total is equal to one or greater than one , so as to obtain the second original representations of several individuals with respect to the target action category.
  • the second original representation representing the target action category obtained from the aforementioned sampling can be directly used as the single individual’s information about the target action category
  • the second original representation or, in the case where the first total is greater than one, it can be determined that the action is generated as a scene of multiple individuals, then the second original representation that characterizes the target action category can be copied for the first total number of times to obtain multiple Individuals’ second original representations about the target action category, for example, in the case where the first total is 2, the second original representation representing the target action category can be copied into two second original representations, the two second original representations respectively represent the second original representations of the two individuals with respect to the target action category, other cases can be deduced by analogy, and examples are not given here.
  • encoding can be based on the second original representation
  • the position information of each second original representation is used to obtain the corresponding second feature representation. That is to say, similar to the first feature representation, the second feature representation is also fused with position codes, and each position code is different. It should be noted that not only are the position codes of the fusion of second feature representations different, but also the position codes of the fusion of second feature representations and the position codes of fusion of first feature representations are also different.
  • position encoding includes temporal position encoding, that is to say, in the case of several individuals being a single individual, the second feature representation can be combined with the first feature of different action frames in the temporal dimension Indicates a distinction is made.
  • the first original representations of these T action frames can be respectively incorporated into the temporal position encoding (eg, 1, 2, ..., T)
  • the second original representation of the target action category can be incorporated into the temporal position encoding (eg, T+1), so as to obtain the single individual's information about the target action category
  • the second feature representation of e.g., T+1
  • position encoding may include time-series position encoding and individual position encoding, that is, when several individuals are multiple individuals, it is necessary to simultaneously Distinguish (as shown in the dotted box after the position coding in Figure 2).
  • the second original representation of the target action category of multiple individuals can be first integrated into the temporal position encoding (eg, T+1), and further The second original representation about the target action category of the first individual is integrated into the individual position code (for example, 1), and the second original representation about the target action category of the second individual is further integrated into the individual position code (for example, 2), so that By analogy, the temporal position codes and individual position codes are combined, so that multiple individuals are fused with different position codes (eg, T+1-1, T+1-2, . . . ) with respect to the target action category.
  • the above position codes are just examples.
  • an action generation model can be pre-trained, and the position codes can be adjusted together with the network parameters of the action generation model during the training process of the action generation model until the action generation model Until the training converges, from then on, the adjusted position encoding can be used in the subsequent application process.
  • the above method can use different position encoding strategies to distinguish different feature representations in both application scenarios of a single individual and multiple individuals, so that the position encodings of feature representations are different, which is conducive to improving the accuracy of feature representation.
  • both the first feature representation and the second feature representation are fused with position codes, and when several individuals are a single individual, the position codes include temporal position codes, and when several individuals are multiple
  • the position code includes an individual position code and a time-series position code, which can be referred to in conjunction with FIG. 2 and the above description.
  • each location code may be different.
  • the first original representation of the individual in several action frames and the second original representation of the target action category can be used as the original time-series representation of the individual in different time series. Still taking T action frames, then for the pth individual, its first original representation in T action frames and the second original representation about the target action category can be regarded as its The original timing representation of the T+1th timing.
  • the temporal position encoding TPE t of the t-th timing can be added to the t-th original timing representation to obtain
  • the first feature representation of the t-th time series when the time series t is T+1, the time-series position encoding TPE t of the t-th time series can be added to the t-th original time-series representation to obtain the t-th time-series Two features are represented.
  • the encoding PE(t,p) is added to the original time series representation of the p-th individual at the t-th time series to obtain the first feature representation of the p-th individual at the t-th time series, and when the time series t is T+1
  • the position code PE(t,p) of the p individual at the t time series can be added to the original time series representation of the p individual at the t time series to obtain the Two features are represented.
  • Step S12 Perform relationship modeling based on the first feature representation and the second feature representation to obtain fusion feature representations of each individual in each action frame.
  • the type of relationship modeling is related to the first total number of several individuals.
  • the relationship modeling includes modeling the temporal relationship between action frames, In this way, the temporal coherence between action frames can be improved by modeling the temporal relationship, which is conducive to improving the authenticity of the action sequence.
  • the relationship modeling includes modeling several in each action frame. The interaction relationship between individuals and the timing relationship between modeling action frames, so as to improve the rationality of interaction between individuals by modeling interaction relationships, and the timing continuity between action frames can be improved by modeling timing relationships, Helps improve the authenticity of action sequences.
  • the single individual can be directly selected as the target individual, and the first feature corresponding to the target individual representation and the second feature representation, as the temporal feature representation of the target individual at different timings.
  • the first feature representation of the target individual in the first action frame can be used as the first temporal feature representation
  • the first feature representation of the target individual in the second action frame The feature representation is used as the second temporal feature representation
  • the first feature representation of the target individual in the T-th action frame is used as the T-th temporal feature representation
  • the second feature representation of the target individual about the target action category It is represented as the T+1th time series feature.
  • each time series can be selected as the current time series respectively
  • the time series feature representation of the current time series can be selected as the current time series feature representation, and based on the correlation between each reference time series representation and the current time series representation, the current time series representation corresponding to Fusion feature representation.
  • the time-series feature representations of the target individual at each time-series can be used as the reference time-series representation, and based on these reference time-series representations
  • the correlation between the i-th time-series feature representation and the fusion feature representation corresponding to the i-th time-series feature representation are obtained, so that in the action generation scene of a single individual, T+1 fusion feature representations can be finally obtained.
  • the +1 fused feature representations include: the feature representations of the single individual in T action frames after the temporal relationship is fused, and the feature representation of the target action category after the single individual is fused with the temporal relationship.
  • the current time series can be named as the first current time series
  • the time series feature representation of the current time series can be named as the first current time series representation
  • refer to The timing representation may be named as a first reference timing representation.
  • an action generation model may be pre-trained, and the action generation model may include a relational modeling network, and the relational modeling network may further include a timing modeling sub-network.
  • the time series modeling sub-network can be constructed based on Transformer.
  • the Transformer included in the time series modeling sub-network can be called T-Former.
  • the aforementioned T+1 time series feature representations can be linearly transformed first , get the ⁇ query, key, value ⁇ feature representation corresponding to each time series feature representation.
  • the t-th time series feature representation F t as an example, the corresponding ⁇ query, key, value ⁇ feature representation q t , k t , v t can be obtained through linear transformation:
  • the value feature representation of the t'th (value range is from 1 to T+1) time-series feature representation can be weighted based on the correlation degree w t,t' to obtain the first
  • the fusion feature representation H t after t time series feature representation is fused with the time series relationship:
  • the timing modeling sub-network can be formed by stacking L (L is greater than or equal to 1) layers of Transformers.
  • L is greater than or equal to 1
  • the aforementioned timing modeling process can be re-executed to obtain the fusion feature representation output by the l+1 layer Transformer
  • the fusion feature output by the last layer of Transformer can be expressed in the end as the final fusion feature representation.
  • the T+1th final fusion feature representation related to the target action category can be represented as throw away.
  • the interaction relationship can be modeled first, and then the Model the time series relationship; or, you can model the time series relationship first, and then model the interaction relationship.
  • the output feature representation of the previously modeled relation is represented as the input feature representation of the later modeled relation. That is to say, when relational modeling includes modeling interaction relations and temporal relations, the previous relation can be modeled based on the first feature representation and the second feature representation, and the output feature representation of the previous relation can be obtained, and then based on the output feature The representation models the post-relationship, resulting in a fused feature representation.
  • the earlier relationship is an interaction relationship and the later relationship is a time series relationship, or the earlier relationship is a time series relationship and the later relationship is a time series relationship.
  • an action generation model in order to improve the efficiency of action generation, can be pre-trained, and the action generation model can include a relational modeling network, and the relational modeling network can include a timing modeling subnetwork and an interaction Model subnetworks.
  • both the timing modeling subnetwork and the interaction modeling subnetwork can be constructed based on Transformer.
  • the Transformer contained in the timing modeling subnetwork can be called T-Former
  • the interaction modeling subnetwork contained in The Transformer is called I-Former.
  • one of the individuals can also be selected as the target individual.
  • the p-th individual among the P individuals may be selected as the target individual.
  • the first feature representation and the second feature representation corresponding to the target individual can be used as time-series feature representations of the target individual at different time sequences.
  • the first feature representation of the target individual in T action frames and the second feature representation of the target action category are respectively regarded as the temporal feature representations from time sequence 1 to time sequence T+1, then for the aforementioned T+
  • a time-series feature representation can be linearly transformed first to obtain the ⁇ query, key, value ⁇ feature representation corresponding to each time-series feature representation.
  • the pth individual selected as the target individual as an example, its tth temporal feature representation
  • the corresponding ⁇ query, key, value ⁇ feature representation can be obtained by linear transformation
  • each time series can be selected as the current time series, and the temporal feature representation of the current time series can be selected as the current time series representation , and based on the correlation between each reference time-series representation and the current time-series representation, the fusion feature representation corresponding to the current time-series representation is obtained.
  • the reference time-series representation includes each individual respectively Timing feature representation at the current time series.
  • the current timing can be named the second current timing
  • the timing feature representation of the second current timing can be named the second current timing representation
  • the reference timing representation can be named the second current timing representation.
  • Two reference timing representation Specifically, the tth time series can be used as the reference time series, and the time series feature representation of each individual at the reference time series is the key feature representation of each individual at the reference time series
  • the value range of p′ is from 1 to P.
  • the degree of relevance can be expressed as:
  • these fused feature representations can be used as input feature representations for building temporal relationships, so as to continue to build temporal relationships.
  • For the construction process of the timing relationship please refer to the related description above.
  • the I-Former used to build the interaction relationship and the T-Former used to build the timing relationship can be combined into a group of Transformers to jointly build the interaction relationship and timing relationship, then the relationship construction
  • the network can include L sets of Transformers.
  • the fusion feature output by the l-th group of Transformers can be expressed as Afterwards, it can be used as the input of the l+1 layer Transformer, and the aforementioned timing modeling process can be re-executed to obtain the fusion feature representation output by the l+1 layer Transformer
  • the fusion feature output by the last layer of Transformer can be expressed in the end as the final fusion feature representation.
  • the T+1th final fusion feature representation related to the target action category can be represented as throw away.
  • Table 1 is a structural representation of an embodiment of the action generation model.
  • the action generation model exemplarily includes 2 sets of Transformers. Of course, it is also possible to set 3 sets of Transformers, 4 sets of Transformers, or 5 sets of Transformers, etc., which are not limited here.
  • the action generation model shown in Table 1 is only a possible implementation in the actual application process, and the specific structure of the action generation model is not limited here.
  • the number of input/output channels of each network layer shown in Table 1 can also be adaptively adjusted according to actual application needs.
  • Table 1 Schematic diagram of the structure of an embodiment of the action generation model
  • the modeling process of the two tends to be similar, whether it is modeling a temporal relationship or an interactive relationship, that is, the individual can be selected as the target individual first, and the The first feature representation and the second feature representation are used as the time series feature representation of the target individual at different time series, and then each time series is selected as the current time series, and the time series feature representation of the current time series is selected as the current time series representation, and then based on each reference time series representation respectively The correlation with the current time series representation is used to obtain the fusion feature representation corresponding to the current time series representation.
  • the reference temporal representation includes the temporal feature representations of the target individual at each time series; in the case of modeling interaction relationships, the reference temporal representation includes The time series feature representation. Therefore, the timing relationship and interaction relationship can be modeled through a similar modeling process, so the compatibility of the two application scenarios of a single individual and multiple individuals can be further improved.
  • Step S13 Perform action mapping based on the fused feature representation, and obtain the action sequences of several individuals with respect to the target action category.
  • the action sequence includes several action frames, and the action frames include action representations of each individual.
  • the action sequence may include T action frames, and the number of individuals is P individuals, then each action frame includes the action representations of P individuals, so time-series continuous three-dimensional actions can be generated.
  • the action generation model in order to improve the efficiency of action generation, can be pre-trained, and the action generation model can include an action mapping network.
  • the action mapping network can include such as a fully connected layer, etc.
  • the linear layer does not limit the specific structure of the action mapping network here.
  • the fusion feature representation of each individual in each action frame can be input to the action mapping network, and the action sequence of several individuals with respect to the target action category can be obtained.
  • T*P fusion feature representations can be obtained, then the above T*P fusion feature representations can be input to the action mapping network to obtain T action frames, and each An action frame contains the action representations of P individuals, so T action frames can be combined in chronological order to obtain an action sequence.
  • the action sequence can be expressed as ⁇ M t
  • the action representation of the individual in the action frame may include: in the action frame, the first position information of the key point (such as the crotch) of the individual and the posture information of the individual, and the posture information may include several Second position information of joint points (eg, left shoulder, right shoulder, left elbow, right elbow, left knee, right knee, left foot, right foot, etc.).
  • the first position information can be written as It can be the absolute position of the key point in the local coordinate system
  • the attitude information can be recorded as It may include the position coordinates of each joint point in the local coordinate system.
  • each action frame in the action sequence can be expressed as a tensor with a size of (P, C), that is, the action representation of each individual in the action frame can be expressed as a C-dimensional vector.
  • action sequences can be represented as tensors of size (P,T,C).
  • the above posture information can be expressed as a posture representation in Skinned Multi Person Model (SMPL).
  • SMPL Skinned Multi Person Model
  • Figure 3a is the action sequence generated when the target action category is "toast”
  • Figure 3b is the action sequence generated when the target action category is "photographing”
  • Figure 3c is the action sequence generated when the target action category is "support”
  • Figure 3d is the action sequence generated when the target action category is "raid”
  • Figure 3e is the action sequence generated when the target action category is "stretch”
  • Fig. 3f is the action sequence generated when the target action category is "dancing”.
  • the action sequence generated by the action generation model only includes the action representation of each individual in each action frame, but does not include the appearance and action scene of each individual, so the obtained After the action sequence, the appearance of each individual (such as hairstyle, clothing, hair color, etc.) can be freely designed according to needs, and the action scenes (such as streets, shopping malls, parks, etc.) can also be freely designed according to needs.
  • the appearance of each individual such as hairstyle, clothing, hair color, etc.
  • the action scenes such as streets, shopping malls, parks, etc.
  • the action sequence shown in Figure 3b can be generated through the aforementioned process, and on this basis, the The appearance of the individual on the left (such as short hair, shirt, shorts, black hair, etc.) and the appearance of the individual on the right (long hair, dress, black hair, etc.), and the action scene can be designed as a "park", so that it can be further enriched.
  • Animation on the one hand, can improve design flexibility, and on the other hand, it can greatly reduce the workload of creation.
  • the above solution obtains the first feature representations representing several individuals in several action frames, and obtains the second feature representations representing several individuals with respect to the target action category, on this basis, based on the first feature representation and the second feature representation.
  • Relation modeling to obtain the fusion feature representation of each individual in each action frame, and the type of relationship modeling is related to the first total number of several individuals, and then perform action mapping based on the fusion feature representation, to obtain the target action category of several individuals
  • An action sequence and the action sequence includes a number of action frames, and the action frame includes the action representation of each individual, so on the one hand, the action can be automatically generated without relying on manual work, and on the other hand, the relationship can be targeted according to the first total number of several individuals Modeling can be compatible with both single individual and multiple individual application scenarios. Therefore, under the premise of improving the efficiency of action generation, it is compatible with both application scenarios of a single individual and multiple individuals.
  • FIG. 4 is a schematic flowchart of a method for training an action generation model provided by an embodiment of the present application.
  • the action sequence is obtained by the action generation model.
  • the action generation model can be trained against the discrimination model through generation confrontation.
  • the training process can include the following steps:
  • Step S41 Obtain the sample action sequences of several sample individuals with respect to the sample action category.
  • the sample action sequence includes a preset number of sample action frames, and the sample action sequence is marked with a sample mark, which indicates whether the sample action sequence is actually generated by the action generation model.
  • the sample action sequence can be generated by an action generation model, or it can be collected in a real scene.
  • FIG. 5 is a schematic diagram of acquiring a sample action frame provided by an embodiment of the present application.
  • the sample action representation of each sample individual in the sample captured image can be extracted, for example, the sample action representation of each sample individual can include the key points of the sample individual and the position information of several joint points.
  • each sample captured image can be expressed as a sample action frame, and the sample action representation of each sample individual in each sample action frame, similar to the action representation in the previous embodiment, can be represented by a C-dimensional vector .
  • Step S42 Decompose each sample action frame in the sample action sequence to obtain sample image data.
  • the sample graph data includes preset numerical node graphs
  • the node graph is formed by connecting nodes
  • the nodes include key points and joint points
  • the node graph includes the node feature representation of each node
  • the position feature representation of the node is represented by several The position feature representations of the sample individuals at the corresponding nodes are concatenated.
  • FIG. 6 is a schematic diagram of sample graph data provided by an embodiment of the present application.
  • each node graph only needs to represent a single sample individual, so each node graph is formed by connecting K nodes, and each node on the node graph is formed by the node Therefore, each node graph can be expressed as a tensor of size (K, D), and based on this sample graph data can be expressed as a tensor of size (T, K, D).
  • each node graph needs to represent multiple sample individuals, and at this time each node graph is still formed by K node connections , but each node on the node graph is obtained by concatenating the D-dimensional vectors of multiple sample individuals at the node, that is, each node graph can be expressed as a tensor of size (K, P D), based on this sample graph data can be Represented as tensors of size (T, K, P D).
  • K, P D tensor of size
  • T, K, P D tensors of size
  • the position feature representation of a node is spliced from the position feature representations of several sample individuals at the corresponding nodes according to the random order of several sample individuals, so that the action generation model will be different during the training process. Sorting but actually belonging to the same sample action sequence is regarded as different samples and modeled, so as to achieve data enhancement and improve the robustness of the model.
  • Step S43 Discriminate the sample image data and the sample action category based on the discrimination model to obtain a prediction result.
  • the identification model can be constructed based on a spatio-temporal graph convolutional network.
  • Table 2 is a structural representation of an embodiment of the identification model. It should be noted that Table 2 is only a possible implementation of the identification model in the actual application process, and does not limit the specific structure of the identification model.
  • space-time convolution in Table 2 please refer to the relevant technical details of space-time convolution.
  • the prediction result includes the first prediction mark and the second prediction mark of the sample action sequence
  • the first prediction mark represents the possibility of the sample action sequence being predicted to be generated by the action generation model
  • the second prediction mark represents the sample action sequence Likelihood of belonging to the sample action category.
  • the first predictive flag and the second predictive flag can be represented by numerical values, and the larger the numerical value, the higher the corresponding possibility.
  • the sample graph data can be recorded as x, and after each layer of spatio-temporal graph convolution layer processing, a 512-dimensional vector ⁇ (x) can be obtained, and the sample action category After being represented by category embedding, a 512-dimensional vector y can also be obtained, and the inner product of the two can be obtained to obtain ⁇ (x) ⁇ y. Further, the vector ⁇ (x) can be input to the output mapping layer to obtain Combined with the aforementioned inner product ⁇ (x) ⁇ y, the score given by the discriminant model to the input sample action category and sample action sequence can be obtained, that is, the aforementioned first prediction mark and second prediction mark.
  • Step S44 Adjust the network parameters of any one of the action generation model and the discrimination model based on the sample label, the first prediction label and the second prediction label.
  • the discrimination loss of the discriminative model can be measured by the first prediction mark and the sample mark
  • the generation loss of the action generation model can be measured by the second prediction mark and the sample mark.
  • every M times of training can be Identify the model (adjust the network parameters of the identification model at this time), and train the action generation model N times (adjust the network parameters of the action generation model at this time), such as training the identification model 4 times every time, train the action generation model once, here No limit.
  • the identification ability of the identification model for the sample action sequence that is, the ability to distinguish the sample action sequence generated by the model from the sample action sequence actually collected
  • Ability which makes the identification model and the action generation model mutually promote and complement each other. After several rounds of training, the model performance of the action generation model is getting better and better.
  • the identification model can no longer distinguish the action sequence generated by the action generation model from the real action. sequence, and the training ends at this point.
  • position encoding can be performed during the action generation process, and the position encoding can be adjusted together with the network parameters of the action generation model during the training process of the action generation model.
  • the above scheme through generative confrontation training to jointly train the action generation model and the identification model, can make the action generation model and the identification model promote each other in the process of collaborative training, complement each other, and ultimately help to improve the model performance of the action generation model; in addition, through Decomposing the sample action representation into sample graph data can subtly resolve the identification of action sequences into the identification of graph data, which is conducive to greatly reducing the training complexity and the difficulty of building the identification model.
  • FIG. 7 is a schematic framework diagram of an action generating device 70 provided in an embodiment of the present application.
  • the action generation device 70 includes: a feature acquisition part 71, a relationship modeling part 72, and an action mapping part 73.
  • the feature acquisition part 71 is configured to acquire the first feature representations representing several individuals in several action frames respectively, and obtain the first feature representations representing several individuals respectively.
  • the relationship modeling part 72 is configured to perform relationship modeling based on the first feature representation and the second feature representation, and obtain the fusion feature representation of each individual in each action frame; wherein, the relationship modeling The type of is related to the first total number of several individuals; the action mapping part 73 is configured to perform action mapping based on the fusion feature representation, and obtain the action sequences of several individuals with respect to the target action category; wherein, the action sequence includes several action frames, and the action frame contains The action representation of each individual.
  • the above solution on the one hand, can automatically generate actions without relying on manual work, and on the other hand, through targeted relationship modeling based on the first total number of several individuals, it can be compatible with both application scenarios of a single individual and multiple individuals. Therefore, under the premise of improving the efficiency of action generation, it is compatible with both application scenarios of a single individual and multiple individuals.
  • the type of relationship modeling is related to the first total of the several individuals, including at least one of the following: when the first total of the several individuals is a single, the relationship modeling includes modeling each Timing relationship between action frames; when the first total number of several individuals is multiple, relationship modeling includes modeling the interaction relationship between several individuals in each action frame and modeling the timing relationship between each action frame .
  • the relationship modeling part 72 includes a timing modeling subsection, and the timing modeling subsection includes a first selection unit configured to select an individual as a target individual, and represent the first feature corresponding to the target individual and the second The feature representation is used as the time series feature representation of the target individual at different time series, and the different time series are respectively used as the first current time series, and the time series feature representation of the first current time series is used as the first current time series representation; the time series modeling sub-part includes the first Represents a fusion unit, configured to obtain a fusion feature representation corresponding to the first current time-series representation based on the correlation between each first reference time-series representation and the first current time-series representation; wherein, the first reference time-series representation includes target individuals in each time-series Temporal feature representation.
  • the relationship modeling part 72 includes an interaction modeling subsection, and the interaction modeling subsection includes a second selection unit configured to select an individual as a target individual, and represent the first feature corresponding to the target individual and the second feature corresponding to the target individual.
  • Feature representation as the time-series feature representation of the target individual in different time-series, and taking different time-series as the second current time-series respectively, and taking the time-series feature representation of the second current time-series as the second current time-series representation
  • the interaction modeling subpart includes Two fusion units, configured to obtain fusion feature representations corresponding to the second current time-series representations based on the correlation between each second reference time-series representation and the second current time-series representation; The timing feature representation of the current timing.
  • relational modeling includes modeling interaction relationships and temporal relationships
  • the relational modeling subsection block 72 includes a prior modeling subsection configured to model based on the first feature representation and the second feature representation
  • the previous relationship is obtained from the output feature representation of the previous relationship.
  • the relationship modeling part 72 includes a subsequent modeling subsection configured to model the subsequent relationship based on the output feature representation to obtain a fusion feature representation; wherein the previous relationship is an interaction relationship, the subsequent relationship is a temporal relationship, or the prior relationship is a temporal relationship, and the subsequent relationship is an interactive relationship.
  • the action sequence is obtained by an action generation model
  • the action generation model includes a relational modeling network
  • the relational modeling network includes a timing modeling subnetwork and an interaction modeling subnetwork
  • the timing modeling subnetwork is used to model Temporal relationship
  • the interaction modeling sub-network is used to model the interaction relationship.
  • the first feature representation is based on Gaussian process sampling.
  • the feature acquisition part 71 includes a first acquisition subsection, and the first acquisition submodule includes a process sampling unit configured to sample the second total number of times in several Gaussian processes, respectively, to obtain the second total number of representations respectively The first original representation of the action frame; wherein, the length of the first original representation is the same as the number of Gaussian processes, and the characteristic length scales of each Gaussian process are different; the first acquisition sub-part includes a first acquisition unit configured to be based on the first A total number and the first original representation to obtain a third total number of first feature representations; wherein, the third total number is the product of the first total number and the second total number.
  • the second feature representation is obtained based on target action category mapping.
  • the feature acquisition part 71 includes a second acquisition subsection, and the second acquisition submodule includes an embedding representation unit configured to perform embedding representation on the target action category to obtain a second original representation; the second acquisition subsection includes The second acquisition unit is configured to obtain the first total number of second feature representations based on the first total number and the second original representation.
  • both the first feature representation and the second feature representation are fused with position codes; wherein, when several individuals are a single individual, the position codes include temporal position codes, and when several individuals are multiple individuals , position codes include individual position codes and time series position codes.
  • the action sequence is obtained by the action generation model, and the position code is adjusted together with the network parameters of the action generation model during the training process of the action generation model until the training of the action generation model converges.
  • the action representation of the individual in the action frame includes: in the action frame, first position information of key points of the individual and pose information of the individual, and the pose information includes second position information of several joint points of the individual.
  • the action sequence is obtained by an action generation model, and the action generation model and the identification model are obtained through generative confrontation training.
  • the action generation part 70 includes a sample sequence acquisition part configured to obtain sample action sequences of several sample individuals with respect to the sample action category; wherein, the sample action sequence includes a preset number of sample action frames, and the sample action sequence is labeled There is a sample mark, and the sample mark indicates whether the sample action sequence is actually generated by the action generation model; the action generation part 70 includes a sample sequence decomposition module, which is configured to decompose each sample action frame in the sample action sequence to obtain sample graph data; wherein, The sample graph data includes preset numerical node graphs.
  • the node graph is formed by connecting nodes. The nodes include key points and joint points of sample individuals.
  • the node graph includes the node feature representation of each node, and the position feature representation of the node is represented by several sample individuals respectively.
  • the position feature representation at the corresponding node is spliced to obtain;
  • the action generation part 70 includes a sample sequence identification module, configured to identify the sample graph data and the sample action category based on the identification model, and obtain the prediction result; wherein the prediction result includes the sample action sequence.
  • a first prediction mark and a second prediction mark the first prediction mark represents the possibility that the sample action sequence is predicted to be generated by the action generation model, and the second prediction mark represents the possibility that the sample action sequence belongs to the sample action category;
  • the action generation part 70 includes The network parameter adjustment module is configured to adjust the network parameters of any one of the action generation model and the discrimination model based on the sample label, the first prediction label and the second prediction label.
  • the position feature representation of a node is spliced from the position feature representations of several sample individuals at corresponding nodes in a random order of several sample individuals.
  • the feature acquisition module part 71, the relationship modeling module part 72 and the action mapping module part 73 mentioned above can all be realized based on the processor of the electronic device.
  • FIG. 8 is a schematic frame diagram of an electronic device 80 provided in an embodiment of the present application.
  • the electronic device 80 includes a memory 81 and a processor 82 coupled to each other, and the processor 82 is configured to execute program instructions stored in the memory 81, so as to implement the steps in any of the above embodiments of the action generation method.
  • the electronic device 80 may include, but is not limited to: a microcomputer and a server.
  • the electronic device 80 may also include mobile devices such as notebook computers and tablet computers, which are not limited here.
  • the processor 82 is configured to control itself and the memory 81 to implement the steps in any of the above embodiments of the action generating method.
  • the processor 82 may also be called a central processing unit (Central Processing Unit, CPU).
  • the processor 82 may be an integrated circuit chip with signal processing capabilities.
  • the processor 82 can also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field-programmable gate array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the processor 82 may be jointly realized by an integrated circuit chip.
  • the above solution on the one hand, can automatically generate actions without relying on manual work, and on the other hand, through targeted relationship modeling based on the first total number of several individuals, it can be compatible with both application scenarios of a single individual and multiple individuals. Therefore, under the premise of improving the efficiency of action generation, it is compatible with both application scenarios of a single individual and multiple individuals.
  • FIG. 9 is a schematic frame diagram of a computer-readable storage medium 90 provided by an embodiment of the present application.
  • the computer-readable storage medium 90 stores program instructions 901 that can be executed by the processor, and the program instructions 901 are used to implement the steps of any of the above embodiments of the action generating method.
  • the above solution on the one hand, can automatically generate actions without relying on manual work, and on the other hand, through targeted relationship modeling based on the first total number of several individuals, it can be compatible with both application scenarios of a single individual and multiple individuals. Therefore, under the premise of improving the efficiency of action generation, it is compatible with both application scenarios of a single individual and multiple individuals.
  • the embodiment of the present application also provides a computer program product, the computer product carries a program code, and the instructions included in the program code can be used to implement the steps in any of the above embodiments of the action generating method.
  • the above-mentioned computer program product may be specifically implemented by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in some embodiments, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK).
  • the disclosed methods and devices may be implemented in other ways.
  • the device implementations described above are only illustrative.
  • the division of modules or units is only a logical function division. In actual implementation, there may be other division methods.
  • units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or may also be distributed to network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) execute all or part of the steps of the methods in various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种动作生成方法及相关装置、电子设备、存储介质和计算机程序,其中,动作生成方法包括:获取分别表征若干个体在若干动作帧的第一特征表示,并获取分别表征若干个体关于目标动作类别的第二特征表示;基于第一特征表示和第二特征表示进行关系建模,得到各动作帧中各个体的融合特征表示;其中,关系建模的类型与若干个体的第一总数相关;基于融合特征表示进行动作映射,得到若干个体关于目标动作类别的动作序列;其中,动作序列包括若干动作帧,且动作帧包含各个体的动作表示。

Description

动作生成方法及相关装置、电子设备、存储介质和程序
相关申请的交叉引用
本申请基于申请号为202210089863.5、申请日为2022年01月25日,名称为“动作生成方法及相关装置、电子设备和存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及计算机视觉技术领域,涉及但不限于一种动作生成方法及相关装置、电子设备、存储介质和程序。
背景技术
动作生成是诸如动画创作、仿人机器人交互等众多计算机视觉任务的关键所在。目前,现有的动作生成方式主要包括两种,一种是基于计算机图形学的建模-渲染方式,需要依赖于设计师投入大量时间和精力进行建模、蒙皮和动作捕捉等工作,效率较低;另一种是基于机器学习特别是深度学习的方式。得益于近年来机器学习技术的快速发展,利用深度神经网络执行动作生成任务,能够大大提升动作生成的效率。
发明内容
本申请实施例提供一种动作生成方法及相关装置、电子设备、存储介质和程序。
本申请实施例提供了一种动作生成方法,应用于电子设备中,该方法包括:获取分别表征若干个体在若干动作帧的第一特征表示,并获取分别表征若干个体关于目标动作类别的第二特征表示;基于第一特征表示和第二特征表示进行关系建模,得到各动作帧中各个体的融合特征表示;其中,关系建模的类型与若干个体的第一总数相关;基于融合特征表示进行动作映射,得到若干个体关于目标动作类别的动作序列;其中,动作序列包括若干动作帧,且动作帧包含各个体的动作表示。因此,获取分别表征若干个体在若干动作帧的第一特征表示,并获取分别表征若干个体关于目标动作类别的第二特征表示,在此基础上,基于第一特征表示和第二特征表示进行关系建模,得到各动作帧中各个体的融合特征表示,且关系建模的类型与若干个体的第一总数相关,再基于融合特征表示进行动作映射,即可得到若干个体关于目标动作类别的动作序列,且动作序列包括若干动作帧,动作帧包括各个体的动作表示,故一方面无需依赖于人工即可自动生成动作,另一方面通过根据若干个体的第一总数来针对性地进行关系建模,能够兼容单个个体和多个个体两种应用场景。故此,能够在提升动作生成效率的前提下,兼容单个个体和多个个体两种应用场景。
在一些实施例中,所述关系建模的类型与所述若干个体的第一总数相关,包括以下至少一项:在若干个体的第一总数为单个的情况下,关系建模包括建模各动作帧之间的时序关系;在若干个体的第一总数为多个的情况下,关系建模包括建模各动作帧中若干个体之间的交互关系和建模各动作帧之间的时序关系。因此,在若干个体的第一总数为单个的情况下,关系建模包括建模各动作帧之间的时序关系,故能够通过建模时序关系提升动作帧之间的时序连贯性,有利于提升动作序列的真实性,而在若干个体的第一总数为多个的情况下,关系建模包括建模各动作帧中若干个体之间的交互关系和建模各动作帧之间的时序关系,故能够通过建模交互关系提升个体之间的交互合理性,以及能够通过建模时序关系提升动作帧之间的时序连贯性,有利于提升动作序列的真实性。
在一些实施例中,在关系建模包括建模时序关系的情况下,基于第一特征表示和第二特征表示进行关系建模,得到各动作帧中各个体的融合特征表示,包括:选择个体作 为目标个体,并将目标个体对应的第一特征表示和第二特征表示,作为目标个体在不同时序的时序特征表示;分别选择各个时序作为第一当前时序,并选择第一当前时序的时序特征表示作为第一当前时序表示;基于各个第一参考时序表示分别与第一当前时序表示的相关度,得到第一当前时序表示对应的融合特征表示;其中,第一参考时序表示包括目标个体在各时序的时序特征表示。
在一些实施例中,在关系建模包括建模交互关系的情况下,基于第一特征表示和第二特征表示进行关系建模,得到各动作帧中各个体的融合特征表示,包括:选择个体作为目标个体,并将目标个体对应的第一特征表示和第二特征表示,作为目标个体在不同时序的时序特征表示;分别选择各个时序作为第二当前时序,并选择第二当前时序的时序特征表示作为第二当前时序表示;基于各个第二参考时序表示分别与第二当前时序表示的相关度,得到第二当前时序表示对应的融合特征表示;其中,第二参考时序表示包括各个体分别在第二当前时序的时序特征表示。因此,选择个体作为目标个体,并将目标个体对应的第一特征表示和第二特征表示,作为目标个体在不同时序的时序特征表示,基于此将不同时序的时序特征表示,分别作为当前时序表示,再基于各参考时序表示分别与当前时序表示的相关度,得到当前时序表示对应的融合特征表示,且在建模时序关系的情况下,参考时序表示包括目标个体在各时序的时序特征表示,在建模交互关系的情况下,参考时序表示包括各个体分别在参考时序的时序特征表示,而参考时序为当前时序表示对应的时序,故能够通过相似的建模流程来建模时序关系和交互关系,故能进一步提升单个个体和多个个体两种应用场景的兼容性。
在一些实施例中,在关系建模包括建模交互关系和时序关系的情况下,基于第一特征表示和第二特征表示进行关系建模,得到各动作帧中各个体的融合特征表示,包括:基于第一特征表示和第二特征表示建模在先关系,得到在先关系的输出特征表示,基于输出特征表示建模在后关系,得到融合特征表示;其中,在先关系为交互关系,在后关系为时序关系,或者,在先关系为时序关系,在后关系为交互关系。因此,在关系建模包括建模交互关系和时序关系的情况下,将在先建模的交互关系的输出特征表示为在后建模的时序关系的输入特征表示,故在多个个体的应用场景下,通过先后建模交互关系和时序关系,使得各融合特征表示分别融入交互关系和时序关系,有利于提升交互关系和时序关系的融合效果。
在一些实施例中,动作序列由动作生成模型得到,动作生成模型包括关系建模网络,且关系建模网络包括时序建模子网络和交互建模子网络,时序建模子网络用于建模时序关系,交互建模子网络用于建模交互关系。因此,动作序列由动作生成模型得到,作生成模型包括关系建模网络,且关系建模网络包括时序建模子网络和交互建模子网络,时序建模子网络用于建模时序关系,交互建模子网络用于建模交互关系,故能够通过网络模型完成动作生成任务,有利于进一步提升动作生成效率。
在一些实施例中,第一特征表示基于高斯过程的采样得到。因此,基于高斯过程采样得到第一特征表示,有利于大大降低第一特征表示的获取复杂度,而且还能够提升在类别丰富的动作数据上的生成质量。
在一些实施例中,获取分别表征若干个体在若干动作帧的第一特征表示,包括:在若干高斯过程中,分别采样第二总数次,得到分别表征第二总数个动作帧的第一原始表示;其中,第一原始表示的长度与高斯过程的个数相同,各高斯过程的特征长度尺度各不相同;基于第一总数和第一原始表示,得到第三总数个第一特征表示;其中,第三总数为第一总数和第二总数的乘积。因此,在若干高斯过程中,分别采样第二总数次,得到分别表征第二总数个动作帧的第一原始表示,且第一原始表示的长度与高斯过程的个数相同,各高斯过程的特征长度尺度各不相同,基于此再基于第一总数和第一原始表示,得到第三总数个第一特征表示,且第三总数为第一总数和第二总数的乘积,由于各高斯 过程的特征长度尺度各不相同,且每次对高斯过程采样均能够得到各个动作帧的特征信息,故能够提升各个第一特征表示的准确性。
在一些实施例中,第二特征表示基于目标动作类别映射得到。因此,基于对目标动作类别进行映射得到第二特征表示,故仅需对文本信息进行映射等简单处理即可得到第二特征表示,有利于大大降低驱动动作生成的复杂度。
在一些实施例中,获取分别表征若干个体关于目标动作类别的第二特征表示,包括:对目标动作类别进行嵌入表示,得到第二原始表示;基于第一总数和第二原始表示,得到第一总数个第二特征表示。因此,对目标动作类别进行嵌入表示,得到第二原始表示,并基于第一总数和第二原始表示,得到第一总数个第二特征表示,即通过对文本信息进行嵌入表示并结合第一总数进行相关处理,即可得到第一总数个第二特征表示,有利于大大降低获取第二特征表示的复杂度。
在一些实施例中,第一特征表示和第二特征表示均融合有位置编码;其中,在若干个体为单个个体的情况下,位置编码包括时序位置编码,在若干个体为多个个体的情况下,位置编码包括个体位置编码和时序位置编码。因此,第一特征表示和第二特征表示均融合位置编码,在若干个体为单个个体的情况下,位置编码包括时序位置编码,在若干个体为多个个体的情况下,位置编码包括个体位置编码和时序位置编码,故能够在单个个体和多个个体两种应用场景下,采用不同的位置编码策略来区分不同特征表示,使得特征表示的位置编码各不相同,有利于提升特征表示的准确性。
在一些实施例中,动作序列由动作生成模型得到,且位置编码在动作生成模型的训练过程中,与动作生成模型的网络参数一同调整,直至动作生成模型训练收敛为止。因此,动作序列由动作生成模型得到,且位置编码在动作生成模型的训练过程中,与动作生成模型的网络参数一同调整,直至动作生成模型训练收敛为止,由于位置编码随网络模型一同训练,故能够提升位置编码的表示能力,而在训练收敛之后位置编码不再调整,即维持固定,从而能够加入强大的先验约束,从而能够在先验约束和表示能力两者之间达到平衡,进而能够进一步提升特征表示的准确性,有利于提升动作序列的生成效果。
在一些实施例中,动作帧中个体的动作表示包括:在动作帧中,个体的关键点的第一位置信息和个体的姿态信息,且姿态信息包括个体的若干关节点的第二位置信息。因此,动作帧中个体的动作表示包括:在动作帧中个体的关键点的第一位置信息和个体的位姿信息,且位姿信息包括个体的若干关节点的第二位置信息,故能够通过关键点和关节点两者的位置信息来表达个体动作,有利于提升动作表示的准确性。
在一些实施例中,动作序列由动作生成模型得到,且动作生成模型与鉴别模型通过生成对抗训练得到。因此,通过生成对抗训练来协同训练动作生成模型和鉴别模型,能够使动作生成模型和鉴别模型在协同训练过程中相互促进,彼此相辅相成,最终有利于提升动作生成模型的模型性能。
在一些实施例中,生成对抗训练的步骤包括:获取若干样本个体关于样本动作类别的样本动作序列;其中,样本动作序列包括预设数值个样本动作帧,且样本动作序列标注有样本标记,样本标记表示样本动作序列实际是否动作生成模型生成得到;分别对样本动作序列中各个样本动作帧进行分解,得到样本图数据;其中,样本图数据包括预设数值张节点图,节点图由节点连接形成,节点包括关键点和关节点,节点图包括各个节点的节点特征表示,且节点的位置特征表示由若干样本个体分别在对应节点处的位置特征表示拼接得到;基于鉴别模型对样本图数据和样本动作类别进行鉴别,得到预测结果;其中,预测结果包括样本动作序列的第一预测标记,第一预测标记表示样本动作序列经预测由动作生成模型生成的可能性,第二预测标记表示样本动作序列属于样本动作类别的可能性;基于样本标记、第一预测标记和第二预测标记,调整动作生成模型、鉴别模型中任一者的网络参数。因此,通过将样本动作表示分解为样本图数据,能够将动作序 列的鉴别巧妙地化解为图数据的鉴别,有利于大大降低训练复杂度以及鉴别模型的构建难度。
在一些实施例中,在样本动作序列为从真实场景采集得到的情况下,节点的位置特征表示按照若干样本个体的随机顺序,由若干样本个体分别在对应节点处的位置特征表示拼接得到。因此,节点的位置特征表示按照若干样本个体的随机顺序,由若干样本个体分别在对应节点处的位置特征表示拼接得到,从而在训练过程中使动作生成模型将不同排序而实际属于同一样本动作序列的情况视为不同样本,并对其进行建模,从而能够实现数据增强,进而有利于提升模型鲁棒性。
本申请实施例还提供了一种动作生成装置,包括:特征获取部分、关系建模部分和动作映射部分,特征获取部分,配置为获取分别表征若干个体在若干动作帧的第一特征表示,并获取分别表征若干个体关于目标动作类别的第二特征表示;关系建模部分,配置为基于第一特征表示和第二特征表示进行关系建模,得到各动作帧中各个体的融合特征表示;其中,关系建模的类型与若干个体的第一总数相关;动作映射部分,配置为基于融合特征表示进行动作映射,得到若干个体关于目标动作类别的动作序列;其中,动作序列包括若干动作帧,且动作帧包含各个体的动作表示。本申请实施例还提供了一种电子设备,包括相互耦接的存储器和处理器,处理器用于执行存储器中存储的程序指令,以实现上述任意一种动作生成方法。
本申请实施例还提供了一种计算机可读存储介质,其上存储有程序指令,程序指令被处理器执行时实现上述任意一种动作生成方法。
本申请实施例还提供了一种计算机程序,包括计算机可读代码,当计算机可读代码在电子设备中运行时,电子设备中的处理器执行用于实现上述任意一种动作生成方法。
上述方案,获取分别表征若干个体在若干动作帧的第一特征表示,并获取分别表征若干个体关于目标动作类别的第二特征表示,在此基础上,基于第一特征表示和第二特征表示进行关系建模,得到各动作帧中各个体的融合特征表示,且关系建模的类型与若干个体的第一总数相关,再基于融合特征表示进行动作映射,即可得到若干个体关于目标动作类别的动作序列,且动作序列包括若干动作帧,动作帧包括各个体的动作表示,故一方面无需依赖于人工即可自动生成动作,另一方面通过根据若干个体的第一总数来针对性地进行关系建模,能够兼容单个个体和多个个体两种应用场景。故此,能够在提升动作生成效率的前提下,兼容单个个体和多个个体两种应用场景。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本申请的实施例,并与说明书一起用于说明本申请的技术方案。
图1是本申请实施例提供的一种动作生成方法的流程示意图;
图2是本申请实施例提供的一种动作生成方法的过程示意图;
图3a是本申请实施例提供的第一种动作序列的示意图;
图3b是本申请实施例提供的第二种动作序列的示意图;
图3c是本申请实施例提供的第三种动作序列的示意图;
图3d是本申请实施例提供的第四种动作序列的示意图;
图3e是本申请实施例提供的第五种动作序列的示意图;
图3f是本申请实施例提供的第六种动作序列的示意图;
图4是本申请实施例提供的一种动作生成模型的训练方法的流程示意图;
图5是本申请实施例提供的一种样本动作帧的获取示意图;
图6是本申请实施例提供的一种样本图数据的示意图;
图7是本申请实施例提供的一种动作生成装置的框架示意图;
图8是本申请实施例提供的一种电子设备的框架示意图;
图9是本申请实施例提供的一种计算机可读存储介质的框架示意图。
具体实施方式
下面结合说明书附图,对本申请实施例的方案进行详细说明。以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、接口、技术之类的具体细节,以便透彻理解本申请。本文中术语“系统”和“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。此外,本文中的“多”表示两个或者多于两个。
本申请实施例中,动作生成方法的执行主体可以是电子设备,例如,电子设备可以是终端设备、服务器或其它处理设备,其中,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些实施例中,该动作生成方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
请参阅图1,图1是本申请实施例提供的一种动作生成方法的流程示意图,可以包括如下步骤:
步骤S11:获取分别表征若干个体在若干动作帧的第一特征表示,并获取分别表征若干个体关于目标动作类别的第二特征表示。
在一个实施场景中,若干个体的第一总数以及目标动作类别可以在正式实施动作生成之前,由用户指定。示例性地,用户可以指定目标动作类别为“拥抱”,并指定若干个体的第一总数为两个;或者,用户可以指定目标动作类别为“跳舞”,并指定若干个体的第一总数为一个;或者,用户可以指定目标动作类别为“打架”,并指定若干个体的第一总数为三个。需要说明的是,以上示例仅仅是实际应用过程中几种可能的实施方式,并不因此而限定实际应用过程中的目标动作类别和若干个体的第一总数。
在另一个实施场景中,目标动作类别可以在正式实施动作生成之前由用户指定,而若干个体的第一总数可以基于目标动作类别自动分析得到。示例性地,用户可以指定目标动作类别为“击掌”,则基于该目标动作类别,可以自动分析得到若干个体的第一总数为两个;或者,用户可以指定目标动作类别为“交换物品”,则基于该目标动作类别可以自动分析得到若干个体的第一总数为两个;或者,用户可以指定目标动作类别为“搬运物品”,则基于该目标动作类别可以自动分析得到若干个体的第一总数为一个。需要说明的是,以上示例仅仅是实际应用过程中几种可能的实施方式,并不因此而限定实际应用过程中的目标动作类别和若干个体的第一总数。
在又一个实施场景中,目标动作类别可以在正式实施动作生成之前由用户指定,而若干个体的第一总数可以基于目标动作类别自动分析得到,且可以接受用户对自动分析得到的第一总数的修改指令,来更正自动分析得到的第一总数。示例性地,用户可以指定目标动作类别为“打架”,则基于该目标动作类别可以自动分析得到若干个体的第一总数为两个,并接受用户对自动分析得到的第一总数的修改指令,将其更正为四个;或者,用户可以指定目标动作类别为“散步”,则该基于目标动作类别可以自动分析得到若干个体的第一总数为一个,并接受用户对自动分析得到的第一总数的修改指令,将其更正为两个。需要说明的是,以上示例仅仅是实际应用过程中几种可能的实施方式,并不因此而限定实际应用过程中的目标动作类别和若干个体的第一总数。
需要说明的是,上述若干个体可以均为人。当然,也不排除若干个体同时包括人和动物。示例性地,可以指定目标动作类别为“遛狗”,则若干个体可以包括人和狗。
在一个实施场景中,若干动作帧的第二总数可以预先指定,示例性地,第二总数可以为10、15、20、30等等,在此不做限定。
在一个实施场景中,可以获取每个个体在每个动作帧的第一特征表示。例如,对于若干个体的第一总数为一个的情况而言(即对于单个个体的动作生成场景而言),可以获取该单个个体分别在各个动作帧的第一特征表示;或者,对于若干个体的第一总数为两个的情况而言(即对于两个个体的动作生成场景而言),可以获取每个个体分别在各个动作帧的第一特征表示,为了便于描述,可以将这两个个体分别称为“甲”和“乙”,则可以获取“甲”分别在各个动作帧的第一特征表示,并获取“乙”分别在各个动作帧的第一特征表示。其他情况可以以此类推,在此不再一一举例。
在一个实施场景中,需要说明的是,动作帧包含于本申请动作生成方法实施例最终期望生成的动作序列中,即在获取第一特征表示时,动作帧实际上并未生成,第一特征表示可以视为每个个体分别在各个动作帧中初始化的特征表示。例如,可以基于高斯过程采样得到第一特征表示。需要说明的是,高斯过程是概率论和数理统计中随机过程的一种,是一系列服从正态分布的随机变量在一指数集内的组合,高斯过程的具体含义,可以参阅高斯过程的技术细节。
在一个实施场景中,可以在若干高斯过程中,分别采样第二总数次,得到分别表征第二总数个动作帧的第一原始表示,且第一原始表示的长度与高斯过程的个数相同,各高斯过程的特征长度尺度各不相同。在此基础上,再基于第一总数和第一原始表示,得到第三总数个第一特征表示,且第三总数为第一总数和第二总数的乘积。示例性地,为了便于描述,可以将若干动作帧的第二总数记为T,若干高斯过程的特征长度尺度σ c可以分别取值1、10、100、1000,则在特征长度尺度σ c为1的高斯过程采样T次,得到一个长度为T的一维向量,以此类推,在特征长度尺度σ c为10、100、1000的高斯过程上均可以采样得到长度为T的一维向量,将上述4个高斯过程分别采样得到的长度为T的一维特征向量上相同位置元素进行组合,即可得到T个长度为4的第一原始表示,且这T个第一原始表示分别与T个动作帧一一对应,即第一个第一原始表示对应于第一个动作帧,第二个第一原始表示对应于第二个动作帧,……,第T个第一原始表示对应于第T个动作帧。此外,请结合参阅图2,图2是本申请实施例提供的一种动作生成方法的过程示意图。如图2所示,为了便于描述,可以将上述采样得到的第一原始表示的长度记为C 0,故上述分别表征若干动作帧的第一原始表示可以记为(T,C 0)。在此基础上,可以对上述第一原始表示(T,C 0)进行输入映射(如,可以采样多层感知机对第一原始表示进行映射),以改变原第一原始表示(T,C 0)的维度。此外,映射之后的第一原始表示的个数仍为T个。上述方式,由于各高斯过程的特征长度尺度各不相同,且每次对高斯过程采样均能够得到各个动作帧的特征信息,故能够提升各个第一特征表示的准确性。
在一个实施场景中,在得到分别表征第二总数个动作帧的第一原始表示之后,可以基于第一总数等于一还是大于一,确定是否对表征各个动作帧的第一原始表示进行复制,以此来得到各个动作帧中若干个体的第一原始表示。例如,在第一总数等于一的情况下,可以确定动作生成为单个个体的场景,则可以直接将前述采样得到的表征各个动作帧的第一原始表示,作为各个动作帧中该单个个体的第一原始表示;或者,在第一总数大于一的情况下,可以确定动作生成为多个个体的场景,则可以将前述采样得到的表征各个动作的第一原始表示分别复制第一总数次,得到各个动作帧中多个个体的第一原始表示,如在第一总数为2的情况下,可以将表征第1个动作帧的第一原始表示复制为两个第一原始表示,这两个第一原始表示分别表示在第1个动作帧中这两个个体的第一原始表示,其他情况可以以此类推,在此不再一一举例。
在一个实施场景中,请继续结合参阅图2,为了在单个个体以及多个个体的情况下区分不同第一原始表示,可以在第一原始表示的基础上编码各个第一原始表示的位置信息,得到对应的第一特征表示,也就是说,第一特征表示融合有位置编码,且各个位置编码各不相同。例如,在若干个体为单个个体的情况下,位置编码包括时序位置编码,也就是说,在若干个体为单个个体的情况下,主要通过对不同时序的动作帧进行编码来区分不同第一原始表示,从而得到第一特征表示。示例性地,仍以T个动作帧为例,在单个个体的情况下,可以分别对这T个动作帧的第一原始表示融入时序位置编码(如,1、2、……、T),从而得到分别表征单个个体在这T个动作帧的第一特征表示。类似地,在若干个体为多个个体的情况下,位置编码可以包括时序位置编码和个体位置编码,也就是说,在若干个体为多个个体的情况下,不仅要对不同时序的动作帧进行编码,还要对每个动作帧中多个个体进行编码,以此来区分不同第一原始表示,从而得到第一特征表示(如图2位置编码后的虚线框所示)。示例性地,仍以T个动作帧为例,在多个个体的情况下,可以对第1动作帧的第一原始表示融入时序位置编码(如,1),并进一步对第1个动作帧多个个体融入个体位置编码(如,1、2、……),从而组合时序位置编码和个体位置编码作为位置编码,使得表征多个个体在第1个动作帧的第一特征表示分别融合有不同位置编码(如,1-1、1-2、……);类似地,可以对第2个动作帧的第一原始表示融入时序位置编码(如,2),并进一步对第1个动作帧多个个体融入个体位置编码(如,1、2、……),从而组合时序位置编码和个体位置编码作为位置编码,使得表征多个个体在第2个动作帧的第一特征表示分别融合有不同位置编码(如,2-1、2-2、……),其他动作帧可以以此类推,在此不再一一举例。此外,上述位置编码仅仅作为示例,在实际应用过程中,可以预先训练一个动作生成模型,且位置编码可以在动作生成模型的训练过程中,与动作生成模型的网络参数一同调整,直至动作生成模型训练收敛为止,自此之后,在后续应用过程中,即可使用调整好的位置编码。上述方式,能够在单个个体和多个个体两种应用场景下,采用不同的位置编码策略来区分不同特征表示,使得特征表示的位置编码各不相同,有利于提升特征表示的准确性。
在一个实施场景中,与第一特征表示类似地,对于第二特征表示而言,可以获取每个个体关于目标动作类别的第二特征表示。例如,对于若干个体的第一总数为一个的情况而言(即对于单个个体的动作生成场景而言),可以获取该单个个体关于目标动作类别的第二特征表示;或者,对于若干个体的第一总数为两个的情况而言(即对于两个个体的动作生成场景而言),可以获取每个个体分别关于目标动作类别的的第二特征表示,为了便于描述,可以将这两个个体分别称为“甲”和“乙”,则可以获取“甲”关于目标动作类别的第二特征表示,并获取“乙”关于目标动作类别的第二特征表示。其他情况可以以此类推,在此不再一一举例。
在一个实施场景中,如前所述,目标动作类别可以由用户指定,则在确定目标动作类别之后,可以基于该目标动作类别进行映射得到第二特征表示。
在一个实施场景中,可以对目标动作类别进行嵌入表示,得到第二原始表示,在此基础上,再基于第一总数和第二原始表示,得到第一总数个第二特征表示。需要说明的是,上述嵌入表示的作用是将目标动作类别转换为向量。示例性地,可以预先设置不同动作类别的类别向量,如不同动作类别总计有26个的情况下,可以预先设置26个动作类别的类别向量(如,每个类别向量的长度可以为200),则在确定目标动作类别之后,可以其中与目标动作类别一致的动作类别的类别向量,作为该目标动作类别的第二原始表示;或者,也可以先将目标动作类别进行独热(one-hot)编码,再利用全连接层进行线性变换,得到该目标动作类别的第二原始表示,如不同动作类别总计有26个的情况下,可以先将目标动作类别进行独热(one-hot)编码为26维向量,而上述全连接层的线性变换可以视为N(如,200)*26的变换矩阵,再将该矩阵与26维的独热编码相乘即可得到 该目标动作类别的第二原始表示。
在一个实施场景中,与获取第一特征表示类似地,在得到表征目标动作类别的第二原始表示之后,也可以基于第一总数等于一还是大于一,确定是否对该第一原始表示进行复制,以此来得到若干个体分别关于目标动作类别的第二原始表示。例如,在第一总数等于一的情况下,可以确定动作生成为单个个体的场景,则可以直接将前述采样得到的表征目标动作类别的第二原始表示,作为该单个个体关于该目标动作类别的第二原始表示;或者,在第一总数大于一的情况下,可以确定动作生成为多个个体的场景,则可以将前述表征目标动作类别的第二原始表示复制第一总数次,得到多个个体分别关于目标动作类别的第二原始表示,如在第一总数为2的情况下,可以将表征目标动作类别的第二原始表示复制为两个第二原始表示,这两个第二原始表示分别表示这两个个体关于目标动作类别的第二原始表示,其他情况可以以此类推,在此不再一一举例。
在一个实施场景中,请继续结合参阅图2,与获取第一特征表示类似地,为了在单个个体以及多个个体的情况下区分不同第二原始表示,可以在第二原始表示的基础上编码各个第二原始表示的位置信息,得到对应的第二特征表示。也就是说,与第一特征表示类似地,第二特征表示也融合有位置编码,且各个位置编码各不相同。需要说明的是,不仅各个第二特征表示融合的位置编码各不相同,而且第二特征表示融合的位置编码与第一特征表示融合的位置编码也各不相同。例如,在若干个体为单个个体的情况下,位置编码包括时序位置编码,也就是说,在若干个体为单个个体的情况下,第二特征表示可以通过在时序维度与不同动作帧的第一特征表示进行区分。示例性地,仍以T个动作帧为例,在单个个体的情况下,可以分别对这T个动作帧的第一原始表示融入时序位置编码(如,1、2、……、T),从而得到分别表征单个个体在这T个动作帧的第一特征表示,则可以对目标动作类别的第二原始表示融入时序位置编码(如,T+1),从而得到单个个体关于该目标动作类别的第二特征表示。类似地,在若干个体为多个个体的情况下,位置编码可以包括时序位置编码和个体位置编码,也就是说,在若干个体为多个个体的情况下,需要同时在时序维度和个体维度进行区分(如图2位置编码后的虚线框所示)。示例性地,仍以T个动作帧为例,在多个个体的情况下,可以对多个个体关于目标动作类别的第二原始表示先融入时序位置编码(如,T+1),并进一步对第1个个体关于目标动作类别的第二原始表示融入个体位置编码(如,1),进一步对第2个个体关于目标动作类别的第二原始表示融入个体位置编码(如,2),以此类推,从而组合时序位置编码和个体位置编码,使得表征多个个体关于目标动作类别分别融合有不同位置编码(如,T+1-1,T+1-2、……)。此外,上述位置编码仅仅作为示例,在实际应用过程中,可以预先训练一个动作生成模型,且位置编码可以在动作生成模型的训练过程中,与动作生成模型的网络参数一同调整,直至动作生成模型训练收敛为止,自此之后,在后续应用过程中,即可使用调整好的位置编码。上述方式,能够在单个个体和多个个体两种应用场景下,采用不同的位置编码策略来区分不同特征表示,使得特征表示的位置编码各不相同,有利于提升特征表示的准确性。
在一个实施场景中,如前所述,第一特征表示和第二特征表示均融合有位置编码,且在若干个体为单个个体的情况下,位置编码包括时序位置编码,在若干个体为多个个体的情况下,位置编码包括个体位置编码和时序位置编码,可以结合参阅图2以及上述描述。进一步地,为了便于区分,各位置编码可以各不相同。以T个动作帧和P个(P等于1,或者,P大于1)个体为例,经上述操作最终可以得到(T+1)*P的特征表示,其中,包含T*P个分别表征各个动作帧中各个个体的第一特征表示,以及P个分别表征各个个体关于目标动作类别的第二特征表示。
需要说明的是,对于每个个体而言,可以将该个体分别在若干动作帧的第一原始表示和关于目标动作类别的第二原始表示,作为该个体在不同时序的原始时序表示。仍以T 个动作帧而言,则对于第p个个体而言,其在T个动作帧的第一原始表示和关于目标动作类别的第二原始表示,可以视为其在第1个时序至第T+1个时序的原始时序表示。在此基础上,在单个个体的动作生成场景下,在时序t的范围为1至T的情况下,可以将第t个时序的时序位置编码TPE t与第t个原始时序表示相加,得到第t个时序的第一特征表示,在时序t为T+1的情况下,可以将第t个时序的时序位置编码TPE t与第t个原始时序表示相加,得到第t个时序的第二特征表示。类似地,在多个个体的动作生成场景下,可以先将第p个个体的个体位置编码TPE t与第t个时序的时序位置编码拼接,得到第p个个体在第t个时序的位置编码PE(t,p)=concat(TPE t,PPE p),其中,concat表示拼接操作,则在时序t的范围为1至T的情况下,可以将第p个个体在第t个时序的位置编码PE(t,p)与第p个个体在第t个时序的原始时序表示相加,得到第p个个体在第t个时序的第一特征表示,而在时序t为T+1的情况下,可以将p个个体在第t个时序的位置编码PE(t,p)与第p个个体在第t个时序的原始时序表示相加,得到第p个个体在第t个时序的第二特征表示。此外,除了上述时序位置编码与个体位置编码的组合式编码之外,也可以不区分时序位置编码和个体位置编码,而采用完全独立的固定编码,即对于T个动作帧以及P个个体的动作生成场景而言,可以预先设置(T+1)×P个独立的位置编码。
步骤S12:基于第一特征表示和第二特征表示进行关系建模,得到各动作帧中各个体的融合特征表示。
本申请实施例中,关系建模的类型与若干个体的第一总数相关,例如,在若干个体的第一总数为单个的情况下,关系建模包括建模各动作帧之间的时序关系,从而通过建模时序关系提升动作帧之间的时序连贯性,有利于提升动作序列的真实性,在若干个体的第一总数为多个的情况下,关系建模包括建模各动作帧中若干个体之间的交互关系和建模各动作帧之间的时序关系,从而通过建模交互关系提升个体之间的交互合理性,以及能够通过建模时序关系提升动作帧之间的时序连贯性,有利于提升动作序列的真实性。
在一个实施场景中,在若干个体的第一总数为单个的情况下,仅需建模时序关系,在此情况下,可以直接选择该单个个体作为目标个体,并将目标个体对应的第一特征表示和第二特征表示,作为目标个体在不同时序的时序特征表示。示例性地,仍以T个动作帧为例,可以将第1个动作帧中该目标个体的第一特征表示作为第一个时序特征表示,将第2个动作帧中该目标个体的第一特征表示作为第二个时序特征表示,……,将第T个动作帧中该目标个体的第一特征表示作为第T个时序特征表示,以及将该目标个体关于目标动作类别的第二特征表示作为第T+1个时序特征表示。在此基础上,可以分别选择各个时序分别作为当前时序,并选择当前时序的时序特征表示作为当前时序特征表示,并基于各个参考时序表示分别与当前时序表示的相关度,得到当前时序表示对应的融合特征表示。也就是说,在将第i个时序特征表示作为当前时序表示的情况下,可以将目标个体在各时序(即1至T+1)的时序特征表示作为参考时序表示,并基于这些参考时序表示分别与第i个时序特征表示之间的相关度,得到第i个时序特征表示对应的融合特征表示,从而在单个个体的动作生成场景中,最终可以得到T+1个融合特征表示,这T+1个融合特征表示包括:该单个个体融合时序关系之后在分别T个动作帧的特征表示,以及该单个个体融合时序关系之后关于目标动作类别的特征表示。需要说明的是,为了便于与后续交互关系的建模步骤加以区分,在时序建模中,当前时序可以命名为第一当前时序,当前时序的时序特征表示可以命名为第一当前时序表示,参考时序表示可以命名为第一参考时序表示。
在一个实施场景中,如前所述,为了提升动作生成效率,可以预先训练一个动作生成模型,且该动作生成模型可以包括关系建模网络,关系建模网络可以进一步包括时序建模子网络。示例性地,时序建模子网络可以基于Transformer构建,为了便于描述可以将时序建模子网络所包含的Transformer称为T-Former,则对于前述T+1个时序特征表示 可以先分别经过线性变换,得到每个时序特征表示对应的{查询、键、值}特征表示。以第t个时序特征表示F t为例,经线性变换可以得到对应的{查询、键、值}特征表示q t,k t,v t
q t=W qF t,k t=W kF t,v t=W vF t……(1)
上述公式(1)中,
Figure PCTCN2022135160-appb-000001
分别表示线性变换参数,且可以在动作生成模型的训练过程中调整。在此基础上,在选择第t个时序特征表示作为当前时序表示的情况下,可以获取第t个时序特征表示对应的查询特征表示分别与第t′(取值范围为1至T+1)个时序特征表示的键特征表示之间的相关度w t,t′
w t,t′=q t·k t′……(2)
在得到相关度w t,t′之后,即可基于该相关度w t,t′对第t′(取值范围为1至T+1)个时序特征表示的值特征表示进行加权,得到第t个时序特征表示融合时序关系之后的融合特征表示H t
Figure PCTCN2022135160-appb-000002
在一个实施场景中,时序建模子网络可以由L(L大于或等于1)层Transformer堆叠形成,在此基础上,在得到第l层Transformer输出的融合特征表示
Figure PCTCN2022135160-appb-000003
之后,可以将其作为第l+1层Transformer的输入,并重新执行前述时序建模过程,得到第l+1层Transformer输出的融合特征表示
Figure PCTCN2022135160-appb-000004
以此类推,最终可将最后一层Transformer输出的融合特征表示
Figure PCTCN2022135160-appb-000005
作为最终的融合特征表示。此外,在得到最终的融合特征表示
Figure PCTCN2022135160-appb-000006
之后,由于第1至第T个最终的融合特征表示已经充分融入目标动作类别,故在后续步骤S13动作生成前,可以将与目标动作类别相关的第T+1个最终的融合特征表示
Figure PCTCN2022135160-appb-000007
丢弃。
在一个实施场景中,在若干个体为多个个体的情况下,需建模时序关系和交互关系,且交互关系和时序关系可以先后建模,示例性地,可以先建模交互关系,再建模时序关系;或者,也可以先建模时序关系,再建模交互关系。此外,在先建模关系的输出特征表示为在后建模关系的输入特征表示。也就是说,在关系建模包括建模交互关系和时序关系的情况下,可以基于第一特征表示和第二特征表示建模在先关系,得到在先关系的输出特征表示,再基于输出特征表示建模在后关系,得到融合特征表示。需要说明的是,在先关系为交互关系,在后关系为时序关系,或者,在先关系为时序关系,在后关系为时序关系。
在一个实施场景中,如前所述,为了提升动作生成效率,可以预先训练一个动作生成模型,且该动作生成模型可以包括关系建模网络,关系建模网络可以包括时序建模子网络和交互建模子网络。示例性地,时序建模子网络和交互建模子网络可以均基于Transformer构建,为了便于描述可以将时序建模子网络所包含的Transformer称为T-Former,并将交互建模子网络所包含的Transformer称为I-Former。与前述单个个体的动作生成场景类似地,在多个个体的动作生成场景中,也可以选择其中一个个体作为目标个体。示例性地,可以选择P个个体中第p个个体作为目标个体。在此基础上,可以将目标个体对应的第一特征表示和第二特征表示,作为该目标个体分别在不同时序的时序特征表示。为了便于区分,该目标个体分别在T个动作帧的第一特征表示和关于目标动作类别的第二特征表示,分别视为在时序1至时序T+1的时序特征表示,则对于前述T+1个时序特征表示可以先分别经过线性变换,得到每个时序特征表示对应的{查询、键、值}特征表示。以第p个个体选择作为目标个体为例,其第t个时序特征表示
Figure PCTCN2022135160-appb-000008
经线性变换可以得到对应的{查询、键、值}特征表示
Figure PCTCN2022135160-appb-000009
Figure PCTCN2022135160-appb-000010
则在先构建交互关系时,与前述构建时序关系类似地,在得到目标个体在不同时序的时序特征表示之后,可以选择各个时序分别作为当前时序,并选择当前时序的时序特 征表示作为当前时序表示,以及基于各个参考时序表示分别与当前时序表示的相关度,得到当前时序表示对应的融合特征表示,与构建时序关系不同的是,在建模交互关系的情况下,参考时序表示包括各个体分别在当前时序的时序特征表示。需要说明的是,为了与前述时序关系的建模步骤加以区分,当前时序可以命名为第二当前时序,第二当前时序的时序特征表示可以命名为第二当前时序表示,参考时序表示可以命名第二参考时序表示。具体而言,可以将第t个时序作为参考时序,各个个体分别在参考时序的时序特征表示即为各个个体分别在参考时序的键特征表示
Figure PCTCN2022135160-appb-000011
其中,p′的取值范围为1至P。在此情况下,相关度可以表示为:
Figure PCTCN2022135160-appb-000012
进一步地,可以基于该相关度
Figure PCTCN2022135160-appb-000013
对第p′(p′取值范围为1至P)个个体在第t时序的值特征表示进行加权,得到第t个时序时第p个个体的时序特征表示融合交互关系之后的融合特征表示
Figure PCTCN2022135160-appb-000014
Figure PCTCN2022135160-appb-000015
在一个实施场景中,在得到各个个体分别在各个时序融合交互关系之后的融合特征表示
Figure PCTCN2022135160-appb-000016
之后,可以如前所述,将这些融合特征表示作为构建时序关系的输入特征表示,以继续构建时序关系。时序关系的构建过程可以参阅前述相关描述。
在一个实施场景中,请结合参阅图2,用于构建交互关系的I-Former和用于构建时序关系的T-Former可以合为一组Transformer,以共同构建交互关系和时序关系,则关系构建网络可以包括L组Transformer。在此基础上,对于第t个时序的动作帧中第p个个体而言,可以将第l组Transformer输出的融合特征表示
Figure PCTCN2022135160-appb-000017
之后,可以将其作为第l+1层Transformer的输入,并重新执行前述时序建模过程,得到第l+1层Transformer输出的融合特征表示
Figure PCTCN2022135160-appb-000018
以此类推,最终可以将最后一层Transformer输出的融合特征表示
Figure PCTCN2022135160-appb-000019
作为最终的融合特征表示。此外,在得到最终的融合特征表示
Figure PCTCN2022135160-appb-000020
之后,由于第1至第T个最终的融合特征表示已经充分融入目标动作类别,故在后续步骤S13动作生成前,可以将与目标动作类别相关的第T+1个最终的融合特征表示
Figure PCTCN2022135160-appb-000021
丢弃。此外,请结合参阅表1,表1是动作生成模型一实施例的结构示意表。如表1所示,动作生成模型示例性地,可以包含2组Transformer。当然,也可以设置3组Transformer、4组Transformer、或5组Transformer等等,在此不做限定。需要说明的是,输入映射层和类别嵌入层的具体含义,可以分别参阅前述第一特征表示、第二特征表示的具体获取过程。此外,表1所示的动作生成模型仅仅是实际应用过程中一种可能的实施方式,在此对动作生成模型的具体结构不做限定。例如,表1所示的各网络层的输入/输出通道数也可以根据实际应用需要进行适应性调整。
表1 动作生成模型一实施例的结构示意表
Figure PCTCN2022135160-appb-000022
Figure PCTCN2022135160-appb-000023
需要说明的是,由前述实施方式可知,无论是建模时序关系,还是建模交互关系,两者的建模过程趋于类似,即均可以先选择个体作为目标个体,并将目标个体对应的第一特征表示和第二特征表示,作为目标个体在不同时序的时序特征表示,再选择各个时序分别作为当前时序,以及选择当前时序的时序特征表示作为当前时序表示,再基于各个参考时序表示分别与当前时序表示的相关度,得到当前时序表示对应的融合特征表示。两者的不同之处在于,在建模时序关系的情况下,参考时序表示包括目标个体在各时序的时序特征表示,在建模交互关系的情况下,参考时序表示包括各个体分别在当前时序的时序特征表示。故此,能够通过相似的建模流程来建模时序关系和交互关系,故能进一步提升单个个体和多个个体两种应用场景的兼容性。
步骤S13:基于融合特征表示进行动作映射,得到若干个体关于目标动作类别的动作序列。
本申请实施例中,动作序列包括若干动作帧,且动作帧包含各个体的动作表示。示例性地,动作序列可以包括T个动作帧,若干个体为P个个体,则每个动作帧中包含P个个体的动作表示,故此可以生成得到时序连续的三维动作。
在一个实施场景中,如前所述,为了提升动作生成效率,可以预先训练动作生成模型,且动作生成模型可以包括动作映射网络,如表1所示,动作映射网络可以包括诸如全连接层等线性层,在此对动作映射网络的具体结构不做限定。在此基础上,可以将各个动作帧中各个个体的融合特征表示输入至动作映射网络,即可得到若干个体关于目标动作类别的动作序列。以T个动作帧以及P个个体为例,可以得到T*P个融合特征表示,则可以将上述T*P个融合特征表示输入至动作映射网络,即可得到T个动作帧,且每个动作帧包含P个个体的动作表示,从而可以将T个动作帧按时序先后顺序组合,得到动作序列。为了便于描述,动作序列可以表示为{M t|t∈[1,…,T]},M t表示第t个动作帧,且每个动作帧M t包含P个个体的动作表示,即
Figure PCTCN2022135160-appb-000024
在一个实施场景中,动作帧中个体的动作表示可以包括:在动作帧中,个体的关键点(如,胯部)的第一位置信息和个体的姿态信息,且姿态信息可以包括个体的若干关节点(如,左肩、右肩、左肘、右肘、左膝、右膝、左脚、右脚等等)的第二位置信息。示例性地,以第t个动作帧中第p个个体为例,第一位置信息可记为
Figure PCTCN2022135160-appb-000025
其可以关键点在局部坐标系中的绝对位置,姿态信息可以记为
Figure PCTCN2022135160-appb-000026
其可以包括局部坐标系中各关节点的位置坐标。示例性地,动作序列中每一动作帧可以表示为大小为(P,C)的张量,即动作帧中每个个体的动作表示可以以C维向量表示。基于此,动作序列可以表示为大小为(P,T,C)的张量。上述姿态信息可以表达为Skinned Multi Person Model(SMPL)中的姿态表示,SMPL是一种广泛使用的参数化人体模型,其含义可以参阅SMPL的技术细节。
在一个实施场景中,请结合参阅图3a至图3f,图3a为在目标动作类别为“祝酒”时所生成的动作序列,图3b为目标动作类别为“照相”时所生成的动作序列,图3c为目标动作类别为“搀扶”时所生成的动作序列,图3d为目标动作类别为“突袭”时所生成的动作序列,图3e为目标动作类别为“伸展”时所生成的动作序列,图3f为目标动作类别为“跳舞”时所述生成的动作序列。
在一个实施场景中,如图3a至图3f所示,动作生成模型所生成的动作序列仅包含各个个体分别在各个动作帧的动作表示,而不包含各个个体的外形以及动作场景,故在得到动作序列之后,可以根据需要自由设计各个个体的外形(如,发型、穿着、发色等),也可以根据需要自由设计动作场景(如,街巷、商场、公园等)。示例性地,在确定目标 动作类别为“照相”且若干个体的第一总数为2个之后,可以通过前述过程生成得到如图3b所示的动作序列,在此基础上,可以设计图3b中左侧个体的外形(如,短发、衬衫、短裤、黑发等)以及右侧个体的外形(长发、连衣裙、黑发等),并可以设计动作场景为“公园”,从而可以进一步丰富得到动画,进而一方面能够提升设计灵活性,另一方面也能够大大减轻创作工作量。
上述方案,获取分别表征若干个体在若干动作帧的第一特征表示,并获取分别表征若干个体关于目标动作类别的第二特征表示,在此基础上,基于第一特征表示和第二特征表示进行关系建模,得到各动作帧中各个体的融合特征表示,且关系建模的类型与若干个体的第一总数相关,再基于融合特征表示进行动作映射,即可得到若干个体关于目标动作类别的动作序列,且动作序列包括若干动作帧,动作帧包括各个体的动作表示,故一方面无需依赖于人工即可自动生成动作,另一方面通过根据若干个体的第一总数来针对性地进行关系建模,能够兼容单个个体和多个个体两种应用场景。故此,能够在提升动作生成效率的前提下,兼容单个个体和多个个体两种应用场景。
请参阅图4,图4是本申请实施例提供的一种动作生成模型的训练方法的流程示意图。如前所述,动作序列由动作生成模型得到,为了提升训练效果,动作生成模型可以与鉴别模型通过生成对抗训练得到,训练过程可以包括如下步骤:
步骤S41:获取若干样本个体关于样本动作类别的样本动作序列。
本申请实施例中,样本动作序列包括预设数值个样本动作帧,且样本动作序列标注有样本标记,样本标记表示样本动作序列实际是否动作生成模型生成得到。例如,样本动作序列可以是由动作生成模型生成得到,也可以是在真实场景采集得到。
在一个实施场景中,请结合参阅图5,图5是本申请实施例提供的一种样本动作帧的获取示意图。如图5所示,可以获取若干样本关于样本动作类别的样本拍摄图像,即可以对真实个体演示样本动作类别进行拍摄。在此基础上,可以提取样本拍摄图像中各个样本个体的样本动作表示,如每个样本个体的样本动作表示可以包括样本个体的关键点和若干关节点的位置信息。在此基础上,每张样本拍摄图像可以表示为一个样本动作帧,且每个样本动作帧中每个样本个体的样本动作表示,与前述实施例中动作表示类似地,可以以C维向量表示。
步骤S42:分别对样本动作序列中各个样本动作帧进行分解,得到样本图数据。
本申请实施例中,样本图数据包括预设数值张节点图,节点图由节点连接形成,节点包括关键点和关节点,节点图包括各个节点的节点特征表示,且节点的位置特征表示由若干样本个体分别在对应节点处的位置特征表示拼接得到。仍以每个样本个体的样本动作表示包括该样本个体的关键点和若干关节点的位置信息为例,则该样本动作表示的C维向量可以分解为K个D维向量(如前所述,该向量表示位置信息,如位置坐标等),且C=K×D,其中,K为样本个体的关键点和若干关节点的总数,如每个样本个体的关键点和若干关节点的总数为18个。
在一个实施场景中,请参阅图6,图6是本申请实施例提供的一种样本图数据的示意图。如图6所示,对于单个样本个体的场景而言,每张节点图仅需表示单个样本个体即可,故每张节点图由K个节点连接形成,且节点图上每个节点由该节点的D维向量表达,故每张节点图可以表示为大小为(K,D)的张量,基于此样本图数据可以表示为大小为(T,K,D)的张量。
在另一个实施场景中,与单个样本个体的场景不同的是,在多个样本个体的场景中,每张节点图需要表示多个样本个体,此时每张节点图仍然由K个节点连接形成,但节点图上每个节点由多个样本个体在该节点的D维向量拼接得到,即每张节点图可以表示为大小为(K,P·D)的张量,基于此样本图数据可以表示为大小为(T,K,P·D)的张量。此外,对于样本动作序列中多个样本个体而言,如果排序不同,有可能导致后续鉴别模型的预 测结果也不同,从而给模型训练带来不确定性,为弥补这一不足,在样本动作序列为从真实场景采集得到的情况下,节点的位置特征表示按照若干样本个体的随机顺序,由若干样本个体分别在对应节点处的位置特征表示拼接得到,从而在训练过程中使动作生成模型将不同排序而实际属于同一样本动作序列的情况视为不同样本,并对其进行建模,从而能够实现数据增强,进而有利于提升模型鲁棒性。
步骤S43:基于鉴别模型对样本图数据和样本动作类别进行鉴别,得到预测结果。
在一个实施场景中,鉴别模型可以基于时空图卷积网络构建,示例性地,请参阅表2,表2是鉴别模型一实施例的结构示意表。需要说明的是,表2仅仅是实际应用过程中鉴别模型一种可能的实施方式,并不因此而限定鉴别模型的具体结构。此外,关于表2中时空卷积的具体含义,可以参阅时空卷积的相关技术细节。
表2 鉴别模型一实施例的结构示意表
Figure PCTCN2022135160-appb-000027
本申请实施例中,预测结果包括样本动作序列的第一预测标记和第二预测标记,第一预测标记表示样本动作序列经预测由动作生成模型生成的可能性,第二预测标记表示样本动作序列属于样本动作类别的可能性。需要说明的是,第一预测标记和第二预测标记可以采用数值来表示,且数值越大,对应的可能性越高。以鉴别模型采用表2所示的网络结构为例,可以将样本图数据记为x,经各层时空图卷积层处理之后,可以得到一个512维的向量φ(x),且样本动作类别经类别嵌入表示之后,也可以得到一个512维的向量y,两者进行内积得到φ(x)·y。进一步地,可以将向量φ(x)输入输出映射层,得到
Figure PCTCN2022135160-appb-000028
再结合前述内积φ(x)·y,即可得到鉴别模型对输入的样本动作类别、样本动作序列给出的分值,即前述第一预测标记和第二预测标记。
步骤S44:基于样本标记、第一预测标记和第二预测标记,调整动作生成模型、鉴别模型中任一者的网络参数。
在一些实施例中,通过第一预测标记和样本标记可以度量鉴别模型的鉴别损失,而通过第二预测标记和样本标记可以度量动作生成模型的生成损失,在训练过程中,可以每训练M次鉴别模型(此时调整鉴别模型的网络参数),就训练N次动作生成模型(此时调整动作生成模型的网络参数),如每训练4次鉴别模型,就训练1次动作生成模型,在此不做限定。在此基础上,通过训练鉴别模型,能够提升鉴别模型对样本动作序列的鉴别能力(即区分模型生成的样本动作序列和真实采集的样本动作序列的能力),以此可以促进动作生成模型提升生成动作序列的真实性,而通过训练动作训练模型,能够提升 动作生成模型生成动作序列的真实性(即模型生成的动作序列尽可能地接近真实采集的动作序列),从而又促使鉴别模型提升其鉴别能力,进而使得鉴别模型和动作生成模型相互促进,相辅相成,经若干轮训练之后,动作生成模型的模型性能越来越优秀,鉴别模型已经无法区分动作生成模型所生成的动作序列以及真实采集的动作序列,至此即可结束训练。需要说明的是,生成对抗训练的具体过程,可以参阅生成对抗训练的具体技术细节。此外,如前述实施例所述,在动作生成过程中,可以进行位置编码,且位置编码在动作生成模型的训练过程中,可以与动作生成模型的网络参数一同调整。
上述方案,通过生成对抗训练来协同训练动作生成模型和鉴别模型,能够使动作生成模型和鉴别模型在协同训练过程中相互促进,彼此相辅相成,最终有利于提升动作生成模型的模型性能;此外,通过将样本动作表示分解为样本图数据,能够将动作序列的鉴别巧妙地化解为图数据的鉴别,有利于大大降低训练复杂度以及鉴别模型的构建难度。
请参阅图7,图7是本申请实施例提供的一种动作生成装置70的框架示意图。动作生成装置70包括:特征获取部分71、关系建模部分72和动作映射部分73,特征获取部分71,配置为获取分别表征若干个体在若干动作帧的第一特征表示,并获取分别表征若干个体关于目标动作类别的第二特征表示;关系建模部分72,配置为基于第一特征表示和第二特征表示进行关系建模,得到各动作帧中各个体的融合特征表示;其中,关系建模的类型与若干个体的第一总数相关;动作映射部分73,配置为基于融合特征表示进行动作映射,得到若干个体关于目标动作类别的动作序列;其中,动作序列包括若干动作帧,且动作帧包含各个体的动作表示。
上述方案,一方面无需依赖于人工即可自动生成动作,另一方面通过根据若干个体的第一总数来针对性地进行关系建模,能够兼容单个个体和多个个体两种应用场景。故此,能够在提升动作生成效率的前提下,兼容单个个体和多个个体两种应用场景。
在一些实施例中,所述关系建模的类型与所述若干个体的第一总数相关,包括以下至少一项:在若干个体的第一总数为单个的情况下,关系建模包括建模各动作帧之间的时序关系;在若干个体的第一总数为多个的情况下,关系建模包括建模各动作帧中若干个体之间的交互关系和建模各动作帧之间的时序关系。
在一些实施例中,关系建模部分72包括时序建模子部分,时序建模子部分包括第一选择单元,配置为选择个体作为目标个体,并将目标个体对应的第一特征表示和第二特征表示,作为目标个体在不同时序的时序特征表示,以及将不同时序分别作为第一当前时序,并将第一当前时序的时序特征表示作为第一当前时序表示;时序建模子部分包括第一表示融合单元,配置为基于各个第一参考时序表示分别与第一当前时序表示的相关度,得到第一当前时序表示对应的融合特征表示;其中,第一参考时序表示包括目标个体在各时序的时序特征表示。
在一些实施例中,关系建模部分72包括交互建模子部分,交互建模子部分包括第二选择单元,配置为选择个体作为目标个体,并将目标个体对应的第一特征表示和第二特征表示,作为目标个体在不同时序的时序特征表示,以及将不同时序分别作为第二当前时序,并将第二当前时序的时序特征表示作为第二当前时序表示;交互建模子部分包括表示第二融合单元,配置为基于各个第二参考时序表示分别与第二当前时序表示的相关度,得到第二当前时序表示对应的融合特征表示;其中,第二参考时序表示包括各个体分别在第二当前时序的时序特征表示。
在一些实施例中,在关系建模包括建模交互关系和时序关系的情况下,关系建模部分块72包括在先建模子部分,配置为基于第一特征表示和第二特征表示建模在先关系,得到在先关系的输出特征表示,关系建模部分72包括在后建模子部分,配置为基于输出特征表示建模在后关系,得到融合特征表示;其中,在先关系为交互关系,在后关系为时序关系,或者,在先关系为时序关系,在后关系为交互关系。
在一些实施例中,动作序列由动作生成模型得到,动作生成模型包括关系建模网络,且关系建模网络包括时序建模子网络和交互建模子网络,时序建模子网络用于建模时序关系,交互建模子网络用于建模交互关系。
在一些实施例中,第一特征表示基于高斯过程的采样得到。
在一些实施例中,特征获取部分71包括第一获取子部分,且第一获取子模块包括过程采样单元,配置为在若干高斯过程中,分别采样第二总数次,得到分别表征第二总数个动作帧的第一原始表示;其中,第一原始表示的长度与高斯过程的个数相同,各高斯过程的特征长度尺度各不相同;第一获取子部分包括第一获取单元,配置为基于第一总数和第一原始表示,得到第三总数个第一特征表示;其中,第三总数为第一总数和第二总数的乘积。
在一些实施例中,第二特征表示基于目标动作类别映射得到。
在一些实施例中,特征获取部分71包括第二获取子部分,且第二获取子模块包括嵌入表示单元,配置为对目标动作类别进行嵌入表示,得到第二原始表示;第二获取子部分包括第二获取单元,配置为基于第一总数和第二原始表示,得到第一总数个第二特征表示。
在一些实施例中,第一特征表示和第二特征表示均融合有位置编码;其中,在若干个体为单个个体的情况下,位置编码包括时序位置编码,在若干个体为多个个体的情况下,位置编码包括个体位置编码和时序位置编码。
在一些实施例中,动作序列由动作生成模型得到,且位置编码在动作生成模型的训练过程中,与动作生成模型的网络参数一同调整,直至动作生成模型训练收敛为止。
在一些实施例中,动作帧中个体的动作表示包括:在动作帧中,个体的关键点的第一位置信息和个体的姿态信息,且姿态信息包括个体的若干关节点的第二位置信息。
在一些实施例中,动作序列由动作生成模型得到,且动作生成模型与鉴别模型通过生成对抗训练得到。
在一些实施例中,动作生成部分70包括样本序列获取部分,配置为获取若干样本个体关于样本动作类别的样本动作序列;其中,样本动作序列包括预设数值个样本动作帧,且样本动作序列标注有样本标记,样本标记表示样本动作序列实际是否动作生成模型生成得到;动作生成部分70包括样本序列分解模块,配置为分别对样本动作序列中各个样本动作帧进行分解,得到样本图数据;其中,样本图数据包括预设数值张节点图,节点图由节点连接形成,节点包括样本个体的关键点和关节点,节点图包括各个节点的节点特征表示,且节点的位置特征表示由若干样本个体分别在对应节点处的位置特征表示拼接得到;动作生成部分70包括样本序列鉴别模块,配置为基于鉴别模型对样本图数据和样本动作类别进行鉴别,得到预测结果;其中,预测结果包括样本动作序列的第一预测标记和第二预测标记,第一预测标记表示样本动作序列经预测由动作生成模型生成的可能性,第二预测标记表示样本动作序列属于样本动作类别的可能性;动作生成部分70包括网络参数调整模块,配置为基于样本标记、第一预测标记和第二预测标记,调整动作生成模型、鉴别模型中任一者的网络参数。
在一些实施例中,在样本动作序列为从真实场景采集得到的情况下,节点的位置特征表示按照若干样本个体的随机顺序,由若干样本个体分别在对应节点处的位置特征表示拼接得到。
上述特征获取模块部分71、关系建模模块部分72和动作映射模块部分73均可以基于电子设备的处理器实现。
请参阅图8,图8是本申请实施例提供的一种电子设备80的框架示意图。电子设备80包括相互耦接的存储器81和处理器82,处理器82用于执行存储器81中存储的程序指令,以实现上述任一动作生成方法实施例的步骤。在一些实施例中,电子设备80可以 包括但不限于:微型计算机、服务器,此外,电子设备80还可以包括笔记本电脑、平板电脑等移动设备,在此不做限定。处理器82用于控制其自身以及存储器81以实现上述任一动作生成方法实施例的步骤。处理器82还可以称为中央处理单元(Central Processing Unit,CPU)。处理器82可能是一种集成电路芯片,具有信号的处理能力。处理器82还可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。另外,处理器82可以由集成电路芯片共同实现。
上述方案,一方面无需依赖于人工即可自动生成动作,另一方面通过根据若干个体的第一总数来针对性地进行关系建模,能够兼容单个个体和多个个体两种应用场景。故此,能够在提升动作生成效率的前提下,兼容单个个体和多个个体两种应用场景。
请参阅图9,图9为本申请实施例提供的一种计算机可读存储介质90的框架示意图。计算机可读存储介质90存储有能够被处理器运行的程序指令901,程序指令901用于实现上述任一动作生成方法实施例的步骤。
上述方案,一方面无需依赖于人工即可自动生成动作,另一方面通过根据若干个体的第一总数来针对性地进行关系建模,能够兼容单个个体和多个个体两种应用场景。故此,能够在提升动作生成效率的前提下,兼容单个个体和多个个体两种应用场景。
本申请实施例还提供一种计算机程序产品,该计算机产品承载有程序代码,所述程序代码包括的指令可用于实现上述任一动作生成方法实施例的步骤。其中,上述计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一些实施例中,所述计算机程序产品体现为计算机存储介质,在一些实施例中,计算机程序产品体现为软件产品,例如软件开发包(Software Development Kit,SDK)等。
在本申请所提供的几个实施例中,应该理解到,所揭露的方法和装置,可以通过其它的方式实现。例如,以上所描述的装置实施方式仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (20)

  1. 一种动作生成方法,应用于电子设备中,所述方法包括:
    获取分别表征若干个体在若干动作帧的第一特征表示,并获取分别表征所述若干个体关于目标动作类别的第二特征表示;
    基于所述第一特征表示和所述第二特征表示进行关系建模,得到各所述动作帧中各所述个体的融合特征表示;其中,所述关系建模的类型与所述若干个体的第一总数相关;
    基于所述融合特征表示进行动作映射,得到所述若干个体关于所述目标动作类别的动作序列;其中,所述动作序列包括所述若干动作帧,且所述动作帧包含各所述个体的动作表示。
  2. 根据权利要求1所述的方法,其中,所述关系建模的类型与所述若干个体的第一总数相关,包括以下至少一项:
    在所述若干个体的第一总数为单个的情况下,所述关系建模包括建模各所述动作帧之间的时序关系;
    在所述若干个体的第一总数为多个的情况下,所述关系建模包括建模各所述动作帧中所述若干个体之间的交互关系和建模各所述动作帧之间的时序关系。
  3. 根据权利要求2所述的方法,其中,在所述关系建模包括建模所述时序关系的情况下,所述基于所述第一特征表示和所述第二特征表示进行关系建模,得到各所述动作帧中各所述个体的融合特征表示,包括:
    选择所述个体作为目标个体,并将所述目标个体对应的第一特征表示和第二特征表示,作为所述目标个体在不同时序的时序特征表示;
    分别选择各个所述时序作为第一当前时序,并选择所述第一当前时序的时序特征表示作为第一当前时序表示;
    基于各个第一参考时序表示分别与所述第一当前时序表示的相关度,得到所述第一当前时序表示对应的融合特征表示;
    其中,所述第一参考时序表示包括所述目标个体在各所述时序的时序特征表示。
  4. 根据权利要求2所述的方法,其中,在所述关系建模包括建模所述交互关系的情况下,所述基于所述第一特征表示和所述第二特征表示进行关系建模,得到各所述动作帧中各所述个体的融合特征表示,包括:
    选择所述个体作为目标个体,并将所述目标个体对应的第一特征表示和第二特征表示,作为所述目标个体在不同时序的时序特征表示;
    分别选择各个所述时序作为第二当前时序,并选择所述第二当前时序的时序特征表示分别作为第二当前时序表示;
    基于各个第二参考时序表示分别与所述第二当前时序表示的相关度,得到所述第二当前时序表示对应的融合特征表示;
    其中,所述第二参考时序表示包括各所述个体分别在所述第二当前时序的时序特征表示。
  5. 根据权利要求2所述的方法,其中,在所述关系建模包括建模所述交互关系和所述时序关系的情况下,所述基于所述第一特征表示和所述第二特征表示进行关系建模,得到各所述动作帧中各所述个体的融合特征表示,包括:
    基于所述第一特征表示和所述第二特征表示建模在先关系,得到所述在先关系的输出特征表示;
    基于所述输出特征表示建模在后关系,得到所述融合特征表示;
    其中,所述在先关系为所述交互关系,所述在后关系为所述时序关系,或者,所述在先关系为所述时序关系,所述在后关系为所述交互关系。
  6. 根据权利要求2所述的方法,其中,所述动作序列由动作生成模型得到,所述动作生成模型包括关系建模网络,且所述关系建模网络包括时序建模子网络和交互建模子网络,所述时序建模子网络用于建模所述时序关系,所述交互建模子网络用于建模所述交互关系。
  7. 根据权利要求1至6任一项所述的方法,其中,所述第一特征表示基于高斯过程的采样得到。
  8. 根据权利要求7所述的方法,其中,所述获取分别表征若干个体在若干动作帧的第一特征表示,包括:
    在若干所述高斯过程中,分别采样第二总数次,得到分别表征第二总数个所述动作帧的第一原始表示;其中,所述第一原始表示的长度与所述高斯过程的个数相同,各所述高斯过程的特征长度尺度各不相同;
    基于所述第一总数和所述第一原始表示,得到第三总数个所述第一特征表示;其中,所述第三总数为所述第一总数和所述第二总数的乘积。
  9. 根据权利要求1至8任一项所述的方法,其中,所述第二特征表示基于所述目标动作类别映射得到。
  10. 根据权利要求9所述的方法,其中,所述获取分别表征所述若干个体关于目标动作类别的第二特征表示,包括:
    对所述目标动作类别进行嵌入表示,得到第二原始表示;
    基于所述第一总数和所述第二原始表示,得到所述第一总数个所述第二特征表示。
  11. 根据权利要求1至10任一项所述的方法,其中,所述第一特征表示和所述第二特征表示均融合有位置编码;
    其中,在所述若干个体为单个所述个体的情况下,所述位置编码包括时序位置编码,在所述若干个体为多个所述个体的情况下,所述位置编码包括个体位置编码和所述时序位置编码。
  12. 根据权利要求11所述的方法,其中,所述动作序列由动作生成模型得到,且所述位置编码在所述动作生成模型的训练过程中,与所述动作生成模型的网络参数一同调整,直至所述动作生成模型训练收敛为止。
  13. 根据权利要求1至12任一项所述的方法,其中,所述动作帧中所述个体的动作表示包括:在所述动作帧中,所述个体的关键点的第一位置信息和所述个体的姿态信息,且所述姿态信息包括所述个体的若干关节点的第二位置信息。
  14. 根据权利要求1至13任一项所述的方法,其中,所述动作序列由动作生成模型得到,且所述动作生成模型与鉴别模型通过生成对抗训练得到。
  15. 根据权利要求14所述的方法,其中,所述生成对抗训练的步骤包括:
    获取若干样本个体关于样本动作类别的样本动作序列;其中,所述样本动作序列包括预设数值个样本动作帧,且所述样本动作序列标注有样本标记,所述样本标记表示所述样本动作序列实际是否所述动作生成模型生成得到;
    分别对所述样本动作序列中各个所述样本动作帧进行分解,得到样本图数据;其中,所述样本图数据包括所述预设数值张节点图,所述节点图由节点连接形成,所述节点包括所述样本个体的关键点和关节点,所述节点图包括各个所述节点的节点特征表示,且所述节点的位置特征表示由所述若干样本个体分别在对应所述节点处的位置特征表示拼接得到;
    基于鉴别模型对所述样本图数据和所述样本动作类别进行鉴别,得到预测结果;其中,所述预测结果包括所述样本动作序列的第一预测标记和第二预测标记,所述第一预 测标记表示所述样本动作序列经预测由所述动作生成模型生成的可能性,所述第二预测标记表示所述样本动作序列属于样本动作类别的可能性;
    基于所述样本标记、所述第一预测标记和所述第二预测标记,调整所述动作生成模型、所述鉴别模型中任一者的网络参数。
  16. 根据权利要求15所述的方法,其中,在所述样本动作序列为从真实场景采集得到的情况下,所述节点的位置特征表示按照所述若干样本个体的随机顺序,由所述若干样本个体分别在对应所述节点处的位置特征表示拼接得到。
  17. 一种动作生成装置,包括:
    特征获取部分,配置为获取分别表征若干个体在若干动作帧的第一特征表示,并获取分别表征所述若干个体关于目标动作类别的第二特征表示;
    关系建模部分,配置为基于所述第一特征表示和所述第二特征表示进行关系建模,得到各所述动作帧中各所述个体的融合特征表示;其中,所述关系建模的类型与所述若干个体的第一总数相关;
    动作映射部分,配置为基于所述融合特征表示进行动作映射,得到所述若干个体关于所述目标动作类别的动作序列;其中,所述动作序列包括所述若干动作帧,且所述动作帧包含各所述个体的动作表示。
  18. 一种电子设备,包括相互耦接的存储器和处理器,所述处理器用于执行所述存储器中存储的程序指令,实现权利要求1至16任一项所述的动作生成方法。
  19. 一种计算机可读存储介质,其上存储有程序指令,所述程序指令被处理器执行时实现权利要求1至16任一项所述的动作生成方法。
  20. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于执行如权利要求1至16任一项所述的动作生成方法。
PCT/CN2022/135160 2022-01-25 2022-11-29 动作生成方法及相关装置、电子设备、存储介质和程序 WO2023142651A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210089863.5 2022-01-25
CN202210089863.5A CN114494543A (zh) 2022-01-25 2022-01-25 动作生成方法及相关装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2023142651A1 true WO2023142651A1 (zh) 2023-08-03

Family

ID=81474329

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/135160 WO2023142651A1 (zh) 2022-01-25 2022-11-29 动作生成方法及相关装置、电子设备、存储介质和程序

Country Status (2)

Country Link
CN (1) CN114494543A (zh)
WO (1) WO2023142651A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116805046A (zh) * 2023-08-18 2023-09-26 武汉纺织大学 一种基于文本标签生成3d人体动作的方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494543A (zh) * 2022-01-25 2022-05-13 上海商汤科技开发有限公司 动作生成方法及相关装置、电子设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263650A (zh) * 2019-05-22 2019-09-20 北京奇艺世纪科技有限公司 行为类别检测方法、装置、电子设备和计算机可读介质
CN110765967A (zh) * 2019-10-30 2020-02-07 腾讯科技(深圳)有限公司 一种基于人工智能的动作识别方法和相关装置
CN112025692A (zh) * 2020-09-01 2020-12-04 广东工业大学 一种自学习机器人的控制方法、装置和电子设备
JP2021033602A (ja) * 2019-08-23 2021-03-01 Kddi株式会社 情報処理装置、ベクトル生成方法及びプログラム
CN112668366A (zh) * 2019-10-15 2021-04-16 华为技术有限公司 图像识别方法、装置、计算机可读存储介质及芯片
CN114494543A (zh) * 2022-01-25 2022-05-13 上海商汤科技开发有限公司 动作生成方法及相关装置、电子设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263650A (zh) * 2019-05-22 2019-09-20 北京奇艺世纪科技有限公司 行为类别检测方法、装置、电子设备和计算机可读介质
JP2021033602A (ja) * 2019-08-23 2021-03-01 Kddi株式会社 情報処理装置、ベクトル生成方法及びプログラム
CN112668366A (zh) * 2019-10-15 2021-04-16 华为技术有限公司 图像识别方法、装置、计算机可读存储介质及芯片
CN110765967A (zh) * 2019-10-30 2020-02-07 腾讯科技(深圳)有限公司 一种基于人工智能的动作识别方法和相关装置
CN112025692A (zh) * 2020-09-01 2020-12-04 广东工业大学 一种自学习机器人的控制方法、装置和电子设备
CN114494543A (zh) * 2022-01-25 2022-05-13 上海商汤科技开发有限公司 动作生成方法及相关装置、电子设备和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116805046A (zh) * 2023-08-18 2023-09-26 武汉纺织大学 一种基于文本标签生成3d人体动作的方法
CN116805046B (zh) * 2023-08-18 2023-12-01 武汉纺织大学 一种基于文本标签生成3d人体动作的方法

Also Published As

Publication number Publication date
CN114494543A (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2023142651A1 (zh) 动作生成方法及相关装置、电子设备、存储介质和程序
Gilbert et al. Fusing visual and inertial sensors with semantics for 3d human pose estimation
CN112949622B (zh) 融合文本与图像的双模态性格分类方法及装置
Xia et al. LAGA-Net: Local-and-global attention network for skeleton based action recognition
WO2019114726A1 (zh) 图像识别方法、装置、电子设备以及可读存储介质
CN112329525A (zh) 一种基于时空图卷积神经网络的手势识别方法和装置
CN109871736A (zh) 自然语言描述信息的生成方法及装置
CN112906520A (zh) 一种基于姿态编码的动作识别方法及装置
Thiruthuvanathan et al. Engagement Detection through Facial Emotional Recognition Using a Shallow Residual Convolutional Neural Networks.
Shahzad et al. Role of zoning in facial expression using deep learning
Nguyen et al. Combined YOLOv5 and HRNet for high accuracy 2D keypoint and human pose estimation
CN113868451A (zh) 基于上下文级联感知的社交网络跨模态对话方法及装置
CN116189306A (zh) 基于联合注意力机制的人体行为识别方法
CN114529944B (zh) 一种结合人体关键点热图特征的人像景别识别方法
Usman et al. Skeleton-based motion prediction: A survey
CN115546897A (zh) 手语识别方法、装置、电子设备及可读存储介质
CN112633224B (zh) 一种社交关系识别方法、装置、电子设备及存储介质
CN114863013A (zh) 一种目标物体三维模型重建方法
WO2021217973A1 (zh) 情感信息识别方法、装置、存储介质及计算机设备
CN113887501A (zh) 行为识别方法、装置、存储介质及电子设备
Dembani et al. UNSUPERVISED FACIAL EXPRESSION DETECTION USING GENETIC ALGORITHM.
Zhang et al. Object detection based on deep learning and b-spline level set in color images
CN113963202A (zh) 一种骨骼点动作识别方法、装置、电子设备及存储介质
CN116612495B (zh) 图像处理方法及装置
Li et al. Adversarial learning for viewpoints invariant 3D human pose estimation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923443

Country of ref document: EP

Kind code of ref document: A1