CN117152208A

CN117152208A - Virtual image generation method, deep learning model training method and device

Info

Publication number: CN117152208A
Application number: CN202311121138.2A
Authority: CN
Inventors: 杨少雄; 徐颖; 崔宪坤; 赵晨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-09-01
Filing date: 2023-09-01
Publication date: 2023-12-01

Abstract

The disclosure provides a method for generating an avatar and a method and a device for training a deep learning model, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, and can be applied to metauniverse and digital virtual person scenes. The avatar generation method includes: generating a motion trail feature sequence according to the bone point features of the initial gesture of the target object and the bone point features of the target gesture of the target object, wherein the motion trail feature sequence comprises bone point features of a plurality of transition actions required to be completed by the target object to be transformed from the initial gesture to the target gesture; processing the skeleton point characteristics of the initial gesture, the motion track characteristic sequence and the skeleton point characteristics of the target gesture to generate an action characteristic sequence, wherein the action characteristic sequence is used for driving the virtual image to obtain the target gesture from the initial gesture through continuous gesture change; and rendering the action feature sequence to generate the virtual image.

Description

Virtual image generation method, deep learning model training method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, and specifically relates to a virtual image generation method, a deep learning model training method and a device.

Background

The digital virtual person aims at creating a digital image similar to a human image through a Computer Graphic (CG) technology, giving the digital image a specific character identity setting, and bringing closer visual and psychological distances to the human, so as to bring more real emotion interaction for users.

With the deep application of digital virtual people in various fields, the requirements of people on the similarity degree of the motion gesture of the virtual people and the true motion gesture of the human are higher and higher.

Disclosure of Invention

The disclosure provides a virtual image generation method, a training method of a deep learning model and a training device of the deep learning model.

According to an aspect of the present disclosure, there is provided a method of generating an avatar, including: generating a motion trail feature sequence according to the bone point features of the initial gesture of the target object and the bone point features of the target gesture of the target object, wherein the motion trail feature sequence comprises bone point features of a plurality of transition actions required to be completed by the target object to be transformed from the initial gesture to the target gesture; processing the skeleton point characteristics of the initial gesture, the motion track characteristic sequence and the skeleton point characteristics of the target gesture to generate an action characteristic sequence, wherein the action characteristic sequence is used for driving the virtual image to obtain the target gesture from the initial gesture through continuous gesture change; and rendering the action feature sequence to generate the virtual image.

According to another aspect of the present disclosure, there is provided a training method of a deep learning model, including: carrying out local masking on the initial sample action feature sequence to obtain a masking action feature sequence, wherein the initial sample action feature sequence comprises a sample skeleton point feature sequence of a sample object from a sample initial posture to a sample target posture through continuous posture change; inputting the mask action feature sequence into an encoding module of the initial model to obtain an encoded sample feature sequence; inputting the coded sample feature sequence into an attention module of an initial model to obtain a fusion feature sequence, wherein the fusion feature sequence comprises a predicted skeleton point feature sequence of a sample object from the initial sample posture to the target sample posture through continuous posture change; obtaining a target loss value according to the predicted bone point feature sequence and the sample bone point feature sequence based on the target loss function; and adjusting model parameters of the initial model based on the target loss value to obtain a trained deep learning model.

According to another aspect of the present disclosure, there is provided an avatar generation apparatus including: the device comprises a generation module, a processing module and a rendering module. The generation module is used for generating a motion track feature sequence according to the bone point features of the initial gesture of the target object and the bone point features of the target gesture of the target object, wherein the motion track feature sequence comprises bone point features of a plurality of transition actions required to be completed by the target object for transforming from the initial gesture to the target gesture. The processing module is used for processing the skeleton point characteristics of the initial gesture, the motion track characteristic sequences and the skeleton point characteristics of the target gesture to generate an action characteristic sequence, wherein the action characteristic sequence is used for driving the virtual image to obtain the target gesture from the initial gesture through continuous gesture change. And the rendering module is used for rendering the action feature sequence and generating an virtual image.

According to another aspect of the present disclosure, there is provided a training apparatus of a deep learning model, including: a masking module, an encoding module, an attention module, a loss calculation module and an adjustment module. And the masking module is used for locally masking the initial sample action feature sequence to obtain a masking action feature sequence, wherein the initial sample action feature sequence comprises a sample skeleton point feature sequence of a sample object changing from a sample initial posture to a sample target posture through continuous postures. And the encoding module is used for encoding the mask action feature sequence to obtain an encoded sample feature sequence. And the attention module is used for processing the coded sample feature sequence based on an attention mechanism to obtain a fusion feature sequence, wherein the fusion feature sequence comprises a predicted skeleton point feature sequence of a sample object from a sample initial posture to a sample target posture through continuous posture change. And the loss calculation module is used for obtaining a target loss value according to the predicted bone point characteristic sequence and the sample bone point characteristic sequence based on the target loss function. And the adjusting module is used for adjusting the model parameters of the initial model based on the target loss value to obtain a trained deep learning model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method as described above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

Fig. 1 schematically illustrates an exemplary system architecture to which an avatar generation method and apparatus may be applied according to an embodiment of the present disclosure;

fig. 2 schematically illustrates a flowchart of a method of generating an avatar according to an embodiment of the present disclosure;

fig. 3 schematically illustrates a schematic diagram of a method of generating an avatar according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of generating a motion trajectory feature sequence according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a schematic diagram of generating a sequence of motion features according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart of a training method of a deep learning model according to an embodiment of the present disclosure;

fig. 7 schematically illustrates a block diagram of an avatar generating apparatus according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of a training apparatus of a deep learning model according to an embodiment of the present disclosure; and

fig. 9 schematically illustrates a block diagram of an electronic device adapted to implement a method of generating an avatar or a training method of a deep learning model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following ways of generating an action sequence for driving an avatar to move are generally adopted in the related art:

1. the return to zero state is designed at the beginning and end of each action. The zeroing state may be preconfigured, for example: may be a standard standing posture. The connection between any two actions can return to the zero state, so that the virtual image achieves the technical effect of no jump when continuous actions are carried out. However, the avatar generated in this way returns to a pre-configured posture after any action is finished, and does not conform to the natural motion law of the human body.

2. A predetermined number of transitional actions are inserted at the location where a transition between two actions is required, for example: linear interpolation. However, the motion distribution of the avatar generated in this way is not consistent with the real motion distribution of the human body, resulting in a mechanical sense of the motion of the avatar.

In view of the above, the present disclosure provides a method for generating an avatar, which includes generating a motion trajectory sequence according to a skeletal point feature of an initial pose of a target object and a skeletal point feature of the target pose; and generating an action characteristic sequence capable of representing continuous change of the gesture of the target object based on the skeleton point characteristics of the initial gesture, the motion track sequence and the skeleton point characteristics of the target gesture, thereby reducing the mechanical sense of the virtual image changing from the initial gesture to the target gesture and completing the action conforming to the real motion distribution of the human body.

Fig. 1 schematically illustrates an exemplary system architecture to which an avatar generation method and apparatus may be applied according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the avatar generation method and apparatus may be applied may include a terminal device, but the terminal device may implement the avatar generation method and apparatus provided by the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 101, 102, 103, such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and/or social platform software, etc. (as examples only).

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for content browsed by the user using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the avatar generation method provided by the embodiments of the present disclosure may be generally performed by the terminal apparatus 101, 102, or 1 03. Accordingly, the avatar generation apparatus provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.

Alternatively, the avatar generation method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the avatar generation apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The avatar generation method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the avatar generation apparatus provided by the embodiments of the present disclosure may also be provided in a server or server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

For example, the terminal device 101, 102, 103 may acquire the skeletal point feature of the initial pose and the skeletal point feature of the target pose of the target object, and then send the skeletal point feature of the initial pose and the skeletal point feature of the target pose to the server 105, and the server 105 processes the skeletal point feature of the initial pose and the skeletal point feature of the target pose to generate a motion trail feature sequence; and generating an action feature sequence according to the skeleton point features of the initial gesture, the motion track feature sequence and the skeleton point features of the target gesture, and rendering the action feature sequence to generate the virtual image. Or the skeletal point features of the initial pose and the skeletal point features of the target pose are processed by a server or server cluster capable of communicating with the terminal devices 101, 102, 103 and/or server 105 and ultimately the avatar is generated.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

Fig. 2 schematically illustrates a flowchart of a method of generating an avatar according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S210 to S230.

In operation S210, a motion trajectory feature sequence is generated from the skeletal point features of the initial pose of the target object and the skeletal point features of the target pose of the target object.

In operation S220, the skeletal point features of the initial pose, the motion trajectory feature sequence, and the skeletal point features of the target pose are processed to generate an action feature sequence.

In operation S230, the motion feature sequence is rendered to generate an avatar.

According to embodiments of the present disclosure, the initial pose of the target object may characterize a starting motion pose of the target object at a start time and the target pose may characterize a ending motion pose of the target object at an end time. For example: the initial pose may be a sitting pose and the target pose may be a standing pose.

According to an embodiment of the present disclosure, bone point features are used to characterize absolute positions of a plurality of bone points of a target object and relative positions between a plurality of bone points on adjacent joints.

For example: skeletal point features may be derived based on human skeletal trees in an SMPL (Skinned Multi-Person Linear Model) model. The human skeletal tree may include 23 joints and 1 root node. The root node may be the human crotch center point. Each joint can comprise a plurality of bone points, and the absolute position of the bone points can represent the coordinate value of the bone points on each joint or the coordinate value of the center point of the crotch of the human body. The relative position between the plurality of bone points on adjacent joints may be an angle of rotation between the plurality of bone points on adjacent joints. For example: the adjacent joints may be elbow joints and wrist joints. Each motion gesture of the target object can be represented by the relative position of the skeletal points between 23 joints and the absolute position of the root node.

According to an embodiment of the present disclosure, the motion trajectory feature sequence includes skeletal point features of a plurality of transitional actions that the target object needs to complete to transform from an initial pose to a target pose.

For example: the initial pose may be a standard standing pose and the target pose may be a sitting pose. The transition motion that the target subject needs to complete from standing to sitting may include multiple transition poses such as bending down, bending knees, squatting half, etc. The motion trajectory feature sequence may include skeletal point features at each transition pose.

According to the embodiment of the disclosure, each transition action in the motion trail feature sequence is isolated, and the distribution of the transition actions does not completely accord with the distribution rule of human motion. Therefore, the skeleton point characteristics of the initial gesture, the motion track characteristic sequence and the skeleton point characteristics of the target gesture are fused, and the obtained action characteristic sequence can represent the continuous gesture change process of the target object from the initial gesture to the target gesture.

For example: the skeleton point characteristics of the initial gesture, the motion track characteristic sequences and the skeleton point characteristics of the target gesture can be input into a trained deep learning model, and the motion characteristic sequences for driving the virtual image to obtain the target gesture from the initial gesture through continuous gesture changes are output.

According to the embodiment of the present disclosure, an avatar capable of simulating a change of a target object from an initial pose to a target pose through successive poses may be generated by rendering a sequence of motion features.

as shown in fig. 3, in an embodiment 300, skeletal point features of a motion trajectory sequence 303 including a plurality of transitional actions, an initial pose, and a target pose may be derived based on a linear interpolation method from skeletal point features of the initial pose 301 and skeletal point features of the target pose 302. The motion trajectory sequence 303 is input into a transducer network 304 of training numbers, and is subjected to model refinement to obtain an action feature sequence 305 for driving the avatar to change from the initial pose 301 to the target pose 302 through continuous poses.

According to the embodiment of the disclosure, a motion trail sequence is generated according to the bone point characteristics of the initial gesture of the target object and the bone point characteristics of the target gesture; and generating an action characteristic sequence capable of representing continuous change of the gesture of the target object based on the skeleton point characteristics of the initial gesture, the motion track sequence and the skeleton point characteristics of the target gesture, thereby reducing the mechanical sense of the virtual image changing from the initial gesture to the target gesture and completing the action conforming to the real motion distribution of the human body.

In the related art, the number of transition actions from the initial pose to the target pose is generally determined based on a priori experience, and then the number of transition actions is adjusted based on the action effect of the generated avatar. This method requires repeated creation of the avatar, and then reverse-adjusts the number of transition actions, and the repeated operation is more, resulting in lower efficiency of creation of the avatar.

For example: generating a motion trail feature sequence according to the bone point features of the initial gesture of the target object and the bone point features of the target gesture of the target object can comprise the following operations: obtaining the number of transition actions according to the bone point characteristics of the initial gesture and the bone point characteristics of the target gesture; and processing the skeleton point characteristics of the initial gesture and the skeleton point characteristics of the target gesture based on the number of transition actions to obtain a motion track characteristic sequence.

According to embodiments of the present disclosure, the number of transitional actions may characterize the number of skeletal point features that need to be inserted between the skeletal point features of the initial pose and the skeletal point features of the target pose. For example: under the condition that the difference between the initial gesture and the target gesture is large, more transition actions are needed to finish the smooth transition from the initial gesture to the target gesture. Under the condition that the difference between the initial gesture and the target gesture is small, the smooth transition from the initial gesture to the target gesture can be completed only by less transition actions.

According to embodiments of the present disclosure, a motion trajectory feature sequence may characterize a transition motion sequence. For example: the number of transition actions is I, which is an integer greater than 1. The order of the transition actions in the sequence of transition actions may be, in turn, 1,2, -I, -I.

For example: for the ith transition action, the skeletal point feature of the ith transition action can be generated according to the skeletal point feature of the initial gesture, the number of transition actions, the sequence of transition actions and the skeletal point feature of the target gesture by using the formula (1).

Wherein f _a Bone point features representing an initial pose; f (f) _b Bone point features representing the pose of the target; i represents the order of transition actions; n represents the number of transitional actions.

Returning to execute the operation of generating the skeleton point feature of the ith transitional action and increasing I under the condition that j is determined to be less than I; and generating skeleton point characteristics of the (i+1) th transition action. And so on until the bone point feature of the ith transitional action is generated if I is determined to be equal to I. And generating a motion trail feature sequence according to the skeleton point features of the I transition actions.

According to the embodiment of the disclosure, based on the sequence of transition actions, the skeleton point characteristics of the transition actions used for representing the motion track of the target object are generated between the initial gesture and the target gesture according to the uniform speed or uniform acceleration change of the skeleton point characteristics, so that the action gesture density between the initial gesture and the target gesture is increased, and the difficulty of fusion of the motion gestures is reduced.

Because the characteristics of the skeleton points can represent any gesture of the target object, the number of transition actions can be accurately determined based on the differences among the characteristics of the skeleton points.

According to the embodiment of the disclosure, according to the bone point characteristics of the initial gesture, obtaining first positions of a plurality of bone points of a target object in the initial gesture state; obtaining second positions of a plurality of skeleton points of the target object in the state of the target gesture according to the skeleton point characteristics of the target gesture; and obtaining the number of transitional actions according to the first positions of the plurality of bone points and the second positions of the plurality of bone points.

For example: the skeletal point features of the initial pose may be processed according to equation (2) based on a forward kinematic algorithm to obtain first positions of a plurality of skeletal points of the initial pose. The first position may characterize position coordinates of bone points on each joint of the target object in the initial pose.

p _a ＝FK(f _a ) (2)

Wherein f _a Bone point features representing an initial pose; FK () represents a forward kinematic function; p is p _a Representing the first position, it may be a 23x3 dimensional vector (23 representing the number of joints of the human skeletal tree, 3 representing the rotation angle of the skeletal points of each joint relative to the x, y, z axes of the adjacent joints).

Similarly, the skeletal point features of the target pose may be processed based on a forward kinematic algorithm to obtain second positions of the plurality of skeletal points of the target pose. The second position may characterize position coordinates of bone points on each joint of the target object in the target pose.

According to an embodiment of the present disclosure, deriving the number of transitional actions from the first location of the plurality of skeletal points and the second location of the plurality of skeletal points may include the operations of: according to the identification of the bone points, obtaining the position difference of the target bone points according to the first positions of the bone points and the second positions of the corresponding bone points; and obtaining the number of transition actions according to the position difference of the target bone points and the preset transition parameters.

For example: according to the identification of the bone points, the first position of the bone points on the elbow joint in the initial posture and the second position of the bone points on the elbow joint in the target posture can be obtained, and the position difference of the bone points on the elbow joint is obtained. By analogy, skeleton points and root nodes on 23 joints can be obtained, and multiple groups of position difference values under different postures can be obtained.

According to the embodiment of the disclosure, the position differences of the bone points and root nodes on 23 joints in the initial posture and the target posture can be ordered, and the bone point with the largest position difference is determined to be the target bone point. For example: the target object may be a bone point at which the difference in position between bone points on all joints is greatest, when the position of the bone point of the ankle joint is changed from the ground to a position higher than the knee during running. Thus, the bone point of the ankle joint may be determined as the target bone point.

For example: the number of transition actions can be obtained according to equation (3) based on the difference in the positions of the target bone points and the predetermined transition parameters:

n＝κ·Max(||p _a -p _b || ₂ ) (3)

wherein n represents the number of transition actions, p _a A first position, p, of a skeletal point representing an initial pose _b A second location of a skeletal point representing a target pose; k represents a predetermined transition parameter.

According to embodiments of the present disclosure, the predetermined transition parameter is not unique, and different position difference values may correspond to different predetermined transition parameters. The mapping relationship between the transition parameters and the position difference values can be preconfigured, and the matched preset transition parameters are selected based on the position difference values of the target bone points.

Fig. 4 schematically illustrates a schematic diagram of generating a motion trajectory feature sequence according to an embodiment of the present disclosure.

As shown in fig. 4, in an embodiment 400, skeletal point features 411 of an initial pose may be processed based on a forward kinematic algorithm to obtain first locations 413 of a plurality of skeletal points of the initial pose. The skeletal point features 412 of the target pose are processed based on a forward kinematic algorithm to obtain second locations 414 of the plurality of skeletal points of the target pose. The first positions 413 of the plurality of bone points in the initial pose and the second positions 414 of the plurality of bone points in the corresponding target pose are sequentially calculated to obtain the position differences 415 of the plurality of bone points. The position difference 416 of the target bone point is set to a value having the largest position difference among the position differences 415 of the plurality of bone points. Then, the number of transitional actions 417 is derived from the difference in position 416 of the target bone point. Based on the linear interpolation, an action track feature sequence 418 is generated according to the bone point feature 411 of the initial gesture, the bone point feature 412 of the target gesture and the transition action number 417.

According to the embodiment of the invention, the number of matched transition actions is determined by the difference of the bone point positions among different postures, so that the number of times of adjusting the action characteristic sequence of the virtual image can be effectively reduced, and the generation efficiency of the virtual image is improved.

Although the linear interpolation-based mode can increase the density of transitional actions and reduce the mechanical sense of the movement of the avatar, the real movement change process of the human body is not a complete uniform speed or uniform speed change process.

In view of the above, the skeleton point features of the initial gesture, the motion track feature sequence and the skeleton point features of the target gesture can be spliced to generate a skeleton point feature sequence to be processed; and processing the skeleton point feature sequence to be processed based on the attention mechanism to obtain an action feature sequence.

For example: the skeleton point features of the initial gesture, the motion track feature sequence and the skeleton point features of the target gesture can be spliced according to the gesture change sequence to generate a skeleton point feature sequence to be processed.

For example: the bone point characteristic sequence to be processed can be encoded to obtain an encoded characteristic sequence; and processing the coded feature sequence based on a cross attention mechanism to obtain an action feature sequence.

The bone point feature sequence to be processed can be encoded, and position encoding is added to obtain the encoded feature sequence. The coded feature sequence is input into a Multi-Head Attention layer (Multi-Head Attention), the focusing of important information by high weight is realized based on an Attention mechanism, non-important information is ignored by low weight, and information exchange can be carried out between the important information and other information by sharing the important information, so that the transmission of the important information is realized, and the action feature sequence coupled with global information is obtained.

Fig. 5 schematically illustrates a schematic diagram of generating a sequence of motion features according to an embodiment of the present disclosure.

As shown in fig. 5, in embodiment 500, a sequence of motion features is generated in two stages. In the first stage, a motion trajectory feature sequence 513 is inserted between the segment 1511 and the segment 2512 based on a linear interpolation, resulting in a bone point feature sequence 514 to be processed. In the sequence of skeletal point features 514 to be processed, there is a discontinuity between each transition action. The sequence of skeletal point features 514 to be processed is input into a trained transducer network 515. In the transducer network 515, the sequence 514 of the bone point feature to be processed is encoded by a time sequence convolution network, and position encoding (PE: position encoding) is added to obtain the encoded sequence of the bone point feature. The coded skeletal point feature sequence is coupled with global information through N Multi-Head Attention layers (Multi-Head Attention), outputting a motion feature sequence 516 of continuous posture change.

According to the embodiment of the disclosure, the spliced skeleton point features are fused based on the attention mechanism, so that the skeleton point features can be coupled with the skeleton point feature information of the initial gesture, the transition gesture and the target gesture to realize continuous gesture change conforming to the human motion distribution.

Fig. 6 schematically illustrates a flowchart of a training method of a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 6, the training method 600 may include operations S610 to S650.

In operation S610, the initial sample motion feature sequence is locally masked to obtain a masked motion feature sequence.

In operation S620, the mask action feature sequence is input to the encoding module of the initial model, resulting in an encoded sample feature sequence.

In operation S630, the encoded sample feature sequence is input to the attention module of the initial model to obtain a fused feature sequence.

In operation S640, a target loss value is obtained from the predicted bone point feature sequence and the sample bone point feature sequence based on the target loss function.

In operation S650, model parameters of the initial model are adjusted based on the target loss value, resulting in a trained deep learning model.

According to an embodiment of the present disclosure, the initial sample motion feature sequence comprises a sample skeletal point feature sequence of a sample object that varies from a sample initial pose through successive poses to a sample target pose. The fusion feature sequence comprises a predicted bone point feature sequence of a sample object from a sample initial pose through successive poses to a sample target pose.

According to embodiments of the present disclosure, the sample bone point features are defined in the same range as the bone point features.

According to an embodiment of the present disclosure, the local masking of the initial sample motion feature sequence to obtain a masked motion feature sequence may be operated as follows: dividing the initial sample motion characteristic sequence to obtain a plurality of sample motion characteristic sequence fragments; and masking a continuous predetermined number of sample motion feature sequence fragments in the plurality of sample motion feature sequence fragments to obtain a masked motion sequence.

For example: the initial sample motion feature sequence can be divided according to a sliding window with a preset size to obtain m frame sample motion feature sequence fragments, and p fragments in the m frame sample motion feature sequence fragments can be masked to obtain a masking motion sequence.

According to an embodiment of the disclosure, the masking action feature sequence is input into an encoding module of the initial model to obtain an encoded sample feature sequence. And inputting the coded sample feature sequence into an attention module of the initial model, and carrying out feature fusion on the mask action sequence based on a cross attention mechanism to obtain a fusion feature sequence.

According to the embodiment of the disclosure, the model can learn implicit rules conforming to human motion rules in the training process through the local mask of the complete continuous gesture change sequence, and global skeleton point characteristic information is coupled based on an attention mechanism, so that the generated action characteristic sequence conforms to human motion distribution.

According to an embodiment of the present disclosure, the fusion feature sequence comprises a predicted bone point feature sequence of a sample object from a sample initial pose through successive poses to a sample target pose. The predicted bone point feature sequence includes a predicted rotation angle feature sequence of bone points between adjacent associated joints and a predicted position feature sequence of crotch center points of the sample object.

According to an embodiment of the present disclosure, the initial sample motion feature sequence comprises a sample skeletal point feature sequence of a sample object that varies from a sample initial pose through successive poses to a sample target pose. The sample bone point feature sequence includes a sample rotation angle feature sequence of bone points between adjacent associated joints and a sample position feature sequence of crotch bone center points of sample objects.

For example: based on the target loss function, obtaining a target loss value according to the predicted bone point feature sequence and the sample bone point feature sequence may include the following operations: based on the first loss function, obtaining a rotation angle loss value according to the sample rotation angle characteristic sequence and the predicted rotation angle characteristic sequence; based on the second loss function, obtaining a center point position loss value according to the sample position feature sequence and the predicted position feature sequence; and obtaining a target loss value according to the rotation angle loss value and the center point position loss value.

For example: the rotation angle loss value can be calculated using equation (4):

wherein rx is _ij Representing the rotation angle characteristic of the jth sample bone point in the ith gesture; ry _ij Representing a predicted rotation angle characteristic of a jth bone point in an ith pose; j represents the number of skeletal points; ia represents the initial pose; ib represents the target pose.

For example: the center point position loss value can be calculated using equation (5):

wherein tx is _i A sample center point location feature representing an i-th pose; ty (ty) _i Representing the ith gesture prediction center point position feature; ia represents the initial pose; ib represents the target pose.

According to the embodiments of the present disclosure, since the human body motion gesture can be vector-represented by the skeletal point features, the accuracy of model prediction can be improved based on the loss of the rotation angle of the skeletal points and the loss of the position of the center point between adjacent joints.

Fig. 7 schematically illustrates a block diagram of an avatar generating apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the generating apparatus 700 may include a generating module 710, a processing module 720, and a rendering module 730.

The generating module 710 is configured to generate a motion trajectory feature sequence according to the skeletal point feature of the initial pose of the target object and the skeletal point feature of the target pose of the target object, where the motion trajectory feature sequence includes skeletal point features of a plurality of transitional actions required to be completed by the target object to be transformed from the initial pose to the target pose.

The processing module 720 is configured to process the skeletal point feature of the initial gesture, the motion track feature sequence, and the skeletal point feature of the target gesture, and generate an action feature sequence, where the action feature sequence is used to drive the avatar to obtain the target gesture from the initial gesture through continuous gesture change.

And a rendering module 730 for rendering the motion feature sequence to generate an avatar.

According to an embodiment of the present disclosure, the generating module 710 may include: the system comprises a first obtaining sub-module and a first processing sub-module.

The first obtaining submodule is used for obtaining the number of transition actions according to the bone point characteristics of the initial gesture and the bone point characteristics of the target gesture.

The first processing sub-module is used for processing the skeleton point characteristics of the initial gesture and the skeleton point characteristics of the target gesture based on the number of transition actions to obtain a motion track characteristic sequence.

According to an embodiment of the present disclosure, the first obtaining sub-module may include: a skeleton point position obtaining unit of the initial gesture, a skeleton point position obtaining unit of the target gesture and a transition number calculating unit.

And the skeleton point position obtaining unit of the initial gesture is used for obtaining first positions of a plurality of skeleton points of the target object in the initial gesture state according to the skeleton point characteristics of the initial gesture.

And the skeleton point position obtaining unit is used for obtaining second positions of a plurality of skeleton points of the target object in the state of the target gesture according to the skeleton point characteristics of the target gesture.

And the transition number calculation unit is used for obtaining the number of transition actions according to the first positions of the plurality of bone points and the second positions of the plurality of bone points.

According to an embodiment of the present disclosure, the transition number calculation unit may include: a position difference obtaining subunit and a transition number calculating subunit.

A position difference obtaining subunit, configured to obtain, according to the identification of the bone points, a position difference of a target bone point according to a first position of the plurality of bone points and a second position of the corresponding plurality of bone points, where the target bone point represents a bone point with a largest position difference between the first position and the second position

And the transition number calculation subunit is used for obtaining the number of transition actions according to the position difference of the target bone points and the preset transition parameters.

According to an embodiment of the present disclosure, the number of transition actions is I, I being an integer greater than 1, the first processing submodule comprising: the device comprises a first generation unit, a second generation unit and a third generation unit.

The first generation unit is used for generating the skeleton point characteristic of the ith transition action according to the skeleton point characteristic of the initial gesture, the number of transition actions, the sequence of transition actions and the skeleton point characteristic of the target gesture aiming at the ith transition action.

And the second generating unit is used for returning to execute the operation of generating the skeleton point feature of the ith transitional action and incrementing I under the condition that I is determined to be smaller than I.

And the third generation unit is used for generating a motion track feature sequence according to the skeleton point features of the I transition actions under the condition that the I is equal to the I.

According to an embodiment of the present disclosure, the processing module may include: the sub-module is spliced and the attention sub-module is used for the display.

And the splicing sub-module is used for splicing the bone point characteristics of the initial gesture, the motion track characteristic sequence and the bone point characteristics of the target gesture to generate a bone point characteristic sequence to be processed.

And the attention sub-module is used for processing the skeleton point characteristic sequence to be processed based on an attention mechanism to obtain an action characteristic sequence.

According to an embodiment of the present disclosure, the attention sub-module may include: a coding unit and an attention unit.

The coding unit is used for coding the feature sequence of the bone point to be processed to obtain a coded feature sequence.

And the attention unit is used for processing the coded characteristic sequence based on a cross attention mechanism to obtain an action characteristic sequence.

Fig. 8 schematically illustrates a block diagram of a training apparatus of a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 8, the training apparatus 800 may include a masking module 810, an encoding module 820, an attention module 830, a loss calculation module 840, and an adjustment module 850.

The masking module 810 is configured to locally mask an initial sample motion feature sequence to obtain a masking motion feature sequence, where the initial sample motion feature sequence includes a sample skeleton point feature sequence of a sample object that changes from a sample initial pose to a sample target pose through continuous poses.

The encoding module 820 is configured to encode the mask action feature sequence to obtain an encoded sample feature sequence.

The attention module 830 is configured to process the encoded sample feature sequence based on an attention mechanism to obtain a fusion feature sequence, where the fusion feature sequence includes a predicted skeleton point feature sequence of a sample object from an initial sample pose to a target sample pose through continuous pose changes.

The loss calculation module 840 is configured to obtain a target loss value according to the predicted bone point feature sequence and the sample bone point feature sequence based on the target loss function.

The adjustment module 850 is configured to adjust model parameters of the initial model based on the target loss value, to obtain a trained deep learning model.

According to an embodiment of the present disclosure, the sample bone point feature sequence includes a sample rotation angle feature sequence of bone points between adjacent associated joints and a sample position feature sequence of crotch center points of sample objects; the predicted bone point feature sequence includes a predicted rotation angle feature sequence of bone points between adjacent associated joints and a predicted position feature sequence of crotch center points of the sample object.

According to an embodiment of the present disclosure, the loss calculation module may include: the device comprises a rotation angle loss calculation sub-module, a center point loss calculation sub-module and a second obtaining sub-module.

And the rotation angle loss calculation sub-module is used for obtaining a rotation angle loss value according to the sample rotation angle characteristic sequence and the predicted rotation angle characteristic sequence based on the first loss function.

And the central point loss calculation sub-module is used for obtaining a central point loss value according to the sample position characteristic sequence and the predicted position characteristic sequence based on the second loss function.

And the second obtaining submodule is used for obtaining a target loss value according to the rotation angle loss value and the center point position loss value.

According to an embodiment of the present disclosure, a mask module may include: dividing the molecular module and masking the sub-module.

And the dividing sub-module is used for dividing the initial sample action characteristic sequence to obtain a plurality of sample action characteristic sequence fragments.

And the masking submodule is used for masking the continuous preset number of sample action characteristic sequence fragments in the plurality of sample action characteristic sequence fragments to obtain a masking action sequence.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 901 performs the respective methods and processes described above, for example, a creation method of an avatar or a training method of a deep learning model. For example, in some embodiments, the avatar generation method or the deep learning model training method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the avatar generation method or the training method of the deep learning model described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the avatar generation method or the training method of the deep learning model in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of generating an avatar, comprising:

generating a motion track feature sequence according to the bone point features of the initial gesture of the target object and the bone point features of the target gesture of the target object, wherein the motion track feature sequence comprises bone point features of a plurality of transition actions required to be completed by the target object for transforming from the initial gesture to the target gesture;

processing the skeleton point characteristics of the initial gesture, the motion track characteristic sequence and the skeleton point characteristics of the target gesture to generate an action characteristic sequence, wherein the action characteristic sequence is used for driving an avatar to obtain the target gesture from the initial gesture through continuous gesture change; and

And rendering the action feature sequence to generate the virtual image.

2. The method of claim 1, wherein the generating a sequence of motion trajectory features from skeletal point features of an initial pose of a target object and skeletal point features of a target pose of the target object comprises:

obtaining the number of transition actions according to the bone point characteristics of the initial gesture and the bone point characteristics of the target gesture; and

and processing the skeleton point characteristics of the initial gesture and the skeleton point characteristics of the target gesture based on the number of the transition actions to obtain the motion track characteristic sequence.

3. The method of claim 2, wherein the deriving the number of transitional actions from the skeletal point features of the initial pose and the skeletal point features of the target pose comprises:

according to the bone point characteristics of the initial gesture, obtaining first positions of a plurality of bone points of the target object in the initial gesture state;

obtaining second positions of a plurality of skeleton points of the target object in the state of the target gesture according to the skeleton point characteristics of the target gesture; and

And obtaining the number of transition actions according to the first positions of the bone points and the second positions of the bone points.

4. A method according to claim 3, wherein said deriving the number of transitional actions from the first locations of the plurality of skeletal points and the second locations of the plurality of skeletal points comprises:

according to the identification of the bone points, obtaining the position difference of a target bone point according to the first positions of the bone points and the second positions of the bone points, wherein the target bone point represents the bone point with the largest position difference between the first positions and the second positions; and

and obtaining the number of transition actions according to the position difference of the target bone points and the preset transition parameters.

5. The method of claim 2, wherein the number of transitional actions is I, I is an integer greater than 1, and the processing the skeletal point feature of the initial pose and the skeletal point feature of the target pose based on the number of transitional actions to obtain the motion trail feature sequence includes:

for the ith transition action, generating skeleton point characteristics of the ith transition action according to the skeleton point characteristics of the initial gesture, the number of transition actions, the sequence of the transition actions and the skeleton point characteristics of the target gesture;

Returning to perform the operation of generating the skeleton point feature of the ith transitional action and increasing I under the condition that I is determined to be smaller than I; and

and under the condition that I is equal to I, generating the motion track feature sequence according to the bone point features of the I transitional actions.

6. The method of claim 1, wherein the processing of the skeletal point features of the initial pose, the sequence of motion trajectory features, and the skeletal point features of the target pose to generate a sequence of motion features comprises:

splicing the skeleton point characteristics of the initial gesture, the motion track characteristic sequence and the skeleton point characteristics of the target gesture to generate a skeleton point characteristic sequence to be processed; and

and processing the skeleton point characteristic sequence to be processed based on an attention mechanism to obtain the action characteristic sequence.

7. The method of claim 6, wherein the fusing the feature sequence of skeletal points to be processed based on an attention mechanism to obtain the action feature sequence comprises:

encoding the feature sequence of the bone points to be processed to obtain an encoded feature sequence; and

and processing the coded feature sequence based on a cross attention mechanism to obtain the action feature sequence.

8. The method of claim 1, wherein the bone point features are used to characterize absolute positions of a plurality of bone points of the target object and relative positions between a plurality of bone points on adjacent joints.

9. A training method of a deep learning model, comprising:

performing local masking on an initial sample motion feature sequence to obtain a masking motion feature sequence, wherein the initial sample motion feature sequence comprises a sample skeleton point feature sequence from which a sample object changes from a sample initial posture to a sample target posture through continuous postures;

encoding the mask action feature sequence to obtain an encoded sample feature sequence;

processing the coded sample feature sequence based on an attention mechanism to obtain a fusion feature sequence, wherein the fusion feature sequence comprises a predicted skeleton point feature sequence of the sample object from the initial sample posture to the target sample posture through continuous posture change;

based on a target loss function, obtaining a target loss value according to the predicted bone point feature sequence and the sample bone point feature sequence; and

and adjusting model parameters of the initial model based on the target loss value to obtain a trained deep learning model.

10. The method of claim 9, wherein the sample bone point feature sequence comprises a sample rotation angle feature sequence of bone points between adjacent associated joints and a sample position feature sequence of crotch center points of the sample object; the predicted bone point feature sequence comprises a predicted rotation angle feature sequence of bone points between adjacent associated joints and a predicted position feature sequence of crotch center points of the sample object;

the obtaining a target loss value based on the target loss function according to the predicted bone point feature sequence and the sample bone point feature sequence comprises the following steps:

based on a first loss function, obtaining a rotation angle loss value according to the sample rotation angle characteristic sequence and the predicted rotation angle characteristic sequence;

based on a second loss function, obtaining a center point position loss value according to the sample position feature sequence and the predicted position feature sequence; and

and obtaining the target loss value according to the rotation angle loss value and the center point position loss value.

11. The method of claim 9, wherein the locally masking the initial sample motion feature sequence to obtain a masked motion sequence comprises:

Dividing the initial sample action feature sequence to obtain a plurality of sample action feature sequence fragments; and

masking a predetermined number of consecutive sample motion feature sequence segments of the plurality of sample motion feature sequence segments to obtain the masked motion sequence.

12. An avatar generation apparatus comprising:

the generation module is used for generating a motion track feature sequence according to the bone point features of the initial gesture of the target object and the bone point features of the target gesture of the target object, wherein the motion track feature sequence comprises bone point features of a plurality of transition actions required to be completed by the target object for transforming from the initial gesture to the target gesture;

the processing module is used for processing the skeleton point characteristics of the initial gesture, the motion track characteristic sequence and the skeleton point characteristics of the target gesture to generate an action characteristic sequence, wherein the action characteristic sequence is used for driving the virtual image to obtain the target gesture from the initial gesture through continuous gesture change; and

and the rendering module is used for rendering the action feature sequence and generating the virtual image.

13. The apparatus of claim 12, wherein the generating means comprises:

the first obtaining submodule is used for obtaining the number of transition actions according to the bone point characteristics of the initial gesture and the bone point characteristics of the target gesture; and

and the first processing sub-module is used for processing the skeleton point characteristics of the initial gesture and the skeleton point characteristics of the target gesture based on the number of the transition actions to obtain the motion track characteristic sequence.

14. The apparatus of claim 13, wherein the first obtaining submodule comprises:

a bone point position obtaining unit of an initial gesture, configured to obtain first positions of a plurality of bone points of the target object in a state of the initial gesture according to bone point features of the initial gesture;

a skeleton point position obtaining unit of a target gesture, configured to obtain second positions of a plurality of skeleton points of the target object in a state of the target gesture according to skeleton point features of the target gesture; and

and the transition number calculation unit is used for obtaining the number of transition actions according to the first positions of the bone points and the second positions of the bone points.

15. The apparatus of claim 14, wherein the transition number calculation unit comprises:

a position difference obtaining subunit, configured to obtain, according to the identification of the bone points, a position difference of a target bone point according to a first position of the plurality of bone points and a second position of the corresponding plurality of bone points, where the target bone point represents a bone point with a largest position difference between the first position and the second position; and

and the transition number calculation subunit is used for obtaining the number of transition actions according to the position difference of the target skeleton point and the preset transition parameters.

16. The apparatus of claim 13, wherein the number of transitional actions is I, I being an integer greater than 1, the first processing submodule comprising:

the first generation unit is used for generating skeleton point characteristics of the ith transition action according to the skeleton point characteristics of the initial gesture, the number of the transition actions, the sequence of the transition actions and the skeleton point characteristics of the target gesture aiming at the ith transition action;

the second generating unit is used for returning to execute the operation of generating the skeleton point characteristic of the ith transitional action and increasing I under the condition that I is determined to be smaller than I; and

And the third generation unit is used for generating the motion track feature sequence according to the skeleton point features of the I transition actions under the condition that the I is equal to the I.

17. The apparatus of claim 12, wherein the processing module comprises:

the splicing sub-module is used for splicing the skeleton point characteristics of the initial gesture, the motion track characteristic sequence and the skeleton point characteristics of the target gesture to generate a skeleton point characteristic sequence to be processed; and

and the attention sub-module is used for processing the skeleton point characteristic sequence to be processed based on an attention mechanism to obtain the action characteristic sequence.

18. The apparatus of claim 17, wherein the attention sub-module comprises:

the coding unit is used for coding the bone point characteristic sequence to be processed to obtain a coded characteristic sequence; and

and the attention unit is used for processing the coded characteristic sequence based on a cross attention mechanism to obtain the action characteristic sequence.

19. The apparatus of claim 12, wherein the bone point features are used to characterize absolute positions of a plurality of bone points of the target object and relative positions between a plurality of bone points on adjacent joints.

20. A training device for a deep learning model, comprising:

the masking module is used for locally masking the initial sample action feature sequence to obtain a masking action feature sequence, wherein the initial sample action feature sequence comprises a sample skeleton point feature sequence of a sample object from a sample initial posture to a sample target posture through continuous posture change;

the coding module is used for coding the mask action feature sequence to obtain a coded sample feature sequence;

the attention module is used for processing the coded sample feature sequence based on an attention mechanism to obtain a fusion feature sequence, wherein the fusion feature sequence comprises a predicted skeleton point feature sequence of the sample object from the initial sample posture to the target sample posture through continuous posture change;

the loss calculation module is used for obtaining a target loss value according to the predicted bone point characteristic sequence and the sample bone point characteristic sequence based on a target loss function; and

and the adjusting module is used for adjusting the model parameters of the initial model based on the target loss value to obtain a trained deep learning model.

21. The apparatus of claim 20, wherein the sample bone point feature sequence comprises a sample rotation angle feature sequence of bone points between adjacent associated joints and a sample position feature sequence of crotch center points of the sample object; the predicted bone point feature sequence comprises a predicted rotation angle feature sequence of bone points between adjacent associated joints and a predicted position feature sequence of crotch center points of the sample object; the loss calculation module includes:

The rotation angle loss calculation sub-module is used for obtaining a rotation angle loss value according to the sample rotation angle characteristic sequence and the predicted rotation angle characteristic sequence based on a first loss function;

the central point loss calculation sub-module is used for obtaining a central point loss value according to the sample position feature sequence and the predicted position feature sequence based on a second loss function; and

and the second obtaining submodule is used for obtaining the target loss value according to the rotation angle loss value and the central point position loss value.

22. The apparatus of claim 20, wherein the masking module comprises:

the dividing sub-module is used for dividing the initial sample action characteristic sequence to obtain a plurality of sample action characteristic sequence fragments; and

and the masking submodule is used for masking the continuous preset number of sample action characteristic sequence fragments in the plurality of sample action characteristic sequence fragments to obtain the masking action sequence.

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

24. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-11.