WO2024066549A1

WO2024066549A1 - Data processing method and related device

Info

Publication number: WO2024066549A1
Application number: PCT/CN2023/103012
Authority: WO
Inventors: 周世奇; 许斌
Original assignee: 华为技术有限公司
Priority date: 2022-09-29
Filing date: 2023-06-28
Publication date: 2024-04-04
Also published as: CN117808934A

Abstract

Provided in the present application is a data processing method, which can be applied to a scenario such as animation style transfer. The method comprises: acquiring first style information; extracting action information of a first image sequence; and generating a second image sequence on the basis of the first style information and the action information, wherein the action type of the second image sequence is the same as that of the first image sequence, and the second image sequence has the first style information. Style information and action information are separated and acquired, and a second image sequence is generated on the basis of the first style information and the action information, such that stylized animation editing is performed without changing other features of an original image sequence, thereby improving the effect of animation style transfer.

Description

A data processing method and related equipment

This application claims the priority of the Chinese patent application filed with the China Patent Office on September 29, 2022, with application number 202211202267.X and invention name “A data processing method and related equipment”, the entire contents of which are incorporated by reference in this application.

Technical Field

The present application relates to the field of computer technology, and in particular to a data processing method and related equipment.

Background technique

With the introduction of the concept of the metaverse, "virtual digital humans" are seen as the medium for humans to enter the metaverse in the future, and have become the focus of public opinion. In addition, the Beijing Winter Olympics has become a showcase for current virtual digital human technology. Whether it is the virtual avatars of sports and entertainment stars or virtual anchors oriented towards public services (such as artificial intelligence sign language anchors), the public has a more intuitive and in-depth understanding of virtual digital humans. With the maturity of driving technology, virtual digital humans will surely be more widely used in more practical scenarios such as virtual customer service, virtual shopping guides, and virtual tour guides.

At present, there are several mainstream methods for driving virtual digital humans to imitate human behavior: pure manual modeling and motion capture modeling. Among them, pure manual modeling is widely used in hyper-realistic virtual humans or virtual humans of celebrities, but the manual production cycle is long and the cost is very high. The motion capture modeling method completes the drive by collecting model data with the help of external scanning equipment. Compared with the pure manual modeling method, the time and cost will be much lower. It is often used in pan-entertainment industries such as film and television, live broadcast, etc., but it requires the participation of real actors and cannot improve production efficiency.

Therefore, how to achieve the migration of different styles between animation actions is a technical problem that needs to be solved urgently.

Summary of the invention

The embodiment of the present application provides a data processing method and related equipment for realizing stylized animation editing without changing other features of the original image sequence, thereby improving the style transfer effect of the animation.

The first aspect of the embodiment of the present application provides a data processing method, which can be applied to scenes such as animation style transfer. The method can be executed by a data processing device, or by a component of the data processing device (such as a processor, a chip, or a chip system, etc.). The method includes: obtaining first style information; obtaining action information of a first image sequence; generating a second image sequence based on the first style information and the action information, the second image sequence has the same action type as the first image sequence, and the second image sequence has the first style information. Among them, the above-mentioned style information can be understood as a style description of the image sequence, and the style includes one or more of the following: limb/facial contour, limb/facial proportion, limb movement amplitude, emotion, personality, etc. Action type, used to describe the action of the image sequence, for example, running, jumping, walking, etc. Action information can be understood as a low-level vector used to represent the action type. It can be understood that the action vectors corresponding to the image sequence of the same action type may be different.

In the embodiment of the present application, the style information and the action information are obtained separately, and the second image sequence is generated based on the first style information and the action information, so as to realize stylized animation editing without changing other features of the original image sequence, thereby improving the style transfer effect of the animation.

Optionally, in a possible implementation manner of the first aspect, before the above step of obtaining the first style information, the method further includes: obtaining a third image sequence; and obtaining the first style information includes: obtaining the first style information based on the third image sequence.

In this possible implementation, the first style information is obtained through other third image sequences, which can make up for the defect that a certain type of style information is difficult for users to describe.

Optionally, in a possible implementation manner of the first aspect, the above step of: acquiring the first style information based on the third image sequence includes: extracting second style information of the third image sequence; and determining the first style information based on the second style information.

In this possible implementation, the style information of the third image sequence is directly used as the style information to be subsequently migrated to the first image sequence, so that the style of the generated second image sequence is similar to or the same as the style of the third image sequence, thereby satisfying the accurate migration of style.

Optionally, in a possible implementation manner of the first aspect, the step of determining the first style information based on the second style information includes: using the second style information as the first style information.

In this possible implementation, the style information of the third image sequence is directly used as the style information to be subsequently migrated to the first image sequence, so that the style of the generated second image sequence is similar to or the same as the style of the third image sequence, thereby compensating for the defect that users have difficulty in describing a certain type of style information, thereby satisfying the precise migration of style.

Optionally, in a possible implementation of the first aspect, the above step of: determining the first style information based on the second style information includes: displaying a second semantic tag to the user, the second semantic tag being used to describe the second style information; modifying the second semantic tag to a first semantic tag based on the user's first operation, the first semantic tag being used to describe the first style information; and determining the first style information based on the first semantic tag.

In this possible implementation, the user modifies the semantic label through operation on the basis of the third image sequence to achieve the description of style information and ensure user needs, so that the second image sequence generated subsequently can meet the user's style needs for the image sequence. Or it can be understood that the use of labels to make style information explicit allows users to have a quantitative and qualitative analysis of style information, and then clearly know how to quantitatively describe their needs. In addition, by analyzing user needs and combining the advantage of the video covering any style, the embodiment of the present application can generate any customized stylized digital human animation.

Optionally, in a possible implementation manner of the first aspect, the third image sequence is an image sequence of a two-dimensional animation, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are image sequences of a three-dimensional animation.

In this possible implementation, the stock of 2D video is large enough to realize the migration of any style information of the 2D video to the 3D original video to obtain the 3D target video.

Optionally, in a possible implementation of the first aspect, the above steps also include: displaying a first interface to a user, the first interface including multiple semantic tags, the multiple semantic tags being used to describe different style information of different image sequences, and the multiple semantic tags corresponding one-to-one to the style information; obtaining the first style information, including: determining a first semantic tag from the multiple semantic tags based on a second operation of the user; and determining the first style information based on the first semantic tag.

In this possible implementation, it can be understood that any style is extracted from the video offline and a feature library is generated. The user only needs to upload the semantic label of the required personalized style, and then the style information corresponding to the label is automatically identified from the feature library.

Optionally, in a possible implementation manner of the first aspect, the above step of: generating a second image sequence based on the first style information and the action information includes: fusing the first style information and the action information to obtain a first motion feature; and acquiring the second image sequence based on the first motion feature.

In this possible implementation, the first style information represented by the first semantic tag is fused with the motion information of the original image sequence to obtain the first motion feature. Therefore, the second image sequence obtained based on the first motion feature realizes style transfer without changing other features of the original image sequence.

Optionally, in a possible implementation manner of the first aspect, the above-mentioned action information includes one or more of the following: a facial expression sequence, a limb image sequence.

In this possible implementation, this method can be applied not only to the style transfer of body movements, but also to the style transfer of facial expressions, etc., and has a wide range of applicable scenarios.

Optionally, in a possible implementation manner of the first aspect, the above steps further include: rendering the second image sequence to the virtual object to obtain an animation.

This possible implementation manner may be applicable to style transfer scenarios from 2D animation to 2D animation, from 2D animation to 3D animation, or from 3D animation to 3D animation.

Optionally, in a possible implementation manner of the first aspect, the style information of the image sequence includes explicit style information and implicit style information, and the second semantic tag is specifically used to associate the explicit style information in the second style information.

In this possible implementation, the style information is decomposed into explicit and implicit parts, so that the user can edit the explicit style information, and the edited explicit style information and implicit style information are combined to generate modified style information.

Optionally, in a possible implementation manner of the first aspect, the above steps: extracting action information of the first image sequence includes: inputting the first image sequence into a content encoder to obtain action information; extracting second style information of the third image sequence includes: inputting the third image sequence into a style encoder to obtain second style information.

Optionally, in a possible implementation manner of the first aspect, the above steps further include: acquiring a first training image sequence and a second training image sequence, wherein the first training image sequence and the second training image sequence have different motion features, and the motion features include action information and/or style information; inputting the first training image sequence into a style encoder and a content encoder respectively to obtain first training style information and first training action information; The second training image sequence is input into the style encoder and the content encoder respectively to obtain the second training style information and the second training action information; the first training style information and the second training action information are fused to obtain the first training motion feature; the second training style information and the first training action information are fused to obtain the second training motion feature; the first training motion feature is input into the decoder to obtain the first reconstructed image sequence; the second training motion feature is input into the decoder to obtain the second reconstructed image sequence; the training is performed with the value of the first loss function being less than the first threshold as the goal to obtain the trained style encoder, content encoder and decoder, the first loss function includes a style loss function and a content loss function, the style loss function is used to represent the style difference between the first reconstructed image sequence and the first training image sequence and the style difference between the second reconstructed image sequence and the second training image sequence, and the content loss function is used to represent the content difference between the first reconstructed image sequence and the second training image sequence and the content difference between the second reconstructed image sequence and the first training image sequence.

In this possible implementation, the accuracy of style transfer can be achieved through the above training process.

The second aspect of the embodiment of the present application provides a data processing device. The data processing device includes: an acquisition unit, used to acquire first style information; the acquisition unit is also used to acquire action information of the first image sequence; a generation unit is used to generate a second image sequence based on the first style information and the action information, the second image sequence has the same action type as the first image sequence, and the second image sequence has the first style information.

Optionally, in a possible implementation manner of the second aspect, the acquisition unit is further used to acquire a third image sequence; the acquisition unit is specifically used to acquire the first style information based on the third image sequence.

Optionally, in a possible implementation manner of the second aspect, the acquisition unit is specifically used to extract second style information of the third image sequence; and the acquisition unit is specifically used to determine the first style information based on the second style information.

Optionally, in a possible implementation manner of the second aspect, the acquisition unit is specifically configured to use the second style information as the first style information.

Optionally, in a possible implementation of the second aspect, the above-mentioned acquisition unit is specifically used to display a second semantic tag to the user, the second semantic tag being used to describe the second style information; the acquisition unit is specifically used to modify the second semantic tag to a first semantic tag based on the user's first operation, the first semantic tag being used to describe the first style information; the acquisition unit is specifically used to determine the first style information based on the first semantic tag.

Optionally, in a possible implementation manner of the second aspect, the third image sequence is an image sequence of a two-dimensional animation, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are image sequences of a three-dimensional animation.

Optionally, in a possible implementation of the second aspect, the above-mentioned data processing device also includes: a display unit, used to display a first interface to a user, the first interface including multiple semantic tags, the multiple semantic tags are used to describe different style information of different image sequences, and the multiple semantic tags correspond one-to-one to the style information; an acquisition unit, specifically used to determine a first semantic tag from the multiple semantic tags based on the user's second operation; the acquisition unit, specifically used to determine the first style information based on the first semantic tag.

Optionally, in a possible implementation manner of the second aspect, the above-mentioned generation unit is specifically used to fuse the first style information and the action information to obtain the first motion feature; the generation unit is specifically used to obtain the second image sequence based on the first motion feature.

Optionally, in a possible implementation manner of the second aspect, the above-mentioned action information includes one or more of the following: a facial expression sequence, a limb image sequence.

Optionally, in a possible implementation manner of the second aspect, the data processing device further includes: a rendering unit, configured to render the second image sequence to the virtual object to obtain an animation.

A third aspect of the present application provides a data processing device, comprising: a processor, the processor is coupled to a memory, the memory is used to store programs or instructions, when the program or instructions are executed by the processor, the data processing device implements the method in the above-mentioned first aspect or any possible implementation of the first aspect.

A fourth aspect of the present application provides a computer-readable medium having a computer program or instruction stored thereon. When the computer program or instruction is executed on a computer, the computer executes the method in the aforementioned first aspect or any possible implementation of the first aspect.

A fifth aspect of the present application provides a computer program product. When the computer program product is executed on a computer, it enables the computer to execute the method in the aforementioned first aspect or any possible implementation manner of the first aspect.

A sixth aspect of an embodiment of the present application provides a chip system, which includes at least one processor for supporting a data processing device to implement the functions involved in the above-mentioned first aspect or any possible implementation method of the first aspect.

In a possible design, the chip system may also include a memory for storing program instructions and data necessary for the data processing device. The chip system may be composed of a chip, or may include a chip and other discrete devices. Optionally, the chip system may also include a memory for storing program instructions and data necessary for the data processing device. An interface circuit is included that provides program instructions and/or data to the at least one processor.

Among them, the technical effects brought about by the second, third, fourth, fifth, and sixth aspects or any possible implementation methods thereof can refer to the technical effects brought about by the first aspect or different possible implementation methods of the first aspect, and will not be repeated here.

From the above technical solutions, it can be seen that the present application has the following advantages: by separating the style information and the action information, and generating the second image sequence based on the first style information and the action information, it is possible to perform stylized animation editing without changing other features of the original image sequence, thereby improving the style transfer effect of the animation.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of the present invention;

FIG2 is a schematic diagram of the structure of the system architecture provided in an embodiment of the present application;

FIG3A is a schematic diagram of a deployment scenario provided in an embodiment of the present application;

FIG3B is a schematic diagram of another deployment scenario provided in an embodiment of the present application;

FIG4 is a flow chart of a data processing method provided in an embodiment of the present application;

FIG5A is a schematic diagram of decomposing style information into explicit features according to an embodiment of the present application;

FIG5B is a schematic diagram of a training process of a conversion module provided in an embodiment of the present application;

FIG6A is another schematic flow chart of a data processing method provided in an embodiment of the present application;

FIG6B is a schematic diagram of a process of user modifying a label according to an embodiment of the present application;

7 is a schematic diagram of a data processing device provided in an embodiment of the present application displaying a user interface to a user;

FIG8 is another schematic diagram of a data processing device provided in an embodiment of the present application displaying a user interface to a user;

FIG9 is another schematic diagram of a data processing device provided in an embodiment of the present application displaying a user interface to a user;

FIG10 is another schematic flow chart of a data processing method provided in an embodiment of the present application;

FIG11 is an example diagram of a first image sequence provided in an embodiment of the present application;

FIG12 is an example diagram of a third image sequence provided in an embodiment of the present application;

FIG13 is an example diagram of a second image sequence provided in an embodiment of the present application;

FIG14 is another schematic flow chart of a data processing method provided in an embodiment of the present application;

FIG15 is another schematic flow chart of a data processing method provided in an embodiment of the present application;

FIG16 is a schematic diagram of the training process of the encoder and decoder provided in an embodiment of the present application;

FIG17 is a schematic diagram of a flow chart of a method provided in an embodiment of the present application applied to a gesture style transfer scenario;

FIG18 is a schematic diagram of a flow chart of the method provided in an embodiment of the present application applied to an expression style transfer scenario;

FIG19 is a schematic diagram of a structure of a data processing device provided in an embodiment of the present application;

FIG. 20 is another schematic diagram of the structure of the data processing device provided in an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present invention will be described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

To facilitate understanding, the relevant terms and concepts mainly involved in the embodiments of the present application are first introduced below.

1. Neural Networks

A neural network may be composed of neural units, and a neural unit may refer to an operation unit with _Xs and an intercept b as input, and the output of the operation unit may be:

Where s = 1, 2, ... n, n is a natural number greater than 1, _Ws is the weight of _Xs , and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be an area composed of several neural units.

2. Loss Function

In the process of training deep neural networks, because we hope that the output of the deep neural network is as close as possible to the value we really want to predict, we can compare the predicted value of the current network with the target value we really want, and then update the weight vector of each layer of the neural network according to the difference between the two (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it predict lower, and keep adjusting until the neural network can predict the target value we really want. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which are important equations used to measure the difference between the predicted value and the target value. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, so the training of the deep neural network becomes a process of minimizing this loss as much as possible.

3. Generative Adversarial Networks

Generative adversarial network (GAN) is a deep learning model. Generative adversarial network includes at least one generator and one discriminator. It produces better output by letting two neural networks learn in a game of mutual competition. These two neural networks can be deep neural networks or convolutional neural networks. The basic principle of GAN is as follows: Taking the GAN that generates pictures as an example, suppose there are two networks, G (Generator) and D (Discriminator), where G is a network that generates pictures, randomly sampling from the latent space as input to generate pictures, recorded as G (z); D is a discriminator network, which is used to determine whether a picture is "real". Its input parameter is x, x represents a picture, and x is a real picture or the output of the generator network. The output D (x) represents the probability that x is a real picture. If it is 1, it means that it is 100% a real picture, and if it is 0, it means that it cannot be a real picture. In the process of training the generative adversarial network, the goal of the generative network G is to generate as realistic images as possible to deceive the discriminative network D, and the output results need to imitate the real samples in the training set as much as possible. The goal of the discriminative network D is to distinguish the images generated by G from the real images as much as possible. The two networks compete with each other and constantly adjust parameters. In this way, G and D constitute a dynamic "game" process, which is also the "confrontation" in the "generative adversarial network". The ultimate goal is to make the discriminative network unable to judge whether the output results of the generative network are real. The final result of the game is that, under ideal conditions, G can generate images G(z) that are "real", while D has difficulty in judging whether the images generated by G are real, that is, D(G(z)) = 0.5. In this way, an excellent generative model G is obtained, which can be used to generate images.

4. Animation

Virtually created video content includes animated videos displayed on 2D planes, and 3D animated content displayed on 3D display devices such as augmented reality (AR), virtual reality (VR), and holographic displays; its style is not only cartoon style, but also includes realistic style, such as digital human animation, special effects film and television, etc.

5. Virtual Digital Human

Virtual digital people refer to virtual characters with digital appearance. Unlike robots with physical bodies, virtual digital people rely on display devices to exist, such as mobile phones, computers or smart large screens. A complete virtual digital person often needs to have the following three capabilities:

First, they have a human appearance, with specific features such as appearance, gender, and personality.

Second, they possess human behavior and the ability to express themselves through language, facial expressions, and body movements.

Third, it must possess human thoughts, the ability to recognize the external environment, and the ability to communicate and interact with others.

6. Image Sequence

An image sequence can be understood as a plurality of images with a time-sequential relationship, and of course, can also be an image sequence obtained from a video. The image sequence can include a limb image sequence, and/or a facial expression sequence, etc. In addition, the image sequence can refer to an image sequence of the entire body limbs, or an image sequence of a part of the entire body limbs (or called a local limb), or a facial expression sequence of a character corresponding to the image sequence, etc., which is not specifically limited here.

7. Style Information

The style information involved in the embodiment of the present application can be a style feature vector obtained by the image sequence through the style encoder. It can also be an explicit vector in the style feature vector. It can also be a partial feature of the explicit vector in the style feature vector, etc., which is not limited here. In addition, the label corresponding to the style information can also be understood as a style description of the image sequence. For example, the style includes one or more of the following: body/ Facial contours, body/face proportions, body movement range, emotions, personality, etc. Emotions can include: happy, depressed, excited, etc. Personality can include: lively, kind, feminine, mean, etc.

8. Action information

The action information involved in the embodiment of the present application may be a feature vector obtained by an image sequence through a content encoder.

9. Action Type

The action type is used to describe the action of the image sequence. The content refers to the action described by the image sequence (for example: running, jumping, squatting, walking, raising head, lowering head, closing eyes, etc.). It is understandable that the action vectors corresponding to the image sequence of the same action type may be different.

10. Semantic Tags

Semantic tags are used to describe the style information of image sequences, which can be understood as being used to visualize the style of image sequences.

The style information corresponds to the semantic label one by one. The semantic label may be different according to different situations of the style information. The semantic label can be understood as being used to describe the style information, so as to facilitate the user to understand or edit the style of the image sequence.

For example, the style information is a style feature vector of an image sequence. The semantic tag is an explicit expression of the style feature vector, and the user can use the semantic tag to clarify the style of the image sequence/video (for example, the character's emotions and personality expressed by the body movements of the character in the video), so as to facilitate style editing/migration and other operations.

At present, there are three main methods for driving virtual digital people to imitate human behavior: pure manual modeling, motion capture modeling, and artificial intelligence modeling. Among them, pure manual modeling is widely used in hyper-realistic virtual people or celebrity virtual people, but the manual production cycle is long and the cost is very high. The motion capture modeling method completes the drive by collecting model data with the help of external scanning equipment. Compared with the pure manual modeling method, the time and cost are much lower. It is often used in pan-entertainment industries such as film and television and live broadcasting, but it requires the participation of real actors and cannot improve production efficiency. The artificial intelligence-driven method is based on algorithms and machine learning. Since the premise for the machine to automatically generate virtual digital people is to obtain enough data, analyze a large number of photos/videos, extract various data and information of people, and drive virtual digital people to imitate human behavior. In the above-mentioned artificial intelligence modeling methods, different styles are often used to migrate between animation actions to reduce the motion capture and driving costs of virtual digital human actions.

The generation and editing of stylized human animation is an important topic in the field of computer animation. By migrating different styles between the same animation, arbitrary stylization of animation can be achieved, reducing the cost of motion capture and driving. However, there are several key issues to be resolved: First, stylized animation editing requires that the original animation have a specified style while keeping other features as unchanged as possible. How to better decouple style information from animation motion information is an important issue; second, how to obtain style data at low cost. Video is a major data source, but how to explicitly mark the semantic label features of style in massive video data so that users can complete editing and style transfer by only semantically describing the style is also an important issue.

To this end, the embodiments of the present application address the defect that existing virtual digital human animation driving methods cannot be arbitrarily stylized, and propose a body movement driving solution based on video style extraction and explicit marking and editing of style information, aiming to fill the gap in AI user personalized animation driving in pan-entertainment scenarios; in addition, extracting style from video can make up for the defect that users have difficulty in describing a certain type of style.

Before introducing the data processing method and related equipment according to the embodiment of the present application in conjunction with the accompanying drawings, the system architecture provided by the embodiment of the present application is first described.

Referring to FIG. 1 , an embodiment of the present invention provides a system architecture 100. As shown in the system architecture 100, the data acquisition device 160 is used to collect training data. In the embodiment of the present application, the training data includes: a first training image sequence and a second training image sequence. The training data is stored in the database 130, and the training device 120 obtains the target model/rule 101 based on the training data maintained in the database 130. The following will describe in more detail how the training device 120 obtains the target model/rule 101 based on the training data, and the target model/rule 101 can be used to implement the data processing method provided in the embodiment of the present application. The target model/rule 101 in the embodiment of the present application may specifically include a style encoder, a content encoder, and a decoder. It should be noted that in actual applications, the training data maintained in the database 130 may not all come from the collection of the data acquisition device 160, but may also be received from other devices. It should also be noted that the training device 120 may not necessarily train the target model/rule 101 based entirely on the training data maintained in the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a limitation on the embodiment of the present application.

The target model/rule 101 obtained by training the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 1 . The execution device 110 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, an AR/VR, a vehicle terminal, etc., or a server or a cloud. In FIG. 1 , the execution device 110 is configured with an I/O interface 112 for data interaction with an external device. The user can input data to the I/O interface 112 through the client device 140. The input data can be Including: a first image sequence and a first semantic label; optionally, the input data may also include the first image sequence and the second image sequence, etc. Of course, the input data may also be a two-dimensional animation (for example, the two-dimensional animation is the animation to which the second image sequence belongs) and a three-dimensional animation (for example, the three-dimensional animation is the animation to which the first image sequence belongs). In addition, the input data may be input by a user, or uploaded by a user through a shooting device, or may come from a database, which is not specifically limited here.

The preprocessing module 113 is used to perform preprocessing (e.g., conversion of two-dimensional features to three-dimensional features, etc.) according to the input data received by the I/O interface 112 (e.g., a first image sequence and a first semantic label, or a first image sequence and a second image sequence, or a two-dimensional animation and a three-dimensional animation).

When the execution device 110 preprocesses the input data, or when the computing module 111 of the execution device 110 performs processing related to extracting action information of the first image sequence and generating a second image sequence based on the action information and the first semantic tag, the execution device 110 can call data, code, etc. in the data storage system 150 for corresponding processing, and can also store the second image sequence, instructions, etc. obtained by the corresponding processing into the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the second image sequence obtained as described above, or the three-dimensional animation corresponding to the second image sequence, to the client device 140 to provide it to the user.

It is worth noting that the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks. The corresponding target models/rules 101 can be used to achieve the above goals or complete the above tasks, thereby providing users with the desired results.

In the case shown in FIG. 1 , the user can manually give input data, and the manual giving can be operated through the interface provided by the I/O interface 112. In another case, the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send input data and needs to obtain the user's authorization, the user can set the corresponding authority in the client device 140. The user can view the results output by the execution device 110 on the client device 140, and the specific presentation form can be a specific method such as display, sound, action, etc. The client device 140 can also be used as a data acquisition terminal to collect the input data of the input I/O interface 112 and the output results of the output I/O interface 112 as shown in the figure as new sample data, and store them in the database 130. Of course, it is also possible not to collect through the client device 140, but the I/O interface 112 directly stores the input data of the input I/O interface 112 and the output results of the output I/O interface 112 as new sample data in the database 130.

It is worth noting that FIG1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG1, the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110.

As shown in FIG. 1 , a target model/rule 101 is obtained through training by a training device 120 . The target model/rule 101 may include a style encoder, a content encoder, a decoder, etc. in the embodiment of the present application.

The following introduces a chip hardware structure provided by an embodiment of the present application.

FIG2 is a chip hardware structure provided by an embodiment of the present invention, and the chip includes a neural network processor 20. The chip can be set in the execution device 110 shown in FIG1 to complete the calculation work of the calculation module 111. The chip can also be set in the training device 120 shown in FIG1 to complete the training work of the training device 120 and output the target model/rule 101.

The neural network processor 20 can be a neural network processor (neural-network processing unit, NPU), a tensor processing unit (tensor processing unit, TPU), or a graphics processing unit (graphics processing unit, GPU) and any other processor suitable for large-scale XOR operation processing. Taking NPU as an example: the neural network processor 20 is mounted on the main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks. The core part of the NPU is the operation circuit 203, and the controller 204 controls the operation circuit 203 to extract data from the memory (weight memory or input memory) and perform operations.

In some implementations, the operation circuit 203 includes multiple processing units (process engines, PEs) inside. In some implementations, the operation circuit 203 is a two-dimensional systolic array. The operation circuit 203 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 203 is a general-purpose matrix processor.

For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit takes the corresponding data of matrix B from the weight memory 202 and caches it on each PE in the operation circuit. The operation circuit takes the matrix A data from the input memory 201 and performs matrix operation with matrix B, and the partial result or final result of the matrix is stored in the accumulator 208.

The vector calculation unit 207 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. For example, the vector calculation unit 207 can be used for network calculations of non-convolutional/non-FC layers in a neural network, such as pooling. (Pooling), Batch Normalization, Local Response Normalization, etc.

In some implementations, the vector calculation unit 207 stores the vector of processed outputs to the unified buffer 206. For example, the vector calculation unit 207 can apply a nonlinear function to the output of the operation circuit 203, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 207 generates a normalized value, a merged value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operation circuit 203, for example, for use in a subsequent layer in a neural network.

The unified memory 206 is used to store input data and output data.

The weight data is directly transferred from the external memory to the input memory 201 and/or the unified memory 206 through the direct memory access controller 205 (DMAC), the weight data in the external memory is stored in the weight memory 202, and the data in the unified memory 206 is stored in the external memory.

The bus interface unit (BIU) 210 is used to realize the interaction between the main CPU, DMAC and instruction fetch memory 209 through the bus.

An instruction fetch buffer 209 connected to the controller 204 is used to store instructions used by the controller 204.

The controller 204 is used to call the instructions cached in the memory 209 to control the working process of the computing accelerator.

Generally, the unified memory 206, the input memory 201, the weight memory 202 and the instruction fetch memory 209 are all on-chip memories, and the external memory is a memory outside the NPU, which can be a double data rate synchronous dynamic random access memory (DDR SDRAM), a high bandwidth memory (HBM) or other readable and writable memory.

Several deployment scenarios provided by the embodiments of the present application are described below. The arbitrary style editable 3D animation generation solution provided by the embodiments of the present application can be applied to 2B digital host scenarios and 2C digital companion, assistant software and other scenarios. There are many specific deployment solutions, which are described below by example.

A deployment scenario provided by an embodiment of the present application is shown in FIG3A , where a user uploads an animation video representing a target style on the client. The server extracts the target style from the video and returns the semantic label of the style to the user. The user can then describe, edit, or select the style based on the semantic label of the style, such as for a style with a semantic label of excitement, hoping that the degree is slightly weaker. The client completes the label editing and uploading, and after receiving the request, the server reduces the weight of the semantic label of the style according to the target to reduce the degree of excitement, thereby editing the style information, and then generating a target animation that meets the user label, and returns it to the client for rendering and display.

Another deployment scenario provided by the embodiment of the present application is shown in FIG3B . Compared with the deployment scheme of FIG3A , in this deployment scheme, the server completes the extraction of any style from the video offline and generates a feature library. The user only needs to upload the semantic tags of the required personalized style, such as adding a little feminine style to the excited style. After receiving the request, the server automatically identifies the style information corresponding to the excited and feminine tags from the style information library, edits the feature, generates style information that matches the semantic tags of the target style, and completes rendering and display.

It is understandable that the above two deployment scenarios are just examples. In actual applications, there may be other forms of deployment scenarios, which are not specifically limited here.

In addition, the style involved in the above deployment scenario may refer to a two-dimensional style or a three-dimensional style. In other words, the method provided in the embodiment of the present application may be applied to a scenario where a two-dimensional style is transferred to a two-dimensional image sequence. It may also be applied to a scenario where a three-dimensional style is transferred to a three-dimensional image sequence. It may also be applied to a scenario where a two-dimensional style is transferred to a three-dimensional image sequence, or a scenario where a three-dimensional style is transferred to a two-dimensional image sequence, etc., which are not specifically limited here.

The data processing method provided in the embodiment of the present application is described in detail below with reference to the accompanying drawings.

Please refer to FIG4 , which is an embodiment of a data processing method provided in an embodiment of the present application. The method can be executed by a data processing device (terminal device/cloud server), or by a component of a data processing device (such as a processor, a chip, or a chip system, etc.), and the method includes steps 401 to 403. The method can be applied to scenes of style transfer between animations such as children's educational animations, short video animations, promotional animations, variety show animations, and film and television preview animations.

Step 401: Obtain first style information.

In a possible implementation, the style information refers to a style feature vector of an image sequence, and the third image sequence is input into a style encoder to obtain the second style information. The training process of the style encoder will be described later and will not be expanded here.

In another possible implementation, the style information refers to an explicit vector or a part of an explicit vector in a style feature vector of an image sequence. If the third image sequence is input into the style encoder to obtain a style feature vector, the style feature vector is split into an explicit vector and an implicit vector. In this case, the style information can be understood as an explicit expression of the style feature vector.

Or it can be understood that the style information in the embodiment of the present application can be the style feature vector corresponding to the image sequence, or it can be the explicit vector in the style feature vector corresponding to the image sequence. It can also be a partial feature of the explicit vector in the style feature vector corresponding to the image sequence, etc. In other words, in the latter case, it can be understood that the style information can be decomposed into explicit vectors and implicit features. Of course, this decomposition is just an example, and the style information can also be decomposed into explicit vectors, implicit features, and personalized features. The personalized feature is used to express the personalized differences brought about by the same style when interpreted by different roles. The personalized feature can also be related to the role in the image sequence, for example, it can be "Venus style, Trump style", etc.

Optionally, when the style information is an explicit vector, it is also necessary to first decompose the style feature vector into an explicit vector and an implicit feature, and use the explicit vector as the style information.

There are multiple ways for the data processing device to obtain the first style information in the embodiment of the present application, which are described below respectively.

The first one is to obtain the first style information based on the third image sequence.

In this case, the data processing device first obtains the third image sequence, and obtains the first style information based on the third image sequence. There are many ways for the data processing device to obtain the third image sequence, which may be by receiving images sent by other devices, by selecting images from a database, by collecting images through various sensors in the data processing device, or by uploading images by users, etc., which are not limited here.

In the embodiment of the present application, the image sequence (for example, the first image sequence, the third image sequence, etc.) can be a two-dimensional image sequence or a three-dimensional image sequence, etc., which is not specifically limited here.

Optionally, in order to obtain style information of more styles, the third image sequence may be an image sequence extracted from the two-dimensional animation. For example, the third image sequence is extracted from the two-dimensional animation by a human posture recognition method (e.g., openpose). In addition, the method for obtaining the two-dimensional animation is not limited here, and may be a method of uploading by a user, a method of receiving from other devices, or a method of selecting from a database, etc., which is not limited here.

The step of acquiring the first style information based on the third image sequence is divided into two cases according to whether there is a user operation, which are described below respectively.

1. No user operation.

After acquiring the third image sequence, the data processing device may directly extract the second style information of the third image sequence and use the second style information as the first style information, or convert the second style information into preset style information.

In addition, the above decomposition may be based on a trained neural network, or may be based on finding multiple image sequences for expressing the same style in a database, and then determining explicit vectors based on multiple image sequences for expressing the same style, etc., which is not specifically limited here. The case of determining explicit vectors based on multiple image sequences for expressing the same style may specifically include: inputting multiple image sequences of the same style into a style encoder to obtain multiple style feature vectors. And using the common features in the multiple style feature vectors as style information. The non-common parts are implicit features, etc., which are not specifically limited here.

Exemplarily, multiple image sequences expressing the style of "happy" are found from the database, and the multiple image sequences are respectively input into the style encoder to obtain multiple style feature vectors. The common vector of the multiple style feature vectors is determined, and the style information of "happy" is the above-mentioned common vector. Thus, the corresponding relationship between the explicit style information and the common vector is determined.

Optionally, when the style information is a partial feature of an explicit vector in a style feature vector corresponding to an image sequence, the explicit vector needs to be split. For example, explicit vector = W ₁ * style information 1 + W ₂ * style information 2 + ... + W _n * style information n).

Exemplarily, take the example that the style information is an explicit vector of the style feature vector. As shown in FIG5A , the style information may include: “calm->excited”, “single->diverse”, “feminine->masculine”. Among them, the “->” before and after may refer to the boundaries of a range. For example, “calm” to “excited” is a progression of emotions, or it can be understood that the style information can be further distinguished by different weights/levels. For another example, the intensity range of happiness may include several levels such as satisfaction, relief, pleasure, joy, and ecstasy. In this example, the style information may also be “satisfied->ecstasy”.

Optionally, if the second style information is converted into the first style information, the second style information is two-dimensional style information. After the data processing device obtains the second style information, the second style information can be converted into the first style information through the conversion module, and the first style information is three-dimensional style information. This situation is mainly used to migrate the style information of a two-dimensional animation to a three-dimensional animation to change the scene of the style information of the three-dimensional animation.

The above-mentioned conversion module can be understood as a 2D-3D style conversion module. This module uses a large number of 2D-3D pairs with consistent styles to train and obtain nonlinear transformations for embedding 2D stylized features into 3D stylized feature space. Subsequently, the 2D style information (i.e., the second style information) extracted from the video can be converted into 3D stylized features (i.e., the first style information) after being projected into the 3D space using nonlinear transformations.

The training process of the above conversion module can be shown in FIG5B . First, a 3D animation sequence is obtained and the 3D stylized features of the 3D animation sequence are extracted. Then, by orthogonally projecting the 3D animation sequence, a 2D animation sequence consistent with the style and action of the 3D animation sequence is generated, and the 2D style information is extracted. Finally, by supervising the style information of both, the two are aligned to the same feature space, completing the projection of the 2D style information to the 3D style information space.

2. Determine first style information based on the user's first operation and the third image sequence.

In this manner, after the data processing device extracts the second style information of the third image sequence, it can display a second semantic tag to the user, and the second semantic tag is used to explicitly describe the second style information. Then, based on the user's first operation, the second semantic tag is modified to the first semantic tag. The first style information is determined based on the first semantic tag. The explanation of the semantic tag can refer to the description of the aforementioned related terms, which will not be repeated here.

The second semantic label can be understood as a style description of the third image sequence, and the style includes one or more of the following: body/face contour, body/face proportion, body movement amplitude, emotion, personality, etc. For details, please refer to the description of the above-mentioned related terms, which will not be repeated here.

This method can also be understood as the data processing device converting the second style information vector of the image sequence into a second semantic label that can be understood by the user, and the user processes the second semantic label according to actual needs to obtain the first semantic label. The data processing device converts the first semantic label into the first style information, and then subsequently generates an image sequence that meets the user's needs. The above processing includes at least one of the following: addition, deletion, modification, degree control (or understood as amplitude, level adjustment), etc.

Optionally, the first operation includes the above-mentioned addition, deletion, modification, degree control (or understood as amplitude, level adjustment), modification of semantic tag weight, etc. Specifically, the data processing device can determine the first operation through the user's voice, text, etc. input method, which is not specifically limited here.

This situation can be applied to the scenario shown in FIG. 3A above. For example, the data processing device is a cloud device, and the way to obtain the third image sequence is the way the terminal device sends it. The process in this case can be shown in FIG. 6. The process includes steps 601 to 606.

Step 601: The terminal device sends a third image sequence to the cloud device.

The user can send the third image sequence to the cloud device through the terminal device. Correspondingly, the cloud device receives the third image sequence sent by the terminal device.

Step 602: The cloud device generates a second semantic tag for the third image sequence.

After the cloud device obtains the third image sequence, it first obtains the second style information of the third image sequence and converts the second style information into a second semantic label.

For example, taking the case where the style information is an explicit vector in the style feature vector, similar to the above, multiple image sequences expressing "happy" can be found from the database, and the multiple image sequences can be input into the style encoder to obtain multiple style feature vectors. The common vector of multiple style feature vectors is determined, and the style semantic label of "happy" corresponds to the above common vector (i.e., the explicit vector). Thus, the corresponding relationship between the semantic label and the style information is determined.

Step 603: The cloud device sends a second semantic tag to the terminal device.

After acquiring the second semantic tag, the cloud device sends the second semantic tag to the terminal device. Correspondingly, the terminal device receives the second semantic tag sent by the cloud device.

Step 604: The terminal device determines the first semantic tag based on the second semantic tag.

If no user operation is required, the process is similar to the above description, and here only takes determining the first style information based on the user's first operation and the third image sequence as an example.

After acquiring the second semantic tag, the terminal device displays the second semantic tag to the user, and then modifies the second semantic tag to the first semantic tag based on the user's first operation.

Step 605: The terminal device sends the first semantic tag to the cloud device.

After acquiring the first semantic tag, the terminal device sends the first semantic tag to the cloud device. Correspondingly, the cloud device receives the first semantic tag sent by the terminal device.

Step 606: The cloud device determines first style information based on the first semantic tag.

After acquiring the first semantic tag, the cloud device may determine the first style information based on the first semantic tag.

Exemplarily, FIG6B is an example of a user modifying a label. The second semantic label of the third image sequence is "emotional excitement, single style". Based on the second semantic label, the user performed the following processing: deleting excitement and maintaining neutrality; adjusting the richness of the action, changing from single to diverse; adding a feminine style. Among them, the natural language processing (NLP) module in the data processing device can automatically identify and match the semantic label of the style specified by the user, and select the style information that matches it, and can quantify the degree of a certain style specified by the user, and the two are merged to generate the edited style information. In addition, the ability of the NLP module is to input a paragraph of text and output the analysis of the paragraph of text (for example, nouns, verbs, keywords that the user cares about). The NLP module outputs the keywords for expressing the style in the paragraph of text. For example, if you input "the target style I want is half feminine and half masculine", the NLP module can output the following keywords: feminine, masculine, and half each. That is, to parse out the words related to the style in the descriptive text. For another example, the user transmits the information "I want a more feminine style" by inputting text or voice. The NLP module determines from this information that the user wants to "add a feminine style" on the basis of the second semantic tag.

Exemplary, take modifying the weight label as an example. The data processing device displays a user interface as shown in FIG7 to the user. The user interface includes an animation preview interface and an editing interface. The semantic label of the style in the editing interface (also referred to as the style label) can be understood as the aforementioned second semantic label. For example, the second semantic label is excitement and single. The user can modify the second semantic label through the editing interface. As shown in FIG8, the user can drag the cursor 801 to drag "calm->excited" from 1.0 to 0.5. The excitement is removed and changed to neutral. The user can drag the cursor 802 to drag "single->diversity" from 0.0 to 1.0. The single is changed to diverse. In addition, the user can click Add Label 803 to add a feminine style label as shown in FIG9. Based on the above FIGS. 7 to 9, the user modifies the second semantic label (excited, single) to the first semantic label (neutral, diverse, feminine).

In this way, the semantic label of the style information is made explicit, and the user can edit according to the explicit label. In actual use, for the style in any video, it is often difficult for humans to accurately define the style presented by the video subjectively, and it is even more difficult to edit accurately. This embodiment decomposes the style information, semanticizes the explicit features therein, and then realizes the labeling of the style information, and identifies, matches and quantifies the semantic labels of any style specified by the user to generate specific style information. Therefore, whether it is the return feature label in the deployment scheme shown in Figure 3A above and used for user editing, or the semantic label matching the user's personalized style in the deployment scheme shown in Figure 3B above, it becomes possible, and the user can be more aware of his or her editing behavior.

It is understandable that the above are just two examples of ways to obtain the first semantic label based on the third image sequence. In practical applications, there may be other ways, which are not specifically limited here.

In the first case, style information may be extracted from the third image sequence/video to compensate for the defect that certain styles are difficult for users to describe.

The second type is to determine the first style information based on a second operation performed by the user on the first interface.

In this manner, the data processing device displays a first interface to the user, and the first interface includes multiple semantic tags. Each of the multiple semantic tags is used to explicitly display the style information of the image sequence. The data processing device then determines a first semantic tag from the multiple semantic tags based on the user's second operation, and then determines the first style information based on the first semantic tag.

This situation can be applied to the scenario shown in FIG. 3B above. Take the case where the data processing device is a cloud device as an example. The process in this case can be shown in FIG. 10. The process includes steps 1001 to 1005.

Step 1001: A cloud device generates a style information library and multiple semantic tags based on multiple image sequences.

The cloud device obtains multiple image sequences, obtains common vectors of style feature vectors corresponding to the multiple image sequences, and extracts different semantic labels based on different common vectors, thereby obtaining a style information library and multiple semantic labels of the multiple common vectors.

Step 1002: The cloud device sends a plurality of semantic tags to the terminal device.

After the cloud device obtains the multiple semantic tags, it sends the multiple semantic tags to the terminal device. Correspondingly, the terminal device receives the multiple semantic tags sent by the cloud device.

Step 1003: The terminal device determines a first semantic tag based on a second operation performed by the user on the first interface.

After receiving the plurality of semantic tags sent by the cloud device, the terminal device displays a first interface to the user, where the first interface includes the plurality of semantic tags. The first semantic tag is determined based on a second operation of the user on the first interface. The second operation may specifically be a selection operation, etc.

Step 1004: The terminal device sends a first semantic tag to the cloud device.

After determining the first semantic tag, the terminal device sends the first semantic tag to the cloud device. Correspondingly, the cloud device receives the first semantic tag sent by the terminal device.

Step 1005: The cloud device determines the first style information from the style information library based on the first semantic tag.

After receiving the first semantic tag sent by the terminal device, the cloud device finds a common vector corresponding to the first semantic tag from the style information library as the first style information.

This method can also be understood as the data processing device displays multiple semantic tags to the user, and the user can directly select the semantic tag he needs from the multiple semantic tags as needed, or the user inputs the weights of the multiple semantic tags in the first interface.

The third type is to determine the first style information based on a third operation of the user.

In this manner, the data processing device can directly receive the third operation of the user, and determine the first semantic tag in response to the third operation.

The third operation may be voice, text, etc., which is not limited here. For example, the user edits "add feminine style" by voice. Then the data processing device can determine the first semantic tag as "feminine" according to the voice of "add feminine style".

For example, the data processing device is a server, that is, the data processing device extracts any style from the video offline and generates a feature library. The user only needs to upload the semantic tag of the required personalized style, such as adding a little feminine style to the excited style. After receiving the request, the data processing device automatically identifies the style information corresponding to the excited and feminine tags from the style information library, edits the feature, generates style information that matches the semantic tag of the target style, and completes rendering and display.

It is understandable that the above-mentioned situations are just a few examples of obtaining the first style information. In practical applications, there may be other ways, which are not specifically limited here.

Step 402: Acquire action information of a first image sequence.

The data processing device obtains a first image sequence, which can be understood as an image sequence whose style information needs to be replaced.

Optionally, in the scenario of migrating 2D/3D animation style information to 3D animation, the first image sequence is a 3D image sequence. In the scenario of migrating 2D/3D animation style information to 2D animation, the first image sequence is a 2D image sequence.

Optionally, the first image sequence may be an image sequence extracted from a three-dimensional animation. For example, the first image sequence is extracted from the three-dimensional animation by a human posture recognition method (e.g., openpose). In addition, the acquisition method of the three-dimensional animation is not limited here, and may be a method of uploading by a user, a method of receiving from other devices, or a method of selecting from a database, etc., which is not limited here.

Example 1: An example of the first image sequence is shown in Figure 11. The action content of the first image sequence is "walking".

After acquiring the first image sequence, the data processing device extracts the action information of the first image sequence. The explanation of the action information can refer to the description of the above-mentioned related terms, which will not be repeated here.

Optionally, the first image sequence is input into a content encoder to obtain the action information. The training process of the content encoder will be described later and will not be expanded here.

Step 403: Generate a second image sequence based on the first style information and the motion information.

After acquiring the first semantic tag, the data processing device may determine the first style information based on the first semantic tag, and then generate a second image sequence based on the first style information and the action information.

In a possible implementation, the first semantic tag is used to make the entire first style information explicit. In this case, the first style information is determined directly based on the first semantic tag.

In another possible implementation, the first semantic label is used to make explicit the explicit vector in the first style information. In this case, the first semantic label is first converted into an explicit vector and then fused with the implicit features of the first image sequence to obtain the first style information.

Optionally, the data processing device fuses the first style information with the action information to obtain a first motion feature, and acquires a second image sequence based on the first motion feature.

The fusion algorithm used by the above-mentioned data processing device to fuse the first style information and the action information to obtain the first motion feature may include: Adaptive Instance Normalization (AdaIN), deep learning models, statistical methods and other alignment methods between distributions.

Optionally, the data processing device inputs the first motion feature into a decoder to obtain a second image sequence. The training process of the decoder will be described later and will not be expanded here.

Exemplarily, the first semantic tag is obtained based on the third image sequence. The third image sequence is shown in FIG12. The first style information is "frustrated". Continuing with the above example 1, the second image sequence obtained in this step is shown in FIG13. The second image sequence is a "frustrated" walk.

In this example, the process from step 401 to step 403 may be as shown in FIG. 14. The input end includes a third image sequence (eg, 2D motion The 2D style information extraction module extracts the 2D stylized features of the third image sequence and converts them into 3D style information. It also makes the semantic label of the style explicit and returns it to the user for editing. Secondly, the user generates personalized requirements based on the semantic labels and needs. After parsing the user's personalized needs, the NLP module inputs them into the style editing module together with the 3D style information to generate an edited style information vector (i.e., the first style information). Finally, after content encoding, the first image sequence obtains a feature expression that characterizes the content of the first image sequence, integrates the above-mentioned edited first style information, and generates an image sequence of the 3D target animation (i.e., the second image sequence) that conforms to the user's editing information after decoding.

Optionally, after acquiring the second image sequence, the data processing device renders the second image sequence to a virtual object to obtain an animation/video.

Optionally, when the second image sequence is a three-dimensional image sequence, the generated animation is a 3D animation. When the second image sequence is a two-dimensional image sequence, the generated animation is a 2D animation.

In one possible implementation, the data processing method provided in the embodiment of the present application is mainly applied to the style transfer scenario of image sequences.

In another possible implementation manner, the data processing method provided in the embodiment of the present application is mainly used in animation style transfer scenarios.

In the embodiment of the present application, on the one hand, the style information and the action information are obtained by separation, and a second image sequence is generated based on the first style information and the action information. This allows stylized animation editing to be performed without changing other features of the original image sequence, thereby improving the style transfer effect of the animation. On the other hand, the style information is described by semantic tags, and the style information is made explicit by using semantic tags. Users edit semantic tags to achieve style transfer, thereby realizing a driving scheme for body movements. Users can have a quantitative and qualitative analysis of the style information, and clearly know how to quantitatively describe their needs. In addition, by analyzing user needs and combining the advantage of massive videos that can cover any style, the embodiment of the present application can generate any customized stylized digital human animation. On the other hand, extracting style information from the video to which the second image sequence belongs can make up for the user's difficulty in describing a certain type of style information. On the other hand, using tags to make style information explicit,

Another flowchart of the method provided by the embodiment of the present application can be shown in FIG15. A second image sequence is obtained from the style reference animation, and the stylized features of the second image sequence are extracted to obtain a second stylized feature. The second stylized feature is then made explicit to obtain a display label. The user edits the display label to obtain a first stylized feature. The first stylized feature is then transferred to the original animation to obtain a stylized animation. The content of the stylized animation is consistent with the original animation, and the style of the stylized animation is consistent with the style reference animation, thereby achieving stylized migration.

The above describes the data processing method provided in the embodiment of the present application. The following describes in detail the training process of the style encoder, content encoder, and decoder mentioned in the embodiment shown in Figure 4. On the training side, a large amount of body animation videos are used to construct an approximately complete body animation stylized feature vector space, which can meet the arbitrariness of the stylized features on the reasoning side.

The training process is shown in Figure 16. First, image sequence 1 and image sequence 2 are obtained. Among them, image sequence 1 has style 1 and action 1. Image sequence 2 has style 2 and action 2. Secondly, the style encoder and action content encoder are used to encode the style and motion content of the two input sequences respectively to decouple the style information and action information. Then, style information 1 and action information 2 are fused through a fusion algorithm (such as AdaIN), and style 1 action 2 is generated after decoding. And style information 2 and action information 1 are fused to generate style 2 action 1. Finally, the discriminator supervises the reconstruction losses of the generated stylized animation in style and content respectively, so that the final generated stylized animation can have the greatest similarity with the target style without losing the original motion content.

The above process can be understood as: obtaining a first training image sequence and a second training image sequence, wherein the motion features of the first training image sequence and the second training image sequence are different, and the motion features include action information and/or style information. The first training image sequence is input into a style encoder and a content encoder respectively to obtain first training style information and first training action information; the second training image sequence is input into a style encoder and a content encoder respectively to obtain second training style information and second training action information. The first training style information and the second training action information are fused to obtain a first training motion feature; the second training style information and the first training action information are fused to obtain a second training motion feature. The first training motion feature is input into a decoder to obtain a first reconstructed image sequence; the second training motion feature is input into a decoder to obtain a second reconstructed image sequence. Training is performed with the goal of making the value of the first loss function less than a first threshold to obtain a trained style encoder, content encoder and decoder, the first loss function includes a style loss function and a content loss function, the style loss function is used to represent the style difference between the first reconstructed image sequence and the first training image sequence and the style difference between the second reconstructed image sequence and the second training image sequence, and the content loss function is used to represent the content difference between the first reconstructed image sequence and the second training image sequence and the content difference between the second reconstructed image sequence and the first training image sequence.

In this embodiment, the style encoder, content encoder, and decoder obtained through training can extract 2D stylized features from the video sequence and map them to the 3D feature space to generate a 3D style consistent with its semantics, and make the 3D style information semantically explicit. Expression, the user edits it according to the semantic expression of the style to generate the target style that meets his expectations, and then the algorithm generates the corresponding style information from the semantic label of the user's style. Finally, the style transfer module is used to transfer the generated 3D target features to the original animation sequence to generate a target stylized virtual digital human animation sequence.

In addition, the third image sequence in the embodiment shown in FIG4 includes one or more of the following: a facial expression sequence, a limb image sequence. For example, limb movements include global limbs, local limbs (such as gestures, etc.), etc. In other words, the method provided in the embodiment of the present application can also be applied to style transfer such as gestures and expressions. The following takes voice-driven gestures as an example. The scenario in which the method is applied to gesture style transfer is shown in FIG17.

By inputting a piece of text or voice data, the virtual digital human is driven to make gestures with known semantics and consistent rhythm with the voice data. For the same piece of voice or text data, the gesture style of different speakers will vary from person to person, and also from different emotions of the same person, so the personalized customization and transfer of style is of great significance to enriching the diversity of gestures.

In the offline or training stage, a large amount of 2D speech videos are collected, and the gesture style information that can cover almost any style is generated through the aforementioned stylized feature extraction module, and a style information database is generated offline. In the online use stage, the user specifies any personalized stylized label, and the user label is parsed and quantified, and the style database generated offline is integrated to generate the edited style information, and the motion sequence generated by the voice-driven gesture module is stylized into the target style.

The scenario where this method is applied to expression style transfer is shown in Figure 18. This scenario can also be understood as a digital human expression base style editing and transfer scenario. By obtaining nearly arbitrary expression styles from massive expression videos and then transferring them to the digital human expression muscles, the same digital human can be driven to make expressions of any style. Among them, the definition of expression base is a predetermined set of coordinates of several key points on the face used to represent a neutral expression, and the original coefficient represents the parameter expression of a specific expression relative to a neutral expression, such as the degree of mouth opening relative to a neutral expression when smiling. Therefore, the whole process of Figure 18 is to first calculate the original coefficient corresponding to a person's expression and a preset expression base through an expression network; and obtain the coefficients corresponding to various expressions in the video through the same set of expression bases, and the user controls the expression to be generated by editing the coefficient.

In this embodiment, on the one hand, stylized features of gestures/expressions can be extracted from video sequences and converted into them, greatly enriching the style diversity; on the other hand, the style of gestures/expressions extracted from the video is explicitly labeled, which facilitates the semantic description of the style of gestures/expressions with the user, and then realizes the subsequent matching and fusion of labels and style information.

The data processing method in the embodiment of the present application is described above. The data processing device in the embodiment of the present application is described below. Please refer to FIG. 19. An embodiment of the data processing device in the embodiment of the present application includes:

An acquiring unit 1901 is used to acquire first style information;

The acquisition unit 1901 is further used to acquire the motion information of the first image sequence;

The generating unit 1902 is configured to generate a second image sequence based on the first style information and the action information. The second image sequence has the same action type as the first image sequence, and the second image sequence has the first style information.

Optionally, the data processing device may also include: a display unit 1903, used to display a first interface to the user, the first interface including multiple semantic tags, the multiple semantic tags are used to describe different style information of different image sequences, and the multiple semantic tags correspond one-to-one to the style information; an acquisition unit 1901, specifically used to determine a first semantic tag from multiple semantic tags based on a second operation of the user; and used to convert the first semantic tag into first style information.

Optionally, the data processing device may further include: a rendering unit 1904, configured to render the second image sequence to a virtual object to obtain an animation.

In this embodiment, the operations performed by each unit in the data processing device are similar to those described in the embodiments shown in Figures 1 to 18 above, and will not be repeated here.

In this embodiment, the acquisition unit 1901 acquires the style information and the action information by separation, and the generation unit 1902 generates the second image sequence based on the first style information and the action information, so as to realize stylized animation editing without changing other features of the original image sequence, and improve the style transfer effect of the animation.

Referring to FIG. 20 , a schematic diagram of the structure of another data processing device provided by the present application. The data processing device may include a processor 2001, a memory 2002, and a communication port 2003. The processor 2001, the memory 2002, and the communication port 2003 are interconnected via a line. The memory 2002 stores program instructions and data.

The memory 2002 stores program instructions and data corresponding to the steps executed by the data processing device in the corresponding implementation modes shown in the aforementioned FIGS. 1 to 18 .

The processor 2001 is used to execute the steps performed by the data processing device shown in any of the embodiments shown in Figures 1 to 18 above.

The communication port 2003 can be used to receive and send data, and to execute the steps related to acquisition, sending, and receiving in any of the embodiments shown in Figures 1 to 18 above.

In one implementation, the data processing device may include more or fewer components than those in FIG. 20 , and this application is merely an illustrative description and is not intended to be limiting.

An embodiment of the present application further provides a computer-readable storage medium storing one or more computer-executable instructions. When the computer-executable instructions are executed by a processor, the processor executes the method described in the possible implementation manner of the data processing device in the aforementioned embodiment.

An embodiment of the present application also provides a computer program product (or computer program) storing one or more computers. When the computer program product is executed by the processor, the processor executes the method of the possible implementation mode of the above-mentioned data processing device.

The embodiment of the present application also provides a chip system, which includes at least one processor for supporting a terminal device to implement the functions involved in the possible implementation of the above-mentioned data processing device. Optionally, the chip system also includes an interface circuit, which provides program instructions and/or data for the at least one processor. In one possible design, the chip system may also include a memory, which is used to store the necessary program instructions and data for the terminal device. The chip system may be composed of chips, or may include chips and other discrete devices.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices and units described above can refer to the corresponding processes in the aforementioned method embodiments and will not be repeated here.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be an indirect coupling or communication connection through some interfaces, devices or units, which can be electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), disk or optical disk and other media that can store program code.

Claims

A data processing method, characterized in that the method comprises:

Get the first style information;

Acquiring motion information of a first image sequence;

A second image sequence is generated based on the first style information and the action information. The second image sequence has the same action type as the first image sequence, and the second image sequence has the first style information.
The method according to claim 1, characterized in that before obtaining the first style information, the method further comprises:

acquiring a third image sequence;

The obtaining of the first style information includes:

The first style information is acquired based on the third image sequence.
The method according to claim 2, characterized in that the obtaining the first style information based on the third image sequence comprises:

extracting second style information of the third image sequence;

The first style information is determined based on the second style information.
The method according to claim 3, characterized in that determining the first style information based on the second style information comprises:

The second style information is used as the first style information.
The method according to claim 3, characterized in that determining the first style information based on the second style information comprises:

Displaying a second semantic tag to the user, where the second semantic tag is used to describe the second style information;

modifying the second semantic tag to a first semantic tag based on a first operation of the user, where the first semantic tag is used to describe the first style information;

The first style information is determined based on the first semantic tag.
The method according to any one of claims 2 to 5 is characterized in that the third image sequence is an image sequence of a two-dimensional animation, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are image sequences of a three-dimensional animation.
The method according to claim 1, characterized in that the method further comprises:

Displaying a first interface to a user, wherein the first interface includes a plurality of semantic tags, the plurality of semantic tags are used to describe different style information of different image sequences, and the plurality of semantic tags correspond one-to-one to the style information;

The obtaining of the first style information includes:

determining a first semantic tag from the plurality of semantic tags based on a second operation of the user;

The first style information is determined based on the first semantic tag.
The method according to any one of claims 1 to 7, characterized in that generating a second image sequence based on the first style information and the action information comprises:

fusing the first style information with the action information to obtain a first motion feature;

The second image sequence is acquired based on the first motion characteristic.
The method according to any one of claims 1 to 8, characterized in that the action information includes one or more of the following: a facial expression sequence, a body image sequence.
The method according to any one of claims 1 to 9, characterized in that the method further comprises:

The second image sequence is rendered to a virtual object to obtain an animation.
A data processing device, characterized in that the data processing device comprises:

An acquiring unit, used for acquiring first style information;

The acquisition unit is further used to acquire the motion information of the first image sequence;

A generating unit is configured to generate a second image sequence based on the first style information and the action information, wherein the second image sequence has the same action type as the first image sequence and has the first style information.
The device according to claim 11, characterized in that the acquisition unit is further used to acquire a third image sequence;

The acquisition unit is specifically configured to acquire the first style information based on the third image sequence.
The device according to claim 12, characterized in that the acquisition unit is specifically used to extract the second style information of the third image sequence;

The acquiring unit is specifically configured to determine the first style information based on the second style information.
The device according to claim 13 is characterized in that the acquisition unit is specifically used to use the second style information as the first style information.
The device according to claim 13, characterized in that the acquisition unit is specifically used to display a second semantic tag to the user, wherein the second semantic tag is used to describe the second style information;

The acquisition unit is specifically configured to modify the second semantic tag into a first semantic tag based on the first operation of the user, where the first semantic tag is used to describe the first style information;

The acquisition unit is specifically configured to determine the first style information based on the first semantic tag.
The device according to any one of claims 12 to 15 is characterized in that the third image sequence is an image sequence of a two-dimensional animation, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are image sequences of a three-dimensional animation.
The device according to claim 11, characterized in that the data processing device further comprises:

A display unit, configured to display a first interface to a user, wherein the first interface includes a plurality of semantic tags, wherein the plurality of semantic tags are used to describe different style information of different image sequences, and the plurality of semantic tags correspond one-to-one to the style information;

The acquiring unit is specifically configured to determine a first semantic tag from the plurality of semantic tags based on the second operation of the user;

The acquisition unit is specifically configured to determine the first style information based on the first semantic tag.
The device according to any one of claims 11 to 17, characterized in that the generating unit is specifically used to fuse the first style information with the action information to obtain a first motion feature;

The generating unit is specifically configured to acquire the second image sequence based on the first motion feature.
The device according to any one of claims 11 to 18, characterized in that the action information includes one or more of the following: a facial expression sequence, a body image sequence.
The device according to any one of claims 11 to 19, characterized in that the data processing device further comprises:

A rendering unit is used to render the second image sequence to a virtual object to obtain an animation.
A data processing device, characterized in that it comprises: a processor, the processor is coupled to a memory, the memory is used to store programs or instructions, when the program or instructions are executed by the processor, the data processing device executes the method as described in any one of claims 1 to 10.
A computer storage medium, characterized in that it comprises computer instructions, and when the computer instructions are executed on a data processing device, the data processing device is caused to execute the method as claimed in any one of claims 1 to 10.
A computer program product, characterized in that when the computer program product is run on a computer, the computer is caused to execute the method according to any one of claims 1 to 10.