WO2024066549A1 - Data processing method and related device - Google Patents

Data processing method and related device Download PDF

Info

Publication number
WO2024066549A1
WO2024066549A1 PCT/CN2023/103012 CN2023103012W WO2024066549A1 WO 2024066549 A1 WO2024066549 A1 WO 2024066549A1 CN 2023103012 W CN2023103012 W CN 2023103012W WO 2024066549 A1 WO2024066549 A1 WO 2024066549A1
Authority
WO
WIPO (PCT)
Prior art keywords
style information
image sequence
style
information
semantic
Prior art date
Application number
PCT/CN2023/103012
Other languages
French (fr)
Chinese (zh)
Inventor
周世奇
许斌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024066549A1 publication Critical patent/WO2024066549A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level

Definitions

  • the present application relates to the field of computer technology, and in particular to a data processing method and related equipment.
  • pure manual modeling is widely used in hyper-realistic virtual humans or virtual humans of celebrities, but the manual production cycle is long and the cost is very high.
  • the motion capture modeling method completes the drive by collecting model data with the help of external scanning equipment. Compared with the pure manual modeling method, the time and cost will be much lower. It is often used in pan-entertainment industries such as film and television, live broadcast, etc., but it requires the participation of real actors and cannot improve production efficiency.
  • the embodiment of the present application provides a data processing method and related equipment for realizing stylized animation editing without changing other features of the original image sequence, thereby improving the style transfer effect of the animation.
  • the first aspect of the embodiment of the present application provides a data processing method, which can be applied to scenes such as animation style transfer.
  • the method can be executed by a data processing device, or by a component of the data processing device (such as a processor, a chip, or a chip system, etc.).
  • the method includes: obtaining first style information; obtaining action information of a first image sequence; generating a second image sequence based on the first style information and the action information, the second image sequence has the same action type as the first image sequence, and the second image sequence has the first style information.
  • the above-mentioned style information can be understood as a style description of the image sequence, and the style includes one or more of the following: limb/facial contour, limb/facial proportion, limb movement amplitude, emotion, personality, etc.
  • Action type used to describe the action of the image sequence, for example, running, jumping, walking, etc.
  • Action information can be understood as a low-level vector used to represent the action type. It can be understood that the action vectors corresponding to the image sequence of the same action type may be different.
  • the style information and the action information are obtained separately, and the second image sequence is generated based on the first style information and the action information, so as to realize stylized animation editing without changing other features of the original image sequence, thereby improving the style transfer effect of the animation.
  • the method before the above step of obtaining the first style information, the method further includes: obtaining a third image sequence; and obtaining the first style information includes: obtaining the first style information based on the third image sequence.
  • the first style information is obtained through other third image sequences, which can make up for the defect that a certain type of style information is difficult for users to describe.
  • the above step of: acquiring the first style information based on the third image sequence includes: extracting second style information of the third image sequence; and determining the first style information based on the second style information.
  • the style information of the third image sequence is directly used as the style information to be subsequently migrated to the first image sequence, so that the style of the generated second image sequence is similar to or the same as the style of the third image sequence, thereby satisfying the accurate migration of style.
  • the step of determining the first style information based on the second style information includes: using the second style information as the first style information.
  • the style information of the third image sequence is directly used as the style information to be subsequently migrated to the first image sequence, so that the style of the generated second image sequence is similar to or the same as the style of the third image sequence, thereby compensating for the defect that users have difficulty in describing a certain type of style information, thereby satisfying the precise migration of style.
  • the above step of: determining the first style information based on the second style information includes: displaying a second semantic tag to the user, the second semantic tag being used to describe the second style information; modifying the second semantic tag to a first semantic tag based on the user's first operation, the first semantic tag being used to describe the first style information; and determining the first style information based on the first semantic tag.
  • the user modifies the semantic label through operation on the basis of the third image sequence to achieve the description of style information and ensure user needs, so that the second image sequence generated subsequently can meet the user's style needs for the image sequence.
  • the use of labels to make style information explicit allows users to have a quantitative and qualitative analysis of style information, and then clearly know how to quantitatively describe their needs.
  • the embodiment of the present application can generate any customized stylized digital human animation.
  • the third image sequence is an image sequence of a two-dimensional animation
  • the second style information is two-dimensional style information
  • the first style information is three-dimensional style information
  • the first image sequence and the second image sequence are image sequences of a three-dimensional animation.
  • the stock of 2D video is large enough to realize the migration of any style information of the 2D video to the 3D original video to obtain the 3D target video.
  • the above steps also include: displaying a first interface to a user, the first interface including multiple semantic tags, the multiple semantic tags being used to describe different style information of different image sequences, and the multiple semantic tags corresponding one-to-one to the style information; obtaining the first style information, including: determining a first semantic tag from the multiple semantic tags based on a second operation of the user; and determining the first style information based on the first semantic tag.
  • any style is extracted from the video offline and a feature library is generated.
  • the user only needs to upload the semantic label of the required personalized style, and then the style information corresponding to the label is automatically identified from the feature library.
  • the above step of: generating a second image sequence based on the first style information and the action information includes: fusing the first style information and the action information to obtain a first motion feature; and acquiring the second image sequence based on the first motion feature.
  • the first style information represented by the first semantic tag is fused with the motion information of the original image sequence to obtain the first motion feature. Therefore, the second image sequence obtained based on the first motion feature realizes style transfer without changing other features of the original image sequence.
  • the above-mentioned action information includes one or more of the following: a facial expression sequence, a limb image sequence.
  • this method can be applied not only to the style transfer of body movements, but also to the style transfer of facial expressions, etc., and has a wide range of applicable scenarios.
  • the above steps further include: rendering the second image sequence to the virtual object to obtain an animation.
  • This possible implementation manner may be applicable to style transfer scenarios from 2D animation to 2D animation, from 2D animation to 3D animation, or from 3D animation to 3D animation.
  • the style information of the image sequence includes explicit style information and implicit style information
  • the second semantic tag is specifically used to associate the explicit style information in the second style information
  • the style information is decomposed into explicit and implicit parts, so that the user can edit the explicit style information, and the edited explicit style information and implicit style information are combined to generate modified style information.
  • extracting action information of the first image sequence includes: inputting the first image sequence into a content encoder to obtain action information
  • extracting second style information of the third image sequence includes: inputting the third image sequence into a style encoder to obtain second style information.
  • the above steps further include: acquiring a first training image sequence and a second training image sequence, wherein the first training image sequence and the second training image sequence have different motion features, and the motion features include action information and/or style information; inputting the first training image sequence into a style encoder and a content encoder respectively to obtain first training style information and first training action information;
  • the second training image sequence is input into the style encoder and the content encoder respectively to obtain the second training style information and the second training action information;
  • the first training style information and the second training action information are fused to obtain the first training motion feature;
  • the second training style information and the first training action information are fused to obtain the second training motion feature;
  • the first training motion feature is input into the decoder to obtain the first reconstructed image sequence;
  • the second training motion feature is input into the decoder to obtain the second reconstructed image sequence;
  • the training is performed with the value of the first loss function being less than the first threshold as the goal to obtain the trained style encoder, content encoder and decoder
  • the accuracy of style transfer can be achieved through the above training process.
  • the second aspect of the embodiment of the present application provides a data processing device.
  • the data processing device includes: an acquisition unit, used to acquire first style information; the acquisition unit is also used to acquire action information of the first image sequence; a generation unit is used to generate a second image sequence based on the first style information and the action information, the second image sequence has the same action type as the first image sequence, and the second image sequence has the first style information.
  • the acquisition unit is further used to acquire a third image sequence; the acquisition unit is specifically used to acquire the first style information based on the third image sequence.
  • the acquisition unit is specifically used to extract second style information of the third image sequence; and the acquisition unit is specifically used to determine the first style information based on the second style information.
  • the acquisition unit is specifically configured to use the second style information as the first style information.
  • the above-mentioned acquisition unit is specifically used to display a second semantic tag to the user, the second semantic tag being used to describe the second style information; the acquisition unit is specifically used to modify the second semantic tag to a first semantic tag based on the user's first operation, the first semantic tag being used to describe the first style information; the acquisition unit is specifically used to determine the first style information based on the first semantic tag.
  • the third image sequence is an image sequence of a two-dimensional animation
  • the second style information is two-dimensional style information
  • the first style information is three-dimensional style information
  • the first image sequence and the second image sequence are image sequences of a three-dimensional animation.
  • the above-mentioned data processing device also includes: a display unit, used to display a first interface to a user, the first interface including multiple semantic tags, the multiple semantic tags are used to describe different style information of different image sequences, and the multiple semantic tags correspond one-to-one to the style information; an acquisition unit, specifically used to determine a first semantic tag from the multiple semantic tags based on the user's second operation; the acquisition unit, specifically used to determine the first style information based on the first semantic tag.
  • the above-mentioned generation unit is specifically used to fuse the first style information and the action information to obtain the first motion feature; the generation unit is specifically used to obtain the second image sequence based on the first motion feature.
  • the above-mentioned action information includes one or more of the following: a facial expression sequence, a limb image sequence.
  • the data processing device further includes: a rendering unit, configured to render the second image sequence to the virtual object to obtain an animation.
  • a third aspect of the present application provides a data processing device, comprising: a processor, the processor is coupled to a memory, the memory is used to store programs or instructions, when the program or instructions are executed by the processor, the data processing device implements the method in the above-mentioned first aspect or any possible implementation of the first aspect.
  • a fourth aspect of the present application provides a computer-readable medium having a computer program or instruction stored thereon.
  • the computer program or instruction When executed on a computer, the computer executes the method in the aforementioned first aspect or any possible implementation of the first aspect.
  • a fifth aspect of the present application provides a computer program product.
  • the computer program product When the computer program product is executed on a computer, it enables the computer to execute the method in the aforementioned first aspect or any possible implementation manner of the first aspect.
  • a sixth aspect of an embodiment of the present application provides a chip system, which includes at least one processor for supporting a data processing device to implement the functions involved in the above-mentioned first aspect or any possible implementation method of the first aspect.
  • the chip system may also include a memory for storing program instructions and data necessary for the data processing device.
  • the chip system may be composed of a chip, or may include a chip and other discrete devices.
  • the chip system may also include a memory for storing program instructions and data necessary for the data processing device.
  • An interface circuit is included that provides program instructions and/or data to the at least one processor.
  • the technical effects brought about by the second, third, fourth, fifth, and sixth aspects or any possible implementation methods thereof can refer to the technical effects brought about by the first aspect or different possible implementation methods of the first aspect, and will not be repeated here.
  • the present application has the following advantages: by separating the style information and the action information, and generating the second image sequence based on the first style information and the action information, it is possible to perform stylized animation editing without changing other features of the original image sequence, thereby improving the style transfer effect of the animation.
  • FIG1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of the present invention.
  • FIG2 is a schematic diagram of the structure of the system architecture provided in an embodiment of the present application.
  • FIG3A is a schematic diagram of a deployment scenario provided in an embodiment of the present application.
  • FIG3B is a schematic diagram of another deployment scenario provided in an embodiment of the present application.
  • FIG4 is a flow chart of a data processing method provided in an embodiment of the present application.
  • FIG5A is a schematic diagram of decomposing style information into explicit features according to an embodiment of the present application.
  • FIG5B is a schematic diagram of a training process of a conversion module provided in an embodiment of the present application.
  • FIG6A is another schematic flow chart of a data processing method provided in an embodiment of the present application.
  • FIG6B is a schematic diagram of a process of user modifying a label according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a data processing device provided in an embodiment of the present application displaying a user interface to a user;
  • FIG8 is another schematic diagram of a data processing device provided in an embodiment of the present application displaying a user interface to a user;
  • FIG9 is another schematic diagram of a data processing device provided in an embodiment of the present application displaying a user interface to a user;
  • FIG10 is another schematic flow chart of a data processing method provided in an embodiment of the present application.
  • FIG11 is an example diagram of a first image sequence provided in an embodiment of the present application.
  • FIG12 is an example diagram of a third image sequence provided in an embodiment of the present application.
  • FIG13 is an example diagram of a second image sequence provided in an embodiment of the present application.
  • FIG14 is another schematic flow chart of a data processing method provided in an embodiment of the present application.
  • FIG15 is another schematic flow chart of a data processing method provided in an embodiment of the present application.
  • FIG16 is a schematic diagram of the training process of the encoder and decoder provided in an embodiment of the present application.
  • FIG17 is a schematic diagram of a flow chart of a method provided in an embodiment of the present application applied to a gesture style transfer scenario
  • FIG18 is a schematic diagram of a flow chart of the method provided in an embodiment of the present application applied to an expression style transfer scenario
  • FIG19 is a schematic diagram of a structure of a data processing device provided in an embodiment of the present application.
  • FIG. 20 is another schematic diagram of the structure of the data processing device provided in an embodiment of the present application.
  • a neural network may be composed of neural units, and a neural unit may refer to an operation unit with Xs and an intercept b as input, and the output of the operation unit may be:
  • s 1, 2, ... n, n is a natural number greater than 1
  • Ws is the weight of Xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be an area composed of several neural units.
  • Generative adversarial network is a deep learning model.
  • Generative adversarial network includes at least one generator and one discriminator. It produces better output by letting two neural networks learn in a game of mutual competition. These two neural networks can be deep neural networks or convolutional neural networks.
  • the basic principle of GAN is as follows: Taking the GAN that generates pictures as an example, suppose there are two networks, G (Generator) and D (Discriminator), where G is a network that generates pictures, randomly sampling from the latent space as input to generate pictures, recorded as G (z); D is a discriminator network, which is used to determine whether a picture is "real”.
  • Its input parameter is x
  • x represents a picture
  • x is a real picture or the output of the generator network.
  • the output D (x) represents the probability that x is a real picture. If it is 1, it means that it is 100% a real picture, and if it is 0, it means that it cannot be a real picture.
  • the goal of the generative network G is to generate as realistic images as possible to deceive the discriminative network D, and the output results need to imitate the real samples in the training set as much as possible.
  • the goal of the discriminative network D is to distinguish the images generated by G from the real images as much as possible.
  • the two networks compete with each other and constantly adjust parameters.
  • G and D constitute a dynamic "game” process, which is also the "confrontation” in the "generative adversarial network".
  • the ultimate goal is to make the discriminative network unable to judge whether the output results of the generative network are real.
  • Virtually created video content includes animated videos displayed on 2D planes, and 3D animated content displayed on 3D display devices such as augmented reality (AR), virtual reality (VR), and holographic displays; its style is not only cartoon style, but also includes realistic style, such as digital human animation, special effects film and television, etc.
  • AR augmented reality
  • VR virtual reality
  • holographic displays 3D display devices
  • its style is not only cartoon style, but also includes realistic style, such as digital human animation, special effects film and television, etc.
  • Virtual digital people refer to virtual characters with digital appearance. Unlike robots with physical bodies, virtual digital people rely on display devices to exist, such as mobile phones, computers or smart large screens. A complete virtual digital person often needs to have the following three capabilities:
  • An image sequence can be understood as a plurality of images with a time-sequential relationship, and of course, can also be an image sequence obtained from a video.
  • the image sequence can include a limb image sequence, and/or a facial expression sequence, etc.
  • the image sequence can refer to an image sequence of the entire body limbs, or an image sequence of a part of the entire body limbs (or called a local limb), or a facial expression sequence of a character corresponding to the image sequence, etc., which is not specifically limited here.
  • the style information involved in the embodiment of the present application can be a style feature vector obtained by the image sequence through the style encoder. It can also be an explicit vector in the style feature vector. It can also be a partial feature of the explicit vector in the style feature vector, etc., which is not limited here.
  • the label corresponding to the style information can also be understood as a style description of the image sequence.
  • the style includes one or more of the following: body/ Facial contours, body/face proportions, body movement range, emotions, personality, etc. Emotions can include: happy, depressed, excited, etc. Personality can include: lively, kind, feminine, mean, etc.
  • the action information involved in the embodiment of the present application may be a feature vector obtained by an image sequence through a content encoder.
  • the action type is used to describe the action of the image sequence.
  • the content refers to the action described by the image sequence (for example: running, jumping, squatting, walking, raising head, lowering head, closing eyes, etc.). It is understandable that the action vectors corresponding to the image sequence of the same action type may be different.
  • Semantic tags are used to describe the style information of image sequences, which can be understood as being used to visualize the style of image sequences.
  • the style information corresponds to the semantic label one by one.
  • the semantic label may be different according to different situations of the style information.
  • the semantic label can be understood as being used to describe the style information, so as to facilitate the user to understand or edit the style of the image sequence.
  • the style information is a style feature vector of an image sequence.
  • the semantic tag is an explicit expression of the style feature vector, and the user can use the semantic tag to clarify the style of the image sequence/video (for example, the character's emotions and personality expressed by the body movements of the character in the video), so as to facilitate style editing/migration and other operations.
  • pure manual modeling is widely used in hyper-realistic virtual people or celebrity virtual people, but the manual production cycle is long and the cost is very high.
  • the motion capture modeling method completes the drive by collecting model data with the help of external scanning equipment. Compared with the pure manual modeling method, the time and cost are much lower. It is often used in pan-entertainment industries such as film and television and live broadcasting, but it requires the participation of real actors and cannot improve production efficiency.
  • the artificial intelligence-driven method is based on algorithms and machine learning.
  • stylized animation editing requires that the original animation have a specified style while keeping other features as unchanged as possible. How to better decouple style information from animation motion information is an important issue; second, how to obtain style data at low cost.
  • Video is a major data source, but how to explicitly mark the semantic label features of style in massive video data so that users can complete editing and style transfer by only semantically describing the style is also an important issue.
  • the embodiments of the present application address the defect that existing virtual digital human animation driving methods cannot be arbitrarily stylized, and propose a body movement driving solution based on video style extraction and explicit marking and editing of style information, aiming to fill the gap in AI user personalized animation driving in pan-entertainment scenarios; in addition, extracting style from video can make up for the defect that users have difficulty in describing a certain type of style.
  • an embodiment of the present invention provides a system architecture 100.
  • the data acquisition device 160 is used to collect training data.
  • the training data includes: a first training image sequence and a second training image sequence.
  • the training data is stored in the database 130, and the training device 120 obtains the target model/rule 101 based on the training data maintained in the database 130.
  • the following will describe in more detail how the training device 120 obtains the target model/rule 101 based on the training data, and the target model/rule 101 can be used to implement the data processing method provided in the embodiment of the present application.
  • the target model/rule 101 in the embodiment of the present application may specifically include a style encoder, a content encoder, and a decoder.
  • the training data maintained in the database 130 may not all come from the collection of the data acquisition device 160, but may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rule 101 based entirely on the training data maintained in the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a limitation on the embodiment of the present application.
  • the target model/rule 101 obtained by training the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 1 .
  • the execution device 110 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, an AR/VR, a vehicle terminal, etc., or a server or a cloud.
  • the execution device 110 is configured with an I/O interface 112 for data interaction with an external device.
  • the user can input data to the I/O interface 112 through the client device 140.
  • the input data can be Including: a first image sequence and a first semantic label; optionally, the input data may also include the first image sequence and the second image sequence, etc.
  • the input data may also be a two-dimensional animation (for example, the two-dimensional animation is the animation to which the second image sequence belongs) and a three-dimensional animation (for example, the three-dimensional animation is the animation to which the first image sequence belongs).
  • the input data may be input by a user, or uploaded by a user through a shooting device, or may come from a database, which is not specifically limited here.
  • the preprocessing module 113 is used to perform preprocessing (e.g., conversion of two-dimensional features to three-dimensional features, etc.) according to the input data received by the I/O interface 112 (e.g., a first image sequence and a first semantic label, or a first image sequence and a second image sequence, or a two-dimensional animation and a three-dimensional animation).
  • preprocessing e.g., conversion of two-dimensional features to three-dimensional features, etc.
  • the execution device 110 When the execution device 110 preprocesses the input data, or when the computing module 111 of the execution device 110 performs processing related to extracting action information of the first image sequence and generating a second image sequence based on the action information and the first semantic tag, the execution device 110 can call data, code, etc. in the data storage system 150 for corresponding processing, and can also store the second image sequence, instructions, etc. obtained by the corresponding processing into the data storage system 150.
  • the I/O interface 112 returns the processing result, such as the second image sequence obtained as described above, or the three-dimensional animation corresponding to the second image sequence, to the client device 140 to provide it to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks.
  • the corresponding target models/rules 101 can be used to achieve the above goals or complete the above tasks, thereby providing users with the desired results.
  • the user can manually give input data, and the manual giving can be operated through the interface provided by the I/O interface 112.
  • the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send input data and needs to obtain the user's authorization, the user can set the corresponding authority in the client device 140.
  • the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form can be a specific method such as display, sound, action, etc.
  • the client device 140 can also be used as a data acquisition terminal to collect the input data of the input I/O interface 112 and the output results of the output I/O interface 112 as shown in the figure as new sample data, and store them in the database 130.
  • the I/O interface 112 directly stores the input data of the input I/O interface 112 and the output results of the output I/O interface 112 as new sample data in the database 130.
  • FIG1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110.
  • a target model/rule 101 is obtained through training by a training device 120 .
  • the target model/rule 101 may include a style encoder, a content encoder, a decoder, etc. in the embodiment of the present application.
  • FIG2 is a chip hardware structure provided by an embodiment of the present invention, and the chip includes a neural network processor 20.
  • the chip can be set in the execution device 110 shown in FIG1 to complete the calculation work of the calculation module 111.
  • the chip can also be set in the training device 120 shown in FIG1 to complete the training work of the training device 120 and output the target model/rule 101.
  • the neural network processor 20 can be a neural network processor (neural-network processing unit, NPU), a tensor processing unit (tensor processing unit, TPU), or a graphics processing unit (graphics processing unit, GPU) and any other processor suitable for large-scale XOR operation processing.
  • NPU neural network processor
  • TPU tensor processing unit
  • GPU graphics processing unit
  • any other processor suitable for large-scale XOR operation processing Taking NPU as an example: the neural network processor 20 is mounted on the main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks.
  • the core part of the NPU is the operation circuit 203, and the controller 204 controls the operation circuit 203 to extract data from the memory (weight memory or input memory) and perform operations.
  • the operation circuit 203 includes multiple processing units (process engines, PEs) inside.
  • the operation circuit 203 is a two-dimensional systolic array.
  • the operation circuit 203 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • the operation circuit 203 is a general-purpose matrix processor.
  • the operation circuit takes the corresponding data of matrix B from the weight memory 202 and caches it on each PE in the operation circuit.
  • the operation circuit takes the matrix A data from the input memory 201 and performs matrix operation with matrix B, and the partial result or final result of the matrix is stored in the accumulator 208.
  • the vector calculation unit 207 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
  • the vector calculation unit 207 can be used for network calculations of non-convolutional/non-FC layers in a neural network, such as pooling. (Pooling), Batch Normalization, Local Response Normalization, etc.
  • the vector calculation unit 207 stores the vector of processed outputs to the unified buffer 206.
  • the vector calculation unit 207 can apply a nonlinear function to the output of the operation circuit 203, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 207 generates a normalized value, a merged value, or both.
  • the vector of processed outputs can be used as an activation input to the operation circuit 203, for example, for use in a subsequent layer in a neural network.
  • the unified memory 206 is used to store input data and output data.
  • the weight data is directly transferred from the external memory to the input memory 201 and/or the unified memory 206 through the direct memory access controller 205 (DMAC), the weight data in the external memory is stored in the weight memory 202, and the data in the unified memory 206 is stored in the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (BIU) 210 is used to realize the interaction between the main CPU, DMAC and instruction fetch memory 209 through the bus.
  • An instruction fetch buffer 209 connected to the controller 204 is used to store instructions used by the controller 204.
  • the controller 204 is used to call the instructions cached in the memory 209 to control the working process of the computing accelerator.
  • the unified memory 206, the input memory 201, the weight memory 202 and the instruction fetch memory 209 are all on-chip memories, and the external memory is a memory outside the NPU, which can be a double data rate synchronous dynamic random access memory (DDR SDRAM), a high bandwidth memory (HBM) or other readable and writable memory.
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • HBM high bandwidth memory
  • the arbitrary style editable 3D animation generation solution provided by the embodiments of the present application can be applied to 2B digital host scenarios and 2C digital companion, assistant software and other scenarios. There are many specific deployment solutions, which are described below by example.
  • FIG3A A deployment scenario provided by an embodiment of the present application is shown in FIG3A , where a user uploads an animation video representing a target style on the client.
  • the server extracts the target style from the video and returns the semantic label of the style to the user.
  • the user can then describe, edit, or select the style based on the semantic label of the style, such as for a style with a semantic label of excitement, hoping that the degree is slightly weaker.
  • the client completes the label editing and uploading, and after receiving the request, the server reduces the weight of the semantic label of the style according to the target to reduce the degree of excitement, thereby editing the style information, and then generating a target animation that meets the user label, and returns it to the client for rendering and display.
  • FIG3B Another deployment scenario provided by the embodiment of the present application is shown in FIG3B .
  • the server completes the extraction of any style from the video offline and generates a feature library.
  • the user only needs to upload the semantic tags of the required personalized style, such as adding a little feminine style to the excited style.
  • the server After receiving the request, the server automatically identifies the style information corresponding to the excited and feminine tags from the style information library, edits the feature, generates style information that matches the semantic tags of the target style, and completes rendering and display.
  • the style involved in the above deployment scenario may refer to a two-dimensional style or a three-dimensional style.
  • the method provided in the embodiment of the present application may be applied to a scenario where a two-dimensional style is transferred to a two-dimensional image sequence. It may also be applied to a scenario where a three-dimensional style is transferred to a three-dimensional image sequence. It may also be applied to a scenario where a two-dimensional style is transferred to a three-dimensional image sequence, or a scenario where a three-dimensional style is transferred to a two-dimensional image sequence, etc., which are not specifically limited here.
  • FIG4 is an embodiment of a data processing method provided in an embodiment of the present application.
  • the method can be executed by a data processing device (terminal device/cloud server), or by a component of a data processing device (such as a processor, a chip, or a chip system, etc.), and the method includes steps 401 to 403.
  • the method can be applied to scenes of style transfer between animations such as children's educational animations, short video animations, promotional animations, variety show animations, and film and television preview animations.
  • Step 401 Obtain first style information.
  • the style information refers to a style feature vector of an image sequence
  • the third image sequence is input into a style encoder to obtain the second style information.
  • the training process of the style encoder will be described later and will not be expanded here.
  • the style information refers to an explicit vector or a part of an explicit vector in a style feature vector of an image sequence. If the third image sequence is input into the style encoder to obtain a style feature vector, the style feature vector is split into an explicit vector and an implicit vector. In this case, the style information can be understood as an explicit expression of the style feature vector.
  • the style information in the embodiment of the present application can be the style feature vector corresponding to the image sequence, or it can be the explicit vector in the style feature vector corresponding to the image sequence. It can also be a partial feature of the explicit vector in the style feature vector corresponding to the image sequence, etc.
  • the style information can be decomposed into explicit vectors and implicit features.
  • this decomposition is just an example, and the style information can also be decomposed into explicit vectors, implicit features, and personalized features.
  • the personalized feature is used to express the personalized differences brought about by the same style when interpreted by different roles.
  • the personalized feature can also be related to the role in the image sequence, for example, it can be "Venus style, Trump style", etc.
  • style information is an explicit vector
  • the first one is to obtain the first style information based on the third image sequence.
  • the data processing device first obtains the third image sequence, and obtains the first style information based on the third image sequence.
  • the data processing device may be by receiving images sent by other devices, by selecting images from a database, by collecting images through various sensors in the data processing device, or by uploading images by users, etc., which are not limited here.
  • the image sequence (for example, the first image sequence, the third image sequence, etc.) can be a two-dimensional image sequence or a three-dimensional image sequence, etc., which is not specifically limited here.
  • the third image sequence may be an image sequence extracted from the two-dimensional animation.
  • the third image sequence is extracted from the two-dimensional animation by a human posture recognition method (e.g., openpose).
  • the method for obtaining the two-dimensional animation is not limited here, and may be a method of uploading by a user, a method of receiving from other devices, or a method of selecting from a database, etc., which is not limited here.
  • the step of acquiring the first style information based on the third image sequence is divided into two cases according to whether there is a user operation, which are described below respectively.
  • the data processing device may directly extract the second style information of the third image sequence and use the second style information as the first style information, or convert the second style information into preset style information.
  • the above decomposition may be based on a trained neural network, or may be based on finding multiple image sequences for expressing the same style in a database, and then determining explicit vectors based on multiple image sequences for expressing the same style, etc., which is not specifically limited here.
  • the case of determining explicit vectors based on multiple image sequences for expressing the same style may specifically include: inputting multiple image sequences of the same style into a style encoder to obtain multiple style feature vectors. And using the common features in the multiple style feature vectors as style information.
  • the non-common parts are implicit features, etc., which are not specifically limited here.
  • multiple image sequences expressing the style of "happy” are found from the database, and the multiple image sequences are respectively input into the style encoder to obtain multiple style feature vectors.
  • the common vector of the multiple style feature vectors is determined, and the style information of "happy" is the above-mentioned common vector.
  • the corresponding relationship between the explicit style information and the common vector is determined.
  • the explicit vector W 1 * style information 1 + W 2 * style information 2 + ... + W n * style information n).
  • the style information is an explicit vector of the style feature vector.
  • the style information may include: “calm->excited”, “single->diverse”, “feminine->masculine”.
  • the “->” before and after may refer to the boundaries of a range.
  • “calm” to “excited” is a progression of emotions, or it can be understood that the style information can be further distinguished by different weights/levels.
  • the intensity range of happiness may include several levels such as satisfaction, relief, pleasure, joy, and ecstasy.
  • the style information may also be “satisfied->ecstasy”.
  • the second style information is two-dimensional style information.
  • the second style information can be converted into the first style information through the conversion module, and the first style information is three-dimensional style information. This situation is mainly used to migrate the style information of a two-dimensional animation to a three-dimensional animation to change the scene of the style information of the three-dimensional animation.
  • the above-mentioned conversion module can be understood as a 2D-3D style conversion module.
  • This module uses a large number of 2D-3D pairs with consistent styles to train and obtain nonlinear transformations for embedding 2D stylized features into 3D stylized feature space.
  • the 2D style information i.e., the second style information
  • the 2D style information extracted from the video can be converted into 3D stylized features (i.e., the first style information) after being projected into the 3D space using nonlinear transformations.
  • the training process of the above conversion module can be shown in FIG5B .
  • a 3D animation sequence is obtained and the 3D stylized features of the 3D animation sequence are extracted.
  • a 2D animation sequence consistent with the style and action of the 3D animation sequence is generated, and the 2D style information is extracted.
  • the style information of both the two are aligned to the same feature space, completing the projection of the 2D style information to the 3D style information space.
  • the data processing device extracts the second style information of the third image sequence, it can display a second semantic tag to the user, and the second semantic tag is used to explicitly describe the second style information. Then, based on the user's first operation, the second semantic tag is modified to the first semantic tag. The first style information is determined based on the first semantic tag.
  • the explanation of the semantic tag can refer to the description of the aforementioned related terms, which will not be repeated here.
  • the second semantic label can be understood as a style description of the third image sequence, and the style includes one or more of the following: body/face contour, body/face proportion, body movement amplitude, emotion, personality, etc.
  • body/face contour body/face proportion
  • body movement amplitude body movement amplitude
  • emotion personality, etc.
  • This method can also be understood as the data processing device converting the second style information vector of the image sequence into a second semantic label that can be understood by the user, and the user processes the second semantic label according to actual needs to obtain the first semantic label.
  • the data processing device converts the first semantic label into the first style information, and then subsequently generates an image sequence that meets the user's needs.
  • the above processing includes at least one of the following: addition, deletion, modification, degree control (or understood as amplitude, level adjustment), etc.
  • the first operation includes the above-mentioned addition, deletion, modification, degree control (or understood as amplitude, level adjustment), modification of semantic tag weight, etc.
  • the data processing device can determine the first operation through the user's voice, text, etc. input method, which is not specifically limited here.
  • the data processing device is a cloud device
  • the way to obtain the third image sequence is the way the terminal device sends it.
  • the process in this case can be shown in FIG. 6.
  • the process includes steps 601 to 606.
  • Step 601 The terminal device sends a third image sequence to the cloud device.
  • the user can send the third image sequence to the cloud device through the terminal device.
  • the cloud device receives the third image sequence sent by the terminal device.
  • Step 602 The cloud device generates a second semantic tag for the third image sequence.
  • the cloud device After the cloud device obtains the third image sequence, it first obtains the second style information of the third image sequence and converts the second style information into a second semantic label.
  • the style information is an explicit vector in the style feature vector
  • multiple image sequences expressing "happy” can be found from the database, and the multiple image sequences can be input into the style encoder to obtain multiple style feature vectors.
  • the common vector of multiple style feature vectors is determined, and the style semantic label of "happy" corresponds to the above common vector (i.e., the explicit vector).
  • the corresponding relationship between the semantic label and the style information is determined.
  • Step 603 The cloud device sends a second semantic tag to the terminal device.
  • the cloud device After acquiring the second semantic tag, the cloud device sends the second semantic tag to the terminal device.
  • the terminal device receives the second semantic tag sent by the cloud device.
  • Step 604 The terminal device determines the first semantic tag based on the second semantic tag.
  • the process is similar to the above description, and here only takes determining the first style information based on the user's first operation and the third image sequence as an example.
  • the terminal device After acquiring the second semantic tag, the terminal device displays the second semantic tag to the user, and then modifies the second semantic tag to the first semantic tag based on the user's first operation.
  • Step 605 The terminal device sends the first semantic tag to the cloud device.
  • the terminal device After acquiring the first semantic tag, the terminal device sends the first semantic tag to the cloud device.
  • the cloud device receives the first semantic tag sent by the terminal device.
  • Step 606 The cloud device determines first style information based on the first semantic tag.
  • the cloud device may determine the first style information based on the first semantic tag.
  • FIG6B is an example of a user modifying a label.
  • the second semantic label of the third image sequence is "emotional excitement, single style".
  • the user performed the following processing: deleting excitement and maintaining neutrality; adjusting the richness of the action, changing from single to diverse; adding a feminine style.
  • the natural language processing (NLP) module in the data processing device can automatically identify and match the semantic label of the style specified by the user, and select the style information that matches it, and can quantify the degree of a certain style specified by the user, and the two are merged to generate the edited style information.
  • NLP natural language processing
  • the ability of the NLP module is to input a paragraph of text and output the analysis of the paragraph of text (for example, nouns, verbs, keywords that the user cares about).
  • the NLP module outputs the keywords for expressing the style in the paragraph of text. For example, if you input "the target style I want is half feminine and half masculine", the NLP module can output the following keywords: feminine, masculine, and half each. That is, to parse out the words related to the style in the descriptive text. For another example, the user transmits the information "I want a more feminine style" by inputting text or voice. The NLP module determines from this information that the user wants to "add a feminine style" on the basis of the second semantic tag.
  • the data processing device displays a user interface as shown in FIG7 to the user.
  • the user interface includes an animation preview interface and an editing interface.
  • the semantic label of the style in the editing interface (also referred to as the style label) can be understood as the aforementioned second semantic label.
  • the second semantic label is excitement and single.
  • the user can modify the second semantic label through the editing interface. As shown in FIG8, the user can drag the cursor 801 to drag "calm->excited" from 1.0 to 0.5. The excitement is removed and changed to neutral. The user can drag the cursor 802 to drag "single->diversity" from 0.0 to 1.0. The single is changed to diverse.
  • the user can click Add Label 803 to add a feminine style label as shown in FIG9. Based on the above FIGS. 7 to 9, the user modifies the second semantic label (excited, single) to the first semantic label (neutral, diverse, feminine).
  • the semantic label of the style information is made explicit, and the user can edit according to the explicit label.
  • This embodiment decomposes the style information, semanticizes the explicit features therein, and then realizes the labeling of the style information, and identifies, matches and quantifies the semantic labels of any style specified by the user to generate specific style information. Therefore, whether it is the return feature label in the deployment scheme shown in Figure 3A above and used for user editing, or the semantic label matching the user's personalized style in the deployment scheme shown in Figure 3B above, it becomes possible, and the user can be more aware of his or her editing behavior.
  • style information may be extracted from the third image sequence/video to compensate for the defect that certain styles are difficult for users to describe.
  • the second type is to determine the first style information based on a second operation performed by the user on the first interface.
  • the data processing device displays a first interface to the user, and the first interface includes multiple semantic tags.
  • Each of the multiple semantic tags is used to explicitly display the style information of the image sequence.
  • the data processing device determines a first semantic tag from the multiple semantic tags based on the user's second operation, and then determines the first style information based on the first semantic tag.
  • This situation can be applied to the scenario shown in FIG. 3B above.
  • the data processing device is a cloud device as an example.
  • the process in this case can be shown in FIG. 10.
  • the process includes steps 1001 to 1005.
  • Step 1001 A cloud device generates a style information library and multiple semantic tags based on multiple image sequences.
  • the cloud device obtains multiple image sequences, obtains common vectors of style feature vectors corresponding to the multiple image sequences, and extracts different semantic labels based on different common vectors, thereby obtaining a style information library and multiple semantic labels of the multiple common vectors.
  • Step 1002 The cloud device sends a plurality of semantic tags to the terminal device.
  • the cloud device After the cloud device obtains the multiple semantic tags, it sends the multiple semantic tags to the terminal device. Correspondingly, the terminal device receives the multiple semantic tags sent by the cloud device.
  • Step 1003 The terminal device determines a first semantic tag based on a second operation performed by the user on the first interface.
  • the terminal device After receiving the plurality of semantic tags sent by the cloud device, the terminal device displays a first interface to the user, where the first interface includes the plurality of semantic tags.
  • the first semantic tag is determined based on a second operation of the user on the first interface.
  • the second operation may specifically be a selection operation, etc.
  • Step 1004 The terminal device sends a first semantic tag to the cloud device.
  • the terminal device After determining the first semantic tag, the terminal device sends the first semantic tag to the cloud device.
  • the cloud device receives the first semantic tag sent by the terminal device.
  • Step 1005 The cloud device determines the first style information from the style information library based on the first semantic tag.
  • the cloud device After receiving the first semantic tag sent by the terminal device, the cloud device finds a common vector corresponding to the first semantic tag from the style information library as the first style information.
  • This method can also be understood as the data processing device displays multiple semantic tags to the user, and the user can directly select the semantic tag he needs from the multiple semantic tags as needed, or the user inputs the weights of the multiple semantic tags in the first interface.
  • the third type is to determine the first style information based on a third operation of the user.
  • the data processing device can directly receive the third operation of the user, and determine the first semantic tag in response to the third operation.
  • the third operation may be voice, text, etc., which is not limited here.
  • the user edits "add feminine style” by voice. Then the data processing device can determine the first semantic tag as "feminine” according to the voice of "add feminine style".
  • the data processing device is a server, that is, the data processing device extracts any style from the video offline and generates a feature library.
  • the user only needs to upload the semantic tag of the required personalized style, such as adding a little feminine style to the excited style.
  • the data processing device automatically identifies the style information corresponding to the excited and feminine tags from the style information library, edits the feature, generates style information that matches the semantic tag of the target style, and completes rendering and display.
  • Step 402 Acquire action information of a first image sequence.
  • the data processing device obtains a first image sequence, which can be understood as an image sequence whose style information needs to be replaced.
  • the first image sequence is a 3D image sequence.
  • the first image sequence is a 2D image sequence.
  • the first image sequence may be an image sequence extracted from a three-dimensional animation.
  • the first image sequence is extracted from the three-dimensional animation by a human posture recognition method (e.g., openpose).
  • the acquisition method of the three-dimensional animation is not limited here, and may be a method of uploading by a user, a method of receiving from other devices, or a method of selecting from a database, etc., which is not limited here.
  • Example 1 An example of the first image sequence is shown in Figure 11. The action content of the first image sequence is "walking".
  • the data processing device After acquiring the first image sequence, the data processing device extracts the action information of the first image sequence.
  • the explanation of the action information can refer to the description of the above-mentioned related terms, which will not be repeated here.
  • the first image sequence is input into a content encoder to obtain the action information.
  • the training process of the content encoder will be described later and will not be expanded here.
  • Step 403 Generate a second image sequence based on the first style information and the motion information.
  • the data processing device may determine the first style information based on the first semantic tag, and then generate a second image sequence based on the first style information and the action information.
  • the first semantic tag is used to make the entire first style information explicit.
  • the first style information is determined directly based on the first semantic tag.
  • the first semantic label is used to make explicit the explicit vector in the first style information.
  • the first semantic label is first converted into an explicit vector and then fused with the implicit features of the first image sequence to obtain the first style information.
  • the data processing device fuses the first style information with the action information to obtain a first motion feature, and acquires a second image sequence based on the first motion feature.
  • the fusion algorithm used by the above-mentioned data processing device to fuse the first style information and the action information to obtain the first motion feature may include: Adaptive Instance Normalization (AdaIN), deep learning models, statistical methods and other alignment methods between distributions.
  • AdaIN Adaptive Instance Normalization
  • deep learning models deep learning models
  • statistical methods and other alignment methods between distributions.
  • the data processing device inputs the first motion feature into a decoder to obtain a second image sequence.
  • the training process of the decoder will be described later and will not be expanded here.
  • the first semantic tag is obtained based on the third image sequence.
  • the third image sequence is shown in FIG12.
  • the first style information is "frustrated”.
  • the second image sequence obtained in this step is shown in FIG13.
  • the second image sequence is a "frustrated" walk.
  • the process from step 401 to step 403 may be as shown in FIG. 14.
  • the input end includes a third image sequence (eg, 2D motion).
  • the 2D style information extraction module extracts the 2D stylized features of the third image sequence and converts them into 3D style information. It also makes the semantic label of the style explicit and returns it to the user for editing.
  • the user generates personalized requirements based on the semantic labels and needs.
  • the NLP module inputs them into the style editing module together with the 3D style information to generate an edited style information vector (i.e., the first style information).
  • the first image sequence obtains a feature expression that characterizes the content of the first image sequence, integrates the above-mentioned edited first style information, and generates an image sequence of the 3D target animation (i.e., the second image sequence) that conforms to the user's editing information after decoding.
  • the data processing device after acquiring the second image sequence, the data processing device renders the second image sequence to a virtual object to obtain an animation/video.
  • the generated animation is a 3D animation.
  • the generated animation is a 2D animation.
  • the data processing method provided in the embodiment of the present application is mainly applied to the style transfer scenario of image sequences.
  • the data processing method provided in the embodiment of the present application is mainly used in animation style transfer scenarios.
  • the style information and the action information are obtained by separation, and a second image sequence is generated based on the first style information and the action information.
  • This allows stylized animation editing to be performed without changing other features of the original image sequence, thereby improving the style transfer effect of the animation.
  • the style information is described by semantic tags, and the style information is made explicit by using semantic tags. Users edit semantic tags to achieve style transfer, thereby realizing a driving scheme for body movements. Users can have a quantitative and qualitative analysis of the style information, and clearly know how to quantitatively describe their needs. In addition, by analyzing user needs and combining the advantage of massive videos that can cover any style, the embodiment of the present application can generate any customized stylized digital human animation.
  • extracting style information from the video to which the second image sequence belongs can make up for the user's difficulty in describing a certain type of style information.
  • using tags to make style information explicit
  • FIG15 Another flowchart of the method provided by the embodiment of the present application can be shown in FIG15.
  • a second image sequence is obtained from the style reference animation, and the stylized features of the second image sequence are extracted to obtain a second stylized feature.
  • the second stylized feature is then made explicit to obtain a display label.
  • the user edits the display label to obtain a first stylized feature.
  • the first stylized feature is then transferred to the original animation to obtain a stylized animation.
  • the content of the stylized animation is consistent with the original animation, and the style of the stylized animation is consistent with the style reference animation, thereby achieving stylized migration.
  • image sequence 1 and image sequence 2 are obtained. Among them, image sequence 1 has style 1 and action 1. Image sequence 2 has style 2 and action 2.
  • the style encoder and action content encoder are used to encode the style and motion content of the two input sequences respectively to decouple the style information and action information.
  • style information 1 and action information 2 are fused through a fusion algorithm (such as AdaIN), and style 1 action 2 is generated after decoding.
  • style information 2 and action information 1 are fused to generate style 2 action 1.
  • the discriminator supervises the reconstruction losses of the generated stylized animation in style and content respectively, so that the final generated stylized animation can have the greatest similarity with the target style without losing the original motion content.
  • the above process can be understood as: obtaining a first training image sequence and a second training image sequence, wherein the motion features of the first training image sequence and the second training image sequence are different, and the motion features include action information and/or style information.
  • the first training image sequence is input into a style encoder and a content encoder respectively to obtain first training style information and first training action information;
  • the second training image sequence is input into a style encoder and a content encoder respectively to obtain second training style information and second training action information.
  • the first training style information and the second training action information are fused to obtain a first training motion feature;
  • the second training style information and the first training action information are fused to obtain a second training motion feature.
  • the first training motion feature is input into a decoder to obtain a first reconstructed image sequence; the second training motion feature is input into a decoder to obtain a second reconstructed image sequence.
  • Training is performed with the goal of making the value of the first loss function less than a first threshold to obtain a trained style encoder, content encoder and decoder, the first loss function includes a style loss function and a content loss function, the style loss function is used to represent the style difference between the first reconstructed image sequence and the first training image sequence and the style difference between the second reconstructed image sequence and the second training image sequence, and the content loss function is used to represent the content difference between the first reconstructed image sequence and the second training image sequence and the content difference between the second reconstructed image sequence and the first training image sequence.
  • the style encoder, content encoder, and decoder obtained through training can extract 2D stylized features from the video sequence and map them to the 3D feature space to generate a 3D style consistent with its semantics, and make the 3D style information semantically explicit.
  • the user edits it according to the semantic expression of the style to generate the target style that meets his expectations, and then the algorithm generates the corresponding style information from the semantic label of the user's style.
  • the style transfer module is used to transfer the generated 3D target features to the original animation sequence to generate a target stylized virtual digital human animation sequence.
  • the third image sequence in the embodiment shown in FIG4 includes one or more of the following: a facial expression sequence, a limb image sequence.
  • limb movements include global limbs, local limbs (such as gestures, etc.), etc.
  • style transfer such as gestures and expressions.
  • the following takes voice-driven gestures as an example. The scenario in which the method is applied to gesture style transfer is shown in FIG17.
  • the virtual digital human By inputting a piece of text or voice data, the virtual digital human is driven to make gestures with known semantics and consistent rhythm with the voice data.
  • the gesture style of different speakers will vary from person to person, and also from different emotions of the same person, so the personalized customization and transfer of style is of great significance to enriching the diversity of gestures.
  • the gesture style information that can cover almost any style is generated through the aforementioned stylized feature extraction module, and a style information database is generated offline.
  • the user specifies any personalized stylized label, and the user label is parsed and quantified, and the style database generated offline is integrated to generate the edited style information, and the motion sequence generated by the voice-driven gesture module is stylized into the target style.
  • FIG. 18 The scenario where this method is applied to expression style transfer is shown in Figure 18.
  • This scenario can also be understood as a digital human expression base style editing and transfer scenario.
  • the definition of expression base is a predetermined set of coordinates of several key points on the face used to represent a neutral expression
  • the original coefficient represents the parameter expression of a specific expression relative to a neutral expression, such as the degree of mouth opening relative to a neutral expression when smiling.
  • the whole process of Figure 18 is to first calculate the original coefficient corresponding to a person's expression and a preset expression base through an expression network; and obtain the coefficients corresponding to various expressions in the video through the same set of expression bases, and the user controls the expression to be generated by editing the coefficient.
  • stylized features of gestures/expressions can be extracted from video sequences and converted into them, greatly enriching the style diversity; on the other hand, the style of gestures/expressions extracted from the video is explicitly labeled, which facilitates the semantic description of the style of gestures/expressions with the user, and then realizes the subsequent matching and fusion of labels and style information.
  • An embodiment of the data processing device in the embodiment of the present application includes:
  • An acquiring unit 1901 is used to acquire first style information
  • the acquisition unit 1901 is further used to acquire the motion information of the first image sequence
  • the generating unit 1902 is configured to generate a second image sequence based on the first style information and the action information.
  • the second image sequence has the same action type as the first image sequence, and the second image sequence has the first style information.
  • the data processing device may also include: a display unit 1903, used to display a first interface to the user, the first interface including multiple semantic tags, the multiple semantic tags are used to describe different style information of different image sequences, and the multiple semantic tags correspond one-to-one to the style information; an acquisition unit 1901, specifically used to determine a first semantic tag from multiple semantic tags based on a second operation of the user; and used to convert the first semantic tag into first style information.
  • a display unit 1903 used to display a first interface to the user, the first interface including multiple semantic tags, the multiple semantic tags are used to describe different style information of different image sequences, and the multiple semantic tags correspond one-to-one to the style information
  • an acquisition unit 1901 specifically used to determine a first semantic tag from multiple semantic tags based on a second operation of the user; and used to convert the first semantic tag into first style information.
  • the data processing device may further include: a rendering unit 1904, configured to render the second image sequence to a virtual object to obtain an animation.
  • a rendering unit 1904 configured to render the second image sequence to a virtual object to obtain an animation.
  • the acquisition unit 1901 acquires the style information and the action information by separation, and the generation unit 1902 generates the second image sequence based on the first style information and the action information, so as to realize stylized animation editing without changing other features of the original image sequence, and improve the style transfer effect of the animation.
  • the data processing device may include a processor 2001, a memory 2002, and a communication port 2003.
  • the processor 2001, the memory 2002, and the communication port 2003 are interconnected via a line.
  • the memory 2002 stores program instructions and data.
  • the memory 2002 stores program instructions and data corresponding to the steps executed by the data processing device in the corresponding implementation modes shown in the aforementioned FIGS. 1 to 18 .
  • the processor 2001 is used to execute the steps performed by the data processing device shown in any of the embodiments shown in Figures 1 to 18 above.
  • the communication port 2003 can be used to receive and send data, and to execute the steps related to acquisition, sending, and receiving in any of the embodiments shown in Figures 1 to 18 above.
  • the data processing device may include more or fewer components than those in FIG. 20 , and this application is merely an illustrative description and is not intended to be limiting.
  • An embodiment of the present application further provides a computer-readable storage medium storing one or more computer-executable instructions.
  • the processor executes the method described in the possible implementation manner of the data processing device in the aforementioned embodiment.
  • An embodiment of the present application also provides a computer program product (or computer program) storing one or more computers.
  • the processor executes the method of the possible implementation mode of the above-mentioned data processing device.
  • the embodiment of the present application also provides a chip system, which includes at least one processor for supporting a terminal device to implement the functions involved in the possible implementation of the above-mentioned data processing device.
  • the chip system also includes an interface circuit, which provides program instructions and/or data for the at least one processor.
  • the chip system may also include a memory, which is used to store the necessary program instructions and data for the terminal device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be an indirect coupling or communication connection through some interfaces, devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), disk or optical disk and other media that can store program code.

Abstract

Provided in the present application is a data processing method, which can be applied to a scenario such as animation style transfer. The method comprises: acquiring first style information; extracting action information of a first image sequence; and generating a second image sequence on the basis of the first style information and the action information, wherein the action type of the second image sequence is the same as that of the first image sequence, and the second image sequence has the first style information. Style information and action information are separated and acquired, and a second image sequence is generated on the basis of the first style information and the action information, such that stylized animation editing is performed without changing other features of an original image sequence, thereby improving the effect of animation style transfer.

Description

一种数据处理方法及相关设备A data processing method and related equipment
本申请要求于2022年9月29日提交中国专利局、申请号为202211202267.X、发明名称为“一种数据处理方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the China Patent Office on September 29, 2022, with application number 202211202267.X and invention name “A data processing method and related equipment”, the entire contents of which are incorporated by reference in this application.
技术领域Technical Field
本申请涉及计算机技术领域,尤其涉及一种数据处理方法及相关设备。The present application relates to the field of computer technology, and in particular to a data processing method and related equipment.
背景技术Background technique
随着元宇宙概念的提出,“虚拟数字人”被视为未来人类进入元宇宙的介质,随之站上舆论风口。此外,北京冬奥会成为当前虚拟数字人技术的展示舞台,无论是体娱明星的虚拟分身,还是以公共服务为导向的虚拟主播(如人工智能手语主播),都让大众对虚拟数字人产生了更直观、更深入的认知。随着驱动技术的成熟,虚拟数字人必将在虚拟客服、虚拟导购、虚拟讲解员等更多实际能够变现的场景中得到更广泛的应用。With the introduction of the concept of the metaverse, "virtual digital humans" are seen as the medium for humans to enter the metaverse in the future, and have become the focus of public opinion. In addition, the Beijing Winter Olympics has become a showcase for current virtual digital human technology. Whether it is the virtual avatars of sports and entertainment stars or virtual anchors oriented towards public services (such as artificial intelligence sign language anchors), the public has a more intuitive and in-depth understanding of virtual digital humans. With the maturity of driving technology, virtual digital humans will surely be more widely used in more practical scenarios such as virtual customer service, virtual shopping guides, and virtual tour guides.
目前,在如何驱动虚拟数字人模仿人类行为上,主要有几种主流方法:纯人工建模、动捕建模。其中,纯人工建模方式在超写实的虚拟人或者明星的虚拟人上应用较多,但人工制作周期较长,且成本非常高。动捕建模方式通过借助于外部扫描设备采集模型数据来完成驱动,相比于纯人工建模的方式,时间与成本会低很多,常用于影视、直播等泛娱乐化行业,但需要真人演员参与,无法提升生产效率。At present, there are several mainstream methods for driving virtual digital humans to imitate human behavior: pure manual modeling and motion capture modeling. Among them, pure manual modeling is widely used in hyper-realistic virtual humans or virtual humans of celebrities, but the manual production cycle is long and the cost is very high. The motion capture modeling method completes the drive by collecting model data with the help of external scanning equipment. Compared with the pure manual modeling method, the time and cost will be much lower. It is often used in pan-entertainment industries such as film and television, live broadcast, etc., but it requires the participation of real actors and cannot improve production efficiency.
因此,如何实现不同风格在动画动作之间进行迁移是亟待解决的技术问题。Therefore, how to achieve the migration of different styles between animation actions is a technical problem that needs to be solved urgently.
发明内容Summary of the invention
本申请实施例提供了一种数据处理方法及相关设备。用于实现在不改变原图像序列其他特征的情况下进行风格化的动画编辑,提升动画的风格迁移效果。The embodiment of the present application provides a data processing method and related equipment for realizing stylized animation editing without changing other features of the original image sequence, thereby improving the style transfer effect of the animation.
本申请实施例第一方面提供了一种数据处理方法,可以应用于动画的风格迁移等场景。该方法可以由数据处理设备执行,也可以由数据处理设备的部件(例如处理器、芯片、或芯片系统等)执行。该方法包括:获取第一风格信息;获取第一图像序列的动作信息;基于第一风格信息与动作信息生成第二图像序列,第二图像序列与第一图像序列的动作类型相同,第二图像序列具有第一风格信息。其中,上述的风格信息可以理解为是对图像序列的风格描述,该风格包括以下一项或多项:肢体/面部轮廓、肢体/面部比例、肢体动作幅度、情绪、性格等。动作类型,用于描述图像序列的动作,例如,跑、跳、走等。动作信息可以理解为是低层用于表示动作类型的向量。可以理解的是,相同动作类型的图像序列对应的动作向量可能有所不同。The first aspect of the embodiment of the present application provides a data processing method, which can be applied to scenes such as animation style transfer. The method can be executed by a data processing device, or by a component of the data processing device (such as a processor, a chip, or a chip system, etc.). The method includes: obtaining first style information; obtaining action information of a first image sequence; generating a second image sequence based on the first style information and the action information, the second image sequence has the same action type as the first image sequence, and the second image sequence has the first style information. Among them, the above-mentioned style information can be understood as a style description of the image sequence, and the style includes one or more of the following: limb/facial contour, limb/facial proportion, limb movement amplitude, emotion, personality, etc. Action type, used to describe the action of the image sequence, for example, running, jumping, walking, etc. Action information can be understood as a low-level vector used to represent the action type. It can be understood that the action vectors corresponding to the image sequence of the same action type may be different.
本申请实施例中,通过风格信息与动作信息的分离获取,并基于该第一风格信息与动作信息生成第二图像序列。以实现在不改变原图像序列其他特征的情况下进行风格化的动画编辑,提升动画的风格迁移效果。In the embodiment of the present application, the style information and the action information are obtained separately, and the second image sequence is generated based on the first style information and the action information, so as to realize stylized animation editing without changing other features of the original image sequence, thereby improving the style transfer effect of the animation.
可选地,在第一方面的一种可能的实现方式中,上述步骤获取第一风格信息之前,方法还包括:获取第三图像序列;获取第一风格信息,包括:基于第三图像序列获取第一风格信息。Optionally, in a possible implementation manner of the first aspect, before the above step of obtaining the first style information, the method further includes: obtaining a third image sequence; and obtaining the first style information includes: obtaining the first style information based on the third image sequence.
该种可能的实现方式中,通过其他第三图像序列获取第一风格信息,可以弥补用户对某一类风格信息难以描述的缺陷。In this possible implementation, the first style information is obtained through other third image sequences, which can make up for the defect that a certain type of style information is difficult for users to describe.
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第三图像序列获取第一风格信息,包括:提取第三图像序列的第二风格信息;基于第二风格信息确定第一风格信息。Optionally, in a possible implementation manner of the first aspect, the above step of: acquiring the first style information based on the third image sequence includes: extracting second style information of the third image sequence; and determining the first style information based on the second style information.
该种可能的实现方式中,直接将第三图像序列的风格信息作为后续要迁移至第一图像序列上的风格信息,进而使得生成的第二图像序列的风格与第三图像序列的风格类似或相同,从而满足风格的精准迁移。In this possible implementation, the style information of the third image sequence is directly used as the style information to be subsequently migrated to the first image sequence, so that the style of the generated second image sequence is similar to or the same as the style of the third image sequence, thereby satisfying the accurate migration of style.
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第二风格信息确定第一风格信息,包括:将第二风格信息作为第一风格信息。 Optionally, in a possible implementation manner of the first aspect, the step of determining the first style information based on the second style information includes: using the second style information as the first style information.
该种可能的实现方式中,直接将第三图像序列的风格信息作为后续要迁移至第一图像序列上的风格信息,进而使得生成的第二图像序列的风格与第三图像序列的风格类似或相同,弥补用户对某一类风格信息难以描述的缺陷,从而满足风格的精准迁移。In this possible implementation, the style information of the third image sequence is directly used as the style information to be subsequently migrated to the first image sequence, so that the style of the generated second image sequence is similar to or the same as the style of the third image sequence, thereby compensating for the defect that users have difficulty in describing a certain type of style information, thereby satisfying the precise migration of style.
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第二风格信息确定第一风格信息,包括:向用户显示第二语义标签,第二语义标签用于描述第二风格信息;基于用户的第一操作将第二语义标签修改为第一语义标签,第一语义标签用于描述第一风格信息;基于第一语义标签确定第一风格信息。Optionally, in a possible implementation of the first aspect, the above step of: determining the first style information based on the second style information includes: displaying a second semantic tag to the user, the second semantic tag being used to describe the second style information; modifying the second semantic tag to a first semantic tag based on the user's first operation, the first semantic tag being used to describe the first style information; and determining the first style information based on the first semantic tag.
该种可能的实现方式中,用户在第三图像序列的基础上,通过操作修改语义标签,以实现风格信息的描述与保证用户需求,实现后续生成的第二图像序列可以满足用户对图像序列的风格需求。或者理解为,使用标签显式化风格信息,可让用户对风格信息有个定量及定性的分析,进而清楚地知道如何量化描述自己的需求。此外,通过对用户需求的解析,配合视频能覆盖任意风格的优势,使得本申请实施例能生成任意定制化的风格化数字人动画。In this possible implementation, the user modifies the semantic label through operation on the basis of the third image sequence to achieve the description of style information and ensure user needs, so that the second image sequence generated subsequently can meet the user's style needs for the image sequence. Or it can be understood that the use of labels to make style information explicit allows users to have a quantitative and qualitative analysis of style information, and then clearly know how to quantitatively describe their needs. In addition, by analyzing user needs and combining the advantage of the video covering any style, the embodiment of the present application can generate any customized stylized digital human animation.
可选地,在第一方面的一种可能的实现方式中,上述第三图像序列为二维动画的图像序列,第二风格信息为二维风格信息,第一风格信息为三维风格信息,第一图像序列与第二图像序列为三维动画的图像序列。Optionally, in a possible implementation manner of the first aspect, the third image sequence is an image sequence of a two-dimensional animation, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are image sequences of a three-dimensional animation.
该种可能的实现方式中,2D视频的存量够大,可以实现将2D视频的任意风格信息迁移至3D原始视频以得到3D目标视频。In this possible implementation, the stock of 2D video is large enough to realize the migration of any style information of the 2D video to the 3D original video to obtain the 3D target video.
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:向用户显示第一界面,第一界面包括多个语义标签,多个语义标签用于描述不同图像序列的不同风格信息,多个语义标签与风格信息一一对应;获取第一风格信息,包括:基于用户的第二操作从多个语义标签中确定第一语义标签;基于第一语义标签确定第一风格信息。Optionally, in a possible implementation of the first aspect, the above steps also include: displaying a first interface to a user, the first interface including multiple semantic tags, the multiple semantic tags being used to describe different style information of different image sequences, and the multiple semantic tags corresponding one-to-one to the style information; obtaining the first style information, including: determining a first semantic tag from the multiple semantic tags based on a second operation of the user; and determining the first style information based on the first semantic tag.
该种可能的实现方式中,该种可能的实现方式中,可以理解为离线完成从视频中提取任意风格,并生成特征库。用户只需上传要求的个性化风格的语义标签,进而实现自动从特征库中识别标签对应的风格信息。In this possible implementation, it can be understood that any style is extracted from the video offline and a feature library is generated. The user only needs to upload the semantic label of the required personalized style, and then the style information corresponding to the label is automatically identified from the feature library.
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第一风格信息与动作信息生成第二图像序列,包括:融合第一风格信息与动作信息以得到第一运动特征;基于第一运动特征获取第二图像序列。Optionally, in a possible implementation manner of the first aspect, the above step of: generating a second image sequence based on the first style information and the action information includes: fusing the first style information and the action information to obtain a first motion feature; and acquiring the second image sequence based on the first motion feature.
该种可能的实现方式中,将第一语义标签表示的第一风格信息与原图像序列的动作信息进行融合以得到第一运动特征。因此,基于该第一运动特征获取的第二图像序列,在不改变原图像序列其他特征的情况下实现风格迁移。In this possible implementation, the first style information represented by the first semantic tag is fused with the motion information of the original image sequence to obtain the first motion feature. Therefore, the second image sequence obtained based on the first motion feature realizes style transfer without changing other features of the original image sequence.
可选地,在第一方面的一种可能的实现方式中,上述动作信息包括以下一项或多项:面部表情序列、肢体图像序列。Optionally, in a possible implementation manner of the first aspect, the above-mentioned action information includes one or more of the following: a facial expression sequence, a limb image sequence.
该种可能的实现方式中,该种可能的实现方式中,该方法不仅可以应用于肢体动作的风格迁移,还可以应用于面部表情的风格迁移等,适用场景广泛。In this possible implementation, this method can be applied not only to the style transfer of body movements, but also to the style transfer of facial expressions, etc., and has a wide range of applicable scenarios.
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:将第二图像序列渲染至虚拟物体以得到动画。Optionally, in a possible implementation manner of the first aspect, the above steps further include: rendering the second image sequence to the virtual object to obtain an animation.
该种可能的实现方式中,该种可能的实现方式中,可以适用于2D动画至2D动画、2D动画至3D动画或3D动画至3D动画的风格迁移场景。This possible implementation manner may be applicable to style transfer scenarios from 2D animation to 2D animation, from 2D animation to 3D animation, or from 3D animation to 3D animation.
可选地,在第一方面的一种可能的实现方式中,上述图像序列的风格信息包括显式风格信息与隐式风格信息,第二语义标签具体用于关联第二风格信息中的显式风格信息。Optionally, in a possible implementation manner of the first aspect, the style information of the image sequence includes explicit style information and implicit style information, and the second semantic tag is specifically used to associate the explicit style information in the second style information.
该种可能的实现方式中,通过将风格信息进行显式与隐式的分解,进而可以实现用户对显式风格信息的编辑。并将编辑后的显式风格信息与隐式风格信息生成修改后的风格信息。In this possible implementation, the style information is decomposed into explicit and implicit parts, so that the user can edit the explicit style information, and the edited explicit style information and implicit style information are combined to generate modified style information.
可选地,在第一方面的一种可能的实现方式中,上述步骤:提取第一图像序列的动作信息,包括:将第一图像序列输入内容编码器以得到动作信息;提取第三图像序列的第二风格信息,包括:将第三图像序列输入风格编码器以得到第二风格信息。Optionally, in a possible implementation manner of the first aspect, the above steps: extracting action information of the first image sequence includes: inputting the first image sequence into a content encoder to obtain action information; extracting second style information of the third image sequence includes: inputting the third image sequence into a style encoder to obtain second style information.
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:获取第一训练图像序列与第二训练图像序列,第一训练图像序列与第二训练图像序列的运动特征不同,运动特征包括动作信息和/或风格信息;将第一训练图像序列分别输入风格编码器与内容编码器以得到第一训练风格信息与第一训练动作信息; 将第二训练图像序列分别输入风格编码器与内容编码器以得到第二训练风格信息与第二训练动作信息;融合第一训练风格信息与第二训练动作信息以得到第一训练运动特征;融合第二训练风格信息与第一训练动作信息以得到第二训练运动特征;将第一训练运动特征输入解码器以得到第一重建图像序列;将第二训练运动特征输入解码器以得到第二重建图像序列;以第一损失函数的值小于第一阈值为目标进行训练以得到训练好的风格编码器、内容编码器以及解码器,第一损失函数包括风格损失函数与内容损失函数,风格损失函数用于表示第一重建图像序列与第一训练图像序列之间的风格差异以及第二重建图像序列与第二训练图像序列之间的风格差异,内容损失函数用于表示第一重建图像序列与第二训练图像序列之间的内容差异以及第二重建图像序列与第一训练图像序列之间的内容差异。Optionally, in a possible implementation manner of the first aspect, the above steps further include: acquiring a first training image sequence and a second training image sequence, wherein the first training image sequence and the second training image sequence have different motion features, and the motion features include action information and/or style information; inputting the first training image sequence into a style encoder and a content encoder respectively to obtain first training style information and first training action information; The second training image sequence is input into the style encoder and the content encoder respectively to obtain the second training style information and the second training action information; the first training style information and the second training action information are fused to obtain the first training motion feature; the second training style information and the first training action information are fused to obtain the second training motion feature; the first training motion feature is input into the decoder to obtain the first reconstructed image sequence; the second training motion feature is input into the decoder to obtain the second reconstructed image sequence; the training is performed with the value of the first loss function being less than the first threshold as the goal to obtain the trained style encoder, content encoder and decoder, the first loss function includes a style loss function and a content loss function, the style loss function is used to represent the style difference between the first reconstructed image sequence and the first training image sequence and the style difference between the second reconstructed image sequence and the second training image sequence, and the content loss function is used to represent the content difference between the first reconstructed image sequence and the second training image sequence and the content difference between the second reconstructed image sequence and the first training image sequence.
该种可能的实现方式中,通过上述训练过程,可以实现风格迁移的准确性。In this possible implementation, the accuracy of style transfer can be achieved through the above training process.
本申请实施例第二方面提供了一种数据处理设备。该数据处理设备包括:获取单元,用于获取第一风格信息;获取单元,还用于获取第一图像序列的动作信息;生成单元,用于基于第一风格信息与动作信息生成第二图像序列,第二图像序列与第一图像序列的动作类型相同,第二图像序列具有第一风格信息。The second aspect of the embodiment of the present application provides a data processing device. The data processing device includes: an acquisition unit, used to acquire first style information; the acquisition unit is also used to acquire action information of the first image sequence; a generation unit is used to generate a second image sequence based on the first style information and the action information, the second image sequence has the same action type as the first image sequence, and the second image sequence has the first style information.
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,还用于获取第三图像序列;获取单元,具体用于基于第三图像序列获取第一风格信息。Optionally, in a possible implementation manner of the second aspect, the acquisition unit is further used to acquire a third image sequence; the acquisition unit is specifically used to acquire the first style information based on the third image sequence.
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于提取第三图像序列的第二风格信息;获取单元,具体用于基于第二风格信息确定第一风格信息。Optionally, in a possible implementation manner of the second aspect, the acquisition unit is specifically used to extract second style information of the third image sequence; and the acquisition unit is specifically used to determine the first style information based on the second style information.
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于将第二风格信息作为第一风格信息。Optionally, in a possible implementation manner of the second aspect, the acquisition unit is specifically configured to use the second style information as the first style information.
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于向用户显示第二语义标签,第二语义标签用于描述第二风格信息;获取单元,具体用于基于用户的第一操作将第二语义标签修改为第一语义标签,第一语义标签用于描述第一风格信息;获取单元,具体用于基于第一语义标签确定第一风格信息。Optionally, in a possible implementation of the second aspect, the above-mentioned acquisition unit is specifically used to display a second semantic tag to the user, the second semantic tag being used to describe the second style information; the acquisition unit is specifically used to modify the second semantic tag to a first semantic tag based on the user's first operation, the first semantic tag being used to describe the first style information; the acquisition unit is specifically used to determine the first style information based on the first semantic tag.
可选地,在第二方面的一种可能的实现方式中,上述的第三图像序列为二维动画的图像序列,第二风格信息为二维风格信息,第一风格信息为三维风格信息,第一图像序列与第二图像序列为三维动画的图像序列。Optionally, in a possible implementation manner of the second aspect, the third image sequence is an image sequence of a two-dimensional animation, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are image sequences of a three-dimensional animation.
可选地,在第二方面的一种可能的实现方式中,上述的数据处理设备还包括:显示单元,用于向用户显示第一界面,第一界面包括多个语义标签,多个语义标签用于描述不同图像序列的不同风格信息,多个语义标签与风格信息一一对应;获取单元,具体用于基于用户的第二操作从多个语义标签中确定第一语义标签;获取单元,具体用于基于第一语义标签确定第一风格信息。Optionally, in a possible implementation of the second aspect, the above-mentioned data processing device also includes: a display unit, used to display a first interface to a user, the first interface including multiple semantic tags, the multiple semantic tags are used to describe different style information of different image sequences, and the multiple semantic tags correspond one-to-one to the style information; an acquisition unit, specifically used to determine a first semantic tag from the multiple semantic tags based on the user's second operation; the acquisition unit, specifically used to determine the first style information based on the first semantic tag.
可选地,在第二方面的一种可能的实现方式中,上述的生成单元,具体用于融合第一风格信息与动作信息以得到第一运动特征;生成单元,具体用于基于第一运动特征获取第二图像序列。Optionally, in a possible implementation manner of the second aspect, the above-mentioned generation unit is specifically used to fuse the first style information and the action information to obtain the first motion feature; the generation unit is specifically used to obtain the second image sequence based on the first motion feature.
可选地,在第二方面的一种可能的实现方式中,上述的动作信息包括以下一项或多项:面部表情序列、肢体图像序列。Optionally, in a possible implementation manner of the second aspect, the above-mentioned action information includes one or more of the following: a facial expression sequence, a limb image sequence.
可选地,在第二方面的一种可能的实现方式中,上述的数据处理设备还包括:渲染单元,用于将第二图像序列渲染至虚拟物体以得到动画。Optionally, in a possible implementation manner of the second aspect, the data processing device further includes: a rendering unit, configured to render the second image sequence to the virtual object to obtain an animation.
本申请第三方面提供了一种数据处理设备,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该数据处理设备实现上述第一方面或第一方面的任意可能的实现方式中的方法。A third aspect of the present application provides a data processing device, comprising: a processor, the processor is coupled to a memory, the memory is used to store programs or instructions, when the program or instructions are executed by the processor, the data processing device implements the method in the above-mentioned first aspect or any possible implementation of the first aspect.
本申请第四方面提供了一种计算机可读介质,其上存储有计算机程序或指令,当计算机程序或指令在计算机上运行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。A fourth aspect of the present application provides a computer-readable medium having a computer program or instruction stored thereon. When the computer program or instruction is executed on a computer, the computer executes the method in the aforementioned first aspect or any possible implementation of the first aspect.
本申请第五方面提供了一种计算机程序产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。A fifth aspect of the present application provides a computer program product. When the computer program product is executed on a computer, it enables the computer to execute the method in the aforementioned first aspect or any possible implementation manner of the first aspect.
本申请实施例第六方面提供了一种芯片系统,该芯片系统包括至少一个处理器,用于支持数据处理设备实现上述第一方面或第一方面任意一种可能的实现方式中所涉及的功能。A sixth aspect of an embodiment of the present application provides a chip system, which includes at least one processor for supporting a data processing device to implement the functions involved in the above-mentioned first aspect or any possible implementation method of the first aspect.
在一种可能的设计中,该芯片系统还可以包括存储器,存储器,用于保存该数据处理设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。可选的,所述芯片系统还 包括接口电路,所述接口电路为所述至少一个处理器提供程序指令和/或数据。In a possible design, the chip system may also include a memory for storing program instructions and data necessary for the data processing device. The chip system may be composed of a chip, or may include a chip and other discrete devices. Optionally, the chip system may also include a memory for storing program instructions and data necessary for the data processing device. An interface circuit is included that provides program instructions and/or data to the at least one processor.
其中,第二、第三、第四、第五、第六方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。Among them, the technical effects brought about by the second, third, fourth, fifth, and sixth aspects or any possible implementation methods thereof can refer to the technical effects brought about by the first aspect or different possible implementation methods of the first aspect, and will not be repeated here.
从以上技术方案可以看出,本申请具有以下优点:通过风格信息与动作信息的分离获取,并基于该第一风格信息与动作信息生成第二图像序列。以实现在不改变原图像序列其他特征的情况下进行风格化的动画编辑,提升动画的风格迁移效果。From the above technical solutions, it can be seen that the present application has the following advantages: by separating the style information and the action information, and generating the second image sequence based on the first style information and the action information, it is possible to perform stylized animation editing without changing other features of the original image sequence, thereby improving the style transfer effect of the animation.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明实施例提供的一种人工智能主体框架示意图;FIG1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of the present invention;
图2为本申请实施例提供的系统架构的结构示意图;FIG2 is a schematic diagram of the structure of the system architecture provided in an embodiment of the present application;
图3A为本申请实施例提供的一种部署场景示意图;FIG3A is a schematic diagram of a deployment scenario provided in an embodiment of the present application;
图3B为本申请实施例提供的另一种部署场景示意图;FIG3B is a schematic diagram of another deployment scenario provided in an embodiment of the present application;
图4为本申请实施例提供的数据处理方法的一个流程示意图;FIG4 is a flow chart of a data processing method provided in an embodiment of the present application;
图5A为本申请实施例提供的风格信息分解为显式化特征的示意图;FIG5A is a schematic diagram of decomposing style information into explicit features according to an embodiment of the present application;
图5B为本申请实施例提供的转化模块的训练流程示意图;FIG5B is a schematic diagram of a training process of a conversion module provided in an embodiment of the present application;
图6A为本申请实施例提供的数据处理方法的另一个流程示意图;FIG6A is another schematic flow chart of a data processing method provided in an embodiment of the present application;
图6B为本申请实施例提供的用户修改标签的流程示意图;FIG6B is a schematic diagram of a process of user modifying a label according to an embodiment of the present application;
图7为本申请实施例提供的数据处理设备向用户显示用户界面的示意图;7 is a schematic diagram of a data processing device provided in an embodiment of the present application displaying a user interface to a user;
图8为本申请实施例提供的数据处理设备向用户显示用户界面的另一示意图;FIG8 is another schematic diagram of a data processing device provided in an embodiment of the present application displaying a user interface to a user;
图9为本申请实施例提供的数据处理设备向用户显示用户界面的另一示意图;FIG9 is another schematic diagram of a data processing device provided in an embodiment of the present application displaying a user interface to a user;
图10为本申请实施例提供的数据处理方法的另一个流程示意图;FIG10 is another schematic flow chart of a data processing method provided in an embodiment of the present application;
图11为本申请实施例提供的第一图像序列的示例图;FIG11 is an example diagram of a first image sequence provided in an embodiment of the present application;
图12为本申请实施例提供的第三图像序列的示例图;FIG12 is an example diagram of a third image sequence provided in an embodiment of the present application;
图13为本申请实施例提供的第二图像序列的示例图;FIG13 is an example diagram of a second image sequence provided in an embodiment of the present application;
图14为本申请实施例提供的数据处理方法的另一个流程示意图;FIG14 is another schematic flow chart of a data processing method provided in an embodiment of the present application;
图15为本申请实施例提供的数据处理方法的另一个流程示意图;FIG15 is another schematic flow chart of a data processing method provided in an embodiment of the present application;
图16为本申请实施例提供的编码器与解码器的训练流程示意图;FIG16 is a schematic diagram of the training process of the encoder and decoder provided in an embodiment of the present application;
图17为本申请实施例提供的方法应用于手势风格迁移场景的流程示意图;FIG17 is a schematic diagram of a flow chart of a method provided in an embodiment of the present application applied to a gesture style transfer scenario;
图18为本申请实施例提供的方法应用于表情风格迁移场景的流程示意图;FIG18 is a schematic diagram of a flow chart of the method provided in an embodiment of the present application applied to an expression style transfer scenario;
图19为本申请实施例提供的数据处理设备的一个结构示意图;FIG19 is a schematic diagram of a structure of a data processing device provided in an embodiment of the present application;
图20为本申请实施例提供的数据处理设备的另一个结构示意图。FIG. 20 is another schematic diagram of the structure of the data processing device provided in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获取的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
为了便于理解,下面先对本申请实施例主要涉及的相关术语和概念进行介绍。To facilitate understanding, the relevant terms and concepts mainly involved in the embodiments of the present application are first introduced below.
1、神经网络1. Neural Networks
神经网络可以是由神经单元组成的,神经单元可以是指以Xs和截距b为输入的运算单元,该运算单元的输出可以为:
A neural network may be composed of neural units, and a neural unit may refer to an operation unit with Xs and an intercept b as input, and the output of the operation unit may be:
其中,s=1、2、……n,n为大于1的自然数,Ws为Xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。 神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。Where s = 1, 2, ... n, n is a natural number greater than 1, Ws is the weight of Xs , and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be an area composed of several neural units.
2、损失函数2. Loss Function
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。In the process of training deep neural networks, because we hope that the output of the deep neural network is as close as possible to the value we really want to predict, we can compare the predicted value of the current network with the target value we really want, and then update the weight vector of each layer of the neural network according to the difference between the two (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it predict lower, and keep adjusting until the neural network can predict the target value we really want. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which are important equations used to measure the difference between the predicted value and the target value. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, so the training of the deep neural network becomes a process of minimizing this loss as much as possible.
3、生成对抗网络3. Generative Adversarial Networks
生成对抗网络(generative adversarial network,GAN)是一种深度学习模型。生成对抗网络至少包括一个生成网络(Generator)与一个判别网络(Discriminator),通过让两个神经网络以相互博弈的方式进行学习,从而产生更好的输出。这两个神经网络可以是深度神经网络,也可以是卷积神经网络。GAN的基本原理如下:以生成图片的GAN为例,假设有两个网络,G(Generator)和D(Discriminator),其中G是一个生成图片的网络,从潜在空间(Latent Space)中随机取样作为输入,生成图片,记做G(z);D是一个判别网络,用于判别一张图片是不是“真实的”。它的输入参数是x,x代表一张图片,x为真实图片或生成网络的输出。输出D(x)代表x为真实图片的概率,如果为1,就代表100%是真实的图片,如果为0,就代表不可能是真实的图片。在对该生成式对抗网络进行训练的过程中,生成网络G的目标就是尽可能生成真实的图片去欺骗判别网络D,输出结果需要尽量模仿训练集中的真实样本(Real Samples)。而判别网络D的目标就是尽量把G生成的图片和真实的图片区分开来。两个网络相互对抗、不断调整参数。这样,G和D就构成了一个动态的“博弈”过程,也即“生成式对抗网络”中的“对抗”,最终目的是使判别网络无法判断生成网络的输出结果是否真实。最后博弈的结果,在理想的状态下,G可以生成足以“以假乱真”的图片G(z),而D难以判定G生成的图片究竟是不是真实的,即D(G(z))=0.5。这样就得到了一个优异的生成模型G,它可以用来生成图片。Generative adversarial network (GAN) is a deep learning model. Generative adversarial network includes at least one generator and one discriminator. It produces better output by letting two neural networks learn in a game of mutual competition. These two neural networks can be deep neural networks or convolutional neural networks. The basic principle of GAN is as follows: Taking the GAN that generates pictures as an example, suppose there are two networks, G (Generator) and D (Discriminator), where G is a network that generates pictures, randomly sampling from the latent space as input to generate pictures, recorded as G (z); D is a discriminator network, which is used to determine whether a picture is "real". Its input parameter is x, x represents a picture, and x is a real picture or the output of the generator network. The output D (x) represents the probability that x is a real picture. If it is 1, it means that it is 100% a real picture, and if it is 0, it means that it cannot be a real picture. In the process of training the generative adversarial network, the goal of the generative network G is to generate as realistic images as possible to deceive the discriminative network D, and the output results need to imitate the real samples in the training set as much as possible. The goal of the discriminative network D is to distinguish the images generated by G from the real images as much as possible. The two networks compete with each other and constantly adjust parameters. In this way, G and D constitute a dynamic "game" process, which is also the "confrontation" in the "generative adversarial network". The ultimate goal is to make the discriminative network unable to judge whether the output results of the generative network are real. The final result of the game is that, under ideal conditions, G can generate images G(z) that are "real", while D has difficulty in judging whether the images generated by G are real, that is, D(G(z)) = 0.5. In this way, an excellent generative model G is obtained, which can be used to generate images.
4、动画4. Animation
虚拟创作的视频内容,包括2D平面上显示的动画视频,以及增强现实(augmented reality,AR)、虚拟现实(virtual reality,VR)、全息显示等3D显示设备上显示的3D动画内容;其风格不仅仅是卡通风格,还包括写实风格,如数字人动画、特效影视等。Virtually created video content includes animated videos displayed on 2D planes, and 3D animated content displayed on 3D display devices such as augmented reality (AR), virtual reality (VR), and holographic displays; its style is not only cartoon style, but also includes realistic style, such as digital human animation, special effects film and television, etc.
5、虚拟数字人5. Virtual Digital Human
虚拟数字人是指具有数字化外形的虚拟人物。与具备实体的机器人不同,虚拟数字人依赖显示设备存在,如需通过手机、电脑或者智慧大屏等设备才能显示。一个完整的虚拟数字人往往需具备以下三种能力:Virtual digital people refer to virtual characters with digital appearance. Unlike robots with physical bodies, virtual digital people rely on display devices to exist, such as mobile phones, computers or smart large screens. A complete virtual digital person often needs to have the following three capabilities:
一是拥有人的外观,具有特定的相貌、性别和性格等人物特征。First, they have a human appearance, with specific features such as appearance, gender, and personality.
二是拥有人的行为,具有用语言、面部表情和肢体动作表达的能力。Second, they possess human behavior and the ability to express themselves through language, facial expressions, and body movements.
三是拥有人的思想,具有识别外界环境、并能与人交流互动的能力。Third, it must possess human thoughts, the ability to recognize the external environment, and the ability to communicate and interact with others.
6、图像序列6. Image Sequence
图像序列可以理解为是有时序关系的多个图像,当然,也可以是视频中获取的图像序列。该图像序列可以包括肢体图像序列,和/或面部表情序列等。另外,该图像序列可以是指全身肢体的图像序列,也可以是指全身肢体中部分肢体(或称为局部肢体)的图像序列,还可以是图像序列对应角色的面部表情序列等,具体此处不做限定。An image sequence can be understood as a plurality of images with a time-sequential relationship, and of course, can also be an image sequence obtained from a video. The image sequence can include a limb image sequence, and/or a facial expression sequence, etc. In addition, the image sequence can refer to an image sequence of the entire body limbs, or an image sequence of a part of the entire body limbs (or called a local limb), or a facial expression sequence of a character corresponding to the image sequence, etc., which is not specifically limited here.
7、风格信息7. Style Information
本申请实施例所涉及的风格信息可以是图像序列经过风格编码器获取的风格特征向量。也可以是风格特征向量中的显式向量。还可以是风格特征向量中的显式向量的部分特征等,具体此处不做限定。另外,该风格信息对应的标签也可以理解为是对图像序列的风格描述。例如,风格包括以下一项或多项:肢体/ 面部轮廓、肢体/面部比例、肢体动作幅度、情绪、性格等。上述中的情绪可以包括:开心、沮丧、兴奋等。性格可以包括:活泼、善良、阴柔、刻薄等。The style information involved in the embodiment of the present application can be a style feature vector obtained by the image sequence through the style encoder. It can also be an explicit vector in the style feature vector. It can also be a partial feature of the explicit vector in the style feature vector, etc., which is not limited here. In addition, the label corresponding to the style information can also be understood as a style description of the image sequence. For example, the style includes one or more of the following: body/ Facial contours, body/face proportions, body movement range, emotions, personality, etc. Emotions can include: happy, depressed, excited, etc. Personality can include: lively, kind, feminine, mean, etc.
8、动作信息8. Action information
本申请实施例所涉及的动作信息可以是图像序列经过内容编码器获取的特征向量。The action information involved in the embodiment of the present application may be a feature vector obtained by an image sequence through a content encoder.
9、动作类型9. Action Type
动作类类型用于描述图像序列的动作。该内容是指图像序列所描述的动作(例如:跑、跳、蹲、走、抬头、低头、闭眼等)。可以理解的是,相同动作类型的图像序列对应的动作向量可能有所不同。The action type is used to describe the action of the image sequence. The content refers to the action described by the image sequence (for example: running, jumping, squatting, walking, raising head, lowering head, closing eyes, etc.). It is understandable that the action vectors corresponding to the image sequence of the same action type may be different.
10、语义标签10. Semantic Tags
语义标签用于描述图像序列的风格信息。可以理解为用于将图像序列风格具象化。Semantic tags are used to describe the style information of image sequences, which can be understood as being used to visualize the style of image sequences.
风格信息与语义标签一一对应。语义标签可以根据风格信息的不同情况而有所不同。语义标签可以理解为是用于描述风格信息,从而便于用户理解或编辑该图像序列的风格。The style information corresponds to the semantic label one by one. The semantic label may be different according to different situations of the style information. The semantic label can be understood as being used to describe the style information, so as to facilitate the user to understand or edit the style of the image sequence.
示例性的,风格信息是将图像序列的风格特征向量。语义标签是将该风格特征向量进行显式表达,用户可以通过该语义标签明确该图像序列/视频的风格(例如,视频中角色的肢体动作所表达的角色情绪、性格等),以便于风格编辑/迁移等操作。For example, the style information is a style feature vector of an image sequence. The semantic tag is an explicit expression of the style feature vector, and the user can use the semantic tag to clarify the style of the image sequence/video (for example, the character's emotions and personality expressed by the body movements of the character in the video), so as to facilitate style editing/migration and other operations.
目前,在如何驱动虚拟数字人模仿人类行为上,主要有三种主流方法:纯人工建模、动捕建模、人工智能建模。其中,纯人工建模方式在超写实的虚拟人或者明星的虚拟人上应用较多,但人工制作周期较长,且成本非常高。动捕建模方式通过借助于外部扫描设备采集模型数据来完成驱动,相比于纯人工建模的方式,时间与成本会低很多,常用于影视、直播等泛娱乐化行业,但需要真人演员参与,无法提升生产效率。而人工智能驱动的方式,依据的是算法与机器学习。由于机器能够自动生成虚拟数字人的前提是要获取足够多的数据,对大量的照片/视频进行分析,提取到人的各种数据与信息,驱动虚拟数字人模仿人的行为。在上述人工智能建模方式中,常常会使用不同风格在动画动作之间进行迁移,以减少虚拟数字人动作的动捕及驱动成本。At present, there are three main methods for driving virtual digital people to imitate human behavior: pure manual modeling, motion capture modeling, and artificial intelligence modeling. Among them, pure manual modeling is widely used in hyper-realistic virtual people or celebrity virtual people, but the manual production cycle is long and the cost is very high. The motion capture modeling method completes the drive by collecting model data with the help of external scanning equipment. Compared with the pure manual modeling method, the time and cost are much lower. It is often used in pan-entertainment industries such as film and television and live broadcasting, but it requires the participation of real actors and cannot improve production efficiency. The artificial intelligence-driven method is based on algorithms and machine learning. Since the premise for the machine to automatically generate virtual digital people is to obtain enough data, analyze a large number of photos/videos, extract various data and information of people, and drive virtual digital people to imitate human behavior. In the above-mentioned artificial intelligence modeling methods, different styles are often used to migrate between animation actions to reduce the motion capture and driving costs of virtual digital human actions.
风格化的人体动画生成与编辑是计算机动画领域的重要课题,通过不同风格在同种动画间的迁移,实现动画的任意风格化,减少了动捕及驱动的成本,但存在几大关键问题有待解决:一是风格化的动画编辑要求在尽量不改变原有动画其他特征的基础上,使其具有指定的风格,如何较好地解耦风格信息与动画动作信息是一个重要问题;二是如何低成本地获得风格数据,视频是一大数据来源,但在海量视频数据中如何显式地标记风格地语义标签特征,以便于用户只需对风格进行语义性描述即可完成编辑及风格迁移也是一个重要问题。The generation and editing of stylized human animation is an important topic in the field of computer animation. By migrating different styles between the same animation, arbitrary stylization of animation can be achieved, reducing the cost of motion capture and driving. However, there are several key issues to be resolved: First, stylized animation editing requires that the original animation have a specified style while keeping other features as unchanged as possible. How to better decouple style information from animation motion information is an important issue; second, how to obtain style data at low cost. Video is a major data source, but how to explicitly mark the semantic label features of style in massive video data so that users can complete editing and style transfer by only semantically describing the style is also an important issue.
为此,本申请实施例针对现有虚拟数字人动画驱动方法中无法任意风格化的缺陷,提出了一种基于视频的风格提取及风格信息显式化标记及编辑的肢体动作驱动方案,旨在填补AI用户个性化动画驱动在泛娱乐场景下的空白;此外,从视频中提取风格可以弥补用户对某一类风格难以描述的缺陷。To this end, the embodiments of the present application address the defect that existing virtual digital human animation driving methods cannot be arbitrarily stylized, and propose a body movement driving solution based on video style extraction and explicit marking and editing of style information, aiming to fill the gap in AI user personalized animation driving in pan-entertainment scenarios; in addition, extracting style from video can make up for the defect that users have difficulty in describing a certain type of style.
在结合附图对本申请实施例数据处理方法及相关设备介绍之前,先对本申请实施例提供的系统架构进行说明。Before introducing the data processing method and related equipment according to the embodiment of the present application in conjunction with the accompanying drawings, the system architecture provided by the embodiment of the present application is first described.
参见附图1,本发明实施例提供了一种系统架构100。如所述系统架构100所示,数据采集设备160用于采集训练数据,本申请实施例中训练数据包括:第一训练图像序列与第二训练图像序列。并将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。下面将更详细地描述训练设备120如何基于训练数据得到目标模型/规则101,该目标模型/规则101能够用于实现本申请实施例提供的数据处理方法。本申请实施例中的目标模型/规则101具体可以包括风格编码器、内容编码器以及解码器。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。Referring to FIG. 1 , an embodiment of the present invention provides a system architecture 100. As shown in the system architecture 100, the data acquisition device 160 is used to collect training data. In the embodiment of the present application, the training data includes: a first training image sequence and a second training image sequence. The training data is stored in the database 130, and the training device 120 obtains the target model/rule 101 based on the training data maintained in the database 130. The following will describe in more detail how the training device 120 obtains the target model/rule 101 based on the training data, and the target model/rule 101 can be used to implement the data processing method provided in the embodiment of the present application. The target model/rule 101 in the embodiment of the present application may specifically include a style encoder, a content encoder, and a decoder. It should be noted that in actual applications, the training data maintained in the database 130 may not all come from the collection of the data acquisition device 160, but may also be received from other devices. It should also be noted that the training device 120 may not necessarily train the target model/rule 101 based entirely on the training data maintained in the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a limitation on the embodiment of the present application.
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,AR/VR,车载终端等,还可以是服务器或者云端等。在附图1中,执行设备110配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以 包括:第一图像序列与第一语义标签;可选地,输入数据还可以包括第一图像序列与第二图像序列等。当然,输入数据也可以是二维动画(例如,二维动画是第二图像序列所属的动画)与三维动画(例如,三维动画是第一图像序列所属的动画)。另外该输入数据可以是用户输入的,也可以是用户通过拍摄设备上传的,当然还可以来自数据库,具体此处不做限定。The target model/rule 101 obtained by training the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 1 . The execution device 110 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, an AR/VR, a vehicle terminal, etc., or a server or a cloud. In FIG. 1 , the execution device 110 is configured with an I/O interface 112 for data interaction with an external device. The user can input data to the I/O interface 112 through the client device 140. The input data can be Including: a first image sequence and a first semantic label; optionally, the input data may also include the first image sequence and the second image sequence, etc. Of course, the input data may also be a two-dimensional animation (for example, the two-dimensional animation is the animation to which the second image sequence belongs) and a three-dimensional animation (for example, the three-dimensional animation is the animation to which the first image sequence belongs). In addition, the input data may be input by a user, or uploaded by a user through a shooting device, or may come from a database, which is not specifically limited here.
预处理模块113用于根据I/O接口112接收到的输入数据(例如,第一图像序列与第一语义标签。或者第一图像序列与第二图像序列,或者二维动画与三维动画)进行预处理(例如,二维特征到三维特征的转化等)。The preprocessing module 113 is used to perform preprocessing (e.g., conversion of two-dimensional features to three-dimensional features, etc.) according to the input data received by the I/O interface 112 (e.g., a first image sequence and a first semantic label, or a first image sequence and a second image sequence, or a two-dimensional animation and a three-dimensional animation).
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行提取第一图像序列的动作信息,以及基于动作信息与第一语义标签生成第二图像序列等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的第二图像序列、指令等存入数据存储系统150中。When the execution device 110 preprocesses the input data, or when the computing module 111 of the execution device 110 performs processing related to extracting action information of the first image sequence and generating a second image sequence based on the action information and the first semantic tag, the execution device 110 can call data, code, etc. in the data storage system 150 for corresponding processing, and can also store the second image sequence, instructions, etc. obtained by the corresponding processing into the data storage system 150.
最后,I/O接口112将处理结果,如上述得到的第二图像序列,或者第二图像序列对应的三维动画返回给客户设备140,从而提供给用户。Finally, the I/O interface 112 returns the processing result, such as the second image sequence obtained as described above, or the three-dimensional animation corresponding to the second image sequence, to the client device 140 to provide it to the user.
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。It is worth noting that the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks. The corresponding target models/rules 101 can be used to achieve the above goals or complete the above tasks, thereby providing users with the desired results.
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。In the case shown in FIG. 1 , the user can manually give input data, and the manual giving can be operated through the interface provided by the I/O interface 112. In another case, the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send input data and needs to obtain the user's authorization, the user can set the corresponding authority in the client device 140. The user can view the results output by the execution device 110 on the client device 140, and the specific presentation form can be a specific method such as display, sound, action, etc. The client device 140 can also be used as a data acquisition terminal to collect the input data of the input I/O interface 112 and the output results of the output I/O interface 112 as shown in the figure as new sample data, and store them in the database 130. Of course, it is also possible not to collect through the client device 140, but the I/O interface 112 directly stores the input data of the input I/O interface 112 and the output results of the output I/O interface 112 as new sample data in the database 130.
值得注意的是,附图1仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。It is worth noting that FIG1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG1, the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110.
如图1所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以包括风格编码器、内容编码器、解码器等。As shown in FIG. 1 , a target model/rule 101 is obtained through training by a training device 120 . The target model/rule 101 may include a style encoder, a content encoder, a decoder, etc. in the embodiment of the present application.
下面介绍本申请实施例提供的一种芯片硬件结构。The following introduces a chip hardware structure provided by an embodiment of the present application.
图2为本发明实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器20。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。FIG2 is a chip hardware structure provided by an embodiment of the present invention, and the chip includes a neural network processor 20. The chip can be set in the execution device 110 shown in FIG1 to complete the calculation work of the calculation module 111. The chip can also be set in the training device 120 shown in FIG1 to complete the training work of the training device 120 and output the target model/rule 101.
神经网络处理器20可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器20作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路203,控制器204控制运算电路203提取存储器(权重存储器或输入存储器)中的数据并进行运算。The neural network processor 20 can be a neural network processor (neural-network processing unit, NPU), a tensor processing unit (tensor processing unit, TPU), or a graphics processing unit (graphics processing unit, GPU) and any other processor suitable for large-scale XOR operation processing. Taking NPU as an example: the neural network processor 20 is mounted on the main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks. The core part of the NPU is the operation circuit 203, and the controller 204 controls the operation circuit 203 to extract data from the memory (weight memory or input memory) and perform operations.
在一些实现中,运算电路203内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路203是二维脉动阵列。运算电路203还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路203是通用的矩阵处理器。In some implementations, the operation circuit 203 includes multiple processing units (process engines, PEs) inside. In some implementations, the operation circuit 203 is a two-dimensional systolic array. The operation circuit 203 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 203 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器202中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器201中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器208中。For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit takes the corresponding data of matrix B from the weight memory 202 and caches it on each PE in the operation circuit. The operation circuit takes the matrix A data from the input memory 201 and performs matrix operation with matrix B, and the partial result or final result of the matrix is stored in the accumulator 208.
向量计算单元207可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元207可以用于神经网络中非卷积/非FC层的网络计算,如池化 (Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。The vector calculation unit 207 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. For example, the vector calculation unit 207 can be used for network calculations of non-convolutional/non-FC layers in a neural network, such as pooling. (Pooling), Batch Normalization, Local Response Normalization, etc.
在一些实现种,向量计算单元207将经处理的输出的向量存储到统一缓存器206。例如,向量计算单元207可以将非线性函数应用到运算电路203的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元207生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路203的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector calculation unit 207 stores the vector of processed outputs to the unified buffer 206. For example, the vector calculation unit 207 can apply a nonlinear function to the output of the operation circuit 203, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 207 generates a normalized value, a merged value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operation circuit 203, for example, for use in a subsequent layer in a neural network.
统一存储器206用于存放输入数据以及输出数据。The unified memory 206 is used to store input data and output data.
权重数据直接通过存储单元访问控制器205(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器201和/或统一存储器206、将外部存储器中的权重数据存入权重存储器202,以及将统一存储器206中的数据存入外部存储器。The weight data is directly transferred from the external memory to the input memory 201 and/or the unified memory 206 through the direct memory access controller 205 (DMAC), the weight data in the external memory is stored in the weight memory 202, and the data in the unified memory 206 is stored in the external memory.
总线接口单元(bus interface unit,BIU)210,用于通过总线实现主CPU、DMAC和取指存储器209之间进行交互。The bus interface unit (BIU) 210 is used to realize the interaction between the main CPU, DMAC and instruction fetch memory 209 through the bus.
与控制器204连接的取指存储器(instruction fetch buffer)209,用于存储控制器204使用的指令。An instruction fetch buffer 209 connected to the controller 204 is used to store instructions used by the controller 204.
控制器204,用于调用指存储器209中缓存的指令,实现控制该运算加速器的工作过程。The controller 204 is used to call the instructions cached in the memory 209 to control the working process of the computing accelerator.
一般地,统一存储器206,输入存储器201,权重存储器202以及取指存储器209均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。Generally, the unified memory 206, the input memory 201, the weight memory 202 and the instruction fetch memory 209 are all on-chip memories, and the external memory is a memory outside the NPU, which can be a double data rate synchronous dynamic random access memory (DDR SDRAM), a high bandwidth memory (HBM) or other readable and writable memory.
下面对本申请实施例提供的几种部署场景进行描述。本申请实施例提供的任意风格可编辑3D动画生成方案,可应用于2B端数字主持人场景及2C端数字伴侣、助理软件等场景。具体部署方案有多种,下面举例描述。Several deployment scenarios provided by the embodiments of the present application are described below. The arbitrary style editable 3D animation generation solution provided by the embodiments of the present application can be applied to 2B digital host scenarios and 2C digital companion, assistant software and other scenarios. There are many specific deployment solutions, which are described below by example.
本申请实施例提供的一种部署场景如图3A所示,用户在客户端上传表征目标风格的动画视频。由服务端完成从视频中提取目标风格,并向用户返回风格的语义标签。随后用户可根据风格的语义标签对风格进行描述、编辑或选择等操作,如对于语义标签为兴奋的风格,希望程度稍微弱一点。则在客户端完成标签编辑及上传,服务端接收到请求后,根据目标降低风格的语义标签的权重,以减少兴奋的程度,从而实现对风格信息进行编辑,进而生成符合用户标签的目标动画,并返回客户端渲染与显示。A deployment scenario provided by an embodiment of the present application is shown in FIG3A , where a user uploads an animation video representing a target style on the client. The server extracts the target style from the video and returns the semantic label of the style to the user. The user can then describe, edit, or select the style based on the semantic label of the style, such as for a style with a semantic label of excitement, hoping that the degree is slightly weaker. The client completes the label editing and uploading, and after receiving the request, the server reduces the weight of the semantic label of the style according to the target to reduce the degree of excitement, thereby editing the style information, and then generating a target animation that meets the user label, and returns it to the client for rendering and display.
本申请实施例提供的另一种部署场景如图3B所示,相比于图3A的部署方案,该种部署方案中,服务端离线完成从视频中提取任意风格,并生成特征库。用户只需上传要求的个性化风格的语义标签,如在兴奋的风格上增加一点阴柔风格。服务端在收到请求后,自动从风格信息库中识别出兴奋与阴柔标签对应的风格信息,并对此特征进行编辑,生成与目标风格的语义标签相匹配的风格信息,完成渲染与显示。Another deployment scenario provided by the embodiment of the present application is shown in FIG3B . Compared with the deployment scheme of FIG3A , in this deployment scheme, the server completes the extraction of any style from the video offline and generates a feature library. The user only needs to upload the semantic tags of the required personalized style, such as adding a little feminine style to the excited style. After receiving the request, the server automatically identifies the style information corresponding to the excited and feminine tags from the style information library, edits the feature, generates style information that matches the semantic tags of the target style, and completes rendering and display.
可以理解的是,上述两种部署场景只是举例,在实际应用中,还可以有其他形式的部署场景,具体此处不做限定。It is understandable that the above two deployment scenarios are just examples. In actual applications, there may be other forms of deployment scenarios, which are not specifically limited here.
另外,上述部署场景中所涉及的风格,可以是指二维风格,也可以是指三维风格。换句话说,本申请实施例提供的方法可以应用于二维风格迁移至二维图像序列的场景。也可以应用于三维风格迁移至三维图像序列的场景。还可以应用于二维风格迁移至三维图像序列的场景,或三维风格迁移至二维图像序列的场景等,具体此处不做限定。In addition, the style involved in the above deployment scenario may refer to a two-dimensional style or a three-dimensional style. In other words, the method provided in the embodiment of the present application may be applied to a scenario where a two-dimensional style is transferred to a two-dimensional image sequence. It may also be applied to a scenario where a three-dimensional style is transferred to a three-dimensional image sequence. It may also be applied to a scenario where a two-dimensional style is transferred to a three-dimensional image sequence, or a scenario where a three-dimensional style is transferred to a two-dimensional image sequence, etc., which are not specifically limited here.
下面结合附图对本申请实施例提供的数据处理方法进行详细的介绍。The data processing method provided in the embodiment of the present application is described in detail below with reference to the accompanying drawings.
请参阅图4,本申请实施例提供的数据处理方法一个实施例,该方法可以由数据处理设备(终端设备/云服务器)执行,也可以由数据处理设备的部件(例如处理器、芯片、或芯片系统等)执行,该方法包括步骤401至步骤403。该方法可以应用于儿童教育动画、短视频动画、宣传动画、综艺动画、影视预演动画等动画之间风格迁移的场景。Please refer to FIG4 , which is an embodiment of a data processing method provided in an embodiment of the present application. The method can be executed by a data processing device (terminal device/cloud server), or by a component of a data processing device (such as a processor, a chip, or a chip system, etc.), and the method includes steps 401 to 403. The method can be applied to scenes of style transfer between animations such as children's educational animations, short video animations, promotional animations, variety show animations, and film and television preview animations.
步骤401,获取第一风格信息。Step 401: Obtain first style information.
在一种可能实现的方式中,风格信息是指图像序列的风格特征向量,则将第三图像序列输入风格编码器,以得到第二风格信息。其中,对于风格编码器的训练过程后续会有说明,此处不再展开。In a possible implementation, the style information refers to a style feature vector of an image sequence, and the third image sequence is input into a style encoder to obtain the second style information. The training process of the style encoder will be described later and will not be expanded here.
在另一种可能实现的方式中,风格信息是指图像序列的风格特征向量中的显式向量或显式向量的部分 特征,则将第三图像序列输入风格编码器,以得到风格特征向量。并将风格特征向量拆分为显式向量与隐式向量。在这种情况下,该风格信息可以理解为风格特征向量的显式表达。In another possible implementation, the style information refers to an explicit vector or a part of an explicit vector in a style feature vector of an image sequence. If the third image sequence is input into the style encoder to obtain a style feature vector, the style feature vector is split into an explicit vector and an implicit vector. In this case, the style information can be understood as an explicit expression of the style feature vector.
或者理解为,本申请实施例中的风格信息可以是图像序列对应的风格特征向量,也可以是图像序列对应风格特征向量中的显式向量。还可以是图像序列对应风格特征向量中的显式向量的部分特征等。换句话说,后者的情况,可以理解为风格信息可以分解为显式向量与隐式特征。当然,这种分解只是举例,还可以将风格信息分解为显式向量、隐式特征以及个性化特征。该个性化特征用于表达同一风格经不同角色演绎时所带来的个性化差异。该个性化特征还可以与图像序列中的角色相关,例如,可以是“金星式、川普式”等。Or it can be understood that the style information in the embodiment of the present application can be the style feature vector corresponding to the image sequence, or it can be the explicit vector in the style feature vector corresponding to the image sequence. It can also be a partial feature of the explicit vector in the style feature vector corresponding to the image sequence, etc. In other words, in the latter case, it can be understood that the style information can be decomposed into explicit vectors and implicit features. Of course, this decomposition is just an example, and the style information can also be decomposed into explicit vectors, implicit features, and personalized features. The personalized feature is used to express the personalized differences brought about by the same style when interpreted by different roles. The personalized feature can also be related to the role in the image sequence, for example, it can be "Venus style, Trump style", etc.
可选地,风格信息是显式向量的情况下。还需要先将风格特征向量分解为显式向量与隐式特征。并将显式向量作为风格信息。Optionally, when the style information is an explicit vector, it is also necessary to first decompose the style feature vector into an explicit vector and an implicit feature, and use the explicit vector as the style information.
本申请实施例中数据处理设备获取第一风格信息的方式有多种,下面分别进行描述。There are multiple ways for the data processing device to obtain the first style information in the embodiment of the present application, which are described below respectively.
第一种,基于第三图像序列获取第一风格信息。The first one is to obtain the first style information based on the third image sequence.
该种情况下,数据处理设备先获取第三图像序列,并基于第三图像序列获取第一风格信息。其中,数据处理设备获取第三图像序列的方式有多种,可以是通过接收其他设备发送的方式,也可以是从数据库中选取的方式,还可以通过数据处理设备中各传感器采集的方式,还可以是用户上传的方式等,具体此处不做限定。In this case, the data processing device first obtains the third image sequence, and obtains the first style information based on the third image sequence. There are many ways for the data processing device to obtain the third image sequence, which may be by receiving images sent by other devices, by selecting images from a database, by collecting images through various sensors in the data processing device, or by uploading images by users, etc., which are not limited here.
本申请实施例中图像序列(例如,第一图像序列、第三图像序列等)可以是二维图像序列,也可以是三维图像序列等,具体此处不做限定。In the embodiment of the present application, the image sequence (for example, the first image sequence, the third image sequence, etc.) can be a two-dimensional image sequence or a three-dimensional image sequence, etc., which is not specifically limited here.
可选地,为了获取更多风格种类的风格信息,该第三图像序列可以是从二维动画中提取出来的图像序列。例如,通过人体姿态识别方法(例如,openpose)从二维动画中提取出第三图像序列。另外,二维动画的获取方式这里不做限定,可以是通过用户拍摄上传的方式,也可以是通过接收其他设备发送的方式,还可以是从数据库中选取的方式等,具体此处不做限定。Optionally, in order to obtain style information of more styles, the third image sequence may be an image sequence extracted from the two-dimensional animation. For example, the third image sequence is extracted from the two-dimensional animation by a human posture recognition method (e.g., openpose). In addition, the method for obtaining the two-dimensional animation is not limited here, and may be a method of uploading by a user, a method of receiving from other devices, or a method of selecting from a database, etc., which is not limited here.
其中,基于第三图像序列获取第一风格信息的步骤,又根据是否有用户的操作分为两种情况,下面分别进行描述。The step of acquiring the first style information based on the third image sequence is divided into two cases according to whether there is a user operation, which are described below respectively.
1、无用户操作。1. No user operation.
数据处理设备获取第三图像序列之后,可以直接提取第三图像序列的第二风格信息,将该第二风格信息作为第一风格信息。或者将第二风格信息转化为预设的风格信息等。After acquiring the third image sequence, the data processing device may directly extract the second style information of the third image sequence and use the second style information as the first style information, or convert the second style information into preset style information.
另外,上述分解的依据可以是通过训练好的神经网络,也可以是通过在数据库中寻找多个用于表达相同风格的图像序列,再根据多个用于表达相同风格的图像序列确定显式向量等,具体此处不做限定。对于根据多个用于表达相同风格的图像序列确定显式向量的情况具体可以包括:将多个相同风格的图像序列输入风格编码器以得到多个风格特征向量。并将多个风格特征向量中的共有特征作为风格信息。非共有部分即为隐式特征等,具体此处不做限定。In addition, the above decomposition may be based on a trained neural network, or may be based on finding multiple image sequences for expressing the same style in a database, and then determining explicit vectors based on multiple image sequences for expressing the same style, etc., which is not specifically limited here. The case of determining explicit vectors based on multiple image sequences for expressing the same style may specifically include: inputting multiple image sequences of the same style into a style encoder to obtain multiple style feature vectors. And using the common features in the multiple style feature vectors as style information. The non-common parts are implicit features, etc., which are not specifically limited here.
示例性的,从数据库中寻找多个表达风格为“开心”的图像序列,将多个图像序列分别输入风格编码器得到多个风格特征向量。确定多个风格特征向量的共有向量,则“开心”的风格信息为上述的共有向量。从而确定出显式风格信息与共有向量之间的对应关系。Exemplarily, multiple image sequences expressing the style of "happy" are found from the database, and the multiple image sequences are respectively input into the style encoder to obtain multiple style feature vectors. The common vector of the multiple style feature vectors is determined, and the style information of "happy" is the above-mentioned common vector. Thus, the corresponding relationship between the explicit style information and the common vector is determined.
可选地,风格信息是图像序列对应风格特征向量中的显式向量的部分特征的情况下,需要将显式向量进行拆分。例如,显式向量=W1*风格信息1+W2*风格信息2+...+Wn*风格信息n)。Optionally, when the style information is a partial feature of an explicit vector in a style feature vector corresponding to an image sequence, the explicit vector needs to be split. For example, explicit vector = W 1 * style information 1 + W 2 * style information 2 + ... + W n * style information n).
示例性的,以风格信息是风格特征向量的显式向量为例。如图5A所示,风格信息可以包括:“平静->兴奋”、“单一->多样”、“阴柔->阳刚”。其中,上述“->”的前后可以是指一个范围的两侧边界。例如,“平静”到“兴奋”是一种情绪的递进,或者理解为风格信息还可以进一步区分不同权重/等级。又例如,快乐的强度范围可以包括:满意、欣慰、愉快、欢乐、狂喜等几种层次。该种示例下,风格信息也可以为“满意->狂喜”。Exemplarily, take the example that the style information is an explicit vector of the style feature vector. As shown in FIG5A , the style information may include: “calm->excited”, “single->diverse”, “feminine->masculine”. Among them, the “->” before and after may refer to the boundaries of a range. For example, “calm” to “excited” is a progression of emotions, or it can be understood that the style information can be further distinguished by different weights/levels. For another example, the intensity range of happiness may include several levels such as satisfaction, relief, pleasure, joy, and ecstasy. In this example, the style information may also be “satisfied->ecstasy”.
可选地,若是将第二风格信息转化为第一风格信息的情况。第二风格信息为二维风格信息。则数据处理设备获取第二风格信息之后,可以通过转化模块将第二风格信息转化为第一风格信息,该第一风格信息为三维风格信息。该种情况,主要应用于将二维动画的风格信息迁移到三维动画中,以改变三维动画风格信息的场景。 Optionally, if the second style information is converted into the first style information, the second style information is two-dimensional style information. After the data processing device obtains the second style information, the second style information can be converted into the first style information through the conversion module, and the first style information is three-dimensional style information. This situation is mainly used to migrate the style information of a two-dimensional animation to a three-dimensional animation to change the scene of the style information of the three-dimensional animation.
其中,上述的转化模块可以理解为2D-3D风格转化模块。该模块利用大量的风格一致的2D-3D对训练得到非线性变换,用于2D风格化特征嵌入到3D风格化特征空间。后续对于从视频中提取的2D风格信息(即第二风格信息),利用非线性变换投影到3D空间后,便可转化成3D风格化特征(即第一风格信息)。The above-mentioned conversion module can be understood as a 2D-3D style conversion module. This module uses a large number of 2D-3D pairs with consistent styles to train and obtain nonlinear transformations for embedding 2D stylized features into 3D stylized feature space. Subsequently, the 2D style information (i.e., the second style information) extracted from the video can be converted into 3D stylized features (i.e., the first style information) after being projected into the 3D space using nonlinear transformations.
上述的转化模块的训练过程可以如图5B所示。首先,获取3D动画序列,并提取该3D动画序列的3D风格化特征。然后,通过对3D动画序列进行正交投影,生成与3D动画序列风格与动作一致的2D动画序列,并提取2D风格信息。最后,通过对两者各自风格信息的监督,使其对齐到同一特征空间,完成2D风格信息到3D风格信息空间的投影。The training process of the above conversion module can be shown in FIG5B . First, a 3D animation sequence is obtained and the 3D stylized features of the 3D animation sequence are extracted. Then, by orthogonally projecting the 3D animation sequence, a 2D animation sequence consistent with the style and action of the 3D animation sequence is generated, and the 2D style information is extracted. Finally, by supervising the style information of both, the two are aligned to the same feature space, completing the projection of the 2D style information to the 3D style information space.
2、基于用户的第一操作与第三图像序列确定第一风格信息。2. Determine first style information based on the user's first operation and the third image sequence.
该种方式下,数据处理设备提取第三图像序列的第二风格信息之后,可以向用户显示第二语义标签,该第二语义标签用于显式化的描述第二风格信息。进而基于用户的第一操作将第二语义标签修改为第一语义标签。基于第一语义标签确定第一风格信息。语义标签的解释可以参考前述相关术语的描述,此处不再赘述。In this manner, after the data processing device extracts the second style information of the third image sequence, it can display a second semantic tag to the user, and the second semantic tag is used to explicitly describe the second style information. Then, based on the user's first operation, the second semantic tag is modified to the first semantic tag. The first style information is determined based on the first semantic tag. The explanation of the semantic tag can refer to the description of the aforementioned related terms, which will not be repeated here.
第二语义标签可以理解为是对第三图像序列的风格描述,该风格包括以下一项或多项:肢体/面部轮廓、肢体/面部比例、肢体动作幅度、情绪、性格等。具体可以参考前述相关术语中的描述,此处不再赘述。The second semantic label can be understood as a style description of the third image sequence, and the style includes one or more of the following: body/face contour, body/face proportion, body movement amplitude, emotion, personality, etc. For details, please refer to the description of the above-mentioned related terms, which will not be repeated here.
该种方式也可以理解为,数据处理设备将图像序列的第二风格信息向量转化为用户可以理解的第二语义标签,用户根据实际需要将第二语义标签进行处理以得到第一语义标签。数据处理设备在基于该第一语义标签转化为第一风格信息,进而后续生成满足用户需求的图像序列。上述处理包括以下至少一项:增加、删除、修改、程度控制(或者理解为幅度、层次调整)等。This method can also be understood as the data processing device converting the second style information vector of the image sequence into a second semantic label that can be understood by the user, and the user processes the second semantic label according to actual needs to obtain the first semantic label. The data processing device converts the first semantic label into the first style information, and then subsequently generates an image sequence that meets the user's needs. The above processing includes at least one of the following: addition, deletion, modification, degree control (or understood as amplitude, level adjustment), etc.
可选地,该第一操作包括上述的增加、删除、修改、程度控制(或者理解为幅度、层次调整)、修改语义标签权重等。具体的,数据处理设备可以通过用户的语音、文本等输入方式确定第一操作,具体此处不做限定。Optionally, the first operation includes the above-mentioned addition, deletion, modification, degree control (or understood as amplitude, level adjustment), modification of semantic tag weight, etc. Specifically, the data processing device can determine the first operation through the user's voice, text, etc. input method, which is not specifically limited here.
该种情况可以应用于前述图3A所示的场景。以数据处理设备是云端设备,且获取第三图像序列的方式是终端设备发送的方式为例。该种情况下的流程可以如图6所示。该流程包括步骤601至步骤606。This situation can be applied to the scenario shown in FIG. 3A above. For example, the data processing device is a cloud device, and the way to obtain the third image sequence is the way the terminal device sends it. The process in this case can be shown in FIG. 6. The process includes steps 601 to 606.
步骤601,终端设备向云端设备发送第三图像序列。Step 601: The terminal device sends a third image sequence to the cloud device.
用户可以通过终端设备向云端设备发送第三图像序列。相应的,云端设备接收终端设备发送的第三图像序列。The user can send the third image sequence to the cloud device through the terminal device. Correspondingly, the cloud device receives the third image sequence sent by the terminal device.
步骤602,云端设备生成第三图像序列的第二语义标签。Step 602: The cloud device generates a second semantic tag for the third image sequence.
云端设备获取第三图像序列之后,先获取第三图像序列的第二风格信息。并将第二风格信息转化为第二语义标签。After the cloud device obtains the third image sequence, it first obtains the second style information of the third image sequence and converts the second style information into a second semantic label.
示例性的,以风格信息是风格特征向量中的显式向量为例,与前述类似,可以从数据库中寻找多个表达“开心”的图像序列,将多个图像序列分别输入风格编码器得到多个风格特征向量。确定多个风格特征向量的共有向量,则“开心”的风格语义标签对应为上述的共有向量(即显式向量)。从而确定出语义标签与风格信息之间的对应关系。For example, taking the case where the style information is an explicit vector in the style feature vector, similar to the above, multiple image sequences expressing "happy" can be found from the database, and the multiple image sequences can be input into the style encoder to obtain multiple style feature vectors. The common vector of multiple style feature vectors is determined, and the style semantic label of "happy" corresponds to the above common vector (i.e., the explicit vector). Thus, the corresponding relationship between the semantic label and the style information is determined.
步骤603,云端设备向终端设备发送第二语义标签。Step 603: The cloud device sends a second semantic tag to the terminal device.
云端设备获取第二语义标签之后,向终端设备发送第二语义标签。相应的,终端设备接收云端设备发送的第二语义标签。After acquiring the second semantic tag, the cloud device sends the second semantic tag to the terminal device. Correspondingly, the terminal device receives the second semantic tag sent by the cloud device.
步骤604,终端设备基于第二语义标签确定第一语义标签。Step 604: The terminal device determines the first semantic tag based on the second semantic tag.
如果无需用户操作则与前述描述类似,这里仅以基于用户的第一操作与第三图像序列确定第一风格信息为例。If no user operation is required, the process is similar to the above description, and here only takes determining the first style information based on the user's first operation and the third image sequence as an example.
终端设备获取第二语义标签之后,向用户显示第二语义标签。进而基于用户的第一操作将第二语义标签修改为第一语义标签。After acquiring the second semantic tag, the terminal device displays the second semantic tag to the user, and then modifies the second semantic tag to the first semantic tag based on the user's first operation.
步骤605,终端设备向云端设备发送第一语义标签。Step 605: The terminal device sends the first semantic tag to the cloud device.
终端设备获取第一语义标签之后,向云端设备发送第一语义标签。相应的,云端设备接收终端设备发送的第一语义标签。After acquiring the first semantic tag, the terminal device sends the first semantic tag to the cloud device. Correspondingly, the cloud device receives the first semantic tag sent by the terminal device.
步骤606,云端设备基于第一语义标签确定第一风格信息。 Step 606: The cloud device determines first style information based on the first semantic tag.
云端设备获取第一语义标签之后,可以基于第一语义标签确定第一风格信息。After acquiring the first semantic tag, the cloud device may determine the first style information based on the first semantic tag.
示例性的,图6B为用户修改标签的一种示例。第三图像序列的第二语义标签为“情绪兴奋、风格单一”。用户在第二语义标签的基础上,做了如下处理:删除兴奋,保持中性;调整动作丰富程度,单一改为多样;新增阴柔风格。其中,数据处理设备中的自然语言处理(Natural Language Processing,NLP)模块可以自动识别并匹配用户指定的风格的语义标签,并选择与之相匹配的风格信息,并可对用户指定的某一种风格的程度进行量化,两者相互融合后生成编辑后的风格信息。另外,NLP模块的能力就是输入一段文字,输出对这段文字的解析(例如,名词、动词、用户在意的关键词)。NLP模块输出这段文字中对表达风格的关键字,例如,输入“我想要的目标风格是一半的阴柔风,一半的阳刚风”,那么NLP模块可以输出如下几个关键字:阴柔、阳刚、各一半。就是解析出描述性文字中,与风格相关的字词。又例如,用户通过输入文字或语音传输“来一个比较娘的风格”信息,NLP模块通过该信息,确定用户想要在第二语义标签的基础上“增加阴柔风格”。Exemplarily, FIG6B is an example of a user modifying a label. The second semantic label of the third image sequence is "emotional excitement, single style". Based on the second semantic label, the user performed the following processing: deleting excitement and maintaining neutrality; adjusting the richness of the action, changing from single to diverse; adding a feminine style. Among them, the natural language processing (NLP) module in the data processing device can automatically identify and match the semantic label of the style specified by the user, and select the style information that matches it, and can quantify the degree of a certain style specified by the user, and the two are merged to generate the edited style information. In addition, the ability of the NLP module is to input a paragraph of text and output the analysis of the paragraph of text (for example, nouns, verbs, keywords that the user cares about). The NLP module outputs the keywords for expressing the style in the paragraph of text. For example, if you input "the target style I want is half feminine and half masculine", the NLP module can output the following keywords: feminine, masculine, and half each. That is, to parse out the words related to the style in the descriptive text. For another example, the user transmits the information "I want a more feminine style" by inputting text or voice. The NLP module determines from this information that the user wants to "add a feminine style" on the basis of the second semantic tag.
示例性的,以修改权重标签为例。数据处理设备向用户显示如图7所示的用户界面。该用户界面包括动画预览界面与编辑界面。编辑界面中的风格的语义标签(也可以称为风格标签)可以理解为是前述的第二语义标签。例如,第二语义标签为兴奋与单一。用户可以通过编辑界面对第二语义标签进行修改。如图8所示,用户可以通过拖拽光标801以将“平静->兴奋”从1.0拖拽到0.5。即将兴奋去掉,改为中性。用户可以通过拖拽光标802以将“单一->多样”从0.0拖拽到1.0。即将单一修改为多样。另外,用户可以通过点击增加标签803的方式以增加如图9所示阴柔风格的标签。基于上述图7至图9,用户将第二语义标签(兴奋、单一)修改为第一语义标签(中性、多样、阴柔)。Exemplary, take modifying the weight label as an example. The data processing device displays a user interface as shown in FIG7 to the user. The user interface includes an animation preview interface and an editing interface. The semantic label of the style in the editing interface (also referred to as the style label) can be understood as the aforementioned second semantic label. For example, the second semantic label is excitement and single. The user can modify the second semantic label through the editing interface. As shown in FIG8, the user can drag the cursor 801 to drag "calm->excited" from 1.0 to 0.5. The excitement is removed and changed to neutral. The user can drag the cursor 802 to drag "single->diversity" from 0.0 to 1.0. The single is changed to diverse. In addition, the user can click Add Label 803 to add a feminine style label as shown in FIG9. Based on the above FIGS. 7 to 9, the user modifies the second semantic label (excited, single) to the first semantic label (neutral, diverse, feminine).
该种方式下,通过将风格信息的语义标签显式化,并做到用户可以根据该显式化标签进行编辑。在实际使用中,对于任意视频中的风格,人为主观上往往难以准确的定义该视频所呈现的风格,更加难以准确的编辑。本实施例通过将风格信息分解,将其中的显式化特征进行语义化,进而实现风格信息的标签化,并对用户指定的任意风格的语义标签进行识别、匹配及量化,生成特定的风格信息,因而使得无论是前述图3A所示的部署方案中的返回特征标签并用于用户编辑,还是前述图3B所示的部署方案中的匹配用户个性化风格的语义标签,都成为可能,也使得用户能更加清楚自己的编辑行为。In this way, the semantic label of the style information is made explicit, and the user can edit according to the explicit label. In actual use, for the style in any video, it is often difficult for humans to accurately define the style presented by the video subjectively, and it is even more difficult to edit accurately. This embodiment decomposes the style information, semanticizes the explicit features therein, and then realizes the labeling of the style information, and identifies, matches and quantifies the semantic labels of any style specified by the user to generate specific style information. Therefore, whether it is the return feature label in the deployment scheme shown in Figure 3A above and used for user editing, or the semantic label matching the user's personalized style in the deployment scheme shown in Figure 3B above, it becomes possible, and the user can be more aware of his or her editing behavior.
可以理解的是,上述只是基于第三图像序列获取第一语义标签的两种方式举例,在实际应用中,还可以有其他方式,具体此处不做限定。It is understandable that the above are just two examples of ways to obtain the first semantic label based on the third image sequence. In practical applications, there may be other ways, which are not specifically limited here.
该第一种情况下,可以从第三图像序列/视频中提取风格信息,以弥补用户对某些风格难以描述的缺陷。In the first case, style information may be extracted from the third image sequence/video to compensate for the defect that certain styles are difficult for users to describe.
第二种,基于用户针对于第一界面的第二操作确定第一风格信息。The second type is to determine the first style information based on a second operation performed by the user on the first interface.
该种方式下,数据处理设备向用户显示第一界面,该第一界面包括多个语义标签。多个语义标签中的每个语义标签用于显式化图像序列的风格信息。数据处理设备再基于用户的第二操作从多个语义标签中确定第一语义标签。进而根据该第一语义标签确定第一风格信息。In this manner, the data processing device displays a first interface to the user, and the first interface includes multiple semantic tags. Each of the multiple semantic tags is used to explicitly display the style information of the image sequence. The data processing device then determines a first semantic tag from the multiple semantic tags based on the user's second operation, and then determines the first style information based on the first semantic tag.
该种情况可以应用于前述图3B所示的场景。以数据处理设备是云端设备为例。该种情况下的流程可以如图10所示。该流程包括步骤1001至步骤1005。This situation can be applied to the scenario shown in FIG. 3B above. Take the case where the data processing device is a cloud device as an example. The process in this case can be shown in FIG. 10. The process includes steps 1001 to 1005.
步骤1001,云端设备基于多个图像序列生成风格信息库与多个语义标签。Step 1001: A cloud device generates a style information library and multiple semantic tags based on multiple image sequences.
云端设备通过获取多个图像序列,获取多个图像序列对应风格特征向量的公共向量,并基于不同的公共向量提取出不同的语义标签。进而获取多个公共向量的风格信息库与多个语义标签。The cloud device obtains multiple image sequences, obtains common vectors of style feature vectors corresponding to the multiple image sequences, and extracts different semantic labels based on different common vectors, thereby obtaining a style information library and multiple semantic labels of the multiple common vectors.
步骤1002,云端设备向终端设备发送多个语义标签。Step 1002: The cloud device sends a plurality of semantic tags to the terminal device.
云端设备获取多个语义标签之后,向终端设备发送多个语义标签。相应的,终端设备接收云端设备发送的多个语义标签。After the cloud device obtains the multiple semantic tags, it sends the multiple semantic tags to the terminal device. Correspondingly, the terminal device receives the multiple semantic tags sent by the cloud device.
步骤1003,终端设备基于用户针对于第一界面的第二操作确定第一语义标签。Step 1003: The terminal device determines a first semantic tag based on a second operation performed by the user on the first interface.
终端设备接收云端设备发送的多个语义标签之后,向用户显示第一界面,该第一界面包括多个语义标签。基于用户对第一界面的第二操作确定第一语义标签。该第二操作具体可以是选择操作等。After receiving the plurality of semantic tags sent by the cloud device, the terminal device displays a first interface to the user, where the first interface includes the plurality of semantic tags. The first semantic tag is determined based on a second operation of the user on the first interface. The second operation may specifically be a selection operation, etc.
步骤1004,终端设备向云端设备发送第一语义标签。Step 1004: The terminal device sends a first semantic tag to the cloud device.
终端设备确定第一语义标签之后,向云端设备发送第一语义标签。相应的,云端设备接收终端设备发送的第一语义标签。 After determining the first semantic tag, the terminal device sends the first semantic tag to the cloud device. Correspondingly, the cloud device receives the first semantic tag sent by the terminal device.
步骤1005,云端设备基于第一语义标签从风格信息库中确定第一风格信息Step 1005: The cloud device determines the first style information from the style information library based on the first semantic tag.
云端设备接收终端设备发送的第一语义标签之后,将第一语义标签从风格信息库中,找到与第一语义标签对应的公共向量作为第一风格信息。After receiving the first semantic tag sent by the terminal device, the cloud device finds a common vector corresponding to the first semantic tag from the style information library as the first style information.
该种方式也可以理解为,数据处理设备向用户显示多个语义标签,用户可以通过需要直接从多个语义标签中选取自己需要的语义标签。或者通过用户在第一界面中输入多个语义标签中的权重。This method can also be understood as the data processing device displays multiple semantic tags to the user, and the user can directly select the semantic tag he needs from the multiple semantic tags as needed, or the user inputs the weights of the multiple semantic tags in the first interface.
第三种,基于用户的第三操作确定第一风格信息。The third type is to determine the first style information based on a third operation of the user.
该种方式下,数据处理设备可以直接接收用户的第三操作,并响应于该第三操作确定第一语义标签。In this manner, the data processing device can directly receive the third operation of the user, and determine the first semantic tag in response to the third operation.
该第三操作可以是语音、文本等,具体此处不做限定。例如,用户通过语音编辑“增加阴柔风格”。则数据处理设备可以根据“增加阴柔风格”的语音,确定第一语义标签为“阴柔”。The third operation may be voice, text, etc., which is not limited here. For example, the user edits "add feminine style" by voice. Then the data processing device can determine the first semantic tag as "feminine" according to the voice of "add feminine style".
示例性的,以数据处理设备是服务端为例,即数据处理设备离线完成从视频中提取任意风格,并生成特征库。用户只需上传要求的个性化风格的语义标签,如在兴奋的风格上增加一点阴柔风格。数据处理设备在收到请求后,自动从风格信息库中识别出兴奋与阴柔标签对应的风格信息,并对此特征进行编辑,生成与目标风格的语义标签相匹配的风格信息,完成渲染与显示。For example, the data processing device is a server, that is, the data processing device extracts any style from the video offline and generates a feature library. The user only needs to upload the semantic tag of the required personalized style, such as adding a little feminine style to the excited style. After receiving the request, the data processing device automatically identifies the style information corresponding to the excited and feminine tags from the style information library, edits the feature, generates style information that matches the semantic tag of the target style, and completes rendering and display.
可以理解的是,上述几种情况只是获取第一风格信息的几个举例,在实际应用中,还可以有其他方式,具体此处不做限定。It is understandable that the above-mentioned situations are just a few examples of obtaining the first style information. In practical applications, there may be other ways, which are not specifically limited here.
步骤402,获取第一图像序列的动作信息。Step 402: Acquire action information of a first image sequence.
数据处理设备获取第一图像序列。该第一图像序列可以理解为是需要替换风格信息的图像序列。The data processing device obtains a first image sequence, which can be understood as an image sequence whose style information needs to be replaced.
可选地,在将2D/3D动画风格信息迁移到3D动画的场景中,该第一图像序列为三维图像序列。在将2D/3D动画风格信息迁移到2D动画的场景中,该第一图像序列为二维图像序列。Optionally, in the scenario of migrating 2D/3D animation style information to 3D animation, the first image sequence is a 3D image sequence. In the scenario of migrating 2D/3D animation style information to 2D animation, the first image sequence is a 2D image sequence.
可选地,第一图像序列可以是从三维动画中提取出来的图像序列。例如,通过人体姿态识别方法(例如,openpose)从三维动画中提取出第一图像序列。另外,三维动画的获取方式这里不做限定,可以是通过用户拍摄上传的方式,也可以是通过接收其他设备发送的方式,还可以是从数据库中选取的方式等,具体此处不做限定。Optionally, the first image sequence may be an image sequence extracted from a three-dimensional animation. For example, the first image sequence is extracted from the three-dimensional animation by a human posture recognition method (e.g., openpose). In addition, the acquisition method of the three-dimensional animation is not limited here, and may be a method of uploading by a user, a method of receiving from other devices, or a method of selecting from a database, etc., which is not limited here.
示例1,第一图像序列的一种示例如图11所示。该第一图像序列的动作内容为“走步”。Example 1: An example of the first image sequence is shown in Figure 11. The action content of the first image sequence is "walking".
数据处理设备获取第一图像序列之后,提取第一图像序列的动作信息。其中,动作信息的解释可以参考前述相关术语的描述,此处不再赘述。After acquiring the first image sequence, the data processing device extracts the action information of the first image sequence. The explanation of the action information can refer to the description of the above-mentioned related terms, which will not be repeated here.
可选地,将第一图像序列输入内容编码器,以得到动作信息。其中,对于内容编码器的训练过程后续会有说明,此处不再展开。Optionally, the first image sequence is input into a content encoder to obtain the action information. The training process of the content encoder will be described later and will not be expanded here.
步骤403,基于第一风格信息与动作信息生成第二图像序列。Step 403: Generate a second image sequence based on the first style information and the motion information.
数据处理设备获取第一语义标签之后,可以基于第一语义标签确定第一风格信息。进而基于该第一风格信息与动作信息生成第二图像序列。After acquiring the first semantic tag, the data processing device may determine the first style information based on the first semantic tag, and then generate a second image sequence based on the first style information and the action information.
在一种可能实现的方式中,第一语义标签用于显式化整个第一风格信息。该种情况下,直接基于第一语义标签确定第一风格信息。In a possible implementation, the first semantic tag is used to make the entire first style information explicit. In this case, the first style information is determined directly based on the first semantic tag.
在另一种可能实现的方式中,第一语义标签用于显式化第一风格信息中的显式向量。该种情况下,先将第一语义标签转化为显式向量,然后与第一图像序列的隐式特征进行融合以得到第一风格信息。In another possible implementation, the first semantic label is used to make explicit the explicit vector in the first style information. In this case, the first semantic label is first converted into an explicit vector and then fused with the implicit features of the first image sequence to obtain the first style information.
可选地,数据处理设备融合第一风格信息与动作信息以得到第一运动特征。并基于第一运动特征获取第二图像序列。Optionally, the data processing device fuses the first style information with the action information to obtain a first motion feature, and acquires a second image sequence based on the first motion feature.
上述数据处理设备融合第一风格信息与动作信息以得到第一运动特征所使用的融合算法可以包括:自适应实例标准化层(Adaptive Instance Normalization,AdaIN)、深度学习模型、统计方法等分布之间的对齐方法。The fusion algorithm used by the above-mentioned data processing device to fuse the first style information and the action information to obtain the first motion feature may include: Adaptive Instance Normalization (AdaIN), deep learning models, statistical methods and other alignment methods between distributions.
可选地,数据处理设备将第一运动特征输入解码器以得到第二图像序列。其中,对于解码器的训练过程后续会有说明,此处不再展开。Optionally, the data processing device inputs the first motion feature into a decoder to obtain a second image sequence. The training process of the decoder will be described later and will not be expanded here.
示例性的,以第一语义标签基于第三图像序列获取为例。第三图像序列如图12所示。该第一风格信息为“沮丧”。延续上述示例1,则本步骤获取的第二图像序列如图13所示。该第二图像序列为“沮丧”的走步。Exemplarily, the first semantic tag is obtained based on the third image sequence. The third image sequence is shown in FIG12. The first style information is "frustrated". Continuing with the above example 1, the second image sequence obtained in this step is shown in FIG13. The second image sequence is a "frustrated" walk.
该种示例下,步骤401至步骤403的流程可以如图14所示。输入端包括第三图像序列(例如,2D动 画的图像序列)、第一图像序列(例如,3D原始动画的图像序列)及用户个性化风格的语义标签(即第一语义标签)。首先,2D风格信息提取模块提取第三图像序列的2D风格化特征,并转化为3D风格信息,同时显式化该风格的语义标签,返回给用户供其编辑。其次,用户根据语义标签,以及需要生成个性化要求,NLP模块在解析出用户的个性化需求后,连同3D风格信息一起输入到风格编辑模块,生成编辑化的风格信息向量(即第一风格信息)。最后,第一图像序列在内容编码后,得到表征第一图像序列内容的特征表达,并融合上述编辑化的第一风格信息,并经过解码生成符合用户编辑信息的3D目标动画的图像序列(即第二图像序列)。In this example, the process from step 401 to step 403 may be as shown in FIG. 14. The input end includes a third image sequence (eg, 2D motion The 2D style information extraction module extracts the 2D stylized features of the third image sequence and converts them into 3D style information. It also makes the semantic label of the style explicit and returns it to the user for editing. Secondly, the user generates personalized requirements based on the semantic labels and needs. After parsing the user's personalized needs, the NLP module inputs them into the style editing module together with the 3D style information to generate an edited style information vector (i.e., the first style information). Finally, after content encoding, the first image sequence obtains a feature expression that characterizes the content of the first image sequence, integrates the above-mentioned edited first style information, and generates an image sequence of the 3D target animation (i.e., the second image sequence) that conforms to the user's editing information after decoding.
可选地,数据处理设备获取第二图像序列之后,将第二图像序列渲染至虚拟物体以得到动画/视频。Optionally, after acquiring the second image sequence, the data processing device renders the second image sequence to a virtual object to obtain an animation/video.
可选地,在第二图像序列是三维图像序列的情况下,上述生成的动画为3D动画。在第二图像序列是二维图像序列的情况下,上述生成的动画为2D动画。Optionally, when the second image sequence is a three-dimensional image sequence, the generated animation is a 3D animation. When the second image sequence is a two-dimensional image sequence, the generated animation is a 2D animation.
在一种可能实现的方式中,本申请实施例提供的数据处理方法主要应用于图像序列的风格迁移场景。In one possible implementation, the data processing method provided in the embodiment of the present application is mainly applied to the style transfer scenario of image sequences.
在另一种可能实现的方式中,本申请实施例提供的数据处理方法主要应用于动画风格迁移场景中。In another possible implementation manner, the data processing method provided in the embodiment of the present application is mainly used in animation style transfer scenarios.
本申请实施例中,一方面,通过风格信息与动作信息的分离获取,并基于该第一风格信息与动作信息生成第二图像序列。以实现在不改变原图像序列其他特征的情况下进行风格化的动画编辑,提升动画的风格迁移效果。另一方面,通过语义标签描述风格信息,使用语义标签显式化风格信息,用户通过编辑语义标签以实现风格迁移,进而实现肢体动作的驱动方案。可让用户对风格信息有个定量及定性的分析,进而清楚地知道如何量化描述自己的需求。此外,通过对用户需求的解析,配合海量视频能覆盖任意风格的优势,使得本申请实施例能生成任意定制化的风格化数字人动画。另一方面,从第二图像序列所属的视频中提取风格信息可以弥补用户对某一类风格信息难以描述的缺陷。另一方面,使用标签显式化风格信息,In the embodiment of the present application, on the one hand, the style information and the action information are obtained by separation, and a second image sequence is generated based on the first style information and the action information. This allows stylized animation editing to be performed without changing other features of the original image sequence, thereby improving the style transfer effect of the animation. On the other hand, the style information is described by semantic tags, and the style information is made explicit by using semantic tags. Users edit semantic tags to achieve style transfer, thereby realizing a driving scheme for body movements. Users can have a quantitative and qualitative analysis of the style information, and clearly know how to quantitatively describe their needs. In addition, by analyzing user needs and combining the advantage of massive videos that can cover any style, the embodiment of the present application can generate any customized stylized digital human animation. On the other hand, extracting style information from the video to which the second image sequence belongs can make up for the user's difficulty in describing a certain type of style information. On the other hand, using tags to make style information explicit,
本申请实施例提供的方法的另一流程图可以如图15所示。从风格参照动画中获取第二图像序列,并对第二图像序列的风格化特征提取,以得到第二风格化特征。进而显式化第二风格化特征以得到显示标签。用户对显示标签进行编辑后得到第一风格化特征。再将该第一风格化特征迁移到原始动画中得到风格化动画。该风格化动画的内容与原始动画一致,风格化动画的风格与风格参照动画一致,进而实现风格化迁移。Another flowchart of the method provided by the embodiment of the present application can be shown in FIG15. A second image sequence is obtained from the style reference animation, and the stylized features of the second image sequence are extracted to obtain a second stylized feature. The second stylized feature is then made explicit to obtain a display label. The user edits the display label to obtain a first stylized feature. The first stylized feature is then transferred to the original animation to obtain a stylized animation. The content of the stylized animation is consistent with the original animation, and the style of the stylized animation is consistent with the style reference animation, thereby achieving stylized migration.
上面对本申请实施例提供的数据处理方法进行了描述,下面对上述图4所示实施例中所提的风格编码器、内容编码器、解码器的训练过程进行详细描述。训练侧,利用海量的肢体动画视频,构建近似完备的肢体动画风格化特征向量空间,可满足推理侧风格化特征的任意性。The above describes the data processing method provided in the embodiment of the present application. The following describes in detail the training process of the style encoder, content encoder, and decoder mentioned in the embodiment shown in Figure 4. On the training side, a large amount of body animation videos are used to construct an approximately complete body animation stylized feature vector space, which can meet the arbitrariness of the stylized features on the reasoning side.
训练过程如图16所示,首先,获取图像序列1与图像序列2。其中,图像序列1具有风格1与动作1。图像序列2具有风格2与动作2。其次,利用风格编码器和动作内容编码器分别对输入的两个序列的风格和运动内容进行编码,以解耦风格信息与动作信息。再通过融合算法(例如AdaIN)融合风格信息1与动作信息2,经过解码后生成风格1化的动作2。并融合风格信息2与动作信息1,生成风格2化的动作1。最后,通过判别器分别监督生成的风格化动画分别在风格和内容上的重构损失,使得最终生成的风格化动画能在不损失原始运动内容的前提下,兼备与目标风格最大的相似性。The training process is shown in Figure 16. First, image sequence 1 and image sequence 2 are obtained. Among them, image sequence 1 has style 1 and action 1. Image sequence 2 has style 2 and action 2. Secondly, the style encoder and action content encoder are used to encode the style and motion content of the two input sequences respectively to decouple the style information and action information. Then, style information 1 and action information 2 are fused through a fusion algorithm (such as AdaIN), and style 1 action 2 is generated after decoding. And style information 2 and action information 1 are fused to generate style 2 action 1. Finally, the discriminator supervises the reconstruction losses of the generated stylized animation in style and content respectively, so that the final generated stylized animation can have the greatest similarity with the target style without losing the original motion content.
上述过程可以理解为:获取第一训练图像序列与第二训练图像序列,第一训练图像序列与第二训练图像序列的运动特征不同,运动特征包括动作信息和/或风格信息。将第一训练图像序列分别输入风格编码器与内容编码器以得到第一训练风格信息与第一训练动作信息;将第二训练图像序列分别输入风格编码器与内容编码器以得到第二训练风格信息与第二训练动作信息。融合第一训练风格信息与第二训练动作信息以得到第一训练运动特征;融合第二训练风格信息与第一训练动作信息以得到第二训练运动特征。将第一训练运动特征输入解码器以得到第一重建图像序列;将第二训练运动特征输入解码器以得到第二重建图像序列。以第一损失函数的值小于第一阈值为目标进行训练以得到训练好的风格编码器、内容编码器以及解码器,第一损失函数包括风格损失函数与内容损失函数,风格损失函数用于表示第一重建图像序列与第一训练图像序列之间的风格差异以及第二重建图像序列与第二训练图像序列之间的风格差异,内容损失函数用于表示第一重建图像序列与第二训练图像序列之间的内容差异以及第二重建图像序列与第一训练图像序列之间的内容差异。The above process can be understood as: obtaining a first training image sequence and a second training image sequence, wherein the motion features of the first training image sequence and the second training image sequence are different, and the motion features include action information and/or style information. The first training image sequence is input into a style encoder and a content encoder respectively to obtain first training style information and first training action information; the second training image sequence is input into a style encoder and a content encoder respectively to obtain second training style information and second training action information. The first training style information and the second training action information are fused to obtain a first training motion feature; the second training style information and the first training action information are fused to obtain a second training motion feature. The first training motion feature is input into a decoder to obtain a first reconstructed image sequence; the second training motion feature is input into a decoder to obtain a second reconstructed image sequence. Training is performed with the goal of making the value of the first loss function less than a first threshold to obtain a trained style encoder, content encoder and decoder, the first loss function includes a style loss function and a content loss function, the style loss function is used to represent the style difference between the first reconstructed image sequence and the first training image sequence and the style difference between the second reconstructed image sequence and the second training image sequence, and the content loss function is used to represent the content difference between the first reconstructed image sequence and the second training image sequence and the content difference between the second reconstructed image sequence and the first training image sequence.
本实施例中,通过训练得到的上述风格编码器、内容编码器、解码器可以实现从视频序列中提取2D风格化特征,并映射到3D特征空间,产生与其语义一致的3D风格,并将该3D风格信息进行语义显式化 表达,用户根据风格的语义性表达对其进行编辑,生成符合其预期的目标风格,然后算法将用户的风格的语义标签生成对应的风格信息,最后利用风格迁移模块,将生成的3D目标特征迁移到原始动画序列上,产生目标风格化的虚拟数字人动画序列。In this embodiment, the style encoder, content encoder, and decoder obtained through training can extract 2D stylized features from the video sequence and map them to the 3D feature space to generate a 3D style consistent with its semantics, and make the 3D style information semantically explicit. Expression, the user edits it according to the semantic expression of the style to generate the target style that meets his expectations, and then the algorithm generates the corresponding style information from the semantic label of the user's style. Finally, the style transfer module is used to transfer the generated 3D target features to the original animation sequence to generate a target stylized virtual digital human animation sequence.
另外,图4所示实施例中的第三图像序列包括以下一项或多项:面部表情序列、肢体图像序列。例如,肢体动作包括全局肢体、局部肢体(例如手势等)等。换句话说本申请实施例提供的方法还可以应用于手势、表情等风格迁移。下面以语音驱动手势为例。该方法应用于手势风格迁移的场景如图17所示。In addition, the third image sequence in the embodiment shown in FIG4 includes one or more of the following: a facial expression sequence, a limb image sequence. For example, limb movements include global limbs, local limbs (such as gestures, etc.), etc. In other words, the method provided in the embodiment of the present application can also be applied to style transfer such as gestures and expressions. The following takes voice-driven gestures as an example. The scenario in which the method is applied to gesture style transfer is shown in FIG17.
通过输入一段文本或语音数据,驱动虚拟数字人做出与语音数据语义已知、节奏一致的手势动作。对于同一段语音或文本数据,不同演讲者的手势风格会因人而异,也会因同一人的不同情绪而异,因而风格的个性化定制与迁移对丰富手势的多样性有重要意义。By inputting a piece of text or voice data, the virtual digital human is driven to make gestures with known semantics and consistent rhythm with the voice data. For the same piece of voice or text data, the gesture style of different speakers will vary from person to person, and also from different emotions of the same person, so the personalized customization and transfer of style is of great significance to enriching the diversity of gestures.
在离线或训练阶段,通过收集海量的2D演讲视频,通过前述风格化特征提取模块产生几乎可覆盖任意风格的手势风格信息,离线生成风格信息数据库;在线使用阶段,用户指定任意个性化的风格化标签,通过对用户标签的解析与量化表示,融合离线生成的风格数据库,生成编辑后的风格信息,并将语音驱动手势模块生成的运动序列风格化为目标风格。In the offline or training stage, a large amount of 2D speech videos are collected, and the gesture style information that can cover almost any style is generated through the aforementioned stylized feature extraction module, and a style information database is generated offline. In the online use stage, the user specifies any personalized stylized label, and the user label is parsed and quantified, and the style database generated offline is integrated to generate the edited style information, and the motion sequence generated by the voice-driven gesture module is stylized into the target style.
该方法应用于表情风格迁移的场景如图18所示。该场景也可以理解为数字人表情基风格编辑与迁移场景。通过从海量表情视频中获取近乎任意的表情风格,再迁移到数字人表情肌上,驱动同一个数字人做出任意风格的表情。其中,表情基的定义是,事先确定的用于表征某个中性表情的脸部若干个关键点的坐标集合,而原始系数则表示某个特定表情相对于中性表情的参数表达,比如微笑时相对于中性表情的嘴巴的咧开程度等。因而图18的整个过程是,首先根据某个人的表情和预置的表情基,通过一个表情网络计算该表情所对应的原始系数;并通过同一组表情基获取视频中各种表情对应的系数,用户通过编辑该系数控制所要生成的表情。The scenario where this method is applied to expression style transfer is shown in Figure 18. This scenario can also be understood as a digital human expression base style editing and transfer scenario. By obtaining nearly arbitrary expression styles from massive expression videos and then transferring them to the digital human expression muscles, the same digital human can be driven to make expressions of any style. Among them, the definition of expression base is a predetermined set of coordinates of several key points on the face used to represent a neutral expression, and the original coefficient represents the parameter expression of a specific expression relative to a neutral expression, such as the degree of mouth opening relative to a neutral expression when smiling. Therefore, the whole process of Figure 18 is to first calculate the original coefficient corresponding to a person's expression and a preset expression base through an expression network; and obtain the coefficients corresponding to various expressions in the video through the same set of expression bases, and the user controls the expression to be generated by editing the coefficient.
本实施例中,一方面,能从视频序列中提取并转化成手势/表情的风格化特征,极大地丰富了风格多样性;另一方面,对从视频中提取的手势/表情的风格进行显式标签化,便于与用户对手势/表情的风格进行语义性描述,进而实现后续标签与风格信息的匹配与融合。In this embodiment, on the one hand, stylized features of gestures/expressions can be extracted from video sequences and converted into them, greatly enriching the style diversity; on the other hand, the style of gestures/expressions extracted from the video is explicitly labeled, which facilitates the semantic description of the style of gestures/expressions with the user, and then realizes the subsequent matching and fusion of labels and style information.
上面对本申请实施例中的数据处理方法进行了描述,下面对本申请实施例中的数据处理设备进行描述,请参阅图19,本申请实施例中数据处理设备的一个实施例包括:The data processing method in the embodiment of the present application is described above. The data processing device in the embodiment of the present application is described below. Please refer to FIG. 19. An embodiment of the data processing device in the embodiment of the present application includes:
获取单元1901,用于获取第一风格信息;An acquiring unit 1901 is used to acquire first style information;
获取单元1901,还用于获取第一图像序列的动作信息;The acquisition unit 1901 is further used to acquire the motion information of the first image sequence;
生成单元1902,用于基于第一风格信息与动作信息生成第二图像序列,第二图像序列与第一图像序列的动作类型相同,第二图像序列具有第一风格信息。The generating unit 1902 is configured to generate a second image sequence based on the first style information and the action information. The second image sequence has the same action type as the first image sequence, and the second image sequence has the first style information.
可选地,数据处理设备还可以包括:显示单元1903,用于向用户显示第一界面,第一界面包括多个语义标签,多个语义标签用于描述不同图像序列的不同风格信息,多个语义标签与风格信息一一对应;获取单元1901,具体用于基于用户的第二操作从多个语义标签中确定第一语义标签;以及用于将第一语义标签转化为第一风格信息。Optionally, the data processing device may also include: a display unit 1903, used to display a first interface to the user, the first interface including multiple semantic tags, the multiple semantic tags are used to describe different style information of different image sequences, and the multiple semantic tags correspond one-to-one to the style information; an acquisition unit 1901, specifically used to determine a first semantic tag from multiple semantic tags based on a second operation of the user; and used to convert the first semantic tag into first style information.
可选地,数据处理设备还可以包括:渲染单元1904,用于将第二图像序列渲染至虚拟物体以得到动画。Optionally, the data processing device may further include: a rendering unit 1904, configured to render the second image sequence to a virtual object to obtain an animation.
本实施例中,数据处理设备中各单元所执行的操作与前述图1至图18所示实施例中描述的类似,此处不再赘述。In this embodiment, the operations performed by each unit in the data processing device are similar to those described in the embodiments shown in Figures 1 to 18 above, and will not be repeated here.
本实施例中,获取单元1901通过风格信息与动作信息的分离获取,生成单元1902基于该第一风格信息与动作信息生成第二图像序列。以实现在不改变原图像序列其他特征的情况下进行风格化的动画编辑,提升动画的风格迁移效果。In this embodiment, the acquisition unit 1901 acquires the style information and the action information by separation, and the generation unit 1902 generates the second image sequence based on the first style information and the action information, so as to realize stylized animation editing without changing other features of the original image sequence, and improve the style transfer effect of the animation.
参阅图20,本申请提供的另一种数据处理设备的结构示意图。该数据处理设备可以包括处理器2001、存储器2002和通信端口2003。该处理器2001、存储器2002和通信端口2003通过线路互联。其中,存储器2002中存储有程序指令和数据。Referring to FIG. 20 , a schematic diagram of the structure of another data processing device provided by the present application. The data processing device may include a processor 2001, a memory 2002, and a communication port 2003. The processor 2001, the memory 2002, and the communication port 2003 are interconnected via a line. The memory 2002 stores program instructions and data.
存储器2002中存储了前述图1至图18所示对应的实施方式中,由数据处理设备执行的步骤对应的程序指令以及数据。 The memory 2002 stores program instructions and data corresponding to the steps executed by the data processing device in the corresponding implementation modes shown in the aforementioned FIGS. 1 to 18 .
处理器2001,用于执行前述图1至图18所示实施例中任一实施例所示的由数据处理设备执行的步骤。The processor 2001 is used to execute the steps performed by the data processing device shown in any of the embodiments shown in Figures 1 to 18 above.
通信端口2003可以用于进行数据的接收和发送,用于执行前述图1至图18所示实施例中任一实施例中与获取、发送、接收相关的步骤。The communication port 2003 can be used to receive and send data, and to execute the steps related to acquisition, sending, and receiving in any of the embodiments shown in Figures 1 to 18 above.
一种实现方式中,数据处理设备可以包括相对于图20更多或更少的部件,本申请对此仅仅是示例性说明,并不作限定。In one implementation, the data processing device may include more or fewer components than those in FIG. 20 , and this application is merely an illustrative description and is not intended to be limiting.
本申请实施例还提供一种存储一个或多个计算机执行指令的计算机可读存储介质,当计算机执行指令被处理器执行时,该处理器执行如前述实施例中数据处理设备可能的实现方式所述的方法。An embodiment of the present application further provides a computer-readable storage medium storing one or more computer-executable instructions. When the computer-executable instructions are executed by a processor, the processor executes the method described in the possible implementation manner of the data processing device in the aforementioned embodiment.
本申请实施例还提供一种存储一个或多个计算机的计算机程序产品(或称计算机程序),当计算机程序产品被该处理器执行时,该处理器执行上述数据处理设备可能实现方式的方法。An embodiment of the present application also provides a computer program product (or computer program) storing one or more computers. When the computer program product is executed by the processor, the processor executes the method of the possible implementation mode of the above-mentioned data processing device.
本申请实施例还提供了一种芯片系统,该芯片系统包括至少一个处理器,用于支持终端设备实现上述数据处理设备可能的实现方式中所涉及的功能。可选的,所述芯片系统还包括接口电路,所述接口电路为所述至少一个处理器提供程序指令和/或数据。在一种可能的设计中,该芯片系统还可以包括存储器,存储器,用于保存该终端设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。The embodiment of the present application also provides a chip system, which includes at least one processor for supporting a terminal device to implement the functions involved in the possible implementation of the above-mentioned data processing device. Optionally, the chip system also includes an interface circuit, which provides program instructions and/or data for the at least one processor. In one possible design, the chip system may also include a memory, which is used to store the necessary program instructions and data for the terminal device. The chip system may be composed of chips, or may include chips and other discrete devices.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices and units described above can refer to the corresponding processes in the aforementioned method embodiments and will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be an indirect coupling or communication connection through some interfaces, devices or units, which can be electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-only memory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。 If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), disk or optical disk and other media that can store program code.

Claims (23)

  1. 一种数据处理方法,其特征在于,所述方法包括:A data processing method, characterized in that the method comprises:
    获取第一风格信息;Get the first style information;
    获取第一图像序列的动作信息;Acquiring motion information of a first image sequence;
    基于所述第一风格信息与所述动作信息生成第二图像序列,所述第二图像序列与所述第一图像序列的动作类型相同,所述第二图像序列具有所述第一风格信息。A second image sequence is generated based on the first style information and the action information. The second image sequence has the same action type as the first image sequence, and the second image sequence has the first style information.
  2. 根据权利要求1所述的方法,其特征在于,所述获取第一风格信息之前,所述方法还包括:The method according to claim 1, characterized in that before obtaining the first style information, the method further comprises:
    获取第三图像序列;acquiring a third image sequence;
    所述获取第一风格信息,包括:The obtaining of the first style information includes:
    基于所述第三图像序列获取所述第一风格信息。The first style information is acquired based on the third image sequence.
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述第三图像序列获取所述第一风格信息,包括:The method according to claim 2, characterized in that the obtaining the first style information based on the third image sequence comprises:
    提取所述第三图像序列的第二风格信息;extracting second style information of the third image sequence;
    基于所述第二风格信息确定所述第一风格信息。The first style information is determined based on the second style information.
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述第二风格信息确定所述第一风格信息,包括:The method according to claim 3, characterized in that determining the first style information based on the second style information comprises:
    将所述第二风格信息作为所述第一风格信息。The second style information is used as the first style information.
  5. 根据权利要求3所述的方法,其特征在于,所述基于所述第二风格信息确定所述第一风格信息,包括:The method according to claim 3, characterized in that determining the first style information based on the second style information comprises:
    向用户显示第二语义标签,所述第二语义标签用于描述所述第二风格信息;Displaying a second semantic tag to the user, where the second semantic tag is used to describe the second style information;
    基于所述用户的第一操作将所述第二语义标签修改为第一语义标签,所述第一语义标签用于描述所述第一风格信息;modifying the second semantic tag to a first semantic tag based on a first operation of the user, where the first semantic tag is used to describe the first style information;
    基于所述第一语义标签确定所述第一风格信息。The first style information is determined based on the first semantic tag.
  6. 根据权利要求2至5中任一项所述的方法,其特征在于,所述第三图像序列为二维动画的图像序列,所述第二风格信息为二维风格信息,所述第一风格信息为三维风格信息,所述第一图像序列与所述第二图像序列为三维动画的图像序列。The method according to any one of claims 2 to 5 is characterized in that the third image sequence is an image sequence of a two-dimensional animation, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are image sequences of a three-dimensional animation.
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, characterized in that the method further comprises:
    向用户显示第一界面,所述第一界面包括多个语义标签,所述多个语义标签用于描述不同图像序列的不同风格信息,所述多个语义标签与所述风格信息一一对应;Displaying a first interface to a user, wherein the first interface includes a plurality of semantic tags, the plurality of semantic tags are used to describe different style information of different image sequences, and the plurality of semantic tags correspond one-to-one to the style information;
    所述获取第一风格信息,包括:The obtaining of the first style information includes:
    基于所述用户的第二操作从所述多个语义标签中确定第一语义标签;determining a first semantic tag from the plurality of semantic tags based on a second operation of the user;
    基于所述第一语义标签确定所述第一风格信息。The first style information is determined based on the first semantic tag.
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述基于所述第一风格信息与所述动作信息生成第二图像序列,包括:The method according to any one of claims 1 to 7, characterized in that generating a second image sequence based on the first style information and the action information comprises:
    融合所述第一风格信息与所述动作信息以得到第一运动特征;fusing the first style information with the action information to obtain a first motion feature;
    基于所述第一运动特征获取所述第二图像序列。The second image sequence is acquired based on the first motion characteristic.
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述动作信息包括以下一项或多项:面部表情序列、肢体图像序列。The method according to any one of claims 1 to 8, characterized in that the action information includes one or more of the following: a facial expression sequence, a body image sequence.
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 9, characterized in that the method further comprises:
    将所述第二图像序列渲染至虚拟物体以得到动画。The second image sequence is rendered to a virtual object to obtain an animation.
  11. 一种数据处理设备,其特征在于,所述数据处理设备包括:A data processing device, characterized in that the data processing device comprises:
    获取单元,用于获取第一风格信息;An acquiring unit, used for acquiring first style information;
    所述获取单元,还用于获取所述第一图像序列的动作信息;The acquisition unit is further used to acquire the motion information of the first image sequence;
    生成单元,用于基于所述第一风格信息与所述动作信息生成第二图像序列,所述第二图像序列与所述第一图像序列的动作类型相同,所述第二图像序列具有所述第一风格信息。A generating unit is configured to generate a second image sequence based on the first style information and the action information, wherein the second image sequence has the same action type as the first image sequence and has the first style information.
  12. 根据权利要求11所述的设备,其特征在于,所述获取单元,还用于获取第三图像序列; The device according to claim 11, characterized in that the acquisition unit is further used to acquire a third image sequence;
    所述获取单元,具体用于基于所述第三图像序列获取所述第一风格信息。The acquisition unit is specifically configured to acquire the first style information based on the third image sequence.
  13. 根据权利要求12所述的设备,其特征在于,所述获取单元,具体用于提取所述第三图像序列的第二风格信息;The device according to claim 12, characterized in that the acquisition unit is specifically used to extract the second style information of the third image sequence;
    所述获取单元,具体用于基于所述第二风格信息确定所述第一风格信息。The acquiring unit is specifically configured to determine the first style information based on the second style information.
  14. 根据权利要求13所述的设备,其特征在于,所述获取单元,具体用于将所述第二风格信息作为所述第一风格信息。The device according to claim 13 is characterized in that the acquisition unit is specifically used to use the second style information as the first style information.
  15. 根据权利要求13所述的设备,其特征在于,所述获取单元,具体用于向用户显示第二语义标签,所述第二语义标签用于描述所述第二风格信息;The device according to claim 13, characterized in that the acquisition unit is specifically used to display a second semantic tag to the user, wherein the second semantic tag is used to describe the second style information;
    所述获取单元,具体用于基于所述用户的第一操作将所述第二语义标签修改为第一语义标签,所述第一语义标签用于描述所述第一风格信息;The acquisition unit is specifically configured to modify the second semantic tag into a first semantic tag based on the first operation of the user, where the first semantic tag is used to describe the first style information;
    所述获取单元,具体用于基于所述第一语义标签确定所述第一风格信息。The acquisition unit is specifically configured to determine the first style information based on the first semantic tag.
  16. 根据权利要求12至15中任一项所述的设备,其特征在于,所述第三图像序列为二维动画的图像序列,所述第二风格信息为二维风格信息,所述第一风格信息为三维风格信息,所述第一图像序列与所述第二图像序列为三维动画的图像序列。The device according to any one of claims 12 to 15 is characterized in that the third image sequence is an image sequence of a two-dimensional animation, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are image sequences of a three-dimensional animation.
  17. 根据权利要求11所述的设备,其特征在于,所述数据处理设备还包括:The device according to claim 11, characterized in that the data processing device further comprises:
    显示单元,用于向用户显示第一界面,所述第一界面包括多个语义标签,所述多个语义标签用于描述不同图像序列的不同风格信息,所述多个语义标签与所述风格信息一一对应;A display unit, configured to display a first interface to a user, wherein the first interface includes a plurality of semantic tags, wherein the plurality of semantic tags are used to describe different style information of different image sequences, and the plurality of semantic tags correspond one-to-one to the style information;
    所述获取单元,具体用于基于所述用户的第二操作从所述多个语义标签中确定第一语义标签;The acquiring unit is specifically configured to determine a first semantic tag from the plurality of semantic tags based on the second operation of the user;
    所述获取单元,具体用于基于所述第一语义标签确定所述第一风格信息。The acquisition unit is specifically configured to determine the first style information based on the first semantic tag.
  18. 根据权利要求11至17中任一项所述的设备,其特征在于,所述生成单元,具体用于融合所述第一风格信息与所述动作信息以得到第一运动特征;The device according to any one of claims 11 to 17, characterized in that the generating unit is specifically used to fuse the first style information with the action information to obtain a first motion feature;
    所述生成单元,具体用于基于所述第一运动特征获取所述第二图像序列。The generating unit is specifically configured to acquire the second image sequence based on the first motion feature.
  19. 根据权利要求11至18中任一项所述的设备,其特征在于,所述动作信息包括以下一项或多项:面部表情序列、肢体图像序列。The device according to any one of claims 11 to 18, characterized in that the action information includes one or more of the following: a facial expression sequence, a body image sequence.
  20. 根据权利要求11至19中任一项所述的设备,其特征在于,所述数据处理设备还包括:The device according to any one of claims 11 to 19, characterized in that the data processing device further comprises:
    渲染单元,用于将所述第二图像序列渲染至虚拟物体以得到动画。A rendering unit is used to render the second image sequence to a virtual object to obtain an animation.
  21. 一种数据处理设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述数据处理设备执行如权利要求1至10中任一项所述的方法。A data processing device, characterized in that it comprises: a processor, the processor is coupled to a memory, the memory is used to store programs or instructions, when the program or instructions are executed by the processor, the data processing device executes the method as described in any one of claims 1 to 10.
  22. 一种计算机存储介质,其特征在于,包括计算机指令,当所述计算机指令在数据处理设备上运行时,使得所述数据处理设备执行如权利要求1至10中任一项所述的方法。A computer storage medium, characterized in that it comprises computer instructions, and when the computer instructions are executed on a data processing device, the data processing device is caused to execute the method as claimed in any one of claims 1 to 10.
  23. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至10中任一项所述的方法。 A computer program product, characterized in that when the computer program product is run on a computer, the computer is caused to execute the method according to any one of claims 1 to 10.
PCT/CN2023/103012 2022-09-29 2023-06-28 Data processing method and related device WO2024066549A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211202267.XA CN117808934A (en) 2022-09-29 2022-09-29 Data processing method and related equipment
CN202211202267.X 2022-09-29

Publications (1)

Publication Number Publication Date
WO2024066549A1 true WO2024066549A1 (en) 2024-04-04

Family

ID=90433987

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/103012 WO2024066549A1 (en) 2022-09-29 2023-06-28 Data processing method and related device

Country Status (2)

Country Link
CN (1) CN117808934A (en)
WO (1) WO2024066549A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018132855A (en) * 2017-02-14 2018-08-23 国立大学法人電気通信大学 Image style conversion apparatus, image style conversion method and image style conversion program
CN110909790A (en) * 2019-11-20 2020-03-24 Oppo广东移动通信有限公司 Image style migration method, device, terminal and storage medium
CN110956654A (en) * 2019-12-02 2020-04-03 Oppo广东移动通信有限公司 Image processing method, device, equipment and storage medium
CN111667399A (en) * 2020-05-14 2020-09-15 华为技术有限公司 Method for training style migration model, method and device for video style migration
CN112164130A (en) * 2020-09-07 2021-01-01 北京电影学院 Video-animation style migration method based on depth countermeasure network
CN112967174A (en) * 2021-01-21 2021-06-15 北京达佳互联信息技术有限公司 Image generation model training method, image generation device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018132855A (en) * 2017-02-14 2018-08-23 国立大学法人電気通信大学 Image style conversion apparatus, image style conversion method and image style conversion program
CN110909790A (en) * 2019-11-20 2020-03-24 Oppo广东移动通信有限公司 Image style migration method, device, terminal and storage medium
CN110956654A (en) * 2019-12-02 2020-04-03 Oppo广东移动通信有限公司 Image processing method, device, equipment and storage medium
CN111667399A (en) * 2020-05-14 2020-09-15 华为技术有限公司 Method for training style migration model, method and device for video style migration
CN112164130A (en) * 2020-09-07 2021-01-01 北京电影学院 Video-animation style migration method based on depth countermeasure network
CN112967174A (en) * 2021-01-21 2021-06-15 北京达佳互联信息技术有限公司 Image generation model training method, image generation device and storage medium

Also Published As

Publication number Publication date
CN117808934A (en) 2024-04-02

Similar Documents

Publication Publication Date Title
US11741668B2 (en) Template based generation of 3D object meshes from 2D images
KR102503413B1 (en) Animation interaction method, device, equipment and storage medium
US20210174072A1 (en) Microexpression-based image recognition method and apparatus, and related device
KR101306221B1 (en) Method and apparatus for providing moving picture using 3d user avatar
CN110555896B (en) Image generation method and device and storage medium
CN111553267B (en) Image processing method, image processing model training method and device
JP2022503647A (en) Cross-domain image conversion
US11514638B2 (en) 3D asset generation from 2D images
WO2023284435A1 (en) Method and apparatus for generating animation
WO2024051445A1 (en) Image generation method and related device
CN113362263A (en) Method, apparatus, medium, and program product for changing the image of a virtual idol
CN116611496A (en) Text-to-image generation model optimization method, device, equipment and storage medium
CN112634413B (en) Method, apparatus, device and storage medium for generating model and generating 3D animation
WO2024066549A1 (en) Data processing method and related device
Usman et al. Skeleton-based motion prediction: A survey
CN115690276A (en) Video generation method and device of virtual image, computer equipment and storage medium
CN114529635A (en) Image generation method, device, storage medium and equipment
CN117152843B (en) Digital person action control method and system
CN115631285B (en) Face rendering method, device, equipment and storage medium based on unified driving
WO2023207391A1 (en) Virtual human video generation method, and apparatus
CN115471618B (en) Redirection method, redirection device, electronic equipment and storage medium
CN117252791A (en) Image processing method, device, electronic equipment and storage medium
Rahman et al. Implementation of diffusion model in realistic face generation
CN117011430A (en) Game resource processing method, apparatus, device, storage medium and program product
CN116775179A (en) Virtual object configuration method, electronic device and computer readable storage medium