CN116882482A

CN116882482A - Training of virtual image generation model and virtual image generation method and device

Info

Publication number: CN116882482A
Application number: CN202310787077.7A
Authority: CN
Inventors: 李�杰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2023-10-13

Abstract

The present disclosure provides a training method of an avatar generation model and a generating method and a generating device of an avatar, which relate to the technical field of artificial intelligence such as computer vision, augmented reality, virtual reality, deep learning, etc., and can be applied to scenes such as metauniverse, digital people, etc. The training method of the avatar generation model comprises the following steps: acquiring training data; constructing a neural network model comprising a vector generation module, an attribute generation module, a density generation module, a color generation module and an image generation module; obtaining sample Gaussian noise according to sample information; training a neural network model by using the sample Gaussian noise and the avatar standard image to obtain an avatar generation model. The avatar generation method includes: acquiring input information to be processed, and acquiring Gaussian noise to be processed according to the input information to be processed; inputting the Gaussian noise to be processed into the avatar generation model, and obtaining the avatar corresponding to the input information to be processed according to the output result of the avatar generation model.

Description

Training of virtual image generation model and virtual image generation method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, and can be applied to scenes such as metauniverse, digital people and the like. Provided are a training of an avatar generation model, a method and a device for generating an avatar, an electronic device and a readable storage medium.

Background

In the prior art, when the avatar is generated, professional personnel are usually required to perform operations such as geometric modeling, texture mapping and the like of the avatar, so that the technical problems of lower generation efficiency and higher generation cost exist in the prior art when the avatar is generated.

Disclosure of Invention

According to a first aspect of the present disclosure, there is provided a training method of an avatar generation model, including: acquiring training data, wherein the training data comprises sample information and an avatar standard image corresponding to the sample information; constructing a neural network model comprising a vector generation module, an attribute generation module, a density generation module, a color generation module and an image generation module, wherein the vector generation module is used for generating an implicit vector according to Gaussian noise, the attribute generation module is used for generating image attribute information of an avatar according to the implicit vector, the density generation module is used for generating image density information of the avatar according to the implicit vector, the color generation module is used for generating image color information of the avatar according to the implicit vector, and the image generation module is used for generating an avatar prediction image according to the image attribute information, the image density information and the image color information; obtaining sample Gaussian noise according to the sample information; and training the neural network model by using the sample Gaussian noise and the virtual image standard image to obtain an virtual image generation model.

According to a second aspect of the present disclosure, there is provided a method of generating an avatar, including: acquiring input information to be processed, and acquiring Gaussian noise to be processed according to the input information to be processed; inputting the Gaussian noise to be processed into an avatar generation model, and obtaining an avatar corresponding to the input information to be processed according to an output result of the avatar generation model.

According to a third aspect of the present disclosure, there is provided a training apparatus of an avatar generation model, comprising: the device comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring training data, and the training data comprises sample information and an avatar standard image corresponding to the sample information; the system comprises a building unit, a vector generation module, an attribute generation module, a density generation module, a color generation module and an image generation module, wherein the vector generation module is used for generating an implicit vector according to Gaussian noise, the attribute generation module is used for generating image attribute information of an avatar according to the implicit vector, the density generation module is used for generating image density information of the avatar according to the implicit vector, the color generation module is used for generating image color information of the avatar according to the implicit vector, and the image generation module is used for generating an avatar prediction image according to the image attribute information, the image density information and the image color information; the first processing unit is used for obtaining sample Gaussian noise according to the sample information; and the training unit is used for training the neural network model by using the sample Gaussian noise and the virtual image standard image to obtain an virtual image generation model.

According to a fourth aspect of the present disclosure, there is provided an avatar generation apparatus including: the second processing unit is used for acquiring input information to be processed and acquiring Gaussian noise to be processed according to the input information to be processed; and the generating unit is used for inputting the Gaussian noise to be processed into an avatar generating model, and obtaining the avatar corresponding to the input information to be processed according to the output result of the avatar generating model.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to a fifth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method as described above.

According to a sixth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

According to the technical scheme, the neural network model comprising different generation modules is trained by using the sample Gaussian noise obtained according to the sample information, so that the finally obtained avatar generation model can generate an avatar image according to the input sample Gaussian noise, the generation step of the avatar image is simplified, the generation efficiency of the avatar image is improved, and the reality of the generated avatar image can be enhanced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing a training of an avatar generation model or a generation method of an avatar according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in fig. 1, the training method of the avatar generation model of the present embodiment specifically includes the following steps:

s101, acquiring training data, wherein the training data comprises sample information and an avatar standard image corresponding to the sample information;

s102, constructing a neural network model comprising a vector generation module, an attribute generation module, a density generation module, a color generation module and an image generation module, wherein the vector generation module is used for generating an implicit vector according to Gaussian noise, the attribute generation module is used for generating image attribute information of an avatar according to the implicit vector, the density generation module is used for generating image density information of the avatar according to the implicit vector, the color generation module is used for generating image color information of the avatar according to the implicit vector, and the image generation module is used for generating an avatar prediction image according to the image attribute information, the image density information and the image color information;

S103, obtaining sample Gaussian noise according to the sample information;

and S104, training the neural network model by using the sample Gaussian noise and the avatar standard image to obtain an avatar generation model.

According to the training method of the avatar generation model, the neural network model comprising different generation modules is trained by using the sample Gaussian noise obtained according to the sample information, so that the finally obtained avatar generation model can generate an avatar image according to the input sample Gaussian noise, the generation steps of the avatar image are simplified, the generation efficiency of the avatar image is improved, and the reality of the generated avatar image can be enhanced.

In the training data obtained in S101, the sample information may be text information or image information; the avatar standard image corresponding to the sample information is a pre-acquired avatar truth image, and the avatar may be a cartoon avatar, a face image, a human body image, etc.

That is, the avatar generation model trained in the present embodiment can output a corresponding avatar image according to the input text information or image information; for example, if the input information is a text of "a short-hair man wearing glasses", the avatar generation model of the present embodiment outputs an image of "a short-hair man wearing glasses".

It is understood that the avatar standard image corresponding to the sample information acquired in S101 may be one or more, and different avatar standard images correspond to different viewing angles.

In the embodiment, the avatar standard image obtained in S101 may further include preset contents such as preset expression, preset emotion, preset action, preset wearing, etc., so that the avatar in the image generated by the avatar generation model also has preset contents such as preset expression, preset emotion, preset action, preset wearing, etc. accordingly.

In this embodiment, after the step S101 of obtaining the standard image containing the sample information and the corresponding avatar, the step S102 of constructing the neural network model containing the vector generation module, the attribute generation module, the density generation module, the color generation module and the image generation module is performed.

In the neural network model constructed in S102, the vector generation module is configured to generate a hidden vector according to the input gaussian noise (the gaussian noise is obtained according to the sample information); the generated hidden vector contains semantic information of sample information, and the semantic information implicitly reflects text or image features contained in the sample information.

In the neural network model constructed in the step S102, the attribute generation module is used for generating the image attribute information of the virtual image according to the hidden vector generated by the vector generation module; wherein the generated character attribute information includes information such as the size of the avatar in the image, the position in the image, and the orientation in the image.

In the neural network model constructed in the step S102, the density generation module is configured to generate image density information of the avatar according to the hidden vector generated by the vector generation module; the generated image density information includes density values of all pixels in the finally generated predicted image, wherein the density value of the pixel is 1, which indicates that the pixel belongs to the avatar, and the density value of the pixel is 0, which indicates that the pixel does not belong to the avatar (i.e., the pixel belongs to the background in the image).

In the neural network model constructed in the step S102, the color generation module is configured to generate image color information of the avatar according to the hidden vector generated by the vector generation module; the image color information includes color values of all pixels in the finally generated predicted image, and the color values of the pixels represent colors corresponding to the pixels in the predicted image.

In the neural network model constructed in the implementation S102, the image generating module is configured to generate an avatar prediction image according to the avatar attribute information generated by the attribute generating module, the image density information generated by the density generating module, and the image color information generated by the color generating module; wherein the generated avatar prediction image includes an avatar having a color and a background.

It can be understood that, since the generated avatar image may be used to generate a three-dimensional avatar, and pixels at the same location in the three-dimensional avatar may have different colors under different viewing angles, in the neural network model constructed by the embodiment of S102, the color generation module may further generate image color information of the avatar under the target viewing angle according to the two kinds of information, namely, the hidden vector generated by the vector generation module and the target viewing angle, so that the image generation module generates a predicted image of the avatar under the target viewing angle.

The target viewing angle in this embodiment may be one or more, for example, one or more of a front viewing angle, a rear viewing angle, an upper viewing angle, a lower viewing angle, a left viewing angle, and a right viewing angle; if the target viewing angle is only one, the image generating module in the embodiment generates only one virtual image prediction image corresponding to the target viewing angle; if there are a plurality of target views, the image generation module in this embodiment generates avatar prediction images corresponding to different target views, respectively.

The target viewing angle in this embodiment may be one or more preset viewing angles, or one or more viewing angles input by the input end may be used as the target viewing angle.

The neural network model constructed in S102 may further include a discrimination module, where the discrimination module is configured to distinguish an avatar standard image from an avatar predicted image, until the discrimination module cannot distinguish whether the input image is the standard image or the predicted image, and then stop training of the neural network model; specifically, the standard image and the predicted image of the virtual image are input into a judging module, a loss function value is calculated according to the output result of the judging module, and parameters of each module in the neural network model are adjusted by using the calculated loss function value until the neural network model converges.

That is, the present embodiment constructs a neural network model based on the architecture for generating the countermeasure network, that is, the vector generation module, the attribute generation module, the density generation module, the color generation module, and the image generation module form the generation module for generating the countermeasure network, and further, the discrimination module is combined to complete the training of the neural network model.

In this embodiment, after executing S102 to construct a vector generation module, an attribute generation module, a density generation module, a color generation module, and an image generation module, S103 obtains sample gaussian noise according to sample information.

In the embodiment, when S103 is executed to obtain sample gaussian noise according to sample information, the sample information may be input to a noise generation model, and an output result of the noise generation model is used as the sample gaussian noise; the noise generation model in this embodiment may be a pre-trained model for outputting gaussian noise based on inputted text information or image information.

In this embodiment, when S103 is executed to obtain sample gaussian noise according to sample information, the sample information may be first converted into an embedded vector, for example, the embedded vector is obtained by using a CLIP model, then the embedded vector obtained by the conversion is embedded (ebedding) with initial gaussian noise (for example, random gaussian noise), and finally the embedded result is used as the sample gaussian noise.

That is, in this embodiment, the neural network model is trained by converting the sample information into the sample gaussian noise, so that the disturbance on the sample information can be increased, thereby improving the training effect of the model, and making the avatar generation model obtained by training more robust when generating the avatar image.

In this embodiment, after the sample gaussian noise is obtained in step S103, step S104 trains a neural network model using the sample gaussian noise and the avatar standard image, and obtains an avatar generation model.

In this embodiment, when executing S104 to train the neural network model using the sample gaussian noise and the avatar standard image, an avatar generation model is obtained, the following alternative implementation manners may be adopted: inputting the sample Gaussian noise into a neural network model to obtain an virtual image prediction image output by the neural network model; calculating a loss function value according to the avatar prediction image and the avatar standard image; and updating parameters of each module in the neural network model according to the calculated loss function value to obtain the virtual image generation model.

If the neural network model includes a discrimination model, the embodiment may input the avatar prediction image and the avatar standard image to the discrimination module and calculate the loss function value according to the output result of the discrimination module when executing S104 to calculate the loss function value according to the avatar prediction image and the avatar standard image.

In the embodiment, when executing S104, it may be determined that the neural network model converges when the loss function obtained in the preset number of times converges; the convergence of the neural network model can also be determined when the obtained loss function reaches a preset value; the convergence of the neural network model can be determined when the training times exceeds the preset times; and when the convergence of the neural network model is determined, the training of the neural network model is considered to be completed, so that the virtual image generation model is obtained.

In the embodiment, when performing S104 to calculate the loss function value according to the avatar prediction image and the avatar standard image, the target view angle used when generating the avatar prediction image may be determined first, then the target avatar standard image corresponding to the target view angle may be determined, and finally the loss function value may be calculated according to the avatar prediction image and the target avatar standard image.

If the embodiment includes multiple target views, the embodiment may calculate the loss function value corresponding to each target view separately when executing S104, determine a final loss function value according to the calculated multiple loss function values, and update the parameters of the neural network model according to the final loss function value.

That is, when the loss function value for updating the model parameter is calculated, the present embodiment also determines the corresponding standard image in combination with the target view angle used when generating the predicted image, thereby improving the accuracy of the calculated loss function value and the accuracy of the model parameter updating.

If the neural network model constructed in the step S102 includes a discrimination module, the discrimination module included in the neural network model needs to be removed when the step S104 is executed, so that the virtual image generation model is obtained by using the remaining generation modules; the obtained avatar generation model is used to generate a two-dimensional image of the avatar according to the input information.

Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure. Fig. 2 is a block diagram showing a neural network model constructed in the present embodiment; the neural network model consists of a vector generation module, an attribute generation module, a density generation module, a color generation module, an image generation module and a discrimination module, wherein the discrimination module is removed after training is completed, and an avatar generation model is obtained; the attribute generation module, the density generation module and the color generation module respectively generate image attribute information, image density information and image color information according to the hidden vectors generated by the vector generation module; the image generation module generates an avatar prediction image according to the avatar attribute information, the image density information and the image color information.

Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure. As shown in fig. 3, the method for generating an avatar of the present embodiment specifically includes the following steps:

s301, acquiring input information to be processed, and acquiring Gaussian noise to be processed according to the input information to be processed;

s302, inputting the Gaussian noise to be processed into an avatar generation model, and obtaining an avatar corresponding to the input information to be processed according to an output result of the avatar generation model.

That is, the present embodiment obtains an avatar corresponding to input information to be processed according to gaussian noise to be processed and an avatar generation model obtained from the input information to be processed, achieves individualization of avatar generation, can simplify the avatar generation step, and improves the avatar generation efficiency and the reality of the generated avatar.

In the embodiment, when S301 is executed, text information or image information input by the input end may be obtained as input information to be processed.

In the embodiment, when S301 is executed to obtain gaussian noise to be processed according to input information to be processed, the input information to be processed may be input into a noise generation model, so that an output result of the noise generation model is used as the gaussian noise to be processed; the input information to be processed can be converted into an embedded vector and then embedded with the initial Gaussian noise, so that the embedded result is used as the Gaussian noise to be processed.

In this embodiment, after the to-be-processed gaussian noise is obtained in step S301, step S302 is performed to input the to-be-processed gaussian noise into the avatar generation model, and the avatar corresponding to the to-be-processed input information is obtained according to the output result of the avatar generation model.

In the present embodiment, when S302 is performed, the target viewing angle may also be first determined, and then the target viewing angle is input into the avatar generation model together with the gaussian noise to be processed, so that the avatar generation model outputs the virtual object image corresponding to the target viewing angle.

In the embodiment, when S302 is executed, the determined target viewing angles may be one or more than one; the target viewing angle may be one or more preset viewing angles, or one or more viewing angles input by the input terminal may be used as the target viewing angle.

In addition, if the avatar generation model is trained based on the avatar standard images corresponding to different perspectives during training, the present embodiment automatically generates the avatar images corresponding to different target perspectives from the avatar generation model without determining the target perspectives when executing S302.

In the present embodiment, when S302 is performed, the avatar obtained according to the output result of the avatar generation model may be a two-dimensional avatar image or a three-dimensional avatar.

Therefore, when the present embodiment performs S302 to obtain the avatar corresponding to the input information to be processed according to the output result of the avatar generation model, the following alternative implementation manners may be adopted: obtaining an avatar image sequence according to a plurality of avatar images corresponding to different target visual angles output by the avatar generation model; based on the obtained avatar image sequence, a three-dimensional avatar is generated, and the embodiment may use a three-dimensional avatar generation model trained in advance to generate a three-dimensional avatar from the avatar image sequence.

After executing S302 to generate the three-dimensional avatar, the embodiment may place the generated three-dimensional avatar in a preset virtual scene, such as a meta-universe, so as to achieve the purpose of constructing the three-dimensional avatar in the meta-universe; the generated three-dimensional avatar may also be used as a digital person for news broadcasting, entertainment interaction, etc.

The present embodiment may further include the following after performing S302 to generate the three-dimensional avatar: determining the use scene of the three-dimensional virtual image, such as a scene of news broadcasting, entertainment interaction and the like; acquiring a three-dimensional object model corresponding to the determined use scene, wherein the three-dimensional object model can be clothes, hairstyles and the like; the acquired three-dimensional object model is added to the three-dimensional avatar so that the three-dimensional avatar to which the three-dimensional object model is added is more matched with the use scene.

Fig. 4 is a schematic diagram according to a fourth embodiment of the present disclosure. As shown in fig. 4, the training apparatus 400 of the avatar generation model of the present embodiment includes:

an obtaining unit 401, configured to obtain training data, where the training data includes sample information and an avatar standard image corresponding to the sample information;

The building unit 402 is configured to build a neural network model including a vector generation module, an attribute generation module, a density generation module, a color generation module and an image generation module, where the vector generation module is configured to generate an hidden vector according to gaussian noise, the attribute generation module is configured to generate avatar attribute information of an avatar according to the hidden vector, the density generation module is configured to generate image density information of the avatar according to the hidden vector, the color generation module is configured to generate image color information of the avatar according to the hidden vector, and the image generation module is configured to generate an avatar prediction image according to the avatar attribute information, the image density information and the image color information;

a first processing unit 403, configured to obtain sample gaussian noise according to the sample information;

and the training unit 404 is configured to train the neural network model by using the sample gaussian noise and the avatar standard image, so as to obtain an avatar generation model.

In the training data acquired by the acquiring unit 401, the sample information may be text information or image information; the avatar standard image corresponding to the sample information is a pre-acquired avatar truth image, and the avatar may be a cartoon avatar, a face image, a human body image, etc.

That is, the avatar generation model trained in the present embodiment can output a corresponding avatar image according to the inputted text information or image information.

It is understood that the avatar standard image corresponding to the sample information acquired by the acquisition unit 401 may be one or more, and different avatar standard images correspond to different viewing angles.

The avatar may further include preset contents such as preset expression, preset emotion, preset action, preset wearing, etc. in the avatar standard image acquired by the acquisition unit 401, so that the avatar in the image generated by the avatar generation model also has preset contents such as preset expression, preset emotion, preset action, preset wearing, etc. accordingly.

In this embodiment, after the acquisition unit 401 acquires the standard image containing the sample information and the corresponding avatar, the construction unit 402 constructs a neural network model containing a vector generation module, an attribute generation module, a density generation module, a color generation module, and an image generation module.

In the neural network model constructed by the construction unit 402, the vector generation module is configured to generate a hidden vector according to the input gaussian noise (the gaussian noise is obtained according to the sample information); the generated hidden vector contains semantic information of sample information, and the semantic information implicitly reflects text or image features contained in the sample information.

In the neural network model constructed by the construction unit 402, the attribute generation module is used for generating image attribute information of the virtual image according to the hidden vector generated by the vector generation module; wherein the generated character attribute information includes information such as the size of the avatar in the image, the position in the image, and the orientation in the image.

In the neural network model constructed by the construction unit 402, the density generation module is used for generating image density information of the virtual image according to the hidden vector generated by the vector generation module; the generated image density information includes density values of all pixels in the finally generated predicted image, wherein the density value of the pixel is 1, which indicates that the pixel belongs to the avatar, and the density value of the pixel is 0, which indicates that the pixel does not belong to the avatar (i.e., the pixel belongs to the background in the image).

In the neural network model constructed by the construction unit 402, the color generation module is used for generating image color information of the virtual image according to the hidden vector generated by the vector generation module; the image color information includes color values of all pixels in the finally generated predicted image, and the color values of the pixels represent colors corresponding to the pixels in the predicted image.

In the neural network model constructed by the construction unit 402, the image generation module is configured to generate an avatar prediction image according to the avatar attribute information generated by the attribute generation module, the image density information generated by the density generation module, and the image color information generated by the color generation module; wherein the generated avatar prediction image includes an avatar having a color and a background.

It may be appreciated that, since the generated avatar image may be used to generate a three-dimensional avatar, and pixels at the same location in the three-dimensional avatar may have different colors under different viewing angles, the color generation module may generate image color information of the avatar under the target viewing angle according to both the hidden vector generated by the vector generation module and the target viewing angle in the neural network model constructed by the construction unit 402, thereby enabling the image generation module to generate a predicted image of the avatar under the target viewing angle.

The neural network model constructed by the construction unit 402 may further include a discrimination module for discriminating an avatar standard image from an avatar predicted image until the discrimination module cannot discriminate whether the inputted image is the standard image or the predicted image, and stopping training of the neural network model; specifically, the standard image and the predicted image of the virtual image are input into a judging module, a loss function value is calculated according to the output result of the judging module, and parameters of each module in the neural network model are adjusted by using the calculated loss function value until the neural network model converges.

That is, the construction unit 402 constructs the neural network model based on the architecture for generating the countermeasure network, that is, the vector generation module, the attribute generation module, the density generation module, the color generation module, and the image generation module constitute the generation module for generating the countermeasure network, and further, the training of the neural network model is completed in combination with the discrimination module.

In this embodiment, after the construction unit 402 constructs the vector generation module, the attribute generation module, the density generation module, the color generation module, and the image generation module, the first processing unit 403 obtains sample gaussian noise according to sample information.

When the first processing unit 403 obtains sample gaussian noise according to sample information, the sample information may be input to a noise generation model, and an output result of the noise generation model is used as sample gaussian noise; the noise generation model in this embodiment may be a pre-trained model for outputting gaussian noise based on inputted text information or image information.

When the first processing unit 403 obtains the sample gaussian noise according to the sample information, it may also convert the sample information into an embedded vector, for example, obtain the embedded vector by using a CLIP model, then embed (ebedding) the embedded vector obtained by the conversion with the initial gaussian noise (for example, random gaussian noise), and finally use the embedded result as the sample gaussian noise.

That is, the first processing unit 403 trains the neural network model by converting the sample information into the sample gaussian noise, so that the disturbance to the sample information can be increased, thereby improving the training effect of the model, and making the avatar generation model obtained by training more robust in generating the avatar image.

In this embodiment, after the first processing unit 403 obtains the sample gaussian noise, the training unit 404 trains the neural network model using the sample gaussian noise and the avatar standard image, so as to obtain the avatar generation model.

When training the neural network model by using the sample gaussian noise and the avatar standard image, the training unit 404 obtains an avatar generation model, the following alternative implementation manners may be adopted: inputting the sample Gaussian noise into a neural network model to obtain an virtual image prediction image output by the neural network model; calculating a loss function value according to the avatar prediction image and the avatar standard image; and updating parameters of each module in the neural network model according to the calculated loss function value to obtain the virtual image generation model.

If the neural network model includes a discrimination model, the training unit 404 may input the avatar prediction image and the avatar standard image into the discrimination module when calculating the loss function value according to the avatar prediction image and the avatar standard image, and calculate the loss function value according to an output result of the discrimination module.

The training unit 404 may determine that the neural network model converges when the loss function obtained in the preset number of times converges; the convergence of the neural network model can also be determined when the obtained loss function reaches a preset value; the convergence of the neural network model can be determined when the training times exceeds the preset times; and when the convergence of the neural network model is determined, the training of the neural network model is considered to be completed, so that the virtual image generation model is obtained.

The training unit 404 may also determine a target viewing angle used when generating the avatar prediction image, determine a target avatar standard image corresponding to the target viewing angle, and calculate a loss function value based on the avatar prediction image and the target avatar standard image.

If the embodiment includes a plurality of target view angles, the training unit 404 may calculate the loss function value corresponding to each target view angle, determine a final loss function value according to the calculated plurality of loss function values, and update the parameters of the neural network model according to the final loss function value.

That is, when calculating the loss function value for updating the model parameter, the training unit 404 further determines the corresponding standard image in combination with the target view angle used when generating the predicted image, thereby improving the accuracy of the calculated loss function value and the accuracy of the model parameter updating.

If the neural network model constructed by the construction unit 402 includes a discrimination module, the training unit 404 needs to remove the discrimination module included in the neural network model, so as to obtain an avatar generation model by using the remaining generation modules; the obtained avatar generation model is used to generate a two-dimensional image of the avatar according to the input information.

Fig. 5 is a schematic diagram according to a fifth embodiment of the present disclosure. As shown in fig. 5, the avatar generation apparatus 500 of the present embodiment includes:

the second processing unit 501 obtains input information to be processed, and obtains Gaussian noise to be processed according to the input information to be processed;

and the generating unit 502 inputs the Gaussian noise to be processed into an avatar generating model, and obtains an avatar corresponding to the input information to be processed according to the output result of the avatar generating model.

The second processing unit 501 may acquire text information or image information input by the input terminal as input information to be processed.

When the second processing unit 501 obtains the gaussian noise to be processed according to the input information to be processed, the input information to be processed may be input into the noise generation model, so that an output result of the noise generation model is used as the gaussian noise to be processed; the input information to be processed can be converted into an embedded vector and then embedded with the initial Gaussian noise, so that the embedded result is used as the Gaussian noise to be processed.

In this embodiment, after the gaussian noise to be processed is obtained by the second processing unit 501, the gaussian noise to be processed is input into the avatar generation model by the generating unit 502, and the avatar corresponding to the input information to be processed is obtained according to the output result of the avatar generation model.

The generating unit 502 may also first determine a target viewing angle and then input the target viewing angle into the avatar generation model together with the gaussian noise to be processed such that the avatar generation model outputs a virtual object image corresponding to the target viewing angle.

The target viewing angle determined by the generating unit 502 may be one or more; the target viewing angle may be one or more preset viewing angles, or one or more viewing angles input by the input terminal may be used as the target viewing angle.

In addition, if the avatar generation model is trained based on the avatar standard images corresponding to different perspectives during training, the generation unit 502 automatically generates the avatar images corresponding to different target perspectives from the avatar generation model without determining the target perspectives.

The avatar obtained by the generation unit 502 according to the output result of the avatar generation model may be a two-dimensional avatar image or a three-dimensional avatar.

Therefore, when the generating unit 502 obtains the avatar corresponding to the input information to be processed according to the output result of the avatar generating model, the following alternative implementation methods may be adopted: obtaining an avatar image sequence according to a plurality of avatar images corresponding to different target visual angles output by the avatar generation model; based on the obtained avatar image sequence, a three-dimensional avatar is generated, and the embodiment may use a three-dimensional avatar generation model trained in advance to generate a three-dimensional avatar from the avatar image sequence.

The generating unit 502 may place the generated three-dimensional avatar to a preset virtual scene, such as a meta universe, etc., after generating the three-dimensional avatar, thereby achieving the purpose of constructing the three-dimensional avatar in the meta universe; the generated three-dimensional avatar may also be used as a digital person for news broadcasting, entertainment interaction, etc.

The avatar generation apparatus 500 of the present embodiment may further include an optimizing unit 503 for performing: after the generating unit 502 generates the three-dimensional avatar, determining a usage scene of the three-dimensional avatar; acquiring a three-dimensional object model corresponding to the determined use scene; the acquired three-dimensional object model is added to the three-dimensional avatar so that the three-dimensional avatar to which the three-dimensional object model is added is more matched with the use scene.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

As shown in fig. 6, there is a block diagram of an electronic device for training an avatar generation model or a method of generating an avatar according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM602, and RAM603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 601 performs the respective methods and processes described above, for example, training of an avatar generation model or a generation method of an avatar. For example, in some embodiments, the training of the avatar generation model or the method of avatar generation may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608.

In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM602 and/or the communication unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the above-described avatar generation model training or avatar generation method may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform training of the avatar generation model or the avatar generation method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here can be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable avatar generation model training or avatar generation apparatus, such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a presentation device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for presenting information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of an avatar generation model, comprising:

acquiring training data, wherein the training data comprises sample information and an avatar standard image corresponding to the sample information;

constructing a neural network model comprising a vector generation module, an attribute generation module, a density generation module, a color generation module and an image generation module, wherein the vector generation module is used for generating an implicit vector according to Gaussian noise, the attribute generation module is used for generating image attribute information of an avatar according to the implicit vector, the density generation module is used for generating image density information of the avatar according to the implicit vector, the color generation module is used for generating image color information of the avatar according to the implicit vector, and the image generation module is used for generating an avatar prediction image according to the image attribute information, the image density information and the image color information;

obtaining sample Gaussian noise according to the sample information;

and training the neural network model by using the sample Gaussian noise and the virtual image standard image to obtain an virtual image generation model.

2. The method of claim 1, wherein the color generation module generating image color information of the avatar according to the hidden vector comprises:

Determining a target viewing angle;

and generating image color information of the virtual image according to the hidden vector and the target visual angle.

3. The method of claim 1, wherein the training the neural network model using the sample gaussian noise and the avatar standard image to obtain an avatar generation model comprises:

inputting the sample Gaussian noise into the neural network model to obtain an avatar prediction image output by the neural network model;

calculating a loss function value according to the avatar prediction image and the avatar standard image;

and updating parameters of each module in the neural network model according to the loss function value to obtain the virtual image generation model.

4. A method according to claim 3, wherein the neural network model includes a discrimination module;

the calculating a loss function value according to the avatar prediction image and the avatar standard image includes:

inputting the avatar prediction image and the avatar standard image into the discrimination module;

and calculating the loss function value according to the output result of the judging module.

5. The method of claim 3, wherein the calculating a loss function value according to the avatar prediction image and the avatar standard image comprises:

Determining a target viewing angle used when generating the avatar prediction image;

determining a target avatar standard image corresponding to the target visual angle;

and calculating the loss function value according to the avatar prediction image and the target avatar standard image.

6. A method of generating an avatar, comprising:

acquiring input information to be processed, and acquiring Gaussian noise to be processed according to the input information to be processed;

inputting the Gaussian noise to be processed into an avatar generation model, and obtaining an avatar corresponding to the input information to be processed according to an output result of the avatar generation model;

wherein the avatar generation model is trained according to the method of any one of claims 1 to 5.

7. The method of claim 6, wherein the inputting the gaussian noise to be processed into an avatar generation model comprises:

determining a target viewing angle;

and inputting the target visual angle and the Gaussian noise to be processed into the avatar generation model.

8. The method of claim 6, wherein obtaining an avatar corresponding to the input information to be processed according to an output result of the avatar generation model comprises:

Obtaining an avatar image sequence according to a plurality of avatar images corresponding to different target visual angles output by the avatar generation model;

and generating a three-dimensional virtual image according to the virtual image sequence.

9. The method of claim 8, further comprising,

determining a usage scenario of the three-dimensional avatar after generating the three-dimensional avatar;

acquiring a three-dimensional object model corresponding to the use scene;

the three-dimensional object model is added to the three-dimensional avatar.

10. A training apparatus of an avatar generation model, comprising:

the device comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring training data, and the training data comprises sample information and an avatar standard image corresponding to the sample information;

the system comprises a building unit, a vector generation module, an attribute generation module, a density generation module, a color generation module and an image generation module, wherein the vector generation module is used for generating an implicit vector according to Gaussian noise, the attribute generation module is used for generating image attribute information of an avatar according to the implicit vector, the density generation module is used for generating image density information of the avatar according to the implicit vector, the color generation module is used for generating image color information of the avatar according to the implicit vector, and the image generation module is used for generating an avatar prediction image according to the image attribute information, the image density information and the image color information;

The first processing unit is used for obtaining sample Gaussian noise according to the sample information;

and the training unit is used for training the neural network model by using the sample Gaussian noise and the virtual image standard image to obtain an virtual image generation model.

11. The apparatus of claim 10, wherein the color generation module constructed by the construction unit, when generating the image color information of the avatar according to the hidden vector, specifically performs:

determining a target viewing angle;

12. The apparatus of claim 10, wherein the training unit, when training the neural network model using the sample gaussian noise and the avatar standard image to obtain an avatar generation model, specifically performs:

13. The apparatus of claim 12, wherein the neural network model constructed by the construction unit includes a discrimination module;

the training unit, when calculating a loss function value from the avatar prediction image and the avatar standard image, specifically performs:

14. The apparatus of claim 12, wherein the training unit, when calculating the loss function value from the avatar prediction image and the avatar standard image, specifically performs:

15. An avatar generation apparatus comprising:

the second processing unit is used for acquiring input information to be processed and acquiring Gaussian noise to be processed according to the input information to be processed;

The generating unit is used for inputting the Gaussian noise to be processed into an avatar generating model, and obtaining an avatar corresponding to the input information to be processed according to the output result of the avatar generating model;

wherein the avatar generation model is trained from the apparatus of any one of claims 10-14.

16. The apparatus of claim 15, wherein the generating unit, when inputting the gaussian noise to be processed into an avatar generation model, specifically performs:

determining a target viewing angle;

17. The apparatus of claim 15, wherein the generating unit, when obtaining an avatar corresponding to the input information to be processed according to an output result of the avatar generation model, specifically performs:

18. The apparatus of claim 17, further comprising an optimization unit to perform:

Determining a usage scenario of the three-dimensional avatar after the generating unit generates the three-dimensional avatar;

acquiring a three-dimensional object model corresponding to the use scene;

the three-dimensional object model is added to the three-dimensional avatar.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-9.