CN117078816A

CN117078816A - Virtual image generation method, device, terminal equipment and storage medium

Info

Publication number: CN117078816A
Application number: CN202311059569.0A
Authority: CN
Inventors: 李勉; 刘世超; 严立康; 徐坚江
Original assignee: Avatr Technology Chongqing Co Ltd
Current assignee: Avatr Technology Chongqing Co Ltd
Priority date: 2023-08-22
Filing date: 2023-08-22
Publication date: 2023-11-17

Abstract

The application relates to the technical field of image processing, and provides a method, a device, terminal equipment and a storage medium for generating an avatar. The method utilizes the image characteristics of an original image corresponding to the virtual image, generates a corresponding video through an image diffusion model, and then respectively inputs the video into a coding and decoding network contained in the image diffusion model and a control Net network for controlling the image diffusion model to process; after the video is processed by the coding layer, the middle layer and the decoding layer of the coding and decoding network and the coding layer, the middle layer and the zero convolution layer of the control Net network, the decoding layer outputs a processed video file, and the video file can be regarded as an virtual image with the characteristics of an original image. By the arrangement, the user can select the original image corresponding to the virtual image to be generated according to the preference, the obtained virtual image is not a fixed image any more, and the personalized requirement of the user can be met.

Description

Virtual image generation method, device, terminal equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and apparatus for generating an avatar, a terminal device, and a storage medium.

Background

With the continuous development of artificial intelligence technology, the virtual images such as intelligent voice assistants are widely used in various fields. For example, in the vehicle field, the vehicle-mounted voice assistant can provide services such as voice navigation, vehicle control, music playing and the like for a driver, so that the driving experience of the user is effectively improved. However, the existing virtual images such as the vehicle-mounted voice assistant are generally fixed images which are preset, and cannot meet the personalized requirements of users.

Disclosure of Invention

In view of the above, the embodiments of the present application provide a method, an apparatus, a terminal device, and a storage medium for generating an avatar, which can meet the personalized requirements of users for the avatar.

A first aspect of an embodiment of the present application provides a method for generating an avatar, including:

acquiring an original image corresponding to the virtual image;

generating a first video corresponding to the original image through the trained image diffusion model;

the first video is respectively input into a coding and decoding network contained in the image diffusion model and a control Net network for controlling the image diffusion model to be processed; the control Net network sequentially comprises the same coding layer as the coding and decoding network, the same middle layer as the coding and decoding network and a zero convolution layer, and the output of the zero convolution layer is connected to the input of the decoding layer;

The generated avatar is output through the decoding layer.

The embodiment of the application utilizes the image characteristics of the original image corresponding to the virtual image, generates corresponding video through the image diffusion model, and then respectively inputs the video into a coding and decoding network contained in the image diffusion model and a control Net network for controlling the image diffusion model for processing; after the video is processed by the coding layer, the middle layer and the decoding layer of the coding and decoding network and the coding layer, the middle layer and the zero convolution layer of the control Net network, the decoding layer of the coding and decoding network outputs a processed video file, and the video file can be regarded as an virtual image with the characteristics of an original image. By this arrangement, the user can select an original image corresponding to an avatar to be generated, for example, an image of a male character or an image of an animated character or the like, according to personal preference, and after inputting the original image, the avatar corresponding to the original image can be automatically output. The avatar thus obtained is no longer a fixed avatar, and can satisfy the personalized needs of the user.

In an implementation manner of the embodiment of the present application, the generating, by using the trained image diffusion model, the first video corresponding to the original image may include:

Performing image semantic segmentation processing on the original image through a SegNet network contained in the image diffusion model to obtain a characteristic image;

converting the characteristic image into a target text through a semantic learning network contained in the image diffusion model;

performing image diffusion processing on the target text and the characteristic image through an image diffusion network contained in the image diffusion model to obtain a target image;

converting the target text into target audio through a text-to-speech network contained in the image diffusion model;

and fusing the target image and the target audio to obtain a first video.

Further, the network structure of the SegNet network may comprise a plurality of network layer groups, each network layer group comprising at least one convolution layer and at least one pooling layer, and a jump connection being added between a first convolution layer and a last pooling layer of each network layer group.

In an implementation manner of the embodiment of the present application, the processing of the first video input to the codec network included in the image diffusion model and the control net network for controlling the image diffusion model may include:

acquiring a preset vivid template image;

and taking the first video and the image template image as input data, and respectively inputting the input data into a coding and decoding network and a control Net network for processing.

In another implementation manner of the embodiment of the present application, the processing of the first video input to the codec network included in the image diffusion model and the control net network for controlling the image diffusion model respectively may include:

acquiring a prompt text corresponding to the virtual image;

and respectively inputting the first video and the prompt text as input data into a coding and decoding network and a control Net network for processing.

In yet another implementation of an embodiment of the present application, the avatar is an on-board voice assistant avatar of the vehicle; the processing of the first video input to the codec network included in the image diffusion model and the control net network for controlling the image diffusion model may include:

acquiring vehicle parameters of a vehicle in the running process;

the first video and the vehicle parameters are used as input data and are respectively input into a coding and decoding network and a control Net network for processing.

In one implementation manner of the embodiment of the present application, the avatar generated through the decoding layer output may include:

outputting the second video through the decoding layer;

for each frame of image in the second video, acquiring a range area containing the edge of the avatar in the image; each pixel point in the range area is used as input of a distance measurement algorithm to carry out distance measurement processing so as to increase the distance between the pixel points belonging to the virtual image edge and the pixel points not belonging to the virtual image edge in the range area, thereby obtaining a processed second video;

And determining the processed second video as the generated avatar.

A second aspect of an embodiment of the present application provides an avatar generating apparatus including:

the original image acquisition module is used for acquiring an original image corresponding to the virtual image;

the video generation module is used for generating a first video corresponding to the original image through the trained image diffusion model;

the video input module is used for respectively inputting the first video to a coding and decoding network contained in the image diffusion model and a control Net network used for controlling the image diffusion model to process the first video; the control Net network sequentially comprises the same coding layer as the coding and decoding network, the same middle layer as the coding and decoding network and a zero convolution layer, and the output of the zero convolution layer is connected to the input of the decoding layer;

and the avatar output module is used for outputting the generated avatar through the decoding layer.

A third aspect of the embodiments of the present application provides a terminal device including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the avatar generation method as provided in the first aspect of the embodiments of the present application when the computer program is executed.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the avatar generation method as provided in the first aspect of the embodiments of the present application.

A fifth aspect of the embodiments of the present application provides a computer program product which, when run on a terminal device, causes the terminal device to perform the avatar generation method as provided in the first aspect of the embodiments of the present application.

It will be appreciated that the advantages of the second to fifth aspects may be found in the relevant description of the first aspect, and are not described here again.

Drawings

Fig. 1 is a flowchart of a method for generating an avatar according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a codec network and a ControlNet network according to an embodiment of the present application;

fig. 3 is a schematic overall flowchart of a method for generating an avatar according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a data processing flow for generating a first video;

fig. 5 is a structural frame diagram of an avatar generation apparatus provided in an embodiment of the present application;

Fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail. Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

At present, when people drive the vehicle to go out, services such as voice navigation, vehicle control, music playing and the like provided by a vehicle-mounted voice assistant of the vehicle-mounted terminal can be utilized, so that the driving experience of a user is effectively improved. However, the avatar of the vehicle-mounted voice assistant and the like is generally a preset fixed avatar, and cannot meet the personalized requirements of the user. In view of the problem, the embodiment of the application provides a method, a device, a terminal device and a storage medium for generating an avatar, wherein the personalized avatar can be generated through an image processing algorithm, so that the personalized requirement of a user on the avatar is met. For more specific technical implementation details of embodiments of the present application, please refer to the various embodiments described below.

It should be understood that the implementation body of each method embodiment of the present application is a terminal device or a server of various types, for example, a mobile phone, a tablet computer, a wearable device, a vehicle controller, a vehicle-mounted terminal, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a personal digital assistant (personal digital assistant, PDA), and so on, and the specific types of the terminal device and the server are not limited in this embodiment of the present application.

Referring to fig. 1, a method for generating an avatar according to an embodiment of the present application includes:

101. acquiring an original image corresponding to the virtual image;

the execution main body of the embodiment of the application can be a vehicle-mounted terminal of a vehicle or terminal equipment such as a mobile phone/tablet personal computer used by a user. First, an original image corresponding to an avatar is acquired. The user may select an avatar desired to be generated, such as a mature man's avatar, a young woman's avatar, or a cartoon animal avatar, etc., according to his own personalized needs. In actual operation, the terminal device may collect image data and voice data corresponding to different avatars (for example, different sexes, different ages, different skin colors, different hairstyles, etc.) in advance, perform certain preprocessing on the image data and voice data, for example, data cleaning, data standardization, data missing value processing, etc., and then store the preprocessed image data and voice data. After the user selects the avatar to be generated, the terminal device can search the image data of the avatar to obtain multi-frame images as original images.

102. Generating a first video corresponding to the original image through the trained image diffusion model;

after the original image is obtained, the original image can be input into a trained image diffusion model for processing, and after the processing such as feature extraction and image diffusion of the image diffusion model, a video with the features of the original image can be output, and the video is represented by a first video. The image Diffusion model can adopt various existing image Diffusion processing algorithms, such as a Diffusion algorithm, a Stable Diffusion algorithm, a Guided Diffusion algorithm and the like.

(1) Performing image semantic segmentation processing on the original image through a SegNet network contained in the image diffusion model to obtain a characteristic image;

(2) Converting the characteristic image into a target text through a semantic learning network contained in the image diffusion model;

(3) Performing image diffusion processing on the target text and the characteristic image through an image diffusion network contained in the image diffusion model to obtain a target image;

(4) Converting the target text into target audio through a text-to-speech network contained in the image diffusion model;

(5) And fusing the target image and the target audio to obtain a first video.

The image diffusion model designed by the embodiment of the application can comprise a plurality of sub-networks such as a SegNet network, a semantic learning network, an image diffusion network, a text-to-speech network, a coding and decoding network and the like. After the original image is input into the image diffusion model, the SegNet network performs image semantic segmentation processing on the original image, so that the corresponding characteristic image is extracted. It should be emphasized that, in the conventional Stable diffusion model, the adopted feature extraction network is a U-Net network, unlike this, the embodiment of the present application extends the network architecture of the Stable diffusion model, and introduces the SegNet network into the backbone structure of the Stable diffusion model, i.e., the feature extraction network of the Stable diffusion model is replaced by the U-Net network. Considering that the Stable diffusion model requires the network to have the same input and output dimensions, the SegNet network also has excellent image semantic segmentation capability on the premise of meeting the requirement, so that the embodiment of the application utilizes the SegNet network to carry out image semantic segmentation processing on an original image, thereby not only successfully extracting corresponding characteristic images, but also remarkably improving the image semantic segmentation effect.

Generally, if a neural network model with better performance is to be obtained, the number of model training iteration steps needs to be set to be larger correspondingly, so that the problems of common gradient disappearance, gradient explosion and the like can occur. In order to solve the problem, the embodiment of the application also improves the network structure of the SegNet network, specifically, groups the convolution layer and the pooling layer of the SegNet network to obtain a plurality of network layer groups, wherein each network layer group comprises at least one convolution layer and at least one pooling layer, and in addition, jump connection is added between the first convolution layer and the last pooling layer of each network layer group. By adding jump connection similar to structures such as a residual network, the problems of gradient disappearance, gradient explosion and the like can be avoided along with the deep iterative training of the model, and the model can obtain higher convergence speed and better performance.

After the feature image is obtained, the feature image is converted into a natural language text by using a semantic learning network and is recorded as a target text. The natural language processing model based on the recurrent neural network can be used as a semantic learning network to convert the information of the extracted characteristic images into natural language texts, such as text information of gender, age, skin color, hairstyle, action, expression, emotion and the like. The specific principle of converting the feature image into the natural language text by using the natural language processing model can refer to the prior art, and will not be described herein.

After the target text is obtained, performing image diffusion processing on the target text and the characteristic image by using an image diffusion network to obtain a target image. For example, an image diffusion network based on the Stable difusion algorithm may be trained, both the target text and the feature image are input into the image diffusion network to perform image diffusion processing, and the output target image may be obtained by referring to the principle of generating an image by using the text and the feature image in the Stable difusion algorithm, where the target image includes both the image features of the original image and the text features of the target text.

Alternatively, the target text may be converted to target audio over a text-to-speech network, where the target text may be converted to corresponding audio using text-to-speech TTS (Text To Speech) techniques, etc., and represented by the target audio.

Finally, the target image and the target audio may be fused, thereby obtaining the first video. It can be understood that the original image may include multiple frames of image images, each frame of image may obtain a corresponding target image and a corresponding target audio, and the target images of all image images and the target audio are fused and then spliced to obtain a video file with audio, that is, the first video.

In an implementation manner of the embodiment of the present application, the method may further include:

acquiring a first audio corresponding to the virtual image;

the fusing the target image and the target audio to obtain the first video may include:

and fusing the target image, the target audio and the first audio to obtain a first video.

As described above, each avatar may previously store corresponding image data and voice data, and the voice data includes audio corresponding to each avatar. After the user selects the avatar desired to be generated, the audio corresponding to the avatar may be acquired and represented by a first audio, for example, if the avatar desired to be generated by the user is a mature man's avatar, the first audio is the sound of a mature man, if the avatar desired to be generated by the user is a young woman's avatar, the first audio is the sound of a young woman, and so on. When video synthesis is performed, the first audio and the target audio can be overlapped to obtain overlapped audio, then the overlapped audio and the target image are fused, and finally the first video is synthesized. By such arrangement, the finally generated avatar will have a personalized sound matching the avatar, thereby further enhancing the user's experience of viewing the avatar.

In one implementation manner of the embodiment of the present application, after converting the feature image into the target text, the method may further include:

and carrying out text emotion analysis processing on the target text to obtain an emotion tag.

(1) Acquiring a second audio corresponding to the emotion label;

(2) And fusing the target image, the target audio and the second audio to obtain a first video.

If the combination of the sound and emotion of the avatar is desired, the intelligentization and individuation degree of the avatar are further improved, text emotion analysis processing can be performed on the target text, so that emotion labels, such as emotion labels of pleasure, happiness, excitement, sadness, falling and the like, are obtained. The attention mechanism neural network with memory can be adopted to carry out emotion analysis processing on the target text, so that emotion labels corresponding to the text are obtained. The specific principle of performing emotion analysis processing on the text by using the attention mechanism neural network to obtain the corresponding emotion label can refer to the prior art, and will not be described herein. The terminal device may store in advance the audio corresponding to each of the different emotion tags, for example, the audio corresponding to happy mood from the happy tag, the audio corresponding to crying mood from the sad tag, and so on. After text emotion analysis processing is carried out on the target text to obtain an emotion label, the audio corresponding to the emotion label can be searched and represented by a second audio. Then, the target image, the target audio and the second audio are fused to obtain a first video. Specifically, the second audio and the target audio may be first superimposed to obtain a superimposed audio, and then the superimposed audio and the target image are fused to finally synthesize the first video. By doing so, the sound of the finally generated avatar will be associated with its emotion, for example, if the emotion tag obtained through text emotion analysis processing is happy, the sound of the finally generated avatar speaking will be a happy mood.

103. The first video is respectively input into a coding and decoding network contained in the image diffusion model and a control Net network for controlling the image diffusion model to be processed;

after the first video is obtained, a certain optimization process may be performed with respect to the target video, thereby generating a realistic, personalized avatar. Specifically, the first video may be input to a codec network included in the image diffusion model and a control net network for controlling the image diffusion model, respectively, to be processed. ControlNet is an extension of Stable Diffusion developed by Steady university researchers, enabling authors to easily control objects in AI images and videos. The method controls the image generation according to various conditions such as edge detection, sketch processing or human body posture, and the like, and can be summarized as a simple stable diffusion fine tuning method.

The codec network may sequentially include an encoding layer, an intermediate layer, and a decoding layer, and the ControlNet network may sequentially include the same encoding layer as the codec network, the same intermediate layer as the codec network, and a zero convolution layer, an output of which is connected to an input of the decoding layer. Fig. 2 is a schematic structural diagram of a codec network and a ControlNet network according to an embodiment of the present application. In fig. 2, the left is a codec network of the image diffusion model, which sequentially includes a plurality of encoding layers, one intermediate layer, and a plurality of decoding layers corresponding to the encoding layers; to the right of fig. 2 is a control net network, which is made to include the same coding layer as the codec network and the same middle layer as the codec network, in order to keep the output result of the image diffusion model stable, and only fine-tune the output result, where the control net network has no decoding layer, and instead has a plurality of zero convolution layers, which are 1*1 convolutions with zero initialization weights and deviations, and the consistency of the image diffusion model and the training data can be maintained using the zero convolution layers. Specific data flow can refer to arrow directions in fig. 2, the first video is respectively input into a coding and decoding network and a control net network, and one path of data sequentially passes through a coding layer, a middle layer and a decoding layer of the coding and decoding network; the other path of data sequentially passes through the coding layer, the middle layer and the zero convolution layers of the control Net network, the data processed by each zero convolution layer is input to the decoding layer of the corresponding coding and decoding network, and the final data is finally output by the last decoding layer of the coding and decoding network. The final data is also essentially a video file, which can be regarded as an avatar with the characteristics of the original image.

(1) Acquiring a preset vivid template image;

(2) And taking the first video and the image template image as input data, and respectively inputting the input data into a coding and decoding network and a control Net network for processing.

When the virtual image is generated according to the first video, a preset image template image can be obtained, then the image template image and the first video are fused through a coding and decoding network and a control Net network, namely, image data are analyzed and processed by utilizing the control Net technology, and finally, the vivid and personalized virtual image is fused. In particular, the user may have some personalized needs for the avatar, such as gender, appearance, age, style, motion, expression, clothing, hairstyle, and wearing apparel, etc. Corresponding avatar template images, such as avatar template images of different sexes and appearances, avatar template images of different ages and styles, avatar template images of different costumes and wearing decorations, avatar template images of different actions and expressions, etc., may be separately constructed and stored for each individualization requirement. In addition, the user can customize the avatar template image, for example, a photograph of himself or herself or another person can be uploaded as the avatar template image, and the avatar thus generated will contain the appearance characteristics of the user himself or another person. After the avatar template image is obtained, the avatar template image and the first video can be used as input data and respectively input into a coding and decoding network and a control Net network for processing, and finally, corresponding avatars with personalized features of the avatar template image are generated.

(1) Acquiring a prompt text corresponding to the virtual image;

(2) And respectively inputting the first video and the prompt text as input data into a coding and decoding network and a control Net network for processing.

In another embodiment, the user may further preset a prompt text corresponding to each avatar. For example, if the avatar currently desired to be generated is an in-vehicle voice assistant avatar, some of the prompts commonly used by the in-vehicle voice assistant avatar may be acquired as prompt text. When the avatar is generated using the first video, the first video and the prompt text may be input as input data to the codec network and the control net network of the image diffusion model, respectively, for processing. By this arrangement, the character of the prompt text is attached to the output avatar, so that the user can more intuitively see the corresponding prompt.

(1) Acquiring vehicle parameters of a vehicle in the running process;

(2) The first video and the vehicle parameters are used as input data and are respectively input into a coding and decoding network and a control Net network for processing.

If the user wants to generate an avatar that is an on-board voice assistant image of the vehicle, the user can acquire vehicle parameters of the vehicle during driving, and the vehicle parameters are also integrated into the generated on-board voice assistant image. Specifically, the vehicle-mounted terminal can acquire various vehicle parameters of the vehicle in the running process, such as parameters of positioning, speed, acceleration, model and the like of the vehicle. When the avatar is generated using the first video, the first video and the vehicle parameter may be input as input data to the codec network and the control net network of the image diffusion model, respectively, for processing. Through the arrangement, the output virtual image can be provided with vehicle parameter information such as vehicle speed and the like, for example, the vehicle parameter information can be displayed in the vicinity of the finally generated vehicle-mounted voice assistant image, and thus a user can acquire current vehicle parameters when watching the vehicle-mounted voice assistant image, and driving experience is improved.

104. The generated avatar is outputted through a decoding layer of the codec network.

After the first video is respectively input into a coding and decoding network contained in the image diffusion model and a control Net network for controlling the image diffusion model, a video file obtained after the processing can be output through a decoding layer of the coding and decoding network, and the video file can be regarded as a generated virtual image. The generated avatar may be presented to the user or the application system for subsequent analysis and processing, for example, the generated avatar may be displayed on a screen of the vehicle-mounted terminal or a screen of the user's mobile phone, and the user may evaluate and opinion feedback on the generated avatar, including contents in terms of avatar fidelity, avatar personality, user opinion, and the like.

(1) Outputting the second video through the decoding layer;

(2) For each frame of image in the second video, acquiring a range area containing the edge of the avatar in the image; each pixel point in the range area is used as input of a distance measurement algorithm to carry out distance measurement processing so as to increase the distance between the pixel points belonging to the virtual image edge and the pixel points not belonging to the virtual image edge in the range area, thereby obtaining a processed second video;

(3) And determining the processed second video as the generated avatar.

Assuming that the video file outputted through the decoding layer is referred to as a second video, the second video may be directly outputted as the generated avatar on the one hand, and if the user compares the details of the edge portion of the generated avatar, distance metric learning may be introduced to process the second video to improve the accuracy of the pixel calculation of the edge portion of the avatar, so that the overall presentation effect of the avatar is more realistic and natural. Specifically, for each frame of image in the second video, a range area containing the edge of the avatar in the image may be obtained, and each pixel point in the range area is used as an input of a distance measurement algorithm to perform distance measurement processing, so as to increase the distance between the pixel point belonging to the edge of the avatar and the pixel point not belonging to the edge of the avatar in the range area, thereby obtaining the processed second video. That is, the pixels belonging to the edges of the avatar are gathered, and the distances of the pixels not belonging to the edges of the avatar are pushed away, so that more real and accurate edge information of the avatar can be learned, and the avatar with clearer and vivid overall vision can be obtained. The second video processed by the distance measurement algorithm can be output as a final avatar.

Fig. 3 is a schematic overall flow chart of a method for generating an avatar according to an embodiment of the present application. In fig. 3, first, in a data acquisition stage, image data and voice data corresponding to different avatars can be acquired; then, performing corresponding data preprocessing operations, such as data cleaning, data standardization, data missing value processing and the like, on the image data and the voice data; then inputting the image data into a semantic feature extraction module for processing to obtain a feature image and a converted target text; then, fusing data through a Stable diffusion algorithm to generate a corresponding first video; next, fusing the first video and the avatar template image by using the codec network and the ControlNet network to generate a final avatar; finally, the generated avatar can be result optimized and displayed.

As shown in fig. 4, a schematic diagram of a data processing flow for generating a first video is shown. In fig. 4, a multi-frame original image corresponding to an avatar is input, and a feature image corresponding to the original image can be extracted through a feature extraction network, where the feature extraction network can adopt a convolutional neural network or a SegNet network. And then, inputting the characteristic image into a semantic learning module, wherein the semantic learning module consists of an image decoder and a text decoder, the characteristic image can be converted into a corresponding text by using the image decoder and the text decoder, and the converted text and the characteristic image are input into a diffusion network together for image diffusion processing, so that a corresponding target image is generated. In addition, the converted text can be converted into corresponding target audio through a text-to-speech module, and finally the target image and the target audio are synthesized into the output first video.

In general, the embodiment of the application can realize personalized avatar generation by combining Stable difusion with control Net, and has the following advantages: (1) the model effect is more realistic: the virtual image generation system based on Stable diffration and control net adopts a deep neural network, so that the complexity and diversity of image generation can be better captured; (2) higher degree of personalization: the avatar generation system based on Stable diffion and control net can generate personalized avatars according to the preference and the requirement of the user, for example, personalized avatars can be generated according to photos uploaded by the user or selected avatar templates; (3) faster: the avatar generation system based on Stable Diffusionn and ControlNet adopts a deep neural network, and the process of generating the avatar can be accelerated through parallel calculation; (4) system scalability is stronger: the virtual image generating system based on Stable diffration and control net can realize the generation of new samples through a pre-training model, transfer learning and other modes, thereby improving the expandability of the system.

It should be understood that the sequence numbers of the steps in the foregoing embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not be construed as limiting the implementation process of the embodiments of the present application.

The above mainly describes a method of generating an avatar, and a device for generating an avatar will be described below.

Referring to fig. 5, an embodiment of an avatar generating apparatus according to an embodiment of the present application includes:

an original image obtaining module 501, configured to obtain an original image corresponding to an avatar;

the video generation module 502 is configured to generate a first video corresponding to the original image through the trained image diffusion model;

a video input module 503, configured to input a first video to a codec network included in the image diffusion model and a control net network for controlling the image diffusion model, respectively, for processing the first video; the control Net network sequentially comprises the same coding layer as the coding and decoding network, the same middle layer as the coding and decoding network and a zero convolution layer, and the output of the zero convolution layer is connected to the input of the decoding layer;

The avatar output module 504 outputs the generated avatar through the decoding layer.

In one implementation manner of the embodiment of the present application, the video generating module may include:

the feature extraction unit is used for carrying out image semantic segmentation processing on the original image through a SegNet network contained in the image diffusion model to obtain a feature image;

the text conversion unit is used for converting the characteristic image into a target text through a semantic learning network contained in the image diffusion model;

the image diffusion unit is used for performing image diffusion processing on the target text and the characteristic image through an image diffusion network contained in the image diffusion model to obtain a target image;

the audio conversion unit is used for converting the target text into target audio through a text-to-speech network contained in the image diffusion model;

and the video synthesis unit is used for fusing the target image and the target audio to obtain the first video.

Further, the network structure of the SegNet network comprises a plurality of network layer groups, each network layer group comprises at least one convolution layer and at least one pooling layer, and a jump connection is added between the first convolution layer and the last pooling layer of each network layer group.

In one implementation of the embodiment of the present application, the video input module may include:

the image template acquisition unit is used for acquiring a preset image template image;

and the first processing unit is used for taking the first video and the avatar template image as input data and respectively inputting the input data into the coding and decoding network and the control Net network for processing.

In another implementation manner of the embodiment of the present application, the video input module may include:

a prompt text acquisition unit for acquiring a prompt text corresponding to the avatar;

and the second processing unit is used for respectively inputting the first video and the prompt text into the coding and decoding network and the control Net network as input data for processing.

In yet another implementation of an embodiment of the present application, the avatar is an on-board voice assistant avatar of the vehicle; the video input module may include:

a vehicle parameter acquisition unit for acquiring vehicle parameters of the vehicle in the running process;

and the third processing unit is used for respectively inputting the first video and the vehicle parameters into the coding and decoding network and the control Net network as input data for processing.

In one implementation of the embodiment of the present application, the avatar output module may include:

a video output unit for outputting a second video through the decoding layer;

a distance measurement processing unit, configured to obtain, for each frame of image in the second video, a range area including an edge of an avatar in the image; performing distance measurement processing by taking each pixel point in the range area as input of a distance measurement algorithm to increase the distance between the pixel points belonging to the virtual image edge and the pixel points not belonging to the virtual image edge in the range area, thereby obtaining the processed second video;

and an avatar determining unit configured to determine the processed second video as the generated avatar.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements the avatar generation method as described in any one of the above embodiments.

The embodiments of the present application also provide a computer program product which, when run on a terminal device, causes the terminal device to perform a method of generating an avatar as described in any one of the embodiments above.

Fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in fig. 6, the terminal device 6 of this embodiment includes: a processor 60, a memory 61 and a computer program 62 stored in said memory 61 and executable on said processor 60. The processor 60 implements the steps in the embodiments of the above-described respective avatar generation methods when executing the computer program 62, such as steps 101 to 104 shown in fig. 1. Alternatively, the processor 60, when executing the computer program 62, performs the functions of the modules/units of the apparatus embodiments described above, such as the functions of the modules 501-504 shown in fig. 5.

The computer program 62 may be divided into one or more modules/units, which are stored in the memory 61 and executed by the processor 60 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 62 in the terminal device 6.

The processor 60 may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may be an external storage device of the terminal device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the terminal device 6. The memory 61 is used for storing the computer program and other programs and data required by the terminal device. The memory 61 may also be used for temporarily storing data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the system embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium may include content that is subject to appropriate increases and decreases as required by jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is not included as electrical carrier signals and telecommunication signals.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method for generating an avatar, comprising:

acquiring an original image corresponding to the virtual image;

generating a first video corresponding to the original image through a trained image diffusion model;

the first video is respectively input to a coding and decoding network contained in the image diffusion model and a control Net network for controlling the image diffusion model to be processed; the encoding and decoding network sequentially comprises an encoding layer, a middle layer and a decoding layer, the control Net network sequentially comprises the same encoding layer as the encoding and decoding network, the same middle layer as the encoding and decoding network and a zero convolution layer, and the output of the zero convolution layer is connected to the input of the decoding layer;

Outputting the generated avatar through the decoding layer.

2. The method of claim 1, wherein the generating a first video corresponding to the original image by the trained image diffusion model comprises:

and fusing the target image and the target audio to obtain the first video.

3. The method of claim 2, wherein the network structure of the SegNet network comprises a plurality of network layer groups, each network layer group comprising at least one convolutional layer and at least one pooling layer, and wherein a hopping connection is added between a first convolutional layer and a last pooling layer of each network layer group.

4. The method of claim 1, wherein the inputting the first video into the codec network included in the image diffusion model and the control net network for controlling the image diffusion model, respectively, comprises:

acquiring a preset vivid template image;

and respectively inputting the first video and the image template image serving as input data to the coding and decoding network and the control Net network for processing.

5. The method of claim 1, wherein the inputting the first video into the codec network included in the image diffusion model and the control net network for controlling the image diffusion model, respectively, comprises:

acquiring a prompt text corresponding to the virtual image;

and respectively inputting the first video and the prompt text as input data to the coding and decoding network and the control Net network for processing.

6. The method of claim 1, wherein the avatar is an on-board voice assistant avatar of a vehicle; the processing of the first video input to a codec network included in the image diffusion model and a control net network for controlling the image diffusion model includes:

Acquiring vehicle parameters of the vehicle in the running process;

and respectively inputting the first video and the vehicle parameters into the coding and decoding network and the control Net network as input data for processing.

7. The method of any one of claims 1 to 6, wherein the avatar generated by the decoding layer output includes:

outputting a second video through the decoding layer;

for each frame of image in the second video, acquiring a range area containing the edge of the avatar in the image; performing distance measurement processing by taking each pixel point in the range area as input of a distance measurement algorithm to increase the distance between the pixel points belonging to the virtual image edge and the pixel points not belonging to the virtual image edge in the range area, thereby obtaining the processed second video;

and determining the processed second video as the generated avatar.

8. An avatar generation apparatus, comprising:

The video input module is used for respectively inputting the first video to a coding and decoding network contained in the image diffusion model and a control Net network used for controlling the image diffusion model to process the first video; the encoding and decoding network sequentially comprises an encoding layer, a middle layer and a decoding layer, the control Net network sequentially comprises the same encoding layer as the encoding and decoding network, the same middle layer as the encoding and decoding network and a zero convolution layer, and the output of the zero convolution layer is connected to the input of the decoding layer;

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the avatar generation method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the avatar generation method of any one of claims 1 to 7.