CN114792355B

CN114792355B - Virtual image generation method and device, electronic equipment and storage medium

Info

Publication number: CN114792355B
Application number: CN202210720644.2A
Authority: CN
Inventors: 李�杰; 赵晨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2023-02-24
Anticipated expiration: 2042-06-24
Also published as: CN114792355A

Abstract

The utility model provides a virtual image generation method, which relates to the technical field of artificial intelligence, in particular to the technical fields of virtual reality, augmented reality, computer vision, deep learning and the like, and can be applied to scenes such as the meta universe and the like. The specific implementation scheme is as follows: acquiring a target image and a target style image; performing normalization processing on the target image to obtain a normalized image of a target object in the target image; determining a texture map of a target object in the normalized image; mapping the texture map to a three-dimensional space to obtain an output image; and generating an avatar corresponding to the target object based on the output image and the target-style image. The present disclosure also provides an avatar generation apparatus, an electronic device, and a storage medium.

Description

Virtual image generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to artificial intelligence technology, and more particularly to virtual reality, augmented reality, computer vision, and deep learning, which can be applied to the meta universe. More particularly, the present disclosure provides an avatar generation method, apparatus, electronic device, and storage medium.

Background

With the development of artificial intelligence technology, deep learning models are widely applied to image processing or image generation in fields such as virtual reality and augmented reality. In addition, the virtual image is widely applied to scenes such as social contact, live broadcast or games.

Disclosure of Invention

The present disclosure provides an avatar generation method, apparatus, device, and storage medium.

According to an aspect of the present disclosure, there is provided an avatar generation method, the method including: acquiring a target image and a target style image; performing normalization processing on the target image to obtain a normalized image of a target object in the target image; determining a texture map of the target object in the normalized image; mapping the texture map to a three-dimensional space to obtain an output image; and generating an avatar corresponding to the target object based on the output image and the target-style image.

According to another aspect of the present disclosure, there is provided an avatar generating apparatus, the apparatus including: the acquisition module is used for acquiring a target image and a target style image; the normalization module is used for executing normalization processing on the target image to obtain a normalized image of a target object in the target image; the determining module is used for determining a texture map of the target object in the normalized image; the mapping module is used for mapping the texture map to a three-dimensional space to obtain an output image; and a generating module for generating an avatar corresponding to the target object according to the output image and the target style image.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method provided according to the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic diagram of an exemplary system architecture to which the avatar generation method and apparatus may be applied, according to one embodiment of the present disclosure;

FIG. 2 is a flow diagram of an avatar generation method according to one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a deep learning model according to one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of obtaining a texture map according to one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of adjusting parameters of a deep learning model according to one embodiment of the present disclosure;

FIG. 6 is a block diagram of an avatar generation apparatus according to one embodiment of the present disclosure; and

fig. 7 is a block diagram of an electronic device to which an avatar generation method may be applied according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The avatar may include a head. The head of the avatar may include a face. A face texture associated with the target object may be rendered onto the face of the avatar based on the texture map of the target object such that there is a higher degree of similarity between the face of the avatar and the face of the target object.

From the face image of the target object, the texture map of the target object may be determined in an artificial manner. However, the manual method for determining the texture map has the disadvantages of long time, high threshold, high cost and long iteration period.

Fig. 1 is a schematic diagram of an exemplary system architecture to which the avatar generation method and apparatus may be applied, according to one embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the system architecture 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is used to provide a medium of communication links between the first terminal device 101, the second terminal device 102, the third terminal device 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may use the first terminal device 101, the second terminal device 102, the third terminal device 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The first terminal device 101, the second terminal device 102, and the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The backend management server may analyze and process the received data such as the user request, and feed back a processing result (for example, a web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the avatar generation method provided in the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the avatar generation apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The avatar generation method provided by the embodiment of the present disclosure may also be performed by a server or a server cluster different from the server 105 and capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103 and/or the server 105. Accordingly, the avatar generation apparatus provided in the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103 and/or the server 105.

Fig. 2 is a flowchart of an avatar generation method according to one embodiment of the present disclosure.

As shown in fig. 2, the method may include operations S210 to S250.

In operation S210, a target image and a target style image are acquired.

For example, a target object may be included in the target image. For another example, the target object may be an object including a face and/or a head.

For example, the target object may be an object having a face or a head, such as a human, an animal, or a robot.

As another example, the target style image may be determined manually. In one example, a target style image may be manually determined from the target image.

In operation S220, a normalization process is performed on the target image to obtain a normalized image of the target object in the target image.

The normalization processing includes, for example, alignment processing. The target image is normalized so that the face or the head of the target object is at a preset position in the normalized image.

In operation S230, a texture map of the target object in the normalized image is determined.

For example, the texture map of the target object may be determined based on various ways. In one example, the normalized image may be processed using various deep learning models to derive features of the normalized image to obtain a texture map, although the disclosure is not limited thereto.

In operation S240, the texture map is mapped to a three-dimensional space, resulting in an output image.

For example, a texture map may be mapped to a three-dimensional space by using 3d DMM (3 d portable models,3d variable models), so as to obtain a three-dimensional image of the target object. Various renderers can be utilized to render the multiple three-dimensional point data of the three-dimensional image to obtain an output image. In one example, the various renderers may include, for example, a pytorech 3D renderer.

In operation S250, an avatar corresponding to the target object is generated based on the output image and the target-style image.

For example, the texture map may be adjusted according to a difference between the output image and the target-style image, such that the three-dimensional image is adjusted. And rendering the adjusted three-dimensional image by using a Pythrch 3D renderer to obtain a virtual image.

Through the embodiment of the disclosure, the texture map is generated based on the deep learning model, so that the labor cost can be reduced, the generation period of the virtual image is shortened, the threshold for generating the virtual image is reduced, and the improvement of user experience is facilitated.

In some embodiments, performing a normalization process on the target image to obtain a normalized image of the target object in the target image comprises: and performing affine transformation on the target image to obtain a first registration image.

For example, the affine transformation operation may include at least one of a plurality of operations such as a translation operation, a scaling operation, and a rotation operation, for example. In one example, the first registered image may be taken as the normalized image above. It is understood that after performing the affine transformation on the target image, the position of the target object in the image may be adjusted.

Further, in some embodiments, performing normalization processing on the target image to obtain a normalized image of the target object in the target image includes: determining processing parameters of the first registered image; and processing the first registration image according to the processing parameters to obtain a normalized image.

In embodiments of the present disclosure, the processing parameters may include lighting parameters and other parameters.

For example, the illumination parameters may also be referred to as SH (Spherical Harmonics) parameters.

For another example, the other parameters may include, for example, a Shape parameter, an Expression parameter, a Texture parameter, a camera internal reference, a camera external reference, and the like.

In one example, the first registered image described above is processed by 3d mm according to the preset illumination parameter and the preset other parameters, so as to obtain a level 1 second registered image. Next, the preset illumination parameter and the preset other parameters may be adjusted according to the level 1 difference between the level 1 second registration image and the target image, so as to obtain a level 1 illumination parameter and a level 1 other parameter. And repeating for multiple times, for example, processing the n-1 level second registration image by using 3DMM according to the n-1 level illumination parameter and the n-1 level other parameters to obtain an n level second registration image. Next, the nth-1 level illumination parameter and the nth-1 level other parameters may be adjusted according to the nth level difference value between the nth level second registration image and the target image, so as to obtain the nth level illumination parameter and the nth level other parameters. After repeating N times, the minimum value among the N-order difference values may be determined. The first-level illumination parameter and other parameters obtained by the minimum value adjustment can be used as the illumination parameter and other parameters of the first registration image. N is an integer less than or equal to N, N is an integer greater than 1, and N is an integer greater than 1.

In one example, the first-level lighting parameter and other parameters adjusted according to the minimum value may be an mth-level lighting parameter and other parameters, where M is an integer less than or equal to N, and M is an integer greater than or equal to 1.

For another example, the preset illumination parameter and the preset other parameters may be determined according to the target image.

For example, the first registered image may be processed using 3d dm according to the above-described mth level illumination parameter and mth level other parameters, resulting in a normalized image.

In some embodiments, the normalized image may be processed using a deep learning model to obtain a texture map. This will be described in detail below with reference to fig. 3.

FIG. 3 is a schematic diagram of a deep learning model according to one embodiment of the present disclosure.

As shown in fig. 3, the deep learning model 310 may be built based on a ResNet (Residual Network) model, for example.

The deep learning model 310 includes a Conv (Convolution Layer) 311, a 1 st Block 312, a 2 nd Block 313, a 3 rd Block 314, a 4 th Block 315, and an FC (full Connected Layer) 316.

The normalized image 301 may be used as an input to the deep learning model 310. The deep learning model 310 may process the normalized image 301 to obtain an Output feature Output 302. After the feature extraction of the ith stage Block of the deep learning model is completed, the ith stage initial features of the normalized image can be obtained. I is an integer greater than or equal to 1, I is an integer less than or equal to I. I is an integer greater than 1. I =4 in this example.

For example, conv 311 may perform convolution processing on the normalized image 301, outputting a convolved image. The 1 st to 4 th stages of blocks 312 to 315 may sequentially perform feature extraction on the convolved image, and output an I-th stage initial feature. FC 316 may perform full join processing on the level I initial feature resulting in Output feature Output 302.

In one example, the normalized image 301 may be, for example, 224 × 224 × 3 in size. The size of the convolved image may be 112 × 112, for example. The convolved image is input into the Block 312 of level 1, and the size of the resulting initial feature of level 1 may be, for example, 56 × 56. The level 2 initial feature obtained by inputting the level 1 initial feature into the level 2 Block 313 may have a size of 28 × 28, for example. The level 2 initial feature is input into the level 3 Block 314, and the size of the obtained level 3 initial feature may be 14 × 14, for example. The level 3 initial feature is input into the level 4 Block 315, and the size of the resulting level 4 initial feature may be 7 × 7, for example.

In one example, any of the level 1 initial feature, the level 2 initial feature, the level 3 initial feature, and the level 4 initial feature described above may be used as the texture map described above.

In some embodiments, the face of the target object in the target image may be a side face of the target object.

For example, the side face may be a left side face of the subject or a right side face of the subject.

In some embodiments, determining the texture map of the target object in the normalized image comprises: determining initial characteristics of the normalized image; turning over the initial features to obtain turning over features; fusing the initial characteristic and the turning characteristic to obtain a fused characteristic; and obtaining a texture map of the target object according to the fusion characteristics. As will be described in detail below with reference to fig. 4.

FIG. 4 is a schematic diagram of obtaining a texture map according to one embodiment of the present disclosure.

As shown in fig. 4, the ith stage Block may output an ith stage initial feature 403. Flipping the ith level initial feature 403 may result in an ith level flipped feature 404. And fusing the ith-level initial feature 403 and the ith-level flip feature 404 to obtain an ith-level fused feature 405. The fusion process may include, for example, at least one of a splicing process, an addition process, and the like.

For example, in the case of i =1, the input of the 1 st stage Block may be, for example, the convolved image described above. Stage 1 Block may output stage 1 initial features. And turning the 1 st-level initial feature to obtain a 1 st-level turning feature. And then splicing the 1 st level initial feature and the 1 st level flip feature to obtain the 1 st level fusion feature.

As another example, the input to the ith stage Block may be, for example, an I-1 stage fusion feature, where I is greater than 1 and I is less than or equal to I-1. Taking i =2 as an example, the input of the level 2 Block may be, for example, the level 1 fusion feature described above. Stage 2 Block may output stage 2 initial features. And turning over the 2 nd-level initial feature to obtain a 2 nd-level turning-over feature. And fusing the 2 nd level initial feature and the 2 nd level flip feature to obtain a 2 nd level fused feature.

For another example, in the case of I = I, the input of the I-th stage Block may be, for example, the I-1-th stage fusion feature. The I-th stage Block may output the I-th stage initial characteristics. And turning the I-level initial feature to obtain an I-level turning feature. And fusing the I-level initial feature and the I-level overturning feature to obtain an I-level fusion feature.

It is understood that the level I initial features are derived from the level I-1 fused features. The I-level initial feature can be used as a texture map, and the I-level fusion feature can also be used as a texture map.

It will be appreciated that the above-described flipping may be, for example, a horizontal flipping.

Through the embodiment of the disclosure, under the condition that the target image comprises the side face of the object, the virtual image with higher similarity to the target object can be generated, the requirement on the target image is reduced, the use threshold of a user is reduced, and the use experience is improved. Further, in the case where the entire face or head of the subject is included in the target image, the reality of the avatar can be further improved.

In some embodiments, mapping the texture map into three-dimensional space, resulting in an output image comprises: mapping the texture map to a three-dimensional space to obtain a three-dimensional image; and rendering the three-dimensional image to obtain an output image.

For example, as described above, a Pytorch3D renderer may be used to render a three-dimensional image, resulting in an output image.

In some embodiments, an avatar corresponding to the target object may be generated from the output image and the target style image.

For example, the parameters of the deep learning model are adjusted, so that the texture map output by the deep learning model is adjusted, and further the three-dimensional image obtained by the texture map is adjusted. And obtaining the virtual image according to the adjusted three-dimensional image.

For another example, a first disparity value between the output image and the target style image is determined. And adjusting parameters of the deep learning model so that the first difference value converges.

For another example, the output image and the target style image are processed by using a contrast image-text pre-training model, and the output image perception characteristic and the target style image perception characteristic are obtained respectively. Determining a second difference value between the output image perception characteristic and the target style image perception characteristic. And adjusting parameters of the deep learning model so that the second difference value converges.

As another example, a learning-aware image block similarity between the output image and the target-style image is determined. And adjusting parameters of the deep learning model to make the similarity of the learning perception image blocks converged.

As will be described in detail below with reference to fig. 5.

FIG. 5 is a schematic diagram of adjusting parameters of a deep learning model according to one embodiment of the present disclosure.

In order to obtain the virtual image, the deep learning model can be adjusted for K times. K is an integer greater than 1.

In the K adjustment process of the K adjustment, the normalized image 301 is input into the adjusted deep learning model 310 of the K-1 th time, and a K texture map 506 is obtained. It can be understood that the contents described in fig. 3 for the deep learning model may also be applied to the present embodiment, and are not described herein again. It is understood that the above-described principle of obtaining texture maps is also applicable to the present embodiment, and the disclosure is not repeated herein. K is an integer less than or equal to K, K being an integer greater than 1.

Next, the kth texture map 506 may be mapped to three-dimensional space using 3d dm, resulting in a kth three-dimensional model. The kth three-dimensional model includes a plurality of three-dimensional point data. And rendering the image according to the kth three-dimensional model by using a Pythrch 3D renderer to obtain a kth output image 507.

Next, a kth adjustment may be made to deep learning model 310 based on kth output image 507 and target style image 508.

For example, using the L1 loss function, a kth first difference value 5011 between the kth output image 507 and the target style image 508 is determined.

For another example, the kth output Image 507 and the target style Image 508 may be processed by using a CLIP (contrast Language-Image Pre-Training) model 520 to obtain a kth output Image sensing feature 509 and a target style Image sensing feature 5010, respectively. A kth second difference value 5012 between the kth output image perceptual feature 509 and the target style image perceptual feature 5010 is determined, again using the L1 loss function.

For another example, a kth LPIPS (Learned Perceptual Image Patch Similarity) 5013 may be determined between the kth output Image 507 and the target style Image 508.

Next, the parameters of the deep learning model 310 are adjusted according to the kth first difference value 5011, the kth second difference value 5012 and the kth LPIPS 5013 to complete the kth adjustment of the K adjustments. For example, various operations can be performed according to the kth first difference value 5011, the kth second difference value 5012, and the kth LPIPS 5013 to obtain a kth loss value, so as to adjust the parameters of the deep learning model 310.

For another example, after K adjustments are completed, a minimum loss value may be determined as a loss value that causes the first difference value, the second difference value, and LPIPS to converge. For example, after K times of adjustment, if the J-th loss value is determined to be the smallest, the J-th adjusted deep learning model may be used as the target deep learning model. J is an integer less than or equal to K, and J is an integer greater than or equal to 1.

And processing the normalized image by using the target deep learning model to obtain a target texture map. And mapping the target texture map to a three-dimensional space to obtain a target three-dimensional image. And rendering by using a Pythrch 3D renderer according to the target three-dimensional image to obtain a virtual image.

As described above, in the embodiment shown in fig. 5, the deep learning model is adjusted by using the first difference value, the second difference value and LPIPS at the same time. It is understood that in the embodiment of the present disclosure, one or more of the first difference value, the second difference value, and the LPIPS may also be used to adjust the parameters of the deep learning model.

By the embodiment of the disclosure, after at least one of the first difference value, the second difference value and the LPIPS converges, the obtained texture map is closer to the real texture map of the target image. The virtual image generated based on the texture map is more real, has higher precision and has smaller difference with the target image.

Fig. 6 is a block diagram of an avatar generation apparatus according to one embodiment of the present disclosure.

As shown in fig. 6, the apparatus 600 may include an obtaining module 610, a normalizing module 620, a determining module 630, a mapping module 640, and a generating module 650.

An obtaining module 610, configured to obtain a target image and a target style image.

And the normalization module 620 is configured to perform normalization processing on the target image to obtain a normalized image of the target object in the target image.

A determining module 630 is configured to determine a texture map of the target object in the normalized image.

And the mapping module 640 is configured to map the texture map to a three-dimensional space to obtain an output image.

A generating module 650 for generating an avatar corresponding to the target object based on the output image and the target-style image.

In some embodiments, the normalization module comprises: the affine transformation submodule is used for executing affine transformation on the target image to obtain a first registration image; a first determining sub-module for determining processing parameters of the first registered image; and the first processing submodule is used for processing the first registration image according to the processing parameters to obtain a normalized image.

In some embodiments, the determining module comprises: the second determining submodule is used for determining the initial characteristics of the normalized image; the turning submodule is used for turning the initial features to obtain turning features; the fusion submodule is used for fusing the initial characteristic and the turning characteristic to obtain a fusion characteristic; and the obtaining submodule is used for obtaining the texture map of the target object according to the fusion characteristics.

In some embodiments, the mapping module comprises: the mapping submodule is used for mapping the texture map to a three-dimensional space to obtain a three-dimensional image; and the rendering submodule is used for rendering the three-dimensional image to obtain an output image.

In some embodiments, the determining module comprises: and the second processing submodule is used for processing the normalized image by using the deep learning model to obtain a texture map. The generation module comprises: the third determining submodule is used for determining a first difference value between the output image and the target style image; and the first adjusting submodule is used for adjusting the parameters of the deep learning model so as to make the first difference value converge.

In some embodiments, the generating module further comprises: the third processing submodule is used for processing the output image and the target style image by using the comparison image-text pre-training model to respectively obtain an output image perception characteristic and a target style image perception characteristic; the fourth determining submodule is used for determining a second difference value between the output image perception characteristic and the target style image perception characteristic; and the second adjusting submodule is used for adjusting the parameters of the deep learning model so as to make the second difference value converged.

In some embodiments, the generating module further comprises: the fifth determining submodule is used for determining the similarity of the learning perception image blocks between the output image and the target style image; and the third adjusting submodule is used for adjusting the parameters of the deep learning model so that the similarity of the learning perception image blocks is converged.

In some embodiments, the target object includes a head.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

In an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor. For example, the memory stores instructions executable by the at least one processor to enable the at least one processor to perform methods provided in accordance with the present disclosure.

In the disclosed embodiments, a readable storage medium stores computer instructions, which may be a non-transitory computer readable storage medium. For example, the computer instructions may cause a computer to perform a method provided according to the present disclosure.

In an embodiment of the present disclosure, the computer program product comprises a computer program which, when executed by a processor, implements the method provided according to the present disclosure. This will be described in detail below with reference to fig. 7.

FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An I/O (input/output) interface 705 is also connected to the bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as the avatar generation method. For example, in some embodiments, the avatar generation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the avatar generation method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the avatar generation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An avatar generation method, comprising:

acquiring a target image and a target style image;

performing normalization processing on the target image to obtain a normalized image of a target object in the target image, wherein the target object comprises a head;

processing the normalized image by using a deep learning model, and determining the initial characteristics of the normalized image;

turning over the initial features to obtain turning over features;

fusing the initial feature and the turning feature to obtain a fused feature;

obtaining a texture map of the target object according to the fusion characteristics;

mapping the texture map to a three-dimensional space to obtain an output image;

determining a first difference value between the output image and the target style image;

processing the output image and the target style image by using a comparison image-text pre-training model to respectively obtain an output image perception characteristic and a target style image perception characteristic;

determining a second difference value between the output image perception feature and the target style image perception feature;

determining a learning perception image block similarity between the output image and the target style image;

adjusting parameters of the deep learning model to make the first difference value, the second difference value and the learning perception image block similarity converged to obtain a target deep learning model;

processing the normalized image by using the target deep learning model to obtain a target texture map;

mapping the target texture map to the three-dimensional space to obtain a target three-dimensional image; and

and rendering the target three-dimensional image to obtain an avatar corresponding to the target object.

2. The method of claim 1, wherein the performing the normalization process on the target image to obtain a normalized image of a target object in the target image comprises:

performing affine transformation on the target image to obtain a first registration image;

determining processing parameters of the first registered image; and

and processing the first registration image according to the processing parameters to obtain the normalized image.

3. The method of claim 1, wherein mapping the texture map into three-dimensional space to obtain an output image comprises:

mapping the texture map to a three-dimensional space to obtain a three-dimensional image; and

rendering the three-dimensional image to obtain the output image.

4. An avatar generation apparatus, comprising:

the acquisition module is used for acquiring a target image and a target style image;

the normalization module is used for executing normalization processing on the target image to obtain a normalized image of a target object in the target image;

a determination module for determining a texture map of the target object in the normalized image;

the mapping module is used for mapping the texture map to a three-dimensional space to obtain an output image; and

a generating module for generating an avatar corresponding to the target object based on the output image and the target style image,

the determining module comprises:

the second determining submodule is used for processing the normalized image by using a deep learning model and determining the initial characteristics of the normalized image;

the turning submodule is used for turning the initial features to obtain turning features;

the fusion submodule is used for fusing the initial feature and the turning feature to obtain a fusion feature;

the obtaining submodule is used for obtaining a texture map of the target object according to the fusion characteristics;

the generation module comprises:

a third determining sub-module, configured to determine a first difference value between the output image and the target style image; and

the third processing submodule is used for processing the output image and the target style image by using a comparison image-text pre-training model to respectively obtain an output image perception characteristic and a target style image perception characteristic;

a fifth determining submodule, configured to determine a learning-aware image block similarity between the output image and the target-style image; a fourth determining sub-module, configured to determine a second difference value between the output image perceptual feature and the target style image perceptual feature;

a first adjusting submodule, configured to adjust a parameter of the deep learning model so that the first difference value converges;

a second adjusting submodule, configured to adjust a parameter of the deep learning model so that the second difference value converges;

a third adjusting submodule, configured to perform the following operations: adjusting parameters of the deep learning model to enable the learning perception image block similarity to be converged to obtain a target deep learning model; processing the normalized image by using the target deep learning model to obtain a target texture map; mapping the target texture map to the three-dimensional space to obtain a target three-dimensional image; and rendering the target three-dimensional image to obtain an avatar corresponding to the target object.

5. The apparatus of claim 4, wherein the normalization module comprises:

the affine transformation submodule is used for executing affine transformation on the target image to obtain a first registration image;

a first determination sub-module for determining processing parameters of the first registered image; and

and the first processing submodule is used for processing the first registration image according to the processing parameters to obtain the normalized image.

6. The apparatus of claim 4, wherein the mapping module comprises:

the mapping submodule is used for mapping the texture map to a three-dimensional space to obtain a three-dimensional image; and

and the rendering submodule is used for rendering the three-dimensional image to obtain the output image.

7. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 3.

8. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to any one of claims 1 to 3.