CN115063536A

CN115063536A - Image generation method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN115063536A
Application number: CN202210770690.3A
Authority: CN
Inventors: 张琦; 刘巧俏; 邹航
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-16
Anticipated expiration: 2042-06-30
Also published as: CN115063536B

Abstract

The disclosure provides an image generation method, an image generation device, electronic equipment and a computer-readable storage medium, and relates to the technical field of image processing. The method comprises the following steps: acquiring style control content of a target image; extracting n pairs of a first shape code and a first appearance code from the style control content according to a first neural network; acquiring n pairs of second shape codes and second appearance codes; generating n pairs of third shape codes and third appearance codes according to the n pairs of first shape codes and first appearance codes, and the n pairs of second shape codes and second appearance codes; acquiring a position code of a space point corresponding to a target image; generating n first feature fields according to the position codes and the n pairs of third shape codes and third appearance codes; and processing the n first characteristic domains according to the second neural network to obtain a target image. In this way, the user can realize the style control of the target image by inputting the style control content, and can generate a virtual image which does not exist actually.

Description

Image generation method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image generation method and apparatus, an electronic device, and a computer-readable storage medium.

Background

In the field of image processing technology, image generation through three-dimensional reconstruction and new-view image rendering has been a research focus. With the introduction of new concepts such as digital twin and meta universe, the demand of image generation by the industry is gradually increasing.

In the related art, an image is generated by a classical three-dimensional neural rendering model, which has the capability of reconstructing a three-dimensional implicit representation of an object when a few pictures of the object are input, and generates images of the object at various angles by a volume rendering technology.

However, generating an image through a classical three-dimensional neural rendering model requires inputting an image of an object, which is actually present, so that the image generated through the classical three-dimensional neural rendering model belongs to an actually available image, and a virtual image cannot be generated. In addition, the classical three-dimensional neural rendering model generates images by taking each object and the background as a whole, so that the generated images are static and cannot generate images with corresponding styles according to the requirements of users.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure provides an image generation method, an image generation apparatus, an electronic device, and a computer-readable storage medium, which at least partially overcome the problems that a virtual image cannot be generated in the related art and an image of a corresponding style cannot be generated according to the requirements of a user.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of the present disclosure, there is provided an image generation method including: acquiring style control content of a target image;

extracting n pairs of first shape codes and first appearance codes from the style control content according to a first neural network, wherein n is an integer greater than or equal to 1;

acquiring n pairs of second shape codes and second appearance codes, wherein the n pairs of second shape codes and the second appearance codes are obtained by randomly sampling n times in normal distribution of fitted sample data distribution;

generating n pairs of third shape codes and third appearance codes according to the n pairs of first shape codes and first appearance codes and the n pairs of second shape codes and second appearance codes;

acquiring a position code of a space point corresponding to the target image;

generating n first feature fields according to the position codes and the n pairs of third shape codes and third appearance codes, wherein one first feature field corresponds to one third shape code and one corresponding third appearance code;

and processing the n first characteristic domains according to a second neural network to obtain the target image.

In one embodiment of the present disclosure, the first neural network comprises a feature extraction model, a shape mapper, and an appearance mapper, the extracting n pairs of a first shape code and a first appearance code from the style control content according to the first neural network comprises: extracting style features of the style control content according to the feature extraction model to obtain style coding vectors; extracting n first shape codes from the style code vector according to the shape mapper; extracting n first appearance codes from the style code vector according to the appearance mapper.

In one embodiment of the present disclosure, the style control content includes text content, or image content, or voice content; the extracting style characteristics of the style control content according to the characteristic extraction model to obtain a style coding vector comprises the following steps: extracting the style coding vector from the text content according to a text encoder included in the feature extraction model under the condition that the style control content comprises the text content; extracting the style encoding vector from the image content according to an image encoder included in the feature extraction model in a case where the style control content includes the image content; in the case that the style control content comprises the voice content, converting the voice content into corresponding text content according to a voice converter included in the feature extraction model; and extracting the style coding vector from the text content corresponding to the voice content according to the text encoder.

In one embodiment of the present disclosure, the feature extraction model is a pre-trained neural network model.

In an embodiment of the present disclosure, the obtaining a position code of a spatial point corresponding to the target image includes: taking the origin of a camera as a center in a three-dimensional space, and acquiring H x W rays, wherein H x W corresponds to the size of the target image, and H and W are integers larger than 0; determining a first coordinate and a second coordinate of each ray in the H x W rays; sampling S space points in a line segment between the first coordinate and the second coordinate of each ray to obtain H x W x S space points, wherein S is an integer larger than 0; and respectively coding the H, W, S spatial points according to the coordinates of each spatial point in the H, W, S spatial points to obtain the position codes.

In an embodiment of the present disclosure, the generating n first feature fields according to the position coding and the n pairs of third shape coding and third appearance coding includes: and inputting the position codes and the n pairs of third shape codes and third appearance codes into a third neural network, and generating corresponding first feature domains by the third neural network according to each pair of the third shape codes and the third appearance codes and the position codes to obtain n first feature domains.

In one embodiment of the present disclosure, the second neural network comprises a combination operator, a three-dimensional volume rendering neural network, and a prediction neural network; the processing the n first feature domains according to the second neural network to obtain the target image includes: processing the n first feature domains according to the combined operator to obtain a second feature domain; rendering the second feature domain according to the three-dimensional volume rendering neural network to obtain a feature map; and predicting the characteristic graph according to the prediction neural network to obtain the target image.

In one embodiment of the present disclosure, the spatial point corresponding to each of the n first feature domains has a corresponding spatial density and target feature; the processing the n first feature domains according to the combination operator to obtain a second feature domain includes: inputting the n first feature domains into the combination operator, and calculating the average spatial density and the average target feature of the n first feature domains at each spatial point by the combination operator according to the following formula;

where C (x, d) is a spatial point, x is the position of the spatial point, d is the direction of the spatial point, μ _i Is the spatial density of the ith first feature field at C (x, d), f _i Target features at C (x, d) for the ith first feature domain; and obtaining the second feature domain according to the average space density and the average target feature at each space point.

According to another aspect of the present disclosure, there is provided an image generation apparatus including: the acquisition module is used for acquiring the style control content of the target image;

the extracting module is used for extracting n pairs of first shape codes and first appearance codes from the style control content according to a first neural network, wherein n is an integer greater than or equal to 1;

the obtaining module is further configured to obtain n pairs of second shape codes and second appearance codes, where the n pairs of second shape codes and second appearance codes are obtained by randomly sampling n times in a normal distribution fitting the sample data distribution;

a generating module for generating n pairs of third shape codes and third appearance codes according to the n pairs of first shape codes and first appearance codes, and the n pairs of second shape codes and second appearance codes;

the acquisition module is further used for acquiring a position code of a space point corresponding to the target image;

the generating module is further configured to generate n first feature fields according to the position code and the n pairs of third shape codes and third appearance codes, where one first feature field corresponds to one third shape code and a corresponding third appearance code;

and the processing module is used for processing the n first characteristic domains according to a second neural network to obtain the target image.

In an embodiment of the present disclosure, the first neural network includes a feature extraction model, a shape mapper, and an appearance mapper, and the extraction module is configured to extract style features of the style control content according to the feature extraction model to obtain a style coding vector; extracting n first shape codes from the style code vector according to the shape mapper; extracting n first appearance codes from the style code vector according to the appearance mapper.

In one embodiment of the present disclosure, the style control content includes text content, or image content, or voice content; the extraction module is used for extracting the style encoding vector from the text content according to a text encoder included in the feature extraction model under the condition that the style control content comprises the text content; extracting the style encoding vector from the image content according to an image encoder included in the feature extraction model in a case where the style control content includes the image content; in the case that the style control content comprises the voice content, converting the voice content into corresponding text content according to a voice converter included in the feature extraction model; and extracting the style coding vector from the text content corresponding to the voice content according to the text encoder.

In an embodiment of the present disclosure, the obtaining module is configured to obtain H × W rays in a three-dimensional space with an origin of a camera as a center, where H × W corresponds to a size of the target image, and H and W are integers greater than 0; determining a first coordinate and a second coordinate of each ray in the H x W rays; sampling S space points in a line segment between the first coordinate and the second coordinate of each ray to obtain H x W x S space points, wherein S is an integer larger than 0; and respectively coding the H, W, S spatial points according to the coordinates of each spatial point in the H, W, S spatial points to obtain the position codes.

In an embodiment of the disclosure, the generating module is configured to input the position code and the n pairs of third shape codes and third appearance codes into a third neural network, and the third neural network generates corresponding first feature fields according to each pair of the third shape codes and the third appearance codes and the position code, so as to obtain n first feature fields.

In one embodiment of the present disclosure, the second neural network comprises a combination operator, a three-dimensional volume rendering neural network, and a prediction neural network; the processing module is used for processing the n first feature domains according to the combined operator to obtain a second feature domain; rendering the second feature domain according to the three-dimensional volume rendering neural network to obtain a feature map; and predicting the characteristic graph according to the prediction neural network to obtain the target image.

In one embodiment of the present disclosure, the spatial point corresponding to each of the n first feature domains has a corresponding spatial density and a target feature; the processing module is used for inputting the n first feature domains into the combination operator, and the combination operator calculates the average spatial density and the average target feature of the n first feature domains at each spatial point according to the following formula;

wherein C (x, d) is a spacePoint, x is the position of the spatial point, d is the direction of the spatial point, μ _i Is the spatial density of the ith first feature field at C (x, d), f _i Target features at C (x, d) for the ith first feature domain; and obtaining the second feature domain according to the average space density and the average target feature at each space point.

According to yet another aspect of the present disclosure, there is provided an electronic device, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any of the image generation methods described above via execution of the executable instructions.

According to yet another aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the image generation methods described above.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program or computer instructions which is loaded and executed by a processor to cause a computer to implement any of the image generation methods described above.

The technical scheme provided by the embodiment of the disclosure has at least the following beneficial effects:

according to the technical scheme provided by the embodiment of the disclosure, on one hand, n pairs of first shape codes and first appearance codes are extracted from the style control content according to the first neural network, and then the normal distribution of the fitted sample data distribution is randomly sampled for n times to obtain n pairs of second shape codes and second appearance codes. Thereafter, n pairs of third shape codes and third appearance codes are generated from the n pairs of first shape codes and first appearance codes, and the n pairs of second shape codes and second appearance codes. The style of the target image generated by the third coding is influenced by the n pairs of the first shape coding and the first appearance coding, and the n pairs of the first shape coding and the first appearance coding correspond to the style control content, namely, the style of the target image can be controlled by the style control content, so that the user can realize the control of the style of the target image by inputting the style control content.

On the other hand, a pair of the third shape code and the third appearance code corresponds to a pair of the second shape code and the second appearance code, and the second shape code and the second appearance code are randomly sampled from a normal distribution fitting the sample data distribution, that is, a pair of the third shape code and the third appearance code corresponds to one sample data, and n pairs of the third shape code and the third appearance code correspond to n sample data. And n pairs of the third shape code and the third appearance code respectively correspond to n first feature fields, that is, each first feature field corresponds to one sample data. The target image generated by the n first feature fields is controlled by the n sample data, and controlling the target image by the n sample data may make the generated target image a virtual image that does not actually exist.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 shows a schematic diagram of a system architecture in an embodiment of the present disclosure;

FIG. 2 illustrates a flow diagram of an image generation method of one embodiment of the present disclosure;

FIG. 3 illustrates a three-dimensional space ray generation and point sampling schematic in one embodiment of the present disclosure;

FIG. 4 illustrates a flow diagram for extracting n pairs of a first shape code and a first appearance code from a style control content in one embodiment of the present disclosure;

FIG. 5 is a flow chart illustrating processing of n first feature domains according to a second neural network to obtain a target image according to an embodiment of the present disclosure;

FIG. 6 illustrates an overall architecture diagram of an image generation system in one embodiment of the present disclosure;

FIG. 7 shows a flow diagram of an image generation process in one embodiment of the present disclosure;

FIG. 8 shows a schematic diagram of an image generation apparatus in an embodiment of the disclosure;

fig. 9 shows a block diagram of an electronic device in an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 shows a schematic diagram of an exemplary system architecture of an image generation method or an image generation apparatus to which an embodiment of the present disclosure can be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105.

The medium of the network 104 for providing communication links between the

terminal devices

101, 102, 103 and the server 105 may be a wired network or a wireless network.

Optionally, the wireless or wired networks described above use standard communication techniques and/or protocols. The Network is typically the Internet, but can be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wired or wireless Network, a private Network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), Extensible Mark-up Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

The

terminal devices

101, 102, 103 may be various electronic devices including, but not limited to, smart phones, tablets, laptop portable computers, desktop computers, wearable devices, augmented reality devices, virtual reality devices, and the like.

The server 105 may be a server that provides various services, such as a background management server that provides support for devices operated by users using the

terminal apparatuses

101, 102, 103. The background management server can analyze and process the received data such as the request and feed back the processing result to the terminal equipment.

Optionally, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like.

Those skilled in the art will appreciate that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative, and that there may be any number of terminal devices, networks, and servers, as desired. The embodiments of the present disclosure are not limited thereto.

The present exemplary embodiment will be described in detail below with reference to the drawings and examples.

The embodiment of the disclosure provides an image method, which can be executed by any electronic device with computing processing capability, for example, the electronic device with computing processing capability can be a terminal device or a server.

Fig. 2 shows a flowchart of an image generation method in an embodiment of the present disclosure, and as shown in fig. 2, the image generation method provided in the embodiment of the present disclosure includes the following steps S201 to S207.

S201, style control content of the target image is acquired.

The target image is an image to be generated by using the image generation method provided by the embodiment of the disclosure. The style control content may include images, or text, or speech, etc. The embodiments of the present disclosure do not limit this. For example, when the style control content includes an image, it may be an image of a red chair. As another example, when the style control content includes text, it may be a "bundle of red chairs". For another example, when the style control content includes speech, the content corresponding to the speech may be "a red chair".

In some embodiments of the present disclosure, the storing of the style control content in the terminal device or the server, taking the electronic device executing the image generation method provided by the embodiments of the present disclosure as an example, the obtaining of the style control content of the target image may include: the terminal device obtains the style control content from the server through the network, or the terminal device calls the style control content from the memory of the terminal device.

In some embodiments, the terminal device is configured with a microphone through which the style control content in the form of speech may be acquired. For example, after the user says "a red chair", the terminal device receives the audio through the microphone, and then the style control content is acquired.

S202, extracting n pairs of first shape codes and first appearance codes from the style control content according to the first neural network, wherein n is an integer greater than or equal to 1.

The first neural network may be any neural network capable of extracting the first shape code and the first appearance code from the style control content, which is not limited in the embodiment of the present disclosure. In one embodiment, the first neural network includes a feature extraction model, a shape mapper, and an appearance mapper. The feature extraction model is used for extracting style features in the style control content to obtain style feature vectors. The shape mapper is used for extracting a first shape code from the style feature vector, for example, n first shape codes are respectively

The appearance mapper is used for extracting a first appearance code from the style feature vector, for example, n first appearance codes are respectively

In some embodiments of the present disclosure, the feature extraction model may be a model that includes a speech converter and a text/image encoder. The speech converter may convert speech content into text content and the text/image encoder may extract a style vector for an image or text based on style characteristics of the image or text.

And S203, acquiring n pairs of second shape codes and second appearance codes, wherein the n pairs of second shape codes and second appearance codes are obtained by randomly sampling n times from the normal distribution fitting the sample data distribution.

The sample data is sample data used for training the whole neural network model for generating the target image, and the sample data distribution obtained by the training mode of the generative network is fitted to obtain the normal distribution mentioned in S203. In fitting sample data distributionThe random sampling in the normal distribution of (a) will result in a noise vector, which corresponds to the sample data and has the shape and appearance characteristics of the sample data. Then, extracting a second shape code and a second appearance code of the sample data from the noise vector; n noise vectors are obtained by sampling the normal distribution of the fitted sample data distribution for n times, and then n pairs of second shape codes and second appearance codes are extracted from the n noise vectors. For example, the n second shapes are encoded as

n second appearance codes are

And S204, generating n pairs of third shape codes and third appearance codes according to the n pairs of the first shape codes and the first appearance codes and the n pairs of the second shape codes and the second appearance codes.

Wherein the third shape code is obtained by adding the first shape code and the second shape code, and the third appearance code is obtained by adding the first appearance code and the second appearance code. For example, the n first appearance codes are respectively

n second appearance codes are

n third appearances are coded as

As another example, the n first shape codes are each

n second shapes are coded as

n third shape codes are respectively

And S205, acquiring the position code of the space point corresponding to the target image.

In some embodiments, obtaining a position code of a spatial point corresponding to the target image may include: taking the origin of a camera as a center in a three-dimensional space, and acquiring H x W rays, wherein H x W corresponds to the size of a target image, and H and W are integers larger than 0; determining a first coordinate and a second coordinate of each ray in the H x W rays; sampling S space points in a line segment between the first coordinate and the second coordinate of each ray to obtain H x W x S space points, wherein S is an integer larger than 0; and respectively coding the H, W and S spatial points according to the coordinates of each of the H, W and S spatial points to obtain position codes.

Wherein the camera origin is an origin in the camera coordinate system. Illustratively, as shown in FIG. 3, the camera origin corresponds to the spatial point indicated by 301 in FIG. 3. One of the H x W rays may correspond to ray 302 in fig. 3. In the ray indicated by 302 in fig. 3, the specified first coordinate may correspond to the spatial point indicated by 303 in fig. 3, or may correspond to the spatial point indicated by 304 in fig. 3; the specified second coordinate may correspond to a spatial point indicated at 306 in fig. 3, and may also correspond to a spatial point indicated at 305 in fig. 3. For example, a first coordinate of the ray indicated at 302 in fig. 3 corresponds to a spatial point indicated at 303 in fig. 3, and a second coordinate corresponds to a spatial point indicated at 306 in fig. 3, then a line segment between the first coordinate and the second coordinate samples 2 spatial points, resulting in spatial points indicated at 304 and 305 in fig. 3.

In some embodiments, the sampling manner used when sampling S spatial points in the line segment between the first coordinate and the second coordinate of each ray may be uniform sampling. Illustratively, for example, the first coordinate of the ray indicated at 302 in fig. 3 corresponds to the spatial point indicated at 303 in fig. 3, and the second coordinate corresponds to the spatial point indicated at 306 in fig. 3, then the line segment between the first coordinate and the second coordinate is uniformly sampled by 2 spatial points, resulting in the spatial point indicated at 307 in fig. 3.

Embodiments of the present disclosure do not limit how H × W × S spatial points are encoded. Illustratively, each of H × W × S spatial points may be encoded using equation 1.

r(p)＝(sin(2 ⁰ πp),cos(2 ⁰ πp),…,sin(2 ^L-1 πp),sin(2 ^L-1 π p)) (equation 1)

Where p is the coordinates of the spatial point, r (p) is the position code of the spatial point, and L is a set value according to experiments, and L is typically set to an integer between [4-10] in some embodiments.

S206, generating n first feature fields according to the position codes and the n pairs of third shape codes and third appearance codes, wherein one first feature field corresponds to one third shape code and one corresponding third appearance code.

In some embodiments of the disclosure, generating n first feature fields from the position coding and the n-pair third shape coding and the third appearance coding may include: and inputting the position codes and the n pairs of third shape codes and third appearance codes into a third neural network, and generating corresponding first feature domains by the third neural network according to each pair of the third shape codes and the third appearance codes and the position codes to obtain n first feature domains. As to what kind of neural network the third neural network is, the embodiment of the present disclosure does not limit this, and any neural network capable of implementing S206 may be applied thereto. For example, the third neural network is a neural network composed of several fully-connected layers, or the third neural network is a neural network composed of several convolutional layers.

In some embodiments, the last of the n first feature fields is a background of the target image.

And S207, processing the n first feature domains according to the second neural network to obtain a target image.

In some embodiments of the present disclosure, the second neural network comprises a combination operator, a three-dimensional volume rendering neural network, and a prediction neural network. The combination operator is used for combining the n first feature fields into a second feature field. And the three-dimensional volume rendering neural network is used for rendering the second characteristic domain to obtain a corresponding characteristic diagram. And the prediction neural network is used for predicting the color value and the density value of the characteristic diagram to obtain a target image.

Fig. 4 shows a flowchart of extracting n pairs of first shape codes and first appearance codes from the style control content according to an embodiment of the present disclosure, as shown in fig. 4, including the following S401 to S403.

S401, style characteristics of the style control content are extracted according to the characteristic extraction model, and style coding vectors are obtained.

In some embodiments, the style control content includes textual content, or image content, or speech content, and the feature extraction model includes a text/image encoder and a speech converter. Extracting style features of the style control content according to the feature extraction model to obtain a style coding vector, which may include: under the condition that the style control content comprises text content, extracting a style coding vector from the text content according to a text encoder included in the feature extraction model; extracting a style encoding vector from the image content according to an image encoder included in the feature extraction model in a case where the style control content includes the image content; in the case where the style control content includes voice content, converting the voice content into corresponding text content according to a voice converter included in the feature extraction model; and extracting style coding vectors from the text content corresponding to the voice content according to a text encoder.

For example, the style control content includes voice content corresponding to the word "a white car". After the voice content is input into the voice converter, the voice converter converts the voice content into text content to obtain 'a white car' in a text form. And then, inputting the text content of the 'white car' into a text compiler, and extracting the corresponding style coding vector from the 'white car' by the text compiler.

The embodiments of the present disclosure are not limited with respect to the types of the voice converter and the text/Image encoder, for example, the text/Image encoder may be a CLIP (contrast Language-Image Pre-Training) model, and the voice converter may be any neural network model capable of converting voice content into text content.

In some embodiments of the present disclosure, the feature extraction model is a pre-trained neural network model, that is, the feature extraction model includes a speech converter and a text/image encoder that are both pre-trained neural network models.

S402, extracting n first shape codes from the style code vector according to a shape mapper.

As to what network model the shape mapper is specifically, the embodiment of the disclosure is not limited, and the network model capable of extracting the first shape code from the style code vector may be applied to this. For example, the shape mapper may comprise several layers of a fully connected network, or several layers of a convolutional network. After inputting the style code vector into the shape mapper, the style code vector is processed by the shape mapper to output n first shape codes.

S403, extracting n first appearance codes from the style code vector according to the appearance mapper.

As to what network model the shape-appearance decoder is, the embodiment of the disclosure is not limited, and the network model capable of extracting the first appearance code from the style code vector can be applied to this. For example, the look mapper may comprise several layers of fully connected networks, or several layers of convolutional networks. After inputting the style code vector into the appearance mapper, the style code vector is processed by the appearance mapper to output n first appearance codes.

According to the technical scheme provided by the embodiment of the disclosure, the fusion of the cross-visual mode, the text mode and the voice mode is realized by introducing the pre-trained text/image encoder and the pre-trained voice converter, so that a user can describe the required style effect by inputting a sentence and can control the style of the target image through the image content and the voice content. On the other hand, the style characteristics of the image content, the text content or the voice content are extracted through the pre-training characteristic extraction model, and then the first appearance coding and the first shape coding are obtained directly through two mappers (a mapper, a shape mapper and an appearance mapper), so that the user is a black box, the user does not need to fix the model when using the model, and the application difficulty of the technical scheme provided by the embodiment of the disclosure is reduced.

Fig. 5 shows a flowchart of processing n first feature domains according to a second neural network to obtain a target image in an embodiment of the present disclosure, and as shown in fig. 5, the flowchart includes the following S501 to S503.

S501, processing the n first feature domains according to the combined operator to obtain a second feature domain.

In some embodiments, the processing, in which the spatial point corresponding to each of the n first feature domains has a corresponding spatial density and a target feature, and at this time, the processing the n first feature domains according to the combination operator to obtain the second feature domain may include: inputting the n first feature domains into a combination operator, and calculating the average spatial density and the average target feature of the n first feature domains at each spatial point by the combination operator according to the following formula 2 and formula 3;

where C (x, d) is a spatial point, x is the position of the spatial point, d is the direction of the spatial point, μ _i Is the spatial density at C (x, d), f, of the ith first feature field _i Target features at C (x, d) for the ith first feature domain; and obtaining a second feature domain according to the average space density and the average target feature of each space point.

Obtaining a second feature domain according to the average spatial density and the average target feature at each spatial point, which may include: and forming the space density and the target feature of each space point in the second feature domain according to the average space density and the average target feature of each space point to obtain the second feature domain. Wherein the target feature may be a color value feature.

And S502, rendering the second feature domain according to the three-dimensional volume rendering neural network to obtain a feature map.

The spatial points are points sampled S times from H x W rays, so there are S spatial points on each ray. In some embodiments, rendering the second feature domain according to a three-dimensional volume rendering neural network to obtain a feature map may include: and calculating the S space points on each ray according to the following formula 4 to obtain the characteristic of one point in the characteristic diagram.

Wherein, tau _j Is the cumulative transmission, alpha, at the jth point on an ray of the second characteristic field _j Transparency at the jth point on a ray of the second characteristic field, f _j The target feature at the jth point on a ray of the second feature domain, and f is the feature of one point in the feature map.

Wherein, tau _j Can be calculated by the following equation 5.

Wherein alpha is _k J is the transparency at the kth point on one ray of the second feature field and is an integer greater than 1.

Wherein alpha is _j Can be calculated as shown in the following equation 6.

Wherein, mu _j Is the spatial density, δ, at the jth point on a ray of the second feature field _j The distance between the jth point and the j +1 th point on one ray of the second feature field.

Wherein, delta _j Can be calculated by the following equation 7.

δ _j ＝‖x _j+1 -x _j II (equation 7)

Wherein x is _j Is the coordinate, x, of the j-th point on a ray of the second feature field _j+1 Is the coordinate of the j +1 th point on a ray of the second feature field.

And rendering the second feature domain by the three-dimensional volume rendering neural network according to the formula 4-formula 7 to obtain the features of H x W points corresponding to the feature map, wherein the features of the H x W points form the feature map.

And S503, predicting the characteristic graph according to the prediction neural network to obtain a target image.

The embodiments of the present disclosure are not limited with respect to what neural network the predictive neural network is specific to, for example, the predictive neural network comprises a fully connected network. And after inputting the feature map into the prediction neural network, processing the feature map by the prediction neural network, namely predicting the color value and the density of each point according to the features of the point in the feature map, and obtaining a target image according to a prediction result, wherein the size of the target image is H x W.

The n first feature domains are combined into one second feature domain, and then the second feature domain is rendered, so that the calculation amount required in rendering can be reduced. In addition, the plurality of second feature domains are combined into one second feature domain, so that the target image which is rendered and predicted is a virtual image which does not exist actually, and the generation of the virtual image is realized.

Under the image generation method provided by the embodiment of the present disclosure, the overall architecture of the image generation system is as shown in fig. 6. In the system architecture shown in fig. 6, the image generation process may include S701 to S712 as shown in fig. 7.

S701, obtaining the content of the genre control content, where the implementation of S701 is already described in S201 of the embodiment corresponding to fig. 2, and is not described herein again.

S702, inputting the style control content obtained in S701 into the voice converter 602, the text encoder 603 and/or the image encoder 604 to obtain a style encoding vector; among them, the speech converter 602, the text encoder 603, and the image encoder 604 belong to the feature extraction model 601.

S703, inputting the style encoding vector obtained in S702 into the shape mapper 605 and the appearance mapper 606 to obtain n pairs of first shape codes and first appearance codes, for example, n first shape codes indicated by 607 in FIG. 6

And n first appearance codes

S704, randomly sampling in a normal distribution fitted to the sample data distribution to obtain n pairs of second shape codes and second appearance codes, for example, n second shape codes indicated by 608 in FIG. 6

And n second appearance codes

S705, adding the results of S703 and S704, i.e. adding n pairs of the first shape code and the first appearance code to n pairs of the second shape code and the second appearance code, respectively, to obtain n pairs of the third shape code and the third appearance code.

S706, generating H × W rays in the three-dimensional space by taking the origin of the camera as the center, and determining S space points from each ray to obtain H × W × S space points; the implementation process of S706 is already described in S205 of the embodiment corresponding to fig. 2, and is not described herein again.

S707, performing position coding on the spatial points obtained in S706 to obtain position codes 609 of H × W × S empty points; the implementation process of S707 is already described in S205 of the corresponding embodiment in fig. 2, and is not described herein again.

S708, inputting the results of S707 and S705 into a feature extraction network (third neural network) to obtain n first feature domains 610; the implementation of S708 is already described in S206 of the embodiment corresponding to fig. 2, and is not described here again.

S709, inputting the n first feature domains 610 into a combination operator 611 to obtain second feature domains; wherein, the feature extraction network and the combination operator 611 belong to a component level editable nerve radiation field module 612; the implementation process of S709 is already described in S501 of the embodiment corresponding to fig. 5, and is not described here again.

S710, inputting the second feature domain into the three-dimensional volume rendering model 613 to obtain a feature map; the implementation process of S710 is already described in S502 of the embodiment corresponding to fig. 5, and is not described herein again.

S711, inputting the feature map into a prediction neural network 614 to perform color value and density prediction to obtain a target image; the implementation process of S711 is already described in S503 of the embodiment corresponding to fig. 5, and is not described herein again. If the whole image generation system is trained, S712 is further performed.

S712, the target image and the corresponding sample image are input to the discriminator 615, the discriminator 615 compares the target image and the corresponding sample image, and the loss value corresponding to the comparison result is output for back propagation.

The system architecture corresponding to the image generation method provided by the embodiment of the disclosure provides a generation type rendering neural network architecture, so that the trained rendering neural network has the capability of generating virtual images, different scenes can be obtained according to different training data, and the method is used for generating a new visual angle of a general scene and is also suitable for generating a new visual angle of a virtual character image. Because the feature extraction model is a pre-trained neural network model, the image generation speed is improved to a certain extent.

Based on the same inventive concept, an image generation apparatus is also provided in the embodiments of the present disclosure, as described in the following embodiments. Because the principle of the embodiment of the apparatus for solving the problem is similar to that of the embodiment of the method, the embodiment of the apparatus can be implemented by referring to the implementation of the embodiment of the method, and repeated details are not described again.

Fig. 8 shows a schematic diagram of an image device in an embodiment of the disclosure, and as shown in fig. 8, the device includes: an obtaining module 801, configured to obtain style control content of a target image; an extracting module 802, configured to extract n pairs of a first shape code and a first appearance code from the style control content according to a first neural network, where n is an integer greater than or equal to 1; the obtaining module 801 is further configured to obtain n pairs of second shape codes and second appearance codes, where the n pairs of second shape codes and second appearance codes are obtained by randomly sampling n times in a normal distribution fitting the sample data distribution; a generating module 803, configured to generate n pairs of third shape codes and third appearance codes according to n pairs of first shape codes and first appearance codes, and n pairs of second shape codes and second appearance codes; an obtaining module 801, configured to obtain a position code of a spatial point corresponding to a target image; a generating module 803, further configured to generate n first feature fields according to the position code and n pairs of third shape codes and third appearance codes, where one first feature field corresponds to one third shape code and a corresponding third appearance code; the processing module 804 is further configured to process the n first feature domains according to the second neural network to obtain a target image.

In an embodiment of the present disclosure, the first neural network includes a feature extraction model, a shape mapper, and an appearance mapper, and the extraction module 802 is configured to extract style features of the style control content according to the feature extraction model to obtain a style coding vector; extracting n first shape codes from the style code vector according to a shape mapper; n first appearance codes are extracted from the style encoding vector according to an appearance mapper.

In one embodiment of the present disclosure, the style control content comprises textual content, or image content, or voice content; an extracting module 802, configured to, when the style control content includes text content, extract a style encoding vector from the text content according to a text encoder included in the feature extraction model; extracting a style encoding vector from the image content according to an image encoder included in the feature extraction model in a case where the style control content includes the image content; in the case where the style control content includes voice content, converting the voice content into corresponding text content according to a voice converter included in the feature extraction model; and extracting style encoding vectors from the text content corresponding to the voice content according to a text encoder.

In an embodiment of the present disclosure, the obtaining module 801 is configured to obtain H × W rays in a three-dimensional space with an origin of a camera as a center, where H × W corresponds to a size of a target image, and H and W are integers greater than 0; determining a first coordinate and a second coordinate of each ray in the H x W rays; sampling S space points in a line segment between the first coordinate and the second coordinate of each ray to obtain H x W x S space points, wherein S is an integer larger than 0; and respectively coding the H x W x S spatial points according to the coordinates of each of the H x W x S spatial points to obtain position codes.

In an embodiment of the present disclosure, the generating module 803 is configured to input the position code and the n pairs of the third shape codes and the third appearance codes into a third neural network, and generate, by the third neural network, corresponding first feature fields according to each pair of the third shape codes and the third appearance codes and the position codes, so as to obtain n first feature fields.

In one embodiment of the present disclosure, the second neural network comprises a combination operator, a three-dimensional volume rendering neural network, and a prediction neural network; the processing module 804 is configured to process the n first feature domains according to the combination operator to obtain a second feature domain; rendering the second feature domain according to the three-dimensional volume rendering neural network to obtain a feature map; and predicting the characteristic graph according to the prediction neural network to obtain a target image.

In one embodiment of the present disclosure, the spatial points corresponding to each of the n first feature domains have respective spatial densities and target features; the processing module 804 is configured to input the n first feature domains into the combination operator, and the combination operator calculates an average spatial density and an average target feature of the n first feature domains at each spatial point according to the following formula;

wherein C is (x, d) is a spatial point, x is the position of the spatial point, d is the direction of the spatial point, μ _i Is the spatial density at C (x, d), f, of the ith first feature field _i Target features at C (x, d) for the ith first feature domain; and obtaining a second feature domain according to the average space density and the average target feature of each space point.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 900 according to this embodiment of the disclosure is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one memory unit 920, and a bus 930 that couples various system components including the memory unit 920 and the processing unit 910.

Wherein the storage unit stores program code that can be executed by the processing unit 910 to cause the processing unit 910 to perform the steps according to various exemplary embodiments of the present disclosure described in the above section "detailed description" of the present specification.

The storage unit 920 may include a readable medium in the form of a volatile storage unit, such as a random access memory unit (RAM)9201 and/or a cache memory unit 9202, and may further include a read only memory unit (ROM) 9203.

Storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 930 can be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 940 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 950. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown in FIG. 9, the network adapter 960 communicates with the other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium, which may be a readable signal medium or a readable storage medium. On which a program product capable of implementing the above-described method of the present disclosure is stored. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure as described in the above section "detailed description" of the present specification, when the program product is run on the terminal device.

More specific examples of the computer-readable storage medium in the present disclosure may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present disclosure, a computer readable storage medium may include a propagated data signal with readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Alternatively, program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

In particular implementations, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In an exemplary embodiment of the present disclosure, there is also provided a computer program product comprising a computer program or computer instructions, which is loaded and executed by a processor, to cause a computer to implement any of the image generation methods described above.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.

Claims

1. An image generation method, comprising:

acquiring style control content of a target image;

acquiring n pairs of second shape codes and second appearance codes, wherein the n pairs of second shape codes and second appearance codes are obtained by randomly sampling n times from a normal distribution fitting sample data distribution;

acquiring a position code of a space point corresponding to the target image;

2. The method of claim 1, wherein the first neural network comprises a feature extraction model, a shape mapper, and an appearance mapper, and wherein extracting n pairs of a first shape code and a first appearance code from the style control content according to the first neural network comprises:

extracting style features of the style control content according to the feature extraction model to obtain style coding vectors;

extracting n first shape codes from the style code vector according to the shape mapper;

extracting n first appearance codes from the style code vector according to the appearance mapper.

3. The method of claim 2, wherein the style control content comprises text content, or image content, or voice content; the extracting style characteristics of the style control content according to the characteristic extraction model to obtain a style coding vector comprises the following steps:

under the condition that the style control content comprises the text content, extracting the style coding vector from the text content according to a text encoder included in the feature extraction model;

extracting the style encoding vector from the image content according to an image encoder included in the feature extraction model in a case where the style control content includes the image content;

in the case that the style control content comprises the voice content, converting the voice content into corresponding text content according to a voice converter included in the feature extraction model; and extracting the style coding vector from the text content corresponding to the voice content according to the text encoder.

4. The method of claim 3, wherein the feature extraction model is a pre-trained neural network model.

5. The method of claim 1, wherein obtaining a position code of a spatial point corresponding to the target image comprises:

taking the origin of a camera as a center in a three-dimensional space, and acquiring H x W rays, wherein H x W corresponds to the size of the target image, and H and W are integers larger than 0;

determining a first coordinate and a second coordinate of each ray in the H x W rays;

sampling S space points in a line segment between the first coordinate and the second coordinate of each ray to obtain H x W x S space points, wherein S is an integer larger than 0;

and respectively coding the H, W, S spatial points according to the coordinates of each spatial point in the H, W, S spatial points to obtain the position codes.

6. The method of claim 1, wherein generating n first feature fields according to the position coding and the n-pair third shape coding and third appearance coding comprises:

and inputting the position codes and the n pairs of third shape codes and third appearance codes into a third neural network, and generating corresponding first feature domains by the third neural network according to each pair of third shape codes and third appearance codes and the position codes to obtain n first feature domains.

7. The method of claim 1, wherein the second neural network comprises a combination operator, a three-dimensional volume rendering neural network, and a prediction neural network; the processing the n first feature domains according to the second neural network to obtain the target image includes:

processing the n first feature domains according to the combined operator to obtain a second feature domain;

rendering the second feature domain according to the three-dimensional volume rendering neural network to obtain a feature map;

and predicting the characteristic graph according to the prediction neural network to obtain the target image.

8. The method of claim 7, wherein the spatial points corresponding to each of the n first feature domains have respective spatial densities and target features; the processing the n first feature domains according to the combination operator to obtain a second feature domain includes:

inputting the n first feature domains into the combination operator, and calculating the average spatial density and the average target feature of the n first feature domains at each spatial point by the combination operator according to the following formula;

where C (x, d) is a spatial point, x is the position of the spatial point, d is the direction of the spatial point, μ _i Is the spatial density of the ith first feature field at C (x, d), f _i Target features at C (x, d) for the ith first feature domain;

and obtaining the second feature domain according to the average space density and the average target feature at each space point.

9. An image generation apparatus, comprising:

the acquisition module is used for acquiring the style control content of the target image;

the obtaining module is further configured to obtain n pairs of second shape codes and second appearance codes, where the n pairs of second shape codes and second appearance codes are obtained by randomly sampling n times in a normal distribution fitting a sample data distribution;

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the image generation method of any of claims 1-8 via execution of the executable instructions.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the image generation method of any one of claims 1 to 8.