CN115063536B

CN115063536B - Image generation method, device, electronic equipment and computer readable storage medium

Info

Publication number: CN115063536B
Application number: CN202210770690.3A
Authority: CN
Inventors: 张琦; 刘巧俏; 邹航
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2023-10-10
Anticipated expiration: 2042-06-30
Also published as: CN115063536A

Abstract

The disclosure provides an image generation method, an image generation device, electronic equipment and a computer readable storage medium, and relates to the technical field of image processing. Comprising the following steps: acquiring style control content of a target image; extracting n pairs of first shape codes and first appearance codes from style control contents according to a first neural network; acquiring n pairs of second shape codes and second appearance codes; generating n pairs of third shape codes and third appearance codes according to n pairs of first shape codes and first appearance codes and n pairs of second shape codes and second appearance codes; acquiring a position code of a space point corresponding to a target image; generating n first feature domains according to the position codes and the n pairs of third shape codes and third appearance codes; and processing the n first characteristic domains according to the second neural network to obtain a target image. In this way, the user can realize the control of the style of the target image by inputting the style control content, and can generate a virtual image which does not exist actually.

Description

Image generation method, device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image generating method, an image generating device, an electronic device, and a computer readable storage medium.

Background

In the field of image processing technology, image generation by three-dimensional reconstruction and rendering of new view angle images has been an important point of research. With the new concepts of digital twinning and meta universe, the industry demands for image generation are increasing.

In the related art, an image is generated through a classical three-dimensional neural rendering model, the classical three-dimensional neural rendering model has the capability of reconstructing a three-dimensional implicit representation of an object when a small number of pictures of the object are input, and the image of each angle of the object is generated through a volume rendering technology.

However, generating an image by a classical three-dimensional neural rendering model requires an image of an input object, i.e. the object is required to be actually present, so that the image generated by the classical three-dimensional neural rendering model belongs to an actually obtainable image, and cannot be generated as a virtual image. In addition, the classical three-dimensional neural rendering model generates images as a whole of various objects and backgrounds, resulting in the generated images being static and failing to generate images of a corresponding style according to the needs of the user.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure provides an image generation method, apparatus, electronic device, and computer-readable storage medium, which overcome, at least to some extent, the problem that a virtual image cannot be generated by a related technology, and an image of a corresponding style cannot be generated according to a user's requirement.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

According to one aspect of the present disclosure, there is provided an image generation method including: acquiring style control content of a target image;

extracting n pairs of first shape codes and first appearance codes from the style control content according to a first neural network, wherein n is an integer greater than or equal to 1;

acquiring n pairs of second shape codes and second appearance codes, wherein the n pairs of second shape codes and the second appearance codes are obtained by randomly sampling n times from normal distribution of fitting sample data distribution;

generating n pairs of third shape codes and third appearance codes according to the n pairs of first shape codes and first appearance codes and the n pairs of second shape codes and second appearance codes;

Acquiring a position code of a space point corresponding to the target image;

generating n first feature fields according to the position codes, the n pairs of third shape codes and the third appearance codes, wherein one first feature field corresponds to one third shape code and one corresponding third appearance code;

and processing the n first feature domains according to a second neural network to obtain the target image.

In one embodiment of the present disclosure, the first neural network includes a feature extraction model, a shape mapper, and an appearance mapper, the extracting n pairs of first shape encodings and first appearance encodings from the style-controlled content according to the first neural network, comprising: extracting style characteristics of the style control content according to the characteristic extraction model to obtain a style coding vector; extracting n first shape codes from the style code vector according to the shape mapper; and extracting n first appearance codes from the style code vectors according to the appearance mapper.

In one embodiment of the present disclosure, the style control content includes text content, or image content, or voice content; extracting the style characteristics of the style control content according to the characteristic extraction model to obtain a style coding vector, wherein the style coding vector comprises the following components: extracting the style coding vector from the text content according to a text encoder included in the feature extraction model when the style control content includes the text content; extracting the style coding vector from the image content according to an image encoder included in the feature extraction model when the style control content includes the image content; in the case that the style control content comprises the voice content, converting the voice content into corresponding text content according to a voice converter included in the feature extraction model; and extracting the style coding vector from the text content corresponding to the voice content according to the text encoder.

In one embodiment of the present disclosure, the feature extraction model is a pre-trained neural network model.

In one embodiment of the disclosure, the acquiring the position code of the spatial point corresponding to the target image includes: taking a camera origin as a center in a three-dimensional space, acquiring H X W rays, wherein H X W corresponds to the size of the target image, and H and W are integers larger than 0; determining a first coordinate and a second coordinate of each ray in the H x W rays; sampling S space points in a line segment between a first coordinate and a second coordinate of each ray to obtain H, W and S space points, wherein S is an integer greater than 0; and respectively encoding the H, W and S space points according to the coordinates of each space point in the H, W and S space points to obtain the position code.

In one embodiment of the present disclosure, the generating n first feature fields according to the position code and the n pairs of third shape codes and third appearance codes includes: and inputting the position codes and the n pairs of third shape codes and third appearance codes into a third neural network, and generating corresponding first feature domains by the third neural network according to each pair of third shape codes and third appearance codes and the position codes to obtain n first feature domains.

In one embodiment of the present disclosure, the second neural network includes a combining operator, a three-dimensional volume rendering neural network, and a prediction neural network; the processing the n first feature domains according to the second neural network to obtain the target image includes: processing the n first feature domains according to the combination operator to obtain a second feature domain; rendering the second feature domain according to the three-dimensional volume rendering neural network to obtain a feature map; and predicting the feature map according to the prediction neural network to obtain the target image.

In one embodiment of the present disclosure, the spatial point corresponding to each of the n first feature domains has a corresponding spatial density and a target feature; the processing of the n first feature domains according to the combination operator to obtain a second feature domain includes: inputting the n first feature domains into the combination operator, and calculating average space density and average target feature of the n first feature domains at each space point by the combination operator according to the following formula;

wherein C (x, d) is a space point, x is the position of the space point, d is the direction of the space point, μ _i For the spatial density of the ith first feature domain at C (x, d), f _i Target features at C (x, d) for the ith first feature domain; and obtaining the second feature domain according to the average space density and the average target feature at each space point.

According to another aspect of the present disclosure, there is provided an image generating apparatus including: the acquisition module is used for acquiring style control content of the target image;

the extraction module is used for extracting n pairs of first shape codes and first appearance codes from the style control content according to a first neural network, wherein n is an integer greater than or equal to 1;

the acquisition module is further used for acquiring n pairs of second shape codes and second appearance codes, wherein the n pairs of second shape codes and the second appearance codes are obtained by randomly sampling n times from normal distribution of fitting sample data distribution;

a generating module, configured to generate n pairs of third shape codes and third appearance codes according to the n pairs of first shape codes and first appearance codes, and the n pairs of second shape codes and second appearance codes;

the acquisition module is also used for acquiring the position codes of the space points corresponding to the target image;

the generating module is further configured to generate n first feature fields according to the position code and the n pairs of third shape codes and third appearance codes, where one first feature field corresponds to one third shape code and a corresponding third appearance code;

And the processing module is used for processing the n first characteristic domains according to a second neural network to obtain the target image.

In one embodiment of the disclosure, the first neural network includes a feature extraction model, a shape mapper, and an appearance mapper, and the extraction module is configured to extract style features of the style control content according to the feature extraction model to obtain a style coding vector; extracting n first shape codes from the style code vector according to the shape mapper; and extracting n first appearance codes from the style code vectors according to the appearance mapper.

In one embodiment of the present disclosure, the style control content includes text content, or image content, or voice content; the extraction module is used for extracting the style coding vector from the text content according to a text encoder included in the characteristic extraction model when the style control content includes the text content; extracting the style coding vector from the image content according to an image encoder included in the feature extraction model when the style control content includes the image content; in the case that the style control content comprises the voice content, converting the voice content into corresponding text content according to a voice converter included in the feature extraction model; and extracting the style coding vector from the text content corresponding to the voice content according to the text encoder.

In an embodiment of the disclosure, the acquiring module is configured to acquire, in a three-dimensional space, h×w rays with a camera origin as a center, where h×w corresponds to a size of the target image, and H and W are integers greater than 0; determining a first coordinate and a second coordinate of each ray in the H x W rays; sampling S space points in a line segment between a first coordinate and a second coordinate of each ray to obtain H, W and S space points, wherein S is an integer greater than 0; and respectively encoding the H, W and S space points according to the coordinates of each space point in the H, W and S space points to obtain the position code.

In one embodiment of the disclosure, the generating module is configured to input the position code and the n pairs of third shape codes and third appearance codes into a third neural network, and generate, by the third neural network, corresponding first feature fields according to each pair of third shape codes and third appearance codes and the position code, to obtain n first feature fields.

In one embodiment of the present disclosure, the second neural network includes a combining operator, a three-dimensional volume rendering neural network, and a prediction neural network; the processing module is used for processing the n first characteristic domains according to the combination operator to obtain a second characteristic domain; rendering the second feature domain according to the three-dimensional volume rendering neural network to obtain a feature map; and predicting the feature map according to the prediction neural network to obtain the target image.

In one embodiment of the present disclosure, the spatial point corresponding to each of the n first feature domains has a corresponding spatial density and a target feature; the processing module is used for inputting the n first feature domains into the combination operator, and calculating average space density and average target feature of the n first feature domains at each space point according to the following formula by the combination operator;

According to still another aspect of the present disclosure, there is provided an electronic device, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any of the image generation methods described above via execution of the executable instructions.

According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described image generation methods.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program or computer instructions loaded and executed by a processor to cause a computer to implement any of the above described image generation methods.

The technical scheme provided by the embodiment of the disclosure has at least the following beneficial effects:

according to the technical scheme provided by the embodiment of the disclosure, on one hand, n pairs of first shape codes and first appearance codes are extracted from style control contents according to a first neural network, and then n pairs of second shape codes and second appearance codes are obtained by randomly sampling n times from normal distribution of fitting sample data distribution. Thereafter, n pairs of third shape codes and third appearance codes are generated from n pairs of first shape codes and first appearance codes, and n pairs of second shape codes and second appearance codes. The style of the target image generated by the third encoding is affected by n pairs of the first shape encoding and the first appearance encoding, and the n pairs of the first shape encoding and the first appearance encoding correspond to the style control content, that is, the style of the target image can be controlled by the style control content, so that a user can realize the style control of the target image by inputting the style control content.

On the other hand, a pair of the third shape code and the third appearance code corresponds to a pair of the second shape code and the second appearance code, and the second shape code and the second appearance code are randomly sampled from a normal distribution of the fitted sample data distribution, that is, a pair of the third shape code and the third appearance code corresponds to one sample data, and n pairs of the third shape code and the third appearance code correspond to n sample data. And n pairs of the third shape code and the third appearance code correspond to n first feature fields, respectively, that is, each first feature field corresponds to one sample data. The target image generated through the n first feature fields is controlled by the n sample data, and the control of the target image through the n sample data may make the generated target image be a virtual image that does not exist in practice.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 is a schematic diagram of a system architecture in an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of an image generation method of one embodiment of the present disclosure;

FIG. 3 illustrates a three-dimensional space generation ray and point sampling schematic in one embodiment of the present disclosure;

FIG. 4 illustrates a flow chart of extracting n pairs of first shape codes and first appearance codes from style control content in one embodiment of the present disclosure;

FIG. 5 illustrates a flow chart of processing n first feature domains to obtain a target image according to a second neural network in one embodiment of the present disclosure;

FIG. 6 illustrates an overall architecture diagram of an image generation system in one embodiment of the present disclosure;

FIG. 7 illustrates a flow chart of an image generation process in one embodiment of the present disclosure;

FIG. 8 illustrates a schematic diagram of an image generation apparatus in an embodiment of the present disclosure;

fig. 9 shows a block diagram of an electronic device in an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 shows a schematic diagram of an exemplary system architecture to which an image generation method or an image generation apparatus of an embodiment of the present disclosure may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.

The network 104 is a medium for providing a communication link between the terminal devices 101, 102, 103 and the server 105, and may be a wired network or a wireless network.

Alternatively, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically the Internet, but may be any network including, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible MarkupLanguage, XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as secure sockets layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet protocol security (Internet ProtocolSecurity, IPsec), etc. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

The terminal devices 101, 102, 103 may be a variety of electronic devices including, but not limited to, smartphones, tablet computers, laptop portable computers, desktop computers, wearable devices, augmented reality devices, virtual reality devices, and the like.

The server 105 may be a server providing various services, such as a background management server providing support for devices operated by users with the terminal devices 101, 102, 103. The background management server can analyze and process the received data such as the request and the like, and feed back the processing result to the terminal equipment.

Optionally, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.

Those skilled in the art will appreciate that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative, and that any number of terminal devices, networks, and servers may be provided as desired. The embodiments of the present disclosure are not limited in this regard.

The present exemplary embodiment will be described in detail below with reference to the accompanying drawings and examples.

In the embodiment of the disclosure, an image method is provided, and the method can be performed by any electronic device with computing processing capability, for example, the electronic device with computing processing capability can be a terminal device or a server.

Fig. 2 shows a flowchart of an image generation method in an embodiment of the present disclosure, and as shown in fig. 2, the image generation method provided in the embodiment of the present disclosure includes the following S201 to S207.

S201, acquiring style control content of a target image.

The target image is an image to be generated by using the image generation method provided by the embodiment of the disclosure. The style control content may include images, or text, or speech, etc. The embodiments of the present disclosure are not limited in this regard. For example, when the style control content includes an image, the style control content may be an image of a red chair. For another example, when the style control content includes text, the style control content may be "a red chair". For another example, when the style control content includes voice, the content corresponding to the voice may be "a red chair".

In some embodiments of the present disclosure, style control content is stored in a terminal device or a server, and taking an electronic device that performs the image generation method provided by the embodiments of the present disclosure as an example of the terminal device, obtaining style control content of a target image may include: the terminal equipment acquires the style control content from the server through the network, or the terminal equipment invokes the style control content from a memory of the terminal equipment.

In some embodiments, the terminal device is configured with a microphone through which the style control content in the form of speech can be obtained. For example, after the user speaks "a red chair", the terminal device receives the audio through the microphone, thereby completing the acquisition of the style control content.

S202, extracting n pairs of first shape codes and first appearance codes from style control contents according to a first neural network, wherein n is an integer greater than or equal to 1.

Wherein the first neural network can be any type of network capable of extracting from style control contentThe first shape-coded and first appearance-coded neural network is extracted, and the embodiments of the present disclosure are not limited thereto. In one embodiment, the first neural network includes a feature extraction model, a shape mapper, and an appearance mapper. The feature extraction model is used for extracting style features in the style control content to obtain style feature vectors. The shape mapper is used for extracting first shape codes from the style feature vector, for example, n first shape codes are respectivelyThe appearance mapper is used for extracting first appearance codes from the style characteristics by vectors, for example, n first appearance codes are respectively +. >

In some embodiments of the present disclosure, the feature extraction model may be one that includes a speech converter and a text/image encoder. The speech converter may convert the speech content into text content and the text/image encoder may extract a style vector for the image or text based on the style characteristics of the image or text.

S203, n pairs of second shape codes and second appearance codes are obtained, and the n pairs of second shape codes and second appearance codes are obtained by randomly sampling n times from normal distribution of fitting sample data distribution.

The sample data is sample data for training the whole neural network model for generating the target image, and the normal distribution mentioned in S203 is obtained by fitting the sample data distribution obtained by the training mode of the generation type network. Random sampling in a normal distribution fitting the sample data distribution results in a noise vector corresponding to the sample data having shape and appearance characteristics of the sample data. Then, extracting a second shape code and a second appearance code of the sample data from the noise vector; n noise vectors are obtained by sampling the normal distribution of the fitting sample data distribution n times, and n pairs of second shape codes and second appearance codes are extracted from the n noise vectors. For example, n second shapes are encoded asn second appearance codes are +.>

S204, generating n pairs of third shape codes and third appearance codes according to n pairs of first shape codes and first appearance codes and n pairs of second shape codes and second appearance codes.

Wherein the third shape code is obtained by adding the first shape code and the second shape code, and the third appearance code is obtained by adding the first appearance code and the second appearance code. For example, the n first appearance codes are respectivelyn second appearance codes are +.>n third appearances are coded asFor another example, the n first shape codes are respectivelyn second shape codes are +.>n third shape codes are respectively

S205, the position code of the spatial point corresponding to the target image is acquired.

In some embodiments, acquiring the position encoding of the spatial point corresponding to the target image may include: taking a camera origin as a center in a three-dimensional space, acquiring H X W rays, wherein H X W corresponds to the size of a target image, and H and W are integers larger than 0; determining a first coordinate and a second coordinate of each ray in the H X W rays; sampling S space points in a line segment between a first coordinate and a second coordinate of each ray to obtain H, W and S space points, wherein S is an integer greater than 0; and respectively encoding the H.W.S space points according to the coordinates of each of the H.W.S space points to obtain position codes.

The origin of the camera is the origin in the camera coordinate system. Illustratively, as shown in FIG. 3, the camera origin corresponds to a spatial point indicated at 301 in FIG. 3. One of the H x W rays may correspond to ray 302 in fig. 3. The first coordinate designated in the ray indicated by 302 in fig. 3 may correspond to the spatial point indicated by 303 in fig. 3 or may correspond to the spatial point indicated by 304 in fig. 3; the specified second coordinates may correspond to the spatial point indicated at 306 in fig. 3 or may correspond to the spatial point indicated at 305 in fig. 3. For example, the first coordinate of the ray indicated by 302 in fig. 3 corresponds to the spatial point indicated by 303 in fig. 3, the second coordinate corresponds to the spatial point indicated by 306 in fig. 3, and then the line segment between the first coordinate and the second coordinate samples 2 spatial points, resulting in the spatial points indicated by 304 and 305 in fig. 3.

In some embodiments, when sampling S spatial points in a line segment between the first coordinate and the second coordinate of each ray, the sampling manner used may be uniform sampling. Illustratively, for example, a first coordinate of the ray indicated at 302 in fig. 3 corresponds to a spatial point indicated at 303 in fig. 3, a second coordinate corresponds to a spatial point indicated at 306 in fig. 3, and a line segment between the first coordinate and the second coordinate uniformly samples 2 spatial points, resulting in a spatial point indicated at 307 in fig. 3.

The embodiments of the present disclosure are not limited in terms of how h×w×s spatial points are encoded. Illustratively, each of the h×w×s spatial points may be encoded using equation 1.

r(p)＝(sin(2 ⁰ πp),cos(2 ⁰ πp),…,sin(2 ^L-1 πp),sin(2 ^L-1 Pi p)) (equation 1)

Where p is the coordinates of the spatial point, r (p) is the position code of the spatial point, and L is the set point according to the experiment, and in some embodiments L is typically set to an integer between [4-10 ].

S206, generating n first characteristic fields according to the position codes and the n pairs of third shape codes and third appearance codes, wherein one first characteristic field corresponds to one third shape code and one corresponding third appearance code.

In some embodiments of the present disclosure, generating n first feature fields from the position encoding and the n pairs of third shape encoding and third appearance encoding may include: and inputting the position codes and the n pairs of third shape codes and third appearance codes into a third neural network, and generating corresponding first characteristic fields by the third neural network according to each pair of third shape codes, third appearance codes and position codes to obtain n first characteristic fields. Regarding what the third neural network is, the embodiments of the present disclosure do not limit this, and the neural network capable of implementing S206 is applicable thereto. For example, the third neural network is a neural network composed of several fully connected layers, or the third neural network is a neural network composed of several convolutional layers.

In some embodiments, the last of the n first feature fields is the background of the target image.

S207, processing the n first feature domains according to the second neural network to obtain a target image.

In some embodiments of the present disclosure, the second neural network includes a combining operator, a three-dimensional volume rendering neural network, and a prediction neural network. The combining operator is used for combining the n first feature fields into one second feature field. The three-dimensional volume rendering neural network is used for rendering the second feature domain to obtain a corresponding feature map. The prediction neural network is used for predicting color values and density values of the feature images to obtain target images.

Fig. 4 illustrates a flowchart of extracting n pairs of first shape codes and first appearance codes from style control content in one embodiment of the present disclosure, as illustrated in fig. 4, including the following S401 to S403.

S401, extracting style characteristics of style control content according to the characteristic extraction model to obtain a style coding vector.

In some embodiments, the style control content comprises text content, or image content, or speech content, and the feature extraction model comprises a text/image encoder and a speech converter. Extracting style characteristics of the style control content according to the characteristic extraction model to obtain a style coding vector can comprise: extracting a style coding vector from the text content according to a text encoder included in the feature extraction model when the style control content includes the text content; extracting a style coding vector from the image content according to an image encoder included in the feature extraction model when the style control content includes the image content; in the case that the style control content comprises voice content, converting the voice content into corresponding text content according to a voice converter included in the feature extraction model; and extracting style coding vectors from text contents corresponding to the voice contents according to the text encoder.

For example, the style control content includes voice content corresponding to a word "a white car". After the voice content is input into the voice converter, the voice converter converts the voice content into text content to obtain a text form of 'a white car'. Then, the text content of the 'one white car' is input into a text compiler, and the text compiler extracts the corresponding style coding vector from the 'one white car'.

The embodiments of the present disclosure are not limited with respect to the type of speech converter and text/Image encoder, which may be, for example, a CLIP (Contrastive Language-Image Pre-Training, contrast language-Image Pre-Training) model, and the speech converter may be any neural network model capable of converting speech content into text content.

In some embodiments of the present disclosure, the feature extraction model is a pre-trained neural network model, i.e., the feature extraction model includes both a speech converter and a text/image encoder that are pre-trained neural network models.

S402, extracting n first shape codes from the style code vectors according to the shape mapper.

As to what network model the shape mapper is in particular, embodiments of the present disclosure are not limited in this regard, and network models capable of extracting the first shape code from the style-coded vector are applicable thereto. For example, the shape mapper may comprise several layers of fully-connected networks, or several layers of convolutional networks. After the style coded vector is input into the shape mapper, the shape mapper processes the style coded vector and outputs n first shape codes.

S403, extracting n first appearance codes from the style code vectors according to the appearance mapper.

With respect to what network model the shape appearance emitter is, embodiments of the present disclosure are not limited to, and any network model capable of extracting the first appearance code from the style-coded vector may be applied thereto. For example, the appearance mapper may comprise several layers of fully-connected networks, or several layers of convolutional networks. After the style coding vector is input into the appearance mapper, the appearance mapper processes the style coding vector and outputs n first appearance codes.

According to the technical scheme provided by the embodiment of the disclosure, through introducing the pre-trained text/image encoder and the voice converter, the fusion among the cross-visual mode, the text mode and the voice mode is realized, so that a user can describe a required style effect by inputting a sentence, and can control the style of a target image through image content and voice content. On the other hand, the style characteristics of the image content, the text content or the voice content are extracted through the pre-training characteristic extraction model, and then the processes of the first appearance code and the first shape code are directly obtained by two mappers (a mapper, a shape mapper and an appearance mapper), so that a black box is used for a user, the user does not need to go to a fixed model when using the black box, and the application difficulty of the technical scheme provided by the embodiment of the disclosure is reduced.

Fig. 5 shows a flowchart of processing n first feature domains according to a second neural network to obtain a target image in an embodiment of the present disclosure, as shown in fig. 5, including the following S501 to S503.

S501, processing the n first feature domains according to the combination operator to obtain a second feature domain.

In some embodiments, the spatial point corresponding to each of the n first feature domains has a corresponding spatial density and a target feature, and processing the n first feature domains according to the combining operator to obtain the second feature domain may include: inputting the n first feature domains into a combination operator, and calculating average space density and average target characteristics of the n first feature domains at each space point by the combination operator according to the following formula 2 and formula 3;

wherein C (x, d) is a space point, x is the position of the space point, d is the direction of the space point, μ _i For the spatial density of the ith first feature domain at C (x, d), f _i Target features at C (x, d) for the ith first feature domain; and obtaining a second feature domain according to the average space density and the average target feature at each space point.

The obtaining a second feature domain according to the average spatial density and the average target feature at each spatial point may include: and according to the average space density and the average target feature at each space point, the space density and the target feature of each space point in the second feature domain are formed, and the second feature domain is obtained. Wherein the target feature may be a color value feature.

And S502, rendering the second feature domain according to the three-dimensional volume rendering neural network to obtain a feature map.

The spatial point is a point obtained by sampling S times from h×w rays, so there are S spatial points on each ray. In some embodiments, rendering the second feature domain according to the three-dimensional volume rendering neural network to obtain a feature map may include: and calculating the S space points on each ray according to the following formula 4 to obtain the characteristic of one point in the characteristic map.

Wherein τ _j Is the second characteristicCumulative transmittance, α, at the jth point on a ray of the domain _j For transparency at the j-th point on a ray of the second feature field, f _j And f is the characteristic of a point in the characteristic diagram, wherein the f is the characteristic of the target characteristic at the j-th point on one ray of the second characteristic field.

Wherein τ _j Can be calculated by the following equation 5.

Wherein alpha is _k For transparency at the kth point on a ray of the second feature field, j is an integer greater than 1.

Wherein alpha is _j Can be calculated as shown in the following equation 6.

Wherein mu _j Is the spatial density, delta, at the j-th point on a ray of the second feature domain _j The distance between the j-th point and the j+1-th point on a ray of the second feature field.

Wherein delta _j Can be calculated by the following equation 7.

δ _j ＝‖x _j+1 -x _j II (equation 7)

Wherein x is _j Is the coordinate of the j-th point on a ray of the second characteristic field, x _j+1 Is the coordinates of the j+1th point on a ray of the second feature field.

After the three-dimensional volume rendering neural network renders the second feature domain according to the above formula 4-formula 7, features of h×w points corresponding to the feature map are obtained, and the features of h×w points form the feature map.

And S503, predicting the feature map according to the prediction neural network to obtain a target image.

With respect to what the predictive neural network is specifically, embodiments of the disclosure are not limited, e.g., predictive neural networks include fully connected networks. After the feature map is input into the prediction neural network, the feature map is processed by the prediction neural network, namely, the color value and the density of each point in the feature map are predicted according to the feature of the point, and a target image is obtained according to a prediction result, wherein the size of the target image is H.

By combining n first feature fields into one second feature field and then rendering the second feature field, the amount of computation required during rendering can be reduced. In addition, the plurality of second feature fields are combined into one second feature field, so that the rendered and predicted target image is a virtual image which does not exist actually, and the generation of the virtual image is realized.

Under the image generation method provided by the embodiment of the present disclosure, the overall architecture of the image generation system is shown in fig. 6. In the system architecture shown in fig. 6, the image generation process may include S701 to S712 as shown in fig. 7.

S701, acquiring style control content, the implementation of S701 is already described in S201 in the embodiment corresponding to fig. 2, and will not be described herein.

S702, inputting the style control content obtained in S701 into the voice converter 602 and/or the text encoder 603 and/or the image encoder 604 to obtain a style coding vector; wherein the speech converter 602, the text encoder 603 and the image encoder 604 belong to the feature extraction model 601.

S703, inputting the style code vector obtained in S702 into the shape mapper 605 and the appearance mapper 606 to obtain n pairs of first shape codes and first appearance codes, e.g. n first shape codes indicated as 607 in FIG. 6N first appearance codes +.>

S704, fitting the sample data distributionRandom sampling in a normal distribution results in n pairs of second shape codes and second appearance codes, e.g., n second shape codes indicated at 608 in FIG. 6 N second appearance codes

S705, adding the results of S703 and S704, i.e., adding n pairs of the first shape code and the first appearance code and n pairs of the second shape code and the second appearance code, respectively, to obtain n pairs of the third shape code and the third appearance code.

S706, generating H x W rays in a three-dimensional space by taking a camera origin as a center, and determining S space points from each ray to obtain H x W x S space points; the implementation process of S706 is already described in S205 in the corresponding embodiment of fig. 2, and will not be described here again.

S707, performing position coding on the spatial points obtained in S706 to obtain position codes 609 of h×w×s spatial points; the implementation process of S707 is already described in S205 of the corresponding embodiment of fig. 2, and will not be described here again.

S708, inputting the results of S707 and S705 into a feature extraction network (third neural network) to obtain n first feature domains 610; the implementation of S708 is already described in S206 of the corresponding embodiment of fig. 2, and will not be described here again.

S709, inputting n first feature fields 610 into a combination operator 611 to obtain a second feature field; wherein the feature extraction network and combination operator 611 belongs to the component-level editable neural radiation field module 612; the implementation process of S709 is already described in S501 of the corresponding embodiment of fig. 5, and will not be described here again.

S710, inputting the second feature domain into the three-dimensional volume rendering model 613 to obtain a feature map; the implementation process of S710 is already described in S502 in the corresponding embodiment of fig. 5, and will not be described here again.

S711, inputting the feature map into a prediction neural network 614 for color value and density prediction to obtain a target image; the implementation process of S711 is already described in S503 of the corresponding embodiment of fig. 5, and will not be described here again. If the whole image generation system is trained, S712 is also performed.

S712, the target image and the corresponding sample image are input to the discriminator 615, the discriminator 615 compares the target image and the corresponding sample image, and the loss (loss value) corresponding to the comparison result is output to perform back propagation.

The system architecture corresponding to the image generation method provided by the embodiment of the disclosure provides a generation type rendering neural network architecture, so that the trained rendering neural network has the capability of generating virtual images, and different scenes can be obtained according to different training data, so that the method is used for generating a general new view angle of a scene and is also suitable for generating a new view angle of an avatar. Because the feature extraction model is a pre-trained neural network model, the speed of image generation is improved to a certain extent.

Based on the same inventive concept, an image generating apparatus is also provided in embodiments of the present disclosure, as described in the following embodiments. Since the principle of solving the problem of the embodiment of the device is similar to that of the embodiment of the method, the implementation of the embodiment of the device can be referred to the implementation of the embodiment of the method, and the repetition is omitted.

Fig. 8 shows a schematic view of an image device according to an embodiment of the disclosure, as shown in fig. 8, the device includes: an obtaining module 801, configured to obtain style control content of a target image; an extracting module 802, configured to extract n pairs of a first shape code and a first appearance code from style control content according to a first neural network, where n is an integer greater than or equal to 1; the obtaining module 801 is further configured to obtain n pairs of second shape codes and second appearance codes, where n pairs of second shape codes and second appearance codes are obtained by randomly sampling n times from normal distribution of the fitting sample data distribution; a generating module 803 for generating n pairs of third shape codes and third appearance codes according to n pairs of first shape codes and first appearance codes, and n pairs of second shape codes and second appearance codes; the acquisition module 801 is further configured to acquire a position code of a spatial point corresponding to the target image; the generating module 803 is further configured to generate n first feature fields according to the position code and the n pairs of third shape codes and third appearance codes, where one first feature field corresponds to one third shape code and a corresponding third appearance code; the processing module 804 is further configured to process the n first feature domains according to the second neural network to obtain a target image.

In one embodiment of the present disclosure, the first neural network includes a feature extraction model, a shape mapper, and an appearance mapper, and the extraction module 802 is configured to extract style features of the style control content according to the feature extraction model, to obtain a style coding vector; extracting n first shape codes from the style code vectors according to the shape mapper; n first appearance encodings are extracted from the style-encoded vectors according to the appearance mapper.

In one embodiment of the present disclosure, the style control content includes text content, or image content, or voice content; an extracting module 802, configured to extract, in a case where the style control content includes text content, a style encoding vector from the text content according to a text encoder included in the feature extraction model; extracting a style coding vector from the image content according to an image encoder included in the feature extraction model when the style control content includes the image content; in the case that the style control content comprises voice content, converting the voice content into corresponding text content according to a voice converter included in the feature extraction model; and extracting style coding vectors from text contents corresponding to the voice contents according to the text encoder.

In one embodiment of the present disclosure, an obtaining module 801 is configured to obtain, in a three-dimensional space, h×w rays with a camera origin as a center, where h×w corresponds to a size of a target image, and both H and W are integers greater than 0; determining a first coordinate and a second coordinate of each ray in the H X W rays; sampling S space points in a line segment between a first coordinate and a second coordinate of each ray to obtain H, W and S space points, wherein S is an integer greater than 0; and respectively encoding the H.W.S space points according to the coordinates of each of the H.W.S space points to obtain position codes.

In one embodiment of the present disclosure, the generating module 803 is configured to input the position code and the n pairs of the third shape code and the third appearance code into the third neural network, and generate, by the third neural network, a corresponding first feature domain according to each pair of the third shape code and the third appearance code and the position code, to obtain n first feature domains.

In one embodiment of the present disclosure, the second neural network includes a combining operator, a three-dimensional volume rendering neural network, and a prediction neural network; a processing module 804, configured to process the n first feature domains according to the combining operator to obtain a second feature domain; rendering the second feature domain according to the three-dimensional volume rendering neural network to obtain a feature map; and predicting the feature map according to the prediction neural network to obtain a target image.

In one embodiment of the present disclosure, the spatial point corresponding to each of the n first feature domains has a corresponding spatial density and a target feature; the processing module 804 is configured to input the n first feature domains into a combining operator, and calculate, by the combining operator, an average spatial density and an average target feature of the n first feature domains at each spatial point according to the following formula;

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 900 according to such an embodiment of the present disclosure is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, and a bus 930 connecting the different system components (including the storage unit 920 and the processing unit 910).

Wherein the storage unit stores program code that is executable by the processing unit 910 such that the processing unit 910 performs steps according to various exemplary embodiments of the present disclosure described in the section "detailed description of the invention" above.

The storage unit 920 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 9201 and/or cache memory 9202, and may further include Read Only Memory (ROM) 9203.

The storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The bus 930 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 940 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 900, and/or any devices (e.g., routers, modems, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 950. Also, electronic device 900 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 960. As shown in fig. 9, the network adapter 960 communicates with other modules of the electronic device 900 over the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 900, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium, which may be a readable signal medium or a readable storage medium, is also provided. On which a program product is stored which enables the implementation of the method described above of the present disclosure. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the section "detailed description" above of the disclosure, when the program product is run on the terminal device.

More specific examples of the computer readable storage medium in the present disclosure may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In this disclosure, a computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Alternatively, the program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

In particular implementations, the program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

In an exemplary embodiment of the present disclosure, there is also provided a computer program product comprising a computer program or computer instructions loaded and executed by a processor to cause the computer to implement the image generation method of any of the above.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

From the description of the above embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.

Claims

1. An image generation method, comprising:

acquiring style control content of a target image;

Acquiring a position code of a space point corresponding to the target image;

2. The method of claim 1, wherein the first neural network comprises a feature extraction model, a shape mapper, and an appearance mapper, wherein extracting n pairs of first shape encodings and first appearance encodings from the style-controlled content in accordance with the first neural network comprises:

extracting style characteristics of the style control content according to the characteristic extraction model to obtain a style coding vector;

extracting n first shape codes from the style code vector according to the shape mapper;

and extracting n first appearance codes from the style code vectors according to the appearance mapper.

3. The method of claim 2, wherein the style control content comprises text content, or image content, or voice content; extracting the style characteristics of the style control content according to the characteristic extraction model to obtain a style coding vector, wherein the style coding vector comprises the following components:

Extracting the style coding vector from the text content according to a text encoder included in the feature extraction model when the style control content includes the text content;

extracting the style coding vector from the image content according to an image encoder included in the feature extraction model when the style control content includes the image content;

in the case that the style control content comprises the voice content, converting the voice content into corresponding text content according to a voice converter included in the feature extraction model; and extracting the style coding vector from the text content corresponding to the voice content according to the text encoder.

4. A method according to claim 3, wherein the feature extraction model is a pre-trained neural network model.

5. The method according to claim 1, wherein the acquiring a position code of a spatial point corresponding to the target image comprises:

taking a camera origin as a center in a three-dimensional space, acquiring H X W rays, wherein H X W corresponds to the size of the target image, and H and W are integers larger than 0;

Determining a first coordinate and a second coordinate of each ray in the H x W rays;

sampling S space points in a line segment between a first coordinate and a second coordinate of each ray to obtain H, W and S space points, wherein S is an integer greater than 0;

and respectively encoding the H, W and S space points according to the coordinates of each space point in the H, W and S space points to obtain the position code.

6. The method of claim 1, wherein generating n first feature fields from the position encoding and the n pairs of third shape encoding and third appearance encoding comprises:

and inputting the position codes and the n pairs of third shape codes and third appearance codes into a third neural network, and generating corresponding first feature domains by the third neural network according to each pair of third shape codes and third appearance codes and the position codes to obtain n first feature domains.

7. The method of claim 1, wherein the second neural network comprises a combining operator, a three-dimensional volume rendering neural network, and a predictive neural network; the processing the n first feature domains according to the second neural network to obtain the target image includes:

Processing the n first feature domains according to the combination operator to obtain a second feature domain;

rendering the second feature domain according to the three-dimensional volume rendering neural network to obtain a feature map;

and predicting the feature map according to the prediction neural network to obtain the target image.

8. The method of claim 7, wherein the spatial point corresponding to each of the n first feature fields has a corresponding spatial density and target feature; the processing of the n first feature domains according to the combination operator to obtain a second feature domain includes:

inputting the n first feature domains into the combination operator, and calculating average space density and average target feature of the n first feature domains at each space point by the combination operator according to the following formula;

wherein C (x, d) is a space point, x is the position of the space point, d is the direction of the space point, μ _i For the spatial density of the ith first feature domain at C (x, d), f _i Target features at C (x, d) for the ith first feature domain;

and obtaining the second feature domain according to the average space density and the average target feature at each space point.

9. An image generating apparatus, comprising:

the acquisition module is used for acquiring style control content of the target image;

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the image generation method of any of claims 1-8 via execution of the executable instructions.

11. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the image generation method of any one of claims 1 to 8.