CN116957921A - Image rendering method, device, equipment and storage medium - Google Patents

Image rendering method, device, equipment and storage medium Download PDF

Info

Publication number
CN116957921A
CN116957921A CN202310923451.1A CN202310923451A CN116957921A CN 116957921 A CN116957921 A CN 116957921A CN 202310923451 A CN202310923451 A CN 202310923451A CN 116957921 A CN116957921 A CN 116957921A
Authority
CN
China
Prior art keywords
image
noise
sample
rendering
rendered
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310923451.1A
Other languages
Chinese (zh)
Inventor
郭卉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310923451.1A priority Critical patent/CN116957921A/en
Publication of CN116957921A publication Critical patent/CN116957921A/en
Pending legal-status Critical Current

Links

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The application discloses an image rendering method, an image rendering device, image rendering equipment and a storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring a plurality of training samples, wherein the training samples comprise sample original images, rendered texts, random noise images, all 0 images corresponding to the sample original images and sample rendered images; for each training sample, acquiring a noise identification result between a sample original image and a sample rendered image; generating a prediction rendering image corresponding to the training sample according to the sample original image, the rendering text, the random noise image, the all 0 image and the sample rendering image through the image editing model; according to the noise identification result, processing the sample rendering image, the sample original image and the prediction rendering image to obtain training loss corresponding to the sample original image; and adjusting parameters of the image editing model according to the training loss corresponding to each training sample. The application can improve the image rendering accuracy of the image editing model.

Description

Image rendering method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to an image rendering method, an image rendering device, image rendering equipment and a storage medium.
Background
With the research and progress of artificial intelligence technology, image generation effect based on Diffusion Model is better. In the field of image generation, there is an important branch of image editing, i.e. an input image and a corresponding input image editing model of a rendered text (constructed based on a diffusion model), i.e. an input image with content related to the rendered text (such as a temporal atmosphere) is obtained.
In the related art, since it is difficult to collect image data before and after rendering of a natural scene, the image editing model is trained using large-scale generation data, that is, based on a sample image and a rendered image (sample image with rendering effect) constructed based on the sample image, so that the image editing model has image rendering capability, such as rendering an image with a spring atmosphere into an image with an autumn atmosphere.
However, there may be a deviation (e.g., a change in layout or structure) between the rendered image and the sample image, so that the training set may have a relatively noisy sample, and thus the image rendering accuracy of the trained image editing model is not high.
Disclosure of Invention
The embodiment of the application provides an image rendering method, an image rendering device, image rendering equipment and a storage medium, which can improve the image rendering accuracy of an image editing model.
According to an aspect of an embodiment of the present application, there is provided an image rendering method including:
acquiring a plurality of training samples, wherein the training samples comprise sample original images, rendered texts, random noise images, all 0 images corresponding to the sample original images, and sample rendered images generated based on the sample original images under the constraint of the rendered texts;
for each training sample, acquiring a noise identification result between the sample original image and the sample rendered image, wherein the noise identification result is used for indicating whether a difference exists between the sample original image and the sample rendered image except for content related to the rendered text;
generating a prediction rendering image corresponding to the training sample according to the sample original image, the rendering text, the random noise image, the all 0 image and the sample rendering image through an image editing model;
according to the noise identification result, processing the sample rendering image, the sample original image and the prediction rendering image to obtain training loss corresponding to the sample original image, wherein the training loss is used for representing the rendering capacity of the image editing model on the image;
And adjusting parameters of the image editing model according to the training loss corresponding to each training sample respectively to obtain a trained image editing model, wherein the trained image editing model is used for rendering the image according to the rendering text.
According to an aspect of an embodiment of the present application, there is provided an image rendering method including:
acquiring an input image, and an input text, a random noise image and an all-0 image corresponding to the input image;
generating a predictive rendering image corresponding to the input image based on the input image, a random noise image corresponding to the input image and an all-0 image under the constraint of the input text through an image editing network in an image editing model;
acquiring a first noise sub-recognition result between the input image and a predicted rendered image corresponding to the input image through a first noise recognition network in the image editing model;
acquiring a second noise sub-recognition result between the input image and a predicted rendered image corresponding to the input image through a second noise recognition network in the image editing model; wherein the first noise identification network and the second noise identification network are different;
And screening the predicted rendering image corresponding to the input image according to the first noise sub-recognition result and the second noise sub-recognition result to obtain a screening result, wherein the screening result is used for indicating whether the predicted rendering image corresponding to the input image is qualified or not.
According to an aspect of an embodiment of the present application, there is provided an image rendering apparatus including:
a training sample acquisition module, configured to acquire a plurality of training samples, where the training samples include a sample original image, a rendered text, a random noise image, an all-0 image corresponding to the sample original image, and a sample rendered image generated based on the sample original image under the constraint of the rendered text;
a noise result obtaining module, configured to obtain, for each of the training samples, a noise recognition result between the sample original image and the sample rendered image, where the noise recognition result is used to indicate whether there is a difference between the sample original image and the sample rendered image, except for content related to the rendered text;
the prediction image generation module is used for generating a prediction rendering image corresponding to the training sample according to the sample original image, the rendering text, the random noise image, the all 0 image and the sample rendering image through an image editing model;
The training loss acquisition module is used for processing the sample rendering image, the sample original image and the prediction rendering image according to the noise identification result to obtain training loss corresponding to the sample original image, wherein the training loss is used for representing the rendering capacity of the image editing model on the image;
the editing model training module is used for adjusting parameters of the image editing model according to the training loss corresponding to each training sample to obtain a trained image editing model, and the trained image editing model is used for rendering the image according to the rendering text.
According to an aspect of an embodiment of the present application, there is provided an image rendering apparatus including:
the input image acquisition module is used for acquiring an input image, and an input text, a random noise image and an all-0 image corresponding to the input image;
the prediction image generation module is used for generating a prediction rendering image corresponding to the input image based on the input image, a random noise image corresponding to the input image and an all-0 image under the constraint of the input text through an image editing network in an image editing model;
The noise result acquisition module is used for acquiring a first noise sub-recognition result between the input image and the predicted rendering image corresponding to the input image through a first noise recognition network in the image editing model;
the noise result obtaining module is further configured to obtain a second noise sub-recognition result between the input image and the predicted rendered image corresponding to the input image through a second noise recognition network in the image editing model; wherein the first noise identification network and the second noise identification network are different;
the screening result obtaining module is used for screening the prediction rendering image corresponding to the input image according to the first noise sub-recognition result and the second noise sub-recognition result to obtain a screening result, and the screening result is used for indicating whether the prediction rendering image corresponding to the input image is qualified or not.
According to an aspect of an embodiment of the present application, there is provided a computer apparatus including a processor and a memory in which a computer program is stored, the computer program being loaded and executed by the processor to implement the above-described image rendering method.
According to an aspect of an embodiment of the present application, there is provided a computer-readable storage medium having stored therein a computer program loaded and executed by a processor to implement the above-described image rendering method.
According to an aspect of an embodiment of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium. A processor of a computer device reads the computer program from a computer-readable storage medium, and the processor executes the computer program so that the computer device performs the above-described image rendering method.
The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects.
According to the method, according to the noise identification result used for indicating whether the difference exists between the sample original image and the sample rendered image except for the content related to the rendered text, the training loss of the image editing model is built based on the sample rendered image, the sample original image and the prediction rendered image instead of directly building the training loss of the image editing model based on the difference between the sample rendered image and the prediction rendered image, the influence of noise possibly existing between the sample rendered image and the sample original image on the image editing model can be effectively avoided, meanwhile, the sample original image is introduced for supervision, and the image rendering accuracy of the image editing model can be effectively improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic illustration of an implementation environment for an embodiment of the present application;
FIG. 2 is a schematic diagram of a first noise identification network provided by one embodiment of the present application;
FIG. 3 is a schematic diagram of a second noise identification network provided by one embodiment of the application;
FIG. 4 is a schematic diagram of an image editing network provided by one embodiment of the present application;
FIG. 5 is a flow chart of an image rendering method provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of a noise recognition result provided by an embodiment of the present application;
FIG. 7 is a flowchart of a method for obtaining a noise identification result according to an embodiment of the present application;
FIG. 8 is a flow chart of a method for obtaining a predictive rendered image provided by one embodiment of the application;
FIG. 9 is a flow chart of an image rendering method according to another embodiment of the present application;
FIG. 10 is a schematic diagram of a contrast of rendering effects between the related art provided by one embodiment of the present application and the present application;
FIG. 11 is a block diagram of an image rendering apparatus provided by one embodiment of the present application;
fig. 12 is a block diagram of an image rendering apparatus provided in another embodiment of the present application;
fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
Before describing embodiments of the present application, related terms referred to in the present application will be first described.
1. Diffusion Model (Diffusion Model): the target text is input into a diffusion model based on a generation model of the diffusion process, the diffusion model performs a series of operations on a random noise image x, and generates a predicted rendered image Y related to the target text, which may also be called a text-based map, under the cross-attention of the target text.
2. Random noise image: an image noise randomly constructed image. Wherein the image noise may appear as an isolated pixel point or block of isolated pixels that causes a stronger visual effect.
3. Scene time atmosphere rendering: the method aims at performing time, four seasons, morning and evening operations on scenes of images, wherein an original image is daytime atmosphere, is rendered to be black and night atmosphere, is rendered to be spring atmosphere, is rendered to be autumn atmosphere, and is unchanged before and after rendering, and only the content related to the time atmosphere is changed.
4. Generating a model drawing rate: the ratio of images adopted in the image results generated by the generated model is the ratio of images adopted in the image results generated by the generated model, and the conventional generated model is often low in image rate due to the fact that problems such as person deformity and object missing easily occur.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of how to make a machine "look at", and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, following and measurement on a target, and further perform graphic processing, so that the Computer processes the target into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, video processing, OCR (Optical Character Recognition ), video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, and map construction, among others.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
The technical scheme provided by the embodiment of the application relates to an artificial intelligence computer vision technology and a machine learning technology, wherein the computer vision technology is utilized to extract features of a sample original image, a random noise image, a full 0 image and a sample rendering image in a training sample, further under the constraint of a rendering text, a prediction rendering image is generated based on the extracted features, a noise recognition result between the sample original image and the sample rendering image is obtained based on the extracted features, and then the machine learning technology is utilized to utilize the noise recognition result, the sample rendering image, the sample original image and the prediction rendering image to construct training loss of an image editing model so as to train the image editing model, and the image editing model with image rendering capability is obtained.
In the method provided by the embodiment of the application, the execution main body of each step can be computer equipment, and the computer equipment refers to electronic equipment with data calculation, processing and storage capabilities. The computer device may be a terminal such as a PC (Personal Computer ), tablet, smart phone, wearable device, smart robot, etc.; or may be a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service.
The technical scheme provided by the embodiment of the application is suitable for any image rendering scene, such as a scene time atmosphere rendering scene, an image editing scene, an image generating scene (a meridional chart), an image denoising scene, an image compression scene, an image super-resolution scene and the like. The technical scheme provided by the embodiment of the application can improve the image rendering accuracy of the image editing model.
The following will describe a model architecture of an embodiment of the present application that provides a scenario implementation environment and an image editing model.
Referring to fig. 1, a schematic diagram of an implementation environment of an embodiment of the present application is shown. The scenario implementation environment may include a model training apparatus 10 and a model using apparatus 20.
The model training device 10 may be an electronic device such as a mobile phone, a desktop computer, a tablet computer, a notebook computer, a vehicle-mounted terminal, a server, an intelligent robot, an intelligent television, a multimedia playing device, or some other electronic device with a relatively high computing power, which is not limited by the embodiment of the present application. Model training apparatus 10 is used to train image editing model 30. Alternatively, the model training apparatus 10 may train the image editing model 30 in a machine learning manner so that it has a better performance.
The image editing model 30 is a neural network model for image rendering. The image rendering in the embodiment of the application may refer to a process of changing a scene, atmosphere, tone, etc. of an image, in which the content, layout, structure, etc. of the image are not changed, and only the rendering effect of the content, such as the temporal atmosphere, weather atmosphere, light and shadow effect (such as brightness, darkness, softness, etc.), is changed. Alternatively, image rendering may also be referred to as image editing.
Alternatively, the image editing model 30 may perform image rendering based on rendering text, which may refer to image editing instructions in the image rendering process, to instruct the image editing model 30 what rendering to perform on the image. For example, the rendered text may be constructed based on the rendered scene. For example, if the rendered text is "make it rain," the image editing model 30 renders the image into a rainy scene. The embodiment of the application does not limit the style of the rendered text, and can be a Chinese style, an English style and a custom character string style.
For a certain training sample, the model training device 10 firstly obtains a noise recognition result between a sample original image and a sample rendered image in the training sample and a predicted rendered image of the sample original image under the constraint of a rendered text through the image editing model 30, the model training device 10 constructs a training loss of the image editing model 30 based on the sample original image, the sample rendered image and the predicted rendered image according to the noise recognition result, and finally the model training device 10 adjusts parameters of the image editing model 30 according to the training loss of the image editing model 30 to obtain a trained image editing model 30.
Optionally, the training process is an iterative process, for example, a plurality of training samples are used to iteratively adjust parameters of the image editing model 30 to obtain a trained image editing model 30. Optionally, the termination condition of the iteration may include at least one of: the present embodiment is not limited in this regard, as training loss of the image editing model 30 is minimized, the number of iterations is greater than or equal to a threshold, training loss of the image editing model 30 is less than or equal to a threshold, etc.
The image editing model 30 trained as described above can be deployed in the model using apparatus 20 to provide an image rendering service. The model using device 20 may be an electronic device such as a mobile phone, a desktop computer, a tablet computer, a notebook computer, a vehicle-mounted terminal, a server, an intelligent robot, an intelligent television, a multimedia playing device, or some other electronic device with a relatively high computing power, which is not limited by the embodiment of the present application.
In some embodiments, referring to fig. 1, the image editing model 30 described above includes a first noise identification network 31, a second noise identification network 32, and an image editing network 33.
The first noise recognition network 31 is a neural network for image noise recognition, whose inputs are the encoding of the sample original image (or word embedding, feature vector, etc.) and the encoding of the sample rendered image, and whose outputs are noise recognition results between the sample original image and the sample rendered image, which are used to indicate whether there is a difference between the sample original image and the sample rendered image, such as whether there is a difference in content, layout, structure, etc. between the sample original image and the sample rendered image, and if there is a difference, noise is present between the sample original image and the sample rendered image; if there is no difference, it can be said that there is no noise between the sample original image and the sample rendered image. The noise recognition result may be a classification result, 1 indicating the presence of noise, and 0 indicating the absence of noise. Alternatively, the noise recognition result output by the first noise recognition network 31 may be noted as a first noise sub-recognition result.
Illustratively, referring to fig. 2, the first noise identification network 31 includes a first feature extraction network 311 and a second feature extraction network 312, each of the first feature extraction network 311 and the second feature extraction network 312 including m feature extraction layers, m being an integer greater than 1. Optionally, the network architecture of the first feature extraction network 311 and the network architecture of the second feature extraction network 312 are the same, and the network parameters of the first feature extraction network 311 and the network parameters of the second feature extraction network 312 are also the same, that is, the first feature extraction network 311 and the second feature extraction network 312 have the same m feature extraction layers.
The first feature extraction network 311 and the second feature extraction network 312 are both used for feature extraction. Illustratively, the first feature extraction network 311 is used for feature extraction of a sample original image to obtain multi-level stitching features corresponding to the sample original image, and the second feature extraction network 312 is used for feature extraction of a sample rendering image to obtain multi-level stitching features corresponding to the sample rendering image, where the multi-level stitching features include deep features and shallow features. The deep features may be output features obtained through more feature extraction layers, which pay more attention to local information of the image, and deep features may be output features obtained through fewer feature extraction layers, which pay more attention to global information of the image.
Alternatively, the first feature extraction network 311 and the second feature extraction network 312 (i.e., the m feature extraction layers described above) may be constructed using techniques such as Resnet101 (depth residual network with 101 layers), resnet50, concept V4 (a neural network with residual connections, reduced blocks), CNN (Convolutional Neural Network ), MLP (Multilayer Perceptron, multi-layer perceptron), and the like.
The number m of the feature extraction layers is not limited, and can be set and adjusted according to actual use requirements. For example, referring to table 1 below, take pre-trained Resnet101 to build 5 feature extraction layers as an example.
TABLE 1
Because the layout, the structure and the like of the image pay more attention to the difference in the texture details of the image and noise characteristics can exist in the depth characteristics and the shallow characteristics of the image, the embodiment of the application is different from a general classification model, and adopts a multi-level characteristic splicing method to splice the depth characteristics and the shallow characteristics of the image to obtain multi-level splicing characteristics of the image, and then noise identification is performed based on the multi-level splicing characteristics, so that the noise identification accuracy can be improved.
Referring to fig. 2, in order to implement multi-level feature stitching, the first feature extraction network 311 and the second feature extraction network 312 further include the same feature alignment network, where the feature alignment network is used to align output features of different feature extraction layers to the same dimension, and then stitch the output features to obtain multi-level stitching features. The feature alignment network comprises m-1 feature alignment layers, wherein the m-1 feature alignment layers are in one-to-one correspondence with the 2 nd feature extraction layer to the m th feature extraction layer. Each feature alignment layer may include a pooling unit for pooling the output features and an alignment unit for converting the output features of the pooling unit to specified dimensions, which may be set and adjusted according to actual use requirements.
For example, taking feature extraction layers 2-5 in Table 1 as an example, 4 feature alignment layers in a feature alignment network may be constructed as in Table 2 below.
TABLE 2
Layer name Output feature dimension Module
Pond unit 1 (Pool 1) 1*256 Max pool
Alignment unit 1 (Alignment 1) 128 FC+Relu
Pond unit 2 (Pool 2) 1*512 Max pool
Alignment unit 2 (Alignment 2) 128 FC+Relu
Pond unit 3 (Pool 3) 1*1024 Max pool
Alignment unit 3 (Alignment 3) 128 FC+Relu
Pond unit 4 (Pool 4) 1*2048 Max pool
Alignment unit 4 (Alignment 4) 128 FC+Relu
Wherein each alignment unit may include Max pool, FC and Relu, max pool being a maximum pooling layer for pooling, FC (Fully Connected Layers) being a fully-connected layer functioning as a classifier, relu being a rectifying layer for rectifying features, such as linear rectification.
Illustratively, based on tables 1 and 2 above, the implementation of the first feature extraction network 311 or the second feature extraction network 312 may be as follows:
the output features of the feature extraction layer 2 are normally input into the feature extraction layer 3, and a first alignment vector with the dimension of 1x128 is obtained through a feature alignment layer 1 (namely Pool 1+alignment 1); the output features of the feature extraction layer 3 are input to the feature extraction layer 4 normally, and a second pair Ji Xiangliang with a dimension of 1x128 is obtained through the feature Alignment layer 2 (i.e., pool2+alignment 2); the output features of the feature extraction layer 4 are input to the feature extraction layer 5 normally, and a third alignment vector with the dimension of 1x128 is obtained through the feature alignment layer 3 (namely Pool 3+alignment 3); the output features of the feature extraction layer 5 pass through a feature alignment layer 4 (i.e. Pool 4+alignment 4) to obtain a fourth alignment vector with a dimension of 1x128, and then the first alignment vector, the second pair Ji Xiangliang, the third alignment vector and the fourth alignment vector are spliced end to end in sequence, so that the multi-level splicing feature can be obtained.
Optionally, referring to fig. 2, the first noise recognition network 31 further includes a prediction layer, where the prediction layer takes as input a multi-level stitching feature corresponding to the sample original image and a multi-level stitching feature corresponding to the sample rendered image, and takes as output a noise recognition result between the sample original image and the sample rendered image.
Illustratively, referring to table 3 below, the prediction layer includes a rectifying unit for rectifying the feature and a noise prediction unit for performing a noise prediction operation.
TABLE 3 Table 3
Layer name Output feature dimension Module
Rectifying unit 1*512 FC+Relu
Noise prediction unit 1*2 Noise prediction operation
Alternatively, the noise prediction operation may be constructed using cosine similarity, that is, the noise recognition result may be expressed as follows: 1-a similarity (x 1, x 2); the parameter a is a parameter to be learned, or may be a fixed value, for example, 1, which is not limited in the embodiment of the present application. x1 and x2 correspond to the output of the sample raw image under the rectification unit and the output of the sample rendered image under the rectification unit, respectively. Similarity () represents the Similarity between two features, the smaller the Similarity, i.e., the greater 1-a, when noise is present between the sample original image and the sample rendered image.
The second noise recognition network 32 is also a kind of neural network for image noise recognition, but it is different from the first noise recognition network 31. Illustratively, the input of the second noise recognition network 32 may be the encoding of the differential image between the sample original image and the sample rendered image, and the output may be the noise recognition result between the sample original image and the sample rendered image, alternatively, the noise recognition result output by the second noise recognition network 32 may be noted as the second noise sub-recognition result.
Illustratively, referring to fig. 3, the second noise recognition network 32 includes a feature extraction layer 321 and a prediction layer, the feature extraction layer 321 is used for extracting features of the differential image, and the prediction layer is used for classifying based on the output of the feature extraction layer 321 to obtain a second noise sub-recognition result.
Alternatively, feature extraction layer 321 may be constructed using techniques such as Resnet101, resnet50, acceptance V4, CNN, MLP, and the like. The prediction layer in the second noise-recognition network 32 may be constructed using FC. Alternatively, the feature extraction layer 321 may be the same as the first feature extraction network 311 or the second feature extraction network 312, which is not limited in the embodiment of the present application.
For example, the feature extraction layer 321 may construct table 1 (or table 1+table 2) described above, and the second noise identification network 32 may construct table 4 described below.
TABLE 4 Table 4
Layer name Output feature dimension Module
Pooling unit 1*2048 Max pool
Rectifying unit 1 1*512 FC+Relu
Rectifying unit 2 1*128 FC+Relu
Noise prediction unit 1*2 FC
The noise of the differential image is more focused on the change of the detail texture of the image, so that the second noise recognition network 32 can judge whether noise exists between the original sample image and the rendered sample image according to the difference of the detail texture of the image, thereby improving the accuracy of noise recognition and being beneficial to improving the image rendering accuracy of the image editing model.
In some possible embodiments, 3 or more noise recognition networks may be deployed in the image editing model 30, where each noise recognition network is different, so as to obtain multiple noise recognition results between the sample original image and the sample rendered image, and further adjust parameters of the image editing model 30 according to the multiple noise recognition results.
Alternatively, the noise recognition network may be disposed in the image editing model 30, or may be disposed separately outside the image editing model 30 to assist in training of the image editing model 30, which is not limited by the embodiment of the present application.
In one example, the first noise identification network 31 and the second noise identification network 32 may be pre-trained. Illustratively, the first noise identification network 31 and the second noise identification network 32 may be trained using an SGD (Stochastic Gradient Descent, random gradient descent) method. For example, the initial learning rate is set to 0.01, and the learning rate becomes 0.1 times the original after every 10 epochs (each epoch represents a full amount of data for training), and 60 epochs are trained in total. The learning rate, the full data, the iteration number and the like can be set and adjusted according to actual use requirements.
Alternatively, the parameters of the feature extraction network are initialized with the Resnet101 parameters of the open source (as in Table 1 above) and do not participate in the update, and the other parameters of the noise recognition network are initialized with a random Gaussian distribution and updated in training (as in tables 2, 3 and 4 above).
Optionally, model training apparatus 10 trains image editing model 30 with training samples that are open source, such as training samples corresponding to Instruct Pix2 Pix. However, since the sample has a large amount of noise, the embodiment of the present application constructs a noise recognition network to recognize noise so as to ensure the image rendering accuracy of the image editing model 30.
Meanwhile, considering the cost problem, noise labeling cannot be performed on all training samples, so that the embodiment of the application trains the noise recognition network under the limited labeling to judge whether the training samples are noise samples (namely whether noise exists or not). For example, 1/10 (about 4 ten thousand) training samples are randomly sampled from the training samples of the image editing model 30, and the extracted training samples are noise-labeled to pretrain the noise recognition network.
The image editing network 33 refers to a neural network for acquiring a predictive rendered image, and inputs thereof are training samples including a sample original image, a sample rendered image, a full 0 image, a random noise image, and a rendered text, and outputs thereof are predictive rendered images corresponding to the sample original image. Alternatively, the image editing network 33 may be constructed using, for example, a Diffusion model, LDM (Latent Diffusion Models, latent Diffusion model), stable Diffusion model, or the like, which is not limited by the embodiment of the present application.
Illustratively, referring to FIG. 4, the image editing network 33 includes a backward noise adding network 331 (which is used to implement the diffusion process) and a forward noise removing network 332 (which may be constructed based on a denoise U-Net). The backward noise adding network 331 is configured to perform a diffusion operation in the hidden space (i.e. the latent representation space), and gaussian noise is added in each step to obtain the hidden space feature Z at the time T T Hidden space feature Z at this T moment T With noise characteristics, T is a positive integer.
The forward denoising network 332 is used for denoising in hidden space, and may refer to a conditional timing denoising self-encoder (Conditional Denoising Auto-encoder) which uses rendered text as KV information and an image as Q information.
Optionally, the image editing network 33 further comprises a self-encoding network (e.g. VAE, variable Auto-encoder) comprising an encoder (epsilon) for encoding the training samples and a decoder (D) for decoding to obtain the predictive rendered image.
For example, referring to fig. 4, the implementation of the image editing network 33 may be as follows: the method comprises the steps of splicing a sample original image, an all 0 image, a random noise image and a sample rendering image to obtain input, encoding the input (encoding by using a VAE, namely, the image is mapped to a hidden space through the VAE), obtaining hidden space characteristics at the moment T through a diffusion process, restoring the hidden space characteristics at the moment T to the characteristics of the image (namely, original image characteristics without noise) through T times of de-noising U-Net operation, performing rendering characteristic control after each U-Net, and generating input of the U-Net at the next moment through rendering characteristic control operation. And finally, the restored characteristics are subjected to a VAE decoding process to obtain a predictive rendering image.
For rendering text (e.g., making the artwork dim: make it text), a CLIP (Contrastive Language-Image Pre-Training for language-Image Pre-Training) model may be used to get word embedding (i.e., embedding) of the rendered text followed by control via the QKV (cross-attention) mechanism of U-Net. The backward noise adding network 331 (diffusion sampling) is used to map the output of the image editing network 33 to the hidden space feature at time T, and the forward noise removing network 332 learns to generate a fit of the noise feature, thereby eliminating the noise feature to obtain the image feature actually required, and obtains the prediction-rendered image through the decoder D.
The embodiment of the present application does not limit the model architecture of the image editing model, and the image editing model in the above embodiment is only exemplary and explanatory, and all models capable of realizing comment generation should be within the protection scope of the embodiment of the present application.
The following is an embodiment of the method according to the present application, through which the training process of the image editing model is described, and for details not disclosed in the embodiment of the method according to the present application, please refer to the above-mentioned embodiment.
Referring to fig. 5, a flowchart of an image rendering method according to an embodiment of the application is shown. The subject of execution of the steps of the method may be the model training apparatus described above. The method may comprise the following steps (501-505).
Step 501, a plurality of training samples are acquired, wherein the training samples comprise a sample original image, a rendering text, a random noise image, an all 0 image corresponding to the sample original image, and a sample rendering image generated based on the sample original image under the constraint of the rendering text.
The sample original image may refer to any image such as a photograph, video frame, self-drawing, etc. The rendering text may be text for indicating how to render the sample original image, for example, the rendering text may be text for indicating scene time atmosphere rendering of the sample original image, which may be set and adjusted according to actual use requirements. The random noise image may be a noise image generated by a random seed i, and in the process of generating the random noise image, the random number more than the random number of the next time is used as a seed, and the seed may be used to indicate a pixel point or a pixel block and a corresponding pixel value thereof.
The all-0 image refers to an image in which pixel values of the respective pixels are all 0, and it is black. Optionally, the pixel values of all the pixel points in the original sample image are adjusted to 0, so that an all-0 image corresponding to the original sample image can be obtained.
The specimen rendered image corresponds to the specimen original image, which may serve as surveillance data (i.e., label data) for the specimen original image. For example, a diffusion model may be employed to render an image based on a sample original image under rendering text constraints. Noise may or may not be present between the sample rendered image and the sample original image. For example, referring to fig. 6, for a sample-rendered image 602 corresponding to a sample original image 601 (winter atmosphere), the sample-rendered image 602 has a spring atmosphere, and at the same time, the layout and structure of the sample-rendered image 602 are changed, for example, flowers crossing at a place where an upper area in the sample original image 601 should not be a tree are present, and a river at a distance is also changed, so that noise exists between the sample original image 601 and the sample-rendered image 602, which is a noise sample.
For the sample-rendered image 604 corresponding to the sample original image 603 (spring atmosphere), the layout and structure of the sample-rendered image 604 are unchanged (part of flowers and plants in the sample original image 603 are covered with snow and are not shown) while the sample-rendered image 604 has winter atmosphere, so that no noise exists between the sample original image 603 and the sample-rendered image 604, and the sample-rendered image is a non-noise sample.
Optionally, the sample original image, the all 0 image, the random noise image, and the sample rendered image are the same scale.
Step 502, for each training sample, obtaining a noise recognition result between the sample original image and the sample rendered image, where the noise recognition result is used to indicate whether there is a difference between the sample original image and the sample rendered image, except for content related to the rendered text.
In the embodiment of the application, because the memory resources of the model training equipment are limited, the full sample can not be input into the image editing model for training at one time, so the full sample is randomly divided into a plurality of latches to train the image editing model in batches, and the latches complete the training task and are recorded as completing one round of iteration. And (3) carrying out N (e.g. 100) rounds of iteration on the image editing model by adopting the full data, so as to obtain the trained image editing model. Illustratively, for a certain batch of training, bs training samples are first randomly extracted from the total number of samples, bs being a positive integer.
The training samples are the training samples in a certain batch. Because the training process of each batch is the same, the embodiment of the application is illustrated by the training process of a certain batch.
The noise recognition result is used to indicate whether there is noise (such as a difference in layout, structure, etc.) between the sample original image and the sample rendered image, which is the same as that described in the above embodiment, and will not be repeated here.
In one example, the first noise recognition network and the second noise recognition network in the image editing model described above are employed to obtain a noise recognition result between the sample raw image and the sample rendered image. Referring to fig. 7, step 502 may further include the following sub-steps:
step 502a, obtaining multi-level stitching features corresponding to a sample original image and a sample rendered image respectively through a first noise identification network, wherein the multi-level stitching features comprise shallow features and deep features of the image.
Optionally, the first noise identification network comprises a first feature extraction network and a second feature extraction network, each of the first feature extraction network and the second feature extraction network comprises m feature extraction layers, m being an integer greater than 1. The multi-level splicing feature can be obtained according to the output features of the m feature extraction layers.
Illustratively, the process of acquiring the multi-level stitching feature corresponding to the original sample image may be as follows:
1. inputting the original image of the sample into a first feature extraction network, and obtaining a first feature sequence according to the output features from the 2 nd feature extraction layer to the m th feature extraction layer of the first feature extraction network.
And carrying out feature extraction on the codes of the original images of the samples by adopting a first feature extraction layer to obtain the output features of the first feature extraction layer, carrying out feature extraction on the output features of the first feature extraction layer by adopting a second feature extraction layer to obtain the output features of the second feature extraction layer, and analogizing to obtain the output features respectively corresponding to the m feature extraction layers.
And sequentially sequencing from the output features of the second feature extraction layer to the output features of the mth feature extraction layer to obtain a first feature sequence.
2. And carrying out dimension alignment on each output feature in the first feature sequence to obtain an aligned first feature sequence.
And aligning each output feature in the first feature sequence to the same dimension through a feature alignment network in the first feature extraction network to obtain an aligned first feature sequence.
3. And splicing all the characteristics in the aligned first characteristic sequence to obtain multi-level splicing characteristics corresponding to the original image of the sample.
And sequentially splicing all the features in the aligned first feature sequence end to obtain multi-level splicing features corresponding to the original sample image, wherein the multi-level splicing features can reflect the features of the original sample image in all levels (from shallow layers to depth).
The process of obtaining the multi-level stitching features corresponding to the sample rendered image may be as follows:
1. and inputting the sample rendering image into a second feature extraction network, and obtaining a second feature sequence according to the output features from the 2 nd feature extraction layer to the m th feature extraction layer of the second feature extraction network.
And sequentially obtaining output features respectively corresponding to the m feature extraction layers through m feature extraction layers in the second feature extraction network according to the codes of the sample rendering images, and further sequentially sequencing the output features from the 2 nd feature extraction layer to the m feature extraction layer of the second feature extraction network to obtain a second feature sequence.
2. And carrying out dimension alignment on each output feature in the second feature sequence to obtain an aligned second feature sequence.
And aligning each output feature in the second feature sequence to the same dimension through a feature alignment network in the second feature extraction network to obtain an aligned second feature sequence, wherein the dimension of the feature in the aligned second feature sequence is the same as that of the feature in the aligned first feature sequence.
3. And splicing all the features in the aligned second feature sequence to obtain multi-level splicing features corresponding to the sample rendering image.
And sequentially splicing all the features in the aligned second feature sequence end to obtain multi-level splicing features corresponding to the sample rendering image, wherein the multi-level splicing features can reflect the features of the sample rendering image in all levels (from shallow layers to depth).
Step 502b, obtaining a first noise sub-recognition result through a first noise recognition network, wherein the first noise sub-recognition result is a noise recognition result obtained according to the similarity between the multi-level stitching feature corresponding to the original image of the sample and the multi-level stitching feature corresponding to the rendered image of the sample.
And processing the multi-level splicing characteristics corresponding to the original image of the sample and the multi-level splicing characteristics corresponding to the rendered image of the sample through a prediction layer in the first noise identification network to obtain a first noise sub-identification result.
Alternatively, the first noise sub-recognition result may be expressed as follows: 1-a similarity (x 1, x 2);
wherein x1 is a multi-level stitching feature corresponding to the sample original image, x2 is a multi-level stitching feature corresponding to the sample rendered image, similarity (x 1, x 2) represents similarity between the multi-level stitching feature corresponding to the sample original image and the multi-level stitching feature corresponding to the sample rendered image, that is, the higher the similarity is, the lower the possibility of noise is.
For example, if the first noise sub-recognition result is 1, noise exists between the sample original image and the sample rendered image; if the first noise sub-identification result is 0, no noise exists between the sample original image and the sample rendered image.
Step 502c, obtaining a differential image between the sample original image and the sample rendered image.
Optionally, for each pixel point in the sample original image, subtracting the pixel value of the pixel point from the pixel value of the pixel point at the corresponding position in the sample rendered image to obtain a differential value corresponding to the pixel point, and replacing the pixel value of the pixel point at the corresponding position in the sample original image according to the differential value of each pixel point to obtain the differential image. The difference image may highlight the difference that exists between the sample original image and the sample rendered image.
Step 502d, obtaining a second noise sub-recognition result through the second noise recognition network, wherein the second noise sub-recognition result is a noise recognition result obtained according to the differential image.
And extracting the characteristics of the codes of the differential images through a characteristic extraction layer in the second noise identification network to obtain characteristic representation of the differential images, and calculating to obtain a second noise sub-identification result by adopting a prediction layer in the second noise identification network according to the characteristic representation of the differential images. The second noise sub-recognition result is also used to indicate whether noise is present between the sample original image and the sample rendered image.
Step 502e, obtaining a noise identification result according to the first noise sub-identification result and the second noise sub-identification result.
Optionally, the first noise sub-recognition result and the second noise sub-recognition result are combined to obtain a noise recognition result.
Or determining that the noise identification result is noise-free under the condition that the first noise sub-identification result and the second noise sub-identification result both indicate that the noise is not present between the sample original image and the sample rendered image; and determining that the noise identification result is noise under the condition that one of the first noise sub-identification result and the second noise sub-identification result indicates that noise exists between the sample original image and the sample rendered image.
Step 503, generating a prediction rendering image corresponding to the training sample according to the sample original image, the rendering text, the random noise image, the all 0 image and the sample rendering image through the image editing model.
Optionally, generating, by an image editing network in the image editing model, a predictive rendered image corresponding to the training sample according to the sample original image, the rendered text, the random noise image, the all 0 image and the sample rendered image.
In the embodiment of the application, the training sample is input into the middle of the image editing network, and the images in the training sample are spliced to construct the input of the image editing network.
Illustratively, the sample original image and the random noise image are stitched to obtain a first stitched image. For example, the random noise image is stitched directly after the sample original image, resulting in a first stitched image. And carrying out pixel value superposition on the random noise image and the sample rendering image to obtain an intermediate image. For example, the intermediate image can be obtained by adding the pixel value of the pixel point in the random noise image to the pixel value of the pixel point at the corresponding position in the sample rendering image. And stitching the intermediate image and the sample original image to obtain a second stitched image. For example, the intermediate image is stitched directly after the sample original image, resulting in a second stitched image. And splicing the intermediate image and the all 0 images to obtain a third spliced image. For example, the intermediate image is stitched directly after the all 0 images, and a third stitched image can be obtained.
Alternatively, for the first stitched image, the intermediate image may also be stitched directly after the sample original image, so as to obtain the first stitched image.
And taking the first spliced image, the second spliced image and the third spliced image as input of an image editing network, namely generating a prediction rendering image corresponding to the training sample according to the first spliced image, the second spliced image and the third spliced image through an image editing model.
In one example, referring to fig. 8, the acquisition process of the predictive rendered image may further include the following:
in step 503a, T times of backward noise adding is performed on the first stitched image, the second stitched image and the third stitched image through the image editing model, so as to obtain hidden space features of the first stitched image, the second stitched image and the third stitched image at T moments, where the hidden space features at T moments have noise features, and T is a positive integer.
Optionally, codes corresponding to the first stitched image, the second stitched image and the third stitched image are obtained through an encoder in the image editing model, then a backward noise adding network in the image editing network is adopted to add backward noise to the codes corresponding to the first stitched image, the second stitched image and the third stitched image in the hidden space for T times, so that hidden space characteristics of the first stitched image, the second stitched image and the third stitched image at T time are obtained. The T-th backward noise corresponds to the hidden space characteristic at the T moment.
Illustratively, the hidden spatial features of the first stitched image, the second stitched image, and the third stitched image at the T-time point, respectively, may be represented as follows:
First stitched image S0: hidden spatial feature ZT of random noise image (or intermediate image) at T time + hidden spatial feature of sample original image at T time.
Second stitched image S1: hidden space feature ZT of the intermediate image at the T moment+hidden space feature of the original image of the sample at the T moment.
Second stitched image S2: hidden space feature ZT+hidden space feature of all 0 images of the intermediate image at the T moment.
Wherein the hidden space feature ZT has a noise feature.
And step 503b, performing T times of forward denoising on hidden space features of the first spliced image, the second spliced image and the third spliced image at T time respectively according to the rendered text through an image editing model to obtain denoised hidden space features corresponding to the original image of the sample.
Optionally, through a forward Denoising network in the image editing network, rendering text as KV information, and performing forward Denoising (namely Denoising U-Net) for T times according to the hidden space features of the first stitched image, the second stitched image and the third stitched image at T time respectively, so as to obtain hidden space feature Z T Eliminating the middle noise characteristic to obtainAnd (5) denoising hidden space features Z' corresponding to the original image of the sample. The denoised hidden spatial feature Z' may be used to characterize a predicted rendered image under the constraints of the rendered text.
In one example, after each forward denoising, rendering feature control is also performed to obtain the input of the U-Net at the next time. The embodiment of the application adopts directional rendering characteristic control, namely, in the model training and application process, the characteristic combination parameters (namely, the weighting weights of all the characteristics) are fixed, so that the parameter searching process in the training process is reduced, the convergence efficiency of the image editing model is improved, and the training quantity of the model is reduced. For the nth forward denoising of the T forward denoising, and the corresponding rendering feature control, n is a positive integer, which may include the following:
1. and acquiring a first difference value between the denoising hidden space characteristic of the first spliced image at the T-n time and the denoising hidden space characteristic of the second spliced image at the T-n time, wherein n is a positive integer.
Under the condition of n=1, the input of the U-Net is the hidden space characteristics of the first stitched image, the second stitched image and the third stitched image at the T moment respectively, namely S0, S1 and S2, and S0, S1 and S2 all comprise Z T I.e. the de-noised hidden spatial features at the T-time are the hidden spatial features (intermediate images) at the T-time.
Under the constraint of a rendered text, the U-Net respectively carries out forward denoising on S0, S1 and S2, so that denoising hidden space features of the first spliced image, the second spliced image and the third spliced image at the time of T-1 can be obtained, and the denoising hidden space features are recorded as And->The denoising hidden space feature at the T-1 time is the hidden space feature after the first denoising.
Optionally, the first difference is expressed as:
under the condition that n is greater than 1 and less than or equal to T, the input of the U-Net is constructed based on the output of the U-Net under n-1 times of forward denoising, namely, the U-Net is constructed based on the denoising hidden space characteristics of the first spliced image, the second spliced image and the third spliced image at the time of T-n+1 respectively, and the U-Net input can be recorded asAnd->And->Each comprises Z T-n+1 ,Z T-n+1 Is a hidden spatial feature of the intermediate image at the time of T-n+1.
Under the constraint of rendering text through U-NetAnd->Respectively performing forward denoising to obtain denoising hidden space features of the first spliced image, the second spliced image and the third spliced image at the T-n time, which are marked as->And->
Optionally, the first difference is expressed as:
2. and acquiring a second difference value between the denoising hidden space characteristic of the second spliced image at the T-n time and the denoising hidden space characteristic of the third spliced image at the T-n time.
Optionally, the second difference is expressed as:
3. and carrying out weighted summation on the denoising hidden space feature of the third spliced image at the T-n time, the first difference value and the second difference value to obtain the hidden space feature of the intermediate image at the T-n time.
Alternatively, the hidden spatial feature of the intermediate image at the T-n time may be expressed as follows:
wherein the weight parameters corresponding to the weighted summation are fixed and are respectively 1, 7.5 and 1.5. The weight parameter can be set and adjusted according to the actual use requirement.
4. And replacing the hidden space features of the intermediate image at the T-n time, namely replacing the hidden space features corresponding to the intermediate image in the denoising hidden space features of the first spliced image, the second spliced image and the third spliced image at the T-n time respectively to obtain the n+1th forward denoising input.
Alternatively, willAnd->The hidden space features corresponding to the intermediate image (including those corresponding to the random noise image in S0) are replaced by Z T-n The n+1th forward denoising input can be obtained.
For example, it is noted that:
Z T-n the denoising hidden space characteristics of the original image of the +sample at the time of T-n.
Z T-n The denoising hidden space characteristics of the original image of the +sample at the time of T-n.
Z T-n +denoising hidden spatial feature of all 0 images at T-n time instants.
5. And according to the n+1th forward denoising input, denoising hidden space features of the first spliced image, the second spliced image and the third spliced image at the time of T-n-1 respectively are obtained.
Processing n+1st forward denoising input under the constraint of rendering text by U-Net (namely replacing hidden space features corresponding to the intermediate image) And->) And obtaining denoising hidden space features of the first spliced image, the second spliced image and the third spliced image at the time of T-n-1 respectively.
Optionally, the hidden space feature of the intermediate image at 0 time is the denoised hidden space feature corresponding to the original image of the sample, denoted as Z 0 ,Z 0 The random noise image and the noise added in the backward noise adding process are filtered.
And 503c, decoding the denoised hidden space features corresponding to the training samples to generate a prediction rendering image corresponding to the training samples.
Optionally, a decoder in the image editing network is adopted to decode the denoised hidden space features corresponding to the training samples, and then the predicted rendering image corresponding to the training samples can be obtained.
And 504, processing the sample rendering image, the sample original image and the prediction rendering image according to the noise identification result to obtain a training loss corresponding to the sample original image, wherein the training loss is used for representing the rendering capability of the image editing model on the image.
Because noise possibly exists between the sample rendering image and the sample original image, training loss is directly constructed by adopting the sample rendering image and the sample original image, and the image editing model can learn wrong knowledge, so that the image rendering accuracy of the image editing model is not high.
In one example, the training loss building process may be as follows:
1. and assigning a training loss corresponding to the sample original image to 0 when the first noise sub-recognition result and the second noise sub-recognition result indicate that a difference exists between the sample original image and the sample rendered image except for content related to the rendered text.
That is, in the case that the first noise sub-recognition result and the second noise sub-recognition result both indicate that noise exists between the sample original image and the sample rendered image, the training sample does not participate in training of the image editing model, so that the image editing model is prevented from learning wrong knowledge.
2. Randomly extracting a first number of pixel points from the sample rendered image in the case that one and only one of the first and second noise sub-recognition results indicates a difference between the sample original image and the sample rendered image, except for content related to the rendered text; determining a first sub-loss according to the values of the first number of pixels and the values of the pixels of the corresponding positions of the first number of pixels in the predicted rendered image; randomly extracting a second number of pixels except the first number of pixels from the original sample image; determining a second sub-loss according to the values of the second number of pixels and the values of the pixels of the second number of pixels at corresponding positions in the predicted rendered image; and obtaining training loss corresponding to the original image of the sample according to the first sub-loss and the second sub-loss.
That is, under the condition that noise possibly exists between the sample original image and the sample rendered image, part of the content in the predicted rendered image is adopted to construct training loss, on one hand, the influence of the noise sample on the whole model can be reduced, so that the image rendering accuracy of the image editing model is improved, on the other hand, as part of pixel points are extracted from the sample original image as supervision, the image editing model is ensured to have certain image rendering capability, and the output predicted rendered image and the sample original image have certain consistency (such as structure, layout and the like).
The first number and the second number may be set and adjusted according to an empirical value, which is not limited in the embodiment of the present application. Illustratively, in the case that noise may exist between the sample original image and the sample rendered image, 10% of pixels are randomly extracted from the sample rendered image, and 10% of pixels corresponding to positions in the predicted rendered image are extracted, and then the first sub-loss is constructed according to a difference between pixel values of 10% of pixels corresponding to the sample rendered image and pixel values of 10% of pixels corresponding to the predicted rendered image.
Alternatively, the training loss may be calculated using a mean square error loss function, a cross entropy loss function, a focus loss function, a mean square difference loss function, or the like.
For example, taking the mean square error loss function as an example, the first sub-loss can be expressed as follows:
wherein y is i Rendering the pixel value of the ith pixel point in the 10% pixel points corresponding to the image for the sample,for predicting a pixel value of an i-th pixel point in 10% of pixel points corresponding to a rendered image, n is a first number (i.e., a number corresponding to 10% of pixel points).
And randomly extracting 5% of pixel points from the remaining 90% of pixel points in the sample rendering image, extracting 5% of pixel points at corresponding positions in the sample original image, and then constructing a second sub-loss according to the difference between the pixel values of the 5% of pixel points corresponding to the sample rendering image and the pixel values of the 5% of pixel points corresponding to the sample original image. The second sub-loss is constructed using the mean square error loss function described above.
And finally, adding the first sub-loss and the second sub-loss to obtain the training loss corresponding to the original image of the sample. Thus, 15% of pixels are extracted from the predicted sample image, and compared with 100% of pixels, the loss scale of the pixels is 15% of that of a normal sample, so that the influence of noise can be reduced.
3. And under the condition that the first noise sub-recognition result and the second noise sub-recognition result indicate that no difference exists between the sample original image and the sample rendered image except for the content related to the rendered text, obtaining training loss corresponding to the sample original image according to the values of all the pixel points in the sample rendered image and the values of all the pixel points in the predicted rendered image.
That is, in the case that no noise exists between the sample original image and the sample rendered image, the training loss corresponding to the sample original image may be constructed by using the full-pel corresponding to the sample rendered image and the full-pel corresponding to the prediction rendered image.
Optionally, in the case of having a plurality of noise sub-recognition results, the number of pixels extracted from the sample rendered image is adjusted according to the number of noise sub-recognition results indicating that noise exists, so as to construct a first sub-loss, for example, the first number has a negative correlation with the number of noise sub-recognition results indicating that noise exists, and the larger the number of noise sub-recognition results indicating that noise exists, the smaller the value of the first number is, so as to reduce the influence of noise.
Step 505, according to the training loss corresponding to each training sample, adjusting the parameters of the image editing model to obtain a trained image editing model, wherein the trained image editing model is used for rendering the image according to the rendering text.
Optionally, summing the training losses corresponding to the training samples respectively to obtain a total loss corresponding to the image editing model; and adjusting parameters of the image editing model with the aim of minimizing total loss to obtain the trained image editing model.
For example, the SGD method is adopted to reversely transfer the total loss back to the image editing model to obtain the gradient of each model parameter, and update each model parameter with the gradient drop as the target, thereby realizing the adjustment of the parameters of the image editing model.
In one example, the image editing model includes an image editing network, where the image editing network is used to generate a predicted rendered image, and in a case where parameters corresponding to the image editing model except parameters of the image editing network are all constructed by using a pre-trained neural network, parameters of the image editing network may be adjusted with a goal of minimizing total loss, to obtain a trained image editing model; the parameters except the parameters of the image editing network corresponding to the image editing model are kept unchanged, so that the training amount of the image editing model can be further reduced, and the convergence efficiency of the image editing model is improved.
Referring to fig. 1-4, parameters of the first noise identification network and the second noise identification network need not be adjusted when parameters of the image editing network are adjusted.
In one example, the image editing network further comprises a U-Net, and when parameters except the parameters of the U-Net corresponding to the image editing model are all constructed by a pre-trained neural network, the parameters of the U-Net can be adjusted with the aim of minimizing total loss to obtain a trained image editing model; the parameters except the parameters of the U-Net corresponding to the image editing model are kept unchanged, so that the training amount of the image editing model can be further reduced, and the convergence efficiency of the image editing model is improved.
Optionally, completing training of all n×batch, ending one iteration, continuing to adopt n×batch, performing the next iteration on the image editing model until the iteration number meets the threshold, and stopping training of the image editing model to obtain the trained image editing model.
The trained image editing model can be deployed in a model using device to provide an image rendering service, and a specific using method will be described in detail below, which will not be repeated here.
Optionally, before the first batch of training of the first iteration, initializing parameters of the image editing network: the image editing network adopts model parameters (such as parameters of Instruct Pix2 Pix) which are trained by open sources, and the backward noise adding network, the forward noise removing network, the self-coding network and the like which correspond to the image editing network. The initialization adopts a learning rate of 0.0004, and after every 5 rounds of iterative learning, the learning rate is changed to 0.1 times of the original learning rate, and the total number of the iterative learning is 10. The learning rate and the iteration number can be set and adjusted according to actual use requirements, which is not limited by the embodiment of the application.
In summary, according to the technical solution provided in the embodiments of the present application, according to the noise recognition result used to indicate whether there is a difference between the sample original image and the sample rendered image, the training loss of the image editing model is constructed based on the sample rendered image, the sample original image and the prediction rendered image, instead of directly constructing the training loss of the image editing model based on the difference between the sample rendered image and the prediction rendered image, which can effectively avoid the noise that may exist between the sample rendered image and the sample original image, affect the image editing model, and introduce the sample original image to monitor, so as to effectively improve the image rendering accuracy of the image editing model.
In addition, by adopting the first noise recognition network based on the multi-level splicing characteristics and the second noise recognition network based on the differential image, the noise recognition results are obtained from the image texture details, the depth characteristics and the shallow characteristics, so that the accuracy of the noise recognition results is improved, further, training loss is built according to the accurate noise recognition results, the high-image editing model is prevented from learning wrong knowledge, and the image rendering accuracy of the image editing model is improved.
In addition, according to the noise identification result, whether the sample original image and the sample rendering image are noise samples or not is judged, under the condition that the noise samples are noise samples, the training samples are omitted, under the condition that the noise samples are possible, training loss construction is carried out by adopting partial pixel points, and under the condition that the noise samples are not noise samples, training loss construction is carried out by adopting full-quantity pixel points, on one hand, the influence of the noise samples on the whole model can be reduced, so that the image rendering accuracy of the image editing model is improved, on the other hand, the image editing model can be ensured to have certain image rendering capability, the output predictive rendering image and the sample original image have certain consistency, and the image rendering capability of the image editing model is improved.
Fig. 9 is a flowchart illustrating an image rendering method according to another embodiment of the application. The subject of execution of the steps of the method may be the model-using device described above. The method may include the following steps (901-905).
Step 901, an input image is acquired, and an input text, a random noise image and an all 0 image corresponding to the input image are acquired.
The input image refers to an image to be image-rendered. The input text is used to indicate how the input image is to be rendered. The input image and the input text can be selected and set according to the actual use requirement. The pixel value of each pixel point in the input image can be directly assigned to 0, so that an all-0 image corresponding to the input image is obtained. The random noise image may be a noise figure generated by a random one of the seeds i.
Optionally, the input image, the all 0 image, and the random noise image are the same scale. In one example, the dimensions of the input image, all 0 images, and random noise images are the same as the dimensions of the sample original image described above.
Step 902, generating a predictive rendered image corresponding to the input image based on the input image, and the random noise image and the all 0 images corresponding to the input image under the constraint of the input text through an image editing network in the image editing model.
The image editing model in the embodiment of the present application is trained, for example, the image editing model may refer to the image editing model that is trained in the above embodiment.
Illustratively, the acquisition process of the predicted rendered image corresponding to the input image may be as follows:
1. and splicing the input image and the random noise image to obtain a first spliced image.
For example, a first stitched image is obtained by stitching a random noise image directly after an input image.
2. And splicing the all 0 images and the random noise images to obtain a second spliced image.
For example, the all 0 images are stitched directly after the input image, resulting in a second stitched image.
3. And generating a predictive rendering image corresponding to the input image based on the two first spliced images and the second spliced image under the constraint of the input text through the image editing network.
The image editing network takes an input text as KV information and two first spliced images and two second spliced images as input, renders the input images, and obtains predicted rendered images corresponding to the input images. The specific implementation process of the image editing network is the same as that of the above embodiment, and will not be described here again.
Step 903, obtaining a first noise sub-recognition result between the input image and the predicted rendered image corresponding to the input image through a first noise recognition network in the image editing model.
Optionally, the first noise sub-recognition result is used to indicate whether noise is present between the input image and the predicted rendered image corresponding to the input image.
Illustratively, the process of obtaining the first noise sub-identification result may be as follows:
1. and acquiring multi-level splicing characteristics corresponding to the input image and the predicted rendered image respectively through a first noise identification network, wherein the multi-level splicing characteristics comprise shallow layer characteristics and deep layer characteristics of the image.
Optionally, feature extraction is performed on the codes of the input images through a first feature extraction network in the first noise recognition network to obtain multi-level splicing features corresponding to the input images, and feature extraction is performed on the codes of the prediction rendering images through a second feature extraction network in the first noise recognition network to obtain multi-level splicing features corresponding to the prediction rendering images. The implementation process of the first noise identification network may refer to the above embodiment, and will not be described herein.
2. And acquiring a first noise sub-recognition result according to the similarity between the multi-level stitching characteristic corresponding to the input image and the multi-level stitching characteristic corresponding to the predicted rendering image through the first noise recognition network.
The prediction layer in the network is identified through the first noise, and the formula is adopted: and 1-a similarity (x 1, x 2), and calculating a first noise sub-recognition result according to the similarity between the multi-level stitching features corresponding to the input image and the multi-level stitching features corresponding to the predicted rendering image.
Step 904, obtaining a second noise sub-recognition result between the input image and the predicted rendered image corresponding to the input image through a second noise recognition network in the image editing model; wherein the first noise identification network and the second noise identification network are different.
The second noise sub-recognition result is also used to indicate whether noise exists between the input image and the predicted rendered image corresponding to the input image. Optionally, the second noise-recognition network comprises a feature extraction layer and a prediction layer.
Illustratively, the second noise sub-identification result may be obtained as follows:
1. a difference image between the input image and the predictive rendered image is acquired.
And for each pixel point in the input image, subtracting the pixel value of the pixel point from the pixel value of the pixel point at the corresponding position in the prediction rendering image, so as to obtain a difference image between the input image and the prediction rendering image.
2. And acquiring a second noise sub-recognition result according to the differential image through a second noise recognition network.
And extracting the characteristics of the differential image through a characteristic extraction network in the second noise identification network to obtain the characteristic representation of the differential image, and calculating the characteristic representation of the differential image through a prediction layer in the second noise identification network to obtain a second noise sub-identification result.
Step 905, screening the predicted rendered image corresponding to the input image according to the first noise sub-recognition result and the second noise sub-recognition result to obtain a screening result, where the screening result is used to indicate whether the predicted rendered image corresponding to the input image is qualified or not.
Predicting whether a rendered image is acceptable refers to: predicting whether noise exists between the rendered image and the input image, and if no noise exists between the rendered image and the input image, judging that the rendered image is qualified; if noise exists between the predicted rendered image and the input image, it may be determined that the predicted rendered image is not acceptable.
Alternatively, in the event that the predicted rendered image is eligible, the model-using device may provide the predicted rendered image to the user. In the case that the predicted rendered image is not qualified, the model using apparatus does not provide the predicted rendered image to the user, and displays a prompt message to prompt the user to replace the input text or the input image. For example, in the case that the predicted rendered image is not qualified, the model uses the device to display a prompt message "unable to generate, please provide other input text" to prompt the user to replace the input text, which is beneficial to improving the image editing model's drawing rate.
In one example, the screening result may be obtained as follows:
1. in the case that the presence of the noise sub-recognition result in the first noise sub-recognition result and the second noise sub-recognition result indicates that there is a difference between the input image and the predictive rendered image in addition to the content related to the input text, it is determined that the predictive rendered image is not qualified.
That is, in a case where a noise sub-recognition result indicates that noise exists between the input image and the predicted rendered image (e.g., including both cases where noise may exist and where noise exists), it is possible to determine that the predicted rendered image is not acceptable.
2. And determining that the predictive rendered image is qualified in the case that the first noise sub-recognition result and the second noise sub-recognition result both indicate that there is no difference between the input image and the predictive rendered image except for the content related to the input text.
That is, the first noise sub-recognition result and the second noise sub-recognition result indicate that the predicted rendered image is qualified only when no noise exists between the input image and the predicted rendered image, so that the predicted rendered image with low quality can be effectively filtered, and the image editing model drawing rate can be improved.
In some examples, the prediction rendered image may also be determined to be acceptable in the case where the presence of the noise sub-recognition result in the first noise sub-recognition result and the second noise sub-recognition result indicates that no noise exists between the input image and the prediction rendered image, which is not limited by the embodiment of the present application.
Optionally, under the condition of the image editing model obtained by training by adopting the technical scheme provided by the embodiment of the application, the prediction rendering image can be directly provided for the user without screening because the image rendering effect is better.
In some embodiments, reference is made to fig. 10, which shows a schematic diagram of a comparison of rendering effects between the related art provided by one embodiment of the present application and the present application.
The first line in fig. 10 is a predicted rendering diagram obtained by rendering an original 1001 using a related technique: original under a sunset atmosphere, original under a bright tone, and original under a dark tone.
The second line in fig. 10 is a predicted rendering diagram obtained by rendering the original image 1001 by adopting the technical scheme provided by the application: original under a sunset atmosphere, original under a bright tone, and original under a dark tone.
Comparing the first line with the second line, it is obvious that the effect of the inventive daylighting atmosphere is much softer than the daylighting atmosphere of the related art, the upper right corner of the original image 1001 has a bright (not shown in the figure), the related art will render a strong bright spot on the upper right corner and the whole image is excessively yellow, while the sunlight should not be so glaring when the actual daylighting atmosphere becomes very yellow in the day. The clear effect of the present application is also better in bright and dark hues than in the related art.
In summary, according to the technical scheme provided by the embodiment of the application, the quality of the predicted rendered image provided for the user can be improved by screening the predicted rendered image according to the first noise sub-recognition result and the second noise sub-recognition result, so that the image editing model drawing rate can be improved.
The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.
Referring to fig. 11, a block diagram of an image rendering apparatus according to an embodiment of the present application is shown. The device can be used for realizing the image rendering method. The apparatus 1100 may include: a training sample acquisition module 1101, a noise result acquisition module 1102, a predicted image generation module 1103, a training loss acquisition module 1104, and an editing model training module 1105.
A training sample obtaining module 1101, configured to obtain a plurality of training samples, where the training samples include a sample original image, a rendered text, a random noise image, an all-0 image corresponding to the sample original image, and a sample rendered image generated based on the sample original image under the constraint of the rendered text.
A noise result obtaining module 1102, configured to obtain, for each training sample, a noise identification result between the sample original image and the sample rendered image, where the noise identification result is used to indicate whether there is a difference between the sample original image and the sample rendered image, except for content related to the rendered text.
The predicted image generation module 1103 is configured to generate, according to an image editing model, a predicted rendered image corresponding to the training sample according to the sample original image, the rendered text, the random noise image, the all 0 image, and the sample rendered image.
The training loss obtaining module 1104 is configured to process the sample rendered image, the sample original image, and the predicted rendered image according to the noise recognition result, so as to obtain a training loss corresponding to the sample original image, where the training loss is used to represent the rendering capability of the image editing model on the image.
The editing model training module 1105 is configured to adjust parameters of the image editing model according to training losses corresponding to the training samples, so as to obtain a trained image editing model, where the trained image editing model is used for rendering an image according to the rendering text.
In some embodiments, the image editing model includes a first noise identification network and a second noise identification network; the noise result obtaining module 1102 is configured to:
acquiring multi-level splicing features corresponding to the sample original image and the sample rendering image respectively through the first noise identification network, wherein the multi-level splicing features comprise shallow features and deep features of the image;
acquiring a first noise sub-recognition result through the first noise recognition network, wherein the first noise sub-recognition result is a noise recognition result obtained according to the similarity between the multi-level stitching characteristic corresponding to the sample original image and the multi-level stitching characteristic corresponding to the sample rendered image;
acquiring a differential image between the sample original image and the sample rendered image;
acquiring a second noise sub-recognition result through the second noise recognition network, wherein the second noise sub-recognition result is a noise recognition result obtained according to the differential image;
and acquiring the noise identification result according to the first noise identification result and the second noise identification result.
In some embodiments, the first noise identification network comprises a first feature extraction network and a second feature extraction network, each comprising m feature extraction layers, m being an integer greater than 1; the noise result obtaining module 1102 is further configured to:
Inputting the original sample image into the first feature extraction network, and obtaining a first feature sequence according to the output features from the 2 nd feature extraction layer to the m th feature extraction layer of the first feature extraction network;
performing dimension alignment on each output feature in the first feature sequence to obtain an aligned first feature sequence;
splicing all the characteristics in the aligned first characteristic sequence to obtain multi-level splicing characteristics corresponding to the original image of the sample;
inputting the sample rendering image into the second feature extraction network, and obtaining a second feature sequence according to output features from a 2 nd feature extraction layer to an m th feature extraction layer of the second feature extraction network;
performing dimension alignment on each output feature in the second feature sequence to obtain an aligned second feature sequence;
and splicing all the characteristics in the aligned second characteristic sequence to obtain multi-level splicing characteristics corresponding to the sample rendering image.
In some embodiments, the training loss acquisition module 1104 is configured to:
assigning a training loss corresponding to the sample original image to 0 when the first noise sub-recognition result and the second noise sub-recognition result both indicate that there is a difference between the sample original image and the sample rendered image, except for content related to the rendered text;
Or, in the case that one and only one of the first and second noise sub-recognition results indicate that there is a difference between the sample original image and the sample rendered image except for the content related to the rendered text, randomly extracting a first number of pixel points from the sample rendered image; determining a first sub-loss according to the value of the first number of pixel points and the value of the pixel points of the corresponding positions of the first number of pixel points in the prediction rendering image; randomly extracting a second number of pixel points except the first number of pixel points from the sample original image; determining a second sub-loss according to the value of the second number of pixels and the value of the pixel of the corresponding position of the second number of pixels in the predictive rendered image; obtaining training loss corresponding to the original image of the sample according to the first sub-loss and the second sub-loss;
or under the condition that the first noise sub-recognition result and the second noise sub-recognition result indicate that no difference exists between the sample original image and the sample rendered image except for the content related to the rendered text, obtaining training loss corresponding to the sample original image according to the value of each pixel point in the sample rendered image and the value of each pixel point in the prediction rendered image.
In some embodiments, the editing model training module 1105 is configured to:
summing the training losses corresponding to the training samples respectively to obtain total losses corresponding to the image editing model;
and adjusting parameters of the image editing model with the aim of minimizing the total loss to obtain the trained image editing model.
In some embodiments, the image editing model comprises an image editing network for generating the predictive rendered image; the editing model training module 1105 is further configured to adjust parameters of the image editing network with the objective of minimizing the total loss, to obtain the trained image editing model; wherein parameters corresponding to the image editing model other than parameters of the image editing network remain unchanged.
In some embodiments, the prediction image generation module 1103 is configured to:
splicing the sample original image and the random noise image to obtain a first spliced image;
performing pixel value superposition on the random noise image and the sample rendering image to obtain an intermediate image;
splicing the intermediate image and the sample original image to obtain a second spliced image;
Splicing the intermediate image and the all 0 images to obtain a third spliced image;
and generating a prediction rendering image corresponding to the training sample according to the first spliced image, the second spliced image and the third spliced image through the image editing model.
In some embodiments, the predicted image generation module 1103 is further configured to:
carrying out T times of backward noise addition on the first spliced image, the second spliced image and the third spliced image through the image editing model to obtain hidden space features of the first spliced image, the second spliced image and the third spliced image at T moments, wherein the hidden space features at the T moments have noise features, and T is a positive integer;
performing T times of forward denoising on hidden space features of the first spliced image, the second spliced image and the third spliced image at T moments respectively according to the rendering text through the image editing model to obtain denoised hidden space features corresponding to the original sample image;
and decoding the denoised hidden space features corresponding to the training samples to generate a prediction rendering image corresponding to the training samples.
In some embodiments, the predicted image generation module 1103 is further configured to:
for n-th forward denoising, acquiring a first difference value between the denoising hidden space characteristic of the first spliced image at the T-n time and the denoising hidden space characteristic of the second spliced image at the T-n time, wherein n is a positive integer;
acquiring a second difference value between the denoising hidden space characteristic of the second spliced image at the T-n time and the denoising hidden space characteristic of the third spliced image at the T-n time;
carrying out weighted summation on the denoising hidden space feature of the third spliced image at the T-n moment, the first difference value and the second difference value to obtain the hidden space feature of the intermediate image at the T-n moment;
replacing hidden space features of the intermediate image under the T-n time, which correspond to the intermediate image, in the denoising hidden space features of the first spliced image, the second spliced image and the third spliced image under the T-n time respectively to obtain n+1th forward denoising input;
according to the n+1th forward denoising input, denoising hidden space features of the first spliced image, the second spliced image and the third spliced image at the time of T-n-1 respectively are obtained;
The method comprises the steps of weighting and summing, wherein the weighting parameters corresponding to the weighted summation are fixed, the denoising hidden space features under the T time are hidden space features under the T time, and the hidden space features of the intermediate image under the 0 time are denoising hidden space features corresponding to the original image of the sample.
In summary, according to the technical solution provided in the embodiments of the present application, according to the noise recognition result used to indicate whether there is a difference between the sample original image and the sample rendered image, the training loss of the image editing model is constructed based on the sample rendered image, the sample original image and the prediction rendered image, instead of directly constructing the training loss of the image editing model based on the difference between the sample rendered image and the prediction rendered image, which can effectively avoid the noise that may exist between the sample rendered image and the sample original image, affect the image editing model, and introduce the sample original image to monitor, so as to effectively improve the image rendering accuracy of the image editing model.
Referring to fig. 12, a block diagram of an image rendering apparatus according to another embodiment of the present application is shown. The device can be used for realizing the image rendering method. The apparatus 1200 may include: an input image acquisition module 1201, a predicted image generation module 1202, a noise result acquisition module 1203, and a screening result acquisition module 1204.
An input image acquisition module 1201 is configured to acquire an input image, and an input text, a random noise image, and an all 0 image corresponding to the input image.
A predicted image generation module 1202, configured to generate, under the constraint of the input text, a predicted rendered image corresponding to the input image based on the input image, and a random noise image and an all-0 image corresponding to the input image, through an image editing network in an image editing model.
The noise result obtaining module 1203 is configured to obtain, through a first noise recognition network in the image editing model, a first noise sub-recognition result between the input image and a predicted rendered image corresponding to the input image.
The noise result obtaining module 1203 is further configured to obtain, through a second noise recognition network in the image editing model, a second noise sub-recognition result between the input image and the predicted rendered image corresponding to the input image; wherein the first noise identification network and the second noise identification network are different.
And a screening result obtaining module 1204, configured to screen the predicted rendered image corresponding to the input image according to the first noise sub-recognition result and the second noise sub-recognition result, to obtain a screening result, where the screening result is used to indicate whether the predicted rendered image corresponding to the input image is qualified.
In some embodiments, the noise result obtaining module 1203 is further configured to:
acquiring multi-level splicing features corresponding to the input image and the predicted rendered image respectively through the first noise identification network, wherein the multi-level splicing features comprise shallow features and deep features of the image;
and acquiring the first noise sub-recognition result according to the similarity between the multi-level stitching characteristic corresponding to the input image and the multi-level stitching characteristic corresponding to the predicted rendering image through the first noise recognition network.
In some embodiments, the noise result obtaining module 1203 is further configured to:
acquiring a difference image between the input image and the predicted rendered image;
and acquiring a second noise sub-recognition result according to the differential image through the second noise recognition network.
In some embodiments, the screening result obtaining module 1204 is configured to:
determining that the predicted rendered image is not acceptable in the case where the presence of a noise sub-recognition result in the first noise sub-recognition result and the second noise sub-recognition result indicates that there is a difference between the input image and the predicted rendered image in addition to content related to the input text;
Alternatively, the prediction-rendered image is determined to be acceptable in the case where the first noise sub-recognition result and the second noise sub-recognition result both indicate that there is no difference between the input image and the prediction-rendered image other than the content related to the input text.
In some embodiments, the predictive image generation module 1202 is configured to:
splicing the input image and the random noise image to obtain a first spliced image;
splicing the all 0 images and the random noise images to obtain a second spliced image;
and generating a predictive rendering image corresponding to the input image based on the two first spliced images and the second spliced image under the constraint of the input text through the image editing network.
In summary, according to the technical solution provided in the embodiments of the present application, according to the noise recognition result used to indicate whether there is a difference between the sample original image and the sample rendered image, the training loss of the image editing model is constructed based on the sample rendered image, the sample original image and the prediction rendered image, instead of directly constructing the training loss of the image editing model based on the difference between the sample rendered image and the prediction rendered image, which can effectively avoid the noise that may exist between the sample rendered image and the sample original image, affect the image editing model, and introduce the sample original image to monitor, so as to effectively improve the image rendering accuracy of the image editing model.
It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
Referring to fig. 13, a schematic structural diagram of a computer device according to an embodiment of the application is shown. The computer device may be any electronic device having data computing, processing and storage functions that may be implemented as model training device 10 or model using device 20 in the implementation environment of the solution shown in fig. 1. Specifically, the following may be included.
The computer apparatus 1300 includes a central processing unit (such as a CPU (Central Processing Unit, central processing unit), a GPU (Graphics Processing Unit, graphics processor), an FPGA (Field Programmable Gate Array ), etc.) 1301, a system Memory 1304 including a RAM (Random-Access Memory) 1302 and a ROM (Read-Only Memory) 1303, and a system bus 1305 connecting the system Memory 1304 and the central processing unit 1301. The computer device 1300 also includes a basic input/output system (Input Output System, I/O system) 1306 to facilitate the transfer of information between the various devices within the server, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.
In some embodiments, the basic input/output system 1306 includes a display 1308 for displaying information, and an input device 1309, such as a mouse, keyboard, or the like, for a user to input information. Wherein the display 1308 and the input device 1309 are connected to the central processing unit 1301 through an input/output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a keyboard, mouse, or electronic stylus, among a plurality of other devices. Similarly, the input/output controller 1310 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM (Compact Disc Read-Only Memory) drive.
Without loss of generality, the computer readable medium may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc, high density digital video disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the ones described above. The system memory 1304 and mass storage device 1307 described above may be referred to collectively as memory.
The computer device 1300 may also operate in accordance with embodiments of the present application by a remote computer connected to the network through a network, such as the internet. I.e., the computer device 1300 may be connected to the network 1312 via a network interface unit 1311 coupled to the system bus 1305, or alternatively, the network interface unit 1311 may be used to connect to other types of networks or remote computer systems (not shown).
The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the image rendering method described above.
In some embodiments, there is also provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the above-described image rendering method.
Alternatively, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random-Access Memory), SSD (Solid State Drives, solid State disk), optical disk, or the like. The random access memory may include ReRAM (Resistance Random Access Memory, resistive random access memory) and DRAM (Dynamic Random Access Memory ), among others.
In some embodiments, a computer program product is also provided, the computer program product comprising a computer program stored in a computer readable storage medium. A processor of a computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device executes the image rendering method described above.
It should be noted that, in the embodiment of the present application, before and during the process of collecting the relevant data of the user, a prompt interface, a popup window or output voice prompt information may be displayed, where the prompt interface, the popup window or the voice prompt information is used to prompt the user to collect the relevant data currently, so that the present application only starts to execute the relevant step of obtaining the relevant data of the user after obtaining the confirmation operation of the user on the prompt interface or the popup window, otherwise (i.e. when the confirmation operation of the user on the prompt interface or the popup window is not obtained), the relevant step of obtaining the relevant data of the user is finished, i.e. the relevant data of the user is not obtained. In other words, all user data collected by the method are processed strictly according to the requirements of relevant national laws and regulations, informed consent or independent consent of the personal information body is collected under the condition that the user agrees and authorizes, and the subsequent data use and processing actions are carried out within the scope of laws and regulations and the authorization of the personal information body, and the collection, use and processing of relevant user data are required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, training samples (e.g., sample raw images, sample rendered images, etc.), input images, etc., as referred to herein are acquired with sufficient authorization.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limiting.
The foregoing description of the exemplary embodiments of the application is not intended to limit the application to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the application.

Claims (19)

1. An image rendering method, the method comprising:
acquiring a plurality of training samples, wherein the training samples comprise sample original images, rendered texts, random noise images, all 0 images corresponding to the sample original images, and sample rendered images generated based on the sample original images under the constraint of the rendered texts;
For each training sample, acquiring a noise identification result between the sample original image and the sample rendered image, wherein the noise identification result is used for indicating whether a difference exists between the sample original image and the sample rendered image except for content related to the rendered text;
generating a prediction rendering image corresponding to the training sample according to the sample original image, the rendering text, the random noise image, the all 0 image and the sample rendering image through an image editing model;
according to the noise identification result, processing the sample rendering image, the sample original image and the prediction rendering image to obtain training loss corresponding to the sample original image, wherein the training loss is used for representing the rendering capacity of the image editing model on the image;
and adjusting parameters of the image editing model according to the training loss corresponding to each training sample respectively to obtain a trained image editing model, wherein the trained image editing model is used for rendering the image according to the rendering text.
2. The method of claim 1, wherein the image editing model comprises a first noise identification network and a second noise identification network;
The obtaining the noise identification result between the sample original image and the sample rendered image comprises the following steps:
acquiring multi-level splicing features corresponding to the sample original image and the sample rendering image respectively through the first noise identification network, wherein the multi-level splicing features comprise shallow features and deep features of the image;
acquiring a first noise sub-recognition result through the first noise recognition network, wherein the first noise sub-recognition result is a noise recognition result obtained according to the similarity between the multi-level stitching characteristic corresponding to the sample original image and the multi-level stitching characteristic corresponding to the sample rendered image;
acquiring a differential image between the sample original image and the sample rendered image;
acquiring a second noise sub-recognition result through the second noise recognition network, wherein the second noise sub-recognition result is a noise recognition result obtained according to the differential image;
and acquiring the noise identification result according to the first noise identification result and the second noise identification result.
3. The method of claim 2, wherein the first noise identification network comprises a first feature extraction network and a second feature extraction network, each comprising m feature extraction layers, m being an integer greater than 1;
The obtaining, by the first noise recognition network, multi-level stitching features respectively corresponding to the sample original image and the sample rendered image includes:
inputting the original sample image into the first feature extraction network, and obtaining a first feature sequence according to the output features from the 2 nd feature extraction layer to the m th feature extraction layer of the first feature extraction network;
performing dimension alignment on each output feature in the first feature sequence to obtain an aligned first feature sequence;
splicing all the characteristics in the aligned first characteristic sequence to obtain multi-level splicing characteristics corresponding to the original image of the sample;
inputting the sample rendering image into the second feature extraction network, and obtaining a second feature sequence according to output features from a 2 nd feature extraction layer to an m th feature extraction layer of the second feature extraction network;
performing dimension alignment on each output feature in the second feature sequence to obtain an aligned second feature sequence;
and splicing all the characteristics in the aligned second characteristic sequence to obtain multi-level splicing characteristics corresponding to the sample rendering image.
4. The method according to claim 2, wherein the processing the sample rendered image, the sample original image, and the prediction rendered image according to the noise recognition result to obtain the training loss corresponding to the sample original image includes:
assigning a training loss corresponding to the sample original image to 0 when the first noise sub-recognition result and the second noise sub-recognition result both indicate that there is a difference between the sample original image and the sample rendered image, except for content related to the rendered text;
or alternatively, the process may be performed,
randomly extracting a first number of pixel points from the sample rendered image in the case that one and only one of the first and second noise sub-recognition results indicate a difference between the sample original image and the sample rendered image, except for content related to the rendered text; determining a first sub-loss according to the value of the first number of pixel points and the value of the pixel points of the corresponding positions of the first number of pixel points in the prediction rendering image; randomly extracting a second number of pixel points except the first number of pixel points from the sample original image; determining a second sub-loss according to the value of the second number of pixels and the value of the pixel of the corresponding position of the second number of pixels in the predictive rendered image; obtaining training loss corresponding to the original image of the sample according to the first sub-loss and the second sub-loss;
Or alternatively, the process may be performed,
and under the condition that the first noise sub-recognition result and the second noise sub-recognition result indicate that no difference exists between the sample original image and the sample rendered image except for the content related to the rendered text, obtaining training loss corresponding to the sample original image according to the value of each pixel point in the sample rendered image and the value of each pixel point in the prediction rendered image.
5. The method according to claim 1, wherein the adjusting the parameters of the image editing model according to the training loss corresponding to each training sample to obtain a trained image editing model includes:
summing the training losses corresponding to the training samples respectively to obtain total losses corresponding to the image editing model;
and adjusting parameters of the image editing model with the aim of minimizing the total loss to obtain the trained image editing model.
6. The method of claim 5, wherein the image editing model comprises an image editing network for generating the predictive rendered image;
The step of adjusting parameters of the image editing model with the aim of minimizing the total loss to obtain the trained image editing model comprises the following steps:
adjusting parameters of the image editing network with the aim of minimizing the total loss to obtain the trained image editing model;
wherein parameters corresponding to the image editing model other than parameters of the image editing network remain unchanged.
7. The method of claim 1, wherein generating, by an image editing model, a predicted rendered image corresponding to the training sample from the sample original image, the rendered text, the random noise image, the all 0 image, and the sample rendered image comprises:
splicing the sample original image and the random noise image to obtain a first spliced image;
performing pixel value superposition on the random noise image and the sample rendering image to obtain an intermediate image;
splicing the intermediate image and the sample original image to obtain a second spliced image;
splicing the intermediate image and the all 0 images to obtain a third spliced image;
And generating a prediction rendering image corresponding to the training sample according to the first spliced image, the second spliced image and the third spliced image through the image editing model.
8. The method of claim 7, wherein generating, by the image editing model, a predictive rendered image corresponding to the training sample from the first stitched image, the second stitched image, and the third stitched image, comprises:
carrying out T times of backward noise addition on the first spliced image, the second spliced image and the third spliced image through the image editing model to obtain hidden space features of the first spliced image, the second spliced image and the third spliced image at T moments, wherein the hidden space features at the T moments have noise features, and T is a positive integer;
performing T times of forward denoising on hidden space features of the first spliced image, the second spliced image and the third spliced image at T moments respectively according to the rendering text through the image editing model to obtain denoised hidden space features corresponding to the original sample image;
and decoding the denoised hidden space features corresponding to the training samples to generate a prediction rendering image corresponding to the training samples.
9. The method according to claim 8, wherein the performing, by the image editing model, T times of forward denoising on the hidden spatial features of the first stitched image, the second stitched image, and the third stitched image at T times according to the rendered text, to obtain denoised hidden spatial features corresponding to the original image of the sample includes:
for n-th forward denoising, acquiring a first difference value between the denoising hidden space characteristic of the first spliced image at the T-n time and the denoising hidden space characteristic of the second spliced image at the T-n time, wherein n is a positive integer;
acquiring a second difference value between the denoising hidden space characteristic of the second spliced image at the T-n time and the denoising hidden space characteristic of the third spliced image at the T-n time;
carrying out weighted summation on the denoising hidden space feature of the third spliced image at the T-n moment, the first difference value and the second difference value to obtain the hidden space feature of the intermediate image at the T-n moment;
replacing hidden space features of the intermediate image under the T-n time, which correspond to the intermediate image, in the denoising hidden space features of the first spliced image, the second spliced image and the third spliced image under the T-n time respectively to obtain n+1th forward denoising input;
According to the n+1th forward denoising input, denoising hidden space features of the first spliced image, the second spliced image and the third spliced image at the time of T-n-1 respectively are obtained;
the method comprises the steps of weighting and summing, wherein the weighting parameters corresponding to the weighted summation are fixed, the denoising hidden space features under the T time are hidden space features under the T time, and the hidden space features of the intermediate image under the 0 time are denoising hidden space features corresponding to the original image of the sample.
10. An image rendering method, the method comprising:
acquiring an input image, and an input text, a random noise image and an all-0 image corresponding to the input image;
generating a predictive rendering image corresponding to the input image based on the input image, a random noise image corresponding to the input image and an all-0 image under the constraint of the input text through an image editing network in an image editing model;
acquiring a first noise sub-recognition result between the input image and a predicted rendered image corresponding to the input image through a first noise recognition network in the image editing model;
acquiring a second noise sub-recognition result between the input image and a predicted rendered image corresponding to the input image through a second noise recognition network in the image editing model; wherein the first noise identification network and the second noise identification network are different;
And screening the predicted rendering image corresponding to the input image according to the first noise sub-recognition result and the second noise sub-recognition result to obtain a screening result, wherein the screening result is used for indicating whether the predicted rendering image corresponding to the input image is qualified or not.
11. The method of claim 10, wherein the obtaining, by the first noise recognition network in the image editing model, a first noise sub-recognition result between the input image and a predicted rendered image corresponding to the input image, comprises:
acquiring multi-level splicing features corresponding to the input image and the predicted rendered image respectively through the first noise identification network, wherein the multi-level splicing features comprise shallow features and deep features of the image;
and acquiring the first noise sub-recognition result according to the similarity between the multi-level stitching characteristic corresponding to the input image and the multi-level stitching characteristic corresponding to the predicted rendering image through the first noise recognition network.
12. The method of claim 10, wherein the obtaining, by the second noise recognition network in the trained image editing model, a second noise sub-recognition result between the input image and the predicted rendered image corresponding to the input image, comprises:
Acquiring a difference image between the input image and the predicted rendered image;
and acquiring a second noise sub-recognition result according to the differential image through the second noise recognition network.
13. The method according to claim 10, wherein the filtering the predicted rendered image corresponding to the input image according to the first noise sub-recognition result and the second noise sub-recognition result to obtain a filtering result includes:
determining that the predicted rendered image is not acceptable in the case where the presence of a noise sub-recognition result in the first noise sub-recognition result and the second noise sub-recognition result indicates that there is a difference between the input image and the predicted rendered image in addition to content related to the input text;
or alternatively, the process may be performed,
and determining that the predictive rendered image is qualified when the first noise sub-recognition result and the second noise sub-recognition result both indicate that there is no difference between the input image and the predictive rendered image except for content related to the input text.
14. The method of claim 10, wherein the generating, by the image editing network in the image editing model, a predictive rendered image corresponding to the input image based on the input image, and the random noise image and the all 0 images corresponding to the input image, under the constraint of the input text, comprises:
Splicing the input image and the random noise image to obtain a first spliced image;
splicing the all 0 images and the random noise images to obtain a second spliced image;
and generating a predictive rendering image corresponding to the input image based on the two first spliced images and the second spliced image under the constraint of the input text through the image editing network.
15. An image rendering apparatus, the apparatus comprising:
a training sample acquisition module, configured to acquire a plurality of training samples, where the training samples include a sample original image, a rendered text, a random noise image, an all-0 image corresponding to the sample original image, and a sample rendered image generated based on the sample original image under the constraint of the rendered text;
a noise result obtaining module, configured to obtain, for each of the training samples, a noise recognition result between the sample original image and the sample rendered image, where the noise recognition result is used to indicate whether there is a difference between the sample original image and the sample rendered image, except for content related to the rendered text;
The prediction image generation module is used for generating a prediction rendering image corresponding to the training sample according to the sample original image, the rendering text, the random noise image, the all 0 image and the sample rendering image through an image editing model;
the training loss acquisition module is used for processing the sample rendering image, the sample original image and the prediction rendering image according to the noise identification result to obtain training loss corresponding to the sample original image, wherein the training loss is used for representing the rendering capacity of the image editing model on the image;
the editing model training module is used for adjusting parameters of the image editing model according to the training loss corresponding to each training sample to obtain a trained image editing model, and the trained image editing model is used for rendering the image according to the rendering text.
16. An image rendering apparatus, the apparatus comprising:
the input image acquisition module is used for acquiring an input image, and an input text, a random noise image and an all-0 image corresponding to the input image;
the prediction image generation module is used for generating a prediction rendering image corresponding to the input image based on the input image, a random noise image corresponding to the input image and an all-0 image under the constraint of the input text through an image editing network in an image editing model;
The noise result acquisition module is used for acquiring a first noise sub-recognition result between the input image and the predicted rendering image corresponding to the input image through a first noise recognition network in the image editing model;
the noise result obtaining module is further configured to obtain a second noise sub-recognition result between the input image and the predicted rendered image corresponding to the input image through a second noise recognition network in the image editing model; wherein the first noise identification network and the second noise identification network are different;
the screening result obtaining module is used for screening the prediction rendering image corresponding to the input image according to the first noise sub-recognition result and the second noise sub-recognition result to obtain a screening result, and the screening result is used for indicating whether the prediction rendering image corresponding to the input image is qualified or not.
17. A computer device comprising a processor and a memory, the memory having stored therein a computer program that is loaded and executed by the processor to implement the image rendering method of any one of claims 1 to 9 or to implement the image rendering method of any one of claims 10 to 14.
18. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, which is loaded and executed by a processor to implement the image rendering method of any one of claims 1 to 9 or to implement the image rendering method of any one of claims 10 to 14.
19. A computer program product, characterized in that it comprises a computer program stored in a computer readable storage medium, from which a processor reads and executes the computer program to implement the image rendering method according to any one of claims 1 to 9 or to implement the image rendering method according to any one of claims 10 to 14.
CN202310923451.1A 2023-07-25 2023-07-25 Image rendering method, device, equipment and storage medium Pending CN116957921A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310923451.1A CN116957921A (en) 2023-07-25 2023-07-25 Image rendering method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310923451.1A CN116957921A (en) 2023-07-25 2023-07-25 Image rendering method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116957921A true CN116957921A (en) 2023-10-27

Family

ID=88447249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310923451.1A Pending CN116957921A (en) 2023-07-25 2023-07-25 Image rendering method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116957921A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117575894A (en) * 2024-01-16 2024-02-20 腾讯科技(深圳)有限公司 Image generation method, device, electronic equipment and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117575894A (en) * 2024-01-16 2024-02-20 腾讯科技(深圳)有限公司 Image generation method, device, electronic equipment and computer readable storage medium
CN117575894B (en) * 2024-01-16 2024-04-30 腾讯科技(深圳)有限公司 Image generation method, device, electronic equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN113240580B (en) Lightweight image super-resolution reconstruction method based on multi-dimensional knowledge distillation
CN111783705B (en) Character recognition method and system based on attention mechanism
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN111161200A (en) Human body posture migration method based on attention mechanism
CN113780149A (en) Method for efficiently extracting building target of remote sensing image based on attention mechanism
CN113657388A (en) Image semantic segmentation method fusing image super-resolution reconstruction
CN111833360B (en) Image processing method, device, equipment and computer readable storage medium
CN111626134B (en) Dense crowd counting method, system and terminal based on hidden density distribution
Chen et al. MICU: Image super-resolution via multi-level information compensation and U-net
CN110648331A (en) Detection method for medical image segmentation, medical image segmentation method and device
CN112580458A (en) Facial expression recognition method, device, equipment and storage medium
CN116957921A (en) Image rendering method, device, equipment and storage medium
CN112070040A (en) Text line detection method for video subtitles
CN114638768B (en) Image rain removing method, system and equipment based on dynamic association learning network
CN113111716A (en) Remote sensing image semi-automatic labeling method and device based on deep learning
CN114676776A (en) Fine-grained image classification method based on Transformer
CN109492610A (en) A kind of pedestrian recognition methods, device and readable storage medium storing program for executing again
CN116310339A (en) Remote sensing image segmentation method based on matrix decomposition enhanced global features
CN115577768A (en) Semi-supervised model training method and device
Zhang et al. Dense haze removal based on dynamic collaborative inference learning for remote sensing images
CN114550014A (en) Road segmentation method and computer device
CN114529793A (en) Depth image restoration system and method based on gating cycle feature fusion
CN116958324A (en) Training method, device, equipment and storage medium of image generation model
CN116385265B (en) Training method and device for image super-resolution network
CN116975347A (en) Image generation model training method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40099934

Country of ref document: HK