WO2024017093A1 - 图像生成方法、模型训练方法、相关装置及电子设备 - Google Patents
图像生成方法、模型训练方法、相关装置及电子设备 Download PDFInfo
- Publication number
- WO2024017093A1 WO2024017093A1 PCT/CN2023/106800 CN2023106800W WO2024017093A1 WO 2024017093 A1 WO2024017093 A1 WO 2024017093A1 CN 2023106800 W CN2023106800 W CN 2023106800W WO 2024017093 A1 WO2024017093 A1 WO 2024017093A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- feature
- target
- feature vector
- style
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 149
- 238000000034 method Methods 0.000 title claims abstract description 113
- 239000013598 vector Substances 0.000 claims abstract description 415
- 238000012545 processing Methods 0.000 claims abstract description 92
- 238000010276 construction Methods 0.000 claims abstract description 36
- 230000007246 mechanism Effects 0.000 claims description 56
- 238000000605 extraction Methods 0.000 claims description 45
- 230000015572 biosynthetic process Effects 0.000 claims description 12
- 238000003786 synthesis reaction Methods 0.000 claims description 12
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 30
- 238000010586 diagram Methods 0.000 description 21
- 230000006870 function Effects 0.000 description 20
- 238000006243 chemical reaction Methods 0.000 description 14
- 230000008859 change Effects 0.000 description 13
- 238000011176 pooling Methods 0.000 description 8
- 238000013136 deep learning model Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 230000004913 activation Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000012935 Averaging Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000001308 synthesis method Methods 0.000 description 2
- 241000699670 Mus sp. Species 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/04—Context-preserving transformations, e.g. by using an importance map
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Definitions
- This application belongs to the field of artificial intelligence technology, and specifically relates to an image generation method, a model training method, related devices and electronic equipment.
- Cycle Generative Adversarial Network models can be used to convert image styles to generate images with different styles from the input images.
- the CycleGAN model can be used Convert a summer style landscape image to a winter style landscape image.
- CycleGAN model is used to convert image styles, but the quality of the generated images is relatively poor.
- the purpose of the embodiments of the present application is to provide an image generation method, a model training method, related devices and electronic equipment, which can solve the problem of relatively poor quality of generated images when using related models to convert image styles.
- embodiments of the present application provide an image generation method, which method includes:
- a splicing operation is performed on the first feature vector and the second feature vector to obtain a first target feature vector.
- the second feature vector is determined based on the second image of the second style.
- the second feature vector is used to characterize The image style of the second image;
- Image construction is performed based on the first target feature vector to obtain a third image.
- embodiments of the present application provide a model training method, which method includes:
- training sample data includes a first sample image, and a fourth feature vector used to characterize the style of the first sample
- the target model training is completed when a first preset condition is met, and the first preset condition includes: the first network loss value is less than a first preset threshold, and/or the target The number of training iterations of the model is greater than the second preset threshold.
- an image generation device which includes:
- the first acquisition module is used to acquire the first image whose image style is the first style, and the second image whose image style is the second style;
- a first feature processing module configured to perform first feature processing on the first image based on the target model to obtain a first feature vector, where the first feature vector is used to characterize the image content of the first image;
- a feature splicing module configured to splice the first feature vector and the second feature vector to obtain a first target feature vector, where the second feature vector is determined based on the second image of the second style, and the third feature vector is determined based on the second image of the second style.
- Two feature vectors are used to characterize the image style of the second image;
- An image construction module configured to perform image construction based on the first target feature vector to obtain a third image.
- a model training device which includes:
- the third acquisition module is used to acquire training sample data, where the training sample data includes a first sample image and a fourth feature vector used to characterize the style of the first sample;
- a first feature processing module configured to perform first feature processing on the first sample image to obtain a fifth feature vector, where the fifth feature vector is used to characterize the image content of the first sample image;
- a feature splicing module configured to splice the fifth feature vector and the fourth feature vector to obtain a second target feature vector
- An image construction module configured to perform image construction based on the second target feature vector to obtain a first output image
- a first determination module configured to determine a third value of the target model based on the first output image and the fifth feature vector. - network loss value;
- a first update module configured to update the network parameters of the target model based on the first network loss value
- the target model training is completed when a first preset condition is met, and the first preset condition includes: the first network loss value is less than a first preset threshold, and/or the target The number of training iterations of the model is greater than the second preset threshold.
- inventions of the present application provide an electronic device.
- the electronic device includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor.
- the program or instructions are When executed, the processor implements the steps of the image generation method as described in the first aspect, or the steps of the model training method as described in the second aspect.
- embodiments of the present application provide a readable storage medium.
- Programs or instructions are stored on the readable storage medium.
- the image generation method as described in the first aspect is implemented.
- embodiments of the present application provide a chip.
- the chip includes a processor and a communication interface.
- the communication interface is coupled to the processor.
- the processor is used to run programs or instructions to implement the first aspect. The steps of the image generation method, or the steps of the model training method described in the second aspect.
- the first image is obtained.
- Feature vector the first feature vector is used to characterize the image content of the first image; perform a splicing operation on the first feature vector and the second feature vector to obtain a first target feature vector, and the second feature vector
- the second image is determined based on the second style, and the second feature vector is used to characterize the image style of the second image; image construction is performed based on the first target feature vector to obtain a third image.
- the image style of the image can be converted from the first style to the second style based on the target model, and the image content of the generated third image can be kept the same as the image content of the input first image, thereby improving the performance of the image.
- the quality of the resulting image is the same as the image content of the input first image, thereby improving the performance of the image.
- Figure 1 is a flow chart of an image generation method provided by an embodiment of the present application.
- Figure 2 is a schematic structural diagram of the convolution module
- FIG. 3 is a schematic structural diagram of CBAM
- Figure 4 is a schematic structural diagram of an example of the first model
- Figure 5 is a schematic structural diagram of an example target model
- Figure 6 is a schematic structural diagram of the DeConvBlock module
- Figure 7 is a schematic structural diagram of the ResBlock module
- Figure 8 is a flow chart of the image synthesis method provided by the embodiment of the present application.
- Figure 9 is a schematic diagram of brightness adjustment of the sixth image
- Figure 10 is a flow chart of the model training method provided by the embodiment of the present application.
- Figure 11 is a structural diagram of an image generation device provided by an embodiment of the present application.
- Figure 12 is a structural diagram of a model training device provided by an embodiment of the present application.
- Figure 13 is a structural diagram of an electronic device provided by an embodiment of the present application.
- Figure 14 is a schematic diagram of the hardware structure of an electronic device that implements an embodiment of the present application.
- first, second, etc. in the description and claims of this application are used to distinguish similar objects and are not used to describe a specific order or sequence. It is to be understood that the figures so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in orders other than those illustrated or described herein, and that "first,” “second,” etc. are distinguished Objects are usually of one type, and the number of objects is not limited. For example, the first object can be one or multiple.
- “and/or” in the description and claims indicates at least one of the connected objects, and the character “/" generally indicates that the related objects are in an "or” relationship.
- Figure 1 is a flow chart of an image generation method provided by an embodiment of the present application. As shown in Figure 1, it includes the following steps:
- Step 101 Obtain a first image whose image style is a first style, and a second image whose image style is a second style.
- the first image can be any image, such as a portrait image, a landscape image, etc., and the first style can be used to represent the time corresponding to the first image.
- the second image can also be any image, such as a portrait image, a landscape image, etc., and the second style can be used to represent the time corresponding to the second image.
- the first image may be a landscape image
- the first style may be a time of four seasons, such as spring time, or a time of day and night, such as sunrise time.
- the second style can be the same as the first style, or it can be different.
- the second style may be different from the first style, so that the first image can be converted into an image of another style, thereby realizing image style conversion.
- the number of second images may be one, two or even multiple, and is not specifically limited here.
- the number of second styles may also be one, two or even multiple, and is not specifically limited here.
- the first image may be acquired in a variety of ways.
- a pre-stored image may be acquired as the first image, the first image may be captured in real time by a camera, or an image sent by other electronic devices may be received as the first image.
- the second image may be acquired in a variety of ways, and the acquiring method may be similar to the first image, which will not be described again here.
- the acquisition timing of the first image may be before, at the same time or after the acquisition timing of the second image.
- the acquisition timing of the first image may be after the acquisition timing of the second image.
- the second image may be acquired first, and then the second feature vector that can characterize the image style of the second image is extracted.
- the first image is acquired, and image generation is performed based on the first image and the second feature vector.
- the second feature vector can be reused for different images to perform image style conversion, thereby improving the efficiency of image generation. .
- the second feature vector can be matched with the style information of the second style, and accordingly, the style information of the second style can be obtained, that is, the second feature vector determined based on the second image can be obtained.
- the style information can be a time map mode, and the time map mode can represent the second style.
- the time map mode can include a four-season change mode, a day and night change mode, etc.
- the second style can include spring. Time, summer time, etc.
- Step 102 Perform first feature processing on the first image based on the target model to obtain a first feature vector, where the first feature vector is used to characterize the image content of the first image.
- a target model may be used to perform image generation based on the first image, and the target model may be used to generate an image that has the same image content as the first image and whose image style is the second style.
- the target model may include a first model, which may be called an encoder.
- the encoder may separate the content of the image and encode it to obtain a feature vector used to characterize the image content of the image.
- the encoder can perform first feature processing on the first image to obtain a first feature vector.
- the first feature processing may include feature extraction to extract a first feature vector that can characterize the image content of the first image.
- Step 103 Perform a splicing operation on the first feature vector and the second feature vector to obtain a first target feature vector.
- the second feature vector is determined based on the second image of the second style.
- the second feature vector Used to characterize the image style of the second image.
- the second feature vector is used to characterize the image style of the second image, and the second feature vector is determined based on the second image of the second style.
- the second feature vector may be a third feature vector, or may be obtained by averaging multiple third feature vectors, where the third feature vector may be a feature vector used to characterize the image style of the second image.
- the second feature vector can be obtained by performing second feature processing on the second image based on the deep learning model, and each second feature vector corresponds to an image style. In this way, the second feature vector corresponding to the second style can be obtained based on the second style. Feature vector.
- the deep learning model may be the same as the first model, or may be different from the first model.
- the first feature processing and the second feature processing may be completely different or partially the same.
- the first feature processing and the second feature processing may be partially the same.
- the aforementioned feature extraction may be the same.
- different feature extraction may be performed based on the same feature image to obtain respectively the features used to characterize the image content.
- Feature vectors and feature vectors used to characterize image style realize the decoupling of image content and image style, so that the content feature vector and style feature vector of the image can be separated through a model.
- the target model may include a splicing module, and the first target feature vector may be obtained by splicing two feature vectors through the splicing module.
- the scale of the first feature vector is (1, 1, 256), which is a vector of size 1*256
- the scale of the second feature vector is (1, 1, 256), which is the size of the first target feature vector obtained by splicing.
- the scale is (1, 1, 512), and subsequent image construction can be performed based on the first target feature vector to generate the corresponding image.
- the target model needs to be pre-trained before use, and the training process will be described in detail in the following embodiments.
- Step 104 Perform image construction based on the first target feature vector to obtain a third image.
- the third image is an image whose image style is the second style and has the same image content as the first image.
- the target model may also include a second model, which may be called a decoder.
- the decoder may decode based on the input feature vector to obtain an image with the same image content and image style represented by the input feature vector. Since the image content represented by the first target feature vector is the image content of the first image, and the image style represented is the second style, the image output by the decoder, that is, the third image, has the same relationship with the first image. The same image content, and the image style is the second style.
- the first feature vector is used to characterize the image content of the first image; perform a splicing operation on the first feature vector and the second feature vector to obtain a first target feature vector, and the second feature vector is based on the The second image of the second style is determined, and the second feature vector is used to characterize the image style of the second image; image construction is performed based on the first target feature vector to obtain a third image.
- the image style of the image can be converted from the first style to the second style based on the target model, and the image content of the generated third image can be kept the same as the image content of the input first image, thereby improving the performance of the image.
- the quality of the resulting image is the same as the image content of the input first image, thereby improving the performance of the image.
- performing first feature processing on the first image to obtain a first feature vector includes:
- the first feature processing may include first feature coding and second feature coding.
- the first feature coding is used to extract the first feature image of the first image.
- the first feature image may be the image feature of the first image. It can include Including color features, texture features, shape features, spatial relationship features, etc. of the first image, the second feature encoding is used to extract a first feature vector for characterizing the image content of the first image based on the first feature image.
- the first feature vector used to characterize the image content of the first image can be extracted, so that the content of the first image can be separated from the first image. Feature vector.
- performing first feature encoding on the first image to obtain a first feature image of the first image includes:
- the target attention mechanism Based on the target attention mechanism, the attention vector of the second feature image in the corresponding dimension of the target attention mechanism is extracted.
- the target attention mechanism includes an attention mechanism in the channel dimension and an attention mechanism in the spatial dimension. At least one of the force mechanisms;
- the first feature encoding includes the feature extraction and the extraction of the attention vector.
- the first feature encoding may include using an attention mechanism to perform feature extraction on the first image to improve the feature expression capability of the network.
- a convolution module can be used to extract features from the first image to obtain a second feature image of the first image.
- the second feature image can also be an image feature of the first image, which can include the color of the first image.
- FIG. 2 is a schematic structural diagram of the convolution module. As shown in Figure 2, the convolution module is divided into convolution layer 201, normalization (Batch Normlization, BN) processing 202, Relu activation function 203, and convolution layer in order of connection. 204 and BN processing 205.
- the scales of the second feature image and the first feature image may be the same or different, and are not specifically limited here. In an optional implementation, the scales of the second feature image and the first feature image may be different. Different convolution modules connected in series continuously perform feature extraction, and the scale of the feature map can be continuously reduced, thereby fully extracting the first image. image features.
- the first model may include an attention module, which may adjust image features based on a target attention mechanism to improve the expressive ability of image features.
- the target attention mechanism may include at least one of an attention mechanism in the channel dimension and an attention mechanism in the spatial dimension.
- the attention vector of the second feature image in the corresponding dimension of the attention mechanism can be extracted, and the attention vector and the second feature image can be multiplied. Obtain the third feature image.
- the processing of different attention mechanisms can be implemented in series.
- channel attention can be obtained through the global maximum pooling operation and the global pooling operation through the channel attention mechanism.
- the force vector is then passed through a shared multilayer perceptron (MLP) to obtain the attention vector on each channel, and then the elements are added, and the attention vector in the channel dimension is obtained through the sigmoid activation function, and the attention vector is Multiply with the second feature image to output a feature image.
- MLP shared multilayer perceptron
- the average pooling operation and the maximum pooling operation are applied along the channel axis, and they are connected to obtain the attention vector in the spatial dimension.
- the attention vector is The force vector is multiplied with the feature image output by the channel attention mechanism to obtain the third feature image. Wherein, the scale of the third feature image and the second feature image are the same.
- the attention module may be a Convolutional Block Attention Module (CBAM) structure.
- CBAM Convolutional Block Attention Module
- Figure 3 is a schematic structural diagram of CBAM. As shown in Figure 3, CBAM can include channel attention mechanism and spatial attention mechanism, and realize the processing of different attention mechanisms in series. The second feature image is input and processed by different attention mechanisms. Finally, the third feature image can be output.
- CBAM Convolutional Block Attention Module
- the third feature image can be determined as the first feature image.
- another convolution module can be used to continue feature extraction on the third feature image to obtain the first feature image.
- feature extraction of the first image can be achieved through feature extraction, and by using an attention mechanism for feature extraction, the feature expression ability of the network can be improved, thereby improving the accuracy of feature extraction.
- the method further includes:
- the M third feature vectors are averaged to obtain the second feature vector.
- the second feature processing may include third feature extraction and fourth feature extraction.
- the third feature extraction is used to extract a feature image of the second image.
- the feature image may be an image feature of the second image, which may include Color features, texture features, shape features, spatial relationship features, etc. of the second image
- the fourth feature extraction is used to extract a third feature vector for characterizing the image content of the second image based on the feature image.
- the third feature extraction method may be the same as the first feature extraction method.
- the first feature extraction and the third feature extraction may be implemented through the same modules.
- the second feature processing can be performed on each second image separately through the deep learning model to obtain M third feature vectors.
- each second image can be input to the deep learning model separately.
- the deep learning model can, for each input image, A third feature vector can be output and executed M times accordingly, that is, M third feature vectors can be obtained.
- both the first feature processing and the second feature processing can be implemented through the first model, and the first feature extraction and the third feature extraction can be implemented by sharing some modules.
- the second feature extraction and the fourth feature extraction The feature extraction is different, that is, the second feature extraction and the fourth feature extraction can be implemented through different modules respectively. That is to say, when the first model performs feature processing, it can perform feature extraction on the input image to obtain a feature image, and then perform different feature extractions based on the feature image to obtain feature vectors and user features used to characterize the image content. Yu characterized the image style The feature vector of the lattice realizes the decoupling of image content and image style, so that the content feature vector and style feature vector of the image can be separated through a model.
- Figure 4 is a schematic structural diagram of an example of the first model.
- the input of the first model can be an RGB image of size 256*256*3, and the output is two vectors of size 1*256, respectively.
- Content feature vector denoted by fc
- style feature vector denoted by fs
- the first model can include 7 network modules.
- the details of the 7 network modules are as follows:
- the first network module 401 is the convolution module ConvBlock.
- the internal structure is shown in Figure 2.
- the subsequent convolution modules (such as the second network module 402, the third network module 403, the fifth network module 405 to the seventh network module)
- the first network module 407) may have the same or similar structure as the first network module 401.
- the structure of the first network module 401 is: the first convolution layer Conv is a convolution with a kernel size of 3*3, a stride of 2, an input image size of 256*256*3, and an output image size of is 128*128*16.
- the second convolutional layer is a convolution with a kernel size of 1*1 and a stride of 1.
- the input image size is 128*128*16 and the output image size is 128*128*32.
- the fourth network module 404 is a CBAM structure. Its internal structure is shown in Figure 3. It is used to improve the feature expression ability of the network. Its input image is the feature image output by the third network module 403, as shown in Figure 3. It contains two modules: channel attention mechanism and spatial attention mechanism.
- the channel attention vector can be obtained through the global maximum pooling operation and the global pooling operation through the channel attention mechanism. After that, the attention vector on the respective channel is obtained through a shared MLP, followed by element addition, and the channel dimension is obtained through the sigmoid activation function.
- the attention vector is multiplied with the feature image output by the third network module through the Multipy multiplication fusion layer, and a feature image is output.
- the spatial attention mechanism based on the feature image output by the channel attention mechanism, the average pooling operation and the maximum pooling operation are applied along the channel axis, and they are connected to obtain the attention vector in the spatial dimension.
- the attention vector is The force vector is multiplied with the feature image output by the channel attention mechanism to obtain another image feature.
- the fifth network module 405 is the convolution module.
- the input image size of the first convolutional layer is 32*32*96, and the output image size is 16*16*128.
- the input image size of the second convolutional layer is 16*16*128, and the output image size is 16*16*128.
- the sixth network module 406 is a convolution module, which outputs a content feature vector.
- the input image is the output of the fifth network module 405.
- the output image size is 4*4*32.
- the output is then converted into a one-dimensional vector of 1*256 through a reshape operation.
- the seventh network module 407 is a convolution module, which outputs a style feature vector.
- the input image is also the output of the fifth network module 405, and then the output is converted into a 1*256 one-dimensional vector through the reshape operation.
- the M third eigenvectors can be averaged to obtain a second eigenvector, the calculation formula of which is shown in the following formula (1).
- fs avg is the second eigenvector
- fs(i) is the i-th third eigenvector
- M third feature vectors are obtained by performing second feature processing on each of the second images respectively, and one third feature vector corresponds to one second image.
- Vectors are used to represent all The image style of the second image is determined; and the M third feature vectors are averaged to obtain the second feature vector.
- the style feature vector can be separated from the second image to obtain in advance the second feature vector used to characterize the second style, and by averaging the third feature vectors corresponding to the plurality of second styles, the user can obtain The second feature vector represents the average style of the second style. In this way, the representation ability of the style feature vector can be improved.
- step 104 specifically includes:
- image construction may include first feature decoding, second feature decoding, and third feature decoding.
- the first feature decoding is used to perform feature decoding on the first target feature vector to obtain a fourth feature image, which can be obtained through the first feature decoding.
- Feature decoding implements decoding feature vectors into feature images.
- the second feature decoding is used to perform second feature decoding on the fourth feature image to obtain a fifth feature image, where the size of the fifth feature image is the same as the size of the first feature image.
- the operation corresponding to the second feature decoding may correspond to the operation corresponding to the first feature extraction. That is, if feature extraction is implemented through an upsampling operation, a downsampling operation corresponding to the upsampling operation may be used.
- Feature decoding is implemented, and the network layer corresponding to the second feature decoding corresponds to the network layer corresponding to the first feature extraction, so that the size of the fifth feature image can be the same as the size of the first feature image.
- the third feature decoding is used to implement feature decoding of the sixth feature image to obtain a third image.
- the sixth feature image is obtained by splicing the first feature image and the fifth feature image. In this way, the loss of image semantic information during network processing can be avoided, and the invariance of image content during image style conversion can be ensured.
- the corresponding network layers of the encoder and decoder are connected and passed through the channel The dimensional concat operation realizes splicing the feature images output by the corresponding layer to obtain the sixth feature image.
- the first feature decoding may include at least one encoding operation.
- the feature decoding of the first target feature vector may be gradually implemented in a cascade form.
- the second feature decoding may also include at least one decoding operation.
- the feature decoding of the fourth feature image may also be gradually implemented in a cascade manner.
- the first feature decoding, the second feature decoding and the third feature decoding all use downsampling operations to expand the scale of the feature so that it can be decoded to the third image.
- the scale of the third image can be the same as the scale of the first image, such as 256*256*3 size.
- the decoder in the target model can include at least one branch network, for example, it can include two branch networks, and each branch network can achieve the conversion of image content for an image style through the above image construction.
- each branch network can achieve the conversion of image content for an image style through the above image construction.
- it can be through
- the target model implements multi-objective style conversion, that is, it can convert the input image to multiple styles and obtain images of multiple styles.
- the decoder includes at least two branch networks
- different branch networks in the decoder can perform style conversion independently.
- style conversion can also be performed collaboratively, so that multi-objective tasks can promote each other and optimize together, which can better meet the performance and effect requirements in temporal image generation.
- the second style includes The first target style and the second target style; performing first feature decoding on the first target feature vector to obtain a fourth feature image, including:
- the seventh feature image and the eighth feature image are spliced together to obtain a ninth feature image.
- the eighth feature image is obtained by performing the first step on the first target feature vector corresponding to the second target style. Obtained by decoding operation;
- the decoder may include at least two branch networks, and each branch network may realize the conversion of the image content of the first image to the second style. Taking two branch networks as an example, the number of second styles is 2 .
- a first decoding operation can be performed on the first target feature vector corresponding to the first target style through a branch network to obtain a seventh feature image.
- another branch network can be used to perform a first decoding operation on the first target feature vector corresponding to the second target style to obtain an eighth feature image.
- the first decoding operation may include an upsampling operation to achieve feature decoding.
- the seventh feature image and the eighth feature image can be spliced to obtain the ninth feature image.
- the inputs between the corresponding network layers of the two branch networks can be concated with each other. Since the inputs decoded by the two decoders The semantic information should be consistent. Therefore, the interconnected cascade can promote the two decoders to keep decoding the semantic information similar to the input of the same content, play a role in joint optimization, thereby improving the accuracy of feature decoding.
- a second decoding operation can be performed on the ninth feature image to obtain a fourth feature image.
- the first feature decoding of the first target feature vector can be achieved, and the interconnection cascade can promote the two decoders to decode the same content.
- the input remains similar to the decoded semantic information, which plays a role in joint optimization, thereby improving the accuracy of feature decoding.
- Figure 5 is a schematic structural diagram of an example target model.
- the target model may include a first model, which is an encoder 51, and a second model, which is a decoder.
- the second model may include a first decoder 52 and a second model.
- the structures of the second decoder 53 and the encoder 51 are shown in Figure 4.
- the structures of the first decoder and the second decoder are the same, but the network weights are different.
- the decoder can include the decoding network DeConvBlock module and the residual network ResBlock module.
- the structural diagram of the DeConvBlock module is shown in Figure 6. Its components are upsampling module, convolution layer, BN processing and Relu activation function. First, an upsampling operation is used to expand the input to twice the size, and the number of channels remains unchanged. Then a convolution operation is used, the kernel size is set to 3*3, stride is 1, and then conventional BN processing and Relu operations are added.
- the structural diagram of the ResBlock module is shown in Figure 7. Its components are convolution layer, BN processing, Relu activation function, convolution layer, BN processing, and network layer addition processing.
- the first convolutional layer is a convolution with a kernel size of 3*3 and a stride of 1.
- the output channel is the same as the input. After that, regular BN and Relu operations are added.
- the second convolutional layer The kernel size is 1*1, stride is 1, the number of channels is the set output channel, plus a BN operation, and the Add process is used to add the input features and output features of the ResBlock module and then output.
- the decoder can include 8 modules.
- the modules arranged at 1, 2, 5, 6, 7, and 8 can be DeConvBlock modules, and the modules arranged at 3 and 4 can be ResBlock modules.
- the input and output sizes of each module are shown in Table 1 below.
- multiple network layers can be included to avoid the loss of image semantic information during network processing, such as the connection between the corresponding network layers of the encoder and the decoder, and the connection between the two decoders.
- the inputs between 2 to 4 modules are interconnected.
- the above target model can be used to generate images representing different time styles for an image such as a landscape image, and multiple generated images can be used for image synthesis to obtain dynamic images or videos that change according to time.
- image synthesis provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios.
- Figure 8 is a flow chart of an image synthesis method provided by an embodiment of the present application. As shown in Figure 8, it includes the following steps:
- Step 801 perform style conversion on the first image through the target model to generate N third images
- Step 802 Obtain a fourth image whose synthesis position is between the two target images.
- the first pixel information of the fourth image with respect to the first color space is based on the first pixel information of the two target images with respect to the first color space. Determined by two pixel information, the two target images are two adjacent images among the N third images;
- Step 803 based on the N third pixel information of the third image with respect to the second color space, adjust the fourth pixel information of the fourth image with respect to the second color space to obtain a fifth image;
- Step 804 synthesize N third images and fifth images.
- the purpose of this step 801 is to generate the required images representing different light styles based on the target model.
- the user can input a source image, that is, the first image, and time transformation modes corresponding to N second styles, such as four seasons change mode, day and night change mode, etc.
- the target model performs image style conversion based on the input information, and obtains Third image.
- the number of second styles corresponding to the time change mode can be set.
- the number of second styles is 4.
- the four-season change mode it includes four different time styles of spring, summer, autumn and winter, and the day and night change mode
- the mode can be set to four different time styles: sunrise, midday, sunset and late night.
- the process of performing style conversion on the first image through the target model to generate N third images is similar to the process of the above image generation method embodiment, and will not be described again here. It should be noted that when the decoder in the target model only includes two branch networks and needs to output images of four different time styles, the target model can perform two image generation operations, that is, through two inferences, it can Get the required 4 frames of light image. You can resize to enlarge the size of the 4-frame light image to 1080*1080*3.
- the time image sequence is expanded by inserting frames, such as from 4 frames to 10 frames, and one or more frames can be added between two adjacent frames.
- frames such as from 4 frames to 10 frames
- one or more frames can be added between two adjacent frames.
- Frame images for example, two frames of images can be added between every two adjacent images.
- the first pixel information of the image that needs to be inserted with respect to the first color space can be calculated to obtain the difference between the two adjacent frame images.
- the frame image that needs to be inserted is the fourth image. This method can be suitable for inserting frames of landscape images in which the scene position is not moving.
- the first color space can be the RGB color space.
- the color value of the pixel in the frame image to be inserted can correspond to the weighted sum of the color values of the same pixel position in the previous and later light images.
- ori_1 and ori_2 are two adjacent time images, and mid1 and mid2 are the two frames before and after that need to be inserted.
- step 803 after obtaining 10 frames of light images, in order to make the synthesized dynamic light images more consistent with the real changes in the second color space, such as brightness changes and color changes, the 4 frames of light images generated based on the target model can be generated with respect to the second color space.
- the N third pixel information of the second color space is used to adjust the fourth pixel information of the frame image to be inserted with respect to the second color space to obtain a fifth image.
- the second color space may be Lab color space.
- L represents brightness, the value range is [0,100], which means from pure black to pure white; a represents the range from red to green, the value range is [127,-128]; b represents from yellow to blue Range, the value range is [127,-128].
- N third images and fifth images can be synthesized to obtain dynamic images or videos.
- the first image is style-converted through the target model to generate N third images; a fourth image whose synthesis position is between the two target images is obtained, and the fourth image is related to the third image of the first color space.
- One pixel information is determined based on the second pixel information about the first color space of the two target images, which are two adjacent images among the N third images; based on N N pieces of third pixel information about the second color space of the third image are adjusted to the fourth pixel information about the second color space of the fourth image to obtain a fifth image; synthesize N pieces of the said The third image and the fifth image.
- the synthesized dynamic light graph can be It is more consistent with real changes in the second color space, such as brightness changes and color changes, and improves the effect of image synthesis.
- the second color space includes three components, and step 803 specifically includes:
- the pixel value of the fourth pixel information of the component is adjusted to obtain a fifth image.
- the second color space may be a Lab color space, and its components may include three components, namely brightness, color component a, and color component b.
- the pixel value of the component can be adjusted, so that each component of the synthesized dynamic light image in the second color space conforms to real changes.
- the pixel value of the brightness may not be adjusted for a mode that does not change over time, such as a four-season changing mode.
- the three components include a brightness component, and the pixel value related to the component in the fourth pixel information is adjusted based on the pixel value related to the component in the N third pixel information.
- get the fifth image including:
- N first brightness values related to the brightness component of the N third images are obtained; and based on the pixel values related to the brightness component in the fourth pixel information; The pixel value of the brightness component, obtaining the second brightness value of the fourth image with respect to the brightness component;
- the brightness values of the pixels in each third image can be averaged and counted to obtain N first brightness values corresponding to the N third images, and the brightness values of the pixels in the fourth image can be calculated. Perform average statistics to obtain the second brightness value corresponding to the fourth image.
- the image can be converted from the RGB color space to the LAB color space, and the average brightness value of the image can be obtained by averaging the L channels.
- the first curve can be fitted using the formula shown in the following equation (4).
- the first curve is a curve of the change of light intensity relative to the brightness value.
- x is the time
- y is the brightness
- 6 is the sunrise time
- 12 is the midday
- 18 is the sunset
- 0 is the late night.
- the N first brightness values can be used as y data, and the N first light values can be used as x data, and the least squares method is used to determine the coefficients in the above formula (4), that is, the k and b coefficients.
- the second curve can be fitted using the formula shown in the following equation (5).
- the parameters of the second curve namely a, b and c, can be determined through three points (0,0), (100,100) and (q,q’).
- the brightness value of each pixel point before adjustment in the fourth image can be used as x, and the brightness value after adjustment of each pixel point, that is, the fourth brightness value, can be calculated based on the second curve.
- the brightness adjustment diagram of the fourth image is shown in Figure 9.
- the straight line is the brightness curve of the pixel points in the fourth image before adjustment
- the curve is the brightness curve of the pixel points in the fourth image after adjustment.
- the color adjustment method can be used to adjust the brightness channel, which will not be described again.
- the difference is that the formula shown in the following equation (6) is used to fit the first curve.
- Figure 10 is a flow chart of the model training method provided by the embodiment of the present application. As shown in Figure 10, it includes the following steps:
- Step 1001 Obtain training sample data, where the training sample data includes a first sample image and a fourth feature vector used to characterize the first sample style;
- Step 1002 Perform first feature processing on the first sample image to obtain a fifth feature vector.
- the fifth feature vector is used to characterize the image content of the first sample image;
- Step 1003 perform a splicing operation on the fifth feature vector and the fourth feature vector to obtain a second target feature vector
- Step 1004 perform image construction based on the second target feature vector to obtain a first output image
- Step 1005 determine the first network loss value of the target model based on the first output image and the fifth feature vector
- Step 1006 Update the network parameters of the target model based on the first network loss value.
- the target model training is completed when a first preset condition is met, and the first preset condition includes: the first network loss value is less than a first preset threshold, and/or the target The number of training iterations of the model is greater than the second preset threshold.
- the training sample data may include at least one first sample image, and at least one fourth feature vector corresponding to the first sample style.
- the first sample image can be any image, such as a landscape image, and its acquisition method can be similar to the first image.
- the fourth feature vector used to characterize the style of the first sample can be obtained through the first model in the target model, and its acquisition method can also be similar to the second feature vector, which will not be described again here.
- the number of fourth feature vectors can be the same as the number of branch networks of the decoder in the target model.
- the number of branch networks of the decoder is 2, that is, two image style conversions can be achieved at the same time, then the number of fourth feature vectors is 2.
- the training sample data may also include K second sample images.
- the K second sample images may be used to train the first model, and K is an integer greater than 2.
- the training sample data may also include a third sample image, the third sample image has the same image content as the first sample image, the image style of the third sample image is the first sample style, and the third sample image has the same image content as the first sample image.
- the sample image can be combined with the first sample image and the fourth feature vector to adjust the network parameters of the target model.
- the above steps 1002, 1003 and 1004 are the process of image generation based on the target model. Specifically, the first sample image and the fourth feature vector can be input to the target model. The target model can accordingly execute the above steps 1002, 1003 and 1004. Step 1004, wherein the processes of the above-mentioned steps 1002, 1003 and 1004 are similar to the processes of the above-mentioned image generation method embodiment, and will not be described again here.
- a first network loss value of the target model may be determined based on the first output image and the fifth feature vector.
- CE is the cross entropy loss function
- fc(out1) and fc(out2) are the content feature vectors of output image 1 and output image 2
- fc(x) is the content feature vector of the input image
- fs(out1) and fs(out2) are the style feature vectors of output image 1 and output image 2
- Loss1 is the first network loss value.
- the first line of Loss1 is used to ensure that the content of the two generated images is the same and consistent with the input image content.
- the second line is used to ensure that the image style generated by decoder 1 is the same as the input image style.
- the third line is used To ensure that the image style generated by decoder 2 is the same as the input image style.
- the network parameters of the target model may be updated based on the first network loss value.
- the gradient descent method can be used to update the network parameters of the target model, and a loop iteration method can be used to continuously update the network parameters of the target model until the first network loss value is less than the first preset threshold and reaches convergence, and/or the target model
- the number of training iterations is greater than the second preset threshold, and the target model can be trained at this time.
- the first preset threshold and the second preset threshold can be set according to the actual situation. Usually the first preset threshold can be set relatively small, and the second preset threshold can be set relatively large to ensure sufficient training of the target model. Ensure the training effect of the target model.
- the training phase of the target model may only include one phase.
- the third sample image, the first sample image and the fourth feature vector may be used as inputs of the target model.
- the target model is updated, combined with No.
- the three-sample image, the first output image and the fifth feature vector simultaneously update the network parameters of the first model and the second model.
- the training phase of the target model may also include at least two phases.
- the at least two phases may include a first phase and a second phase.
- the second phase is located after the first phase.
- the first phase may be called pre-training. stage, the second stage can be called the fine-tuning stage.
- the training stage of the target model is in the first stage, the first sample image and the fourth feature vector can be used as inputs of the target model.
- the target model is updated, the first output image, the fourth feature vector and the third feature vector can be combined.
- the five eigenvectors update the network parameters of the second model, while in the first stage, the network parameters of the first model are fixed.
- the third sample image, the first sample image and the fourth feature vector can be used as the input of the target model.
- the third sample image, the third sample image and the fourth feature vector can be combined with the target model.
- An output image and the fifth feature vector simultaneously update the network parameters of the first model and the second model to further adjust the network parameters of the target model. In this way, the training method of pre-training combined with fine-tuning can improve the training speed of the target model.
- the training sample data includes a first sample image and a fourth feature vector used to characterize the style of the first sample; the first sample image is subjected to the first feature Process to obtain a fifth eigenvector, which is used to characterize the image content of the first sample image; perform a splicing operation on the fifth eigenvector and the fourth eigenvector to obtain a second target Feature vector; perform image construction based on the second target feature vector to obtain a first output image; determine the first network loss value of the target model based on the first output image and the fifth feature vector; based on the third A network loss value, updating the network parameters of the target model; wherein, when the first preset condition is met, the target model training is completed, and the first preset condition includes: the first network loss value is less than the first preset threshold, and/or the number of training iterations of the target model is greater than the second preset threshold. In this way, the training of the target model can be achieved, so that the target model
- the target model includes a first model and a second model.
- the first model is used to perform first feature processing on the first sample image to obtain a fifth feature vector.
- the second model Used for: performing a splicing operation on the fifth feature vector and the fourth feature vector to obtain a second target feature vector; performing image construction based on the second target feature vector to obtain the first output image;
- the training phase of the target model includes a first phase and a second phase, and the second phase is located after the first phase; the step 1006 specifically includes any of the following:
- the network parameters of the second model are updated based on the first network loss value, wherein the network parameters of the first model are fixed.
- the training phase of the target model is in the first stage
- the second preset condition includes: the first network loss value is greater than or equal to the third preset condition. threshold, and/or, the number of training iterations of the target model is less than or equal to a fourth preset threshold, the third preset threshold is greater than the first preset threshold, and the fourth preset threshold is less than the third preset threshold. Two preset thresholds.
- the training phase of the target model may also include at least two phases. These at least two phases may It includes a first stage and a second stage, the second stage is located after the first stage, the first stage can be called the pre-training stage, and the second stage can be called the fine-tuning stage.
- the first point is the different inputs.
- the inputs of the pre-training stage are the first sample image and the fourth feature vector.
- the inputs of the fine-tuning stage are the third sample image and the fourth feature vector.
- the second point is that the method of determining the first network loss value is different.
- the method of determining the first network loss value in the pre-training stage is to determine the first network loss value based on the first output image, the fourth feature vector and the fifth feature vector.
- the first network loss value in the fine-tuning stage is determined based on the first output image, the third sample image and the fifth feature vector.
- the third point is that the network parameters of the target model are updated in different ways.
- the network parameters of the first model are fixed and only the network parameters of the second model are updated, while in the fine-tuning stage, the first model and the second model are updated simultaneously. network parameters.
- the network parameters of the first model can be fixed, and based on the first network loss value, only the network parameters of the second model in the target model are updated, which can simplify the training of the model.
- the network parameters of the first model and the second model can be updated simultaneously to further fine-tune the network parameters of the target model based on the pre-training stage.
- the training stage of the target model is in the first stage.
- the second preset condition can be set according to the actual situation, which can include that the first network loss value is greater than or equal to The third preset threshold, and/or the number of training iterations of the target model is less than or equal to the fourth preset threshold. Both the third preset threshold and the fourth preset threshold can be set according to actual conditions. The third preset threshold is greater than the first preset threshold, and the fourth preset threshold is less than the second preset threshold.
- the ratio of the number of iterations in the pre-training phase to the number of iterations in the fine-tuning phase during the training process may be 10:1, and the second preset threshold and the fourth preset threshold may be set according to the ratio of the number of iterations.
- the training phase may naturally progress from the pre-training phase to the fine-tuning phase.
- the first model can be trained first before training the target model.
- the training sample data also includes: K second sample images, the K second sample images include: sample images with the same image content but different image styles, and samples with the same image style but different image content.
- Image, K is an integer greater than 2; before step 1006, the method further includes:
- Target feature processing is performed on the K second sample images based on the first model to obtain K sixth feature vectors and K seventh feature vectors.
- the sixth feature vectors are used to characterize the second sample images.
- the image content, the seventh feature vector is used to characterize the image style of the second sample image, and the target feature processing includes the first feature processing;
- the network parameters of the first model are updated, wherein when the second network loss value When the loss value is less than the fifth preset threshold, the first model training is completed.
- the K second sample images may be paired data, that is, paired sample images with the same image content but different image styles, and paired sample images with the same image style but different image content.
- the CycleGAN model can be adopted to generate paired sample images.
- Target feature processing may include first feature processing and second feature processing.
- Each second sample image may be input to the first model for target feature processing to obtain the content feature vector of each second sample image, which is the sixth feature vector.
- the style feature vector is the seventh feature vector.
- the structure of the first model can be shown in Figure 4.
- GT_c is an image containing the same image content as I, but a different image style
- GT_s is an image containing different image content but the same image style as I.
- the content feature vector of the first model that is, the encoder
- fs(x) is denoted as fs(x).
- the loss function used by the first model during the training process is shown in the following equation (8).
- Loss2 k*CE(fc(I),fc(GT_c))-CE(fs(I),fs(GT_c)) +k*CE(fs(I),fs(GT_s))-CE(fc(I),fc(GT_s)) (8)
- k 100
- CE is the cross-entropy loss function
- Loss2 is the second network loss value.
- This loss function can make images with the same image content encode similar content feature vectors after the encoder, images with the same image style encode similar style feature vectors after the encoder, and images with two different image contents encode
- the content feature vectors encoded by the encoder are quite different, and the style feature vectors encoded by the encoder for two images with different image styles are quite different.
- the network parameters of the first model are updated based on the second network loss value.
- the updating method is similar to the method of updating the network parameters of the target model based on the first network loss value, which will not be described again here.
- the fifth preset threshold can be set according to the actual situation, and is usually set relatively small, and is not specifically limited here.
- the first model can be trained in advance, and after the training is completed, the first model can assist in training the target model, which can simplify the model training process.
- step 1005 specifically includes:
- target feature processing is performed on the first output image based on the first model to obtain an eighth feature vector and a ninth feature vector, where the eighth feature vector is used to characterize The image content of the first output image, the ninth feature vector is used to characterize the image style of the first output image;
- the first loss value and the second loss value are aggregated to obtain the first network loss value.
- the training of the target model can be assisted. Specifically, target feature processing can be performed on the first output image based on the first model to obtain the content feature vector of the first input image, that is, the eighth The feature vector and the style feature vector are the ninth feature vector.
- the loss function shown in the above equation (7) can be used to determine the first network loss value.
- the graph The invariant constraint of image content ensures that the content of the two generated images is the same and consistent with the input image content.
- the invariant constraint of image style ensures that the image style generated by the decoder is the same as the input image style.
- step 1005 specifically includes:
- a first network loss value of the target model is determined based on the first output image, the fifth feature vector and the third sample image.
- the first output images are out1 and out2 respectively, the first sample image is x, and the third sample image is denoted gt.
- the loss function shown in the following formula (9) can be used, based on the first output image, The third sample image and the fifth feature vector determine the first network loss value.
- L1 represents the mean absolute error function.
- the first line of Loss3 is used to prompt the image generated by the target model to be the same as the image gt.
- the second line ensures that the content of the generated image is the same as the content of the image gt, and is the same as the image gt.
- the input image x is the same, and the third line ensures that the generated image style is the same as the image gt style.
- the accuracy of model training can be improved.
- the execution subject may be an image generation device, or a control module in the image generation device for executing the image generation method.
- an image generation device executing an image generation method is used as an example to describe the image generation device provided by the embodiments of the present application.
- FIG 11 is a structural diagram of an image generation device provided by an embodiment of the present application. As shown in Figure 11, the image generation device 1100 includes:
- the first acquisition module 1101 is used to acquire a first image whose image style is a first style, and a second image whose image style is a second style;
- the first feature processing module 1102 is configured to perform first feature processing on the first image based on the target model to obtain a first feature vector, where the first feature vector is used to characterize the image content of the first image;
- the feature splicing module 1103 is used to splice the first feature vector and the second feature vector to obtain a first target feature vector.
- the second feature vector is determined based on the second image of the second style.
- the second feature vector is used to characterize the image style of the second image;
- the image construction module 1104 is configured to perform image construction based on the first target feature vector to obtain a third image.
- the first feature processing module 1102 includes:
- a first feature encoding unit configured to perform first feature encoding on the first image to obtain a first feature image of the first image
- the second feature encoding unit is used to perform second feature encoding on the first feature image to obtain the first feature vector.
- the first feature encoding unit is specifically used for:
- the target attention mechanism Based on the target attention mechanism, the attention vector of the second feature image in the corresponding dimension of the target attention mechanism is extracted.
- the target attention mechanism includes an attention mechanism in the channel dimension and an attention mechanism in the spatial dimension. At least one of the force mechanisms;
- the first feature encoding includes the feature extraction and the extraction of the attention vector.
- the image building module 1104 includes:
- a first feature decoding unit configured to perform first feature decoding on the first target feature vector to obtain a fourth feature image
- a second feature decoding unit is configured to perform second feature decoding on the fourth feature image to obtain a fifth feature image, where the size of the fifth feature image is the same as the size of the first feature image;
- a splicing operation unit configured to perform a splicing operation on the first feature image and the fifth feature image to obtain a sixth feature image
- a third feature decoding unit is configured to perform third feature decoding on the sixth feature image to obtain the third image.
- the second style includes a first target style and a second target style; the first feature decoding unit is specifically used for:
- the seventh feature image and the eighth feature image are spliced together to obtain a ninth feature image.
- the eighth feature image is obtained by performing the first step on the first target feature vector corresponding to the second target style. Obtained by decoding operation;
- the number of second images is M, where M is a positive integer, and the device further includes:
- a second feature processing module is configured to perform second feature processing on each of the second images to obtain M third feature vectors.
- One of the third feature vectors corresponds to one of the second images.
- the third feature vector corresponds to one of the second images.
- Three feature vectors are used to characterize the image style of the second image;
- An average processing module is used to average the M third feature vectors to obtain the second feature vector.
- the number of third images includes N, where N is an integer greater than 1, and the device includes:
- the second acquisition module is used to acquire a fourth image whose synthesis position is between the two target images, and the first pixel information of the fourth image with respect to the first color space is based on the two target images with respect to the first Determined by the second pixel information of the color space, the two target images are two adjacent images among the N third images;;
- a pixel adjustment module for N third pixel information about the second color space based on N third images, Adjust the fourth pixel information of the fourth image with respect to the second color space to obtain a fifth image;
- a synthesis module configured to synthesize N third images and the fifth image.
- the first feature vector is used to characterize the image content of the first image; perform a splicing operation on the first feature vector and the second feature vector to obtain a first target feature vector, and the second feature vector is based on the The second image of the second style is determined, and the second feature vector is used to characterize the image style of the second image; image construction is performed based on the first target feature vector to obtain a third image.
- the image style of the image can be converted from the first style to the second style based on the target model, and the image content of the generated third image can be kept the same as the image content of the input first image, thereby improving the performance of the image.
- the quality of the resulting image is the same as the image content of the input first image, thereby improving the performance of the image.
- the image generating device in the embodiment of the present application may be a device, or may be a component, integrated circuit, or chip in an electronic device.
- the device may be a mobile electronic device or a non-mobile electronic device.
- the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a handheld computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a personal digital assistant (personal digital assistant).
- UMPC ultra-mobile personal computer
- PDA personal digital assistant
- non-mobile electronic devices can be servers, network attached storage (Network Attached Storage, NAS), personal computers (personal computers, PC), televisions (television, TV), teller machines or self-service machines, etc., this application The examples are not specifically limited.
- the image generation device in the embodiment of the present application may be a device with an operating system.
- the operating system can be an Android operating system, an ios operating system, or other possible operating systems, which are not specifically limited in the embodiments of this application.
- the image generation device provided by the embodiment of the present application can implement each process implemented by the method embodiment in Figure 1. To avoid repetition, details will not be described here.
- the execution subject may be a model training device, or a control module in the model training device for executing the model training method.
- a model training device executing a model training method is used as an example to illustrate the model training device provided by the embodiment of the present application.
- Figure 12 is a structural diagram of a model training device provided by an embodiment of the present application. As shown in Figure 12, the model training device 1200 includes:
- the third acquisition module 1201 is used to acquire training sample data, where the training sample data includes a first sample image and a fourth feature vector used to characterize the first sample style;
- the first feature processing module 1202 is configured to perform first feature processing on the first sample image to obtain a fifth feature vector, where the fifth feature vector is used to characterize the image content of the first sample image;
- Image construction module 1204 configured to perform image construction based on the second target feature vector to obtain a first output image
- the first determination module 1205 is used to determine the first network loss value of the target model based on the first output image and the fifth feature vector;
- the first update module 1206 is used to update the network parameters of the target model based on the first network loss value
- the target model training is completed when a first preset condition is met, and the first preset condition includes: the first network loss value is less than a first preset threshold, and/or the target The number of training iterations of the model is greater than the second preset threshold.
- the target model includes a first model and a second model.
- the first model is used to perform first feature processing on the first sample image to obtain a fifth feature vector.
- the second model Used for: performing a splicing operation on the fifth feature vector and the fourth feature vector to obtain a second target feature vector; performing image construction based on the second target feature vector to obtain the first output image;
- the training phase of the target model includes a first phase and a second phase, the second phase being located after the first phase;
- the first update module 1206 is specifically used for:
- the network parameters of the second model are updated based on the first network loss value, wherein the network parameters of the first model are fixed.
- the training phase of the target model is in the first stage
- the second preset condition includes: the first network loss value is greater than or equal to the third preset condition. threshold, and/or, the number of training iterations of the target model is less than or equal to a fourth preset threshold, the third preset threshold is greater than the first preset threshold, and the fourth preset threshold is less than the third preset threshold. Two preset thresholds.
- the training sample data also includes: K second sample images, the K second sample images include: sample images with the same image content but different image styles, and sample images with the same image style but image For sample images with different contents, K is an integer greater than 2; the device also includes:
- a target feature processing module configured to perform target feature processing on the K second sample images based on the first model to obtain K sixth feature vectors and K seventh feature vectors, where the sixth feature vector is used Characterizing the image content of the second sample image, the seventh feature vector is used to characterize the image style of the second sample image, and the target feature processing includes the first feature processing;
- a second determination module configured to determine the second network loss value of the first model based on the K sixth feature vectors and the K seventh feature vectors;
- a second update module configured to update the network parameters of the first model based on the second network loss value, wherein when the second network loss value is less than a fifth preset threshold, the first Model training is completed.
- the first determination module 1205 is specifically used to:
- target feature processing is performed on the first output image based on the first model to obtain an eighth feature vector and a ninth feature vector, where the eighth feature vector is used to characterize The first output
- the image content of the image the ninth feature vector is used to characterize the image style of the first output image
- the first loss value and the second loss value are aggregated to obtain the first network loss value.
- the training sample data includes a first sample image and a fourth feature vector used to characterize the style of the first sample; the first sample image is subjected to the first feature Process to obtain a fifth eigenvector, which is used to characterize the image content of the first sample image; perform a splicing operation on the fifth eigenvector and the fourth eigenvector to obtain a second target Feature vector; perform image construction based on the second target feature vector to obtain a first output image; determine the first network loss value of the target model based on the first output image and the fifth feature vector; based on the third A network loss value, updating the network parameters of the target model; wherein, when the first preset condition is met, the target model training is completed, and the first preset condition includes: the first network loss value is less than the first preset threshold, and/or the number of training iterations of the target model is greater than the second preset threshold. In this way, the training of the target model can be achieved, so that the target model
- the model training device in the embodiment of the present application may be a device, or may be a component, integrated circuit, or chip in an electronic device.
- the device may be a mobile electronic device or a non-mobile electronic device.
- the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a handheld computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a personal digital assistant (personal digital assistant).
- UMPC ultra-mobile personal computer
- PDA personal digital assistant
- non-mobile electronic devices can be servers, network attached storage (Network Attached Storage, NAS), personal computers (personal computers, PC), televisions (television, TV), teller machines or self-service machines, etc., this application The examples are not specifically limited.
- the model training device in the embodiment of the present application may be a device with an operating system.
- the operating system can be an Android operating system, an ios operating system, or other possible operating systems, which are not specifically limited in the embodiments of this application.
- model training device provided by the embodiment of the present application can implement each process implemented by the method embodiment in Figure 10. To avoid duplication, details will not be described here.
- this embodiment of the present application also provides an electronic device 1300, including a processor 1301, a memory 1302, and programs or instructions stored on the memory 1302 and executable on the processor 1301.
- the program or instruction When the program or instruction is executed by the processor 1301, it implements each process of the above image generation method embodiment, or implements each process of the above model training method embodiment, and can achieve the same technical effect. To avoid duplication, it will not be described again here.
- the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.
- Figure 14 is a schematic diagram of the hardware structure of an electronic device that implements an embodiment of the present application.
- the electronic device 1400 includes but is not limited to: radio frequency unit 1401, network module 1402, audio output unit 1403, input unit 1404, sensor 1405, display unit 1406, user input unit 1407, interface unit 1408, memory Memory 1409, processor 1410 and other components.
- the electronic device 1400 may also include a power supply (such as a battery) that supplies power to various components.
- the power supply may be logically connected to the processor 1410 through a power management system, thereby managing charging, discharging, and function through the power management system. Consumption management and other functions.
- the structure of the electronic device shown in Figure 14 does not constitute a limitation of the electronic device.
- the electronic device may include more or less components than shown in the figure, or combine certain components, or arrange different components, which will not be described again here. .
- the electronic device can be used to perform the image generation method, wherein the processor 1410 is used to:
- a splicing operation is performed on the first feature vector and the second feature vector to obtain a first target feature vector.
- the second feature vector is determined based on the second image of the second style.
- the second feature vector is used to characterize The image style of the second image;
- Image construction is performed based on the first target feature vector to obtain a third image.
- the first feature vector is used to characterize the image content of the first image; perform a splicing operation on the first feature vector and the second feature vector to obtain a first target feature vector, and the second feature vector is based on the The second image of the second style is determined, and the second feature vector is used to characterize the image style of the second image; image construction is performed based on the first target feature vector to obtain a third image.
- the image style of the image can be converted from the first style to the second style based on the target model, and the image content of the generated third image can be kept the same as the image content of the input first image, thereby improving the performance of the image.
- the quality of the resulting image is the same as the image content of the input first image, thereby improving the performance of the image.
- processor 1410 is also used to:
- processor 1410 is also used to:
- the target attention mechanism Based on the target attention mechanism, the attention vector of the second feature image in the corresponding dimension of the target attention mechanism is extracted.
- the target attention mechanism includes an attention mechanism in the channel dimension and an attention mechanism in the spatial dimension. At least one of the force mechanisms;
- the first feature encoding includes the feature extraction and the extraction of the attention vector.
- processor 1410 is also used to:
- the second style includes a first target style and a second target style; processor 1410 is also used to:
- the seventh feature image and the eighth feature image are spliced together to obtain a ninth feature image.
- the eighth feature image is obtained by performing the first step on the first target feature vector corresponding to the second target style. Obtained by decoding operation;
- the number of second images is M, and M is a positive integer.
- the processor 1410 is also used to:
- the M third feature vectors are averaged to obtain the second feature vector.
- the number of third images includes N, where N is an integer greater than 1.
- the processor 1410 is also used to:
- the first pixel information of the fourth image with respect to the first color space is based on the second pixel information of the two target images with respect to the first color space It is determined that the two target images are two adjacent images among the N third images;
- N pieces of the third image and the fifth image are synthesized.
- the electronic device can be used to perform a model training method, wherein the processor 1410 is used to:
- training sample data includes a first sample image, and a fourth feature vector used to characterize the style of the first sample
- the target model training is completed when a first preset condition is met, and the first preset condition includes: the first network loss value is less than a first preset threshold, and/or the target The number of training iterations of the model is greater than the second preset threshold.
- the target model includes a first model and a second model.
- the first model is used to perform first feature processing on the first sample image to obtain a fifth feature vector.
- the second model Used for: performing a splicing operation on the fifth feature vector and the fourth feature vector to obtain a second target feature vector; performing image construction based on the second target feature vector to obtain the first output image;
- the training phase of the target model includes a first phase and a second phase, the second phase being located after the first phase;
- Processor 1410 also used for:
- the network parameters of the second model are updated based on the first network loss value, wherein the network parameters of the first model are fixed.
- the training phase of the target model is in the first stage
- the second preset condition includes: the first network loss value is greater than or equal to the third preset condition. threshold, and/or, the number of training iterations of the target model is less than or equal to a fourth preset threshold, the third preset threshold is greater than the first preset threshold, and the fourth preset threshold is less than the third preset threshold. Two preset thresholds.
- the training sample data also includes: K second sample images, the K second sample images include: sample images with the same image content but different image styles, and sample images with the same image style but image Sample images with different contents, K is an integer greater than 2; processor 1410 is also used for:
- Target feature processing is performed on the K second sample images based on the first model to obtain K sixth feature vectors and K seventh feature vectors.
- the sixth feature vectors are used to characterize the second sample images.
- the image content, the seventh feature vector is used to characterize the image style of the second sample image, and the target feature processing includes the first feature processing;
- the network parameters of the first model are updated, wherein when the second network loss value is less than a fifth preset threshold, the first model training is completed.
- the processor 1410 is also used to:
- target feature processing is performed on the first output image based on the first model to obtain an eighth feature vector and a ninth feature vector, where the eighth feature vector is used to characterize The image content of the first output image, the ninth feature vector is used to characterize the image style of the first output image;
- the first loss value and the second loss value are aggregated to obtain the first network loss value.
- the input unit 1404 may include a graphics processing unit (GPU) 14041 and a microphone 14042.
- the graphics processor 14041 is useful in video capture mode or image processing. In the image capture mode, image data of still pictures or videos obtained by an image capture device (such as a camera) is processed.
- the display unit 1406 may include a display panel 14061, which may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like.
- the user input unit 1407 includes a touch panel 14071 and other input devices 14072. Touch panel 14071, also known as touch screen.
- the touch panel 14071 may include two parts: a touch detection device and a touch controller.
- Other input devices 14072 may include but are not limited to physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be described again here.
- Memory 1409 may be used to store software programs as well as various data, including but not limited to application programs and operating systems.
- the processor 1410 can integrate an application processor and a modem processor, where the application processor mainly processes operating systems, user interfaces, application programs, etc., and the modem processor mainly processes wireless communications. It can be understood that the above modem processor may not be integrated into the processor 1410.
- Embodiments of the present application also provide a readable storage medium.
- Programs or instructions are stored on the readable storage medium.
- the program or instructions are executed by a processor, each process of the above image generation method embodiment is implemented, or the above model is implemented.
- Each process of the training method embodiment can achieve the same technical effect. To avoid repetition, it will not be described again here.
- the processor is the processor in the electronic device described in the above embodiment.
- the readable storage media includes computer-readable storage media, such as computer read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disks or optical disks, etc.
- An embodiment of the present application further provides a chip.
- the chip includes a processor and a communication interface.
- the communication interface is coupled to the processor.
- the processor is used to run programs or instructions to implement the above image generation method embodiment.
- Each process, or each process that implements the above embodiments of the model training method, can achieve the same technical effect. To avoid duplication, it will not be described again here.
- chips mentioned in the embodiments of this application may also be called system-on-chip, system-on-a-chip, system-on-a-chip or system-on-chip, etc.
- the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation.
- the technical solution of the present application can be embodied in the form of a computer software product that is essentially or contributes to the existing technology.
- the computer software product is stored in a storage medium. (such as ROM/RAM, magnetic disk, optical disk), including several instructions to cause an electronic device (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
本申请公开了一种图像生成方法、模型训练方法、相关装置及电子设备,属于人工智能技术领域。该方法包括:获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像;基于目标模型对所述第一图像进行第一特征处理,得到第一特征向量,所述第一特征向量用于表征所述第一图像的图像内容;对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格;基于所述第一目标特征向量进行图像构建,得到第三图像。
Description
相关申请的交叉引用
本申请主张在2022年07月18日在中国提交的中国专利申请No.202210840608.X的优先权,其全部内容通过引用包含于此。
本申请属于人工智能技术领域,具体涉及一种图像生成方法、模型训练方法、相关装置及电子设备。
随着人工智能的高速发展,可以采用深度学习模型如循环对抗生成网络(Cycle Generative Adversarial Network,CycleGAN)模型,进行图像风格的转换,以生成与输入图像不同风格的图像,比如,可以通过CycleGAN模型将夏天风格的风景图像转换为冬天风格的风景图像。
目前,采用CycleGAN模型进行图像风格的转换,所生成的图像质量比较差。
发明内容
本申请实施例的目的是提供一种图像生成方法、模型训练方法、相关装置及电子设备,能够解决采用相关模型进行图像风格的转换,所生成的图像质量比较差的问题。
第一方面,本申请实施例提供了一种图像生成方法,该方法包括:
获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像;
基于目标模型对所述第一图像进行第一特征处理,得到第一特征向量,所述第一特征向量用于表征所述第一图像的图像内容;
对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格;
基于所述第一目标特征向量进行图像构建,得到第三图像。
第二方面,本申请实施例提供了一种模型训练方法,该方法包括:
获取训练样本数据,所述训练样本数据包括第一样本图像,以及用于表征第一样本风格的第四特征向量;
对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第五特征向量用于表征所述第一样本图像的图像内容;对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;基于所述第二目标特征向量进行图像构建,得到第一输出图像;
基于所述第一输出图像和所述第五特征向量,确定目标模型的第一网络损失值;
基于所述第一网络损失值,更新所述目标模型的网络参数;
其中,在满足第一预设条件的情况下,所述目标模型训练完成,所述第一预设条件包括:所述第一网络损失值小于第一预设阈值,和/或,所述目标模型的训练迭代次数大于第二预设阈值。
第三方面,本申请实施例提供了一种图像生成装置,该装置包括:
第一获取模块,用于获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像;
第一特征处理模块,用于基于目标模型对所述第一图像进行第一特征处理,得到第一特征向量,所述第一特征向量用于表征所述第一图像的图像内容;
特征拼接模块,用于对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格;
图像构建模块,用于基于所述第一目标特征向量进行图像构建,得到第三图像。
第四方面,本申请实施例提供了一种模型训练装置,该装置包括:
第三获取模块,用于获取训练样本数据,所述训练样本数据包括第一样本图像,以及用于表征第一样本风格的第四特征向量;
第一特征处理模块,用于对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第五特征向量用于表征所述第一样本图像的图像内容;
特征拼接模块,用于对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;
图像构建模块,用于基于所述第二目标特征向量进行图像构建,得到第一输出图像;
第一确定模块,用于基于所述第一输出图像和所述第五特征向量,确定目标模型的第
一网络损失值;
第一更新模块,用于基于所述第一网络损失值,更新所述目标模型的网络参数;
其中,在满足第一预设条件的情况下,所述目标模型训练完成,所述第一预设条件包括:所述第一网络损失值小于第一预设阈值,和/或,所述目标模型的训练迭代次数大于第二预设阈值。
第五方面,本申请实施例提供了一种电子设备,该电子设备包括处理器、存储器及存储在所述存储器上并可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如第一方面所述的图像生成方法的步骤,或者如第二方面所述的模型训练方法的步骤。
第六方面,本申请实施例提供了一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如第一方面所述的图像生成方法的步骤,或者如第二方面所述的模型训练方法的步骤。
第七方面,本申请实施例提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如第一方面所述的图像生成方法的步骤,或者如第二方面所述的模型训练方法的步骤。
在本申请实施例中,通过获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像;基于目标模型对所述第一图像进行第一特征处理,得到第一特征向量,所述第一特征向量用于表征所述第一图像的图像内容;对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格;基于所述第一目标特征向量进行图像构建,得到第三图像。如此,可以基于目标模型实现图像的图像风格从第一风格到第二风格的转换,并可以保持所生成的第三图像的图像内容与所输入的第一图像的图像内容相同,从而可以提高所生成的图像质量。
图1是本申请实施例提供的图像生成方法的流程图;
图2是卷积模块的结构示意图;
图3是CBAM的结构示意图;
图4是一示例的第一模型的结构示意图;
图5是一示例的目标模型的结构示意图;
图6是DeConvBlock模块的结构示意图;
图7是ResBlock模块的结构示意图;
图8是本申请实施例提供的图像合成方法的流程图;;
图9是第六图像的亮度调整示意图;
图10是本申请实施例提供的模型训练方法的流程图;
图11是本申请实施例提供的图像生成装置的结构图;
图12是本申请实施例提供的模型训练装置的结构图;
图13是本申请实施例提供的电子设备的结构图;
图14为实现本申请实施例的一种电子设备的硬件结构示意图。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。
下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的图像生成进行详细地说明。
图1是本申请实施例提供的图像生成方法的流程图,如图1所示,包括以下步骤:
步骤101,获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像。
该步骤中,第一图像可以为任一图像,比如人像图像、风景图像等,第一风格可以用于表征第一图像所对应的时光。第二图像也可以为任一图像,比如,人像图像、风景图像等,第二风格可以用于表征第二图像所对应的时光。
在一可选实施方式中,第一图像可以为风景图像,第一风格可以为四季时光中的时光,如春天时光,也可以为昼夜时长中的时光,如日出时光。
第二风格可以与第一风格相同,也可以不同。在一可选实施方式中,第二风格可以与第一风格不同,这样可以将第一图像转换成另一风格的图像,从而实现图像风格转换。
第二图像的数量可以为一个、两个甚至是多个,这里不进行具体限定。第二风格的数量也可以为一个、两个甚至是多个,这里不进行具体限定。
第一图像的获取方式可以包括多种,比如,可以获取预先存储的图像作为第一图像,可以通过相机实时拍摄得到第一图像,还可以接收其他电子设备发送的图像作为第一图像。第二图像的获取方式也可以包括多种,其获取方式可以与第一图像的获取方式类似,这里不进行赘述。
第一图像的获取时机可以在第二图像的获取时机之前、同时或之后。在一可选实施方式中,第一图像的获取时机可以在第二图像的获取时机之后,比如,可以首先获取第二图像,在提取出可表征第二图像的图像风格的第二特征向量的情况下,再获取第一图像,并基于第一图像和第二特征向量进行图像生成,这样,可以针对不同的图像,可以重复利用第二特征向量进行图像风格转换,从而可以提高图像生成的效率。
在一可选实施方式中,可以将第二特征向量与第二风格的风格信息匹配,相应的,可以获取第二风格的风格信息,即可以获取基于第二图像确定的第二特征向量。其中,风格信息可以为时光图模式,时光图模式可以表征第二风格,时光图模式可以包括四季变化模式、昼夜变换模式等,比如,时光图模式为四季变换模式时,第二风格可以包括春天时光、夏天时光等。
步骤102,基于目标模型对所述第一图像进行第一特征处理,得到第一特征向量,所述第一特征向量用于表征所述第一图像的图像内容。
可以采用目标模型,基于第一图像进行图像生成,该目标模型可以用于生成与第一图像的图像内容、且图像风格为第二风格的图像。
该目标模型可以包括第一模型,第一模型可以称之为编码器,该编码器可以分离出图像的内容,编码得到用于表征图像的图像内容的特征向量。
该编码器可以对第一图像进行第一特征处理,得到第一特征向量。其中,第一特征处理可以包括特征提取,以提取出可表征第一图像的图像内容的第一特征向量。
步骤103,对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格。
该步骤中,第二特征向量用于表征第二图像的图像风格,第二特征向量基于第二风格的第二图像确定。
第二特征向量可以为第三特征向量,也可以对多个第三特征向量进行平均处理得到,其中,第三特征向量可以为用于表征第二图像的图像风格的特征向量。
第二特征向量可以基于深度学习模型,对第二图像进行第二特征处理得到,且每个第二特征向量与一个图像风格对应,这样,可以基于第二风格,获取第二风格对应的第二特征向量。
该深度学习模型可以与第一模型相同,也可以与第一模型不同。
在使用第一模型对第二图像进行第二特征处理得到第二特征向量的情况下,第一特征处理和第二特征处理可以完全不同,可以部分相同。
在一可选实施方式中,第一特征处理和第二特征处理可以部分相同,如前述的特征提取相同,后续可以基于相同的特征图像进行不同的特征提取,以分别得到用于表征图像内容的特征向量和用于表征图像风格的特征向量,实现图像内容和图像风格的解耦,这样通过一个模型即可分离出图像的内容特征向量和风格特征向量。
在内容特征向量和风格特征向量分离的情况下,可以将第一特征向量(其为内容特征向量)和第二特征向量(其为风格特征向量)进行拼接操作。具体的,目标模型可以包括拼接模块,可以通过拼接模块将两个特征向量进行拼接得到第一目标特征向量。
比如,第一特征向量的尺度为(1,1,256),即1*256大小的向量,第二特征向量的尺度为(1,1,256),即拼接得到的第一目标特征向量的尺度为(1,1,512),后续可以基于第一目标特征向量进行图像构建,生成相应的图像。
需要说明的是,目标模型在使用之前,需要预先训练,其训练过程将在下述实施例中进行详细说明。
步骤104,基于所述第一目标特征向量进行图像构建,得到第三图像。
其中,所述第三图像为图像风格为所述第二风格,且与所述第一图像具有相同图像内容的图像。
目标模型还可以包括第二模型,第二模型可以称之为解码器,该解码器可以基于所输入的特征向量,解码得到与所输入的特征向量所表征的图像内容和图像风格相同的图像。由于第一目标特征向量所表征的图像内容为第一图像的图像内容,且所表征的图像风格为第二风格,因此,该解码器所输出的图像即第三图像与所述第一图像具有相同图像内容,且图像风格为第二风格。
本实施例中,通过获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像;基于目标模型对所述第一图像进行第一特征处理,得到第一特征向量,所述第一特征向量用于表征所述第一图像的图像内容;对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格;基于所述第一目标特征向量进行图像构建,得到第三图像。如此,可以基于目标模型实现图像的图像风格从第一风格到第二风格的转换,并可以保持所生成的第三图像的图像内容与所输入的第一图像的图像内容相同,从而可以提高所生成的图像质量。
可选的,所述对所述第一图像进行第一特征处理,得到第一特征向量,包括:
对所述第一图像进行第一特征编码,得到所述第一图像的第一特征图像;
对所述第一特征图像进行第二特征编码,得到所述第一特征向量。
本实施方式中,第一特征处理可以包括第一特征编码和第二特征编码,第一特征编码用于提取第一图像的第一特征图像,第一特征图像可以为第一图像的图像特征,其可以包
括第一图像的颜色特征、纹理特征、形状特征和空间关系特征等,第二特征编码用于基于第一特征图像提取出用于表征第一图像的图像内容的第一特征向量。
本实施方式中,通过基于第一图像进行不同阶段的特征编码,可以提取得到用于表征第一图像的图像内容的第一特征向量,从而可以实现从第一图像中分离出第一图像的内容特征向量。
可选的,所述对所述第一图像进行第一特征编码,得到所述第一图像的第一特征图像,包括:
对所述第一图像进行特征提取,得到所述第一图像的第二特征图像;
基于目标注意力机制,提取所述第二特征图像在所述目标注意力机制对应维度上的注意力向量,所述目标注意力机制包括在通道维度上的注意力机制、在空间维度上的注意力机制中的至少一项;
将所述注意力向量和所述第二特征图像进行相乘处理,得到第三特征图像;
基于所述第三特征图像,确定所述第一特征图像;
其中,所述第一特征编码包括所述特征提取和所述注意力向量的提取。
本实施方式中,第一特征编码可以包括采用注意力机制,对第一图像进行特征提取的过程,以提高网络的特征表达能力。
具体的,可以采用一卷积模块,对第一图像进行特征提取,得到第一图像的第二特征图像,第二特征图像也可以为第一图像的图像特征,其可以包括第一图像的颜色特征、纹理特征、形状特征和空间关系特征等。
图2是卷积模块的结构示意图,如图2所示,卷积模块按照连接顺序分别为分别卷积层201、归一化(Batch Normlization,BN)处理202、Relu激活函数203、卷积层204和BN处理205。
第二特征图像与第一特征图像的尺度可以相同,也可以不同,这里不进行具体限定。在一可选实施方式中,第二特征图像与第一特征图像的尺度可以不同,通过串联连接的不同卷积模块不断执行特征提取,可以不断缩小特征图的尺度,从而充分提取出第一图像的图像特征。
第一模型可以包括注意力模块,该注意力模块可以基于目标注意力机制调整图像特征,以提高图像特征的表达能力。其中,目标注意力机制可以包括在通道维度上的注意力机制、在空间维度上的注意力机制中的至少一项。
在目标注意力机制仅包括一种注意力机制的情况下,可以提取第二特征图像在该注意力机制对应维度上的注意力向量,并将注意力向量和第二特征图像进行相乘处理,得到第三特征图像。
在目标注意力机制包括两种注意力机制的情况下,可以串联实现不同注意力机制的处理。
比如,可以通过通道注意力机制通过全局最大池化操作和全局池化操作获得通道注意
力向量,之后经过一个共享的多层感知器(Multilayer Perception,MLP)得到各自通道上的注意力向量后进行元素加法,并通过sigmoid激活函数得到通道维度上的注意力向量,将该注意力向量与第二特征图像进行相乘处理,输出一特征图像。之后,通过空间注意力机制根据通道注意力机制输出的特征图像,沿着信道轴应用平均池化操作和最大池化操作,并将它们连接起来,得到空间维度上的注意力向量,将该注意力向量与通道注意力机制输出的特征图像进行相乘处理,得到第三特征图像。其中,第三特征图像与第二特征图像的尺度相同。
在一可选实施方式中,注意力模块可以为卷积块注意模块(Convolutional Block Attention Module,CBAM)结构。图3是CBAM的结构示意图,如图3所示,CBAM可以包括通道注意力机制和空间注意力机制,并串联实现不同注意力机制的处理,输入第二特征图像,经过不同注意力机制的处理后,可以输出第三特征图像。
可以将第三特征图像确定为第一特征图像,为了充分提取第一图像的图像特征,可以采用另一卷积模块,继续对第三特征图像进行特征提取,以得到第一特征图像。
本实施方式中,通过特征提取可以实现第一图像的特征提取,且通过采用注意力机制进行特征提取,可以提高网络的特征表达能力,从而提高特征提取的准确性。
可选的,所述第二图像的数量为M,M为正整数,所述步骤103之前,所述方法还包括:
分别对每个所述第二图像进行第二特征处理,得到M个第三特征向量,一个所述第三特征向量与一个所述第二图像对应,所述第三特征向量用于表征所述第二图像的图像风格;
对所述M个第三特征向量进行平均处理,得到所述第二特征向量。
本实施方式中,第二特征处理可以包括第三特征提取和第四特征提取,第三特征提取用于提取第二图像的特征图像,该特征图像可以为第二图像的图像特征,其可以包括第二图像的颜色特征、纹理特征、形状特征和空间关系特征等,第四特征提取用于基于该特征图像提取出用于表征第二图像的图像内容的第三特征向量。
第三特征提取的方式可以与第一特征提取的方式相同,在一可选实施方式中,第一特征提取和第三特征提取可以通过相同的一些模块实现。
可以通过深度学习模型分别对每个第二图像进行第二特征处理,得到M个第三特征向量,具体可以分别将每个第二图像输入至深度学习模型,深度学习模型针对每个输入图像,可以输出一个第三特征向量,相应执行M次,即可以得到M个第三特征向量。
在一可选实施方式中,第一特征处理和第二特征处理均可以通过第一模型来实现,且第一特征提取和第三特征提取可以共用一些模块来实现,第二特征提取与第四特征提取不同,即可以分别通过不同的模块来实现第二特征提取和第四特征提取。也就是说,第一模型在进行特征处理时,可以对输入图像进行特征提取,得到特征图像,之后可以基于该特征图像进行不同的特征提取,以分别得到用于表征图像内容的特征向量和用于表征图像风
格的特征向量,实现图像内容和图像风格的解耦,这样通过一个模型即可分离出图像的内容特征向量和风格特征向量。
图4是一示例的第一模型的结构示意图,如图4所示,第一模型的输入可以是一个256*256*3大小的RGB图像,输出是两个1*256大小的向量,分别是内容特征向量(用fc表示)和风格特征向量(用fs表示)。
第一模型可以包括7个网络模块,7个网络模块详细介绍如下:
第一个网络模块401是卷积模块ConvBlock,内部结构如图2所示,后续的卷积模块(如第二个网络模块402、第三个网络模块403、第五个网络模块405至第七个网络模块407)可以与第一网络模块401的结构相同或类似。其中,第一个网络模块401的结构为:第一个卷积层Conv是内核kernel大小为3*3,步长stride为2的卷积,输入图像大小为256*256*3,输出图像大小为128*128*16。第二个卷积层是kernel大小为1*1,stride为1的卷积,输入图像大小为128*128*16,输出图像大小为128*128*32。
第四个网络模块404是CBAM结构,其内部结构如图3所示,用来提高网络的特征表达能力,其输入图像为第三个网络模块403输出的特征图像,如图3所示,其内含通道注意力机制和空间注意力机制两个模块。可以通过通道注意力机制通过全局最大池化操作和全局池化操作获得通道注意力向量,之后经过一个共享的MLP得到各自通道上的注意力向量后进行元素加法,并通过sigmoid激活函数得到通道维度上的注意力向量,通过Multipy相乘融合层将该注意力向量与第三个网络模块输出的特征图像进行相乘处理,输出一特征图像。之后,通过空间注意力机制根据通道注意力机制输出的特征图像,沿着信道轴应用平均池化操作和最大池化操作,并将它们连接起来,得到空间维度上的注意力向量,将该注意力向量与通道注意力机制输出的特征图像进行相乘处理,得到另一图像特征。
第五个网络模块405是卷积模块。第一个卷积层输入图像大小为32*32*96,输出图像大小为16*16*128。第二个卷积层输入图像大小为16*16*128,输出图像大小为16*16*128。
第六个网络模块406是卷积模块,输出的是内容特征向量。输入图像是第五个网络模块405的输出,输出图像大小为4*4*32,之后通过变换reshape操作将输出转换成1*256的一维向量。
第七个网络模块407是卷积模块,输出的是风格特征向量,输入图像同样是第五个网络模块405的输出,之后同样通过reshape操作将输出转换成1*256的一维向量。
在得到M个第三特征向量的情况下,可以对M个第三特征向量进行平均处理,得到第二特征向量,其计算公式如下式(1)所示。
其中,上式(1)中,fsavg为第二特征向量,fs(i)为第i个第三特征向量。
本实施方式中,通过分别对每个所述第二图像进行第二特征处理,得到M个第三特征向量,一个所述第三特征向量与一个所述第二图像对应,所述第三特征向量用于表征所
述第二图像的图像风格;对所述M个第三特征向量进行平均处理,得到所述第二特征向量。如此,可以从第二图像中分离出风格特征向量,以预先获取用于表征第二风格的第二特征向量,且通过对多个第二风格对应的第三特征向量进行平均处理,可以得到用于表征第二风格的平均风格的第二特征向量,如此,可以提高风格特征向量的表征能力。
可选的,所述步骤104具体包括:
对所述第一目标特征向量进行第一特征解码,得到第四特征图像;
对所述第四特征图像进行第二特征解码,得到第五特征图像,所述第五特征图像的尺寸与所述第一特征图像的尺寸相同;
将所述第一特征图像和所述第五特征图像进行拼接操作,得到第六特征图像;
对所述第六特征图像进行第三特征解码,得到所述第三图像。
本实施方式中,图像构建可以包括第一特征解码、第二特征解码和第三特征解码,第一特征解码用于对第一目标特征向量进行特征解码,得到第四特征图像,可以通过第一特征解码实现将特征向量解码为特征图像。
第二特征解码用于对第四特征图像进行第二特征解码,得到第五特征图像,所述第五特征图像的尺寸与所述第一特征图像的尺寸相同。在一可选实施方式中,第二特征解码相应的操作可以与第一特征提取相应的操作对应,即若通过上采样操作实现特征提取,即可以采用与该上采样操作对应的下采样操作来实现特征解码,且第二特征解码对应的网络层与第一特征提取对应的网络层对应,这样可以使得第五特征图像的尺寸与第一特征图像的尺寸相同。
第三特征解码用于实现对第六特征图像进行特征解码,得到第三图像,第六特征图像是基于第一特征图像和第五特征图像进行拼接得到的。这样,可以避免在网络处理过程中图像语义信息的丢失,保证图像风格转换过程中图像内容的不变性,在具体实现过程中,将编码器与解码器的对应网络层之间连接,并通过通道维度上的concat操作实现将对应层输出的特征图像拼接得到第六特征图像。
其中,第一特征解码可以包括至少一个编码操作,在第一特征解码包括多个解码操作的情况下,可以通过级联形式实现逐步实现对第一目标特征向量的特征解码。并且,第二特征解码也可以包括至少一个解码操作,在第二特征解码包括多个解码操作的情况下,也可以通过级联形式实现逐步实现对第四特征图像的特征解码。
第一特征解码、第二特征解码和第三特征解码均是通过下采样操作,来扩大特征的尺度,以可以解码到第三图像,第三图像的尺度可以与第一图像的尺度相同,如256*256*3大小。
需要说明的是,目标模型中的解码器可以包括至少一个分支网络,如可以包括两个分支网络,每个分支网络可以通过上述图像构建实现图像内容针对一个图像风格的转换,相应的,可以通过目标模型实现多目标风格转换,即可以将输入图像转换到多个风格,得到多个风格的图像。
在解码器包括至少两个分支网络的情况下,解码器中的不同分支网络可以独立进行风格转换。在一可选实施方式中,也可以协同进行风格转换,使得多目标任务间可以相互促进,共同优化,更能满足时光图像生成中性能和效果的要求,可选的,所述第二风格包括第一目标风格和第二目标风格;所述对所述第一目标特征向量进行第一特征解码,得到第四特征图像,包括:
对所述第一目标风格对应的所述第一目标特征向量进行第一解码操作,得到第七特征图像;
将所述第七特征图像和第八特征图像进行拼接操作,得到第九特征图像,所述第八特征图像是对所述第二目标风格对应的所述第一目标特征向量进行所述第一解码操作得到的;
对所述第九特征图像进行第二解码操作,得到所述第四特征图像。
本实施方式中,解码器可以包括至少两个分支网络,每个分支网络可以实现第一图像的图像内容针对第二风格的转换,以两个分支网络为例,即第二风格的数量为2。
可以通过一个分支网络对第一目标风格对应的第一目标特征向量进行第一解码操作,得到第七特征图像。相应的,可以通过另一个分支网络对第二目标风格对应的第一目标特征向量进行第一解码操作,得到第八特征图像。其中,第一解码操作可以包括上采样操作,以实现特征解码。
之后,可以将第七特征图像和第八特征图像进行拼接操作,得到第九特征图像,具体可以将两个分支网络对应网络层之间的输入相互进行concat操作,由于两个解码器解码出来的语义信息应该是一致的,因此,互联级联可以促进两个解码器对相同内容的输入保持解码出语义信息的相近,起到联合优化的作用,从而提高特征解码的准确性。
之后,可以对第九特征图像进行第二解码操作,得到第四特征图像,如此可以实现对第一目标特征向量进行第一特征解码,且通过互联级联可以促进两个解码器对相同内容的输入保持解码出语义信息的相近,起到联合优化的作用,从而提高特征解码的准确性。
图5是一示例的目标模型的结构示意图,如图5所示,该目标模型可以包括第一模型即编码器51和第二模型即解码器,第二模型可以包括第一解码器52和第二解码器53,编码器51的结构如图4所示,第一解码器和第二解码器的结构相同,但是网络权重不同。
解码器中可以包括解码网络DeConvBlock模块和残差网络ResBlock模块,DeConvBlock模块的结构示意图如图6所示,其组成分别为上采样模块、卷积层、BN处理和Relu激活函数。首先,采用上采样操作将输入扩大到两倍大小,通道数保持不变,之后采用卷积操作,设置kernel大小为3*3,stride为1,之后加入常规的BN处理和Relu操作。
ResBlock模块的结构示意图如图7所示,其组成分别为卷积层、BN处理、Relu激活函数、卷积层、BN处理、网络层相加Add处理。第一个卷积层是kernel大小为3*3,stride为1的卷积,输出通道与输入相同,之后加入常规的BN和Relu操作,第二个卷积层的
kernel大小为1*1,stride为1,通道数为设置的输出通道,再加一个BN操作,而Add处理用于将ResBlock模块的输入特征与输出特征相加再输出。
如图5所示,解码器可以包括8个模块,排列在第1、2、5、6、7、8的模块可以为DeConvBlock模块,排列在第3、4的模块可以为ResBlock模块。各个模块的输入和输出大小如下表1所示。
表1解码器输入输出大小
如图5所示,可以包括多个网络层级联,可以避免在网络处理过程中图像语义信息的丢失,如编码器和解码器对应网络层之间的连接,又如两个解码器之间第2至4个模块之间的输入互相连接。
可以通过上述目标模型针对一个图像如风景图像,实现表征不同时光风格的图像生成,并可以利用所生成的多个图像进行图像合成,以得到按照时光变换的动态图或视频。下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的图像合成进行详细地说明。
图8是本申请实施例提供的图像合成方法的流程图,如图8所示,包括以下步骤:
步骤801,通过目标模型将第一图像进行风格转换,生成N个第三图像;
步骤802,获取合成位置位于两个目标图像之间的第四图像,所述第四图像关于第一颜色空间的第一像素信息是基于所述两个目标图像关于所述第一颜色空间的第二像素信息确定的,所述两个目标图像为N个所述第三图像中相邻的两个图像;
步骤803,基于N个所述第三图像关于第二颜色空间的N个第三像素信息,对所述第四图像关于所述第二颜色空间的第四像素信息进行调整,得到第五图像;
步骤804,合成N个所述第三图像和所述第五图像。
该步骤801的目的是基于目标模型生成所需的表征不同时光风格的图像。用户可以输入一张源图像即第一图像,以及对应N个第二风格的时光变换模式,如四季变化模式、昼夜变换模式等,相应的,目标模型针对所输入的信息进行图像风格转换,得到第三图像。
其中,时光变换模式对应的第二风格的数量可以设置,如第二风格的数量为4,在四季变化模式中,分别包括春天、夏天、秋天和冬天这4个不同时光的风格,而昼夜变换模式可以设置为日出、日中、日落和深夜这4个不同时光的风格。
通过目标模型将第一图像进行风格转换,生成N个第三图像的过程与上述图像生成方法实施例的过程类似,这里不进行赘述。需要说明的是,在目标模型中解码器仅包括两个分支网络、而需要输出四种不同时光风格的图像的情况下,目标模型可以执行两次图像生成操作,即通过两次推理,便可以得到所需的4帧时光图像。可以通过尺寸调整resize,将4帧时光图像的尺寸全部放大到1080*1080*3大小。
在步骤802中,为了解决图像合成过程中图像突变的问题,采用插帧的方式将时光图像序列进行扩展,如从4帧扩展到10帧,两帧相邻图像之间可以增加一帧或多帧图像,如每两帧相邻图像之间可以增加两帧图像。
可以基于相邻两帧图像关于第一颜色空间的第二像素信息均匀变化的条件,计算出所需要插入的图像关于第一颜色空间的第一像素信息,以得到在该相邻两帧图像之间所需要插入的帧图像即第四图像,该方式可以适用于景物位置不动的风景图像的插帧。
其中,第一颜色空间可以为RGB颜色空间,对于所需要插入的帧图像中像素点的颜色值可以对应前后时光图像中同一像素位置的颜色值的加权和,计算公式如下式(2)和下式(3)所示。
mid1=2/3*ori_1+1/3*ori_2 (2)
mid2=1/3*ori_1+2/3*ori_2 (3)
mid1=2/3*ori_1+1/3*ori_2 (2)
mid2=1/3*ori_1+2/3*ori_2 (3)
其中,ori_1和ori_2分别相邻的两个时光图像,mid1和mid2分别所需要插入的前后两帧。
在步骤803中,得到10帧时光图像之后,为了使合成的动态时光图更符合真实的第二颜色空间上的变化,如亮度变化和色彩变化,可以基于目标模型生成的4帧时光图像关于第二颜色空间的N个第三像素信息,对所需要插入的帧图像关于第二颜色空间的第四像素信息进行调整,得到第五图像。
第二颜色空间可以为Lab颜色空间。其中,L代表亮度,取值范围是[0,100],表示从纯黑到纯白;a表示从红色到绿色的范围,取值范围是[127,-128];b表示从黄色到蓝色的范围,取值范围是[127,-128]。
之后,可以合成N个第三图像和第五图像,得到动态图像或视频。
本实施例中,通过目标模型将第一图像进行风格转换,生成N个第三图像;获取合成位置位于两个目标图像之间的第四图像,所述第四图像关于第一颜色空间的第一像素信息是基于所述两个目标图像关于所述第一颜色空间的第二像素信息确定的,所述两个目标图像为N个所述第三图像中相邻的两个图像;基于N个所述第三图像关于第二颜色空间的N个第三像素信息,对所述第四图像关于所述第二颜色空间的第四像素信息进行调整,得到第五图像;合成N个所述第三图像和所述第五图像。如此,可以使得合成的动态时光图
更符合真实的第二颜色空间上的变化,如亮度变化和色彩变化,提高图像合成的效果。
可选的,所述第二颜色空间包括三个分量,所述步骤803具体包括:
针对每个分量,基于所述N个第三像素信息中关于所述分量的像素值,对所述第四像素信息中关于所述分量的像素值进行调整,得到第五图像。
本实施方式中,第二颜色空间可以为Lab颜色空间,其分量可以包括三个,分别为亮度、颜色分量a和颜色分量b。
可以针对每个分量,进行该分量的像素值调整,可以使得合成的动态时光图在第二颜色空间上的各个分量均符合真实的变化。
在一可选实施方式中,对于时辰不变的模式如四季变化模式可以不调整亮度的像素值。
可选的,所述三个分量包括亮度分量,所述基于所述N个第三像素信息中关于所述分量的像素值,对所述第四像素信息中关于所述分量的像素值进行调整,得到第五图像,包括:
基于所述N个第三像素信息中关于所述亮度分量的像素值,获取所述N个第三图像关于所述亮度分量的N个第一亮度值;以及基于所述第四像素信息中关于所述亮度分量的像素值,获取所述第四图像关于所述亮度分量的第二亮度值;
基于所述N个第一亮度值和所述N个第三图像对应的N个第一时光,对用于表征时光相对于亮度值变化的第一曲线进行拟合;
基于所述第一曲线,计算所述第四图像对应的第二时光的第三亮度值;
基于所述第二亮度值和所述第三亮度值,对用于表征调整前的亮度值相对于调整后的亮度值变化的第二曲线进行拟合;
将所述第四像素信息中关于所述亮度分量的像素值调整为第四亮度值,所述第四亮度值基于所述第二曲线和所述第四像素信息中关于所述亮度分量的像素值计算得到。
本实施方式中,可以分别对每个第三图像中像素点的亮度值进行平均统计,获得N个第三图像对应的N个第一亮度值,并可以对第四图像中像素点的亮度值进行平均统计,获得第四图像对应的第二亮度值。在一可选实施方式中,可以将图像从RGB颜色空间转换成LAB颜色空间,L通道求平均即可得到图像的平均亮度值。
可以使用如下式(4)所示的公式来拟合第一曲线,第一曲线为时光如时刻相对于亮度值变化的曲线。
其中,上式(4)中,x为时刻,y为亮度,且以6作为日出时刻,12为日中,18为日落,0为深夜。
曲线拟合过程中可以使用N个第一亮度值作为y数据,而将N个第一时光作为x数据,使用最小二乘法确定上式(4)中的系数,即k和b系数。
记第四图像中调整前的平均亮度即第二亮度值为q,将第四图像对应的第二时光作为
x,基于第一曲线计算y即第三亮度值(记为q’)作为第四图像中调整后的平均亮度。可以使用如下式(5)所示的公式来拟合第二曲线,第二曲线为调整前的亮度值相对于调整后的亮度值变化的曲线。
y=ax2+bx+c (5)
y=ax2+bx+c (5)
可以通过(0,0)、(100,100)、(q,q’)三点确定第二曲线的参数,即a、b和c。
相应的,可以将第四图像中每个像素点调整前的亮度值作为x,基于第二曲线计算每个像素点调整后的亮度值即第四亮度值。
第四图像的亮度调整示意图如图9所示,其中,直线为第四图像中像素点调整前的亮度曲线,而曲线为第四图像中像素点调整后的亮度曲线,通过对第四图像进行亮度调整,可以模拟真实世界的亮度变化,使得合成后的动态图中帧与帧之间的变化更平滑。
相应的,对于颜色分量a和b,可以采用如亮度通道对应的调整方式进行色彩调整,不再赘述。不同的是,是使用如下式(6)所示的公式来拟合第一曲线。
通过对第四图像进行色彩调整,可以模拟真实世界的色彩变化,使得合成后的动态图中帧与帧之间的变化更平滑。
需要说明的是,上述目标模型在使用之前,需要预先训练,下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的模型训练进行详细地说明。
图10是本申请实施例提供的模型训练方法的流程图,如图10所示,包括以下步骤:
步骤1001,获取训练样本数据,所述训练样本数据包括第一样本图像,以及用于表征第一样本风格的第四特征向量;
步骤1002,对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第五特征向量用于表征所述第一样本图像的图像内容;
步骤1003,对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;
步骤1004,基于所述第二目标特征向量进行图像构建,得到第一输出图像;
步骤1005,基于所述第一输出图像和所述第五特征向量,确定目标模型的第一网络损失值;
步骤1006,基于所述第一网络损失值,更新所述目标模型的网络参数。
其中,在满足第一预设条件的情况下,所述目标模型训练完成,所述第一预设条件包括:所述第一网络损失值小于第一预设阈值,和/或,所述目标模型的训练迭代次数大于第二预设阈值。
在步骤1001中,训练样本数据可以包括至少一个第一样本图像,以及包括至少一个第一样本风格对应的第四特征向量。
第一样本图像可以为任一图像,如可以为风景图像,其获取方式可以与第一图像类似,
用于表征第一样本风格的第四特征向量可以通过目标模型中的第一模型获取,其获取方式也可以与第二特征向量类似,这里不进行赘述。
第四特征向量的数量可以与目标模型中解码器的分支网络的数量相同,如解码器的分支网络的数量为2,即可以同时实现两种图像风格转换,则第四特征向量的数量即为2。
训练样本数据还可以包括K个第二样本图像,这K个第二样本图像可以用于对第一模型进行训练,K为大于2的整数。训练样本数据还可以包括第三样本图像,所述第三样本图像与所述第一样本图像具有相同图像内容,所述第三样本图像的图像风格为所述第一样本风格,第三样本图像可以结合第一样本图像和第四特征向量进行目标模型的网络参数调整,以下再对这两种情况进行说明。
上述步骤1002、步骤1003和步骤1004是基于目标模型进行图像生成的过程,具体可以将第一样本图像和第四特征向量输入至目标模型,该目标模型相应可以执行上述步骤1002、步骤1003和步骤1004,其中,上述步骤1002、步骤1003和步骤1004的过程与上述图像生成方法实施例的过程类似,这里不进行赘述。
在步骤1005中,可以基于所述第一输出图像和所述第五特征向量,确定目标模型的第一网络损失值。
在一可选实施方式中,若解码器的分支网络的数量为2,其输出的图像分别为out1和out2,两个第四特征向量分别为s1和s2,输入图像即第一样本图像为x,其损失函数可以如下式(7)所示。
其中,上式(7)中,CE是交叉熵损失函数,fc(out1)和fc(out2)为输出图像1和输出图像2的内容特征向量,fc(x)为输入图像的内容特征向量,fs(out1)和fs(out2)为输出图像1和输出图像2的风格特征向量,Loss1为第一网络损失值。
Loss1的第1行是用来保证生成的两个图像内容相同且与输入图像内容保持一致,第2行是用来保证解码器1生成的图像风格与输入的图像风格相同,第3行是用来保证解码器2生成的图像风格与输入的图像风格相同。
在步骤1006中,可以基于第一网络损失值,更新目标模型的网络参数。
可以采用梯度下降法更新目标模型的网络参数,且可以采用循环迭代的方式,不断更新目标模型的网络参数,直至第一网络损失值小于第一预设阈值且达到收敛,和/或,目标模型的训练迭代次数大于第二预设阈值,此时目标模型可以训练完成。其中,第一预设阈值和第二预设阈值可以根据实际情况进行设置,通常第一预设阈值可以设置的比较小,第二预设阈值设置的比较大,以保证目标模型的充分训练,保证目标模型的训练效果。
需要说明的是,目标模型的训练阶段可以仅包括一个阶段,该阶段中,可以将第三样本图像、第一样本图像和第四特征向量作为目标模型的输入,在目标模型更新时,结合第
三样本图像、第一输出图像和第五特征向量同时更新第一模型和第二模型的网络参数。
目标模型的训练阶段也可以包括至少两个阶段,这至少两个阶段可以包括第一阶段和第二阶段,所述第二阶段位于所述第一阶段之后,第一阶段可以称之为预训练阶段,第二阶段可以称之为微调阶段。在目标模型的训练阶段处于第一阶段的情况下,可以将第一样本图像和第四特征向量作为目标模型的输入,在目标模型更新时,结合第一输出图像、第四特征向量和第五特征向量更新第二模型的网络参数,而在第一阶段时,第一模型的网络参数固定不定。在目标模型的训练阶段处于第二阶段的情况下,可以将第三样本图像、第一样本图像和第四特征向量作为目标模型的输入,在目标模型更新时,结合第三样本图像、第一输出图像和第五特征向量同时更新第一模型和第二模型的网络参数,以进一步调整目标模型的网络参数。这样,通过预训练结合微调的训练方式可以提高目标模型的训练速度。
本实施例中,通过获取训练样本数据,所述训练样本数据包括第一样本图像,以及用于表征第一样本风格的第四特征向量;对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第五特征向量用于表征所述第一样本图像的图像内容;对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;基于所述第二目标特征向量进行图像构建,得到第一输出图像;基于所述第一输出图像和所述第五特征向量,确定目标模型的第一网络损失值;基于所述第一网络损失值,更新所述目标模型的网络参数;其中,在满足第一预设条件的情况下,所述目标模型训练完成,所述第一预设条件包括:所述第一网络损失值小于第一预设阈值,和/或,所述目标模型的训练迭代次数大于第二预设阈值。如此,可以实现目标模型的训练,使得该目标模型可以用于图像风格转换,提高所生成的图像质量。
可选的,所述目标模型包括第一模型和第二模型,所述第一模型用于:对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第二模型用于:对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;基于所述第二目标特征向量进行图像构建,得到所述第一输出图像;
所述目标模型的训练阶段包括第一阶段和第二阶段,所述第二阶段位于所述第一阶段之后;所述步骤1006具体包括以下任一项:
在所述目标模型的训练阶段位于所述第一阶段的情况下,基于所述第一网络损失值,更新所述第二模型的网络参数,其中,所述第一模型的网络参数固定不变;
在所述目标模型的训练阶段位于所述第二阶段的情况下,基于所述第一网络损失值,更新所述第一模型和所述第二模型的网络参数;
其中,在满足第二预设条件的情况下,所述目标模型的训练阶段位于所述第一阶段,所述第二预设条件包括:所述第一网络损失值大于或等于第三预设阈值,和/或,所述目标模型的训练迭代次数小于或等于第四预设阈值,所述第三预设阈值大于所述第一预设阈值,所述第四预设阈值小于所述第二预设阈值。
本实施方式中,目标模型的训练阶段也可以包括至少两个阶段,这至少两个阶段可以
包括第一阶段和第二阶段,所述第二阶段位于所述第一阶段之后,第一阶段可以称之为预训练阶段,第二阶段可以称之为微调阶段。
预训练阶段和微调阶段在训练过程中存在三点不同,第一点为输入不同,预训练阶段的输入为第一样本图像和第四特征向量,微调阶段的输入为第三样本图像、第一样本图像和第四特征向量。
第二点为第一网络损失值的确定方式不同,预训练阶段的第一网络损失值的确定方式为基于第一输出图像、第四特征向量和第五特征向量,确定第一网络损失值,微调阶段的第一网络损失值的确定方式为基于第一输出图像、第三样本图像和第五特征向量,确定第一网络损失值。
第三点为目标模型的网络参数的更新方式不同,预训练阶段是第一模型的网络参数固定不变,仅更新第二模型的网络参数,而微调阶段是同时更新第一模型和第二模型的网络参数。
在预训练阶段,可以固定第一模型的网络参数,并基于第一网络损失值,仅更新目标模型中第二模型的网络参数,这样可以简化模型的训练。
而在微调阶段,可以同时更新第一模型和第二模型的网络参数,以在预训练阶段的基础上,进一步微调目标模型的网络参数。
其中,在满足第二预设条件的情况下,所述目标模型的训练阶段位于第一阶段,第二预设条件可以根据实际情况进行设置,其可以包括所述第一网络损失值大于或等于第三预设阈值,和/或,所述目标模型的训练迭代次数小于或等于第四预设阈值。第三预设阈值和第四预设阈值均可以根据实际情况进行设置,第三预设阈值大于第一预设阈值,第四预设阈值小于第二预设阈值。
在一可选实施方式中,训练过程中预训练阶段的迭代次数与微调阶段的迭代次数比例可以为10:1,可以根据该迭代次数比例设置第二预设阈值和第四预设阈值。
相应的,当不满足第二预设条件时,训练阶段可以从预训练阶段自然过程到微调阶段。
可选的,为了进一步提高目标模型的训练速度,可以在目标模型训练之前优先训练第一模型。所述训练样本数据还包括:K个第二样本图像,所述K个第二样本图像包括:具有相同图像内容,但图像风格不同的样本图像,以及具有相同图像风格,但图像内容不同的样本图像,K为大于2的整数;所述步骤1006之前,所述方法还包括:
基于所述第一模型对所述K个第二样本图像进行目标特征处理,得到K个第六特征向量和K个第七特征向量,所述第六特征向量用于表征所述第二样本图像的图像内容,所述第七特征向量用于表征所述第二样本图像的图像风格,所述目标特征处理包括所述第一特征处理;
基于所述K个第六特征向量和所述K个第七特征向量,确定所述第一模型的第二网络损失值;
基于所述第二网络损失值,更新所述第一模型的网络参数,其中,在所述第二网络损
失值小于第五预设阈值的情况下,所述第一模型训练完成。
本实施方式中,K个第二样本图像可以为成对数据,即具有相同图像内容,但图像风格不同的成对样本图像,以及具有相同图像风格,但图像内容不同的成对样本图像。
可以采用CycleGAN模型来生成成对样本图像。
目标特征处理可以包括第一特征处理和第二特征处理,可以分别将每个第二样本图像输入至第一模型进行目标特征处理,得到每个第二样本图像的内容特征向量即第六特征向量和风格特征向量即第七特征向量。第一模型的结构可以如图4所示。
训练过程中,对于每个第二样本图像(用I表示),对应有两个成对样本图像(用GT表示),其中,GT_c是与I包含相同图像内容,但是不同图像风格的图像,而GT_s是与I包含不同图像内容,但图像风格相同的图像。将第一模型即编码器的内容特征向量记为fc(x),风格特征向量记为fs(x),第一模型在训练过程中采用的损失函数如下式(8)所示。
Loss2=k*CE(fc(I),fc(GT_c))-CE(fs(I),fs(GT_c))
+k*CE(fs(I),fs(GT_s))-CE(fc(I),fc(GT_s)) (8)
Loss2=k*CE(fc(I),fc(GT_c))-CE(fs(I),fs(GT_c))
+k*CE(fs(I),fs(GT_s))-CE(fc(I),fc(GT_s)) (8)
其中,上式(8)中,k=100,CE是交叉熵损失函数,Loss2为第二网络损失值。该损失函数能够使得拥有相同图像内容的图像经过编码器后编码出相似的内容特征向量,拥有相同图像风格的图像经过编码器后编码出相似的风格特征向量,而两个不同图像内容的图像经过编码器编码的内容特征向量具有较大差异,两个不同图像风格的图像经过编码器编码的风格特征向量具有较大差异。
之后,基于第二网络损失值更新第一模型的网络参数,其更新方式与基于第一网络损失值更新目标模型的网络参数的方式类似,这里不进行赘述。其中,第五预设阈值可以根据实际情况进行设置,通常设置的比较小,这里不进行具体限定。
本实施方式可以预先对第一模型进行训练,第一模型在训练完成之后,可以辅助进行目标模型的训练,这样可以简化模型训练的过程。
可选的,在所述目标模型的训练阶段位于所述第一阶段的情况下,所述步骤1005具体包括:
在所述第一模型训练完成的情况下,基于所述第一模型对所述第一输出图像进行目标特征处理,得到第八特征向量和第九特征向量,所述第八特征向量用于表征所述第一输出图像的图像内容,所述第九特征向量用于表征所述第一输出图像的图像风格;
将所述第八特征向量与所述第五特征向量进行比对,确定第一损失值;以及将所述第九特征向量和所述第四特征向量进行比对,得到第二损失值;
将所述第一损失值和所述第二损失值进行聚合,得到所述第一网络损失值。
本实施方式中,在第一模型训练完成的情况下,可以辅助目标模型的训练,具体可以基于第一模型对第一输出图像进行目标特征处理,得到第一输入图像的内容特征向量即第八特征向量和风格特征向量即第九特征向量。
相应的,可以采用如上式(7)所示的损失函数确定第一网络损失值,一方面进行图
像内容的不变约束,保证生成的两个图像内容相同且与输入图像内容保持一致,另一方面,进行图像风格的不变约束,保证解码器生成的图像风格与输入的图像风格相同。
可选的,在所述目标模型的训练阶段位于所述第二阶段的情况下,所述步骤1005具体包括:
基于第一输出图像、第五特征向量和第三样本图像,确定所述目标模型的第一网络损失值。
本实施方式中,第一输出图像分别为out1和out2,第一样本图像为x,第三样本图像记为gt,可以采用如下式(9)所示的损失函数,基于第一输出图像、第三样本图像和第五特征向量,确定第一网络损失值。
其中,上式(9)中,L1表示平均绝对误差函数,Loss3的第一行是用来促使目标模型生成的图像与图像gt相同,第二行保证生成图像内容与图像gt内容相同,且与输入图像x相同,第三行保证生成图像风格与图像gt风格相同。
本实施方式中,通过微调阶段调整第一模型和第二模型的网络参数,可以提高模型训练的精度。
需要说明的是,本申请实施例提供的图像生成方法,执行主体可以为图像生成装置,或者图像生成装置中的用于执行图像生成方法的控制模块。本申请实施例中以图像生成装置执行图像生成方法为例,说明本申请实施例提供的图像生成装置。
参见图11,图11是本申请实施例提供的图像生成装置的结构图,如图11所示,图像生成装置1100包括:
第一获取模块1101,用于获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像;
第一特征处理模块1102,用于基于目标模型对所述第一图像进行第一特征处理,得到第一特征向量,所述第一特征向量用于表征所述第一图像的图像内容;
特征拼接模块1103,用于对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格;
图像构建模块1104,用于基于所述第一目标特征向量进行图像构建,得到第三图像。
可选的,所述第一特征处理模块1102包括:
第一特征编码单元,用于对所述第一图像进行第一特征编码,得到所述第一图像的第一特征图像;
第二特征编码单元,用于对所述第一特征图像进行第二特征编码,得到所述第一特征向量。
可选的,所述第一特征编码单元,具体用于:
对所述第一图像进行特征提取,得到所述第一图像的第二特征图像;
基于目标注意力机制,提取所述第二特征图像在所述目标注意力机制对应维度上的注意力向量,所述目标注意力机制包括在通道维度上的注意力机制、在空间维度上的注意力机制中的至少一项;
将所述注意力向量和所述第二特征图像进行相乘处理,得到第三特征图像;
基于所述第三特征图像,确定所述第一特征图像;
其中,所述第一特征编码包括所述特征提取和所述注意力向量的提取。
可选的,所述图像构建模块1104包括:
第一特征解码单元,用于对所述第一目标特征向量进行第一特征解码,得到第四特征图像;
第二特征解码单元,用于对所述第四特征图像进行第二特征解码,得到第五特征图像,所述第五特征图像的尺寸与所述第一特征图像的尺寸相同;
拼接操作单元,用于将所述第一特征图像和所述第五特征图像进行拼接操作,得到第六特征图像;
第三特征解码单元,用于对所述第六特征图像进行第三特征解码,得到所述第三图像。
可选的,所述第二风格包括第一目标风格和第二目标风格;所述第一特征解码单元,具体用于:
对所述第一目标风格对应的所述第一目标特征向量进行第一解码操作,得到第七特征图像;
将所述第七特征图像和第八特征图像进行拼接操作,得到第九特征图像,所述第八特征图像是对所述第二目标风格对应的所述第一目标特征向量进行所述第一解码操作得到的;
对所述第九特征图像进行第二解码操作,得到所述第四特征图像。
可选的,所述第二图像的数量为M,M为正整数,所述装置还包括:
第二特征处理模块,用于分别对每个所述第二图像进行第二特征处理,得到M个第三特征向量,一个所述第三特征向量与一个所述第二图像对应,所述第三特征向量用于表征所述第二图像的图像风格;
平均处理模块,用于对所述M个第三特征向量进行平均处理,得到所述第二特征向量。
可选的,所述第三图像的数量包括N个,N为大于1的整数,所述装置包括:
第二获取模块,用于获取合成位置位于两个目标图像之间的第四图像,所述第四图像关于第一颜色空间的第一像素信息是基于所述两个目标图像关于所述第一颜色空间的第二像素信息确定的,所述两个目标图像为N个所述第三图像中相邻的两个图像;;
像素调整模块,用于基于N个所述第三图像关于第二颜色空间的N个第三像素信息,
对所述第四图像关于所述第二颜色空间的第四像素信息进行调整,得到第五图像;
合成模块,用于合成N个所述第三图像和所述第五图像。
本实施例中,通过获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像;基于目标模型对所述第一图像进行第一特征处理,得到第一特征向量,所述第一特征向量用于表征所述第一图像的图像内容;对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格;基于所述第一目标特征向量进行图像构建,得到第三图像。如此,可以基于目标模型实现图像的图像风格从第一风格到第二风格的转换,并可以保持所生成的第三图像的图像内容与所输入的第一图像的图像内容相同,从而可以提高所生成的图像质量。
本申请实施例中的图像生成装置可以是装置,也可以是电子设备中的部件、集成电路、或芯片。该装置可以是移动电子设备,也可以为非移动电子设备。示例性的,移动电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本或者个人数字助理(personal digital assistant,PDA)等,非移动电子设备可以为服务器、网络附属存储器(Network Attached Storage,NAS)、个人计算机(personal computer,PC)、电视机(television,TV)、柜员机或者自助机等,本申请实施例不作具体限定。
本申请实施例中的图像生成装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统,可以为ios操作系统,还可以为其他可能的操作系统,本申请实施例不作具体限定。
本申请实施例提供的图像生成装置能够实现图1的方法实施例实现的各个过程,为避免重复,这里不再赘述。
需要说明的是,本申请实施例提供的模型训练方法,执行主体可以为模型训练装置,或者模型训练装置中的用于执行模型训练方法的控制模块。本申请实施例中以模型训练装置执行模型训练方法为例,说明本申请实施例提供的模型训练装置。
参见图12,图12是本申请实施例提供的模型训练装置的结构图,如图12所示,模型训练装置1200包括:
第三获取模块1201,用于获取训练样本数据,所述训练样本数据包括第一样本图像,以及用于表征第一样本风格的第四特征向量;
第一特征处理模块1202,用于对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第五特征向量用于表征所述第一样本图像的图像内容;
特征拼接模块1203,用于对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;
图像构建模块1204,用于基于所述第二目标特征向量进行图像构建,得到第一输出图像;
第一确定模块1205,用于基于所述第一输出图像和所述第五特征向量,确定目标模型的第一网络损失值;
第一更新模块1206,用于基于所述第一网络损失值,更新所述目标模型的网络参数;
其中,在满足第一预设条件的情况下,所述目标模型训练完成,所述第一预设条件包括:所述第一网络损失值小于第一预设阈值,和/或,所述目标模型的训练迭代次数大于第二预设阈值。
可选的,所述目标模型包括第一模型和第二模型,所述第一模型用于:对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第二模型用于:对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;基于所述第二目标特征向量进行图像构建,得到所述第一输出图像;所述目标模型的训练阶段包括第一阶段和第二阶段,所述第二阶段位于所述第一阶段之后;
所述第一更新模块1206,具体用于:
在所述目标模型的训练阶段位于所述第一阶段的情况下,基于所述第一网络损失值,更新所述第二模型的网络参数,其中,所述第一模型的网络参数固定不变;
在所述目标模型的训练阶段位于所述第二阶段的情况下,基于所述第一网络损失值,更新所述第一模型和所述第二模型的网络参数;
其中,在满足第二预设条件的情况下,所述目标模型的训练阶段位于所述第一阶段,所述第二预设条件包括:所述第一网络损失值大于或等于第三预设阈值,和/或,所述目标模型的训练迭代次数小于或等于第四预设阈值,所述第三预设阈值大于所述第一预设阈值,所述第四预设阈值小于所述第二预设阈值。
可选的,所述训练样本数据还包括:K个第二样本图像,所述K个第二样本图像包括:具有相同图像内容,但图像风格不同的样本图像,以及具有相同图像风格,但图像内容不同的样本图像,K为大于2的整数;所述装置还包括:
目标特征处理模块,用于基于所述第一模型对所述K个第二样本图像进行目标特征处理,得到K个第六特征向量和K个第七特征向量,所述第六特征向量用于表征所述第二样本图像的图像内容,所述第七特征向量用于表征所述第二样本图像的图像风格,所述目标特征处理包括所述第一特征处理;
第二确定模块,用于基于所述K个第六特征向量和所述K个第七特征向量,确定所述第一模型的第二网络损失值;
第二更新模块,用于基于所述第二网络损失值,更新所述第一模型的网络参数,其中,在所述第二网络损失值小于第五预设阈值的情况下,所述第一模型训练完成。
可选的,在所述目标模型的训练阶段位于所述第一阶段的情况下,所述第一确定模块1205,具体用于:
在所述第一模型训练完成的情况下,基于所述第一模型对所述第一输出图像进行目标特征处理,得到第八特征向量和第九特征向量,所述第八特征向量用于表征所述第一输出
图像的图像内容,所述第九特征向量用于表征所述第一输出图像的图像风格;
将所述第八特征向量与所述第五特征向量进行比对,确定第一损失值;以及将所述第九特征向量和所述第四特征向量进行比对,得到第二损失值;
将所述第一损失值和所述第二损失值进行聚合,得到所述第一网络损失值。
本实施例中,通过获取训练样本数据,所述训练样本数据包括第一样本图像,以及用于表征第一样本风格的第四特征向量;对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第五特征向量用于表征所述第一样本图像的图像内容;对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;基于所述第二目标特征向量进行图像构建,得到第一输出图像;基于所述第一输出图像和所述第五特征向量,确定目标模型的第一网络损失值;基于所述第一网络损失值,更新所述目标模型的网络参数;其中,在满足第一预设条件的情况下,所述目标模型训练完成,所述第一预设条件包括:所述第一网络损失值小于第一预设阈值,和/或,所述目标模型的训练迭代次数大于第二预设阈值。如此,可以实现目标模型的训练,使得该目标模型可以用于图像风格转换,提高所生成的图像质量。
本申请实施例中的模型训练装置可以是装置,也可以是电子设备中的部件、集成电路、或芯片。该装置可以是移动电子设备,也可以为非移动电子设备。示例性的,移动电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本或者个人数字助理(personal digital assistant,PDA)等,非移动电子设备可以为服务器、网络附属存储器(Network Attached Storage,NAS)、个人计算机(personal computer,PC)、电视机(television,TV)、柜员机或者自助机等,本申请实施例不作具体限定。
本申请实施例中的模型训练装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统,可以为ios操作系统,还可以为其他可能的操作系统,本申请实施例不作具体限定。
本申请实施例提供的模型训练装置能够实现图10的方法实施例实现的各个过程,为避免重复,这里不再赘述。
可选地,如图13所示,本申请实施例还提供一种电子设备1300,包括处理器1301,存储器1302,存储在存储器1302上并可在所述处理器1301上运行的程序或指令,该程序或指令被处理器1301执行时实现上述图像生成方法实施例的各个过程,或者实现上述模型训练方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
需要说明的是,本申请实施例中的电子设备包括上述所述的移动电子设备和非移动电子设备。
图14为实现本申请实施例的一种电子设备的硬件结构示意图。
该电子设备1400包括但不限于:射频单元1401、网络模块1402、音频输出单元1403、输入单元1404、传感器1405、显示单元1406、用户输入单元1407、接口单元1408、存
储器1409、以及处理器1410等部件。
本领域技术人员可以理解,电子设备1400还可以包括给各个部件供电的电源(比如电池),电源可以通过电源管理系统与处理器1410逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图14中示出的电子设备结构并不构成对电子设备的限定,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置,在此不再赘述。
该电子设备可以用于执行图像生成方法,其中,处理器1410,用于:
获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像;
基于目标模型对所述第一图像进行第一特征处理,得到第一特征向量,所述第一特征向量用于表征所述第一图像的图像内容;
对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格;
基于所述第一目标特征向量进行图像构建,得到第三图像。
本实施例中,通过获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像;基于目标模型对所述第一图像进行第一特征处理,得到第一特征向量,所述第一特征向量用于表征所述第一图像的图像内容;对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格;基于所述第一目标特征向量进行图像构建,得到第三图像。如此,可以基于目标模型实现图像的图像风格从第一风格到第二风格的转换,并可以保持所生成的第三图像的图像内容与所输入的第一图像的图像内容相同,从而可以提高所生成的图像质量。
可选的,处理器1410,还用于:
对所述第一图像进行第一特征编码,得到所述第一图像的第一特征图像;
对所述第一特征图像进行第二特征编码,得到所述第一特征向量。
可选的,处理器1410,还用于:
对所述第一图像进行特征提取,得到所述第一图像的第二特征图像;
基于目标注意力机制,提取所述第二特征图像在所述目标注意力机制对应维度上的注意力向量,所述目标注意力机制包括在通道维度上的注意力机制、在空间维度上的注意力机制中的至少一项;
将所述注意力向量和所述第二特征图像进行相乘处理,得到第三特征图像;
基于所述第三特征图像,确定所述第一特征图像;
其中,所述第一特征编码包括所述特征提取和所述注意力向量的提取。
可选的,处理器1410,还用于:
对所述第一目标特征向量进行第一特征解码,得到第四特征图像;
对所述第四特征图像进行第二特征解码,得到第五特征图像,所述第五特征图像的尺寸与所述第一特征图像的尺寸相同;
将所述第一特征图像和所述第五特征图像进行拼接操作,得到第六特征图像;
对所述第六特征图像进行第三特征解码,得到所述第三图像。
可选的,所述第二风格包括第一目标风格和第二目标风格;处理器1410,还用于:
对所述第一目标风格对应的所述第一目标特征向量进行第一解码操作,得到第七特征图像;
将所述第七特征图像和第八特征图像进行拼接操作,得到第九特征图像,所述第八特征图像是对所述第二目标风格对应的所述第一目标特征向量进行所述第一解码操作得到的;
对所述第九特征图像进行第二解码操作,得到所述第四特征图像。
可选的,所述第二图像的数量为M,M为正整数,处理器1410,还用于:
分别对每个所述第二图像进行第二特征处理,得到M个第三特征向量,一个所述第三特征向量与一个所述第二图像对应,所述第三特征向量用于表征所述第二图像的图像风格;
对所述M个第三特征向量进行平均处理,得到所述第二特征向量。
可选的,所述第三图像的数量包括N个,N为大于1的整数,处理器1410,还用于:
获取合成位置位于两个目标图像之间的第四图像,所述第四图像关于第一颜色空间的第一像素信息是基于所述两个目标图像关于所述第一颜色空间的第二像素信息确定的,所述两个目标图像为N个所述第三图像中相邻的两个图像;
基于N个所述第三图像关于第二颜色空间的N个第三像素信息,对所述第四图像关于所述第二颜色空间的第四像素信息进行调整,得到第五图像;
合成N个所述第三图像和所述第五图像。
在一实施例中,该电子设备可以用于执行模型训练方法,其中,处理器1410,用于:
获取训练样本数据,所述训练样本数据包括第一样本图像,以及用于表征第一样本风格的第四特征向量;
对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第五特征向量用于表征所述第一样本图像的图像内容;对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;基于所述第二目标特征向量进行图像构建,得到第一输出图像;
基于所述第一输出图像和所述第五特征向量,确定目标模型的第一网络损失值;
基于所述第一网络损失值,更新所述目标模型的网络参数;
其中,在满足第一预设条件的情况下,所述目标模型训练完成,所述第一预设条件包括:所述第一网络损失值小于第一预设阈值,和/或,所述目标模型的训练迭代次数大于第二预设阈值。
可选的,所述目标模型包括第一模型和第二模型,所述第一模型用于:对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第二模型用于:对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;基于所述第二目标特征向量进行图像构建,得到所述第一输出图像;所述目标模型的训练阶段包括第一阶段和第二阶段,所述第二阶段位于所述第一阶段之后;
处理器1410,还用于:
在所述目标模型的训练阶段位于所述第一阶段的情况下,基于所述第一网络损失值,更新所述第二模型的网络参数,其中,所述第一模型的网络参数固定不变;
在所述目标模型的训练阶段位于所述第二阶段的情况下,基于所述第一网络损失值,更新所述第一模型和所述第二模型的网络参数;
其中,在满足第二预设条件的情况下,所述目标模型的训练阶段位于所述第一阶段,所述第二预设条件包括:所述第一网络损失值大于或等于第三预设阈值,和/或,所述目标模型的训练迭代次数小于或等于第四预设阈值,所述第三预设阈值大于所述第一预设阈值,所述第四预设阈值小于所述第二预设阈值。
可选的,所述训练样本数据还包括:K个第二样本图像,所述K个第二样本图像包括:具有相同图像内容,但图像风格不同的样本图像,以及具有相同图像风格,但图像内容不同的样本图像,K为大于2的整数;处理器1410,还用于:
基于所述第一模型对所述K个第二样本图像进行目标特征处理,得到K个第六特征向量和K个第七特征向量,所述第六特征向量用于表征所述第二样本图像的图像内容,所述第七特征向量用于表征所述第二样本图像的图像风格,所述目标特征处理包括所述第一特征处理;
基于所述K个第六特征向量和所述K个第七特征向量,确定所述第一模型的第二网络损失值;
基于所述第二网络损失值,更新所述第一模型的网络参数,其中,在所述第二网络损失值小于第五预设阈值的情况下,所述第一模型训练完成。
可选的,在所述目标模型的训练阶段位于所述第一阶段的情况下,处理器1410,还用于:
在所述第一模型训练完成的情况下,基于所述第一模型对所述第一输出图像进行目标特征处理,得到第八特征向量和第九特征向量,所述第八特征向量用于表征所述第一输出图像的图像内容,所述第九特征向量用于表征所述第一输出图像的图像风格;
将所述第八特征向量与所述第五特征向量进行比对,确定第一损失值;以及将所述第九特征向量和所述第四特征向量进行比对,得到第二损失值;
将所述第一损失值和所述第二损失值进行聚合,得到所述第一网络损失值。
应理解的是,本申请实施例中,输入单元1404可以包括图形处理器(Graphics Processing Unit,GPU)14041和麦克风14042,图形处理器14041对在视频捕获模式或图
像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。显示单元1406可包括显示面板14061,可以采用液晶显示器、有机发光二极管等形式来配置显示面板14061。用户输入单元1407包括触控面板14071以及其他输入设备14072。触控面板14071,也称为触摸屏。触控面板14071可包括触摸检测装置和触摸控制器两个部分。其他输入设备14072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。存储器1409可用于存储软件程序以及各种数据,包括但不限于应用程序和操作系统。处理器1410可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1410中。
本申请实施例还提供一种可读存储介质,所述可读存储介质上存储有程序或指令,该程序或指令被处理器执行时实现上述图像生成方法实施例的各个过程,或者实现上述模型训练方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
其中,所述处理器为上述实施例中所述的电子设备中的处理器。所述可读存储介质,包括计算机可读存储介质,如计算机只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等。
本申请实施例另提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现上述图像生成方法实施例的各个过程,或者实现上述模型训练方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
应理解,本申请实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外,需要指出的是,本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能,还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能,例如,可以按不同于所描述的次序来执行所描述的方法,并且还可以添加、省去、或组合各种步骤。另外,参照某些示例所描述的特征可在其他示例中被组合。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来,该计算机软件产品存储在一个存储介质
(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台电子设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。
Claims (21)
- 一种图像生成方法,所述方法包括:获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像;基于目标模型对所述第一图像进行第一特征处理,得到第一特征向量,所述第一特征向量用于表征所述第一图像的图像内容;对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格;基于所述第一目标特征向量进行图像构建,得到第三图像。
- 根据权利要求1所述的方法,其中,所述对所述第一图像进行第一特征处理,得到第一特征向量,包括:对所述第一图像进行第一特征编码,得到所述第一图像的第一特征图像;对所述第一特征图像进行第二特征编码,得到所述第一特征向量。
- 根据权利要求2所述的方法,其中,所述第一特征编码包括特征提取和注意力向量的提取,所述对所述第一图像进行第一特征编码,得到所述第一图像的第一特征图像,包括:对所述第一图像进行特征提取,得到所述第一图像的第二特征图像;基于目标注意力机制,提取所述第二特征图像在所述目标注意力机制对应维度上的注意力向量,所述目标注意力机制包括在通道维度上的注意力机制、在空间维度上的注意力机制中的至少一项;将所述注意力向量和所述第二特征图像进行相乘处理,得到第三特征图像;基于所述第三特征图像,确定所述第一特征图像。
- 根据权利要求2所述的方法,其中,所述基于所述第一目标特征向量进行图像构建,得到第三图像,包括:对所述第一目标特征向量进行第一特征解码,得到第四特征图像;对所述第四特征图像进行第二特征解码,得到第五特征图像,所述第五特征图像的尺寸与所述第一特征图像的尺寸相同;将所述第一特征图像和所述第五特征图像进行拼接操作,得到第六特征图像;对所述第六特征图像进行第三特征解码,得到所述第三图像。
- 根据权利要求4所述的方法,其中,所述第二风格包括第一目标风格和第二目标风格;所述对所述第一目标特征向量进行第一特征解码,得到第四特征图像,包括:对所述第一目标风格对应的所述第一目标特征向量进行第一解码操作,得到第七特征图像;将所述第七特征图像和第八特征图像进行拼接操作,得到第九特征图像,所述第八特 征图像是对所述第二目标风格对应的所述第一目标特征向量进行所述第一解码操作得到的;对所述第九特征图像进行第二解码操作,得到所述第四特征图像。
- 根据权利要求1所述的方法,其中,所述第二图像的数量为M,M为正整数,所述对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量之前,所述方法还包括:分别对每个所述第二图像进行第二特征处理,得到M个第三特征向量,一个所述第三特征向量与一个所述第二图像对应,所述第三特征向量用于表征所述第二图像的图像风格;对所述M个第三特征向量进行平均处理,得到所述第二特征向量。
- 根据权利要求1所述的方法,其中,所述第三图像的数量包括N个,N为大于1的整数,所述基于所述第一目标特征向量进行图像构建,得到第三图像之后,所述方法还包括:获取合成位置位于两个目标图像之间的第四图像,所述第四图像关于第一颜色空间的第一像素信息是基于所述两个目标图像关于所述第一颜色空间的第二像素信息确定的,所述两个目标图像为N个所述第三图像中相邻的两个图像;基于N个所述第三图像关于第二颜色空间的N个第三像素信息,对所述第四图像关于所述第二颜色空间的第四像素信息进行调整,得到第五图像;合成N个所述第三图像和所述第五图像。
- 一种模型训练方法,其中,所述方法包括:获取训练样本数据,所述训练样本数据包括第一样本图像,以及用于表征第一样本风格的第四特征向量;对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第五特征向量用于表征所述第一样本图像的图像内容;对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;基于所述第二目标特征向量进行图像构建,得到第一输出图像;基于所述第一输出图像和所述第五特征向量,确定目标模型的第一网络损失值;基于所述第一网络损失值,更新所述目标模型的网络参数;其中,在满足第一预设条件的情况下,所述目标模型训练完成,所述第一预设条件包括:所述第一网络损失值小于第一预设阈值,和/或,所述目标模型的训练迭代次数大于第二预设阈值。
- 根据权利要求8所述的方法,其中,所述目标模型包括第一模型和第二模型,所述第一模型用于:对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第二模型用于:对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;基于所述第二目标特征向量进行图像构建,得到所述第一输出图像;所述目标模型的训练 阶段包括第一阶段和第二阶段,所述第二阶段位于所述第一阶段之后;所述基于所述第一网络损失值,更新所述目标模型的网络参数,包括以下任一项:在所述目标模型的训练阶段位于所述第一阶段的情况下,基于所述第一网络损失值,更新所述第二模型的网络参数,其中,所述第一模型的网络参数固定不变;在所述目标模型的训练阶段位于所述第二阶段的情况下,基于所述第一网络损失值,更新所述第一模型和所述第二模型的网络参数;其中,在满足第二预设条件的情况下,所述目标模型的训练阶段位于所述第一阶段,所述第二预设条件包括:所述第一网络损失值大于或等于第三预设阈值,和/或,所述目标模型的训练迭代次数小于或等于第四预设阈值,所述第三预设阈值大于所述第一预设阈值,所述第四预设阈值小于所述第二预设阈值。
- 根据权利要求9所述的方法,其中,所述训练样本数据还包括:K个第二样本图像,所述K个第二样本图像包括:具有相同图像内容,但图像风格不同的样本图像,以及具有相同图像风格,但图像内容不同的样本图像,K为大于2的整数;所述基于所述第一网络损失值,更新所述目标模型的网络参数之前,所述方法还包括:基于所述第一模型对所述K个第二样本图像进行目标特征处理,得到K个第六特征向量和K个第七特征向量,所述第六特征向量用于表征所述第二样本图像的图像内容,所述第七特征向量用于表征所述第二样本图像的图像风格,所述目标特征处理包括所述第一特征处理;基于所述K个第六特征向量和所述K个第七特征向量,确定所述第一模型的第二网络损失值;基于所述第二网络损失值,更新所述第一模型的网络参数,其中,在所述第二网络损失值小于第五预设阈值的情况下,所述第一模型训练完成。
- 根据权利要求10所述的方法,其中,在所述目标模型的训练阶段位于所述第一阶段的情况下,所述基于所述第一输出图像和所述第五特征向量,确定所述目标模型的第一网络损失值,包括:在所述第一模型训练完成的情况下,基于所述第一模型对所述第一输出图像进行目标特征处理,得到第八特征向量和第九特征向量,所述第八特征向量用于表征所述第一输出图像的图像内容,所述第九特征向量用于表征所述第一输出图像的图像风格;将所述第八特征向量与所述第五特征向量进行比对,确定第一损失值;以及将所述第九特征向量和所述第四特征向量进行比对,得到第二损失值;将所述第一损失值和所述第二损失值进行聚合,得到所述第一网络损失值。
- 一种图像生成装置,所述装置包括:第一获取模块,用于获取图像风格为第一风格的第一图像,以及图像风格为第二风格的第二图像;第一特征处理模块,用于基于目标模型对所述第一图像进行第一特征处理,得到第一 特征向量,所述第一特征向量用于表征所述第一图像的图像内容;特征拼接模块,用于对所述第一特征向量和第二特征向量进行拼接操作,得到第一目标特征向量,所述第二特征向量基于所述第二风格的第二图像确定,所述第二特征向量用于表征所述第二图像的图像风格;图像构建模块,用于基于所述第一目标特征向量进行图像构建,得到第三图像。
- 根据权利要求12所述的装置,其中,所述第一特征处理模块包括:第一特征编码单元,用于对所述第一图像进行第一特征编码,得到所述第一图像的第一特征图像;第二特征编码单元,用于对所述第一特征图像进行第二特征编码,得到所述第一特征向量。
- 根据权利要求13所述的装置,其中,所述第一特征编码单元,具体用于:对所述第一图像进行特征提取,得到所述第一图像的第二特征图像;基于目标注意力机制,提取所述第二特征图像在所述目标注意力机制对应维度上的注意力向量,所述目标注意力机制包括在通道维度上的注意力机制、在空间维度上的注意力机制中的至少一项;将所述注意力向量和所述第二特征图像进行相乘处理,得到第三特征图像;基于所述第三特征图像,确定所述第一特征图像;其中,所述第一特征编码包括所述特征提取和所述注意力向量的提取。
- 根据权利要求13所述的装置,其中,所述图像构建模块包括:第一特征解码单元,用于对所述第一目标特征向量进行第一特征解码,得到第四特征图像;第二特征解码单元,用于对所述第四特征图像进行第二特征解码,得到第五特征图像,所述第五特征图像的尺寸与所述第一特征图像的尺寸相同;拼接操作单元,用于将所述第一特征图像和所述第五特征图像进行拼接操作,得到第六特征图像;第三特征解码单元,用于对所述第六特征图像进行第三特征解码,得到所述第三图像。
- 根据权利要求15所述的装置,其中,所述第二风格包括第一目标风格和第二目标风格;所述第一特征解码单元,具体用于:对所述第一目标风格对应的所述第一目标特征向量进行第一解码操作,得到第七特征图像;将所述第七特征图像和第八特征图像进行拼接操作,得到第九特征图像,所述第八特征图像是对所述第二目标风格对应的所述第一目标特征向量进行所述第一解码操作得到的;对所述第九特征图像进行第二解码操作,得到所述第四特征图像。
- 根据权利要求12所述的装置,其中,所述第二图像的数量为M,M为正整数,所 述装置还包括:第二特征处理模块,用于分别对每个所述第二图像进行第二特征处理,得到M个第三特征向量,一个所述第三特征向量与一个所述第二图像对应,所述第三特征向量用于表征所述第二图像的图像风格;平均处理模块,用于对所述M个第三特征向量进行平均处理,得到所述第二特征向量。
- 根据权利要求12所述的装置,其中,所述第三图像的数量包括N个,N为大于1的整数,所述装置包括:第二获取模块,用于获取合成位置位于两个目标图像之间的第四图像,所述第四图像关于第一颜色空间的第一像素信息是基于所述两个目标图像关于所述第一颜色空间的第二像素信息确定的,所述两个目标图像为N个所述第三图像中相邻的两个图像;;像素调整模块,用于基于N个所述第三图像关于第二颜色空间的N个第三像素信息,对所述第四图像关于所述第二颜色空间的第四像素信息进行调整,得到第五图像;合成模块,用于合成N个所述第三图像和所述第五图像。
- 一种模型训练装置,所述装置包括:第三获取模块,用于获取训练样本数据,所述训练样本数据包括第一样本图像,以及用于表征第一样本风格的第四特征向量;第一特征处理模块,用于对所述第一样本图像进行第一特征处理,得到第五特征向量,所述第五特征向量用于表征所述第一样本图像的图像内容;特征拼接模块,用于对所述第五特征向量和所述第四特征向量进行拼接操作,得到第二目标特征向量;图像构建模块,用于基于所述第二目标特征向量进行图像构建,得到第一输出图像;第一确定模块,用于基于所述第一输出图像和所述第五特征向量,确定目标模型的第一网络损失值;第一更新模块,用于基于所述第一网络损失值,更新所述目标模型的网络参数;其中,在满足第一预设条件的情况下,所述目标模型训练完成,所述第一预设条件包括:所述第一网络损失值小于第一预设阈值,和/或,所述目标模型的训练迭代次数大于第二预设阈值。
- 一种电子设备,包括处理器,存储器及存储在所述存储器上并可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如权利要求1-7任一项所述的图像生成方法的步骤,或者,如权利要求8-11任一项所述的模型训练方法的步骤。
- 一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如权利要求1-7任一项所述的图像生成方法的步骤,或者,如权利要求8-11任一项所述的模型训练方法的步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210840608.XA CN115222581A (zh) | 2022-07-18 | 2022-07-18 | 图像生成方法、模型训练方法、相关装置及电子设备 |
CN202210840608.X | 2022-07-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024017093A1 true WO2024017093A1 (zh) | 2024-01-25 |
Family
ID=83612811
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/106800 WO2024017093A1 (zh) | 2022-07-18 | 2023-07-11 | 图像生成方法、模型训练方法、相关装置及电子设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115222581A (zh) |
WO (1) | WO2024017093A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115222581A (zh) * | 2022-07-18 | 2022-10-21 | 维沃移动通信有限公司 | 图像生成方法、模型训练方法、相关装置及电子设备 |
CN115512006B (zh) * | 2022-11-23 | 2023-04-07 | 有米科技股份有限公司 | 基于多图像元素的图像智能合成方法及装置 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111784566A (zh) * | 2020-07-01 | 2020-10-16 | 北京字节跳动网络技术有限公司 | 图像处理方法、迁移模型训练方法、装置、介质及设备 |
US20210365710A1 (en) * | 2019-02-19 | 2021-11-25 | Boe Technology Group Co., Ltd. | Image processing method, apparatus, equipment, and storage medium |
CN114581341A (zh) * | 2022-03-28 | 2022-06-03 | 杭州师范大学 | 一种基于深度学习的图像风格迁移方法及系统 |
CN114612289A (zh) * | 2022-03-03 | 2022-06-10 | 广州虎牙科技有限公司 | 风格化图像生成方法、装置及图像处理设备 |
CN115222581A (zh) * | 2022-07-18 | 2022-10-21 | 维沃移动通信有限公司 | 图像生成方法、模型训练方法、相关装置及电子设备 |
-
2022
- 2022-07-18 CN CN202210840608.XA patent/CN115222581A/zh active Pending
-
2023
- 2023-07-11 WO PCT/CN2023/106800 patent/WO2024017093A1/zh unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210365710A1 (en) * | 2019-02-19 | 2021-11-25 | Boe Technology Group Co., Ltd. | Image processing method, apparatus, equipment, and storage medium |
CN111784566A (zh) * | 2020-07-01 | 2020-10-16 | 北京字节跳动网络技术有限公司 | 图像处理方法、迁移模型训练方法、装置、介质及设备 |
CN114612289A (zh) * | 2022-03-03 | 2022-06-10 | 广州虎牙科技有限公司 | 风格化图像生成方法、装置及图像处理设备 |
CN114581341A (zh) * | 2022-03-28 | 2022-06-03 | 杭州师范大学 | 一种基于深度学习的图像风格迁移方法及系统 |
CN115222581A (zh) * | 2022-07-18 | 2022-10-21 | 维沃移动通信有限公司 | 图像生成方法、模型训练方法、相关装置及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
CN115222581A (zh) | 2022-10-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2024017093A1 (zh) | 图像生成方法、模型训练方法、相关装置及电子设备 | |
JP7135125B2 (ja) | 近赤外画像の生成方法、近赤外画像の生成装置、生成ネットワークの訓練方法、生成ネットワークの訓練装置、電子機器、記憶媒体及びコンピュータプログラム | |
CN110163801B (zh) | 一种图像超分辨和着色方法、系统及电子设备 | |
CN110399825A (zh) | 面部表情迁移方法、装置、存储介质及计算机设备 | |
Liu et al. | 4D LUT: learnable context-aware 4d lookup table for image enhancement | |
WO2023151511A1 (zh) | 模型训练方法、图像去摩尔纹方法、装置及电子设备 | |
CN114092774B (zh) | 基于信息流融合的rgb-t图像显著性检测系统及检测方法 | |
CN114693929A (zh) | 一种rgb-d双模态特征融合的语义分割方法 | |
CN114863539A (zh) | 一种基于特征融合的人像关键点检测方法及系统 | |
WO2024027583A1 (zh) | 图像处理方法、装置、电子设备和可读存储介质 | |
WO2024060669A1 (zh) | 动作迁移方法、装置、终端设备及存储介质 | |
CN116958534A (zh) | 一种图像处理方法、图像处理模型的训练方法和相关装置 | |
Liu et al. | Pano-SfMLearner: Self-Supervised multi-task learning of depth and semantics in panoramic videos | |
CN115588237A (zh) | 一种基于单目rgb图像的三维手部姿态估计方法 | |
WO2022179087A1 (zh) | 视频处理方法及装置 | |
CN112200817A (zh) | 基于图像的天空区域分割和特效处理方法、装置及设备 | |
Xu et al. | Deep video inverse tone mapping | |
CN111222459A (zh) | 一种视角无关的视频三维人体姿态识别方法 | |
Qing et al. | Attentive and context-aware deep network for saliency prediction on omni-directional images | |
CN111951171A (zh) | Hdr图像生成方法、装置、可读存储介质及终端设备 | |
Liang et al. | Multi-scale and multi-patch transformer for sandstorm image enhancement | |
Wu et al. | Parallelism optimized architecture on FPGA for real-time traffic light detection | |
CN112200816A (zh) | 视频图像的区域分割及头发替换方法、装置及设备 | |
Huang et al. | Learning image-adaptive lookup tables with spatial awareness for image harmonization | |
Liu et al. | Dsma: Reference-based image super-resolution method based on dual-view supervised learning and multi-attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23842158 Country of ref document: EP Kind code of ref document: A1 |