CN115424013A

CN115424013A - Model training method, image processing apparatus, and medium

Info

Publication number: CN115424013A
Application number: CN202210820156.9A
Authority: CN
Inventors: 司世景; 王健宗; 吴建汉
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-12-02

Abstract

The embodiment provides a model training method, an image processing method, image processing equipment and a medium, which belong to the technical field of artificial intelligence and comprise the following steps: segmenting a semantic image to be processed to obtain a semantic image block; performing data preprocessing on the semantic image blocks to obtain feature vectors; inputting the characteristic vector into an encoder in a generator of the initial network model for data processing to obtain an output vector; inputting the output vector into a multilayer perceptron in a generator to perform data mapping processing to obtain a target image block; carrying out image reforming processing on the target image block in a generator to obtain a target image; inputting a preset comparison image and a target image into a discriminator of an initial network model for discrimination processing to obtain an image discrimination value; training the initial network model according to a preset loss function and an image discrimination value to obtain a generating type confrontation network model; the generative confrontation network model can be applied to conditional generative tasks, and the performance of the model is effectively improved.

Description

Model training method, image processing apparatus, and medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a model training method, an image processing method, image processing equipment and a medium.

Background

With the development of artificial intelligence technology, the usage rate of the generative confrontation network model is gradually improved. In the related art, the generative confrontation network model is usually applied to unconditional generation tasks, but in the field of image processing, the generative confrontation network model often corresponds to conditional generation tasks, and the current generative confrontation network model is difficult to apply to the generation tasks and has poor model performance.

Disclosure of Invention

The embodiment of the application mainly aims to provide a model training method, an image processing method, image processing equipment and a medium, and the trained generative confrontation network model can be applied to conditional generative tasks, so that the performance of the model is effectively improved.

In order to achieve the above object, a first aspect of an embodiment of the present application provides a training method for a model, where the training method includes:

obtaining a semantic image to be processed, and performing segmentation processing on the semantic image to be processed to obtain a plurality of semantic image blocks;

performing data preprocessing on the semantic image blocks to obtain a feature vector corresponding to each semantic image block;

inputting the feature vector into an encoder in a generator of an initial network model for data processing to obtain an output vector, wherein a fast Fourier transform module is arranged in the encoder and is used for extracting features of the feature vector;

inputting the output vector into a multilayer perceptron in the generator of the initial network model for data mapping processing to obtain a target image block;

performing image reforming processing on the target image block in the generator of the initial network model to obtain a target image;

inputting a preset comparison image and the target image into a discriminator of the initial network model for discrimination processing to obtain an image discrimination value;

and training the initial network model according to a preset loss function and the image discrimination value to obtain a generating type confrontation network model.

In some embodiments, the encoder includes a first encoding processing module and a second encoding processing module, and the fast fourier transform module is disposed in each of the first encoding processing module and the second encoding processing module;

the inputting the feature vector into an encoder in a generator of the initial network model to perform data processing to obtain an output vector includes:

inputting the feature vector into the first coding processing module for feature coding to obtain a first vector, wherein the fast fourier transform module in the first coding processing module is used for feature extraction of the feature vector;

and inputting the first vector into the second coding processing module for feature coding to obtain an output vector, wherein the fast Fourier transform module in the second coding processing module is used for performing feature extraction on the first vector.

In some embodiments, a multi-head self-attention module is further arranged in the first encoding processing module;

the inputting the feature vector into the first encoding processing module for feature encoding to obtain a first vector includes:

inputting the feature vector into the multi-head self-attention module of the first coding processing module for feature processing to obtain a second vector;

inputting the feature vector into the fast Fourier transform module of the first coding processing module for feature extraction to obtain a third vector;

and carrying out residual sum processing and normalization processing on the feature vector, the second vector and the third vector to obtain a first vector.

In some embodiments, a full connection layer is further disposed in the second encoding processing module;

the inputting the first vector into the second encoding processing module for feature encoding to obtain an output vector includes:

inputting the first vector into the full-connection layer of the second coding processing module for classification processing to obtain a fourth vector;

inputting the first vector into the fast Fourier transform module of the second encoding processing module for feature extraction to obtain a fifth vector;

and carrying out residual sum processing and normalization processing on the first vector, the fourth vector and the fifth vector to obtain an output vector.

In some embodiments, the fast fourier transform module of the second encoding processing module comprises a first fourier transform unit, a first convolution unit, a first activation layer, a second convolution unit, and a second fourier transform unit;

inputting the first vector into the fast fourier transform module of the second encoding processing module to perform feature extraction, so as to obtain a fifth vector, where the method includes:

inputting the first vector into the first Fourier transform unit for feature extraction to obtain first vector feature data;

sequentially carrying out convolution processing on the first vector characteristic data through the first convolution unit, activation processing on the first activation layer and convolution processing on the second convolution unit to obtain first target vector data;

and inputting the first target vector data into the second Fourier transform unit for feature extraction to obtain a fifth vector.

In some embodiments, the training the initial network model according to a preset loss function and the image discrimination value to obtain a generative confrontation network model, including:

updating parameters corresponding to the generator and the discriminator in the initial network model according to the preset loss function to obtain an updated image discrimination value;

and when the updated image discrimination value is greater than or equal to a preset value, training to obtain the generative antagonistic network model.

In some embodiments, the performing data preprocessing on the semantic image blocks to obtain a feature vector corresponding to each semantic image block includes:

performing position coding processing on the semantic image blocks to obtain position coding data corresponding to each semantic image block;

and inputting the position encoding data into a preset linear layer for linear processing to obtain a feature vector corresponding to each semantic image block.

A second aspect of the embodiments of the present application provides an image processing method, including:

acquiring an original semantic segmentation image;

inputting the original semantic segmentation image into a generative confrontation network model for image processing to obtain a target semantic image, wherein the generative confrontation network model is obtained by training according to the training method of any one of the embodiments of the first aspect of the embodiments of the present application.

A third aspect of embodiments of the present application proposes a computer device, which includes a memory and a processor, where the memory stores a program, and when the program is executed by the processor, the processor is configured to perform the method according to any one of the embodiments of the first aspect of embodiments of the present application or the method according to any one of the embodiments of the second aspect of embodiments of the present application.

A fourth aspect of an embodiment of the present application provides a storage medium, which is a computer-readable storage medium, and the storage medium stores computer-executable instructions for causing a computer to perform the method according to any one of the embodiments of the first aspect of the embodiment of the present application or the method according to any one of the embodiments of the second aspect of the embodiment of the present application.

According to the model training method, the image processing device and the medium, the semantic image to be processed is obtained, and is subjected to segmentation processing to obtain a plurality of semantic image blocks; performing data preprocessing on the semantic image blocks to obtain a feature vector corresponding to each semantic image block; inputting the characteristic vector into an encoder in a generator of the initial network model for data processing to obtain an output vector, wherein the encoder is provided with a fast Fourier transform module which is used for extracting the characteristic of the characteristic vector; inputting the output vector into a multilayer perceptron in a generator of the initial network model to perform data mapping processing to obtain a target image block; performing image reforming processing on a target image block in a generator of the initial network model to obtain a target image; inputting a preset comparison image and a target image into a discriminator of an initial network model for discrimination processing to obtain an image discrimination value; and training the initial network model according to a preset loss function and the image discrimination value to obtain a generating type confrontation network model. According to the method and the device, the initial network model is trained by obtaining the semantic image to be processed, so that the trained generative confrontation network model can be applied to conditional generative tasks, and the performance of the model is effectively improved.

Drawings

FIG. 1 is a schematic flow chart of a method for training a model provided in an embodiment of the present application;

FIG. 2 is a schematic sub-flow chart of step S200 in FIG. 1;

FIG. 3 is a schematic sub-flow diagram of step S300 in FIG. 1;

FIG. 4 is a schematic sub-flowchart of step S310 in FIG. 3;

FIG. 5 is a schematic sub-flowchart of step S320 in FIG. 3;

FIG. 6 is a schematic sub-flow chart of step S322 in FIG. 5;

FIG. 7 is a schematic sub-flow diagram of step S700 in FIG. 1;

FIG. 8 is a diagram of a semantic image block provided by an embodiment of the present application;

fig. 9 is a schematic structural diagram of an encoder in a generator provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of a fast Fourier transform module provided by an embodiment of the present application;

FIG. 11 is a flowchart illustrating an image processing method according to an embodiment of the present application;

FIG. 12 is a block diagram of a module structure of a model training apparatus according to an embodiment of the present disclosure;

fig. 13 is a block diagram of a block configuration of an image processing apparatus according to an embodiment of the present application;

fig. 14 is a hardware structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional blocks are illustrated as being partitioned in a schematic diagram of an apparatus and logical order is illustrated in a flowchart, in some cases, the steps illustrated or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow diagrams depicted in the figures are merely exemplary in nature and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution order may be changed according to the actual situation.

First, several terms referred to in the present application are resolved:

artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, which attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP): is a discipline for studying the language problem of human interaction with computers. According to different technical implementation difficulties, such systems can be divided into three types, namely simple matching type, fuzzy matching type and paragraph understanding type. The simple matching type tutoring and answering system mainly realizes the matching of questions submitted by students and related answer items in an answer library through a simple keyword matching technology, thereby realizing the automatic answering of the questions or the related tutoring. The fuzzy matching type tutoring and answering system increases the matching of synonyms and antonyms on the basis. Thus, even if the student does not find a directly matching answer in the answer library according to the original keyword in the question, if the words synonymous with the keyword or antisense to the keyword can be matched, the relevant answer item can be found in the answer library. Paragraph understanding type tutoring and answering system is the most ideal and truly intelligent tutoring and answering system (simple matching type and fuzzy matching type, which can only be called "automatic tutoring and answering system" rather than "intelligent tutoring and answering system" strictly speaking).

Generative Adaptive Networks (GAN): is a deep generative model based on antagonistic learning. The generative confrontation network model passes through (at least) two structures in the framework: the mutual game learning of the generators (generators) and discriminators (discriminators) yields reasonably good output, i.e. the generators and discriminators are trained at the same time and compete in minimax algorithms. The countermeasure mode avoids some difficulties of some traditional generation models in practical application, skillfully approximates some unsolvable loss functions through countermeasure learning, and has wide application in the generation of data such as images, videos, natural languages, music and the like.

Transformer: as an attention-based encoder-decoder architecture, not only the natural language processing field is completely changed, but also some pioneering work is made in the computer vision field. Visual transformers have superior modeling capabilities and superior performance compared to Convolutional Neural Networks (CNNs). The Transformer is composed of an encoding component, a decoding component and a connection therebetween, wherein the encoding component is composed of a plurality of encoders (encoders), and the decoding component is also composed of a same number of decoders (decoders) (i.e. corresponding to the number of encoders).

Encoder-Decoder (Encoder-Decoder): the model is a common model framework in deep learning, many common applications are designed by using a coding-decoding framework, encoder and Decoder parts can be any characters, voice, images, video data and the like, and various models can be designed based on the Encoder-Decoder.

Encoding (Encoder): coding is to convert the input sequence into a vector of fixed length.

Decoding (Decoder), namely converting the fixed vector generated before into an output sequence, wherein the input sequence can be characters, voice, images and videos; the output sequence may be text, images.

Loss Function (Loss Function): is a function that maps the value of a random event or its associated random variable to a non-negative real number to represent the "risk" or "loss" of the random event. In application, the loss function is usually associated with the optimization problem as a learning criterion, i.e. the model is solved and evaluated by minimizing the loss function. For example, parameter estimation used for models in statistics and machine learning, risk management and decision making in macro-economics, and optimal control theory in control theory.

PIL (Python Image Library, pink): the image processing library is a Python (computer programming language) image processing library which comprises a plurality of packages and has rich functions, and is a plurality of common image processing libraries in Python.

Multilayer perceptrons (MLPs) are a type of forward propagating neural network. 1. Each MLP comprises at least three node levels. Each node, except the input nodes, is a neuron using a nonlinear activation function.

With the development of artificial intelligence technology, the usage rate of the generative confrontation network model is gradually improved. In the related art, a generative confrontation network model is generally applied to an unconditional generation task, but in the field of image processing, the generative confrontation network model often corresponds to the conditional generation task, and the current generative confrontation network model is difficult to apply to the generation task and has poor model performance.

Based on this, the embodiment of the application provides a model training method, an image processing method, equipment and a medium, wherein a semantic image to be processed is obtained and is segmented to obtain a plurality of semantic image blocks; performing data preprocessing on the semantic image blocks to obtain a feature vector corresponding to each semantic image block; inputting the characteristic vector into an encoder in a generator of the initial network model for data processing to obtain an output vector, wherein the encoder is provided with a fast Fourier transform module which is used for extracting the characteristic of the characteristic vector; inputting the output vector into a multilayer perceptron in a generator of the initial network model to perform data mapping processing to obtain a target image block; carrying out image reforming processing on the target image block in a generator of the initial network model to obtain a target image; inputting a preset comparison image and a target image into a discriminator of an initial network model for discrimination processing to obtain an image discrimination value; and training the initial network model according to the preset loss function and the image discrimination value to obtain a generating type confrontation network model. According to the method and the device, the initial network model is trained by obtaining the semantic image to be processed, so that the trained generative confrontation network model can be applied to a conditional generation task, and the performance of the model is effectively improved.

The embodiment of the present application provides a training method of a model, an image processing method and apparatus, a computer device, and a storage medium, which are specifically described in the following embodiments, and first, the training method of the model in the embodiment of the present application is described.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. The artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and obtain the best result by using the knowledge.

The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a model training method, and relates to the technical field of artificial intelligence. The model training method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, smart watch, or the like; the server side can be configured as an independent physical server, can also be configured as a server cluster or a distributed system formed by a plurality of physical servers, and can also be configured as a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content distribution network) and a big data and artificial intelligence platform; the software may be an application of a training method or the like that implements a model, but is not limited to the above form.

Embodiments of the application are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Referring to fig. 1, a training method of a model according to an embodiment of the first aspect of the present application includes, but is not limited to, steps S100 to S700.

Step S100, obtaining a semantic image to be processed, and performing segmentation processing on the semantic image to be processed to obtain a plurality of semantic image blocks;

step S200, carrying out data preprocessing on semantic image blocks to obtain a feature vector corresponding to each semantic image block;

step S300, inputting the characteristic vector into an encoder in a generator of the initial network model for data processing to obtain an output vector, wherein the encoder is provided with a fast Fourier transform module for extracting the characteristic of the characteristic vector;

step S400, inputting the output vector into a multilayer perceptron in a generator of the initial network model to perform data mapping processing to obtain a target image block;

step S500, carrying out image reforming processing on the target image block in a generator of the initial network model to obtain a target image;

step S600, inputting a preset comparison image and a target image into a discriminator of an initial network model for discrimination processing to obtain an image discrimination value;

and step S700, training the initial network model according to a preset loss function and an image discrimination value to obtain a generative confrontation network model.

In step S100 of some embodiments, a to-be-processed semantic image is segmented to obtain a plurality of semantic image blocks, where the segmentation is a block-cutting operation performed on the to-be-processed semantic image and belongs to a data preprocessing portion. The partitioning may be implemented by Python packets, for example, by PIL. Illustratively, the size corresponding to the acquired semantic image to be processed should be consistent with the size corresponding to the output target image. In some embodiments, step S100 may also be performed in the framework of a GAN model.

In step S200 of some embodiments, since the semantic image blocks cannot be identified by the initial network model, data preprocessing needs to be performed on the semantic image blocks to obtain a feature vector corresponding to each semantic image block. The feature vector can be recognized by a computer and/or an initial network model to meet the requirement of model training, so that before the segmented semantic image blocks are input into the initial network model, data preprocessing needs to be performed on the semantic image blocks to obtain the feature vector meeting the requirement of model training. Illustratively, each semantic image block corresponds to one feature vector.

In step S300 of some embodiments, the generator of the initial network model includes an encoder, and all feature vectors are input to the encoder in the generator of the initial network model for data processing, so as to obtain output vectors. Step S300 is specifically implemented by an encoder in the generator of the initial network model. Illustratively, the generator comprises a plurality of encoders, each encoder is provided with a fast fourier transform module, and all the eigenvectors are processed by data of each encoder in sequence to obtain a final output vector. The fast Fourier transform module is arranged to extract the features of the feature vectors, so that the output vectors can be obtained conveniently, more global information can be obtained conveniently, more and more accurate features can be extracted, and the accuracy of the output target image can be improved.

In step S400 of some embodiments, the generator of the initial network model includes a multi-layer perceptron, and the multi-layer perceptron performs data mapping processing on the output vector to obtain the target image block. The multi-layer perceptron is arranged to map a plurality of input output vectors onto an output target image block, and the data which cannot be linearly separated can be identified through the multi-layer perceptron. Illustratively, the target image block is a real image block obtained after the semantic image to be processed is processed.

In step S500 of some embodiments, in the generator of the initial network model, an image reformation process (corresponding to a segmentation process) may be performed on each target image block by using a data preprocessing, so as to obtain a complete target image. Correspondingly, steps S300 to S500 are the processing flow of the generator of the initial network model.

In step S600 of some embodiments, it is the process flow of the arbiter of the initial network model. And inputting a preset comparison image and a target image into a discriminator of the initial network model for discrimination processing to obtain an image discrimination value. The purpose of the discriminator provided is to discriminate between the target image generated by the generator and the comparison image, i.e. the real image provided. That is, there are two inputs to the discriminator, one being the target image and the other being the comparison image. The structure of the discriminator is similar to that of the generator, which also includes a plurality of encoders, and unlike the generator, the output of the generator is an image, and the output of the discriminator is a value, i.e., an image discrimination value, which represents the degree of realism between the target image and the comparison image. That is, the discriminator according to the embodiment of the present application corresponds to the generated target image and the preset comparison image, and the purpose of the discriminator is to discriminate the authenticity of the generated target image, so that the discriminator can better instruct the generator to generate a more ideal target image. It should be noted that the target image corresponds to an image discrimination value, and the larger the image discrimination value is, the better the generated target image is represented, and the image is more real. Illustratively, the image discrimination value has a numerical range of [0,1]. It should be noted that steps S100 to S600 are forward processes.

In step S700 of some embodiments, in the training process of the initial network model, the parameters corresponding to the generator and the arbiter are updated through a backward process, and the parameter updating is implemented through a plurality of preset loss functions. Illustratively, the purpose of training can be achieved by updating the parameters of the model by minimizing the preset loss function, i.e. training to obtain the generative confrontation network model.

It should be noted that, for the discriminator, the embodiment of the present application is implemented by designing a transform decoder. Specifically, a class feature (i.e., an image discrimination value) is introduced as an output of the discriminator, and the class feature is similar to the class feature of the BERT model, and is used for scoring the trueness of the generated target image so as to obtain a trueness score of the discriminator on the generated target image. The discriminator according to the embodiment of the present application not only serves to discriminate the sharpness of the target image but also discriminates the reality of the generated target image, and if the generated target image is not a desired image, it is meaningless to generate a clearer image.

In the related art, the generative confrontation network model based on a Transformer is mainly designed for the model itself, and it is assumed that the input is some random vectors (such as gaussian noise) and can be essentially regarded as sequence information in natural language processing, so that a conventional Transformer encoder is migrated, and for the image generated as the input of the discriminator, a ViT (Vision Transformer) structure is used to discriminate the difference between the generated image and the real image. However, these works only solve the problem of the traditional unconditional generation task, and in practical applications, for example, in the field of image processing, the application of the generative confrontation network model may have a requirement on input, such as input as a semantic image, which may be suitable for more downstream tasks (such as image editing, style migration, etc.). In contrast, by the model training method of the embodiment of the application, the generated confrontation network model obtained by training can be applied to the conditional generation task, and the performance of the model is effectively improved. Specifically, by designing a generative confrontation network model (which comprises a generator, a discriminator and a preset loss function) suitable for semantic image input, the generative confrontation network model not only maintains the function of the conditional generative confrontation network model, but also can exert the unique advantages of a Transformer.

As shown in fig. 2, it can be understood that the step S200 is to perform data preprocessing on the semantic image blocks to obtain the feature vector corresponding to each semantic image block, and the steps specifically include, but are not limited to, the steps S210 to S220:

step S210, carrying out position coding processing on semantic image blocks to obtain position coded data corresponding to each semantic image block;

step S220, inputting the position encoded data into a preset linear layer for linear processing, so as to obtain a feature vector corresponding to each semantic image block.

Since the semantic image blocks obtained through the segmentation processing cannot be identified by the initial network model, data preprocessing needs to be performed on the semantic image blocks. Specifically, each semantic image block after the segmentation processing is subjected to position coding processing. Illustratively, the position encoding process is to add position information which can be recognized by a computer to each semantic image block, that is, position encoded data corresponding to each semantic image block, and then perform a preprocessing, that is, linear processing on each position encoded data to obtain a feature vector which can be recognized by the computer. The preprocessing mode is that the position encoded data is input into a preset linear layer for linear processing to obtain a feature vector corresponding to each semantic image block through a preset linear layer. The feature vectors can be identified by a computer and/or initial network model to meet the requirements of model training. The set linear layer is that each neuron is connected with all neurons of the previous layer so as to realize linear combination and linear transformation of the previous layer.

As shown in fig. 3, it can be understood that the encoder includes a first encoding processing module and a second encoding processing module, and the first encoding processing module and the second encoding processing module are both provided with a fast fourier transform module; step S300 is to input the feature vector into an encoder in the generator of the initial network model to perform data processing, so as to obtain an output vector, which specifically includes, but is not limited to, steps S310 to S320:

step S310, inputting the feature vector into a first coding processing module for feature coding to obtain a first vector, wherein a fast Fourier transform module in the first coding processing module is used for feature extraction of the feature vector;

step S320, inputting the first vector into the second encoding processing module for feature encoding to obtain an output vector, where the fast fourier transform module in the second encoding processing module is used to perform feature extraction on the first vector.

Specifically, the encoder comprises a first encoding processing module and a second encoding processing module, and more global information can be acquired through feature encoding of the first encoding processing module and the second encoding processing module. The fast Fourier transform module in the first coding processing module extracts the features of the feature vectors, and the fast Fourier transform module in the second coding processing module extracts the features of the first vectors, so that more global information can be conveniently concerned through the fast Fourier transform module. Illustratively, the fast fourier transform module maps the output features (i.e., feature vector, first vector) of a semantic image block with position-encoded data into the frequency domain, which focuses on both the real and imaginary parts to enrich the information it contains. The embodiment of the application is provided with a fast Fourier transform module to obtain more global information, so that the model is helped to extract more abundant image representations, and the definition of the generated target image is improved.

As shown in fig. 4, it can be understood that a multi-head self-attention module is further disposed in the first encoding processing module; step S310, inputting the feature vector into the first encoding processing module for feature encoding to obtain a first vector, which specifically includes, but is not limited to, step S311 to step S313:

step S311, inputting the feature vector into a multi-head self-attention module of the first coding processing module for feature processing to obtain a second vector;

step S312, inputting the feature vector into a fast Fourier transform module of the first coding processing module for feature extraction to obtain a third vector;

step 313, performing residual summation processing and normalization processing on the feature vector, the second vector and the third vector to obtain a first vector.

In order to acquire more global information, the attention mechanism of the Transformer is designed, namely a brand-new residual Fourier self-attention block is designed. Specifically, the first encoding processing module is provided with a multi-head self-attention module and a fast fourier transform module, and the second vector is obtained by inputting the feature vector into the multi-head self-attention module for feature processing. Illustratively, attention mechanism is generally widely existed in the deep learning network structure, which can improve the learning effect of the model. The multi-head self-attention module provided by the embodiment of the application can divide each attention operation into groups (heads) to extract characteristic information from multiple dimensions. And then, inputting the feature vector into a fast Fourier transform module for feature extraction to obtain a third vector, wherein the fast Fourier transform module can pay attention to more global information, and the precision of the output target image is further improved. The first vector is obtained by performing residual summation processing and normalization processing on the feature vector, the second vector and the third vector, and the residual summation processing is set in the embodiment of the application, so that the effects of preventing gradient explosion and network degradation (which is the key of network deepening) can be achieved.

As shown in fig. 5, it can be understood that a full connection layer is further provided in the second encoding processing module; step S320 is to input the first vector into the second encoding processing module for feature encoding to obtain an output vector, which includes, but is not limited to, steps S321 to S323:

step S321, inputting the first vector into a full connection layer of the second encoding processing module for classification processing to obtain a fourth vector;

step S322, inputting the first vector into a fast Fourier transform module of a second coding processing module for feature extraction to obtain a fifth vector;

and step S323, carrying out residual sum processing and normalization processing on the first vector, the fourth vector and the fifth vector to obtain an output vector.

Specifically, the second encoding processing module in the embodiment of the present application is provided with a full connection layer and a second encoding processing module, and the fourth vector is obtained by inputting the first vector into the full connection layer of the second encoding processing module for classification processing. Illustratively, a fully connected layer is one in which each node is connected to all nodes in the previous layer to integrate the extracted features from the previous layer (i.e., the first vector). The parameters of a typical fully-connected layer are also the most, due to its fully-connected nature. Similar to MLP, each neuron in a fully-connected layer is fully connected to all neurons in its previous layer. The fully connected layer may integrate local information with category distinctiveness. In order to improve the performance of the model network, a ReLU (Rectified Linear Activation Function) Function is generally adopted as the excitation Function of each neuron of the fully-connected layer. The output value of the last fully connected layer is passed to an output, the fourth vector. And then inputting the first vector into a fast Fourier transform module of the second coding processing module for feature extraction to obtain a fifth vector, wherein the fast Fourier transform module can pay attention to more global information, and the precision of the output target image is further improved. And performing residual sum processing and normalization processing on the first vector, the fourth vector and the fifth vector to obtain an output vector. The embodiment of the application sets the residual summation treatment, and can play a role in preventing gradient explosion and network degradation (which is the key of network deepening).

As shown in fig. 6, it can be understood that the fast fourier transform module of the second encoding processing module includes a first fourier transform unit, a first convolution unit, a first active layer, a second convolution unit, and a second fourier transform unit; step S322 is to input the first vector into the fast fourier transform module of the second encoding processing module for feature extraction, so as to obtain a fifth vector, which specifically includes but is not limited to step S324 to step S326:

step S324, inputting the first vector into a first Fourier transform unit for feature extraction to obtain first vector feature data;

step S325, the first vector characteristic data is subjected to convolution processing of a first convolution unit, activation processing of a first activation layer and convolution processing of a second convolution unit in sequence to obtain first target vector data;

step S326, inputting the first target vector data into the second fourier transform unit for feature extraction, so as to obtain a fifth vector.

According to the embodiment of the application, the fast Fourier transform module is arranged in the second coding processing module, so that more global information can be concerned, and the precision of the output target image is improved. Specifically, the fast fourier transform module of the second encoding processing module includes a first fourier transform unit, a first convolution unit, a first active layer, a second convolution unit, and a second fourier transform unit. The first Fourier transform unit and the second Fourier transform unit both extract features, so that more global information can be conveniently concerned. Illustratively, by mapping the first vector (first target vector data) into the frequency domain, the frequency domain focuses on both the real and imaginary parts to enrich the information it contains, thereby helping the model to extract image features that are richer in content, thereby improving the sharpness of the generated target image.

In step S325 of some embodiments, the first vector feature data is sequentially subjected to convolution processing by the first convolution unit, activation processing by the first activation layer, and convolution processing by the second convolution unit, so as to obtain first target vector data. The convolution processing of the set first convolution unit and the convolution processing of the set second convolution unit are used for feature extraction of the first vector feature data. The convolution unit is the core of the model, and two important purposes of dimension reduction processing and feature extraction can be achieved through convolution operation; and the first activation layer is used for activating the linear output of the previous layer through a nonlinear activation function, so that any function can be simulated, and the characterization capability of the network is enhanced.

It is understood that the fast fourier transform module of the first encoding processing module has the same structure as the fast fourier transform module of the second encoding processing module. Specifically, the fast fourier transform module of the first encoding processing module includes a third fourier transform unit, a third convolution unit, a second active layer, a fourth convolution unit, and a fourth fourier transform unit; step S312, inputting the feature vector into the fast fourier transform module of the first encoding processing module for feature extraction, so as to obtain a third vector, which specifically includes, but is not limited to, the following steps:

inputting the feature vector into a third Fourier transform unit for feature extraction to obtain second vector feature data; the second vector characteristic data are subjected to convolution processing of a third convolution unit, activation processing of a second activation layer and convolution processing of a fourth convolution unit in sequence to obtain second target vector data; and inputting the second target vector data into a fourth Fourier transform unit for feature extraction to obtain a third vector.

According to the embodiment of the application, the fast Fourier transform module is arranged in the first coding processing module, so that more global information can be concerned, and the accuracy of the output target image is improved. Specifically, the third fourier transform unit and the fourth fourier transform unit perform feature extraction, so that attention is paid to more global information. Illustratively, by mapping the feature vector (second target vector data) into the frequency domain, the frequency domain focuses on both the real part and the imaginary part to enrich the information contained therein, thereby helping the model to extract a more abundant image representation, thereby improving the sharpness of the generated target image. The convolution processing of the set third convolution unit and the convolution processing of the set fourth convolution unit are both used for feature extraction of the second vector feature data. The convolution unit is the core of the model, and two important purposes of dimension reduction processing and feature extraction can be achieved through convolution operation; and the second activation layer is used for activating the linear output of the previous layer through a nonlinear activation function, so that an arbitrary function can be simulated, and the characterization capability of the network is enhanced.

Illustratively, referring to fig. 9, an encoder in a generator of an embodiment of the present application is shown, the encoder including a first encoding processing module and a second encoding processing module. It will be appreciated that the illustration only shows one encoder, whereas in an actual network (i.e. generator) a plurality of encoders are included.

Illustratively, referring to fig. 10, a fast fourier transform module of an embodiment of the present application is shown. Specifically, in fig. 10: the real-imaginary fourier transform corresponds to a first fourier transform unit (a third fourier transform unit) in the embodiment of the present application, the first 1 × 1 convolution from top to bottom corresponds to a first convolution unit (a third convolution unit) in the embodiment of the present application, the active layer corresponds to a first active layer (a second active layer) in the embodiment of the present application, the second 1 × 1 convolution from top to bottom corresponds to a second convolution unit (a fourth convolution unit) in the embodiment of the present application, and the real-fast fourier transform corresponds to a second fourier transform unit (a fourth fourier transform unit) in the embodiment of the present application.

As shown in fig. 7, it can be understood that step S700 is to perform training processing on the initial network model according to the preset loss function and the image discrimination value to obtain a generative confrontation network model, which specifically includes, but is not limited to, step S710 to step S720:

step S710, updating corresponding parameters of a generator and a discriminator in the initial network model according to a preset loss function to obtain an updated image discrimination value;

and S720, when the updated image discrimination value is greater than or equal to the preset value, training to obtain a generative confrontation network model.

Illustratively, an L1 loss function, a cross entropy loss function, and a perceptual loss function are used as preset loss functions in the embodiments of the present application, and the initial network model is trained by using the preset loss functions and image discrimination values. Specifically, parameters corresponding to a generator of the initial network model and the discriminator are updated through a preset loss function, and training can be completed when the preset loss function tends to be converged, so that the generated confrontation network model is obtained. According to a preset loss function, updating parameters corresponding to a generator and a discriminator in the initial network model to obtain an updated image discrimination value, and when the updated image discrimination value is greater than or equal to a preset value, training to obtain a generative confrontation network model.

Illustratively, the L1 loss function is calculated as follows: l is ₁ ＝|P _t -P _g |；

The cross entropy loss function is calculated as follows:

the formula for the perceptual loss function is as follows:

wherein, P _t Representing a comparison image, P _g Representing the generated target image, N representing the number of semantic image blocks,

features after the ith ReLU layer representing VGG-19 (number of parameters) associated with training.

The Loss value is calculated by utilizing the comparison image (namely the real image), and the generation of the target image which is similar to the comparison image as far as possible is restricted by the preset Loss functions such as the L1 Loss function, the cross entropy Loss function and the perception Loss function, so that the initial network model is trained until convergence.

The L1 loss function and the cross entropy loss function adopted by the embodiment of the application are used for enabling the generated target image to be as clear as possible, and the perception loss function is consistent on a characteristic level and has the function of enabling the generated target image to be more real so as to accord with human perception.

It can be understood that, in step S100, the semantic image to be processed is segmented to obtain a plurality of semantic image blocks, which includes but is not limited to the following steps: and according to a preset segmentation proportion, performing segmentation processing on the semantic image to be processed to obtain a plurality of semantic image blocks.

Firstly, the input semantic image to be processed is segmented, and the specific preset segmentation proportion can be freely selected. Referring to fig. 8, an input semantic image to be processed is exemplarily subjected to segmentation processing to obtain 7 × 7 semantic image blocks (patch) by segmentation. In some embodiments, the Swin Transformer can be referred to, and can be adjusted by experiment.

And then, performing data preprocessing on all the semantic image blocks to obtain a feature vector corresponding to each semantic image block. Exemplarily, all semantic image blocks are sent to an encoder similar to Swin transform as an input sequence for position encoding; the feature vectors may be represented as characterizing information of the image. Illustratively, the position encoding process may be represented by sin and cos, keeping its dimensions and consistency of each semantic image block.

In some embodiments, the present application may use a Swin Transformer-like architecture as a generator of an initial network model or a generative countermeasure network model.

According to the model training method provided by the embodiment of the application, a semantic image to be processed is obtained and is segmented to obtain a plurality of semantic image blocks; carrying out data preprocessing on the semantic image blocks to obtain a feature vector corresponding to each semantic image block; inputting the characteristic vector into an encoder in a generator of the initial network model for data processing to obtain an output vector, wherein the encoder is provided with a fast Fourier transform module which is used for extracting the characteristic of the characteristic vector; inputting the output vector into a multilayer perceptron in a generator of the initial network model to perform data mapping processing to obtain a target image block; performing image reformation on the target image block in a generator of the initial network model to obtain a target image; inputting a preset comparison image and a target image into a discriminator of an initial network model for discrimination processing to obtain an image discrimination value; and training the initial network model according to a preset loss function and an image discrimination value to obtain a generating type confrontation network model. According to the method and the device, the initial network model is trained by obtaining the semantic image to be processed, so that the trained generative confrontation network model can be applied to a conditional generation task, and the performance of the model is effectively improved.

The embodiment of the application provides a conditional generative confrontation network model based on a Transformer, which can realize the generative confrontation network model input by a conditional image (such as a semantic image to be processed); the embodiment of the application also carries out segmentation processing on the semantic image to be processed so as to realize image serialization. In addition, a brand-new residual Fourier self-attention block is designed through segmentation processing, and a fast Fourier transform module is added into a traditional self-attention block to obtain more global information, so that the model is helped to extract image representations containing more abundant images, and the definition of the generated target image is improved.

In the related art, with the rapid development of the Transformer, the Transformer is strong from the prevalence in the NLP field to the application in almost all fields, a self-attention-based module is used for extracting a feature of interest and then utilizing the feature to complete various downstream tasks, and the feature is very friendly to sequence information, so that the figure of the Transformer cannot be separated in each task in the NLP field. Based on its excellent performance in NLP, many researchers are working on applying transformers to the visual field, from ViT used for classification at first, to Swin transformers that can be applied to downstream tasks such as semantic segmentation, target detection, etc., and many effects are achieved over the traditional convolutional neural network model. However, few transformers are currently applied to the generation task, and although a ViT GAN model for the generation task exists, the ViT GAN model is only successfully applied to the GAN, the effect of the ViT GAN model can only be similar to that of the traditional CNN-based model, and unconditional generation tasks are mostly concerned at present, while the conditional GAN model can be better applied to most downstream tasks, such as image style migration, image editing and the like.

Based on this, the embodiment of the application provides a conditional GAN model based on a Transformer, and the performance of the conditional GAN model is further improved on the basis of ensuring the conditional GAN model by improving the existing GAN model.

Referring to fig. 11, the embodiment of the present application further proposes an image processing method, including but not limited to including steps S800 to S900:

step S800, obtaining an original semantic segmentation image;

step S900, inputting the original semantic segmentation image into a generative confrontation network model for image processing to obtain a target semantic image, where the generative confrontation network model is obtained by training according to the training method as any one of the embodiments of the first aspect of the embodiments of the present application.

For the generative confrontation network model, the inputs of the embodiment of the present application are: and (3) outputting a target semantic image corresponding to the original semantic segmentation image. The generative confrontation network model in the embodiment of the application is a conditional model based on a Transformer, which not only can ensure the function of generating a target semantic image, but also can realize the editing of the image on the basis of various prior images (such as an original semantic segmentation image). Compared with the related art, the embodiment of the application not only continues to use a better Transformer architecture in the related art, namely a coder part in Swin Transformer designs a generating type antagonistic network model, but also designs a self-attention module of the Transformer architecture, so that the self-attention module has stronger capability of paying attention to global information, and the obtained characteristics have richer information, and are favorable for generating a clearer target semantic image.

The embodiment of the application aims to generate a target semantic image meeting requirements. Unlike the ViT GAN model in the related art, the ViT GAN model is a set of noise to generate a series of different images, which are not necessarily required finally, but in the embodiment of the present application, a target semantic image meeting the requirements is generated by setting a certain condition (such as an analytic graph and a semantic segmentation graph), so as to achieve the purpose of generating an ideal image by controlling a generative confrontation network model. It can be understood that the clarity of the embodiments of the present application is also greatly improved. In practical applications, the generated confrontation network model of the embodiment of the present application can be used to generate a desired image, i.e., a target semantic image, rather than a random image that simply conforms to a certain distribution. The input image may be a semantic segmentation map or a hand drawing of a desired generated image, and the output image may correspond to a real image. The embodiment of the application can be applied to application scenes such as drawing and recovery.

It should be noted that, after the generated confrontation network model is obtained by training through the training method according to any one of the embodiments of the first aspect of the embodiments of the present application, that is, after the generator and the discriminator are trained, the testing or verifying or applying stage is performed. During the testing, verifying and applying stages, no discriminator is needed, namely, only the generator of the generative confrontation network model is needed to generate the target semantic image.

An embodiment of the present application further provides a training apparatus for a model, and referring to fig. 12, the training apparatus can implement the training method for the model, and includes: the system comprises a segmentation processing module 100, a pre-processing module 200, a data processing module 300, a mapping processing module 400, a reforming processing module 500, a discrimination processing module 600 and a training processing module 700.

Specifically, the segmentation processing module 100 is configured to obtain a semantic image to be processed, and perform segmentation processing on the semantic image to be processed to obtain a plurality of semantic image blocks; the preprocessing module 200 is configured to perform data preprocessing on the semantic image blocks to obtain a feature vector corresponding to each semantic image block; the data processing module 300 is configured to input the feature vector into an encoder in a generator of the initial network model to perform data processing, so as to obtain an output vector, where the encoder is provided with a fast fourier transform module, and the fast fourier transform module is configured to perform feature extraction on the feature vector; the mapping processing module 400 is configured to input the output vector to a multilayer perceptron in a generator of the initial network model to perform data mapping processing, so as to obtain a target image block; the reformation processing module 500 is configured to perform image reformation processing on the target image block in the generator of the initial network model to obtain a target image; the discrimination processing module 600 is configured to input a preset comparison image and a target image into a discriminator of the initial network model for discrimination processing to obtain an image discrimination value; the training processing module 700 is configured to perform training processing on the initial network model according to a preset loss function and an image discrimination value, so as to obtain a generative confrontation network model.

The training device of the model in the embodiment of the present application is used for executing the training method of the model in the above embodiment, and the specific processing procedure is the same as that of the training method of the model in the above embodiment, and is not described in detail here.

An embodiment of the present application further provides an image processing apparatus, and with reference to fig. 13, the image processing apparatus may implement the image processing method described above, and includes: an image acquisition module 800 and an image processing module 900. Specifically, the image obtaining module 800 is configured to obtain an original semantic segmentation image; the image processing module 900 is configured to input the original semantic segmentation image into the generative confrontation network model for image processing, so as to obtain a target semantic image, where the generative confrontation network model is obtained by training according to the training method in the embodiment of the first aspect of the present disclosure.

The image processing apparatus according to the embodiment of the present application is configured to execute the image processing method according to the embodiment, and a specific processing procedure of the image processing apparatus is the same as that of the image processing method according to the embodiment, which is not described in detail herein.

An embodiment of the present application further provides a computer device, including:

at least one processor, and,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions that are executable by the at least one processor to cause the at least one processor, when executing the instructions, to implement a training method as in an embodiment of the first aspect of an embodiment of the present application or an image processing method as in an embodiment of the second aspect of an embodiment of the present application.

The hardware configuration of the computer apparatus is described in detail below with reference to fig. 14. The computer device includes: a processor 510, a memory 520, an input/output interface 530, a communication interface 540, and a bus 550.

The processor 510 may be implemented by a general CPU (Central processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided in the embodiment of the present Application;

the Memory 520 may be implemented in a ROM (Read Only Memory), a static Memory device, a dynamic Memory device, or a RAM (Random Access Memory). The memory 520 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 520 and are called by the processor 510 to execute a training method of the model of the embodiments of the present disclosure or execute an image processing method of the embodiments of the present disclosure;

an input/output interface 530 for implementing information input and output;

the communication interface 540 is used for realizing communication interaction between the device and other devices, and may realize communication in a wired manner (e.g., USB, network cable, etc.) or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.); and

a bus 550 that transfers information between various components of the device, such as the processor 510, memory 520, input/output interfaces 530, and communication interfaces 540;

wherein processor 510, memory 520, input/output interface 530, and communication interface 540 are communicatively coupled to each other within the device via bus 550.

The embodiment of the present application further provides a storage medium, which is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, which are used to enable a computer to execute the training method of the model of the embodiment of the present application or execute the image processing method of the embodiment of the present application.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the model training method, the image processing device and the medium, the semantic image to be processed is obtained, and is subjected to segmentation processing to obtain a plurality of semantic image blocks; performing data preprocessing on the semantic image blocks to obtain a feature vector corresponding to each semantic image block; inputting the characteristic vector into an encoder in a generator of the initial network model for data processing to obtain an output vector, wherein the encoder is provided with a fast Fourier transform module which is used for extracting the characteristic of the characteristic vector; inputting the output vector into a multilayer perceptron in a generator of the initial network model for data mapping processing to obtain a target image block; performing image reformation processing on a target image block in a generator of the initial network model to obtain a target image; inputting a preset comparison image and a target image into a discriminator of an initial network model for discrimination processing to obtain an image discrimination value; and training the initial network model according to a preset loss function and the image discrimination value to obtain a generating type confrontation network model. According to the method and the device, the initial network model is trained by obtaining the semantic image to be processed, so that the trained generative confrontation network model can be applied to conditional generative tasks, and the performance of the model is effectively improved.

The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions in the embodiments of the present application, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technology and the emergence of new application scenarios.

It will be understood by those skilled in the art that the technical solutions shown in fig. 1 to 7 and 11 do not limit the embodiments of the present application, and may include more or less steps than those shown in the figures, or combine some steps, or different steps.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be implemented in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicates that there may be three relationships, for example, "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the contextual objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b and c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the above-described units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, in essence or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, can be embodied in the form of a software product, which is stored in a storage medium and includes multiple instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a portable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereto. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims

1. A method of training a model, the method comprising:

inputting the feature vector into an encoder in a generator of an initial network model for data processing to obtain an output vector, wherein a fast Fourier transform module is arranged in the encoder and used for extracting features of the feature vector;

2. The training method according to claim 1, wherein the encoder includes a first encoding processing module and a second encoding processing module, and the fast fourier transform module is disposed in each of the first encoding processing module and the second encoding processing module;

3. The training method according to claim 2, wherein a multi-head self-attention module is further provided in the first encoding processing module;

and performing residual summation processing and normalization processing on the feature vector, the second vector and the third vector to obtain a first vector.

4. The training method according to claim 2, wherein a full connection layer is further provided in the second encoding processing module;

inputting the first vector into the full connection layer of the second coding processing module for classification processing to obtain a fourth vector;

and performing residual sum processing and normalization processing on the first vector, the fourth vector and the fifth vector to obtain an output vector.

5. The training method according to claim 4, wherein the fast Fourier transform module of the second encoding processing module includes a first Fourier transform unit, a first convolution unit, a first active layer, a second convolution unit, and a second Fourier transform unit;

inputting the first vector into the fast fourier transform module of the second encoding processing module for feature extraction to obtain a fifth vector, including:

6. The training method according to any one of claims 1 to 5, wherein the training of the initial network model according to a preset loss function and the image discrimination value to obtain a generative confrontation network model comprises:

updating parameters corresponding to the generator and the discriminator in the initial network model according to the preset loss function so as to obtain an updated image discrimination value;

and when the updated image discrimination value is greater than or equal to a preset value, training to obtain the generative confrontation network model.

7. The training method according to any one of claims 1 to 5, wherein the performing data preprocessing on the semantic image blocks to obtain a feature vector corresponding to each semantic image block includes:

and inputting the position coding data into a preset linear layer for linear processing to obtain a feature vector corresponding to each semantic image block.

8. An image processing method, characterized by comprising:

acquiring an original semantic segmentation image;

inputting the original semantic segmentation image into a generative confrontation network model for image processing to obtain a target semantic image, wherein the generative confrontation network model is obtained by training according to the training method of any one of claims 1 to 7.

9. A computer device, characterized in that the computer device comprises a memory and a processor, wherein the memory has stored therein a computer program, and the processor is adapted to perform, when the computer program is executed by the processor:

the training method of any one of claims 1 to 7; or

The image processing method according to claim 8.

10. A storage medium that is a computer-readable storage medium, wherein the computer-readable stores a computer program that, when executed by a computer, the computer is operable to perform:

the training method of any one of claims 1 to 7; or

The image processing method according to claim 8.