CN116091857B - Training method of image processing model, image processing method and device - Google Patents

Training method of image processing model, image processing method and device Download PDF

Info

Publication number
CN116091857B
CN116091857B CN202211270036.2A CN202211270036A CN116091857B CN 116091857 B CN116091857 B CN 116091857B CN 202211270036 A CN202211270036 A CN 202211270036A CN 116091857 B CN116091857 B CN 116091857B
Authority
CN
China
Prior art keywords
image
style
feature
target
direction vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211270036.2A
Other languages
Chinese (zh)
Other versions
CN116091857A (en
Inventor
李甫
吕月明
林天威
何栋梁
丁二锐
王井东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211270036.2A priority Critical patent/CN116091857B/en
Publication of CN116091857A publication Critical patent/CN116091857A/en
Application granted granted Critical
Publication of CN116091857B publication Critical patent/CN116091857B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/95Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Image Processing (AREA)

Abstract

The disclosure provides a training method of an image processing model, an image processing method and an image processing device, relates to the technical field of artificial intelligence, in particular to the technical fields of image processing, computer vision, deep learning and the like, and can be applied to scenes such as smart cities and intelligent traffic. The specific implementation scheme is as follows: determining a first image coding feature and a first style coding feature of a first image; determining a second image coding feature and a second style coding feature of a second image; determining a first differential direction vector according to the first image coding feature and the second image coding feature; inputting the first image coding feature, the first differential direction vector and the first style coding feature into a differential mapper in an image processing model to obtain an editing direction vector; determining a second differential direction vector according to the first style coding feature and the second style coding feature; and adjusting parameters of the image processing model according to the degree of difference between the second differential direction vector and the editing direction vector.

Description

Training method of image processing model, image processing method and device
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of image processing, computer vision, deep learning and the like, and can be applied to scenes such as smart cities, intelligent traffic and the like. In particular to a training method of an image processing model, an image processing method and an image processing device.
Background
With the advent of the "internet+" age, image generation and editing technology has been increasingly demanded in social entertainment, movie production, visual special effects, and the like. In recent years, with the development of AI (artificial intelligence) technology such as deep learning, multi-modal driven image editing technology, such as text driven image editing technology, has further lowered the threshold of image editing technology, has promoted the advancement of interactive image editing technology, and has revealed great practical value. In the field of deep learning text-driven image editing technology, how to model the expression between text and image more accurately and how to generate high-quality editing effects more flexibly are two key technical points.
Disclosure of Invention
The present disclosure provides a training method of an image processing model, an image processing method, an apparatus, a device, a storage medium, and a program product.
According to an aspect of the present disclosure, there is provided a training method of an image processing model, including: determining a first image coding feature and a first style coding feature of a first image; determining a second image coding feature and a second style coding feature of a second image; determining a first differential direction vector according to the first image coding feature and the second image coding feature; inputting the first image coding feature, the first differential direction vector and the first style coding feature into a differential mapper in an image processing model to obtain an editing direction vector; determining a second differential direction vector according to the first style coding feature and the second style coding feature; and adjusting parameters of the image processing model according to the degree of difference between the second differential direction vector and the editing direction vector.
According to another aspect of the present disclosure, there is provided an image processing method including: acquiring a source image, a source text and a target text, wherein the source text is used for describing an element to be edited in the source image, and the target text is used for describing an editing mode aiming at the element to be edited; and inputting the source image, the source text and the target text into an image processing model to obtain a processing result for the source image, wherein the image processing model is trained according to the method disclosed by the embodiment of the disclosure.
According to another aspect of the present disclosure, there is provided a training apparatus of an image processing model, including: a first feature determination module for determining a first image coding feature and a first style coding feature of a first image; a second feature determination module for determining a second image encoding feature and a second style encoding feature of a second image; a first differential direction vector determining module, configured to determine a first differential direction vector according to the first image coding feature and the second image coding feature; the editing direction vector determining module is used for inputting the first image coding feature, the first differential direction vector and the first style coding feature into a differential mapper in an image processing model to obtain an editing direction vector; the second differential direction vector determining module is used for determining a second differential direction vector according to the first style coding feature and the second style coding feature; and the adjusting module is used for adjusting parameters of the image processing model according to the difference degree between the second differential direction vector and the editing direction vector.
According to another aspect of the present disclosure, there is provided an image processing apparatus including: the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a source image, a source text and a target text, wherein the source text is used for describing an element to be edited in the source image, and the target text is used for describing an editing mode aiming at the element to be edited; and an input module, configured to input the source image, the source text, and the target text into an image processing model, to obtain a processing result for the source image, where the image processing model is trained according to the method according to the embodiment of the disclosure.
Another aspect of the present disclosure provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods shown in the embodiments of the present disclosure.
According to another aspect of the disclosed embodiments, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the methods shown in the disclosed embodiments.
According to another aspect of the disclosed embodiments, there is provided a computer program product comprising a computer program/instruction, characterized in that the computer program/instruction, when executed by a processor, implements the steps of the method shown in the disclosed embodiments.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 schematically illustrates an exemplary system architecture according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a training method of an image processing model according to an embodiment of the disclosure;
FIG. 3 schematically illustrates a flow chart of an image processing method according to an embodiment of the disclosure;
FIG. 4 schematically illustrates a schematic diagram of a training method of an image processing model according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a schematic diagram of a differential mapper according to an embodiment of the present disclosure;
fig. 6 schematically illustrates a schematic diagram of an image processing method according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a block diagram of a training apparatus of an image processing model according to an embodiment of the present disclosure;
fig. 8 schematically shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure; and
FIG. 9 schematically illustrates a block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Text-driven image editing techniques aim to edit the content of an image from a user-provided natural language description while ensuring that text-independent content is not altered. This task has attracted extensive research interest both in academia and industry in view of its importance for practical applications.
In the deep learning era, the related art generated a countermeasure network (GAN) using a large number of image-text pairs manually annotated, from the beginning, training conditions. This process, however, requires expensive labeling costs, which hampers training flexibility. The related art is not well generalized to unseen driver text.
According to embodiments of the present disclosure, a first image encoding feature and a first style encoding feature of a first image may be determined. A second image encoding feature and a second style encoding feature of the second image are determined. A first differential direction vector is determined based on the first image encoding feature and the second image encoding feature. And inputting the first image coding feature, the first differential direction vector and the first style coding feature into a differential mapper in an image processing model to obtain an editing direction vector. And determining a second differential direction vector according to the first style coding feature and the second style coding feature. And then adjusting parameters of the image processing model according to the difference degree between the second differential direction vector and the editing direction vector.
According to the training method of the image processing model, in the process of training the image processing model, no text participates, a large number of image-text pairs do not need to be collected, manual labeling is not needed, and therefore manpower and material resources are saved, and in addition, the training flexibility and accuracy are higher.
An exemplary system architecture to which the training method, image processing method, and apparatus of the present disclosure may be applied will be described below in conjunction with fig. 1.
Fig. 1 schematically illustrates an exemplary system architecture 100 according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.
As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as an image processing application, a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103, as just examples.
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server that provides various services, such as a server that provides an image processing service. The user can upload images and text using the terminal devices 101, 102, 103. After receiving the image and the text, the server 105 may edit the image according to the text, and the like, to obtain a processing result, and feed back the processing result to the terminal device.
It should be noted that, the training method and the image processing method of the image processing model provided in the embodiments of the present disclosure may be generally executed by the server 105. Accordingly, the training apparatus and the image processing apparatus of the image processing model provided by the embodiments of the present disclosure may be generally provided in the server 105. The training method and the image processing method of the image processing model provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the training apparatus and the image processing apparatus for an image processing model provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.
In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.
The training method of the image processing model provided by the present disclosure will be described below with reference to fig. 2.
Fig. 2 schematically illustrates a flowchart of a training method of an image processing model according to an embodiment of the present disclosure.
As shown in fig. 2, the training method 200 of the image processing model includes determining a first image coding feature and a first style coding feature of a first image in operation S210.
Then, in operation S220, a second image coding feature and a second style coding feature of the second image are determined.
According to an embodiment of the present disclosure, the first image and the second image may be any two images, such as a face image, an animal image, a natural scene, and the like. Illustratively, in the present embodiment, a training image data set may be acquired in advance, wherein the training image data set includes a plurality of images. Two images may be randomly selected from the training image dataset, one as the first image and the other as the second image.
According to embodiments of the present disclosure, image coding features may be used to represent image features of an image, and style coding features may be used to represent style features of an image. In this embodiment, for example, the first image encoding feature of the first image and the second image encoding feature of the second image may be extracted by an image encoder. A style encoder may be utilized to extract a first style encoding feature of a first image and a second style encoding feature of a second image. The image encoder may include, for example, a CLIP model, among others. The style encoder may include, for example, a StyleGAN model, which may be pre-trained. Among them, the CLIP (Contrastive Language-Image Pre-Training) model is a large-scale Pre-trained visual language model. After training the CLIP model with a large number of image-text, the CLIP model can embed real-world images and text into semantically consistent feature space. Based on this, in this embodiment, the CLIP model and the pre-trained GAN model may be aggregated for text-driven image processing operations.
In operation S230, a first differential direction vector is determined according to the first image encoding feature and the second image encoding feature.
According to an embodiment of the present disclosure, the first differential direction vector may be used to represent a differential direction between the first image encoding feature and the second image encoding feature. In this embodiment, for example, a difference between the first image encoding feature and the second image encoding feature may be calculated, resulting in a first differential direction vector.
In operation S240, the first image coding feature, the first differential direction vector, and the first style coding feature are input to a differential mapper in an image processing model to obtain an edit direction vector.
According to an embodiment of the present disclosure, the differential mapper may be configured to determine an edit direction vector based on the first image coding feature, the first differential direction vector, and the first style coding feature. The edit direction vector may be used to represent a target edit direction.
In operation S250, a second differential direction vector is determined based on the first and second style-coding features.
According to embodiments of the present disclosure, a second differential direction vector may be used to represent a differential direction between the first style-encoded feature and the second style-encoded feature. In this embodiment, for example, a difference between the first style encoding feature and the second style encoding feature may be calculated to obtain the second differential direction vector.
In operation S260, parameters of the image processing model are adjusted according to the degree of difference between the second differential direction vector and the edit direction vector.
According to an embodiment of the present disclosure, the degree of difference between the second differential direction vector and the edit direction vector may be calculated, for example, from the loss function. The loss function may include, among other things, a euclidean distance (L2 distance) reconstruction loss function, a cosine similarity loss function, and the like.
Illustratively, in the present embodiment, for example, an L2 distance between the second differential direction vector and the editing direction vector may be calculated. A cosine distance between the second differential direction vector and the edit direction vector is calculated. And then determining the difference degree according to the L2 distance and the cosine distance.
According to embodiments of the present disclosure, adjusting parameters of the image processing model may include, for example, adjusting parameters of a differential mapper.
According to an embodiment of the present disclosure, the above operations S210 to S260 may be repeatedly performed until the degree of difference satisfies a predetermined condition. Wherein, the predetermined condition can be set according to actual needs. For example, the predetermined condition may be that the degree of difference converges.
After the image processing model is trained, image processing may be performed using the image processing model. For example, the image may be edited from the description of the text using an image processing model.
The related art uses manually annotated image-text pairs to train an image processing model, requiring expensive annotation costs, impeding the flexibility of training.
According to the training method of the image processing model, in the process of training the image processing model, no text participates, a large number of image-text pairs do not need to be collected, manual labeling is not needed, and therefore manpower and material resources are saved, and in addition, the training flexibility and accuracy are higher.
According to other embodiments of the present disclosure, after the image processing model is trained, the image processing model may also be tested, that is, the test image is processed by using the image processing model, so as to obtain a corresponding processing result. And determining whether the processing result corresponding to the test image meets the requirement, and if not, retraining the image processing model.
The image processing method provided by the present disclosure will be described below with reference to fig. 3.
Fig. 3 schematically shows a flowchart of an image processing method according to an embodiment of the present disclosure.
As shown in fig. 3, the image processing method 300 includes acquiring a source image, a source text, and a target text in operation S310.
According to an embodiment of the present disclosure, the source image may be an image to be processed. The source image may be, for example, an animal image, a plant image, or a human face image, which is not limited in this disclosure. The source text may be used to describe the element to be edited in the source image. The target text is used to describe the editing mode for the element to be edited. By way of example, the source text may include, for example, body organs, specific parts of objects, and the like. The target text may be text describing facial organ characteristics of the edited face image or text describing the emotion of the person of the edited face image.
In operation S320, the source image, the source text, and the target text are input into the image processing model, and a processing result for the source image is obtained.
According to embodiments of the present disclosure, for example, a first text encoding feature of a source text and a second text encoding feature of a target text may be determined. A text differential direction vector is determined based on the first text encoding feature and the second text encoding feature. In this embodiment, for example, a difference between the first text encoding feature and the second text encoding feature may be calculated to obtain a text differential direction vector.
In addition, image coding features and style coding features of the source image may be determined. And then inputting the text differential direction vector, the image coding feature and the style coding feature into a differential mapper in the image processing model to obtain an editing direction vector. And then generating a processing result aiming at the source image according to the editing direction vector and the style coding characteristics.
For example, the source image may be a face image, the source text may include "face" indicating that the element to be edited is a face, and the target text may include "face with eyeglasses" indicating that the editing mode is to add glasses to the face. The processing result of the source image processed by the image processing model may include a face image with glasses added to the face.
The training method of the image processing model shown above is further described with reference to fig. 4 in conjunction with a specific embodiment. Those skilled in the art will appreciate that the following example embodiments are merely for the understanding of the present disclosure, and the present disclosure is not limited thereto.
Fig. 4 schematically illustrates a schematic diagram of a training method of an image processing model according to an embodiment of the present disclosure.
As shown in fig. 4, an image processing model may include, for example, an image encoder, a style encoder, and a differential mapper, according to an embodiment of the present disclosure. The image encoder may include, for example, a CLIP model, among others. The style encoder may include, for example, style gan (style-based generative countermeasure network).
According to an embodiment of the present disclosure, the input of the image encoder may comprise an image and the output may comprise image encoding features of the image. The input of the style encoder may comprise an image, the output may comprise image intermediate spatial features, and the intermediate spatial features may be further mapped to a target space resulting in style encoded features. The intermediate space feature may be, for example, a feature of the w+ space. The target space may be, for example, an S space. By mapping the intermediate spatial features to the target space, the decoupling of image processing can be improved.
According to an embodiment of the present disclosure, the input of the differential mapper may comprise a differential direction vector between image encoding features of the source image and the target image, and the output may comprise an edit direction vector.
According to embodiments of the present disclosure, for example, a first image may be input to an image encoder in an image processing model, resulting in a first image encoding feature. In addition, the first image may be input to a style encoder in the image processing model to obtain a first intermediate spatial feature. And then mapping the first intermediate space characteristic to a target space to obtain a first style coding characteristic. Illustratively, in this embodiment, the first intermediate space feature may be, for example, a feature of the w+ space. The target space may be, for example, an S space. The first intermediate spatial feature may be mapped from w+ space to S space, resulting in a first style encoding feature.
According to embodiments of the present disclosure, the second image may be input to an image encoder in an image processing model, for example, resulting in a second image encoding feature. In addition, the second image may be input to a style encoder in the image processing model to obtain a second intermediate spatial feature. The second intermediate spatial feature may then be mapped to the target space resulting in a second style encoding feature. Illustratively, in this embodiment, the second intermediate space feature may be, for example, a feature of the w+ space. The target space may be, for example, an S space. The second intermediate spatial feature may be mapped from the w+ space to the S space resulting in a second style encoding feature.
For example, two images may be randomly selected from the training image dataset, one as the source image I 1 Another as target image I 2 The source image I is then extracted with an image encoder 1 Image coding feature i of (2) 1 And target image I 2 Image coding feature i of (2) 2 . A pre-trained style encoder is then used to extract the w+ spatial features of the two images. Because the feature of the S space is more decoupled than the feature of the w+ space, in this embodiment, the extracted w+ space feature may be further mapped to the S space to obtain the source image I 1 Is of style-encoded features s 1 Target image I 2 Is of style-encoded features s 2 . Using the extracted features described above, a first differential direction vector Δi=i can be calculated 2 -i 1 And a second differential direction vector Δs=s 2 -s 1 . Further, the editing direction between the two images can be predicted using the differential Mapper Delta Mapper, which can be expressed as the following formula.
Δs′=DeltaMapper(Δi,s 1 ,i 1 )
Wherein DeltaMapper is a differential mapper, dels' is an edit direction vector predicted by the differential mapper, delti, s 1 And i 1 Is the input to the differential mapper.
Next, the degree of difference L may be calculated from Δs' and Δs.
Illustratively, in this embodiment, the image encoder may include a CLIP model. The style encoder may include a StyleGAN model. Features s extracted by an image encoder 1 And the feature i extracted by the style encoder 1 Can be used as an input to the Delta Mapper to further provide condition information for the source image, which helps the Delta Mapper understand the position of the source image in CLIP space and S space, so that the generated edit direction vector can more specifically and accurately indicate the edit direction.
The differential mapper shown above is further described in connection with a specific embodiment with reference to fig. 5. Those skilled in the art will appreciate that the following example embodiments are merely for the understanding of the present disclosure, and the present disclosure is not limited thereto.
Fig. 5 schematically illustrates a schematic diagram of a differential mapper according to an embodiment of the present disclosure.
As shown in fig. 5, the differential mapper may include, for example, a style coding module (style module), a condition coding module (condition module), and a fusion module (fusion module).
According to an embodiment of the present disclosure, coarse-granularity style characteristics, medium-granularity style characteristics, and fine-granularity style characteristics corresponding to a first style coding characteristic are determined using a style coding module. Coarse-granularity condition features, medium-granularity condition features, and fine-granularity condition features corresponding to the first image coding features and the first differential direction vector are determined using a condition coding module. And then, fusing the coarse-grain style characteristic and the coarse-grain condition characteristic by utilizing a fusion module to obtain a first fusion characteristic, fusing the medium-grain style characteristic and the medium-grain condition characteristic to obtain a second fusion characteristic, fusing the fine-grain style characteristic and the fine-grain condition characteristic to obtain a third fusion characteristic, and determining an editing direction vector according to the first fusion characteristic, the second fusion characteristic and the third fusion characteristic.
According to embodiments of the present disclosure, the style encoding module may include StyleGAN. Since the StyleGAN has the characteristic that different layers correspond to different semantic levels, the different layers of the StyleGAN can be divided into different semantic levels, and feature extraction operations from coarse granularity to fine granularity are realized in each level. Based on this, in this embodiment, the style encoding module may include three encoding layers:and->Is used for extracting coarse grain style characteristics, medium grain style characteristics and fine grain style characteristics respectively. Wherein each coding layer may comprise a plurality of fully connected layers. Can be used forThe style coding features S with different levels are used as three different levels of input to obtain +.>Wherein (1)>For coarse grain style characteristics, ++>For middle granularity style feature->Is a fine-grained style feature. />And->The subscripts of (a) represent the coarse to fine hierarchy, respectively, and the superscript represents the output of the style coding module.
According to an embodiment of the present disclosure, the condition encoding module may include three hierarchical encoding layers:andis used for extracting coarse-granularity condition features, medium-granularity condition features and fine-granularity condition features respectively. Illustratively, in this embodiment, the structure of the three-layered coding layers in the condition coding module may be the same as the three-layered coding layers in the style coding module. The first differential direction vector Δi and the first image encoding feature i may be first 1 Combining the two code layers as the inputs of the condition code module, and then respectively learning coarse-granularity condition features by using the three code layers in the condition code module>Medium granularity Condition feature->And fine-grained conditional feature->Obtain->Wherein (1)>And (3) withHaving the same dimensions.
According to embodiments of the present disclosure, a fusion model may include three hierarchical coding layers: coarse-granularity coding layer, medium-granularity coding layer, and fine-granularity coding layer, each coding layer may include a fully-connected network layer. The generated style characteristics and condition characteristics from thick to thin can be fused into three layers of coding layers for fusion, and the editing direction vector is predicted, and the process can be expressed as the following formula.
Where Δs' is the edit direction vector.Is a coarse-grained fusion layer. />Is a medium-granularity fusion layer. />Is a fine-grained fusion layer. That is, the +.>Will be coarse grain style characterization->And coarse-grained conditional characterization->Fusion, in the middle granularity fusion layer->Will be the middle granularity style feature->And Medium granularity Condition feature->Fusion, in a fine-grained fusion layer->Will be fine-grained style feature->And fine-grained conditional feature->And fusing, and obtaining an editing direction vector delta s' according to the fused characteristics.
According to an embodiment of the present disclosure, to train the differential Mapper Delta Mapper, the objective function may contain two loss functions, which may be expressed as:
wherein,,for the purpose of +.>Loss for L2 distance reconstruction,>is a cosine similarity penalty. The L2 distance reconstruction loss may measure the L2 distance in S space of the predicted direction Δs' and the target editing direction Δs. By introducing the L2 distance reconstruction loss in S space in the objective function, supervision of the learning edit direction Δs' can be increased. In addition, by introducing cosine similarity losses, the differential mapper can be explicitly directed to minimize the cosine distance between predictions Δs' and Δs.
Accordingly, in accordance with an embodiment of the present disclosure, when image processing is performed using an image processing model,
for example, coarse-granularity style features, medium-granularity style features, and fine-granularity style features corresponding to style coding features may be determined using a style coding module. And determining coarse granularity condition features, medium granularity condition features and fine granularity condition features corresponding to the image coding features and the text differential direction vectors by utilizing a condition coding module. And then fusing the coarse-granularity style characteristic and the coarse-granularity condition characteristic by utilizing a fusion module to obtain a first fusion characteristic, fusing the middle-granularity style characteristic and the middle-granularity condition characteristic to obtain a second fusion characteristic, fusing the fine-granularity style characteristic and the fine-granularity condition characteristic to obtain a third fusion characteristic, and obtaining the editing direction vector according to the first fusion characteristic, the second fusion characteristic and the third fusion characteristic.
According to embodiments of the present disclosure, a target vector may be determined from the edit direction vector and the style coding feature. And then inputting the target vector into an image generator in the image processing model to obtain a target image serving as a processing result of the source image.
According to embodiments of the present disclosure, for example, the editing direction vector and the style coding feature may be added to obtain the target vector.
The image processing method shown above is further described with reference to fig. 6 in conjunction with the specific embodiment. Those skilled in the art will appreciate that the following example embodiments are merely for the understanding of the present disclosure, and the present disclosure is not limited thereto.
Fig. 6 schematically illustrates a schematic diagram of an image processing method according to an embodiment of the present disclosure.
As shown in fig. 6, in performing image processing, the image processing model may include a text encoder, an image encoder, a style encoder, a differential mapper, and an image generator.
According to embodiments of the present disclosure, the input of the text encoder may include text and the output may include text encoding features of the text. In this embodiment, a text encoder may be used to extract text encoding features of the source text and the target file. The text editor may include, for example, a CLIP model.
According to embodiments of the present disclosure, the input of the image generator may include the image encoding features and the editing direction features of the source image, and the output may include the edited target image, i.e., the processing result. The image generator may comprise, for example, a StyleGAN.
According to embodiments of the present disclosure, the image encoder, the style encoder, and the differential mapper may be, for example, referred to above and will not be described herein.
According to embodiments of the present disclosure, a source text may be input to a text encoder in an image processing model, for example, resulting in a first text encoding feature. Inputting the target text into a text encoder in the image processing model to obtain a second text encoding feature.
In addition, the source image may be input into an image feature encoding model in the image processing model to obtain image encoding features. And inputting the source image into a style coding model in the image processing model to obtain the target intermediate space characteristics. And then mapping the target intermediate space features to a target space to obtain style coding features. The target intermediate spatial feature may be, for example, a feature of the w+ space. The target space may be, for example, an S space.
The target coarse grain style characteristic, the target medium grain style characteristic, and the target fine grain style characteristic corresponding to the style coded characteristic may then be determined, for example, using a style coding module of the differential mapper. And determining a target coarse granularity condition characteristic, a target medium granularity condition characteristic and a target fine granularity condition characteristic corresponding to the image coding characteristic and the text differential direction vector by utilizing a condition coding module of the differential mapper. And then fusing the target coarse-granularity style characteristic and the target coarse-granularity condition characteristic by utilizing a fusion module of the differential mapper to obtain a first target fusion characteristic, fusing the granularity style characteristic in the target and the granularity condition characteristic in the target to obtain a second target fusion characteristic, fusing the target fine-granularity style characteristic and the target fine-granularity condition characteristic to obtain a third target fusion characteristic, and obtaining a target editing direction vector according to the first target fusion characteristic, the second target fusion characteristic and the third target fusion characteristic.
According to embodiments of the present disclosure, a target vector may be determined from a target edit direction vector and style coding features. And then inputting the target vector into an image generator in the image processing model to obtain a target image serving as a processing result of the source image.
According to embodiments of the present disclosure, for example, the editing direction vector and the style coding feature may be added to obtain the target vector.
For example, given a source image I, first the image coding feature I of the image I is extracted by an image encoder and the style coding feature s of the image I is extracted by a style encoder. In addition, the source text "face" and the target text "face with yellow hair" (face with yellow hair) can be fed into a text editor to construct a text differential direction vector Δt, based on which the differential mapper deltamap can be used to predict the edit direction vector deltas' for text differential control as
Δs′=DeltaMapper(Δt,s,i)
Wherein the target vector s' =s+Δs may be generated from the editing direction vector. Finally, the target image I 'may be generated based on s' with an image generator as a processing result.
According to another embodiment of the present disclosure, a correlation matrix may be obtained, for example, wherein the correlation matrix is used to represent correlations between image features and corresponding text features. The correlation matrix may then be multiplied by the target edit direction vector to obtain a new target edit direction vector. The new target edit direction vector and style coding feature may then be added to obtain a target vector. The new target editing direction vector is obtained by multiplying the correlation matrix and the target editing direction vector, and the target vector is determined based on the new target editing direction vector, so that the decoupling performance of image processing can be improved, and the effect of image processing can be improved.
For example, a pre-computed correlation matrix R may be utilized s The obtained target edit direction vector deltas' is further processed. Wherein the matrix R s How the corresponding condition features in the CLIP space change when each dimension is modified in the S space is recorded. Based on this, Δs″ can be calculated according to the following formula:
wherein if the correlation matrix R s The product of the sum is greater than or equal to β, then Δs "=Δs', otherwise, Δs" =0. Where β is a threshold value selected for controlling the level of decoupling, which may be set according to actual needs.
According to the disclosed embodiment, in this way, channels in the S space, which have low correlation with the editing target text, can be set to zero, thereby improving the decoupling capability of text control editing.
In this way, a final target image I 'can be generated from the target vector s' =s+Δs″.
The training apparatus of the image processing model provided by the present disclosure will be described below with reference to fig. 7.
Fig. 7 schematically illustrates a block diagram of a training apparatus of an image processing model according to an embodiment of the present disclosure.
As shown in fig. 7, the training apparatus 700 of the image processing model includes a first feature determination module 710, a second feature determination module 720, a first differential direction vector determination module 730, an edit direction vector determination module 740, a second differential direction vector determination module 750, and an adjustment module 760.
The first feature determination module 710 is configured to determine a first image coding feature and a first style coding feature of a first image.
A second feature determination module 720 for determining a second image coding feature and a second style coding feature of the second image.
The first differential direction vector determining module 730 is configured to determine a first differential direction vector according to the first image coding feature and the second image coding feature.
The edit direction vector determination module 740 is configured to input the first image coding feature, the first differential direction vector, and the first style coding feature into a differential mapper in the image processing model to obtain an edit direction vector.
A second differential direction vector determination module 750 is configured to determine a second differential direction vector based on the first style encoding feature and the second style encoding feature.
The adjustment module 760 is configured to adjust parameters of the image processing model according to the degree of difference between the second differential direction vector and the editing direction vector.
According to a disclosed embodiment, the first feature determination module may include: the first input sub-module is used for inputting the first image into an image encoder in the image processing model to obtain a first image coding characteristic; the second input sub-module is used for inputting the first image into a style encoder in the image processing model to obtain a first intermediate space feature; and the first mapping sub-module is used for mapping the first intermediate space characteristic to the target space to obtain a first style coding characteristic.
According to a disclosed embodiment, the second feature determination module may include: the third input sub-module is used for inputting the second image into the image encoder in the image processing model to obtain second image coding characteristics; the fourth input sub-module is used for inputting the second image into a style encoder in the image processing model to obtain a second intermediate space feature; and the second mapping sub-module is used for mapping the second intermediate space characteristic to the target space to obtain a second style coding characteristic.
According to disclosed embodiments, the differential mapper may include a style encoding module, a condition encoding module, and a fusion module; the edit direction vector determination module may include: the style coding sub-module is used for determining coarse granularity style characteristics, medium granularity style characteristics and fine granularity style characteristics corresponding to the first style coding characteristics by utilizing the style coding module; the condition coding sub-module is used for determining coarse granularity condition features, medium granularity condition features and fine granularity condition features corresponding to the first image coding features and the first differential direction vectors by utilizing the condition coding module; and a fusion sub-module, configured to fuse the coarse-grain style feature and the coarse-grain condition feature by using the fusion module to obtain a first fusion feature, fuse the medium-grain style feature and the medium-grain condition feature to obtain a second fusion feature, fuse the fine-grain style feature and the fine-grain condition feature to obtain a third fusion feature, and determine an editing direction vector according to the first fusion feature, the second fusion feature and the third fusion feature.
According to a disclosed embodiment, the training device of the image processing model may further include: the first calculation module is used for calculating the Euclidean distance between the second differential direction vector and the editing direction vector; the second calculation module is used for calculating the cosine distance between the second differential direction vector and the editing direction vector; and the difference degree determining module is used for determining the difference degree according to the Euclidean distance and the cosine distance.
According to a disclosed embodiment, the first differential direction vector determination module may include: and the third computing sub-module is used for computing the difference between the first image coding feature and the second image coding feature to obtain a first differential direction vector.
According to a disclosed embodiment, the second differential direction vector determination module may include: and the fourth computing sub-module is used for computing the difference between the first style coding characteristic and the second style coding characteristic to obtain a second differential direction vector.
The image processing apparatus provided by the present disclosure will be described below with reference to fig. 8.
Fig. 8 schematically shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure.
As shown in fig. 8, the image processing apparatus 800 includes an acquisition module 810 and an input module 820.
The obtaining module 810 is configured to obtain a source image, a source text, and a target text, where the source text is used to describe an element to be edited in the source image, and the target text is used to describe an editing mode for the element to be edited.
And an input module 820, configured to input the source image, the source text, and the target text into an image processing model, to obtain a processing result for the source image, where the image processing model is trained according to the training method of the image processing model shown in the embodiments of the present disclosure.
According to a disclosed embodiment, the input module may include: a text encoding sub-module for determining a first text encoding feature of the source text and a second text encoding feature of the target text; the text difference molecular module is used for determining a text difference direction vector according to the first text coding feature and the second text coding feature; the feature determining submodule is used for determining image coding features and style coding features of the source image; the differential mapping sub-module is used for inputting the text differential direction vector, the image coding feature and the style coding feature into a differential mapper in the image processing model to obtain a target editing direction vector; and the generation submodule is used for generating a processing result aiming at the source image according to the target editing direction vector and the style coding characteristic.
According to a disclosed embodiment, the text encoding submodule may include: a first input unit for inputting the source text into a text encoder in the image processing model to obtain a first text encoding feature; and a second input unit for inputting the target text into the text encoder in the image processing model to obtain a second text encoding feature.
According to a disclosed embodiment, the feature determination submodule may include: the third input unit is used for inputting the source image into the image feature coding model in the image processing model to obtain image coding features; the fourth input unit is used for inputting the source image into a style coding model in the image processing model to obtain target intermediate space characteristics; and the mapping unit is used for mapping the target intermediate space characteristic to the target space to obtain the style coding characteristic.
According to disclosed embodiments, the differential mapper may include a style encoding module, a condition encoding module, and a fusion module; the differential mapping submodule includes: the first feature determining unit is used for determining target coarse granularity style features, target middle granularity style features and target fine granularity style features corresponding to the style coding features by utilizing the style coding module; a second feature determining unit configured to determine, using the condition encoding module, a target coarse-granularity condition feature, a target medium-granularity condition feature, and a target fine-granularity condition feature corresponding to the image encoding feature and the text differential direction vector; and a fusion unit, configured to fuse the target coarse-granularity style feature and the target coarse-granularity condition feature by using a fusion module to obtain a first target fusion feature, fuse the granularity style feature in the target and the target medium-granularity condition feature to obtain a second target fusion feature, fuse the target fine-granularity style feature and the target fine-granularity condition feature to obtain a third target fusion feature, and obtain a target editing direction vector according to the first target fusion feature, the second target fusion feature and the target third fusion feature.
According to a disclosed embodiment, generating the sub-module may include: the target vector determining unit is used for determining a target vector according to the editing direction vector and the style coding characteristics; and a fifth input unit for inputting the target vector into the image generator in the image processing model to obtain the target image as the processing result of the source image.
According to a disclosed embodiment, the target vector determination unit may include: an acquisition subunit, configured to acquire a correlation matrix, where the correlation matrix is used to represent a correlation between an image feature and a corresponding text feature; a multiplication subunit, configured to multiply the correlation matrix with the target editing direction vector to obtain a new target editing direction vector; and an adding subunit, configured to add the new target editing direction vector and the style coding feature to obtain a target vector.
According to a disclosed embodiment, a text difference molecule module may include: and the calculating unit is used for calculating the difference between the first text coding characteristic and the second text coding characteristic to obtain a text difference direction vector.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 9 schematically illustrates a block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, for example, a training method and an image processing method of an image processing model. For example, in some embodiments, the training method of the image processing model and the image processing method may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the training method of the image processing model and the image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the training method and the image processing method of the image processing model in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (32)

1. A method of training an image processing model, comprising:
determining a first image coding feature and a first style coding feature of a first image;
determining a second image coding feature and a second style coding feature of a second image;
determining a first differential direction vector according to the first image coding feature and the second image coding feature;
inputting the first image coding feature, the first differential direction vector and the first style coding feature into a differential mapper in an image processing model to obtain an editing direction vector;
Determining a second differential direction vector according to the first style coding feature and the second style coding feature; and
adjusting parameters of the image processing model according to the degree of difference between the second differential direction vector and the editing direction vector;
the differential mapper comprises a style coding module, a condition coding module and a fusion module; the inputting the first image coding feature, the first differential direction vector and the first style coding feature into a differential mapper in an image processing model to obtain an editing direction vector includes:
determining, with the style encoding module, style characteristics from coarse granularity to fine granularity corresponding to the first style encoding characteristics;
determining, with the condition encoding module, a condition feature from coarse granularity to fine granularity corresponding to the first image encoding feature and the first differential direction vector; and
and utilizing the fusion module to fuse the style characteristics from coarse granularity to fine granularity and the condition characteristics from coarse granularity to fine granularity into three layers of coding layers in the fusion module for fusion, and determining the editing direction vector.
2. The method of claim 1, wherein the determining the first image is
The image coding feature and the first style coding feature comprise:
inputting the first image into an image encoder in the image processing model to obtain the first image coding feature;
inputting the first image into a style encoder in the image processing model to obtain a first intermediate space feature; and
and mapping the first intermediate space characteristic to a target space to obtain the first style coding characteristic.
3. The method of claim 1, wherein the determining the second image coding feature and the second style coding feature of the second image comprises:
inputting the second image into an image encoder in the image processing model to obtain the second image coding feature;
inputting the second image into a style encoder in the image processing model to obtain a second intermediate space feature; and
and mapping the second intermediate space characteristic to a target space to obtain the second style coding characteristic.
4. The method of claim 1, wherein said inputting the first image encoding feature, the first differential direction vector, and the first style encoding feature into a differential mapper in an image processing model results in an edit direction vector, comprising:
Determining coarse granularity style characteristics, medium granularity style characteristics and fine granularity style characteristics corresponding to the first style coding characteristics by utilizing the style coding module;
determining coarse-granularity condition features, medium-granularity condition features, and fine-granularity condition features corresponding to the first image coding features and the first differential direction vector using the condition coding module; and
and fusing the coarse granularity style characteristic and the coarse granularity condition characteristic by utilizing the fusion module to obtain a first fusion characteristic, fusing the middle granularity style characteristic and the middle granularity condition characteristic to obtain a second fusion characteristic, fusing the fine granularity style characteristic and the fine granularity condition characteristic to obtain a third fusion characteristic, and determining the editing direction vector according to the first fusion characteristic, the second fusion characteristic and the third fusion characteristic.
5. The method of claim 1, further comprising:
calculating the Euclidean distance between the second differential direction vector and the editing direction vector;
calculating a cosine distance between the second differential direction vector and the editing direction vector; and
and determining the difference degree according to the Euclidean distance and the cosine distance.
6. The method of claim 1, wherein the determining a first differential direction vector from the first image encoding feature and the second image encoding feature comprises:
and calculating the difference between the first image coding feature and the second image coding feature to obtain the first differential direction vector.
7. The method of claim 1, wherein the determining a second differential direction vector from the first and second style-coded features comprises:
and calculating the difference between the first style coding feature and the second style coding feature to obtain the second differential direction vector.
8. An image processing method, comprising:
acquiring a source image, a source text and a target text, wherein the source text is used for describing an element to be edited in the source image, and the target text is used for describing an editing mode aiming at the element to be edited; and
inputting the source image, the source text and the target text into an image processing model to obtain a processing result aiming at the source image,
wherein the image processing model is trained according to the method of any one of claims 1-7.
9. The method of claim 8, wherein the inputting the source image, the source text, and the target text into an image processing model results in a processing result for the source image, comprising:
determining a first text encoding feature of the source text and a second text encoding feature of the target text;
determining a text differential direction vector according to the first text coding feature and the second text coding feature;
determining image coding features and style coding features of the source image;
inputting the text differential direction vector, the image coding feature and the style coding feature into a differential mapper in the image processing model to obtain a target editing direction vector; and
and generating a processing result aiming at the source image according to the target editing direction vector and the style coding feature.
10. The method of claim 9, wherein the determining the first text encoding feature of the source text and the second text encoding feature of the target text comprises:
inputting the source text into a text encoder in the image processing model to obtain the first text encoding feature; and
And inputting the target text into a text encoder in the image processing model to obtain the second text encoding feature.
11. The method of claim 9, wherein the determining image coding features and style coding features of the source image comprises:
inputting the source image into an image feature coding model in the image processing model to obtain the image coding feature;
inputting the source image into a style coding model in the image processing model to obtain target intermediate space characteristics; and
and mapping the target intermediate space characteristic to a target space to obtain the style coding characteristic.
12. The method of claim 9, wherein said inputting the text differential direction vector, image coding feature, and style coding feature into a differential mapper in the image processing model results in a target edit direction vector, comprising:
determining a target coarse granularity style characteristic, a target medium granularity style characteristic and a target fine granularity style characteristic corresponding to the style coding characteristic by utilizing the style coding module;
determining a target coarse granularity condition feature, a target medium granularity condition feature and a target fine granularity condition feature corresponding to the image coding feature and the text differential direction vector by utilizing the condition coding module; and
And fusing the target coarse-granularity style characteristic and the target coarse-granularity condition characteristic by utilizing the fusion module to obtain a first target fusion characteristic, fusing the granularity style characteristic in the target and the target medium-granularity condition characteristic to obtain a second target fusion characteristic, fusing the target fine-granularity style characteristic and the target fine-granularity condition characteristic to obtain a third target fusion characteristic, and obtaining the target editing direction vector according to the first target fusion characteristic, the second target fusion characteristic and the third target fusion characteristic.
13. The method of claim 9, wherein the generating a processing result for the source image from the target edit direction vector and the style-coding feature comprises:
determining a target vector according to the editing direction vector and the style coding feature; and
and inputting the target vector into an image generator in the image processing model to obtain a target image as a processing result of the source image.
14. The method of claim 13, wherein the determining a target vector from the edit direction vector and the style-coded feature comprises:
Acquiring a correlation matrix, wherein the correlation matrix is used for representing the correlation between the image characteristics and the corresponding text characteristics;
multiplying the correlation matrix by the target editing direction vector to obtain a new target editing direction vector; and
and adding the new target editing direction vector and the style coding feature to obtain the target vector.
15. The method of claim 9, wherein the determining a text differential direction vector from the first text encoding feature and the second text encoding feature comprises:
and calculating the difference between the first text coding feature and the second text coding feature to obtain the text differential direction vector.
16. A training apparatus for an image processing model, comprising:
a first feature determination module for determining a first image coding feature and a first style coding feature of a first image;
a second feature determination module for determining a second image encoding feature and a second style encoding feature of a second image;
a first differential direction vector determining module, configured to determine a first differential direction vector according to the first image coding feature and the second image coding feature;
The editing direction vector determining module is used for inputting the first image coding feature, the first differential direction vector and the first style coding feature into a differential mapper in an image processing model to obtain an editing direction vector;
the second differential direction vector determining module is used for determining a second differential direction vector according to the first style coding feature and the second style coding feature; and
the adjusting module is used for adjusting parameters of the image processing model according to the difference degree between the second differential direction vector and the editing direction vector;
the differential mapper comprises a style coding module, a condition coding module and a fusion module; the editing direction vector determining module includes:
a style encoding sub-module for determining style characteristics from coarse granularity to fine granularity corresponding to the first style encoding characteristics using the style encoding module;
a condition encoding sub-module for determining, with the condition encoding module, a condition feature from coarse granularity to fine granularity corresponding to the first image encoding feature and the first differential direction vector; and
and the fusion sub-module is used for fusing the style characteristics from coarse granularity to fine granularity and the condition characteristics from coarse granularity to fine granularity into three layers of coding layers in the fusion module by utilizing the fusion module to determine the editing direction vector.
17. The apparatus of claim 16, wherein the first feature determination module comprises:
the first input sub-module is used for inputting the first image into an image encoder in the image processing model to obtain the first image coding feature;
the second input sub-module is used for inputting the first image into a style encoder in the image processing model to obtain a first intermediate space feature; and
and the first mapping sub-module is used for mapping the first intermediate space characteristic to a target space to obtain the first style coding characteristic.
18. The apparatus of claim 16, wherein the second feature determination module comprises:
a third input sub-module, configured to input the second image into an image encoder in the image processing model, to obtain the second image coding feature;
a fourth input sub-module, configured to input the second image into a style encoder in the image processing model, to obtain a second intermediate spatial feature; and
and the second mapping sub-module is used for mapping the second intermediate space characteristic to a target space to obtain the second style coding characteristic.
19. The apparatus of claim 16, wherein the differential mapper comprises a style encoding module, a condition encoding module, and a fusion module; the editing direction vector determining module includes:
The style coding sub-module is used for determining coarse granularity style characteristics, medium granularity style characteristics and fine granularity style characteristics corresponding to the first style coding characteristics by utilizing the style coding module;
a condition encoding sub-module for determining coarse granularity condition features, medium granularity condition features, and fine granularity condition features corresponding to the first image encoding features and the first differential direction vector using the condition encoding module; and
and the fusion sub-module is used for fusing the coarse-granularity style characteristic and the coarse-granularity condition characteristic by utilizing the fusion module to obtain a first fusion characteristic, fusing the middle-granularity style characteristic and the middle-granularity condition characteristic to obtain a second fusion characteristic, fusing the fine-granularity style characteristic and the fine-granularity condition characteristic to obtain a third fusion characteristic, and determining the editing direction vector according to the first fusion characteristic, the second fusion characteristic and the third fusion characteristic.
20. The apparatus of claim 16, further comprising:
the first calculating module is used for calculating the Euclidean distance between the second differential direction vector and the editing direction vector;
The second calculating module is used for calculating the cosine distance between the second differential direction vector and the editing direction vector; and
and the difference degree determining module is used for determining the difference degree according to the Euclidean distance and the cosine distance.
21. The apparatus of claim 16, wherein the first differential direction vector determination module comprises:
and a third computing sub-module, configured to compute a difference between the first image coding feature and the second image coding feature, to obtain the first differential direction vector.
22. The apparatus of claim 16, wherein the second differential direction vector determination module comprises:
and a fourth computing sub-module, configured to compute a difference between the first style coding feature and the second style coding feature, to obtain the second differential direction vector.
23. An image processing apparatus comprising:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a source image, a source text and a target text, wherein the source text is used for describing an element to be edited in the source image, and the target text is used for describing an editing mode aiming at the element to be edited; and
an input module for inputting the source image, the source text and the target text into an image processing model to obtain a processing result for the source image,
Wherein the image processing model is trained according to the method of any one of claims 1-7.
24. The apparatus of claim 23, wherein the input module comprises:
a text encoding sub-module for determining a first text encoding feature of the source text and a second text encoding feature of the target text;
the text difference molecular module is used for determining a text difference direction vector according to the first text coding feature and the second text coding feature;
the characteristic determining submodule is used for determining image coding characteristics and style coding characteristics of the source image;
the differential mapping sub-module is used for inputting the text differential direction vector, the image coding feature and the style coding feature into a differential mapper in the image processing model to obtain a target editing direction vector; and
and the generation sub-module is used for generating a processing result aiming at the source image according to the target editing direction vector and the style coding feature.
25. The apparatus of claim 24, wherein the text encoding submodule comprises:
a first input unit, configured to input the source text into a text encoder in the image processing model, to obtain the first text encoding feature; and
And the second input unit is used for inputting the target text into a text encoder in the image processing model to obtain the second text coding feature.
26. The apparatus of claim 24, wherein the feature determination submodule comprises:
the third input unit is used for inputting the source image into an image feature coding model in the image processing model to obtain the image coding features;
the fourth input unit is used for inputting the source image into a style coding model in the image processing model to obtain target intermediate space characteristics; and
and the mapping unit is used for mapping the target intermediate space characteristic to a target space to obtain the style coding characteristic.
27. The apparatus of claim 24, wherein the differential mapper comprises a style encoding module, a condition encoding module, and a fusion module; the differential mapping submodule includes:
the first feature determining unit is used for determining target coarse granularity style features, target middle granularity style features and target fine granularity style features corresponding to the style coding features by utilizing the style coding module;
a second feature determining unit configured to determine, using the condition encoding module, a target coarse-granularity condition feature, a target medium-granularity condition feature, and a target fine-granularity condition feature corresponding to the image encoding feature and the text differential direction vector; and
The fusion unit is used for fusing the target coarse-granularity style characteristic and the target coarse-granularity condition characteristic by utilizing the fusion module to obtain a first target fusion characteristic, fusing the granularity style characteristic in the target and the target medium-granularity condition characteristic to obtain a second target fusion characteristic, fusing the target fine-granularity style characteristic and the target fine-granularity condition characteristic to obtain a third target fusion characteristic, and obtaining the target editing direction vector according to the first target fusion characteristic, the second target fusion characteristic and the third target fusion characteristic.
28. The apparatus of claim 24, wherein the generating submodule comprises:
a target vector determining unit configured to determine a target vector based on the editing direction vector and the style coding feature; and
and a fifth input unit, configured to input the target vector into an image generator in the image processing model, and obtain a target image as a processing result of the source image.
29. The apparatus of claim 28, wherein the target vector determination unit comprises:
an acquisition subunit, configured to acquire a correlation matrix, where the correlation matrix is used to represent a correlation between an image feature and a corresponding text feature;
A multiplication subunit, configured to multiply the correlation matrix with a target editing direction vector to obtain a new target editing direction vector; and
and the adding subunit is used for adding the new target editing direction vector and the style coding feature to obtain the target vector.
30. The apparatus of claim 24, wherein the text difference molecule module comprises:
and the calculating unit is used for calculating the difference between the first text coding characteristic and the second text coding characteristic to obtain the text difference direction vector.
31. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.
32. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-15.
CN202211270036.2A 2022-10-17 2022-10-17 Training method of image processing model, image processing method and device Active CN116091857B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211270036.2A CN116091857B (en) 2022-10-17 2022-10-17 Training method of image processing model, image processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211270036.2A CN116091857B (en) 2022-10-17 2022-10-17 Training method of image processing model, image processing method and device

Publications (2)

Publication Number Publication Date
CN116091857A CN116091857A (en) 2023-05-09
CN116091857B true CN116091857B (en) 2023-10-20

Family

ID=86199800

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211270036.2A Active CN116091857B (en) 2022-10-17 2022-10-17 Training method of image processing model, image processing method and device

Country Status (1)

Country Link
CN (1) CN116091857B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414595A (en) * 2019-07-25 2019-11-05 广西科技大学 The orientation estimate method of texture image with orientation consistency
WO2021129642A1 (en) * 2019-12-23 2021-07-01 Oppo广东移动通信有限公司 Image processing method, apparatus, computer device, and storage medium
CN113963087A (en) * 2021-10-12 2022-01-21 北京百度网讯科技有限公司 Image processing method, image processing model training device and storage medium
WO2022022043A1 (en) * 2020-07-27 2022-02-03 平安科技(深圳)有限公司 Head image generation method, apparatus, server, and storage medium
CN114266840A (en) * 2021-12-21 2022-04-01 北京达佳互联信息技术有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN114612290A (en) * 2022-03-11 2022-06-10 北京百度网讯科技有限公司 Training method of image editing model and image editing method
CN115034957A (en) * 2022-05-06 2022-09-09 西安电子科技大学 Human face sketch portrait editing method based on text description
CN115147261A (en) * 2022-05-17 2022-10-04 腾讯科技(深圳)有限公司 Image processing method, device, storage medium, equipment and product

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11257272B2 (en) * 2019-04-25 2022-02-22 Lucid VR, Inc. Generating synthetic image data for machine learning
US20220012596A1 (en) * 2020-07-09 2022-01-13 Nvidia Corporation Attribute-aware image generation using neural networks

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414595A (en) * 2019-07-25 2019-11-05 广西科技大学 The orientation estimate method of texture image with orientation consistency
WO2021129642A1 (en) * 2019-12-23 2021-07-01 Oppo广东移动通信有限公司 Image processing method, apparatus, computer device, and storage medium
WO2022022043A1 (en) * 2020-07-27 2022-02-03 平安科技(深圳)有限公司 Head image generation method, apparatus, server, and storage medium
CN113963087A (en) * 2021-10-12 2022-01-21 北京百度网讯科技有限公司 Image processing method, image processing model training device and storage medium
CN114266840A (en) * 2021-12-21 2022-04-01 北京达佳互联信息技术有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN114612290A (en) * 2022-03-11 2022-06-10 北京百度网讯科技有限公司 Training method of image editing model and image editing method
CN115034957A (en) * 2022-05-06 2022-09-09 西安电子科技大学 Human face sketch portrait editing method based on text description
CN115147261A (en) * 2022-05-17 2022-10-04 腾讯科技(深圳)有限公司 Image processing method, device, storage medium, equipment and product

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery;Or Patashnik等;《2021 IEEE/CVF International Conference on Computer Vision (ICCV)》;2085-2094 *
StyleGAN-NADA: CLIP-guided domain adaptation of image generators;Rinon Gal等;《ACM Transactions on Graphics》;第41卷(第4期);1-13 *
StyleSwap: Style-Based Generator Empowers Robust Face Swapping;Zhiliang Xu等;《arXiv:2209.13514v1》;1-21 *
生成对抗网络及其文本图像合成综述;王威等;《计算机工程与应用》;第58卷(第19期);14-36 *
面部动作编码系统识别算法研究;胡晓瑞;《中国优秀硕士学位论文全文数据库_信息科技辑》;I138-370 *

Also Published As

Publication number Publication date
CN116091857A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN113313022B (en) Training method of character recognition model and method for recognizing characters in image
JP7301922B2 (en) Semantic retrieval method, device, electronic device, storage medium and computer program
CN114612759B (en) Video processing method, video query method, model training method and model training device
KR20220122566A (en) Text recognition model training method, text recognition method, and apparatus
CN111581926B (en) Document generation method, device, equipment and computer readable storage medium
JP7394809B2 (en) Methods, devices, electronic devices, media and computer programs for processing video
CN112749300B (en) Method, apparatus, device, storage medium and program product for video classification
CN112580733B (en) Classification model training method, device, equipment and storage medium
CN114693934B (en) Training method of semantic segmentation model, video semantic segmentation method and device
CN115565177B (en) Character recognition model training, character recognition method, device, equipment and medium
CN116611496A (en) Text-to-image generation model optimization method, device, equipment and storage medium
CN117315334A (en) Image classification method, training device, training equipment and training medium for model
CN116975349A (en) Image retrieval method, device, electronic equipment and storage medium
CN118015144A (en) Image generation method and training method and device of image generation model
CN113723077B (en) Sentence vector generation method and device based on bidirectional characterization model and computer equipment
CN111444335B (en) Method and device for extracting central word
CN114266937A (en) Model training method, image processing method, device, equipment and storage medium
CN116091857B (en) Training method of image processing model, image processing method and device
CN115481285B (en) Cross-modal video text matching method and device, electronic equipment and storage medium
CN116229095A (en) Model training method, visual task processing method, device and equipment
CN113239215B (en) Classification method and device for multimedia resources, electronic equipment and storage medium
CN112287159B (en) Retrieval method, electronic device and computer readable medium
CN114926322A (en) Image generation method and device, electronic equipment and storage medium
CN113806541A (en) Emotion classification method and emotion classification model training method and device
Nurhasanah et al. Fine-grained object recognition using a combination model of navigator–teacher–scrutinizer and spinal networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant