CN117275068A

CN117275068A - Method and system for training human face fake detection in test stage containing uncertainty guidance

Info

Publication number: CN117275068A
Application number: CN202311224982.8A
Authority: CN
Inventors: 罗引; 徐楠; 郝艳妮; 陈博; 李军锋; 曹家; 王磊
Original assignee: Beijing Zhongke Wenge Technology Co ltd
Current assignee: Beijing Zhongke Wenge Technology Co ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2023-12-22
Anticipated expiration: 2043-09-21
Also published as: CN117275068B

Abstract

The invention discloses a training face counterfeiting detection method and system in a test stage containing uncertainty guidance, belonging to the technical field of deep learning and computer vision, wherein the method comprises the following steps: acquiring an image to be distinguished as an initial input image; acquiring a high-frequency information image of the initial input image; extracting RGB features and frequency domain attention features of different scales in the high-frequency information image, and fusing the RGB features and the frequency domain attention features; performing cross attention calculation on the fused RGB features and the frequency domain features to obtain fused features; based on the fusion characteristics, a fusion mode is selected in a self-adaptive mode according to different input images and task requirements, so that discrimination characteristics are obtained, and classification tasks are performed based on the discrimination characteristics. The invention fully utilizes the effective information in the frequency domain and the RGB domain to mine the fake trace, optimizes the uncertainty in the network by utilizing the uncertainty-guided test stage training strategy, and improves the generalization performance.

Description

Method and system for training human face fake detection in test stage containing uncertainty guidance

Technical Field

The invention belongs to the technical field of deep learning and computer vision, and particularly relates to a face fake detection method and system for training in a test stage containing uncertainty guidance.

Background

Face-forgery detection is a key technology for detecting face forgery by various means. With the popularization and practicality of face recognition technology, the situation of face counterfeiting is also increasing, which threatens the benefits of personal privacy, social security, legal fairness and the like, including the application of face recognition technology, financial transaction security, medical care protection and the like. In the application of face recognition technology, if face forgery exists, false security records and access authorization can be caused; in financial transactions, if the phenomenon of face counterfeiting exists, account security is threatened, and funds loss can be caused; in healthcare protection, face falsification may cause leakage and alteration of medical record information, and may also cause problems such as infringement of medical privacy of a patient. With concerns about these negative effects, researchers have begun to explore various means to address the problem of face counterfeiting. A number of face-counterfeit detectors have been invented, for example, using texture features for detection, using face motion features, using deep learning techniques, etc. The technologies play a positive role in face counterfeiting detection and other related application fields, are hopeful to help human beings to better protect personal privacy, social security, legal fairness and other benefits, and enable life of people to be more convenient, convenient and safe.

Face-forgery detection is seen in early work as a binary classification problem, and researchers aim to learn decision boundaries between genuine and fake faces. However, with the continuous development of counterfeiting technology, such methods gradually lose their performance. Therefore, many recent works have been shifted to finding false clues from the frequency domain, and discriminating the authenticity of a face based on the fine false clues. Some researchers have proposed a similarity model using frequency characteristics to improve the performance of the model in the invisible domain, others have assumed that high frequency noise of the image can remove color textures and mine forgery marks, and use image noise to improve generalization ability. However, there are still non-negligible problems: the effect of frequency is not always sufficiently efficient and adaptable to different counterfeiting techniques; a network trained on a common data set cannot effectively quantify its uncertainty.

Disclosure of Invention

The invention aims to provide a training face fake detection method in a test stage containing uncertainty guidance, which can further excavate fake marks in a frequency domain, can fuse frequency domains and RGB information with different qualities, optimizes uncertainty in a network and further improves generalization performance of the face fake detection method.

In order to achieve the above object, the present invention provides a face falsification detection method for training in a test phase including uncertainty guidance, comprising the steps of:

step S100, an image to be distinguished is obtained as an initial input image;

step S200, acquiring a high-frequency information image of the initial input image, wherein the high-frequency information image is image information positioned in a high-frequency band in the initial input image;

step S300, extracting RGB features and frequency domain attention features of different scales in the high-frequency information image, and fusing the RGB features and the frequency domain attention features to obtain fused RGB features and frequency domain features;

step S400, performing cross attention calculation on the fused RGB features and the frequency domain features to obtain fused features, where the cross attention calculation is also called cross-attention, and refers to performing attention calculation on a certain position in one feature sequence and all positions in another feature sequence in an attention mechanism;

step S500, based on the fusion characteristics, adaptively selecting a fusion mode according to different input images and task requirements to obtain discrimination characteristics, and classifying tasks based on the discrimination characteristics.

Further, in the step S200, the initial input image is converted from the spatial domain to the frequency domain by using discrete cosine transform, and the high-frequency information image is screened out.

Further, in the step S300, RGB features and frequency domain attention features are extracted based on a self-attention mechanism and global information of an input sequence in the high frequency information image.

Further, in the step S400, a dual attention mechanism based on channel attention and spatial attention is used to interact information between the fused RGB features and the frequency domain features.

Further, in the step S500, the fusion manner includes a dynamic weighted average fusion manner with uncertainty factors, and the weights of the feature graphs in the fusion features are adaptively adjusted by the dynamic weighted average fusion manner.

The invention also provides a training face counterfeiting detection system containing uncertainty guidance in the test stage, which comprises the following steps:

an image acquisition unit for acquiring an image to be discriminated as an initial input image;

a high-frequency conversion unit, configured to obtain a high-frequency information image of the initial input image, where the high-frequency information image is image information located in a high-frequency band in the initial input image;

the feature extraction and fusion unit is used for extracting RGB features and frequency domain attention features of different scales in the high-frequency information image, and fusing the RGB features and the frequency domain attention features to obtain fused RGB features and frequency domain features;

the feature calculation unit is used for carrying out cross attention calculation on the fused RGB features and the frequency domain features to obtain fused features, wherein the cross attention calculation is also called cross-attention, and refers to that in an attention mechanism, attention calculation is carried out on a certain position in one feature sequence and all positions in the other feature sequence;

and the feature judging unit is used for adaptively selecting a fusion mode based on the fusion features according to different input images and task requirements to obtain judging features and classifying tasks based on the judging features.

Further, the high-frequency conversion unit comprises a plurality of converter modules, the plurality of converter modules are divided into three groups, and the three groups of converter modules are sequentially connected in series from high to low according to the corresponding spatial resolution.

Further, the feature extraction fusion unit further comprises an image feature enhancement processing module, wherein the image feature enhancement processing module is used for extracting a frequency domain image attention map from the frequency domain initial input image through a convolution encoder connected in series, then combining the frequency domain image attention map with the RGB initial input image, sending the frequency domain image attention map to a multi-stage feature conversion extractor, and repeating the operation after each feature conversion extractor to obtain frequency domain image features and frequency domain enhanced RGB image features.

The present invention also provides an electronic device including: one or more processors; a storage means for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement the methods described above.

The present invention also provides a computer readable medium having stored thereon a computer program which when executed by a processor implements the method described above.

Compared with the prior art, according to the training face counterfeiting detection method and system for the testing stage containing uncertainty guidance, the RGB features and the frequency domain attention features are fused, the cross attention calculation is carried out on the fused RGB features and the frequency domain features to obtain the fusion features, the frequency domain information and the RGB information with different qualities can be fused, the fusion mode is adaptively selected based on the fusion features according to different input images and task requirements, the discrimination features are obtained, and the classification task is carried out based on the discrimination features. The method can further excavate fake trace in the frequency domain, optimize uncertainty in the network by utilizing an uncertainty-guided test stage training strategy, and further improve generalization performance of the method.

Drawings

FIG. 1 is a flow chart of a training face falsification detection method for a test phase with uncertainty guidance in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a feature extraction fusion unit according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a frequency domain feature enhancement network based on a converter module in an embodiment of the invention;

FIG. 4 is a schematic diagram of a frequency domain attention computation step in an embodiment of the present invention;

FIG. 5 is a schematic diagram of the cross domain attention calculation step in an embodiment of the present invention;

FIG. 6 is a schematic diagram of a dynamic fusion module in an embodiment of the invention;

FIG. 7 is a schematic diagram of a strategy for test phase training with uncertainty guidance in one embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein.

It should be understood that, in various embodiments of the present invention, the sequence number of each process does not mean that the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

It should be understood that in the present invention, "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements that are expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present invention, "plurality" means two or more. "and/or" is merely an association relationship describing an association object, and means that three relationships may exist, for example, and/or B may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. "comprising A, B and C", "comprising A, B, C" means that all three of A, B, C comprise, "comprising A, B or C" means that one of the three comprises A, B, C, and "comprising A, B and/or C" means that any 1 or any 2 or 3 of the three comprises A, B, C.

It should be understood that in the present invention, "B corresponding to a", "a corresponding to B", or "B corresponding to a" means that B is associated with a, from which B can be determined. Determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information. The matching of A and B is that the similarity of A and B is larger than or equal to a preset threshold value.

As used herein, "if" may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection" depending on the context.

The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 1 is a flowchart of a method for detecting face falsification during test phase training with uncertainty guidance, and fig. 7 is a schematic diagram of a strategy for training during test phase with uncertainty guidance according to an embodiment of the present invention, and a method for detecting face falsification during test phase training with uncertainty guidance according to a preferred embodiment of the present invention includes the following steps:

step S100, an image to be distinguished is obtained as an initial input image;

step S200, obtaining a high-frequency information image of an initial input image, wherein the high-frequency information image is image information positioned in a high-frequency band in the initial input image, converting the initial input image from a space domain to a frequency domain by adopting Discrete Cosine Transform (DCT), and screening out the image information of the high-frequency band as the high-frequency information image;

step S300, extracting RGB features and frequency domain attention features of different scales in the high-frequency information image, as shown in FIG. 4, which is a schematic diagram of a frequency domain attention calculation step in an embodiment of the invention, by fusing the RGB features and the frequency domain attention features, obtaining fused RGB features and frequency domain features;

step S400, performing cross attention calculation on the RGB features and the frequency domain features after fusion to obtain fusion features, wherein the cross attention calculation is also called cross-attention, namely, in an attention mechanism, performing attention calculation on a certain position in one feature sequence and all positions in the other feature sequence, and obtaining more accurate fusion features through the cross attention calculation;

In one embodiment of the present invention, in step S200, the initial input image is transformed from the spatial domain to the frequency domain using the discrete cosine transform, and the high frequency information image is screened out.

In an embodiment of the present invention, in step S300, RGB features and frequency domain attention features are extracted based on the self-attention mechanism and global information of the input sequence in the high frequency information image.

In an embodiment of the present invention, in step S400, a dual attention mechanism based on channel attention and spatial attention is used to interact information between the fused RGB features and the frequency domain features, so as to further improve the expression capability of the fused features.

In an embodiment of the present invention, in step S500, the fusion method includes a dynamic weighted average fusion method with uncertainty factors, and the weights of the feature graphs in the fusion features are adaptively adjusted by the dynamic weighted average fusion method.

the high-frequency conversion unit is used for acquiring a high-frequency information image of the initial input image, wherein the high-frequency information image is image information positioned in a high-frequency band in the initial input image;

the feature extraction fusion unit, as shown in fig. 2, is a schematic structural diagram of the feature extraction fusion unit in an embodiment of the present invention, where the feature extraction fusion unit is configured to extract RGB features and frequency domain attention features of different scales in a high-frequency information image, and fuse the RGB features and the frequency domain attention features to obtain fused RGB features and frequency domain features; the feature extraction fusion unit comprises a frequency domain attention module, wherein the frequency domain attention module adopts a window-based self-attention calculation module, and can realize information interaction between frequency domain features and utilize the information interaction;

the feature discriminating unit is used for adaptively selecting a fusion mode based on the fusion features according to different input images and task requirements to obtain discriminating features and classifying tasks based on the discriminating features. The feature discriminating unit comprises a dynamic fusion module, the dynamic fusion module carries out dynamic weighted average in a mode with uncertainty factors, and the weight of each feature map in fusion features is adaptively adjusted, so that the image discriminating capability is further improved, the uncertainty factors are arranged in the dynamic fusion module, loss calculation is carried out on the output discriminating features, fine adjustment is carried out on the feature discriminating unit, and only the parameters of the dynamic fusion module are updated in an unsupervised training mode.

In one embodiment of the present invention, as shown in fig. 6, the dynamic fusion module further uses two linear layers in order to obtain the corresponding weight of each branchAnd->The global averaging pooling layer GAP and the activation layer Gelu function δ integrate the features of the three branches, which can be expressed as:

wherein the method comprises the steps of，/>. Then we set up three linear layers/>，/>And->And a softmax function generating a quality weight for each branch>Can be expressed as:

wherein the method comprises the steps ofRepresenting the quality of each branch. Because the contributions of different branches to mining spurious cues are different, the fusion features are weighted according to quality and two linear layers are used to recover the channel dimensions of the dynamic fusion features. Output->Can be expressed as:

in one embodiment of the present invention, the high frequency conversion unit includes a plurality of converter modules, the plurality of converter modules are divided into three groups, and the three groups of converter modules are serially connected in sequence from high to low according to the corresponding spatial resolution. The three groups of converter modules respectively and correspondingly process the image features with different spatial resolutions, and combine the RGB features with the frequency domain attention features to obtain updated RGB features. The combination process of the RGB features and the frequency domain attention features utilizes the frequency domain attention module to model the interdependence relationship between RGB and high-frequency information images, so that the model can extract the image features more accurately.

In one embodiment of the invention, the input image is passed through the high frequency conversion unit to obtain the high frequency image input in accordance with the Discrete Cosine Transform (DCT) in which the low frequency band is the first 1/16 of the spectrum, the middle frequency band is between 1/16 and 1/8 of the spectrum, and the high frequency band is the last 7/8 of the spectrum, in order to enhance high frequency fine artifacts, the low and middle frequency information is filtered by setting them to 0.

In an embodiment of the present invention, the feature extraction and fusion unit further includes an image feature enhancement processing module, as shown in fig. 3, where the image feature enhancement processing module is configured to extract a frequency domain image attention map from a frequency domain initial input image through a convolutional encoder connected in series, then combine the frequency domain image attention map with an RGB initial input image, send the frequency domain image attention map to a multi-stage feature conversion extractor, and repeat the operation after each feature conversion extractor to obtain a frequency domain image feature and a frequency domain enhanced RGB image feature. Wherein the convolutional encoder comprises a convolutional layer and a nonlinear activation layer, the convolutional encoder is configured to map a frequency-domain image input to a high-dimensional feature domain, extract shallow-layer frequency-domain feature representations through sets of basis modules consisting of the convolutional layer and the nonlinear activation layer, and generate a frequency-domain image attention-map, which may be obtained by inputting frequency-domain image features into a set of convolutional decoders consisting of the convolutional layer and the nonlinear activation layer.

In an embodiment of the present invention, a method for processing an input shallow image feature by an image feature enhancement processing module includes: image RGBAnd frequency domain image->As input to the transducer network, the RGB features are extracted via a transducer block>And frequency domain features->Frequency feature acquisition using frequency domain attention moduleIn an effort to direct the forgery trace of RGB modalities from a frequency perspective. The frequency domain attention module is shown in fig. 4, and the calculation process is as follows:

wherein the method comprises the steps ofRepresenting the frequency characteristics after feature extraction, σ represents Sigmoid function, and GAP and GMP represent global average pooling and global maximum pooling, respectively. CAT denotes connecting features in the depth direction. We finally select a 7 x 7 convolution kernel to extract the counterfeit trace in the frequency domain because it detects edge information better and covers a larger area than three 3 x 3 convolution kernels. Attention seeking to->Subtle forgery marks in the frequency domain that are difficult to dig out in RGB features are contained. Therefore, we will->Application to RGB features->In order to further excavate the fake trace, the calculation process is as follows:

wherein->Representing summation(s)>Representing element-wise multiplication. In addition, the feature extraction process has three stages, low, medium and high. The low-level features represent texture counterfeit information, while the high-level features extract more of the overall counterfeit trace. Thus, interacting with RGB features and frequency features at multiple levels to obtain a more comprehensive representation of counterfeit features. Specifically, the firstiFrequency domain output of individual phases->Is used as +.>First, thei+1Input of stage->And RGB input->RGB features, which are pre-guided in the frequency domain, can be expressed as:

the final stage is then output with featuresAnd->Input into a dynamic fusion module to mine more discriminative information, whereinh、wAndcis the dimension of the output feature.

In an embodiment of the present invention, in the multi-stage feature transformation extractor, the feature transformation extractor of each stage shares the network weight, the output of the feature extractor of the previous stage is used as the input of the feature extractor of the current stage, the image features loop through the multi-stage feature transformation extractor a plurality of times, and the feature transformation extractor of each stage is configured to pass the input RGB and frequency domain image features through the window self-attention calculation unit and the corresponding forward propagation calculation unit, so as to realize the information interaction between the image features of different positions.

In an embodiment of the present invention, the feature extraction fusion unit further includes a convolution decoder, where the convolution decoder is configured to input the extracted deep image feature expression into a series of superimposed basic convolution layers and upsampling layers, and perform pixel-level addition operation on the shallow image feature and the deep image feature that are connected in a skip manner, so as to obtain an RGB image feature after frequency domain enhancement, and further obtain an enhanced RGB image.

In some preferred embodiments, the multi-stage feature transformation extractor configuration further comprises a time-interleaved attention computation module, the feature transformation extractor of each stage combining the feature extraction processes of the self-attention mechanism computation unit and the forward propagation computation unit, and employing a cross-layer connection to pixel-level additive connect the input and output of the feature extractor. The time cross-attention calculation module is shown in fig. 5, and in particular, the frequency modality should be an auxiliary component. Given RGB featuresAnd frequency characteristics->We use the query-key-value mechanism to fuse them initially, fusing them into a unified representation, which can be expressed as:

wherein,Q，K，Vthree different matrix transforms are shown for projective transformation of the input features.h、wAndcthe dimension of the output feature represents the dimension of the feature.

The embodiment of the invention adds a dynamic fusion module of uncertainty factors, wherein the dynamic fusion module adds uncertainty factors such as Gumbel-Softmax function applied with random disturbance to generate some disturbance to the relative quality of a judging image, and dynamically selects a prediction result as a model to be output, thereby finely adjusting a network and increasing the fitting capability of the network to unknown data.

The invention also provides an electronic device, characterized by comprising: one or more processors; a storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the methods described above.

The present invention also provides a computer-readable medium having stored thereon a computer program characterized in that: the program is executed by a processor to implement the method described above.

According to the face fake detection method and system for training in the test stage with uncertainty guidance, the RGB features and the frequency domain attention features are fused, the fused RGB features and the frequency domain features are subjected to cross attention calculation to obtain the fusion features, frequency domains with different qualities and RGB information can be fused, fusion modes are adaptively selected based on the fusion features according to different input images and task requirements, discrimination features are obtained, and classification tasks are performed based on the discrimination features. The method can further excavate fake trace in the frequency domain, optimize uncertainty in the network by utilizing an uncertainty-guided test stage training strategy, and further improve generalization performance of the method.

It can be understood that the training face counterfeit detection system based on the uncertainty guidance provided in the foregoing embodiment is only exemplified by the division of the foregoing functional modules, in practical application, the foregoing functional allocation may be completed by different functional modules according to needs, that is, the modules or steps in the foregoing embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into a plurality of sub-modules to complete all or part of the functions described above.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing descriptions of specific exemplary embodiments of the present invention are presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain the specific principles of the invention and its practical application to thereby enable one skilled in the art to make and utilize the invention in various exemplary embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. The method for detecting the face forgery by training in the test stage containing uncertainty guidance is characterized by comprising the following steps of:

step S100, an image to be distinguished is obtained as an initial input image;

2. The method according to claim 1, wherein in step S200, the initial input image is converted from a space domain to a frequency domain by discrete cosine transform, and the high-frequency information image is screened out.

3. The uncertainty-guided test phase-training face-forgery detection method according to claim 1, wherein in the step S300, RGB features and frequency-domain attention features are extracted based on a self-attention mechanism and global information of an input sequence in the high-frequency information image.

4. The method according to claim 1, wherein in step S400, a dual attention mechanism based on channel attention and spatial attention is used to interact information between the fused RGB features and the frequency domain features.

5. The method according to claim 1, wherein in step S500, the fusion method includes a dynamic weighted average fusion method with uncertainty factors, and the weights of the feature maps in the fusion features are adaptively adjusted by the dynamic weighted average fusion method.

6. A training face-forgery detection system for a test phase with uncertainty guidance, comprising:

7. The uncertainty-guided testing phase-training face-forgery detection system of claim 6, wherein the high-frequency conversion unit includes a number of converter modules that are divided into three groups that are serially connected in sequence from high to low according to the corresponding spatial resolution.

8. The system according to claim 6, wherein the feature extraction and fusion unit further comprises an image feature enhancement processing module, the image feature enhancement processing module is configured to extract a frequency domain image attention map from a frequency domain initial input image through a convolutional encoder connected in series, combine the frequency domain image attention map with an RGB initial input image, send the frequency domain image attention map to a multi-stage feature conversion extractor, and obtain frequency domain image features and frequency domain enhanced RGB image features after the processing by the feature conversion extractor.

9. An electronic device, comprising: one or more processors; a storage means for storing one or more programs; when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.

10. A computer readable medium having a computer program stored thereon, characterized by: the program, when executed by a processor, implements the method of any one of claims 1-5.