CN114581690B - Image pair difference description method based on encoding-decoding end - Google Patents

Image pair difference description method based on encoding-decoding end Download PDF

Info

Publication number
CN114581690B
CN114581690B CN202210248468.7A CN202210248468A CN114581690B CN 114581690 B CN114581690 B CN 114581690B CN 202210248468 A CN202210248468 A CN 202210248468A CN 114581690 B CN114581690 B CN 114581690B
Authority
CN
China
Prior art keywords
image
change
difference
images
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210248468.7A
Other languages
Chinese (zh)
Other versions
CN114581690A (en
Inventor
高盛祥
岳圣斌
余正涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202210248468.7A priority Critical patent/CN114581690B/en
Publication of CN114581690A publication Critical patent/CN114581690A/en
Application granted granted Critical
Publication of CN114581690B publication Critical patent/CN114581690B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding

Abstract

The invention relates to a method for describing difference of image pair based on a decoding-encoding end. The invention comprises the following steps: 1) Firstly, extracting visual features of an image from a pre-trained feature extractor; 2) Then modeling the interaction and position relation among the characteristic semantemes to obtain fine-grained information of the image; 3) The difference between the images is accurately represented through the layered interactive matching module between the images, and the interference caused by visual angle/illumination change is eliminated. 4) Finally, by aligning the visual and textual features based on top-down LSTM, sentences are then decoded that can describe the differences between the images. The method has strong robustness, can accurately describe the difference between two images under the condition of the existence of interference factors such as visual angles, illumination and the like, and in the experiment of the public data set provided in the field, the evaluation index of the method exceeds the most advanced model at present and reaches the international leading level.

Description

Image pair difference description method based on encoding-decoding end
Technical Field
The invention relates to an image pair difference description method based on an encoding-decoding end, and belongs to the technical field of multi-mode spanning the fields of natural language processing and computer vision.
Background
The invention lives in a world with great changes, and the change of things is ubiquitous in daily life. As a human, the present invention can infer underlying information from changes detected in a dynamic task environment. For example, a good neural network physician, in addition to locating a lesion, can better judge the progression of a patient's condition by comparing CT images captured at different times. It is very difficult for a computer to understand the image and automatically generate a report when a discrepancy is detected. Therefore, in many applications such as damage detection, video surveillance, aerial photography, medical imaging, satellite imaging, etc., how to accurately find the difference in image pair and automatically generate a report is a critical problem to be solved urgently.
In recent years, cross-modal research combining images and text has attracted increasing attention by researchers in the fields of natural language processing and machine vision. The mainstream tasks include image description generation, machine vision question and answer, visual dialog generation, visual reasoning, automatic generation from text to image, and the like. Describing image content in natural language (image description generation) is a popular field in artificial intelligence research, and many methods for image difference description have been proposed. The current image difference analysis and understanding technology can only analyze and identify specific limited image pair information, can only describe simple differences, and cannot accurately describe image differences under the condition of interference factors (illumination/visual angle change). Therefore, the present invention needs a new difference description technique, which enables a computer to accurately identify complex semantic information in an image, find differences between image pairs, and generate a sentence-level text description more conforming to human language habits using natural language processing techniques.
Disclosure of Invention
The invention provides an image pair difference description method based on an encoding-decoding end, which is used for solving the problems of inaccurate difference positioning, wrong description and the like under the condition that interference factors such as vision/illumination exist and the like, and improving the robustness of a model.
The technical scheme of the invention is as follows: the method for describing the difference of the image pair based on the encoding-decoding end comprises the following specific steps:
step1, utilizing a pre-trained convolutional neural network as a feature extractor, and sending the images before/after change into the feature extractor to obtain visual features of the two images;
step2, modeling semantic interaction and position relation in each image through a semantic-position purifier, so as to deeply understand fine-grained information of the image, which is the basis of the obtained accurate difference representation;
step3, characterization of differences among the acquired images: on the premise of obtaining fine-grained understanding of the image through the operation, distinguishing whether the image is real change or visual/illumination change by utilizing a hierarchical matching mechanism, capturing a fine change process, and obtaining accurate difference representation; the hierarchical matching mechanism comprises two parts: a semantic matching module and a check re-matching module;
step4, sending the difference representation into a decoder, and decoding a natural language sentence capable of describing the difference between the two images;
step5, comprehensively and objectively evaluating the performance of the model of the invention by using 5 evaluation indexes.
As a further aspect of the present invention, step1 includes: to derive the visual features, pre-trained ResNet-101 is used as a feature extractor on ImageNet to obtain the grid features of the images and average them to a 14 × 14 grid size.
As a further aspect of the present invention, in Step 2: the absolute position and relative positional relationship of the image are first encoded. The position between adjacent objects does not change due to changes in the viewpoint, which can be seen as a priori knowledge to distinguish between trueness and view angle changes. Different from the traditional position embedding, the relative position coding of the invention uses a dynamic mode, can automatically learn under the driving of the interaction between the required characteristics, and the relative position between the characteristics obtains a 4-dimensional relative position coordinate by modeling the coordinates of the relative upper left corner and lower right corner of the image; by injecting absolute position information in the original image features, changes in the object can be sensitively discerned. Assigning an ordered fixed token to each feature to represent absolute positional relationships, specifically encoded using sine and cosine functions of different frequencies;
the specific steps of Step2 are as follows:
step2.1, encoding the relative positions between features in the picture: coding the coordinates of the relative upper left corner and lower right corner of the image to obtain the relative position coordinates of the features;
step2.2, by injecting absolute position information into the original image characteristics, the change of the object is sensitively distinguished; assigning a fixed value in order to each feature in the image to represent the absolute position of each feature;
step2.3, based on the self-attention mechanism, integrating the position and semantic relation to obtain fine-grained information which can become prior knowledge for distinguishing real change and vision/illumination change.
As a further aspect of the invention, in step2.2, the absolute position is encoded using, in particular, sine and cosine functions of different frequencies.
As a further scheme of the invention, the Step3 comprises the following specific steps:
step3.1, finding out the difference by finding out the common characteristics between the images, firstly, matching the common characteristics of the images before and after the change by a semantic matching module, namely, scanning the images before and after the change by the images before and after the change to obtain the common characteristics;
distinguishing between real changes and view/lighting changes is critical to this task, and it is more challenging to capture these subtle changes when the magnitude of the view change exceeds those. In addition, it is not practical to directly find the differences between pictures, and all the methods of the present invention adopt a strategy of finding the same characteristics first and then converting the same characteristics into the differences. Through the obtained fine-grained interactive knowledge, the semantic matching module roughly matches the common features of the images before and after change, namely, the images after/before are scanned through the images before/after to obtain the common features;
step3.2, using a checking and re-matching module to regard the images before/after the change as a reference source, and refining the common features to make the tiny changes prominent.
If the motion of the object is too slight, minor changes will be overwhelmed by the majority of the unchanged portions. In this case, the model may misinterpret the two images as a good match. In fact, such minor variations are masked by the common features. To capture such minor changes during the interaction, an efficient review is required to reveal the difference signals from the common features and to help the model describe the exact changes. The checking and re-matching mechanism takes the images before/after the change as a reference source, and makes the tiny change prominent by refining the common characteristics;
as a further scheme of the invention, the Step4 comprises the following specific steps:
step4.1, spatially attentively locating the difference between the images before and after the change, and sending the output to an LSTM sentence decoder based on top down to generate a natural language capable of describing the change;
step4.2, jointly train the encoder and decoder by minimizing the negative likelihood of the resulting word sequence.
The Step5 comprises the following steps: the evaluation index includes BIEU-4, METEOR, CIDER, ROUGE-L and SPICE. These scores will be higher if the semantic recognition is correct and the sentence structure is more consistent with the visual features.
The invention has the beneficial effects that:
the image difference description method based on the encoding-decoding end has strong robustness, can accurately describe the difference between two images under the condition that interference factors such as visual angles and illumination exist, can solve the problem of automatically generating difference description reports in the fields of damage detection, video monitoring, aerial photography, medical images, satellite images and the like, reduces the consumption of human resources, and greatly saves time and personnel cost.
The invention tries to explore the dynamic modeling geometric-semantic exchange relation in the difference description for the first time; by integrating the positional and semantic interactions driven by the disparity representation learning process, a new approach to image understanding in the case of misalignment between images caused by changes in perspective is explored. The image pair difference description method based on the encoding-decoding end can capture tiny changes and immunize the interference caused by viewpoint/illumination changes, and then subtitles with expected content and sequence are generated. A large number of experiments show that all evaluation indexes of the model exceed the most advanced model at present, and the international leading level is reached.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a flow diagram of a semantic-position refiner of the present invention;
FIG. 3 is a flow chart of the hierarchical interaction matching mechanism of the present invention;
FIG. 4 is a graph comparing the effect of the present invention with a baseline effect;
fig. 5 is a diagram illustrating the effect of the present invention.
Detailed Description
Example 1: as shown in fig. 1 to 5, the method for describing the difference of the image based on the encoding-decoding end specifically includes the following steps:
the method comprises the following specific steps:
step1, utilizing a pre-trained convolutional neural network as a feature extractor, and sending the images before/after change into the feature extractor to obtain visual features of the two images;
step1.1, to demonstrate comparative fairness, the experimental dataset of the present invention was derived from the differentially described datasets CLEVR-Change and Spot-the-Diff provided in the field. Where CLEVR-Change is a huge data set containing 79,606 complex scenes and 493,735 descriptive sentences; the Spot-the-Diff dataset consists of 13,192 image pairs, which are extracted from surveillance video from different time periods;
step1.2, to obtain visual features, a pre-trained ResNet-101 is used as a feature extractor on ImageNet to obtain grid features of images and average them to 14 × 14 grid size.
Step2, modeling semantic interaction and position relation in each image through a semantic-position purifier, so as to deeply understand fine-grained information of the image, which is the basis of the obtained accurate difference representation;
the specific steps of Step2 are as follows:
step2.1, encoding the relative position between features in the picture: coding the coordinates of the relative upper left corner and lower right corner of the image to obtain the relative position coordinates of the features; the method comprises the following specific steps:
and calculating the relative position relation of the features i and j in the picture through the relative coordinates. A set of features i is calculated as follows in equation (1):
Figure BDA0003545838790000041
two-dimensional relative coordinates to calculate its relative height and width (w) i ,h i ) Finally, calculating a relative position phi (i, j) between the features i and j according to the following formula (2); />
Figure BDA0003545838790000051
Figure BDA0003545838790000052
Figure BDA0003545838790000053
Figure BDA0003545838790000054
Wherein the content of the first and second substances,
Figure BDA0003545838790000055
is the relative coordinate of the upper left corner of the image, and>
Figure BDA0003545838790000056
is the relative coordinate of the lower right corner of the image
Step2.2, by injecting absolute position information into the original image characteristics, the change of the object is sensitively distinguished; assigning a fixed value in order to each feature in the image to represent the absolute position of each feature; specifically, as shown in equations (3) - (4), the absolute position AGE (r, c) is encoded using sine and cosine functions of different frequencies;
AGE(r,c)=[GE r ;GE c ], (3)
Figure BDA00035458387900000510
where pos, d denotes the position and dimension of each feature, r, c denotes the row and column index;
step2.3, self-attention mechanism based on formula (5), and integration of position and semantic relation to obtain fine-grained information G i I ∈ (bef, aft), this information can become a priori knowledge to distinguish between real changes and visual/lighting changes.
GSR=softmax(Υ W (Q,K,V))V, (5)
Figure BDA0003545838790000057
Figure BDA0003545838790000058
G i =GSR(X i ′,X i ′,X i ′), (7)
Wherein upsilon W The weighted values of the positions and the semantics can be automatically adjusted according to the requirement of the difference representation learning. Wi (wireless) i Q
Figure BDA0003545838790000059
Is a parameter matrix that can be learned;
step3, characterization of differences among the acquired images: on the premise of obtaining fine-grained understanding of the image through the operation, distinguishing whether the image is real change or visual/illumination change by utilizing a hierarchical matching mechanism, capturing a fine change process, and obtaining accurate difference representation; the hierarchical matching mechanism comprises two parts: a semantic matching module and a check re-matching module;
as a further scheme of the invention, the Step3 comprises the following specific steps:
step3.1, finding out the difference by finding out the common characteristics between the images, firstly, matching the common characteristics of the images before and after the change by a semantic matching module, namely, scanning the images before and after the change by the images before and after the change to obtain the common characteristics;
distinguishing between real changes and changes in viewing angle/illumination is a taskIt is critical, and more challenging, to capture the slight variations when the magnitude of the viewing angle variation exceeds those. In addition, it is not practical to directly find the differences between pictures, and all the methods of the present invention adopt a strategy of finding the same characteristics first and then converting the same characteristics into the differences. The fine granularity information G of the two pictures obtained by the method i The semantic matching module roughly matches the picture G before change aft And a rear image G bef Are all characterized in
Figure BDA0003545838790000061
I.e. scanning the posterior/anterior image through the anterior/posterior image to obtain a common feature;
Figure BDA0003545838790000062
wherein ch aft Is the number of channels and applies softmax to Ψ sim Normalization is performed.
Step3.2, using a checking and re-matching module to regard the images before/after the change as a reference source, and refining the common features to make the tiny changes prominent.
If the motion of the object is too slight, minor changes will be overwhelmed by the majority of the unchanged portions. In this case, the model may misinterpret the two images as a good match. In fact, it is noted that the state-of-the-art methods do not perform satisfactorily for the capture of small variations in the presence of perspective variations, which are masked by the common features that dominate the semantic matching process. To capture such minor changes during the interaction, an efficient review is required to reveal the difference signals from the common features and to help the model describe the exact changes. A checking and re-matching mechanism (CA) takes the images before/after the change as a reference source, and makes the tiny change prominent by refining the common features; the invention is characterized by a modified characteristic G, as illustrated by the following formula bef For the source example, a difference characterization is calculated
Figure BDA0003545838790000063
Figure BDA0003545838790000064
CA(G src ,G amp )=G sim =FC(G src ⊙sigmoid(G amp )), (10)
Figure BDA0003545838790000071
Wherein W s ,W a Is a learnable parameter matrix and FC is a fully connected layer.
Step4, sending the difference representation into a decoder, and decoding a natural language sentence capable of describing the difference between the two images; the change language decoder first needs to obtain the change features required for decoding and pay attention to which of the three types of features (before, after, and after change) is related to the ground real word. The speech decoder consists of two layers, namely a spatial attention, speech decoder (LSTM);
as a further scheme of the invention, the Step4 comprises the following specific steps:
step4.1, positioning the difference in the images before and after the change by spatial attention, outputting the difference to an LSTM sentence decoder based on the top down, and generating a natural language capable of describing the change;
a difference characterization G is constructed diff . Spatial attention tells the model the location of the difference features in the two original representations. We first compute the spatial attention map and then by applying G i Applications S nav To locate the variation feature d i ,i∈(bef,aft):
Figure BDA0003545838790000072
/>
Wherein [;]representing a dimension splice, f 2 ,f 1 Is the operation of convolution.
Based on two-layer LSTM junction from top to bottomThe sentence decoder firstly finds out the phrase and the three characteristics d bef ,d diff ,d aft The most relevant feature in (1), at each time step t, is selected by the attention weight value
Figure BDA0003545838790000073
Associated with a word->
Figure BDA0003545838790000074
Then we use->
Figure BDA0003545838790000075
And the previous word w t-1 (tagged words during training, predictive words during reasoning) vs. LSTM w Predict the next word:
d all =ReLU(FC[d bef ;d diff ;d aft ]), (13)
Figure BDA0003545838790000076
Figure BDA0003545838790000081
wherein W 1 、W 2 、b 1 、b 2 Are learnable parameters.
Figure BDA0003545838790000082
Are each LSTM s Module and LSTM w Is hidden state. Embed is the word w t-1 The one-hot code of (1);
step4.2, jointly train the encoder and decoder by minimizing the negative likelihood of the resulting word sequence, using cross-entropy loss to optimize training:
Figure BDA0003545838790000083
where m is the length of the sentence.
Step5, optimizing the objective function of the model using Adam optimizer, using evaluation indexes including BIEU-4, METEOR, CIDER, ROUGE-L and SPICE. These scores will be higher if the semantic recognition is correct and the sentence structure is more consistent with the visual features.
Description of the data set:
the CLEVR-Change dataset is a large-scale dataset consisting of geometric objects, comprising 79606 image pairs and 493735 descriptions. The change types can be classified into six cases, i.e., "color", "texture", "addition", "deletion", "movement", and "disturbance factor (e.g., viewpoint change)".
The Spot-the-Diff dataset consists of 13,192 image pairs, which are extracted from surveillance video from different time periods.
Setting experimental parameters:
to extract visual features, pre-trained ResNet-101 was used on ImageNet, taking advantage of grid features and pooling them on average to 14X 14 grid size. Features of dimensions 1024 × 14 × 14 are embedded into a low-dimensional embedding of dimension 512. The LSTM used in the decoder has a hidden state dimension of 512 and the number of attention heads is 4. Furthermore, each word is represented by a 300-dim vector. For the 38epoch training phase, the model was trained by the Adam Optimizer with a learning rate of 0.001/0.0003 and a batch size of 128/60 on the CLEVR-Change/Spot-the-Diff dataset. Both training and reasoning were implemented using PyTorch on a Titan XP GPU.
Comparative experiments on the CLEVR-Change data set:
to demonstrate the superiority of the model of the invention, extensive experiments were performed in order to compare it with the most advanced methods on the test set. The present invention gives conclusions from four angles: (a) Overall performance (including scene and no scene change); (b) scene change only performance; (c) performance of the model in the absence of scene changes; (d) Performance in some representative scene changes, such as color/texture changes and add/delete/move objects; (e) specification of the SOTA process.
(a) And (4) comprehensive performance. It can be observed from table 1 that, of course, the model of the invention is much better than all the SOTA models described above in all indices.
TABLE 1 comparative test for comprehensive Properties
Figure BDA0003545838790000091
(b) Only the performance of the scene change. In the case of a scene change, the image pair contains not only the scene change but also illumination or viewpoint changes. As shown in table 2, the best performance is indicated in bold, and one can see the fact that: the process of the present invention surpasses all SOTA processes with great advantages, in particular CIDEr (from 114.2 to 117.0) and SPICE (from 30.8 to 32.1).
Table 2 Performance comparison experiment in scene change
Figure BDA0003545838790000092
Figure BDA0003545838790000101
/>
(c) Performance pairs without scene changes. In this case, the image pairs only change non-scenically when the illumination or the viewing angle changes. Since M-VAM + RAF did not report all results, the present invention was only compared to the excellent results provided. According to the invention, the METEOR and CIDER scores of M-VAM + RAF are higher than those of the invention. The present invention considers that this is probably due to the introduction of reinforcement learning, and M-VAM in table one can verify the idea of the present invention. It shows that reinforcement learning does in this case significantly improve the performance of the model, but this increases both training time and computational complexity. Furthermore, since SRDRL + AVS introduces extrinsic knowledge about decoding, this makes its CIDEr score beyond the present invention. The model of the invention reduces the complexity of model training while understanding fine-grained information of semantics and positions.
TABLE 3 Performance comparison experiment without scene changes
Figure BDA0003545838790000102
(d) In the representative scene changes ("Color" (C), "Texture" (T), "Add" (A), "Drop"
(D) And "Move" (M)). From table 4 it can be seen that the model of the invention achieves competitive results over various types of variations compared to the SOTA method. The model proposed by the present invention achieves impressive results in terms of the "COLOR", "ADD", "DROP" and "TEXTURE" variation types. In addition, "TEXTURE" and "MOVE" are difficulties with this task in view point change. It can be observed from the present invention that (1) the present invention achieves a good effect on the variation of object texture thanks to a fine understanding of semantic and positional interactions. (2) The present invention does not perform as well as IFDC on the "MOVE" of CIDER and METEOR, because both use different feature extractors to extract the information of the object's attributes and images. This shows that the model of the present invention is able to accurately describe object changes while reducing pre-processing of the data.
Table 4 performance comparison experiments in representative scene changes
Figure BDA0003545838790000111
(e) Specification of the SOTA method: DUDA was proposed by Park et al on ICCV 2019; DUDA + AT was proposed by Hosseinzadeh et al on CVPR 2021; M-VAM + RAF, M-VAM are proposed by Shi et al in ECCV 2020; VACC is proposed by Kim et al on ECCV 2021; IFDC is proposed by Huang et al in IEEE Transactions on Multimedia; SRDRL + AVS is presented by Tu et al on ACL 2021.
Comparative experiments on the Spot-the-Diff dataset:
the present invention is compared to the SOTA method in a fully aligned picture without changing the viewing angle. From Table 5, it can be observed that the method of the present invention achieves the best performance on BLUE-4, METEOR, ROUGE-L and SPICE without reinforcement learning. Since the data set has no perspective change, the advantage is mainly that the module can enhance the fine-grained representation and interaction of the object features.
TABLE 5Spot-the-Diff data set comparative experiments
Figure BDA0003545838790000112
And (3) qualitative analysis:
examples of several variant descriptions from the CLEVR-Change dataset test set are illustrated in fig. 4, including human-generated sentences (labels) and sentences generated by the present model. In the context of perspective changes, the model of the present invention not only accurately describes the change process, but also emphasizes features such as relative position and object properties. For example, "small" and "rubber" in the first example emphasize the size and properties of the sphere, and "behind" in the last example emphasizes the relative position of the cube. Furthermore, the middle example shows an accurate description without scene changes and disturbing factors. As these examples demonstrate, the model of the present invention can capture detailed information to produce more accurate and descriptive headings.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (4)

1. The image pair difference description method based on the encoding-decoding end is characterized in that: the method comprises the following specific steps:
step1, utilizing a pre-trained convolutional neural network as a feature extractor, and sending the images before/after change into the feature extractor to obtain visual features of the two images;
step2, modeling semantic interaction and position relation in each image through a semantic-position purifier, so as to deeply understand fine-grained information of the image;
step3, characterization of differences among the acquired images: distinguishing real change or visual/illumination change by utilizing a hierarchical matching mechanism, capturing a fine change process, and obtaining accurate difference representation;
step4, sending the difference representation into a decoder, and decoding a natural language sentence capable of describing the difference between the two images;
the specific steps of Step2 are as follows:
step2.1, encoding the relative position between features in the picture: coding the coordinates of the relative upper left corner and lower right corner of the image to obtain the relative position coordinates of the features;
step2.2, by injecting absolute position information into the original image characteristics, the change of the object is sensitively distinguished; assigning a fixed value in order to each feature in the image to represent the absolute position of each feature;
step2.3, integrating position and semantic relation to obtain fine-grained information based on a self-attention mechanism, wherein the information can become prior knowledge for distinguishing real change and vision/illumination change;
the specific steps of Step3 are as follows:
step3.1, firstly matching the common characteristics of the images before and after change, namely scanning the back/front image through the front/back image to obtain the common characteristics;
step3.2, using a checking and re-matching module to regard the images before/after the change as a reference source, and refining the common features to make the tiny changes prominent.
2. The method for describing difference of image pair based on encoding-decoding end as claimed in claim 1, wherein: the Step1 comprises the following steps: to obtain visual features, a pre-trained ResNet-101 is used as a feature extractor to obtain mesh features of an image.
3. The method for describing difference of image pair based on encoding-decoding end as claimed in claim 1, wherein: in step2.2, the absolute position is specifically encoded using sine and cosine functions of different frequencies.
4. The method for describing difference of image pair based on encoding-decoding end as claimed in claim 1, wherein: the specific steps of Step4 are as follows:
step4.1, spatially attentively locating the difference between the images before and after the change, and sending the output to an LSTM sentence decoder based on top down to generate a natural language capable of describing the change;
step4.2, jointly train the encoder and decoder by minimizing the negative likelihood of the resulting word sequence.
CN202210248468.7A 2022-03-14 2022-03-14 Image pair difference description method based on encoding-decoding end Active CN114581690B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210248468.7A CN114581690B (en) 2022-03-14 2022-03-14 Image pair difference description method based on encoding-decoding end

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210248468.7A CN114581690B (en) 2022-03-14 2022-03-14 Image pair difference description method based on encoding-decoding end

Publications (2)

Publication Number Publication Date
CN114581690A CN114581690A (en) 2022-06-03
CN114581690B true CN114581690B (en) 2023-03-24

Family

ID=81774766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210248468.7A Active CN114581690B (en) 2022-03-14 2022-03-14 Image pair difference description method based on encoding-decoding end

Country Status (1)

Country Link
CN (1) CN114581690B (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255317B (en) * 2018-08-31 2021-06-11 西北工业大学 Aerial image difference detection method based on double networks
US11361470B2 (en) * 2019-05-09 2022-06-14 Sri International Semantically-aware image-based visual localization

Also Published As

Publication number Publication date
CN114581690A (en) 2022-06-03

Similar Documents

Publication Publication Date Title
CN108334848B (en) Tiny face recognition method based on generation countermeasure network
US11830230B2 (en) Living body detection method based on facial recognition, and electronic device and storage medium
Wang et al. A robust and efficient video representation for action recognition
CA2934514C (en) System and method for identifying faces in unconstrained media
CN111709409A (en) Face living body detection method, device, equipment and medium
CN111770299B (en) Method and system for real-time face abstract service of intelligent video conference terminal
CN111967533B (en) Sketch image translation method based on scene recognition
CN109635726B (en) Landslide identification method based on combination of symmetric deep network and multi-scale pooling
CN110390308B (en) Video behavior identification method based on space-time confrontation generation network
CN112861575A (en) Pedestrian structuring method, device, equipment and storage medium
Oluwasammi et al. Features to text: a comprehensive survey of deep learning on semantic segmentation and image captioning
CN106203448A (en) A kind of scene classification method based on Nonlinear Scale Space Theory
Cui et al. Face recognition using total loss function on face database with ID photos
CN115018999A (en) Multi-robot-cooperation dense point cloud map construction method and device
CN113298018A (en) False face video detection method and device based on optical flow field and facial muscle movement
Zhou et al. Modeling perspective effects in photographic composition
CN114581690B (en) Image pair difference description method based on encoding-decoding end
CN109165551B (en) Expression recognition method for adaptively weighting and fusing significance structure tensor and LBP characteristics
CN116244464A (en) Hand-drawing image real-time retrieval method based on multi-mode data fusion
CN115471901A (en) Multi-pose face frontization method and system based on generation of confrontation network
Zhang et al. Facial expression recognition by analyzing features of conceptual regions
WO2024099026A1 (en) Image processing method and apparatus, device, storage medium and program product
Qu Towards Theoretical and Practical Image Inpainting with Deep Neural Networks
Montserrat Machine Learning-Based Multimedia Analytics
Bhattacharjee Feature Extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant