CN114581690B

CN114581690B - Image pair difference description method based on encoding-decoding end

Info

Publication number: CN114581690B
Application number: CN202210248468.7A
Authority: CN
Inventors: 高盛祥; 岳圣斌; 余正涛
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2023-03-24
Anticipated expiration: 2042-03-14
Also published as: CN114581690A

Abstract

The invention relates to a method for describing difference of image pair based on a decoding-encoding end. The invention comprises the following steps: 1) Firstly, extracting visual features of an image from a pre-trained feature extractor; 2) Then modeling the interaction and position relation among the characteristic semantemes to obtain fine-grained information of the image; 3) The difference between the images is accurately represented through the layered interactive matching module between the images, and the interference caused by visual angle/illumination change is eliminated. 4) Finally, by aligning the visual and textual features based on top-down LSTM, sentences are then decoded that can describe the differences between the images. The method has strong robustness, can accurately describe the difference between two images under the condition of the existence of interference factors such as visual angles, illumination and the like, and in the experiment of the public data set provided in the field, the evaluation index of the method exceeds the most advanced model at present and reaches the international leading level.

Description

Image pair difference description method based on encoding-decoding end

Technical Field

The invention relates to an image pair difference description method based on an encoding-decoding end, and belongs to the technical field of multi-mode spanning the fields of natural language processing and computer vision.

Background

The invention lives in a world with great changes, and the change of things is ubiquitous in daily life. As a human, the present invention can infer underlying information from changes detected in a dynamic task environment. For example, a good neural network physician, in addition to locating a lesion, can better judge the progression of a patient's condition by comparing CT images captured at different times. It is very difficult for a computer to understand the image and automatically generate a report when a discrepancy is detected. Therefore, in many applications such as damage detection, video surveillance, aerial photography, medical imaging, satellite imaging, etc., how to accurately find the difference in image pair and automatically generate a report is a critical problem to be solved urgently.

In recent years, cross-modal research combining images and text has attracted increasing attention by researchers in the fields of natural language processing and machine vision. The mainstream tasks include image description generation, machine vision question and answer, visual dialog generation, visual reasoning, automatic generation from text to image, and the like. Describing image content in natural language (image description generation) is a popular field in artificial intelligence research, and many methods for image difference description have been proposed. The current image difference analysis and understanding technology can only analyze and identify specific limited image pair information, can only describe simple differences, and cannot accurately describe image differences under the condition of interference factors (illumination/visual angle change). Therefore, the present invention needs a new difference description technique, which enables a computer to accurately identify complex semantic information in an image, find differences between image pairs, and generate a sentence-level text description more conforming to human language habits using natural language processing techniques.

Disclosure of Invention

The invention provides an image pair difference description method based on an encoding-decoding end, which is used for solving the problems of inaccurate difference positioning, wrong description and the like under the condition that interference factors such as vision/illumination exist and the like, and improving the robustness of a model.

The technical scheme of the invention is as follows: the method for describing the difference of the image pair based on the encoding-decoding end comprises the following specific steps:

step1, utilizing a pre-trained convolutional neural network as a feature extractor, and sending the images before/after change into the feature extractor to obtain visual features of the two images;

step2, modeling semantic interaction and position relation in each image through a semantic-position purifier, so as to deeply understand fine-grained information of the image, which is the basis of the obtained accurate difference representation;

step3, characterization of differences among the acquired images: on the premise of obtaining fine-grained understanding of the image through the operation, distinguishing whether the image is real change or visual/illumination change by utilizing a hierarchical matching mechanism, capturing a fine change process, and obtaining accurate difference representation; the hierarchical matching mechanism comprises two parts: a semantic matching module and a check re-matching module;

step4, sending the difference representation into a decoder, and decoding a natural language sentence capable of describing the difference between the two images;

step5, comprehensively and objectively evaluating the performance of the model of the invention by using 5 evaluation indexes.

As a further aspect of the present invention, step1 includes: to derive the visual features, pre-trained ResNet-101 is used as a feature extractor on ImageNet to obtain the grid features of the images and average them to a 14 × 14 grid size.

As a further aspect of the present invention, in Step 2: the absolute position and relative positional relationship of the image are first encoded. The position between adjacent objects does not change due to changes in the viewpoint, which can be seen as a priori knowledge to distinguish between trueness and view angle changes. Different from the traditional position embedding, the relative position coding of the invention uses a dynamic mode, can automatically learn under the driving of the interaction between the required characteristics, and the relative position between the characteristics obtains a 4-dimensional relative position coordinate by modeling the coordinates of the relative upper left corner and lower right corner of the image; by injecting absolute position information in the original image features, changes in the object can be sensitively discerned. Assigning an ordered fixed token to each feature to represent absolute positional relationships, specifically encoded using sine and cosine functions of different frequencies;

the specific steps of Step2 are as follows:

step2.1, encoding the relative positions between features in the picture: coding the coordinates of the relative upper left corner and lower right corner of the image to obtain the relative position coordinates of the features;

step2.2, by injecting absolute position information into the original image characteristics, the change of the object is sensitively distinguished; assigning a fixed value in order to each feature in the image to represent the absolute position of each feature;

step2.3, based on the self-attention mechanism, integrating the position and semantic relation to obtain fine-grained information which can become prior knowledge for distinguishing real change and vision/illumination change.

As a further aspect of the invention, in step2.2, the absolute position is encoded using, in particular, sine and cosine functions of different frequencies.

As a further scheme of the invention, the Step3 comprises the following specific steps:

step3.1, finding out the difference by finding out the common characteristics between the images, firstly, matching the common characteristics of the images before and after the change by a semantic matching module, namely, scanning the images before and after the change by the images before and after the change to obtain the common characteristics;

distinguishing between real changes and view/lighting changes is critical to this task, and it is more challenging to capture these subtle changes when the magnitude of the view change exceeds those. In addition, it is not practical to directly find the differences between pictures, and all the methods of the present invention adopt a strategy of finding the same characteristics first and then converting the same characteristics into the differences. Through the obtained fine-grained interactive knowledge, the semantic matching module roughly matches the common features of the images before and after change, namely, the images after/before are scanned through the images before/after to obtain the common features;

step3.2, using a checking and re-matching module to regard the images before/after the change as a reference source, and refining the common features to make the tiny changes prominent.

If the motion of the object is too slight, minor changes will be overwhelmed by the majority of the unchanged portions. In this case, the model may misinterpret the two images as a good match. In fact, such minor variations are masked by the common features. To capture such minor changes during the interaction, an efficient review is required to reveal the difference signals from the common features and to help the model describe the exact changes. The checking and re-matching mechanism takes the images before/after the change as a reference source, and makes the tiny change prominent by refining the common characteristics;

as a further scheme of the invention, the Step4 comprises the following specific steps:

step4.1, spatially attentively locating the difference between the images before and after the change, and sending the output to an LSTM sentence decoder based on top down to generate a natural language capable of describing the change;

step4.2, jointly train the encoder and decoder by minimizing the negative likelihood of the resulting word sequence.

The Step5 comprises the following steps: the evaluation index includes BIEU-4, METEOR, CIDER, ROUGE-L and SPICE. These scores will be higher if the semantic recognition is correct and the sentence structure is more consistent with the visual features.

The invention has the beneficial effects that:

the image difference description method based on the encoding-decoding end has strong robustness, can accurately describe the difference between two images under the condition that interference factors such as visual angles and illumination exist, can solve the problem of automatically generating difference description reports in the fields of damage detection, video monitoring, aerial photography, medical images, satellite images and the like, reduces the consumption of human resources, and greatly saves time and personnel cost.

The invention tries to explore the dynamic modeling geometric-semantic exchange relation in the difference description for the first time; by integrating the positional and semantic interactions driven by the disparity representation learning process, a new approach to image understanding in the case of misalignment between images caused by changes in perspective is explored. The image pair difference description method based on the encoding-decoding end can capture tiny changes and immunize the interference caused by viewpoint/illumination changes, and then subtitles with expected content and sequence are generated. A large number of experiments show that all evaluation indexes of the model exceed the most advanced model at present, and the international leading level is reached.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a flow diagram of a semantic-position refiner of the present invention;

FIG. 3 is a flow chart of the hierarchical interaction matching mechanism of the present invention;

FIG. 4 is a graph comparing the effect of the present invention with a baseline effect;

fig. 5 is a diagram illustrating the effect of the present invention.

Detailed Description

Example 1: as shown in fig. 1 to 5, the method for describing the difference of the image based on the encoding-decoding end specifically includes the following steps:

the method comprises the following specific steps:

step1.1, to demonstrate comparative fairness, the experimental dataset of the present invention was derived from the differentially described datasets CLEVR-Change and Spot-the-Diff provided in the field. Where CLEVR-Change is a huge data set containing 79,606 complex scenes and 493,735 descriptive sentences; the Spot-the-Diff dataset consists of 13,192 image pairs, which are extracted from surveillance video from different time periods;

step1.2, to obtain visual features, a pre-trained ResNet-101 is used as a feature extractor on ImageNet to obtain grid features of images and average them to 14 × 14 grid size.

the specific steps of Step2 are as follows:

step2.1, encoding the relative position between features in the picture: coding the coordinates of the relative upper left corner and lower right corner of the image to obtain the relative position coordinates of the features; the method comprises the following specific steps:

and calculating the relative position relation of the features i and j in the picture through the relative coordinates. A set of features i is calculated as follows in equation (1):

two-dimensional relative coordinates to calculate its relative height and width (w) _i ,h _i ) Finally, calculating a relative position phi (i, j) between the features i and j according to the following formula (2); />

Wherein the content of the first and second substances,

is the relative coordinate of the upper left corner of the image, and>

is the relative coordinate of the lower right corner of the image

Step2.2, by injecting absolute position information into the original image characteristics, the change of the object is sensitively distinguished; assigning a fixed value in order to each feature in the image to represent the absolute position of each feature; specifically, as shown in equations (3) - (4), the absolute position AGE (r, c) is encoded using sine and cosine functions of different frequencies;

AGE(r,c)＝[GE _r ；GE _c ], (3)

where pos, d denotes the position and dimension of each feature, r, c denotes the row and column index;

step2.3, self-attention mechanism based on formula (5), and integration of position and semantic relation to obtain fine-grained information G _i I ∈ (bef, aft), this information can become a priori knowledge to distinguish between real changes and visual/lighting changes.

GSR＝softmax(Υ _W (Q,K,V))V, (5)

G _i ＝GSR(X _i ′,X _i ′,X _i ′), (7)

Wherein upsilon _W The weighted values of the positions and the semantics can be automatically adjusted according to the requirement of the difference representation learning. Wi (wireless) _i ^Q ，

Is a parameter matrix that can be learned;

distinguishing between real changes and changes in viewing angle/illumination is a taskIt is critical, and more challenging, to capture the slight variations when the magnitude of the viewing angle variation exceeds those. In addition, it is not practical to directly find the differences between pictures, and all the methods of the present invention adopt a strategy of finding the same characteristics first and then converting the same characteristics into the differences. The fine granularity information G of the two pictures obtained by the method _i The semantic matching module roughly matches the picture G before change _aft And a rear image G _bef Are all characterized in

I.e. scanning the posterior/anterior image through the anterior/posterior image to obtain a common feature;

wherein ch _aft Is the number of channels and applies softmax to Ψ _sim Normalization is performed.

If the motion of the object is too slight, minor changes will be overwhelmed by the majority of the unchanged portions. In this case, the model may misinterpret the two images as a good match. In fact, it is noted that the state-of-the-art methods do not perform satisfactorily for the capture of small variations in the presence of perspective variations, which are masked by the common features that dominate the semantic matching process. To capture such minor changes during the interaction, an efficient review is required to reveal the difference signals from the common features and to help the model describe the exact changes. A checking and re-matching mechanism (CA) takes the images before/after the change as a reference source, and makes the tiny change prominent by refining the common features; the invention is characterized by a modified characteristic G, as illustrated by the following formula _bef For the source example, a difference characterization is calculated

CA(G _src ,G _amp )＝G _s ′ _im ＝FC(G _src ⊙sigmoid(G _amp )), (10)

Wherein W _s ，W _a Is a learnable parameter matrix and FC is a fully connected layer.

Step4, sending the difference representation into a decoder, and decoding a natural language sentence capable of describing the difference between the two images; the change language decoder first needs to obtain the change features required for decoding and pay attention to which of the three types of features (before, after, and after change) is related to the ground real word. The speech decoder consists of two layers, namely a spatial attention, speech decoder (LSTM);

step4.1, positioning the difference in the images before and after the change by spatial attention, outputting the difference to an LSTM sentence decoder based on the top down, and generating a natural language capable of describing the change;

a difference characterization G is constructed _diff . Spatial attention tells the model the location of the difference features in the two original representations. We first compute the spatial attention map and then by applying G _i Applications S _nav To locate the variation feature d _i ，i∈(bef,aft)：

/>

Wherein [;]representing a dimension splice, f ₂ ，f ₁ Is the operation of convolution.

Based on two-layer LSTM junction from top to bottomThe sentence decoder firstly finds out the phrase and the three characteristics d _bef ,d _diff ,d _aft The most relevant feature in (1), at each time step t, is selected by the attention weight value

Associated with a word->

Then we use->

And the previous word w _t-1 (tagged words during training, predictive words during reasoning) vs. LSTM _w Predict the next word:

d _all ＝ReLU(FC[d _bef ；d _diff ；d _aft ]), (13)

wherein W ₁ 、W ₂ 、b ₁ 、b ₂ Are learnable parameters.

Are each LSTM _s Module and LSTM _w Is hidden state. Embed is the word w _t-1 The one-hot code of (1);

step4.2, jointly train the encoder and decoder by minimizing the negative likelihood of the resulting word sequence, using cross-entropy loss to optimize training:

where m is the length of the sentence.

Step5, optimizing the objective function of the model using Adam optimizer, using evaluation indexes including BIEU-4, METEOR, CIDER, ROUGE-L and SPICE. These scores will be higher if the semantic recognition is correct and the sentence structure is more consistent with the visual features.

Description of the data set:

the CLEVR-Change dataset is a large-scale dataset consisting of geometric objects, comprising 79606 image pairs and 493735 descriptions. The change types can be classified into six cases, i.e., "color", "texture", "addition", "deletion", "movement", and "disturbance factor (e.g., viewpoint change)".

The Spot-the-Diff dataset consists of 13,192 image pairs, which are extracted from surveillance video from different time periods.

Setting experimental parameters:

to extract visual features, pre-trained ResNet-101 was used on ImageNet, taking advantage of grid features and pooling them on average to 14X 14 grid size. Features of dimensions 1024 × 14 × 14 are embedded into a low-dimensional embedding of dimension 512. The LSTM used in the decoder has a hidden state dimension of 512 and the number of attention heads is 4. Furthermore, each word is represented by a 300-dim vector. For the 38epoch training phase, the model was trained by the Adam Optimizer with a learning rate of 0.001/0.0003 and a batch size of 128/60 on the CLEVR-Change/Spot-the-Diff dataset. Both training and reasoning were implemented using PyTorch on a Titan XP GPU.

Comparative experiments on the CLEVR-Change data set:

to demonstrate the superiority of the model of the invention, extensive experiments were performed in order to compare it with the most advanced methods on the test set. The present invention gives conclusions from four angles: (a) Overall performance (including scene and no scene change); (b) scene change only performance; (c) performance of the model in the absence of scene changes; (d) Performance in some representative scene changes, such as color/texture changes and add/delete/move objects; (e) specification of the SOTA process.

(a) And (4) comprehensive performance. It can be observed from table 1 that, of course, the model of the invention is much better than all the SOTA models described above in all indices.

TABLE 1 comparative test for comprehensive Properties

(b) Only the performance of the scene change. In the case of a scene change, the image pair contains not only the scene change but also illumination or viewpoint changes. As shown in table 2, the best performance is indicated in bold, and one can see the fact that: the process of the present invention surpasses all SOTA processes with great advantages, in particular CIDEr (from 114.2 to 117.0) and SPICE (from 30.8 to 32.1).

Table 2 Performance comparison experiment in scene change

/>

(c) Performance pairs without scene changes. In this case, the image pairs only change non-scenically when the illumination or the viewing angle changes. Since M-VAM + RAF did not report all results, the present invention was only compared to the excellent results provided. According to the invention, the METEOR and CIDER scores of M-VAM + RAF are higher than those of the invention. The present invention considers that this is probably due to the introduction of reinforcement learning, and M-VAM in table one can verify the idea of the present invention. It shows that reinforcement learning does in this case significantly improve the performance of the model, but this increases both training time and computational complexity. Furthermore, since SRDRL + AVS introduces extrinsic knowledge about decoding, this makes its CIDEr score beyond the present invention. The model of the invention reduces the complexity of model training while understanding fine-grained information of semantics and positions.

TABLE 3 Performance comparison experiment without scene changes

(d) In the representative scene changes ("Color" (C), "Texture" (T), "Add" (A), "Drop"

(D) And "Move" (M)). From table 4 it can be seen that the model of the invention achieves competitive results over various types of variations compared to the SOTA method. The model proposed by the present invention achieves impressive results in terms of the "COLOR", "ADD", "DROP" and "TEXTURE" variation types. In addition, "TEXTURE" and "MOVE" are difficulties with this task in view point change. It can be observed from the present invention that (1) the present invention achieves a good effect on the variation of object texture thanks to a fine understanding of semantic and positional interactions. (2) The present invention does not perform as well as IFDC on the "MOVE" of CIDER and METEOR, because both use different feature extractors to extract the information of the object's attributes and images. This shows that the model of the present invention is able to accurately describe object changes while reducing pre-processing of the data.

Table 4 performance comparison experiments in representative scene changes

(e) Specification of the SOTA method: DUDA was proposed by Park et al on ICCV 2019; DUDA + AT was proposed by Hosseinzadeh et al on CVPR 2021; M-VAM + RAF, M-VAM are proposed by Shi et al in ECCV 2020; VACC is proposed by Kim et al on ECCV 2021; IFDC is proposed by Huang et al in IEEE Transactions on Multimedia; SRDRL + AVS is presented by Tu et al on ACL 2021.

Comparative experiments on the Spot-the-Diff dataset:

the present invention is compared to the SOTA method in a fully aligned picture without changing the viewing angle. From Table 5, it can be observed that the method of the present invention achieves the best performance on BLUE-4, METEOR, ROUGE-L and SPICE without reinforcement learning. Since the data set has no perspective change, the advantage is mainly that the module can enhance the fine-grained representation and interaction of the object features.

TABLE 5Spot-the-Diff data set comparative experiments

And (3) qualitative analysis:

examples of several variant descriptions from the CLEVR-Change dataset test set are illustrated in fig. 4, including human-generated sentences (labels) and sentences generated by the present model. In the context of perspective changes, the model of the present invention not only accurately describes the change process, but also emphasizes features such as relative position and object properties. For example, "small" and "rubber" in the first example emphasize the size and properties of the sphere, and "behind" in the last example emphasizes the relative position of the cube. Furthermore, the middle example shows an accurate description without scene changes and disturbing factors. As these examples demonstrate, the model of the present invention can capture detailed information to produce more accurate and descriptive headings.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The image pair difference description method based on the encoding-decoding end is characterized in that: the method comprises the following specific steps:

step2, modeling semantic interaction and position relation in each image through a semantic-position purifier, so as to deeply understand fine-grained information of the image;

step3, characterization of differences among the acquired images: distinguishing real change or visual/illumination change by utilizing a hierarchical matching mechanism, capturing a fine change process, and obtaining accurate difference representation;

the specific steps of Step2 are as follows:

step2.1, encoding the relative position between features in the picture: coding the coordinates of the relative upper left corner and lower right corner of the image to obtain the relative position coordinates of the features;

step2.3, integrating position and semantic relation to obtain fine-grained information based on a self-attention mechanism, wherein the information can become prior knowledge for distinguishing real change and vision/illumination change;

the specific steps of Step3 are as follows:

step3.1, firstly matching the common characteristics of the images before and after change, namely scanning the back/front image through the front/back image to obtain the common characteristics;

2. The method for describing difference of image pair based on encoding-decoding end as claimed in claim 1, wherein: the Step1 comprises the following steps: to obtain visual features, a pre-trained ResNet-101 is used as a feature extractor to obtain mesh features of an image.

3. The method for describing difference of image pair based on encoding-decoding end as claimed in claim 1, wherein: in step2.2, the absolute position is specifically encoded using sine and cosine functions of different frequencies.

4. The method for describing difference of image pair based on encoding-decoding end as claimed in claim 1, wherein: the specific steps of Step4 are as follows: