CN113313147A

CN113313147A - Image matching method based on deep semantic alignment network model

Info

Publication number: CN113313147A
Application number: CN202110516741.5A
Authority: CN
Inventors: 吕肖庆; 瞿经纬; 王天乐
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-08-27
Anticipated expiration: 2041-05-12
Also published as: CN113313147B

Abstract

The invention discloses an image matching method based on a deep semantic alignment network model, which comprises the steps of gradually estimating the alignment between two semantic similar images by establishing an object position-aware semantic alignment network model OLASA; the method is characterized in that a triple sampling strategy is adopted to train a network model OLASA, and three sub-networks N of potential object co-location POCL, affine transformation regression ATR and bidirectional thin plate spline regression TTPS are adopted_tran，N_affiAnd N_ttpsRespectively estimating translation, affine transformation and spline transformation; and then, establishing and optimizing an alignment relation between the images in a layering manner to obtain an image matching result. By utilizing the technical scheme provided by the invention, the image alignment effect with larger position difference can be improved,the accuracy of image matching is improved. The method can be applied to target tracking, semantic segmentation, multi-view three-dimensional reconstruction and the like in the field of computer vision.

Description

Image matching method based on deep semantic alignment network model

Technical Field

The invention belongs to the technical field of computer vision and digital image processing, relates to an image matching technology, and particularly relates to a method for establishing an accurate corresponding matching relation of main target objects in similar images based on an image depth semantic alignment network model.

Background

Image semantic alignment aims at establishing accurate corresponding relation of similar target objects among images, namely point-to-point characteristic matching relation among the similar target objects in different images. The specific scene is to analyze and quantify the similarity between features by using the feature information of the images on the premise that the image content information is the same or similar, and further determine the matching relationship of feature points on similar objects in the images. The problem is a basic problem in computer vision, and the method is widely applied to the fields of target tracking, image semantic segmentation, multi-view three-dimensional reconstruction and the like.

Semantic alignment has received much attention in recent years. Early studies included methods of finding instance-level matches by defining and computing sparse or dense descriptors [5 ]. However, the example-level descriptions of these methods lack the generalization capability corresponding to the category level. The purpose of category-level correspondence is to find dense correspondence between semantically similar images. Some methods use local descriptors and minimize the matching energy invested. And the manually constructed descriptor is difficult to embed high-level semantic features and sensitive to image change.

Inspired by the rich high-level semantics of Convolutional Neural Network (CNN) features, recent solutions (references [2], [4], [6], [7], [8]) employ training CNN features and combine them to estimate dense flow fields, aligning the images. Furthermore, the methods employed in references [1], [3], [9], [10] estimate the geometric transformation with trainable CNN features and express semantic correspondences as geometric alignment problems. Some of these methods outperform dense flow based methods and produce smoother matching results, thanks to the geometric transformations describing dense correspondences.

Despite the great advances made by existing approaches, the semantic alignment problem still faces challenges such as alignment difficulties caused by object variations (e.g., appearance, scale, shape, position) and complex backgrounds. Specifically, firstly, since the target position difference is large, it is difficult to directly establish a dense correspondence relationship between images, and the effect is not good (as shown in fig. 1). Previous methods often fail to align such images due to insufficient research to handle such situations. Secondly, it is a difficulty in data annotation that it is difficult to collect a large number of training image pairs with ground truth dense correspondences and significant appearance transformations. Manually annotating such training data is very labor intensive and is somewhat subjective.

Reference documents:

[1]Ignacio Rocco,Relja Arandjelovic,and Josef Sivic,“Convolutional neural network architecture for geometric matching,”inCVPR,2017.

[2]Kai Han,Rafael S Rezende,Bumsub Ham,Kwan-Yee K Wong,Minsu Cho,Cordelia Schmid,and Jean Ponce,“Scnet:Learning semantic correspondence,”in ICCV,2017.

[3]Ignacio Rocco,Relja

and Josef Sivic,“End-to-end weakly-supervised semantic alignment,”in CVPR,2018.

[4]Junghyup Lee,Dohyung Kim,Jean Ponce,and Bumsub Ham,“Sfnet:Learning object-aware semantic correspondence,”in CVPR,2019.

[5]David G Lowe,“Distinctive image features from scale-invariant keypoints,”IJCV,vol.60,no.2,2004.

[6]Ce Liu,Jenny Yuen,and Antonio Torralba,“Sift flow:Dense correspondence across scenes and its applications,”IEEE TPAMI,vol.33,no.5,2010.

[7]Bumsub Ham,Minsu Cho,Cordelia Schmid,and Jean Ponce,“Proposal flow:Semantic correspondences from object proposals,”IEEETPAMI,vol.40,no.7,2017.

[8]Seungryong Kim,Dongbo Min,Bumsub Ham,Sangryul Jeon,Stephen Lin,and Kwanghoon Sohn,“Fcss:Fully convolutional self-similarity for dense semantic correspondence,”in CVPR,2017.

[9]Paul Hongsuck Seo,Jongmin Lee,Deunsol Jung,Bohyung Han,and Minsu Cho,“Attentive semantic alignment with offset-aware correlation kernels,”in ECCV,2018.

[10]Ignacio Rocco,Mircea Cimpoi,Relja

Akihiko Torii,Tomas Pajdla,and Josef Sivic,“Neighbourhood consensus net-works,”in NIPS,2018.

disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an image matching method based on a deep semantic alignment network model, which establishes a semantic alignment network for sensing the position of an object and trains the network by adopting a triple sampling strategy to establish and optimize the alignment relationship between images in a layered manner, solves the technical problems that the prior art is difficult to directly establish the dense corresponding relationship between the images and the image data annotation wastes time and labor and has low accuracy, and improves the accuracy of image matching.

The image semantic alignment technology in the invention is a sub-problem in the field of image matching or image feature matching, and mainly aims at the following scenes: although the two images to be matched are different, the two images to be matched both contain a similar foreground object, namely, the high-level semantic information such as the appearance, the shape, the posture and the like of the foreground object are similar, and the object basically belongs to the same category, such as cars of different brands.

The technical scheme provided by the invention is as follows:

an image matching method based on a deep semantic alignment network model comprises the steps of gradually and robustly estimating alignment between two semantic similar images by establishing an object position-aware semantic alignment network model (OLASA); meanwhile, a triple sampling strategy is provided for training the network, translation, affine transformation and spline transformation are respectively estimated through three sub-networks (potential object co-location (POCL), Affine Transformation Regression (ATR) and bidirectional thin plate spline regression (TTPS)), and then the alignment relation between images is established and optimized in a layering mode to obtain an image matching result; the method comprises the following steps:

step 1, extracting semantic features of an image;

in the method, an independent Convolutional Neural Network (CNN) is adopted at the front end of each sub-network to extract the characteristics of the image. In specific implementation, the invention adopts a Convolutional Neural Network (CNN) to extract the characteristics of two images, and the network can be the most basic CNN network or the CNN network after improvement or enhancement.

In the formula (1), F is a feature extracted from an image,

the method comprises the following steps that (1) a characteristic data space (real number) is formed, and h, w and d respectively represent three dimensions of the characteristic data space, namely height, width and channel number;

is a Convolutional Neural Network (CNN); i is an image;

h, W, D represent three dimensions of the image data space, i.e. height, width, number of channels, respectively, for the data space (real number) of the image.

A pair of images

As

Respectively extracting two groups of obtained features

I_s,I_tRespectively a source image and a target image; f_s,F_tRespectively a source image semantic feature and a target image semantic feature.

The invention establishes an object position-aware semantic alignment network model-OLASA, and gradually and robustlyThe alignment between two semantically similar images is estimated. The system architecture of OLASA also takes POCL, ATR, and TTPS as main components, and they are named N respectively_tran,N_affi and N_ttpsFor estimating offset, affine and TTPS transformations, respectively, denoted T_tran,T_affi and T_ttpsAs shown in fig. 3. By transforming the model T_tran,T_affi and T_ttpsThe transformation result of the source image at each stage can be obtained

And

continuous transformation I using these transformation models_HI.e. by

Obtaining a source image I_sWith the target image I_tAlignment result of

The method for establishing the semantic alignment network model for object position perception comprises the steps 2-4.

Step 2, adopting potential objects to cooperatively locate the sub-network (N)_tran) Estimating the offset of a target object between images, and eliminating the position deviation of an object to be matched;

aiming at the problem that similar target objects among images often have obvious displacement, N_tranThe sub-network adopts the potential target position detection and estimation technology to predict an offset transformation model in advance and eliminate the influence of the source image on a large span by transforming the source image.

The potential object co-location sub-network is denoted as

Feature F of source image_sAnd features F of the target image_tAs input, estimating potential targets by a classified target detection method, and training a co-location sub-network according to the positions of the potential targets

And then realize I_s and I_tEstimation of preliminary transformation between

Is represented as follows:

in the formula ,

co-locating a sub-network for the potential object;

is I_s and I_tAn estimate of the preliminary transform (offset transform) between;

the degree of freedom of (a) is 4; (x, y) represents the spatial coordinates of the target image and (x ', y') represents the corresponding sample coordinates in the source image.

To estimate an offset transformation model

The position of an object to be matched needs to be estimated firstly, and therefore, the approximate position of the object to be matched can be estimated by adopting the existing classified target detection technology;

further location I_s and I_tTwo major and semantically related objects; and then another feature extraction module, such as CNN,

to extract the corresponding feature descriptor V_s ⁱ} and {V_t ^jI and j respectively represent the serial numbers of the feature points in the source image and the target image;

will describe the son { V_s ⁱ} and {V_t ^jStack (stack) as a feature matrix

And

the semantic similarity matrix is calculated by multiplying the two feature matrices,

selection of Z_stForming a similar feature pair group by a plurality of feature pairs with the highest similarity scores, calculating two corresponding regions by using feature point groups respectively represented in the source image and the target image as two main potential objects, and calculating corresponding frame coordinates, namely space positions;

finally, obtaining the space coordinates of the frames corresponding to the two main potential objects by positioning, and calculating a position deviation transformation model

Source image I_sBy passing

Transforming into position-shifted images

Step 3, constructing affine transformation regression sub-network N_affiEstimating affine transformation model of image to be matched using affine transformation regression subnetwork

Obtaining affine transformation parameter estimation;

the ATR subnetwork is used to estimate an affine transformation model of the offset adjusted image and the target image. The ATR subnetwork needs to construct pairs of image features, i.e. feature pairs, and calculate the correlation of the feature pairs, from which the parameters of the affine transformation model are estimated.

In particular, the image subjected to position shift conversion is

And a target image I_tCharacteristic F of_s ¹ and F_tForming feature pairs, calculating the relevance of these feature pairs

I.e. the 4D correlation tensor

Each element of the tensor

Recording two local feature vectors

The inner product of (d) and (d). The eigenvectors and the correlation tensors are L2 normalized. Tensor of degree of correlation

Input device

To carry out

and I_tInter affine transformation model estimation

Wherein the affine transformation model

Is 6, i.e. 6 affine transformation parameters need to be estimated. Image of a person

Can be further transformed by the model

Obtained after affine transformation

And I_tFurther alignment is accomplished.

Step 4, constructing a bidirectional thin plate spline regression subnetwork N_ttpsOptimizing the alignment effect by using a bidirectional thin plate spline regression subnetwork;

TTPS subnetwork N_ttpsAnd estimating a regression transformation model between the affine transformed image and the target image by using the control points, wherein the model can further improve or enhance the semantic alignment effect of the image. TTPS employs a bi-directional strategy to avoid excessive distortion or distortion of the image. Compared with the prior thin plate spline regression TPS method, the method increases the secondary image I_tTo the image

And the control point adjustment in the opposite direction effectively removes excessive deformation and improves the matching distortion condition.

When embodied, in the image

and I_tA control point grid and a TTPS sub-network are arranged on the network for imaging

and I_tIs characterized by

and F_tThe correlation tensor of the feature pair is used as a calculation pairImage, i.e. 4D correlation tensor

Using TTPS regression sub-networks

Estimating a regression deformation model

The process of (d) can be expressed as:

wherein ,

according to

and F_tCalculating the correlation of the middle control point pair to obtain a regression deformation model

Is obtained by calculating 6 transform weight parameters of the two. According to the estimated

Regression transformation model, can

Is further transformed into

Thereby realizing the source image and the target image I_tIs aligned.

Step 5, jointly training an integral OLASA network model comprising three sub-networks;

in terms of the selection of training samples, the invention proposes a triplet method by introducing reference images, which can better capture geometrical and appearance variations in the training data.

Step 51, the present invention proposes a triple sampling strategy to generate training data. Each triplet containing a source image I_sReference picture I_rAnd a target image I_t. Wherein, from I_sBy random geometric transformation

Or

Generated reference image I_r。

And step 52, designing three loss functions to realize the optimization of the OLASA, wherein the three loss functions comprise transitivity loss, consistency loss and alignment loss.

Transfer loss function

For measuring the accuracy of the predicted transform model. To check whether the predicted transformation model is accurate, the transformation result of the model may be compared with the transformation result of another transformation approach for achieving the same transformation objective; another transformation approach may specifically employ an indirect transformation approach through a combination of two transformation models; the indirect transformation approach achieves the same transformation goal as the direct transformation approach which only adopts one transformation model, namely the transformation goal of the transformation model to be detected is the same.

To check a certain affine transformation

For example, if one is to be used

(affine transformation from source image to reference image) as predicted transformation model, and checking its accuracy to find out another indirect transformation approach capable of implementing uniform transformation, such as combination of two transformation models

I.e. by combination

(affine transformation from source image to target image) and

(affine transformation from target image to reference image) two related successive transformations to simulate

Transformation, judging the transformation from the source image to the reference image by comparing the difference of the transformation results obtained in the two ways

To the degree of accuracy of the measurement. The specific implementation can be realized by means of a grid consisting of selected pixel points in the image, for example, twenty parts of the image are divided in the horizontal and vertical directions, so that a grid G (| G | ═ 20 × 20) can be obtained, wherein 400 points are included in the grid G, one point in G is represented by (x, y), and the initial grid can be constructed in I_sThe above. As described hereinbefore, by

And

to judge

The accuracy implementation may be shifted to a specific calculation based on grid points, i.e.,

theoretically, by

Should be equal to

The projected points after successive deformations are aligned while in contrast with the real data, i.e.

(affine transformation of source image to target image calculated from annotation data, not predicted)

The difference is lost as an error on which the machine learning method is based, and therefore, the transfer loss function

Can be expressed as:

the transformation involves three types of objects, namely a source image, a target image and a reference image, and therefore the three types of objects are constructed into a triple for training a relevant model, namely a piece of training data is composed of a source image, a target image and a reference image. The method can also be seen as adding a corresponding reference image on the basis of the traditional training pair (source image and target image). The associated loss calculation makes it possible to simultaneously calculate three transformations, each involving a certain pair of images of the triplet (the two images belonging to different image types), e.g. affine transformation regression subnetwork

Estimating ternary samples

(source image and target image) and

(target image and reference image) affine transformation model.

Similarly, transmission losses will also occur in this process

For spline regression

The difference analysis of (3).

Consistency loss function

Loss of consistency

Is defined as: consistency loss function

Is defined as: in a ternary sample consisting of a source image, a target image and a reference image, the accumulation of the bidirectional reprojection error of a point in the grid G between different image pairs constitutes a consistency loss function

For representing affine transformations

Sum spline regression

A difference of (a);

given an image I_s and I_tInter affine transformation

And corresponding inverse transformation

When compared to the results of its two-way transformation, the point (x, y) e G should be aligned with the coordinates of the original space, i.e.,

thus, the reprojection error of a point in grid G between a source image-target image pair can be calculated by the following equation:

wherein ,ε(I_s,I_t) Is the reprojection error between the source-target image pair.

In the ternary sample composed of the source image, the target image and the reference image, other image pairs also contain bidirectional reprojection errors, and the two image pairs are all accumulated to be used as the consistency loss function of the method

Similarly, consistency is also lost in the present method

For spline regression

The difference analysis of (3).

Alignment loss function

Based on the literature [3]Soft-inline Count (Soft-inline Count) and two-way measurement alignment quality design for evaluation

And

can further optimize the alignment accuracy.

The calculation formula is as follows:

wherein c_stBy aligning I_sTo I_tSoft inline counting of, similarly, c_tsIndicating alignment I_tTo I_sSoft inline counting of (2).

Step 6, obtaining an image matching result;

according to the joint training integral network, aiming at a given image pair to be matched, an alignment result from a source image to a target image can be given.

Specifically, learning training is carried out through an overall network comprising three sub-networks, and the method can obtain an image alignment transformation model according to a training data sample, so that a trained object position-aware semantic alignment network model OLASA is obtained. In the testing stage, the matching image result from the source image to the target image can be calculated for the given image pair to be matched.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides an image matching method based on a depth semantic alignment network model, which comprises three sub-networks of potential object co-location (POCL), Affine Transformation Regression (ATR) and bidirectional thin plate spline regression (TTPS), wherein the POCL can effectively sense the offset of a potential object, the ATR can learn the parameters of geometric deformation, the TTPS can improve the robustness of deformation, and the combined learning of the three sub-networks not only can realize the semantic alignment of images, but also can obtain the image matching effect with higher accuracy. By utilizing the technical scheme provided by the invention, the image alignment effect with larger position difference can be improved. Meanwhile, in a scene with insufficient annotation data, by means of generating a reference image, geometric changes and appearance changes in the existing data can be deeply mined and utilized, and the accuracy of image matching is improved. The method can be applied to various tasks in the field of computer vision, such as target tracking, semantic segmentation, multi-view three-dimensional reconstruction and the like.

Drawings

FIG. 1 is a diagram of object position-aware semantic alignment in images in image matching;

wherein Is represents the source image; it represents a target image.

FIG. 2 is a flow chart of a method for performing image matching on an OLASA network model established by the present invention;

FIG. 3 is a structural block diagram of an OLASA network model 2-2 established by the present invention;

wherein ,I_sFor the source image, I_tIn order to be the target image,

respectively, the result of each stage transformation.

Extracting the network for features, N_tranCo-location (POCL) sub-networks for potential objects, T_tranFor the corresponding transformation model, N_affiFor Affine Transformation Regression (ATR) sub-networks, T_affiFor the corresponding transformation model, N_ttpsFor a bidirectional thin-plate spline regression (TTPS) subnetwork, T_ttpsIs the corresponding transformation model.

FIG. 4 is a schematic diagram of a training triplet and a related transformation model of an OLASA network model in the present invention;

wherein (a) is a source image I_sGenerating a reference image I_r(ii) a (b) Are the various transformations between triplets.

Respectively representing an affine transformation model and a bi-directional thin-plate spline regression model from a source image to a reference image, as benchmarks (groudtuth) for comparison,

respectively representing an affine transformation model and a bidirectional thin plate spline regression model from a source image to a target image,

two models representing respective inverse transforms;

respectively representing an affine transformation model and a bidirectional thin plate spline regression model from a target image to a reference image,

two models of the corresponding inverse transform are represented.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides an image matching method based on a deep semantic alignment network model, which is a deep neural network model OLASA based on semantic alignment, the input of the model method is a source image, a target image and a reference image, OLASA estimates deformation parameters of the source image according to the internal alignment relation of the OLASA through the deep semantic analysis of the source image and the target image, the deformed source image is the output result of the method, and the target object contained in the image can be matched with the corresponding object in the target image. The internal implementation of OLASA is joint learning through three sub-networks: the three sub-networks, potential object co-location (POCL), Affine Transformation Regression (ATR), and bi-directional thin-plate spline regression (TTPS), achieve the purpose of effectively matching images.

Fig. 2 shows a flow of a method for performing image matching on an OLASA network model established by the present invention. The method can be used for any given image pair (namely the source image and the target image), and the image pair formed by the source image and the target image can be obtained by shooting or network downloading and the like. Image data sets such as PF-WILLOW [11], PF-PASCAL (reference [11 ]: Bumsub Ham, Minsu Cho, Cordelia Schmid, and Jean Ponce, "Proposal flow," in CVPR,2016.) and Caltech-101 (reference [12 ]: Li Fei-Fei, Rob Fergus, and Pietro Perona, "One-shot left of object candidates," IEEE TPAMI, 28, No.4,2006.) are used. The reference image is calculated according to the source image in the execution process by adopting the method. The method of the invention is realized by the following steps.

Step 1, extracting semantic features of images

After receiving a source image, a target image and a reference image, OLASA first extracts features of the source image, the target image and the reference image. That is, a pair of images to be matched (source and target images, available as images or shots in an image library)

As an input, the method adopts a Convolutional Neural Network (CNN) to extract the characteristics of two images, wherein the network can be the most basic CNN network or the CNN network after improvement or enhancement, which is not general and is called as the CNN network in the text

The two groups of extracted features are named

These semantic features will be used in the learning process of the subsequent sub-network.

Step 2, estimating the offset by adopting a potential object co-location (POCL) sub-network

In practical situations, objects to be matched in the source image and the target image are often located at different positions in the respective images, that is, the objects to be matched often have a large position difference. The existing method only processes objects to be matched which are almost at the same position, and a special processing method is rarely adopted for the situation with large position difference, so that ideal matching effect is difficult to obtain in practical application of a plurality of methods. The invention aims at the problem that a preprocessing network, a potential object co-location sub-network (POCL), is adopted to eliminate the position deviation of the object to be matched in the first stage of realizing image matching.

The sub-network is defined as

The characteristics F of the source image and the target image_s and F_tAs input, estimating potential targets by a classified target detection technique, and training a co-location sub-network based on the potential target locations

And then realize I_s and I_tPreliminary transformation between

Is estimated.

wherein ,

is 4, where (x, y) represents the spatial coordinates of the target image and (x ', y') represents the corresponding sampling coordinates in the source image.

To estimate an offset transformation model

It is necessary to estimate the position of the object to be matched, for which purpose existing classification target detection techniques can be used, in particular fast R-CNN [12] as a common target detection network]And the like. It should be noted that the target detection technique is not used to obtain an accurate target detection result at this stage, but is used to obtain approximate position information of the potential object to achieve the co-location, so the target detection technique is used differently here. In particular, the amount of the solvent to be used,

based on image features F_s and F_tOnly two sets of potential object bounding boxes need to be predicted

And

(i＝1,…,n_s,j＝1,…,n_t)，

respectively describe an image I_s and I_tMiddle (i)^th and j^thA potential frame, which may be recorded with coordinates of the top left corner and the bottom right corner, n_s and n_tRespectively represent I_s and I_tThe number of borders in (1). Again, these borders need not be exact object borders, but rather approximate or possible borders, which in this present stage are intended to estimate their approximate location.

To further locate I_s and I_tTwo main and semantically related objects in the text, we pass through the latent bounding box

And

the potential objects are cropped from the image and their size is adjusted to H x W. And then using another feature extraction module, such as CNN,

to extract a feature map V corresponding to the source and target images_s ⁱ} and {V_t ^j}. After that, we will describe the child { V }_s ⁱ} and {V_t ^jStack (stack) as a feature matrix

And

thus, the semantic similarity matrix can be calculated by multiplying the two feature matrices,

wherein, focus is on Z_stSelecting two corresponding frames as the component items with the highest similarity score

And

thereby locating two main potential objects separately. Finally, utilize

And

the spatial coordinates of the position can be calculated to obtain a position deviation transformation model

Source image I_sBy passing

Transforming into position-shifted images

The POCL is only used to capture the position deviation of the potential object, and when the position deviation is large, the corresponding position adjustment can be realized through the position deviation transformation, but the POCL cannot realize accurate semantic alignment.

Step 3, estimating affine transformation model by utilizing Affine Transformation Regression (ATR) sub-network

OLASA is transformed back by more accurate affineReturn (ATR) subnetwork

Estimating affine transformation model of image to be matched

Existing sub-networks [11] that resemble ATR functions]May also be used to estimate the affine transformation. To achieve better results, the method is based on a study of tensor validity [1],[3]An image is formed

and I_tCharacteristic F of_s ¹ and F_tForming pairs of features, computing the correlation of these pairs of features, i.e. the 4D correlation tensor

Each element of the tensor

Recording two local feature vectors

Is inputted into

To estimate

and I_tAffine transformation model between

Wherein the affine transformation model

Can be further transformed by the model

After affine transformation

And I_tFurther alignment therewith is accomplished.

Step 4, optimizing alignment effect by using bidirectional thin plate spline regression (TTPS) sub-network

The semantic alignment effect of the image can be further improved or enhanced by utilizing the control point network. In particular, further imaging is performed by using a bi-directional thin-plate spline regression (TTPS) sub-network

and I_tEstimate a regression transformation model between

And optimizing the image with this model

and I_tSemantic alignment between them. In fact, existing methods for existing thin-plate spline regression TPS, such as [13 ]]For example, the TPS is unidirectional, the control point is fixed, and the effect is affected accordingly, e.g., excessive distortion occurs in some local areas, resulting in larger object distortion. The present invention contemplates TTPS, also performing regression prediction for a known set of corresponding control points, in combination with a simple example, i.e. on an image

and I_tA uniform 3 x 3 grid of control points is provided. But unlike TPS, TTPS is bi-directional, adding a slave picture I_tTo the image

In the control point adjustment in the opposite direction, compared with the TPS, the TTPS considers the control points in the two images to be movable, so that the TTPS can effectively remove excessive deformation and improve the matching distortion. In particular implementations, the TTPS is an image

and I_tIs characterized by

and F_tThe correlation tensor of the feature pair is the object of computation, namely the 4D correlation tensor

To more accurately capture details of the deformation of the object. Using TTPS regression sub-networks

Estimating a regression deformation model

Can be represented in form as:

in a similar manner to that described above,

is based on

Regression transformation model, can

Is further transformed into

Thereby realizing the source image and the target image I_tIs aligned.

Step 5, joint training of the whole network

The continuous transformation of the three sub-networks realizes a matching principle from coarse to fine, and the image pair I_s and I_tCan be completely described as the result of three transformation models being connected, i.e.

wherein ,

representing the composition of the geometric transformation. Moreover, as an integral network, three sub-networks need to be communicated back and forth, and the specific operation method is to perform joint training as an integral network.

Unlike the conventional method that usually uses image pairs as training samples when training the network, for OLASA training, we propose a triple sampling strategy to generate training data by introducing reference images. Each triplet containing a source image I_sReference picture I_rAnd a target image I_t. Wherein, from I_sBy random geometric transformation

Or

Generated reference image I_rAs shown in fig. 4 (a). The combination pairs used in the traditional method contain little appearance change, and the triple sampling strategy can simultaneously capture geometric and appearance change in training data, in other words, the triple not only enriches the appearance change of the original image pair, but also provides powerful transition information for the source image and the target image with larger difference. The introduction of triplets enriches the previously proposed transformation model in the specific calculation process, which is separately represented by dashed and solid lines in (b) of fig. 4: 1) the models still to be estimated are, for example, similar to the original affine regression transformation or spline regression transformation, except that the number of models is increased after subdivision, as indicated by the dashed arrows

Etc. 2) since the reference graph is the result of generation, the known transformation can be used as the real labeled data, and the related transformation is represented by the solid arrow, such as

To achieve the training goal of the whole model, we design three loss functions to achieve the optimization of OLASA, including transitivity loss, consistency loss, and alignment loss.

The transmission loss. On the triple sample, we design the transmission loss according to the transmissibility of the geometric transformation and the network MSE

By affine transformation

For example, two other related continuous transforms may be combined

And

to infer a transformation from a source image to a reference image

That is to say that the first and second electrodes,

(x, y) denotes the structure in I_sOne point (x, y) in the upper coincidence grid G passes through

Should be equal to

And aligning projection points after continuous deformation. In which affine transformation regression sub-networks

Estimating ternary samples

And

the grid G (| G | ═ 20 × 20) can be respectively formed by

And

and (6) transforming. We use the loss function

I.e., transmission loss, to measure the difference between the two grids before and after the affine transformation.

Similarly, transmission losses will also occur in this process

For spline regression

The difference analysis of (3).

The consistency is lost. The consistency in the geometric transformation is extended from the cyclic consistency used in the image-to-image transformation, which can be used as a complement to the transmission loss. Loss of consistency of our design

Is defined as: given an image I_s and I_tInter affine transformation

And corresponding inverse transformation

thus, the reprojection error of a point in grid G between a source image-target image pair can be calculated by the following formula,

in the ternary sample, other image pairs also contain bidirectional reprojection errors, and the two image pairs are all accumulated to be used as a consistency loss function of the method

Similarly, consistency is also lost in the present method

For spline regression

The difference analysis of (3).

The alignment is lost. Loss of alignment

For evaluating

And

can further optimize the alignment accuracy.

Is based on the document [3]]The Soft-inline Count (Soft-inline Count) and the bidirectional measurement alignment quality are designed, and the calculation formula is as follows:

Step 6, obtaining matching result

Through integral network learning including three sub-networks, the method can obtain an image alignment transformation model according to a training data sample, and can calculate the matching result from a source image to a target image aiming at a given image pair to be matched in a testing stage.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the invention and scope of the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. An image matching method based on a deep semantic alignment network model comprises the steps of gradually estimating alignment between two semantic similar images by establishing an object position-aware semantic alignment network model OLASA; the method is characterized in that a triple sampling strategy is adopted to train a network model OLASA, and three sub-networks N of potential object co-location POCL, affine transformation regression ATR and bidirectional thin plate spline regression TTPS are adopted_tran，N_affi and N_ttpsRespectively estimating translation, affine transformation and spline transformation; then, establishing and optimizing an alignment relation between the images in a layering manner to obtain an image matching result; the method comprises the following steps:

step 1, extracting semantic features of an image: a pair of images I_s，

As convolutional neural networks

Respectively extracting two groups of characteristics F_s，

I_s，I_tRespectively a source image and a target image; f_s，F_tRespectively representing source image semantic features and target image semantic features;

the front end of each sub-network adopts a convolutional neural network CNN to extract the characteristics of the image, and the characteristics are expressed as formula (1):

in formula (1), F is a feature extracted from an image; real number

A data space that is a feature; h. w and d respectively represent the height, width and channel number of the characteristic data space;

is a convolutional neural network CNN; i is an image; real number

A data space that is an image; H. w, D respectively represent the height, width, and number of channels of the image data space;

establishing an object location-aware semantic alignment network model OLASA comprising a sub-network N_tran，N_affi and N_ttpsFor estimating offset, affine and TTPS transformations, respectively, denoted T_tran，T_affi and T_ttps(ii) a The source image is transformed to obtain the transformation result of each stage

And

finally by successive transformations T_HObtaining a source image I_sWith the target image I_tAlignment result of

wherein

The method for establishing the object position-aware semantic alignment network model OLASA comprises the following steps of 2-4:

step 2, constructing potential object co-locationSub-network N_tranThe system is used for estimating the offset of a target object between images and eliminating the position deviation of an object to be matched;

a source image F_sAnd features F of the target image_tCo-locating sub-networks as potential objects

Training the sub-network based on the potential target object location

And then realize I_s and I_tEstimation of preliminary transformation between

Is represented as follows:

in the formula ,

co-locating a sub-network for the potential object;

is I_s and I_tAn estimate of the preliminary transform in between, i.e. the offset transform; (x, y) represents the spatial coordinates of the target image, (x ', y') represents the corresponding sampling coordinates in the source image;

further location I_s and I_tTwo semantically related objects in the database, and extracting corresponding feature descriptors { V }_s ⁱ} and {V_t ^jWherein i, j represent in the source image and the target image respectivelyThe serial number of the feature point of (1);

will describe the son { V_s ⁱ} and {V_t ^jStack as feature matrix

And

calculating a semantic similarity matrix by multiplying two feature matrices:

selection of Z_stForming a similar feature pair group by a plurality of feature pairs with the highest similarity scores, calculating two corresponding regions by using feature point groups respectively represented in a source image and a target image, taking the two corresponding regions as two main potential objects, and calculating corresponding frame coordinates, namely space positions;

and then calculating a position offset transformation model by using the space coordinates of the corresponding frame of the potential object obtained by positioning

Source image I_sBy passing

Transforming into position-shifted images

Obtaining affine transformation parameter estimation; the method comprises the following steps:

affine transformation regression sub-network for image features

and F_tConstructing feature pairs, calculating the correlation of the feature pairs

Estimating parameters of the affine transformation model;

degree of correlation of feature pairs

I.e. the 4D correlation tensor,

each element of the tensor

Recording two local feature vectors

Inner product of (2);

the eigenvectors and the correlation tensors are L2 normalized; tensor of degree of correlation

Input device

According to formula (3)

and I_tInter affine transformation model estimation

Image of a person

Further transformed by the formula (3)

After affine transformation

Further with I_tAligning;

step 4, constructing a bidirectional thin plate spline regression subnetwork N_ttps；N_ttpsEstimating a regression transformation model between the affine transformed image and the target image by using the control points, and further optimizing the semantic alignment effect of the image;

in particular in images

and I_tA control point grid is arranged on the upper part so as to

and F_tCorrelation tensor of pairs of features

For calculating objects, adopt

Estimating a regression deformation model

Is represented by formula (4):

wherein ,

according to

and F_tCalculating the relevance of the middle control point pair; obtaining a regression deformation model by calculating 6 transformation weight parameters

The estimation result of (2); according to regression transformation model

Will be provided with

Is further transformed into

Realizing source image and target image I_tAlignment of (2);

step 5, introducing a reference image, selecting a training sample by adopting a triple method, and jointly training a network model OLASA;

step 51, generating training data by adopting a triple sampling strategy method; each triplet containing a source image I_sReference picture I_rAnd a target image I_t(ii) a From I_sBy random geometric transformation

Or

Generating a reference image I_r；

Step 52, designing three loss functions to realize the optimization of OLASA, including the transitive loss functionNumber of

Consistency loss function

And alignment loss function

Transfer loss function

For measuring the accuracy of the predicted transform model; taking the difference between the transformation result of the transformation model and the transformation result of another transformation path for achieving the same transformation purpose as the transmission loss; another transformation approach specifically employs an indirect transformation approach through a combination of two transformation models; the indirect conversion way realizes the same conversion purpose as the direct conversion way which only adopts one conversion model, namely the conversion target of the conversion model to be detected is the same;

consistency loss function

For representing affine transformations

Sum spline regression

A difference of (a);

alignment loss function

Alignment quality design based on soft inline counting and two-way measurement for evaluation

And

the alignment accuracy is further optimized;

calculated by equation (8):

wherein ,c_stBy aligning I_sTo I_tSoft inline counting of (2); c. C_tsIndicating alignment I_tTo I_sSoft inline counting of (2);

step 6, aiming at a given image pair to be matched, obtaining an alignment result from a source image to a target image by utilizing the semantic alignment network model OLASA of the object position perception trained in the step 5, namely obtaining an image matching result;

through the steps, the image matching based on the deep semantic alignment network model is realized.

2. The image matching method based on the deep semantic alignment network model as claimed in claim 1, wherein in step 2, the position of the potential target object is obtained by estimating the potential target through a classified target detection method.

3. The image matching method based on the deep semantic alignment network model as claimed in claim 1, wherein in step 2, I is further positioned_s and I_tThe two semantically related objects are extracted by a feature extraction module CNN,

extracting to obtain corresponding feature mapping { V_s ⁱ} and {V_t ^j}。

4. The image matching method based on deep semantic alignment network model as claimed in claim 1, wherein in step 3, affine transformation model

The degree of freedom of (2) is 6, and 6 affine transformation parameters are estimated.

5. The image matching method based on the deep semantic alignment network model as claimed in claim 1, wherein in step 4, the bidirectional thin plate spline regression sub-network adopts a bidirectional strategy to increase the slave image I_tTo the image

And the control point adjustment in the opposite direction effectively avoids the excessive distortion or distortion of the image.

6. The image matching method based on deep semantic alignment network model as claimed in claim 1, wherein in step 52, for affine transformation

Image I_sOne point in a grid G composed of the pixel points in (1) is expressed by (x, y), and calculation is performed based on the grid point

By passing

And

judgment of

Accuracy, i.e. about

And

as a function of error loss, transfer loss

Represented by formula (5):

wherein ,

an affine transformation model of a source image and a target image;

an affine transformation model for the target image and the reference image;

obtaining an affine transformation model from a source image to a target image according to the labeling data;

transfer loss functions may also be used for spline regression

The difference analysis of (3).

7. As claimed in claim 6The image matching method based on the deep semantic alignment network model is characterized in that in step 52, a consistency loss function

Represented by formula (7):

wherein ,ε(I_s，I_t) The reprojection error for a point in the grid G between the source image-target image pair can be calculated by equation (6):

wherein ,ε(I_s，I_t) Is the reprojection error between the source image-target image pair;

as an image I_s and I_tAffine transformation between;

is composed of

Corresponding inverse transformation;

consistency can also be lost

For spline regression

The difference analysis of (3).