CN113313147A - Image matching method based on deep semantic alignment network model - Google Patents

Image matching method based on deep semantic alignment network model Download PDF

Info

Publication number
CN113313147A
CN113313147A CN202110516741.5A CN202110516741A CN113313147A CN 113313147 A CN113313147 A CN 113313147A CN 202110516741 A CN202110516741 A CN 202110516741A CN 113313147 A CN113313147 A CN 113313147A
Authority
CN
China
Prior art keywords
image
transformation
alignment
model
regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110516741.5A
Other languages
Chinese (zh)
Other versions
CN113313147B (en
Inventor
吕肖庆
瞿经纬
王天乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202110516741.5A priority Critical patent/CN113313147B/en
Publication of CN113313147A publication Critical patent/CN113313147A/en
Application granted granted Critical
Publication of CN113313147B publication Critical patent/CN113313147B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image matching method based on a deep semantic alignment network model, which comprises the steps of gradually estimating the alignment between two semantic similar images by establishing an object position-aware semantic alignment network model OLASA; the method is characterized in that a triple sampling strategy is adopted to train a network model OLASA, and three sub-networks N of potential object co-location POCL, affine transformation regression ATR and bidirectional thin plate spline regression TTPS are adoptedtran,NaffiAnd NttpsRespectively estimating translation, affine transformation and spline transformation; and then, establishing and optimizing an alignment relation between the images in a layering manner to obtain an image matching result. By utilizing the technical scheme provided by the invention, the image alignment effect with larger position difference can be improved,the accuracy of image matching is improved. The method can be applied to target tracking, semantic segmentation, multi-view three-dimensional reconstruction and the like in the field of computer vision.

Description

Image matching method based on deep semantic alignment network model
Technical Field
The invention belongs to the technical field of computer vision and digital image processing, relates to an image matching technology, and particularly relates to a method for establishing an accurate corresponding matching relation of main target objects in similar images based on an image depth semantic alignment network model.
Background
Image semantic alignment aims at establishing accurate corresponding relation of similar target objects among images, namely point-to-point characteristic matching relation among the similar target objects in different images. The specific scene is to analyze and quantify the similarity between features by using the feature information of the images on the premise that the image content information is the same or similar, and further determine the matching relationship of feature points on similar objects in the images. The problem is a basic problem in computer vision, and the method is widely applied to the fields of target tracking, image semantic segmentation, multi-view three-dimensional reconstruction and the like.
Semantic alignment has received much attention in recent years. Early studies included methods of finding instance-level matches by defining and computing sparse or dense descriptors [5 ]. However, the example-level descriptions of these methods lack the generalization capability corresponding to the category level. The purpose of category-level correspondence is to find dense correspondence between semantically similar images. Some methods use local descriptors and minimize the matching energy invested. And the manually constructed descriptor is difficult to embed high-level semantic features and sensitive to image change.
Inspired by the rich high-level semantics of Convolutional Neural Network (CNN) features, recent solutions (references [2], [4], [6], [7], [8]) employ training CNN features and combine them to estimate dense flow fields, aligning the images. Furthermore, the methods employed in references [1], [3], [9], [10] estimate the geometric transformation with trainable CNN features and express semantic correspondences as geometric alignment problems. Some of these methods outperform dense flow based methods and produce smoother matching results, thanks to the geometric transformations describing dense correspondences.
Despite the great advances made by existing approaches, the semantic alignment problem still faces challenges such as alignment difficulties caused by object variations (e.g., appearance, scale, shape, position) and complex backgrounds. Specifically, firstly, since the target position difference is large, it is difficult to directly establish a dense correspondence relationship between images, and the effect is not good (as shown in fig. 1). Previous methods often fail to align such images due to insufficient research to handle such situations. Secondly, it is a difficulty in data annotation that it is difficult to collect a large number of training image pairs with ground truth dense correspondences and significant appearance transformations. Manually annotating such training data is very labor intensive and is somewhat subjective.
Reference documents:
[1]Ignacio Rocco,Relja Arandjelovic,and Josef Sivic,“Convolutional neural network architecture for geometric matching,”inCVPR,2017.
[2]Kai Han,Rafael S Rezende,Bumsub Ham,Kwan-Yee K Wong,Minsu Cho,Cordelia Schmid,and Jean Ponce,“Scnet:Learning semantic correspondence,”in ICCV,2017.
[3]Ignacio Rocco,Relja
Figure BDA0003062597440000021
and Josef Sivic,“End-to-end weakly-supervised semantic alignment,”in CVPR,2018.
[4]Junghyup Lee,Dohyung Kim,Jean Ponce,and Bumsub Ham,“Sfnet:Learning object-aware semantic correspondence,”in CVPR,2019.
[5]David G Lowe,“Distinctive image features from scale-invariant keypoints,”IJCV,vol.60,no.2,2004.
[6]Ce Liu,Jenny Yuen,and Antonio Torralba,“Sift flow:Dense correspondence across scenes and its applications,”IEEE TPAMI,vol.33,no.5,2010.
[7]Bumsub Ham,Minsu Cho,Cordelia Schmid,and Jean Ponce,“Proposal flow:Semantic correspondences from object proposals,”IEEETPAMI,vol.40,no.7,2017.
[8]Seungryong Kim,Dongbo Min,Bumsub Ham,Sangryul Jeon,Stephen Lin,and Kwanghoon Sohn,“Fcss:Fully convolutional self-similarity for dense semantic correspondence,”in CVPR,2017.
[9]Paul Hongsuck Seo,Jongmin Lee,Deunsol Jung,Bohyung Han,and Minsu Cho,“Attentive semantic alignment with offset-aware correlation kernels,”in ECCV,2018.
[10]Ignacio Rocco,Mircea Cimpoi,Relja
Figure BDA0003062597440000022
Akihiko Torii,Tomas Pajdla,and Josef Sivic,“Neighbourhood consensus net-works,”in NIPS,2018.
disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an image matching method based on a deep semantic alignment network model, which establishes a semantic alignment network for sensing the position of an object and trains the network by adopting a triple sampling strategy to establish and optimize the alignment relationship between images in a layered manner, solves the technical problems that the prior art is difficult to directly establish the dense corresponding relationship between the images and the image data annotation wastes time and labor and has low accuracy, and improves the accuracy of image matching.
The image semantic alignment technology in the invention is a sub-problem in the field of image matching or image feature matching, and mainly aims at the following scenes: although the two images to be matched are different, the two images to be matched both contain a similar foreground object, namely, the high-level semantic information such as the appearance, the shape, the posture and the like of the foreground object are similar, and the object basically belongs to the same category, such as cars of different brands.
The technical scheme provided by the invention is as follows:
an image matching method based on a deep semantic alignment network model comprises the steps of gradually and robustly estimating alignment between two semantic similar images by establishing an object position-aware semantic alignment network model (OLASA); meanwhile, a triple sampling strategy is provided for training the network, translation, affine transformation and spline transformation are respectively estimated through three sub-networks (potential object co-location (POCL), Affine Transformation Regression (ATR) and bidirectional thin plate spline regression (TTPS)), and then the alignment relation between images is established and optimized in a layering mode to obtain an image matching result; the method comprises the following steps:
step 1, extracting semantic features of an image;
in the method, an independent Convolutional Neural Network (CNN) is adopted at the front end of each sub-network to extract the characteristics of the image. In specific implementation, the invention adopts a Convolutional Neural Network (CNN) to extract the characteristics of two images, and the network can be the most basic CNN network or the CNN network after improvement or enhancement.
Figure BDA0003062597440000031
In the formula (1), F is a feature extracted from an image,
Figure BDA0003062597440000032
the method comprises the following steps that (1) a characteristic data space (real number) is formed, and h, w and d respectively represent three dimensions of the characteristic data space, namely height, width and channel number;
Figure BDA0003062597440000033
is a Convolutional Neural Network (CNN); i is an image;
Figure BDA0003062597440000034
h, W, D represent three dimensions of the image data space, i.e. height, width, number of channels, respectively, for the data space (real number) of the image.
A pair of images
Figure BDA0003062597440000035
As
Figure BDA0003062597440000036
Respectively extracting two groups of obtained features
Figure BDA0003062597440000037
Figure BDA0003062597440000038
Is,ItRespectively a source image and a target image; fs,FtRespectively a source image semantic feature and a target image semantic feature.
The invention establishes an object position-aware semantic alignment network model-OLASA, and gradually and robustlyThe alignment between two semantically similar images is estimated. The system architecture of OLASA also takes POCL, ATR, and TTPS as main components, and they are named N respectivelytran,Naffi and NttpsFor estimating offset, affine and TTPS transformations, respectively, denoted Ttran,Taffi and TttpsAs shown in fig. 3. By transforming the model Ttran,Taffi and TttpsThe transformation result of the source image at each stage can be obtained
Figure BDA0003062597440000039
And
Figure BDA00030625974400000310
continuous transformation I using these transformation modelsHI.e. by
Figure BDA00030625974400000311
Obtaining a source image IsWith the target image ItAlignment result of
Figure BDA00030625974400000312
The method for establishing the semantic alignment network model for object position perception comprises the steps 2-4.
Step 2, adopting potential objects to cooperatively locate the sub-network (N)tran) Estimating the offset of a target object between images, and eliminating the position deviation of an object to be matched;
aiming at the problem that similar target objects among images often have obvious displacement, NtranThe sub-network adopts the potential target position detection and estimation technology to predict an offset transformation model in advance and eliminate the influence of the source image on a large span by transforming the source image.
The potential object co-location sub-network is denoted as
Figure BDA00030625974400000313
Feature F of source imagesAnd features F of the target imagetAs input, estimating potential targets by a classified target detection method, and training a co-location sub-network according to the positions of the potential targets
Figure BDA0003062597440000041
And then realize Is and ItEstimation of preliminary transformation between
Figure BDA0003062597440000042
Is represented as follows:
Figure BDA0003062597440000043
in the formula ,
Figure BDA0003062597440000044
co-locating a sub-network for the potential object;
Figure BDA0003062597440000045
is Is and ItAn estimate of the preliminary transform (offset transform) between;
Figure BDA0003062597440000046
the degree of freedom of (a) is 4; (x, y) represents the spatial coordinates of the target image and (x ', y') represents the corresponding sample coordinates in the source image.
To estimate an offset transformation model
Figure BDA0003062597440000047
The position of an object to be matched needs to be estimated firstly, and therefore, the approximate position of the object to be matched can be estimated by adopting the existing classified target detection technology;
further location Is and ItTwo major and semantically related objects; and then another feature extraction module, such as CNN,
Figure BDA0003062597440000048
to extract the corresponding feature descriptor Vs i} and {Vt jI and j respectively represent the serial numbers of the feature points in the source image and the target image;
will describe the son { Vs i} and {Vt jStack (stack) as a feature matrix
Figure BDA0003062597440000049
And
Figure BDA00030625974400000410
Figure BDA00030625974400000411
the semantic similarity matrix is calculated by multiplying the two feature matrices,
Figure BDA00030625974400000412
selection of ZstForming a similar feature pair group by a plurality of feature pairs with the highest similarity scores, calculating two corresponding regions by using feature point groups respectively represented in the source image and the target image as two main potential objects, and calculating corresponding frame coordinates, namely space positions;
finally, obtaining the space coordinates of the frames corresponding to the two main potential objects by positioning, and calculating a position deviation transformation model
Figure BDA00030625974400000413
Source image IsBy passing
Figure BDA00030625974400000414
Transforming into position-shifted images
Figure BDA00030625974400000415
Step 3, constructing affine transformation regression sub-network NaffiEstimating affine transformation model of image to be matched using affine transformation regression subnetwork
Figure BDA00030625974400000416
Obtaining affine transformation parameter estimation;
the ATR subnetwork is used to estimate an affine transformation model of the offset adjusted image and the target image. The ATR subnetwork needs to construct pairs of image features, i.e. feature pairs, and calculate the correlation of the feature pairs, from which the parameters of the affine transformation model are estimated.
In particular, the image subjected to position shift conversion is
Figure BDA00030625974400000417
And a target image ItCharacteristic F ofs 1 and FtForming feature pairs, calculating the relevance of these feature pairs
Figure BDA00030625974400000418
I.e. the 4D correlation tensor
Figure BDA00030625974400000419
Each element of the tensor
Figure BDA00030625974400000420
Recording two local feature vectors
Figure BDA00030625974400000421
The inner product of (d) and (d). The eigenvectors and the correlation tensors are L2 normalized. Tensor of degree of correlation
Figure BDA00030625974400000422
Input device
Figure BDA00030625974400000423
To carry out
Figure BDA00030625974400000424
and ItInter affine transformation model estimation
Figure BDA00030625974400000425
Figure BDA0003062597440000051
Wherein the affine transformation model
Figure BDA0003062597440000052
Is 6, i.e. 6 affine transformation parameters need to be estimated. Image of a person
Figure BDA0003062597440000053
Can be further transformed by the model
Figure BDA0003062597440000054
Obtained after affine transformation
Figure BDA0003062597440000055
And ItFurther alignment is accomplished.
Step 4, constructing a bidirectional thin plate spline regression subnetwork NttpsOptimizing the alignment effect by using a bidirectional thin plate spline regression subnetwork;
TTPS subnetwork NttpsAnd estimating a regression transformation model between the affine transformed image and the target image by using the control points, wherein the model can further improve or enhance the semantic alignment effect of the image. TTPS employs a bi-directional strategy to avoid excessive distortion or distortion of the image. Compared with the prior thin plate spline regression TPS method, the method increases the secondary image ItTo the image
Figure BDA0003062597440000056
And the control point adjustment in the opposite direction effectively removes excessive deformation and improves the matching distortion condition.
When embodied, in the image
Figure BDA0003062597440000057
and ItA control point grid and a TTPS sub-network are arranged on the network for imaging
Figure BDA0003062597440000058
and ItIs characterized by
Figure BDA00030625974400000523
and FtThe correlation tensor of the feature pair is used as a calculation pairImage, i.e. 4D correlation tensor
Figure BDA0003062597440000059
Using TTPS regression sub-networks
Figure BDA00030625974400000510
Estimating a regression deformation model
Figure BDA00030625974400000511
The process of (d) can be expressed as:
Figure BDA00030625974400000512
wherein ,
Figure BDA00030625974400000513
according to
Figure BDA00030625974400000524
and FtCalculating the correlation of the middle control point pair to obtain a regression deformation model
Figure BDA00030625974400000514
Is obtained by calculating 6 transform weight parameters of the two. According to the estimated
Figure BDA00030625974400000515
Regression transformation model, can
Figure BDA00030625974400000516
Is further transformed into
Figure BDA00030625974400000517
Thereby realizing the source image and the target image ItIs aligned.
Step 5, jointly training an integral OLASA network model comprising three sub-networks;
in terms of the selection of training samples, the invention proposes a triplet method by introducing reference images, which can better capture geometrical and appearance variations in the training data.
Step 51, the present invention proposes a triple sampling strategy to generate training data. Each triplet containing a source image IsReference picture IrAnd a target image It. Wherein, from IsBy random geometric transformation
Figure BDA00030625974400000518
Or
Figure BDA00030625974400000519
Generated reference image Ir
And step 52, designing three loss functions to realize the optimization of the OLASA, wherein the three loss functions comprise transitivity loss, consistency loss and alignment loss.
Transfer loss function
Figure BDA00030625974400000520
For measuring the accuracy of the predicted transform model. To check whether the predicted transformation model is accurate, the transformation result of the model may be compared with the transformation result of another transformation approach for achieving the same transformation objective; another transformation approach may specifically employ an indirect transformation approach through a combination of two transformation models; the indirect transformation approach achieves the same transformation goal as the direct transformation approach which only adopts one transformation model, namely the transformation goal of the transformation model to be detected is the same.
To check a certain affine transformation
Figure BDA00030625974400000521
For example, if one is to be used
Figure BDA00030625974400000522
(affine transformation from source image to reference image) as predicted transformation model, and checking its accuracy to find out another indirect transformation approach capable of implementing uniform transformation, such as combination of two transformation models
Figure BDA0003062597440000061
I.e. by combination
Figure BDA0003062597440000062
(affine transformation from source image to target image) and
Figure BDA0003062597440000063
(affine transformation from target image to reference image) two related successive transformations to simulate
Figure BDA0003062597440000064
Transformation, judging the transformation from the source image to the reference image by comparing the difference of the transformation results obtained in the two ways
Figure BDA0003062597440000065
To the degree of accuracy of the measurement. The specific implementation can be realized by means of a grid consisting of selected pixel points in the image, for example, twenty parts of the image are divided in the horizontal and vertical directions, so that a grid G (| G | ═ 20 × 20) can be obtained, wherein 400 points are included in the grid G, one point in G is represented by (x, y), and the initial grid can be constructed in IsThe above. As described hereinbefore, by
Figure BDA0003062597440000066
And
Figure BDA0003062597440000067
to judge
Figure BDA0003062597440000068
The accuracy implementation may be shifted to a specific calculation based on grid points, i.e.,
Figure BDA0003062597440000069
theoretically, by
Figure BDA00030625974400000610
Should be equal to
Figure BDA00030625974400000611
The projected points after successive deformations are aligned while in contrast with the real data, i.e.
Figure BDA00030625974400000612
(affine transformation of source image to target image calculated from annotation data, not predicted)
Figure BDA00030625974400000613
The difference is lost as an error on which the machine learning method is based, and therefore, the transfer loss function
Figure BDA00030625974400000614
Can be expressed as:
Figure BDA00030625974400000615
the transformation involves three types of objects, namely a source image, a target image and a reference image, and therefore the three types of objects are constructed into a triple for training a relevant model, namely a piece of training data is composed of a source image, a target image and a reference image. The method can also be seen as adding a corresponding reference image on the basis of the traditional training pair (source image and target image). The associated loss calculation makes it possible to simultaneously calculate three transformations, each involving a certain pair of images of the triplet (the two images belonging to different image types), e.g. affine transformation regression subnetwork
Figure BDA00030625974400000616
Estimating ternary samples
Figure BDA00030625974400000617
(source image and target image) and
Figure BDA00030625974400000618
(target image and reference image) affine transformation model.
Similarly, transmission losses will also occur in this process
Figure BDA00030625974400000619
For spline regression
Figure BDA00030625974400000620
The difference analysis of (3).
Consistency loss function
Figure BDA00030625974400000621
Loss of consistency
Figure BDA00030625974400000622
Is defined as: consistency loss function
Figure BDA00030625974400000623
Is defined as: in a ternary sample consisting of a source image, a target image and a reference image, the accumulation of the bidirectional reprojection error of a point in the grid G between different image pairs constitutes a consistency loss function
Figure BDA00030625974400000624
For representing affine transformations
Figure BDA00030625974400000625
Sum spline regression
Figure BDA00030625974400000626
A difference of (a);
given an image Is and ItInter affine transformation
Figure BDA00030625974400000627
And corresponding inverse transformation
Figure BDA00030625974400000628
When compared to the results of its two-way transformation, the point (x, y) e G should be aligned with the coordinates of the original space, i.e.,
Figure BDA00030625974400000629
thus, the reprojection error of a point in grid G between a source image-target image pair can be calculated by the following equation:
Figure BDA00030625974400000630
wherein ,ε(Is,It) Is the reprojection error between the source-target image pair.
In the ternary sample composed of the source image, the target image and the reference image, other image pairs also contain bidirectional reprojection errors, and the two image pairs are all accumulated to be used as the consistency loss function of the method
Figure BDA0003062597440000071
Figure BDA0003062597440000072
Similarly, consistency is also lost in the present method
Figure BDA0003062597440000073
For spline regression
Figure BDA0003062597440000074
The difference analysis of (3).
Alignment loss function
Figure BDA0003062597440000075
Based on the literature [3]Soft-inline Count (Soft-inline Count) and two-way measurement alignment quality design for evaluation
Figure BDA0003062597440000076
And
Figure BDA0003062597440000077
can further optimize the alignment accuracy.
Figure BDA0003062597440000078
The calculation formula is as follows:
Figure BDA0003062597440000079
wherein cstBy aligning IsTo ItSoft inline counting of, similarly, ctsIndicating alignment ItTo IsSoft inline counting of (2).
Step 6, obtaining an image matching result;
according to the joint training integral network, aiming at a given image pair to be matched, an alignment result from a source image to a target image can be given.
Specifically, learning training is carried out through an overall network comprising three sub-networks, and the method can obtain an image alignment transformation model according to a training data sample, so that a trained object position-aware semantic alignment network model OLASA is obtained. In the testing stage, the matching image result from the source image to the target image can be calculated for the given image pair to be matched.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides an image matching method based on a depth semantic alignment network model, which comprises three sub-networks of potential object co-location (POCL), Affine Transformation Regression (ATR) and bidirectional thin plate spline regression (TTPS), wherein the POCL can effectively sense the offset of a potential object, the ATR can learn the parameters of geometric deformation, the TTPS can improve the robustness of deformation, and the combined learning of the three sub-networks not only can realize the semantic alignment of images, but also can obtain the image matching effect with higher accuracy. By utilizing the technical scheme provided by the invention, the image alignment effect with larger position difference can be improved. Meanwhile, in a scene with insufficient annotation data, by means of generating a reference image, geometric changes and appearance changes in the existing data can be deeply mined and utilized, and the accuracy of image matching is improved. The method can be applied to various tasks in the field of computer vision, such as target tracking, semantic segmentation, multi-view three-dimensional reconstruction and the like.
Drawings
FIG. 1 is a diagram of object position-aware semantic alignment in images in image matching;
wherein Is represents the source image; it represents a target image.
FIG. 2 is a flow chart of a method for performing image matching on an OLASA network model established by the present invention;
FIG. 3 is a structural block diagram of an OLASA network model 2-2 established by the present invention;
wherein ,IsFor the source image, ItIn order to be the target image,
Figure BDA0003062597440000081
respectively, the result of each stage transformation.
Figure BDA0003062597440000082
Extracting the network for features, NtranCo-location (POCL) sub-networks for potential objects, TtranFor the corresponding transformation model, NaffiFor Affine Transformation Regression (ATR) sub-networks, TaffiFor the corresponding transformation model, NttpsFor a bidirectional thin-plate spline regression (TTPS) subnetwork, TttpsIs the corresponding transformation model.
FIG. 4 is a schematic diagram of a training triplet and a related transformation model of an OLASA network model in the present invention;
wherein (a) is a source image IsGenerating a reference image Ir(ii) a (b) Are the various transformations between triplets.
Figure BDA0003062597440000083
Respectively representing an affine transformation model and a bi-directional thin-plate spline regression model from a source image to a reference image, as benchmarks (groudtuth) for comparison,
Figure BDA0003062597440000084
respectively representing an affine transformation model and a bidirectional thin plate spline regression model from a source image to a target image,
Figure BDA0003062597440000085
two models representing respective inverse transforms;
Figure BDA0003062597440000086
respectively representing an affine transformation model and a bidirectional thin plate spline regression model from a target image to a reference image,
Figure BDA0003062597440000087
two models of the corresponding inverse transform are represented.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides an image matching method based on a deep semantic alignment network model, which is a deep neural network model OLASA based on semantic alignment, the input of the model method is a source image, a target image and a reference image, OLASA estimates deformation parameters of the source image according to the internal alignment relation of the OLASA through the deep semantic analysis of the source image and the target image, the deformed source image is the output result of the method, and the target object contained in the image can be matched with the corresponding object in the target image. The internal implementation of OLASA is joint learning through three sub-networks: the three sub-networks, potential object co-location (POCL), Affine Transformation Regression (ATR), and bi-directional thin-plate spline regression (TTPS), achieve the purpose of effectively matching images.
Fig. 2 shows a flow of a method for performing image matching on an OLASA network model established by the present invention. The method can be used for any given image pair (namely the source image and the target image), and the image pair formed by the source image and the target image can be obtained by shooting or network downloading and the like. Image data sets such as PF-WILLOW [11], PF-PASCAL (reference [11 ]: Bumsub Ham, Minsu Cho, Cordelia Schmid, and Jean Ponce, "Proposal flow," in CVPR,2016.) and Caltech-101 (reference [12 ]: Li Fei-Fei, Rob Fergus, and Pietro Perona, "One-shot left of object candidates," IEEE TPAMI, 28, No.4,2006.) are used. The reference image is calculated according to the source image in the execution process by adopting the method. The method of the invention is realized by the following steps.
Step 1, extracting semantic features of images
After receiving a source image, a target image and a reference image, OLASA first extracts features of the source image, the target image and the reference image. That is, a pair of images to be matched (source and target images, available as images or shots in an image library)
Figure BDA0003062597440000091
As an input, the method adopts a Convolutional Neural Network (CNN) to extract the characteristics of two images, wherein the network can be the most basic CNN network or the CNN network after improvement or enhancement, which is not general and is called as the CNN network in the text
Figure BDA0003062597440000092
The two groups of extracted features are named
Figure BDA0003062597440000093
These semantic features will be used in the learning process of the subsequent sub-network.
Figure BDA0003062597440000094
Step 2, estimating the offset by adopting a potential object co-location (POCL) sub-network
In practical situations, objects to be matched in the source image and the target image are often located at different positions in the respective images, that is, the objects to be matched often have a large position difference. The existing method only processes objects to be matched which are almost at the same position, and a special processing method is rarely adopted for the situation with large position difference, so that ideal matching effect is difficult to obtain in practical application of a plurality of methods. The invention aims at the problem that a preprocessing network, a potential object co-location sub-network (POCL), is adopted to eliminate the position deviation of the object to be matched in the first stage of realizing image matching.
The sub-network is defined as
Figure BDA0003062597440000095
Figure BDA0003062597440000096
The characteristics F of the source image and the target images and FtAs input, estimating potential targets by a classified target detection technique, and training a co-location sub-network based on the potential target locations
Figure BDA0003062597440000097
And then realize Is and ItPreliminary transformation between
Figure BDA0003062597440000098
Is estimated.
Figure BDA0003062597440000099
wherein ,
Figure BDA00030625974400000910
is 4, where (x, y) represents the spatial coordinates of the target image and (x ', y') represents the corresponding sampling coordinates in the source image.
To estimate an offset transformation model
Figure BDA00030625974400000911
It is necessary to estimate the position of the object to be matched, for which purpose existing classification target detection techniques can be used, in particular fast R-CNN [12] as a common target detection network]And the like. It should be noted that the target detection technique is not used to obtain an accurate target detection result at this stage, but is used to obtain approximate position information of the potential object to achieve the co-location, so the target detection technique is used differently here. In particular, the amount of the solvent to be used,
Figure BDA00030625974400000912
based on image features Fs and FtOnly two sets of potential object bounding boxes need to be predicted
Figure BDA00030625974400000913
And
Figure BDA00030625974400000914
(i=1,…,ns,j=1,…,nt),
Figure BDA0003062597440000101
respectively describe an image Is and ItMiddle (i)th and jthA potential frame, which may be recorded with coordinates of the top left corner and the bottom right corner, ns and ntRespectively represent Is and ItThe number of borders in (1). Again, these borders need not be exact object borders, but rather approximate or possible borders, which in this present stage are intended to estimate their approximate location.
To further locate Is and ItTwo main and semantically related objects in the text, we pass through the latent bounding box
Figure BDA0003062597440000102
And
Figure BDA0003062597440000103
the potential objects are cropped from the image and their size is adjusted to H x W. And then using another feature extraction module, such as CNN,
Figure BDA0003062597440000104
to extract a feature map V corresponding to the source and target imagess i} and {Vt j}. After that, we will describe the child { V }s i} and {Vt jStack (stack) as a feature matrix
Figure BDA0003062597440000105
And
Figure BDA0003062597440000106
thus, the semantic similarity matrix can be calculated by multiplying the two feature matrices,
Figure BDA0003062597440000107
Figure BDA0003062597440000108
wherein, focus is on ZstSelecting two corresponding frames as the component items with the highest similarity score
Figure BDA0003062597440000109
And
Figure BDA00030625974400001010
thereby locating two main potential objects separately. Finally, utilize
Figure BDA00030625974400001011
And
Figure BDA00030625974400001012
the spatial coordinates of the position can be calculated to obtain a position deviation transformation model
Figure BDA00030625974400001013
Source image IsBy passing
Figure BDA00030625974400001014
Transforming into position-shifted images
Figure BDA00030625974400001015
The POCL is only used to capture the position deviation of the potential object, and when the position deviation is large, the corresponding position adjustment can be realized through the position deviation transformation, but the POCL cannot realize accurate semantic alignment.
Step 3, estimating affine transformation model by utilizing Affine Transformation Regression (ATR) sub-network
OLASA is transformed back by more accurate affineReturn (ATR) subnetwork
Figure BDA00030625974400001016
Estimating affine transformation model of image to be matched
Figure BDA00030625974400001017
Existing sub-networks [11] that resemble ATR functions]May also be used to estimate the affine transformation. To achieve better results, the method is based on a study of tensor validity [1],[3]An image is formed
Figure BDA00030625974400001018
and ItCharacteristic F ofs 1 and FtForming pairs of features, computing the correlation of these pairs of features, i.e. the 4D correlation tensor
Figure BDA00030625974400001019
Each element of the tensor
Figure BDA00030625974400001020
Recording two local feature vectors
Figure BDA00030625974400001021
The inner product of (d) and (d). The eigenvectors and the correlation tensors are L2 normalized. Tensor of degree of correlation
Figure BDA00030625974400001022
Is inputted into
Figure BDA00030625974400001023
To estimate
Figure BDA00030625974400001024
and ItAffine transformation model between
Figure BDA00030625974400001025
Figure BDA00030625974400001026
Wherein the affine transformation model
Figure BDA00030625974400001027
Is 6, i.e. 6 affine transformation parameters need to be estimated. Image of a person
Figure BDA00030625974400001028
Can be further transformed by the model
Figure BDA00030625974400001029
After affine transformation
Figure BDA00030625974400001030
And ItFurther alignment therewith is accomplished.
Step 4, optimizing alignment effect by using bidirectional thin plate spline regression (TTPS) sub-network
The semantic alignment effect of the image can be further improved or enhanced by utilizing the control point network. In particular, further imaging is performed by using a bi-directional thin-plate spline regression (TTPS) sub-network
Figure BDA00030625974400001031
and ItEstimate a regression transformation model between
Figure BDA00030625974400001032
And optimizing the image with this model
Figure BDA00030625974400001033
and ItSemantic alignment between them. In fact, existing methods for existing thin-plate spline regression TPS, such as [13 ]]For example, the TPS is unidirectional, the control point is fixed, and the effect is affected accordingly, e.g., excessive distortion occurs in some local areas, resulting in larger object distortion. The present invention contemplates TTPS, also performing regression prediction for a known set of corresponding control points, in combination with a simple example, i.e. on an image
Figure BDA0003062597440000111
and ItA uniform 3 x 3 grid of control points is provided. But unlike TPS, TTPS is bi-directional, adding a slave picture ItTo the image
Figure BDA0003062597440000112
In the control point adjustment in the opposite direction, compared with the TPS, the TTPS considers the control points in the two images to be movable, so that the TTPS can effectively remove excessive deformation and improve the matching distortion. In particular implementations, the TTPS is an image
Figure BDA0003062597440000113
and ItIs characterized by
Figure BDA00030625974400001120
and FtThe correlation tensor of the feature pair is the object of computation, namely the 4D correlation tensor
Figure BDA0003062597440000114
Figure BDA0003062597440000115
To more accurately capture details of the deformation of the object. Using TTPS regression sub-networks
Figure BDA0003062597440000116
Estimating a regression deformation model
Figure BDA0003062597440000117
Can be represented in form as:
Figure BDA0003062597440000118
in a similar manner to that described above,
Figure BDA0003062597440000119
is based on
Figure BDA00030625974400001121
and FtCalculating the correlation of the middle control point pair to obtain a regression deformation model
Figure BDA00030625974400001110
Is obtained by calculating 6 transform weight parameters of the two. According to the estimated
Figure BDA00030625974400001111
Regression transformation model, can
Figure BDA00030625974400001112
Is further transformed into
Figure BDA00030625974400001113
Thereby realizing the source image and the target image ItIs aligned.
Step 5, joint training of the whole network
The continuous transformation of the three sub-networks realizes a matching principle from coarse to fine, and the image pair Is and ItCan be completely described as the result of three transformation models being connected, i.e.
Figure BDA00030625974400001114
wherein ,
Figure BDA00030625974400001115
representing the composition of the geometric transformation. Moreover, as an integral network, three sub-networks need to be communicated back and forth, and the specific operation method is to perform joint training as an integral network.
Unlike the conventional method that usually uses image pairs as training samples when training the network, for OLASA training, we propose a triple sampling strategy to generate training data by introducing reference images. Each triplet containing a source image IsReference picture IrAnd a target image It. Wherein, from IsBy random geometric transformation
Figure BDA00030625974400001116
Or
Figure BDA00030625974400001117
Generated reference image IrAs shown in fig. 4 (a). The combination pairs used in the traditional method contain little appearance change, and the triple sampling strategy can simultaneously capture geometric and appearance change in training data, in other words, the triple not only enriches the appearance change of the original image pair, but also provides powerful transition information for the source image and the target image with larger difference. The introduction of triplets enriches the previously proposed transformation model in the specific calculation process, which is separately represented by dashed and solid lines in (b) of fig. 4: 1) the models still to be estimated are, for example, similar to the original affine regression transformation or spline regression transformation, except that the number of models is increased after subdivision, as indicated by the dashed arrows
Figure BDA00030625974400001118
Etc. 2) since the reference graph is the result of generation, the known transformation can be used as the real labeled data, and the related transformation is represented by the solid arrow, such as
Figure BDA00030625974400001119
To achieve the training goal of the whole model, we design three loss functions to achieve the optimization of OLASA, including transitivity loss, consistency loss, and alignment loss.
The transmission loss. On the triple sample, we design the transmission loss according to the transmissibility of the geometric transformation and the network MSE
Figure BDA0003062597440000121
By affine transformation
Figure BDA0003062597440000122
For example, two other related continuous transforms may be combined
Figure BDA0003062597440000123
And
Figure BDA0003062597440000124
to infer a transformation from a source image to a reference image
Figure BDA0003062597440000125
That is to say that the first and second electrodes,
Figure BDA0003062597440000126
(x, y) denotes the structure in IsOne point (x, y) in the upper coincidence grid G passes through
Figure BDA0003062597440000127
Should be equal to
Figure BDA0003062597440000128
And aligning projection points after continuous deformation. In which affine transformation regression sub-networks
Figure BDA0003062597440000129
Estimating ternary samples
Figure BDA00030625974400001210
And
Figure BDA00030625974400001211
the grid G (| G | ═ 20 × 20) can be respectively formed by
Figure BDA00030625974400001212
And
Figure BDA00030625974400001213
and (6) transforming. We use the loss function
Figure BDA00030625974400001214
I.e., transmission loss, to measure the difference between the two grids before and after the affine transformation.
Figure BDA00030625974400001215
Similarly, transmission losses will also occur in this process
Figure BDA00030625974400001216
For spline regression
Figure BDA00030625974400001217
The difference analysis of (3).
The consistency is lost. The consistency in the geometric transformation is extended from the cyclic consistency used in the image-to-image transformation, which can be used as a complement to the transmission loss. Loss of consistency of our design
Figure BDA00030625974400001218
Is defined as: given an image Is and ItInter affine transformation
Figure BDA00030625974400001219
And corresponding inverse transformation
Figure BDA00030625974400001220
When compared to the results of its two-way transformation, the point (x, y) e G should be aligned with the coordinates of the original space, i.e.,
Figure BDA00030625974400001221
thus, the reprojection error of a point in grid G between a source image-target image pair can be calculated by the following formula,
Figure BDA00030625974400001222
in the ternary sample, other image pairs also contain bidirectional reprojection errors, and the two image pairs are all accumulated to be used as a consistency loss function of the method
Figure BDA00030625974400001223
Figure BDA00030625974400001224
Similarly, consistency is also lost in the present method
Figure BDA00030625974400001225
For spline regression
Figure BDA00030625974400001226
The difference analysis of (3).
The alignment is lost. Loss of alignment
Figure BDA00030625974400001227
For evaluating
Figure BDA00030625974400001228
And
Figure BDA00030625974400001229
can further optimize the alignment accuracy.
Figure BDA00030625974400001230
Is based on the document [3]]The Soft-inline Count (Soft-inline Count) and the bidirectional measurement alignment quality are designed, and the calculation formula is as follows:
Figure BDA00030625974400001231
wherein cstBy aligning IsTo ItSoft inline counting of, similarly, ctsIndicating alignment ItTo IsSoft inline counting of (2).
Step 6, obtaining matching result
Through integral network learning including three sub-networks, the method can obtain an image alignment transformation model according to a training data sample, and can calculate the matching result from a source image to a target image aiming at a given image pair to be matched in a testing stage.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the invention and scope of the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (7)

1. An image matching method based on a deep semantic alignment network model comprises the steps of gradually estimating alignment between two semantic similar images by establishing an object position-aware semantic alignment network model OLASA; the method is characterized in that a triple sampling strategy is adopted to train a network model OLASA, and three sub-networks N of potential object co-location POCL, affine transformation regression ATR and bidirectional thin plate spline regression TTPS are adoptedtran,Naffi and NttpsRespectively estimating translation, affine transformation and spline transformation; then, establishing and optimizing an alignment relation between the images in a layering manner to obtain an image matching result; the method comprises the following steps:
step 1, extracting semantic features of an image: a pair of images Is
Figure FDA0003062597430000011
As convolutional neural networks
Figure FDA0003062597430000012
Respectively extracting two groups of characteristics Fs
Figure FDA0003062597430000013
Is,ItRespectively a source image and a target image; fs,FtRespectively representing source image semantic features and target image semantic features;
the front end of each sub-network adopts a convolutional neural network CNN to extract the characteristics of the image, and the characteristics are expressed as formula (1):
Figure FDA0003062597430000014
in formula (1), F is a feature extracted from an image; real number
Figure FDA0003062597430000015
A data space that is a feature; h. w and d respectively represent the height, width and channel number of the characteristic data space;
Figure FDA0003062597430000016
is a convolutional neural network CNN; i is an image; real number
Figure FDA0003062597430000017
A data space that is an image; H. w, D respectively represent the height, width, and number of channels of the image data space;
establishing an object location-aware semantic alignment network model OLASA comprising a sub-network Ntran,Naffi and NttpsFor estimating offset, affine and TTPS transformations, respectively, denoted Ttran,Taffi and Tttps(ii) a The source image is transformed to obtain the transformation result of each stage
Figure FDA0003062597430000018
And
Figure FDA0003062597430000019
finally by successive transformations THObtaining a source image IsWith the target image ItAlignment result of
Figure FDA00030625974300000110
wherein
Figure FDA00030625974300000111
The method for establishing the object position-aware semantic alignment network model OLASA comprises the following steps of 2-4:
step 2, constructing potential object co-locationSub-network NtranThe system is used for estimating the offset of a target object between images and eliminating the position deviation of an object to be matched;
a source image FsAnd features F of the target imagetCo-locating sub-networks as potential objects
Figure FDA00030625974300000112
Training the sub-network based on the potential target object location
Figure FDA00030625974300000113
And then realize Is and ItEstimation of preliminary transformation between
Figure FDA00030625974300000114
Is represented as follows:
Figure FDA00030625974300000115
Figure FDA00030625974300000116
in the formula ,
Figure FDA00030625974300000117
co-locating a sub-network for the potential object;
Figure FDA00030625974300000118
is Is and ItAn estimate of the preliminary transform in between, i.e. the offset transform; (x, y) represents the spatial coordinates of the target image, (x ', y') represents the corresponding sampling coordinates in the source image;
further location Is and ItTwo semantically related objects in the database, and extracting corresponding feature descriptors { V }s i} and {Vt jWherein i, j represent in the source image and the target image respectivelyThe serial number of the feature point of (1);
will describe the son { Vs i} and {Vt jStack as feature matrix
Figure FDA00030625974300000119
And
Figure FDA00030625974300000120
calculating a semantic similarity matrix by multiplying two feature matrices:
Figure FDA0003062597430000021
selection of ZstForming a similar feature pair group by a plurality of feature pairs with the highest similarity scores, calculating two corresponding regions by using feature point groups respectively represented in a source image and a target image, taking the two corresponding regions as two main potential objects, and calculating corresponding frame coordinates, namely space positions;
and then calculating a position offset transformation model by using the space coordinates of the corresponding frame of the potential object obtained by positioning
Figure FDA0003062597430000022
Source image IsBy passing
Figure FDA0003062597430000023
Transforming into position-shifted images
Figure FDA0003062597430000024
Step 3, constructing affine transformation regression sub-network NaffiEstimating affine transformation model of image to be matched using affine transformation regression subnetwork
Figure FDA0003062597430000025
Obtaining affine transformation parameter estimation; the method comprises the following steps:
affine transformation regression sub-network for image features
Figure FDA00030625974300000234
and FtConstructing feature pairs, calculating the correlation of the feature pairs
Figure FDA0003062597430000026
Estimating parameters of the affine transformation model;
degree of correlation of feature pairs
Figure FDA0003062597430000027
I.e. the 4D correlation tensor,
Figure FDA0003062597430000028
each element of the tensor
Figure FDA0003062597430000029
Recording two local feature vectors
Figure FDA00030625974300000235
Figure FDA00030625974300000210
Inner product of (2);
the eigenvectors and the correlation tensors are L2 normalized; tensor of degree of correlation
Figure FDA00030625974300000211
Input device
Figure FDA00030625974300000212
According to formula (3)
Figure FDA00030625974300000213
and ItInter affine transformation model estimation
Figure FDA00030625974300000214
Figure FDA00030625974300000215
Image of a person
Figure FDA00030625974300000216
Further transformed by the formula (3)
Figure FDA00030625974300000217
After affine transformation
Figure FDA00030625974300000218
Further with ItAligning;
step 4, constructing a bidirectional thin plate spline regression subnetwork Nttps;NttpsEstimating a regression transformation model between the affine transformed image and the target image by using the control points, and further optimizing the semantic alignment effect of the image;
in particular in images
Figure FDA00030625974300000219
and ItA control point grid is arranged on the upper part so as to
Figure FDA00030625974300000233
and FtCorrelation tensor of pairs of features
Figure FDA00030625974300000220
For calculating objects, adopt
Figure FDA00030625974300000221
Estimating a regression deformation model
Figure FDA00030625974300000222
Is represented by formula (4):
Figure FDA00030625974300000223
wherein ,
Figure FDA00030625974300000224
according to
Figure FDA00030625974300000225
and FtCalculating the relevance of the middle control point pair; obtaining a regression deformation model by calculating 6 transformation weight parameters
Figure FDA00030625974300000226
The estimation result of (2); according to regression transformation model
Figure FDA00030625974300000227
Will be provided with
Figure FDA00030625974300000228
Is further transformed into
Figure FDA00030625974300000229
Realizing source image and target image ItAlignment of (2);
step 5, introducing a reference image, selecting a training sample by adopting a triple method, and jointly training a network model OLASA;
step 51, generating training data by adopting a triple sampling strategy method; each triplet containing a source image IsReference picture IrAnd a target image It(ii) a From IsBy random geometric transformation
Figure FDA00030625974300000230
Or
Figure FDA00030625974300000231
Generating a reference image Ir
Step 52, designing three loss functions to realize the optimization of OLASA, including the transitive loss functionNumber of
Figure FDA00030625974300000232
Consistency loss function
Figure FDA0003062597430000031
And alignment loss function
Figure FDA0003062597430000032
Transfer loss function
Figure FDA0003062597430000033
For measuring the accuracy of the predicted transform model; taking the difference between the transformation result of the transformation model and the transformation result of another transformation path for achieving the same transformation purpose as the transmission loss; another transformation approach specifically employs an indirect transformation approach through a combination of two transformation models; the indirect conversion way realizes the same conversion purpose as the direct conversion way which only adopts one conversion model, namely the conversion target of the conversion model to be detected is the same;
consistency loss function
Figure FDA0003062597430000034
Is defined as: in a ternary sample consisting of a source image, a target image and a reference image, the accumulation of the bidirectional reprojection error of a point in the grid G between different image pairs constitutes a consistency loss function
Figure FDA0003062597430000035
For representing affine transformations
Figure FDA0003062597430000036
Sum spline regression
Figure FDA0003062597430000037
A difference of (a);
alignment loss function
Figure FDA0003062597430000038
Alignment quality design based on soft inline counting and two-way measurement for evaluation
Figure FDA0003062597430000039
And
Figure FDA00030625974300000310
the alignment accuracy is further optimized;
Figure FDA00030625974300000311
calculated by equation (8):
Figure FDA00030625974300000312
wherein ,cstBy aligning IsTo ItSoft inline counting of (2); c. CtsIndicating alignment ItTo IsSoft inline counting of (2);
step 6, aiming at a given image pair to be matched, obtaining an alignment result from a source image to a target image by utilizing the semantic alignment network model OLASA of the object position perception trained in the step 5, namely obtaining an image matching result;
through the steps, the image matching based on the deep semantic alignment network model is realized.
2. The image matching method based on the deep semantic alignment network model as claimed in claim 1, wherein in step 2, the position of the potential target object is obtained by estimating the potential target through a classified target detection method.
3. The image matching method based on the deep semantic alignment network model as claimed in claim 1, wherein in step 2, I is further positioneds and ItThe two semantically related objects are extracted by a feature extraction module CNN,
Figure FDA00030625974300000313
extracting to obtain corresponding feature mapping { Vs i} and {Vt j}。
4. The image matching method based on deep semantic alignment network model as claimed in claim 1, wherein in step 3, affine transformation model
Figure FDA00030625974300000314
The degree of freedom of (2) is 6, and 6 affine transformation parameters are estimated.
5. The image matching method based on the deep semantic alignment network model as claimed in claim 1, wherein in step 4, the bidirectional thin plate spline regression sub-network adopts a bidirectional strategy to increase the slave image ItTo the image
Figure FDA00030625974300000315
And the control point adjustment in the opposite direction effectively avoids the excessive distortion or distortion of the image.
6. The image matching method based on deep semantic alignment network model as claimed in claim 1, wherein in step 52, for affine transformation
Figure FDA00030625974300000316
Image IsOne point in a grid G composed of the pixel points in (1) is expressed by (x, y), and calculation is performed based on the grid point
Figure FDA0003062597430000041
By passing
Figure FDA0003062597430000042
And
Figure FDA0003062597430000043
judgment of
Figure FDA0003062597430000044
Accuracy, i.e. about
Figure FDA0003062597430000045
And
Figure FDA0003062597430000046
Figure FDA0003062597430000047
as a function of error loss, transfer loss
Figure FDA0003062597430000048
Represented by formula (5):
Figure FDA0003062597430000049
wherein ,
Figure FDA00030625974300000410
an affine transformation model of a source image and a target image;
Figure FDA00030625974300000411
an affine transformation model for the target image and the reference image;
Figure FDA00030625974300000412
obtaining an affine transformation model from a source image to a target image according to the labeling data;
transfer loss functions may also be used for spline regression
Figure FDA00030625974300000413
The difference analysis of (3).
7. As claimed in claim 6The image matching method based on the deep semantic alignment network model is characterized in that in step 52, a consistency loss function
Figure FDA00030625974300000414
Represented by formula (7):
Figure FDA00030625974300000415
wherein ,ε(Is,It) The reprojection error for a point in the grid G between the source image-target image pair can be calculated by equation (6):
Figure FDA00030625974300000416
wherein ,ε(Is,It) Is the reprojection error between the source image-target image pair;
Figure FDA00030625974300000417
as an image Is and ItAffine transformation between;
Figure FDA00030625974300000418
is composed of
Figure FDA00030625974300000419
Corresponding inverse transformation;
consistency can also be lost
Figure FDA00030625974300000420
For spline regression
Figure FDA00030625974300000421
The difference analysis of (3).
CN202110516741.5A 2021-05-12 2021-05-12 Image matching method based on depth semantic alignment network model Active CN113313147B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110516741.5A CN113313147B (en) 2021-05-12 2021-05-12 Image matching method based on depth semantic alignment network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110516741.5A CN113313147B (en) 2021-05-12 2021-05-12 Image matching method based on depth semantic alignment network model

Publications (2)

Publication Number Publication Date
CN113313147A true CN113313147A (en) 2021-08-27
CN113313147B CN113313147B (en) 2023-10-20

Family

ID=77373055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110516741.5A Active CN113313147B (en) 2021-05-12 2021-05-12 Image matching method based on depth semantic alignment network model

Country Status (1)

Country Link
CN (1) CN113313147B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111621508A (en) * 2020-06-11 2020-09-04 云南中烟工业有限责任公司 Tobacco terpene synthase NtTPS7 gene and vector and application thereof
CN115861393A (en) * 2023-02-16 2023-03-28 中国科学技术大学 Image matching method, spacecraft landing point positioning method and related device
CN116977652A (en) * 2023-09-22 2023-10-31 之江实验室 Workpiece surface morphology generation method and device based on multi-mode image generation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862707A (en) * 2017-11-06 2018-03-30 深圳市唯特视科技有限公司 A kind of method for registering images based on Lucas card Nader's image alignment
US20190371080A1 (en) * 2018-06-05 2019-12-05 Cristian SMINCHISESCU Image processing method, system and device
CN110580715A (en) * 2019-08-06 2019-12-17 武汉大学 Image alignment method based on illumination constraint and grid deformation
CN110909778A (en) * 2019-11-12 2020-03-24 北京航空航天大学 Image semantic feature matching method based on geometric consistency
CN112102303A (en) * 2020-09-22 2020-12-18 中国科学技术大学 Semantic image analogy method for generating countermeasure network based on single image
CN112634341A (en) * 2020-12-24 2021-04-09 湖北工业大学 Method for constructing depth estimation model of multi-vision task cooperation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862707A (en) * 2017-11-06 2018-03-30 深圳市唯特视科技有限公司 A kind of method for registering images based on Lucas card Nader's image alignment
US20190371080A1 (en) * 2018-06-05 2019-12-05 Cristian SMINCHISESCU Image processing method, system and device
CN110580715A (en) * 2019-08-06 2019-12-17 武汉大学 Image alignment method based on illumination constraint and grid deformation
CN110909778A (en) * 2019-11-12 2020-03-24 北京航空航天大学 Image semantic feature matching method based on geometric consistency
CN112102303A (en) * 2020-09-22 2020-12-18 中国科学技术大学 Semantic image analogy method for generating countermeasure network based on single image
CN112634341A (en) * 2020-12-24 2021-04-09 湖北工业大学 Method for constructing depth estimation model of multi-vision task cooperation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘岩;吕肖庆;秦叶阳;汤帜;徐剑波;: "尺度与颜色不变性图像特征描述", 小型微型计算机系统, no. 10, pages 187 - 192 *
廖明哲;吴谨;朱磊;: "基于ResNet和RF-Net的遥感影像匹配", 液晶与显示, no. 09, pages 91 - 99 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111621508A (en) * 2020-06-11 2020-09-04 云南中烟工业有限责任公司 Tobacco terpene synthase NtTPS7 gene and vector and application thereof
CN111621508B (en) * 2020-06-11 2022-07-01 云南中烟工业有限责任公司 Tobacco terpene synthase NtTPS7 gene and vector and application thereof
CN115861393A (en) * 2023-02-16 2023-03-28 中国科学技术大学 Image matching method, spacecraft landing point positioning method and related device
CN115861393B (en) * 2023-02-16 2023-06-16 中国科学技术大学 Image matching method, spacecraft landing point positioning method and related device
CN116977652A (en) * 2023-09-22 2023-10-31 之江实验室 Workpiece surface morphology generation method and device based on multi-mode image generation
CN116977652B (en) * 2023-09-22 2023-12-22 之江实验室 Workpiece surface morphology generation method and device based on multi-mode image generation

Also Published As

Publication number Publication date
CN113313147B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
US11763433B2 (en) Depth image generation method and device
Labbé et al. Cosypose: Consistent multi-view multi-object 6d pose estimation
CN113313147B (en) Image matching method based on depth semantic alignment network model
CN111625667A (en) Three-dimensional model cross-domain retrieval method and system based on complex background image
CN113160285B (en) Point cloud matching method based on local depth image criticality
CN103700099A (en) Rotation and dimension unchanged wide baseline stereo matching method
CN107766864B (en) Method and device for extracting features and method and device for object recognition
Yi et al. Motion keypoint trajectory and covariance descriptor for human action recognition
CN104517289A (en) Indoor scene positioning method based on hybrid camera
CN113361542A (en) Local feature extraction method based on deep learning
CN110969648A (en) 3D target tracking method and system based on point cloud sequence data
CN110544202A (en) parallax image splicing method and system based on template matching and feature clustering
CN111368733B (en) Three-dimensional hand posture estimation method based on label distribution learning, storage medium and terminal
CN116342648A (en) Twin network target tracking method based on mixed structure attention guidance
Shen et al. Semi-dense feature matching with transformers and its applications in multiple-view geometry
Lee et al. Learning to distill convolutional features into compact local descriptors
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN113988269A (en) Loop detection and optimization method based on improved twin network
Huang et al. Life: Lighting invariant flow estimation
Zhang et al. An automatic three-dimensional scene reconstruction system using crowdsourced Geo-tagged videos
CN110849380A (en) Map alignment method and system based on collaborative VSLAM
Xu et al. Local feature matching using deep learning: A survey
Xu et al. Improved HardNet and Stricter Outlier Filtering to Guide Reliable Matching.
CN115375746A (en) Stereo matching method based on double-space pooling pyramid
CN114155406A (en) Pose estimation method based on region-level feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant