CN116137060B

CN116137060B - Same-scene multi-grid image matching method, device and application

Info

Publication number: CN116137060B
Application number: CN202310424272.3A
Authority: CN
Inventors: 李开民; 章东平; 王杼涛; 曹喆
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-07-18
Anticipated expiration: 2043-04-20
Also published as: CN116137060A

Abstract

The application provides a method, a device and an application for matching multiple wind grid images in the same scene, comprising the following steps: constructing an image matching model, wherein the image matching model comprises a style encoder, a first style migration network, a second style migration network and a matching degree calculation network; inputting two images into a style encoder to obtain coding information through training a plurality of images under different weather conditions in a sample, if the two images are different, generating a new image by using a first style migration network and a second style migration network, and judging the matching degree of the images in a matching degree calculation network; and if the matching degree exceeds the threshold value, the two images are considered to be matched. According to the scheme, the image matching is carried out by judging the style coding information of the image and converting the style of the image, so that the accuracy of the image matching is higher.

Description

Same-scene multi-grid image matching method, device and application

Technical Field

The application relates to the field of deep learning, in particular to a method, a device and application for matching multi-style images in the same scene.

Background

Along with the development of the times, a large amount of garbage can be generated in production and life, the industrial revolution causes the rapid development of scientific technology, the garbage is also exponentially increased, the problem of garbage disposal generated by the method is also a troublesome problem which people have to face and solve, the environment-friendly city construction development training of China is realized, and the real-time disposal of the garbage in daily life is very important for the life of cities and residents.

Image matching (Image matching) refers to finding the same or similar parts in two or more images. This is an important problem in the field of computer vision, and relates to the content of aspects such as image feature extraction, similarity measurement, matching algorithm and the like, and the application of image matching is very wide, such as target tracking, image retrieval, image stitching, three-dimensional reconstruction and the like. The basic idea of image matching is to find the similarity between two images by comparing their features and then match them, in which feature points are a very important concept because they are the key to distinguishing between different images. In general, image matching can be divided into two steps: feature extraction and feature matching. In feature extraction, key points and descriptors are typically extracted using an algorithm such as SIFT, SURF, ORB. In feature matching, algorithms commonly used include violent matching, FLANN matching, RANSAC-based matching, and the like, and the performance of image matching is often affected by many factors, such as image resolution, image noise, illumination variation, and the like, so in practical application, an appropriate algorithm needs to be selected according to a specific scene, and the algorithm needs to be optimized.

However, image matching is not widely and deeply applied to garbage disposal in real life, and when urban garbage treatment is carried out, inspection staff and garbage cleaners are often required to upload images of the same place for comparison, so that whether garbage in the place is cleaned or not is judged, and if so, whether the images uploaded by the inspection staff and the garbage cleaners are the same scene or not is judged, and due to different light rays and weather, whether two pictures are the same scene or not can be easily judged.

In view of the foregoing, there is a need for a method that can accurately determine whether two images are in the same scene under different light and weather conditions.

Disclosure of Invention

The embodiment of the application provides a method, a device and an application for matching images of multiple wind lattices in the same scene, the styles of the images are changed through a style migration network, and the images of different styles can be better matched by using the images of which the styles are changed to match, so that the accuracy of image matching is improved.

In a first aspect, an embodiment of the present application provides a method for matching multiple style images in a same scene, where the method includes:

The method comprises the steps of constructing an image matching model, wherein the image matching model comprises a style coding judging network, a first style migration network, a second style migration network and a matching network, the first style migration network and the second style migration network have the same structure, the style coding judging network comprises two style encoders corresponding to a first input branch and a second input branch, the first style migration network and the second style migration network comprise a first coding branch, a second coding branch and a generator, the first coding branch consists of a linear projection layer, a semantic perception position coding layer and a style encoder, and the second coding branch consists of a parallel time sequence feature encoder, a depth estimation network and a coding input end; the input of the coding input end is style coding information corresponding to different image styles output by the trained style encoder, the trained time sequence feature encoder converts input features into corresponding time vectors, and the trained depth estimation network converts the input features into corresponding depth maps;

acquiring a first image and a second image which need to be matched, wherein the first image is input into a first input branch of the style coding judging network and is coded by a style coder to obtain a first coding result, the second image is input into a second input branch of the style coding judging network and is coded by the style coder to obtain a second coding result, and if the first coding result is the same as the second coding result, the matching result of the first image and the second image is judged by the matching network;

If the first coding result and the second coding result are different, inputting a first image into a first coding branch of a first style migration network to obtain a first semantic feature, inputting a second image and a second coding result into a second coding branch of the first style migration network to obtain a first feature set, inputting the second coding branch result and the first semantic feature into a generator to generate a first synthetic image, and calculating a first similarity of the first synthetic image and the second image;

inputting a second image into a first coding branch of a second style migration network to obtain a second semantic feature, inputting a first image and a first coding result into a second coding branch of the second style migration network to obtain a second feature set, inputting the second semantic feature and the second feature set into a generator to generate a second synthetic image, and calculating a second similarity of the second synthetic image and the first image;

and if the sum of the first similarity and the second similarity is larger than a set threshold value, judging that the first image and the second image are matched.

In a second aspect, an embodiment of the present application provides a co-scene multi-grid image matching device, including:

The construction module comprises: the method comprises the steps of constructing an image matching model, wherein the image matching model comprises a style coding judging network, a first style migration network, a second style migration network and a matching network, the first style migration network and the second style migration network have the same structure, the style coding judging network comprises two style encoders corresponding to a first input branch and a second input branch, the first style migration network and the second style migration network comprise a first coding branch, a second coding branch and a generator, the first coding branch consists of a linear projection layer, a semantic perception position coding layer and a style encoder, and the second coding branch consists of a parallel time sequence feature encoder, a depth estimation network and a coding input end; the input of the coding input end is style coding information corresponding to different image styles output by the trained style encoder, the trained time sequence feature encoder converts input features into corresponding time vectors, and the trained depth estimation network converts the input features into corresponding depth maps;

the acquisition module is used for: acquiring a first image and a second image which need to be matched, wherein the first image is input into a first input branch of the style coding judging network and is coded by a style coder to obtain a first coding result, the second image is input into a second input branch of the style coding judging network and is coded by the style coder to obtain a second coding result, and if the first coding result is the same as the second coding result, the matching result of the first image and the second image is judged by the matching network;

A first calculation module: if the first coding result and the second coding result are different, inputting a first image into a first coding branch of a first style migration network to obtain a first semantic feature, inputting a second image and a second coding result into a second coding branch of the first style migration network to obtain a first feature set, inputting the second coding branch result and the first semantic feature into a generator to generate a first synthetic image, and calculating a first similarity of the first synthetic image and the second image;

a second calculation module: inputting a second image into a first coding branch of a second style migration network to obtain a second semantic feature, inputting a first image and a first coding result into a second coding branch of the second style migration network to obtain a second feature set, inputting the second semantic feature and the second feature set into a generator to generate a second synthetic image, and calculating a second similarity of the second synthetic image and the first image;

and a matching module: and if the sum of the first similarity and the second similarity is larger than a set threshold value, judging that the first image and the second image are matched.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to run the computer program to perform a method for matching multiple grid images of a scene.

In a fourth aspect, embodiments of the present application provide a readable storage medium, where a computer program is stored, the computer program including program code for controlling a process to execute a process, the process including a co-scene multi-grid image matching method.

The main contributions and innovation points of the invention are as follows:

according to the embodiment of the application, the style coding information of the images is judged by constructing the style coder, the images of the same style coding information are directly compared, and the images of different style coding information are compared by adopting a conversion style comparison method, so that the accuracy of image matching can be improved; according to the scheme, semantic perception position coding is added in the input of the style encoder, so that the style encoder outputs semantic features of corresponding images, and then the semantic features are combined with time vectors, style coding information and image depths of other style images to finish conversion of image styles, two images to be compared are compared after style conversion, and therefore errors of image matching can be further reduced.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flow chart of a method of co-scene multi-grid image matching according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a style encoder according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a timing characterization encoder according to an embodiment of the present application;

FIG. 4 is a training schematic of a depth estimation network according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a generator according to an embodiment of the present application;

FIG. 6 is a schematic diagram of the structure of an example normalization layer according to embodiments of the present application;

FIG. 7 is a schematic diagram of a matching network according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a training pattern for a style encoder and generator according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a arbiter according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a method for matching multi-grid images of a scene according to an embodiment of the application;

FIG. 11 is a block diagram of a co-scene multi-grid image matching device according to an embodiment of the present application;

fig. 12 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

Example 1

The embodiment of the application provides a method for matching images of multiple wind lattices in a same scene, and specifically referring to fig. 1, the method comprises the following steps:

In some embodiments, the style encoder is formed by sequentially connecting an attention module and a multi-layer sensor in series, dividing the input image into a plurality of image blocks, projecting each image block by using linear projection to obtain image sequence features, and obtaining an encoding result according to the image sequence features by the style encoder.

For example, there are a first image X and a second image Y, where X and Y are respectively divided into Z image blocks, and two image sequences of size l×c are obtained by linear projection, where the features of the image sequences are XZ and YZ (L is the number of features of the image sequences and C is the dimension of the image sequences), and XZ and YZ are respectively input into the style encoder to obtain a first encoding result X1 of the first image and a second encoding result X2 of the second image, where the first encoding result and the second encoding result represent the image style of the corresponding image.

Specifically, as shown in fig. 2, the attention module in the style encoder is a multi-head attention module, the input image sequence features are standardized and then input into the multi-head attention module to obtain a multi-head attention result, the multi-head attention result is spliced with the image sequence features to obtain a first splicing result, the first splicing result is standardized and then input into the multi-layer perceptron to obtain a multi-layer perception result, and the multi-layer perception result is spliced with the first splicing result to obtain the style coding information.

In some embodiments, in the step of converting the input features into corresponding time vectors by the trained time series feature encoder, the time series feature encoder is a fully connected layer, and the time series feature encoder extracts the time information in the input image and converts the time information into the time vectors.

Specifically, as shown in fig. 3, the structure of the timing sequence feature encoder is shown, in the timing sequence feature encoder, the upper left corner (or other positions including time information) of the input image is identified through a pre-trained character recognition network, and a time vector is obtained through a single-heat coding and time vector conversion formula, and the specific formula is as follows:

where k represents the dimension of the input image,for the time-series characteristics of the input image sin () is a periodic activation function, +.>Is the ith element of the t2v image on the time series feature,/or->、/>Is a learnable parameter whose purpose is to capture periodic and aperiodic information in the image.

Specifically, the character recognition network is formed by sequentially connecting a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer and a third convolution layer in series, wherein the input of the character recognition network is an image containing time information, and the output of the character recognition network is a time sequence of the image.

Specifically, the input of the time sequence feature encoder is in a standard time scale (week, day, hour and minute), the output is a time vector, and in the scheme, the input of the time sequence feature encoder is the time information of the image.

In some embodiments, in the step of converting the input feature into the corresponding depth map by the trained depth estimation network, the depth estimation network downsamples the input image multiple times to obtain first image data, second image data, third image data, and fourth image data, upsamples the first image data, the second image data, the third image data, and the fourth image data respectively to obtain first upsampled image data, second upsampled image data, third upsampled image data, and fourth upsampled image data, subtracts the image data with the corresponding size from the upsampled image data to obtain first contour information, second contour information, third contour information, and fourth contour information, inputs the fourth image data into the decoder to obtain a fifth depth image and a fourth intermediate feature, upsamples the fifth depth image to obtain a fifth depth upsampled image, performs feature stitching on the fourth depth upsampled image, performs convolution on the fourth contour feature, and then adds the fourth contour feature to the fourth contour information, and performs addition on the fourth contour feature to obtain a fourth depth result, and the fourth depth result is obtained by adding the fourth contour information to the fourth depth result and the fourth depth image is output.

Specifically, in the depth estimation network, a pre-trained ResNext101 is adopted by an encoder, an ASPP layer (spp+hole convolution) is adopted by a coder, the encoder is used for image classification, and the decoder is used for acquiring context information and recovering depth residual errors.

Specifically, the depth estimation network is in a Laplacian pyramid structure.

The training schematic diagram of the depth estimation network is shown in fig. 4, the input image of the depth estimation network is downsampled by using an encoder to obtain 4 levels of image data, which are recorded as first image data S1, second image data S2, third image data S3 and fourth image data S4, the image of each level is upsampled again and subtracted from the corresponding size image data to obtain 4 profile information, which is recorded as first profile information L1, second profile information L2, third profile information L3 and fourth profile information L4, the fourth image data S4 passes through an ASPP layer of the decoder and then is input into a convolution branch and an upsampling jump connection branch to obtain a fifth depth image R5 and a fourth intermediate feature X4, the fourth intermediate feature X4 and the fifth depth upsampling image are added to obtain a fourth depth profile feature H4, the fourth depth profile feature is deconvoluted and then is added with the fourth profile information L4 to obtain a fourth depth image R4, and finally the fourth depth image R4 and the fourth depth profile feature are added to obtain a fourth output result D corresponding to the fourth output result D1.

Specifically, the loss calculation formula of the depth estimation network is as follows:

where yi represents the depth value predicted by the ith pixel point,the true depth value of the ith pixel point is represented, and n represents the total number of effective pixels.

In some embodiments, the timing feature encoder and the depth estimation network are pre-trained.

In some embodiments, in the step of inputting a first image into a first coding branch of a first style migration network to obtain a first semantic feature, a semantic perception position coding layer in the first coding branch acquires a semantic perception position code of the first image, a linear projection layer in the first coding branch acquires an image sequence feature of the first image, and the semantic perception position code of the first image and the image sequence feature are input into a style encoder in the first coding branch to obtain the first semantic feature.

In some embodiments, in the step of inputting the second image and the second encoding result into the second encoding branch of the first style migration network to obtain the first feature set, the first feature set includes the second encoding result, a time vector of the second image, and a depth map of the second image.

In some embodiments, in the step of "input the second coding branch result and the first semantic feature into a generator to generate a first composite map", the generator employs an encoder-decoder structure, the encoder is composed of a first generator convolutional layer in series with a plurality of instance normalization layers, and the decoder is composed of a plurality of adaptive instance normalization layers in series with a second generator convolutional layer.

Specifically, the structure of the generator is shown in fig. 5, the first generator convolution layer and the second generator convolution layer are used to convolve the input content, and the example normalization layer is used to normalize each training sample.

Specifically, the structural schematic diagram of the example standardization layer is shown in fig. 6, the example standardization layer is composed of a first branch and a second branch, the first branch is composed of a first example layer, a first example normalization layer, a first example convolution layer, a first average pooling layer, a second example normalization layer and a second example convolution layer which are sequentially connected in series, the second branch is composed of a third example convolution layer and a second average pooling layer which are sequentially connected in series, input content is respectively subjected to the first branch and the second branch to obtain a first branch result and a second branch result, and the first branch result and the second branch result are spliced to obtain an output result of the example standardization layer.

In some embodiments, the method for generating the second composite image is the same as the method for generating the first composite image, and the disclosure of this embodiment is not repeated here.

Specifically, the first composite image is a composite image having first image semantic information and having style-coded information of a second image, and the second composite image is a composite image having second image semantic information and having style-coded information of the first image.

In some embodiments, in the step of calculating the first similarity between the first composite image and the second image, the matching network performs matching between the key features of the first composite image and the second image, and outputs the matching between the key features of the first composite image and the second image as the first similarity through a full connection layer, as shown in fig. 7.

Similarly, the second similarity is calculated through a matching network, and if the sum of the first similarity and the second similarity is larger than a set threshold value, the first image and the second image are judged to be matched.

In some embodiments, as shown in fig. 8, the present scheme trains the style encoder and generator as follows:

The images of different styles of the same scene are obtained as training samples, and the training samples can be obtained according to day and night and weather conditions as different styles, and are divided into 16 categories, namely: x is X ₁ (sunny day in daytime), X ₂ (cloudy day of daytime), X ₃ (cloudiness in daytime), X ₄ (rainy day in daytime) X is X ₅ (daytime snow day), X ₆ (daytime fog day), X ₇ (daytime dust emission), X ₈ (daytime dust, X) ₉ (sunny night, X) ₁₀ (cloudy night, X) ₁₁ (cloudiness at night), X ₁₂ (rainy night, X) ₁₃ (night snow day), X ₁₄ (night foggy day), X ₁₅ (dust emission at night), X ₁₆ (night dust) and size normalization preprocessing is performed on the image.

Specifically, the public data and the road environment image data in the KITTI can be selected, the road environment image data is randomly cut into W×H pixels, then horizontal overturning is carried out, and in addition, the brightness, the color and the gamma value of each road environment image data are randomly adjusted within the range of [0,9,1,1 ].

Training phase 1: the method comprises the steps of obtaining a specified image with a specified specific style, inputting the specified image with the specified specific style into a trained style encoder, a trained time sequence feature encoder and a trained input estimation network to obtain a specified image style, a specified image time sequence feature and a specified image depth map, obtaining semantic perception position codes of any one of images with a target style except the specified style, sending the semantic perception position codes into the style encoder by combining with corresponding images to obtain target semantic features, sending the target semantic features, the specified image style, the specified image time sequence feature and the specified image depth map into a generator to obtain a generated image, constructing a discriminator, and judging countermeasures, reconstruction losses and intra-class distance losses between the generated image and the specified image by using the discriminator.

Specifically, the structure of the discriminator is shown in fig. 9, and the discriminator is composed of a first discriminator convolution layer, a plurality of serially connected discriminating example standardization layers, a first discriminating normalization layer, a second discriminating convolution layer, a second discriminating normalization layer, a discriminating dimension conversion layer and a discriminating output layer which are serially connected in sequence.

Specifically, an image x of a sunny day in the daytime is obtained ₁ Will x ₁ Inputting into the trained style encoder, the trained time sequence feature encoder and the trained input estimation network to obtain a designated image style S ₁ Specifying image timing characteristics T ₁ Designated image depth map D ₁ Acquiring an image x of any style except for sunny days during daytime _i Obtaining x _i Semantic aware position coding of (c) and combining x _i Sending the target semantic features C into the style encoder to obtain target semantic features C _i The target semantic feature C _i Designated image style S ₁ Specifying image timing characteristics T ₁ Designating an image depth map D ₁ Fed into said generator to obtain a generated image x ₁ ' x is ₁ ' and x ₁ Input into a discriminator to obtain x ₁ ' and x ₁ Loss of antagonism L between _D1 Reconstruction loss L _sty1 Intra-class distance loss L _intra1 And so on to obtain。

Specifically, the countering loss L _D1 The formula of (2) is as follows:

The formula for the reconstruction loss is as follows:

the intra-class distance loss L _intra1 The formula of (2) is as follows:

wherein x is _i The expression of' is:

wherein, the liquid crystal display device comprises a liquid crystal display device,

wherein, the liquid crystal display device comprises a liquid crystal display device,represents x _i Target image set taken from style i, +.>Representing the discriminant D prediction x _i From the true distribution->Probability of->Representing the image according to semantic feature C _a And S is _b Is a style-generated image,/>Representation style encoder extracts image +.>Style encoded information of->Representation style encoder extracts image +.>Semantic features of->Representing the expected value of the distribution function +.>Representing the L1 norm, argmin represents the variable value when the objective function is minimized, and so on to get the loss。

Training phase 2: inputting the appointed image and the generated image into the matching network to obtain appointed image key point characteristics and generated image key point characteristics respectively, and subtracting the matched image key point characteristics from the generated image key point characteristics to obtain a matching loss function.

Specifically, a specified image x is acquired ₁ And generating an image x ₁ ' x is ₁ And x ₁ ' input to the matching network to obtain designated image key point characteristics K respectively ₁ Generating image keypoint feature K ₁ ' use K ₁ ' subtracting K ₁ Obtaining a matching loss function L _match1 。

Specifically, the matching loss function L _match1 The calculation formula of (2) is as follows:

wherein the method comprises the steps of，/>Representing specified image +.>The nth key point in (a), and the same thing, is->，/>Representing a target imageThe nth key point in (a) and so on to obtain a matching loss L _matchi ，/>。

Training stage 3: the semantic perception position codes of the generated images are obtained, the generated images and the corresponding semantic perception position codes are input into a style encoder to obtain generated image semantic features, the appointed image styles, the appointed image time sequence features, the appointed image depth features and the generated image semantic features are input into the generator to obtain a second generated image, and the norm distance between the appointed images and the second generated image is calculated to obtain a cyclic loss function.

Specifically, the generated image x is acquired ₁ ' semantic aware position coding, generating the image x ₁ ' and corresponding semantic aware position coding are input into a style encoder to obtain generated image semantic features C _i ' will specify the image style S _i Specifying image timing characteristics T _i Designated image depth feature D _i Generating image semantic features C _i ' input into the generator results in a second generated image x ₁ ', calculate x ₁ And x ₁ ' norm distance L ₁ Obtaining the circulation loss L _cyc1 。

Specifically, the circulation loss L _cyc1 The calculation formula of (2) is as follows:

and so on to obtain the circulation loss。

Training stage 4: the method comprises the steps of obtaining appointed semantic features of any appointed image and any two different first style coding information and second style coding information, inputting the first style coding information, time sequence features corresponding to the first style coding information, depth maps corresponding to the first style coding information and the appointed semantic features into a generator to obtain a first generated image, inputting the second style coding information, the time sequence features corresponding to the second style coding information, the depth maps corresponding to the second style coding information and the appointed semantic features into the generator to obtain a second generated image, and calculating the opposite number of norm distances between the first generated image and the second generated image to obtain generation loss.

Specifically, a specified semantic feature C of any specified image is obtained _i And any two different first-style coded information S ₁ And second style encoded information S ₂ Encoding the first style of information S ₁ A time sequence characteristic T corresponding to the first style coding information ₁ Depth map D corresponding to the first style coded information ₁ Specifying semantic feature C _i Inputting the second style coding information S into a generator to obtain a first generated image ₂ A time sequence characteristic T corresponding to the second style coding information ₂ Depth map D corresponding to the second style coded information ₂ Specifying semantic feature C _i Inputting into a generator to obtain a second generated image, and calculating a norm distance L between the first generated image and the second generated image ₁ The inverse number of the generated loss L _ds13 And so on to get the generation loss。

Specifically, the generation loss L _ds13 The calculation formula of (2) is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,，/>representing any one of the specified images C _i Semantic information of (a).

And so on can get the generation loss。

Training stage 5: the total loss of the image matching model is the sum of the countering loss, the reconstruction loss, the intra-class distance loss, the matching loss, the circulation loss and the generation loss, the parameters of the discriminator are kept unchanged, the parameters of the generator and the style encoder are adjusted to minimize the total loss, the parameters of the generator and the style encoder are kept unchanged, and the parameters of the discriminator are adjusted to maximize the total loss until the total loss is stable, so that the trained image matching model is obtained.

Specifically, the general loss is formulated as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,traversing for constant coefficients corresponding to the penaltyTo iteratively train the model.

Specifically, a flowchart of the scheme when using the trained image matching model to perform image matching is shown in fig. 10.

Example two

Based on the same conception, referring to fig. 11, the present application further provides a same-scene multi-grid image matching device, including:

Example III

This embodiment also provides an electronic device, referring to fig. 12, comprising a memory 404 and a processor 402, the memory 404 having stored therein a computer program, the processor 402 being arranged to run the computer program to perform the steps of any of the method embodiments described above.

In particular, the processor 402 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

The memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, memory 404 may comprise a Hard Disk Drive (HDD), floppy disk drive, solid State Drive (SSD), flash memory, optical disk, magneto-optical disk, tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. Memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 404 includes Read-only memory (ROM) and Random Access Memory (RAM). Where appropriate, the ROM may be a mask-programmed ROM, a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), an electrically rewritable ROM (EAROM) or FLASH memory (FLASH) or a combination of two or more of these. The RAM may be Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM) where appropriate, and the DRAM may be fast page mode dynamic random access memory 404 (FPMDRAM), extended Data Output Dynamic Random Access Memory (EDODRAM), synchronous Dynamic Random Access Memory (SDRAM), or the like.

Memory 404 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions for execution by processor 402.

The processor 402 reads and executes the computer program instructions stored in the memory 404 to implement any one of the co-scene multi-grid image matching methods of the above embodiments.

Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402 and the input/output device 408 is connected to the processor 402.

The transmission device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through the base station to communicate with the internet. In one example, the transmission device 406 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

The input-output device 408 is used to input or output information. In this embodiment, the input information may be an image to be matched, an image encoding style, or the like, and the output information may be a matching result of the image, or the like.

Alternatively, in the present embodiment, the above-mentioned processor 402 may be configured to execute the following steps by a computer program:

s101, constructing an image matching model, wherein the image matching model comprises a style coding judging network, a first style migration network, a second style migration network and a matching network, the first style migration network and the second style migration network have the same structure, the style coding judging network comprises two style encoders corresponding to a first input branch and a second input branch, the first style migration network and the second style migration network comprise a first coding branch, a second coding branch and a generator, the first coding branch consists of a linear projection layer, a semantic perception position coding layer and a style encoder, and the second coding branch consists of a parallel time sequence feature encoder, a depth estimation network and a coding input end; the input of the coding input end is style coding information corresponding to different image styles output by the trained style encoder, the trained time sequence feature encoder converts input features into corresponding time vectors, and the trained depth estimation network converts the input features into corresponding depth maps;

S102, acquiring a first image and a second image which need to be matched, wherein the first image is input into a first input branch of the style coding judging network and is coded by a style coder to obtain a first coding result, the second image is input into a second input branch of the style coding judging network and is coded by the style coder to obtain a second coding result, and if the first coding result and the second coding result are the same, the matching result of the first image and the second image is judged by the matching network;

s103, if the first coding result and the second coding result are different, inputting a first image into a first coding branch of a first style migration network to obtain a first semantic feature, inputting a second image and the second coding result into a second coding branch of the first style migration network to obtain a first feature set, inputting the second coding branch result and the first semantic feature into a generator to generate a first synthetic image, and calculating a first similarity of the first synthetic image and the second image;

s104, inputting a second image into a first coding branch of a second style migration network to obtain a second semantic feature, inputting a first image and a first coding result into a second coding branch of the second style migration network to obtain a second feature set, inputting the second semantic feature and the second feature set into a generator to generate a second synthetic image, and calculating a second similarity of the second synthetic image and the first image;

And S105, if the sum of the first similarity and the second similarity is larger than a set threshold value, judging that the first image and the second image are matched.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of a mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets, and/or macros can be stored in any apparatus-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may include one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. In this regard, it should also be noted that any block of the logic flow as in fig. 12 may represent a program step, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on a physical medium such as a memory chip or memory block implemented within a processor, a magnetic medium such as a hard disk or floppy disk, and an optical medium such as, for example, a DVD and its data variants, a CD, etc. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that the technical features of the above embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The foregoing examples merely represent several embodiments of the present application, the description of which is more specific and detailed and which should not be construed as limiting the scope of the present application in any way. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. The method for matching the multi-grid images of the same scene is characterized by comprising the following steps of:

if the first coding result and the second coding result are different, inputting a first image into a first coding branch of a first style migration network to obtain a first semantic feature, inputting a second image and a second coding result into a second coding branch of the first style migration network to obtain a first feature set, inputting the second coding result and the first semantic feature into a generator to generate a first synthetic image, and calculating a first similarity of the first synthetic image and the second image;

2. The method for matching multi-style images in a same scene according to claim 1, wherein the style encoder is composed of an attention module and a multi-layer sensor which are sequentially connected in series, the input of the style encoder is an image sequence feature, an input image is divided into a plurality of image blocks, each image block is projected by using linear projection to obtain the image sequence feature, and the output of the style encoder is style coding information corresponding to the input image.

3. The method according to claim 1, wherein in the step of inputting a first image into a first coding branch of a first style migration network to obtain a first semantic feature, a semantic perception position coding layer in the first coding branch obtains a semantic perception position code of the first image, a linear projection layer in the first coding branch obtains an image sequence feature of the first image, and the semantic perception position code of the first image and the image sequence feature are input into a style encoder in the first coding branch to obtain the first semantic feature.

4. The method according to claim 1, wherein in the step of calculating the first similarity between the first synthetic image and the second image, the matching network performs matching between key point features of the first synthetic image and the second image, and outputs the matching between key point features of the first synthetic image and the second image as the first similarity through a full connection layer.

5. The method for matching the images with the multiple wind lattices in the same scene according to claim 1, wherein the training of the image matching models is divided into a training stage 1, a training stage 2, a training stage 3 and a training stage 4, the counterloss, the reconstruction loss and the intra-class distance loss of the image matching models are obtained through the training stage 1, the matching loss of the image matching models is obtained through the training stage 2, the circulation loss of the image matching models is obtained through the training stage 3, the generation loss of the image matching models is obtained through the training stage 4, the total loss of the image matching models is the sum of the counterloss, the reconstruction loss, the intra-class distance loss, the matching loss, the circulation loss and the generation loss, the parameters of a discriminator are kept unchanged during training, the parameters of the generator and the style encoder are adjusted to enable the total loss to be minimum, the parameters of the generator and the style encoder are kept unchanged, and the parameters of the discriminator are adjusted to enable the total loss to be maximum until the trained image matching models are obtained stably.

6. The method according to claim 1, wherein in the step of inputting the second image and the second encoding result into the second encoding branch of the first style migration network to obtain the first feature set, the first feature set includes the second encoding result, the time vector of the second image and the depth map of the second image.

7. The method according to claim 1, wherein in the step of converting the input feature into the corresponding depth map by the trained depth estimation network, the depth estimation network downsamples the input image multiple times to obtain first image data, second image data, third image data, and fourth image data, upsamples the first image data, the second image data, the third image data, and the fourth image data respectively to obtain first upsampled image data, second upsampled image data, third upsampled image data, and fourth upsampled image data, subtracts the image data of the corresponding size from the upsampled image data to obtain first contour information, second contour information, third contour information, and fourth contour information, inputs the fourth image data into a decoder to obtain a fifth depth image and a fourth intermediate feature, performs feature stitching on the first image data, the second image data, the third image data, and the fourth image data to obtain a fourth upsampled image, performs feature stitching on the fourth contour, performs addition on the fourth contour, and performs fourth addition on the fourth contour, and the fourth contour information is obtained by the fourth contour information, and the fourth contour information is output as a fourth output result.

8. The utility model provides a with scene many wind check image matching device which characterized in that includes:

A first calculation module: if the first coding result and the second coding result are different, inputting a first image into a first coding branch of a first style migration network to obtain a first semantic feature, inputting a second image and a second coding result into a second coding branch of the first style migration network to obtain a first feature set, inputting the second coding result and the first semantic feature into a generator to generate a first synthetic image, and calculating a first similarity of the first synthetic image and the second image;

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform a co-scene multi-grid image matching method as claimed in any of claims 1 to 7.

10. A readable storage medium, characterized in that the readable storage medium has stored therein a computer program comprising program code for controlling a process to execute a process comprising a co-scene multi-grid image matching method according to any of claims 1-7.