CN109101992B

CN109101992B - Image matching method, device and computer readable storage medium

Info

Publication number: CN109101992B
Application number: CN201810728913.3A
Authority: CN
Inventors: 杨磊; 张行程; 林达华
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2018-07-04
Filing date: 2018-07-04
Publication date: 2022-02-22
Anticipated expiration: 2038-07-04
Also published as: CN109101992A

Abstract

The application discloses an image matching method, an image matching device and a computer-readable storage medium. The method comprises the following steps: acquiring characteristic data of a first image; obtaining M groups of target characteristic data based on the characteristic data of the first image, wherein M is an integer greater than or equal to 2; determining a target image from at least two second images that matches the first image based on the M sets of target feature data. Correspondingly, a corresponding device is also provided. By the method and the device, the accuracy of image matching can be improved.

Description

Image matching method, device and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image matching method and apparatus, and a computer-readable storage medium.

Background

Cross-modality matching refers to matching data in one modality to data in another modality. For example: a picture taken with a sketch matching camera or a picture taken with an infrared camera. Wherein the pixel depiction corresponds to a weak mode and the picture corresponds to a strong mode. How to achieve cross-modal matching is a research hotspot in the field.

Disclosure of Invention

The application provides an image matching method, an image matching device and a computer readable storage medium, which can improve the accuracy of image matching.

In a first aspect, an embodiment of the present application provides an image matching method, including:

acquiring characteristic data of a first image;

obtaining M groups of target characteristic data based on the characteristic data of the first image, wherein M is an integer greater than or equal to 2;

determining a target image from at least two second images that matches the first image based on the M sets of target feature data.

In the embodiment of the application, by acquiring the feature data of the first image, based on the feature data of the first image, M groups of target feature data are obtained, and further based on the M groups of target feature data, a target image matched with the first image is determined from at least two second images. By implementing the embodiment of the application, the target image matched with the first image is determined based on the M groups of target characteristic data, and the accuracy and precision of determining the target image can be improved.

In one possible implementation, a target image that matches the first image may be determined from the at least two second images based on the M sets of target feature data and the feature data of each of the at least two second images.

In one possible implementation, the dimension of the feature data of the second image is different from the dimension of the feature data of the first image, and the dimension of the target feature data is equal to the dimension of the feature data of the second image.

In one possible implementation, the dimension of the feature data of the second image is larger than the dimension of the feature data of the first image.

In one possible implementation, the dimensions of the target feature data are larger than the dimensions of the feature data of the first image.

In one possible implementation, the method further includes: feature data of each of the at least two second images is acquired.

In some examples, the feature extraction process may be performed on each of the at least two second images to obtain feature data of each second image.

In other examples, feature data for each of the at least two second images may be retrieved from the memory.

In one possible implementation, the determining, from at least two second images, a target image matching the first image based on the M sets of target feature data includes:

determining a similarity of the first image to each of the at least two second images based on the M sets of target feature data;

determining a target image from the at least two images that matches the first image based on the similarity of the first image to each of the at least two second images.

In the embodiment of the application, after the M groups of target feature data are obtained, the similarity between the first image and each of the at least two second images is determined through the M groups of target feature data, so that the accuracy of determining the similarity is improved effectively, and the efficiency of determining the target image is improved.

In one possible implementation, the determining, based on the M sets of target feature data, a similarity between the first image and each of the at least two second images includes:

acquiring characteristic data of each second image in the at least two second images;

obtaining M similarity corresponding to each second image based on the M groups of target feature data and the feature data of each second image in the at least two second images;

and determining the similarity of the first image and each second image according to the M similarities corresponding to each second image in the at least two second images.

In the embodiment of the application, a method for determining similarity is provided, that is, M similarities corresponding to each second image are obtained based on M groups of target feature data, and then the similarity between the first image and each second image is determined according to the M similarities corresponding to each second image.

In a possible implementation manner, the determining, according to M similarities corresponding to each of the at least two second images, a similarity between the first image and each of the at least two second images includes:

and determining the average processing result of the M similarity corresponding to the second image as the similarity of the first image and the second image.

In this embodiment of the application, the average processing result of the M similarities corresponding to the second image may include: the weighted average result of the M similarities corresponding to the second image, or the arithmetic average result of the M similarities corresponding to the second image.

In one possible implementation, the first image contains a smaller amount of information than the second image.

The image matching method provided by the embodiment of the application can be applied to matching between different modalities, such as two modalities with equivalent information content; it is also applicable to a modality whose information amount is significantly smaller than that of another modality. Therefore, the embodiment of the application can effectively make up a huge gap in modal matching between the first image and the second image due to too little information amount of the first image, and effectively improves the accuracy of determining the target image.

In a possible implementation manner, the acquiring feature data of the first image includes:

feature data of the first image is retrieved from a memory.

In an embodiment of the application, the memory comprises a memory of the image matching device. That is, the feature data of the first image is obtained and stored in the memory. Therefore, the time for acquiring the feature data of the first image can be saved, and the acquisition efficiency is improved.

In a possible implementation manner, the obtaining M groups of target feature data based on the feature data of the first image includes:

and obtaining the M groups of target feature data based on the feature data of the first image and the M random vectors.

In the embodiment of the application, different target characteristic data can be obtained by inputting different random vectors. That is, a series of possible target feature data can be obtained by using the random vector as a variable under the condition of the first image, so that more target feature data can be obtained as much as possible, and the accuracy of calculating the similarity can be improved.

and inputting the feature data of the first image into a feature generation network for processing to obtain the M groups of target feature data.

In this embodiment of the application, different random vectors may also be input to the feature generation network, respectively, under the condition of the feature data of the first image, so as to obtain M groups of target feature data. If the feature data of the weak modal image and different random vectors are respectively input into the trained feature generation network, the feature data of different pseudo-strong modal images can be obtained.

In a possible implementation manner, before the feature data of the first image is input to a feature generation network for processing, so as to obtain the M groups of target feature data, the method further includes:

inputting feature data of a first training sample into the feature generation network to obtain target training feature data, wherein the dimension of the target training feature data is larger than that of the feature data of the first training sample;

inputting the target training characteristic data into a discrimination network for processing to obtain a first discrimination result;

determining a first loss based on the first discrimination result;

training the feature generation network based on the first loss.

In the embodiment of the application, the image matching device outputs target training characteristic data through the characteristic generating network, so that the target training characteristic data is input into the judging network, and the judging network judges, for example, the probability or the fraction of the target training characteristic data belonging to a strong mode is judged; and then training the feature generation network based on the first loss determined by the first judgment result output by the judgment network, and training the feature generation network in a confrontation mode, so that the training efficiency can be effectively improved, and the accuracy of the target training feature data is gradually improved.

In a possible implementation manner, the inputting the feature data of the first training sample to the feature generation network to obtain target training feature data includes:

inputting the feature data of the first training sample and a training random vector to the feature generation network to obtain the target training feature data;

determining a first loss based on the first discrimination result, including:

inputting the target training characteristic data into a classification network for processing to obtain a first classification result;

inputting the target training characteristic data into a random vector regression network for processing to obtain training regression characteristics, wherein the dimensionality of the training regression characteristics is equal to the dimensionality of the input training random vector;

determining the first loss based on the first discrimination result, the first classification result, and the training regression feature.

In the embodiment of the application, the target training characteristic data is input into a classification network, and the target training characteristic data is supervised through the classification network, so that the accuracy of the target training characteristic data can be further improved; the training regression feature is obtained by inputting the target training feature data into the random vector regression network, so that the target training feature data can be effectively supervised, the random vector can be further effectively utilized, and the condition that the random vector is input and cannot be effectively utilized is avoided.

performing feature extraction on the first image by using a first feature extraction network to obtain feature data of the first image;

the method further comprises the following steps:

training the first feature extraction network according to the first loss.

In this embodiment, the first feature extraction network may be further configured to extract feature data of the first training sample.

In one possible implementation, the method further includes:

inputting the feature data of a second training sample into the discrimination network for processing to obtain a second discrimination result, wherein the information content contained in the second training sample is greater than the information content contained in the first training sample;

determining a second loss based on the second discrimination result and the training regression feature;

training the discrimination network, the classification network, and the random vector regression network based on the second loss.

In a possible implementation manner, the determining a second loss based on the second determination result and the training regression feature includes:

inputting the characteristic data of the second training sample into the classification network for processing to obtain a second classification result;

determining the second loss based on the second discrimination result, the second classification result, and the training regression feature.

In one possible implementation, the obtaining the feature data of each of the at least two second images includes:

performing feature extraction on each second image in the at least two second images by using a second feature extraction network to obtain feature data of each second image in the at least two second images;

the method further comprises the following steps:

training the second feature extraction network based on the second loss.

In a second aspect, an embodiment of the present application provides an image matching apparatus, including:

an acquisition unit configured to acquire feature data of a first image;

the first data processing unit is used for obtaining M groups of target characteristic data based on the characteristic data of the first image; wherein M is an integer greater than or equal to 2;

a first determining unit, configured to determine, based on the M groups of target feature data, a target image that matches the first image from among at least two second images.

In a possible implementation manner, the first determining unit may specifically determine, based on the M groups of target feature data and feature data of each of the at least two second images, a target image that matches the first image from the at least two second images.

In a possible implementation manner, the obtaining unit is further configured to obtain feature data of each of the at least two second images.

In some examples, the obtaining unit may perform feature extraction processing on each of the at least two second images to obtain feature data of each of the at least two second images.

In other examples, the obtaining unit may obtain the feature data of each of the at least two second images from the memory.

In one possible implementation manner, the first determining unit includes:

a first determining subunit, configured to determine, based on the M sets of target feature data, a similarity between the first image and each of the at least two second images;

a second determining subunit, configured to determine, based on a similarity between the first image and each of the at least two second images, a target image that matches the first image from the at least two images.

In a possible implementation manner, the first determining subunit is specifically configured to acquire feature data of each of the at least two second images; obtaining M similarity corresponding to each second image based on the M groups of target feature data and the feature data of each second image in the at least two second images; and determining the similarity of the first image and each second image according to the M similarities corresponding to each second image in the at least two second images.

In a possible implementation manner, the first determining subunit is specifically configured to determine, as the similarity between the first image and the second image, an average processing result of M similarities corresponding to the second image.

In a possible implementation manner, the obtaining unit is specifically configured to obtain the feature data of the first image from a memory.

In a possible implementation manner, the first data processing unit is specifically configured to obtain the M groups of target feature data based on the feature data of the first image and M random vectors.

In a possible implementation manner, the first data processing unit is specifically configured to input the feature data of the first image into a feature generation network for processing, so as to obtain the M groups of target feature data.

In one possible implementation, the apparatus further includes:

the second data processing unit is used for inputting the feature data of the first training sample into the feature generation network to obtain target training feature data;

the first judging unit is used for inputting the target training characteristic data into a judging network for processing to obtain a first judging result;

a second determination unit configured to determine a first loss based on the first discrimination result;

a first training unit to train the feature generation network based on the first loss.

In a possible implementation manner, the second data processing unit is specifically configured to input the feature data of the first training sample and a training random vector to the feature generation network, so as to obtain the target training feature data;

the second determination unit includes:

the classification processing subunit is used for inputting the target training characteristic data into a classification network for processing to obtain a first classification result;

the regression processing subunit is used for inputting the target training feature data into a random vector regression network for processing to obtain training regression features, wherein the dimensionality of the training regression features is equal to the dimensionality of the input training random vector;

a third determining subunit, configured to determine the first loss based on the first discrimination result, the first classification result, and the training regression feature.

In a possible implementation manner, the obtaining unit is specifically configured to perform feature extraction on the first image by using a first feature extraction network to obtain feature data of the first image;

the first training unit is further configured to train the first feature extraction network according to the first loss.

In one possible implementation, the apparatus further includes:

a second judging unit, configured to input feature data of a second training sample to the judging network for processing, so as to obtain a second judging result, where an information amount included in the second training sample is greater than an information amount included in the first training sample;

a third determining unit, configured to determine a second loss based on the second determination result and the training regression feature;

and the second training unit is used for training the discrimination network, the classification network and the random vector regression network based on the second loss.

In a possible implementation manner, the third determining unit is specifically configured to input the feature data of the second training sample into the classification network for processing, so as to obtain a second classification result; and determining the second loss based on the second determination result, the second classification result, and the training regression feature.

In a possible implementation manner, the obtaining unit is specifically configured to perform feature extraction on each of the at least two second images by using a second feature extraction network, so as to obtain feature data of each of the at least two second images;

the second training unit is further configured to train the second feature extraction network based on the second loss.

In a third aspect, an embodiment of the present application provides an image matching apparatus, including: a processor and a memory; the memory is used for coupling with the processor and saving program instructions and data required by the image matching device; the processor is configured to enable the image matching apparatus to perform respective functions in the method of the first aspect.

In a possible implementation manner, the image matching apparatus may further include an input/output interface, which is used for supporting communication between the apparatus and other apparatuses.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored therein readable instructions, which, when executed on a computer, cause the computer to perform a method according to the above aspects.

In a fifth aspect, the present application provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of the described aspects.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present application, the drawings required to be used in the embodiments or the background art of the present application will be described below.

Fig. 1 is a schematic flowchart of an image matching method provided in an embodiment of the present application;

fig. 2 is a schematic diagram of a specific scene of image matching provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart of a training method provided in an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram of another training method provided by the embodiments of the present application;

FIG. 5 is a schematic flow chart diagram illustrating another training method provided by an embodiment of the present application;

fig. 6 is a schematic diagram of a specific scenario of a training method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an image matching apparatus provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a first determining unit provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of another image matching apparatus provided in an embodiment of the present application;

fig. 10 is a schematic structural diagram of a second determining unit provided in an embodiment of the present application;

fig. 11 is a schematic structural diagram of another image matching apparatus provided in the embodiment of the present application;

fig. 12 is a schematic structural diagram of another image matching apparatus provided in the embodiment of the present application.

Detailed Description

The terms "first," "second," and the like in the description and claims of the present application and in the drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, or apparatus.

The present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic flowchart of an image matching method provided in an embodiment of the present application, where the method is applicable to an image matching apparatus, and the image matching apparatus may be a server or a terminal device, and the embodiment of the present application does not uniquely limit what kind of devices the image matching apparatus may use.

As shown in fig. 1, the image matching method includes:

and S101, acquiring characteristic data of the first image.

In this embodiment, the first image may be any image, such as a still image or a video frame image, and the first image is not limited in this embodiment.

In some possible embodiments, the first image is subjected to feature extraction by using a first feature extraction network, so as to obtain feature data of the first image.

Optionally, the first feature extraction network may adopt a deep learning network, for example, feature data of the first image may be acquired through a ResNet network. Further, the ResNet network may include N convolutional layers, and the first image is processed sequentially through the N convolutional layers to obtain the feature data of the first image.

It is understood that the deep learning network shown above is only an example, and should not be construed as limiting the embodiments of the present application.

Optionally, a specific implementation manner of acquiring the feature data of the first image may include: the image matching device directly obtains the characteristic data of the first image through a processor of the image matching device, namely, the image matching device directly obtains the characteristic data of the first image through the processor in real time under the condition that a target image matched with the first image needs to be determined from at least two second images.

Or, the specific implementation manner of acquiring the feature data of the first image may further include: the image matching device obtains the characteristic data of the first image from a memory of the image matching device. That is, the feature data of the first image is obtained and stored in the memory, and in the case that the image matching device needs to determine the target image matching the first image from at least two second images, the image matching device may directly obtain the feature data of the first image from the memory. Therefore, time can be effectively saved, and the speed and efficiency of matching the target image are improved.

Alternatively, the image matching apparatus may also obtain the feature data of the first image from another apparatus, that is, obtain the feature data of the first image through the other apparatus, and then obtain the feature data of the first image from another apparatus when the image matching apparatus needs to implement the embodiment of the present application, for example, the server receives the feature data of the first image from the terminal device. It can be understood that, the embodiment of the present application does not uniquely limit how the image matching apparatus acquires the feature data of the first image. It can be understood that, in the case where the image matching apparatus acquires the feature data of the first image from another apparatus, the embodiment of the present application does not limit how the other apparatus obtains the feature data of the first image. In addition, the embodiment of the present application does not limit what kind of device is specifically used for the other device, for example, the other device may be a terminal device, a server, or the like.

S102, obtaining M groups of target characteristic data based on the characteristic data of the first image, wherein M is an integer greater than or equal to 2.

In the embodiment of the present application, each of the M sets of target feature data corresponds to feature data of the first image.

Optionally, the dimension of the target feature data is larger than the dimension of the feature data of the first image.

In some possible implementations, obtaining M sets of target feature data based on the feature data of the first image includes:

and obtaining M groups of target feature data based on the feature data of the first image and the M random vectors.

In the embodiment of the application, different target characteristic data can be obtained based on different random vectors. That is, at least two target feature data may be obtained with the random vector as a variable, on condition of the feature data of the first image. Specifically, the random vector may include a gaussian random vector, that is, a random vector sampled from a gaussian distribution, or may also be other types of random vectors.

In the embodiment of the application, through the random vector, a plurality of possible target characteristic data can be obtained based on the characteristic data of the first image, so that the efficiency of obtaining the target characteristic data is improved.

Specifically, obtaining M groups of target feature data based on the feature data of the first image includes:

and inputting the feature data of the first image into a feature generation network for processing to obtain M groups of target feature data.

In an embodiment of the present application, the feature generation network may be configured to generate M sets of target feature data. Further, the feature data of the first image and the M random vectors may be input to the feature generation network for processing, so as to obtain M sets of target feature data.

For example, in a case that the amount of information included in the first image is less than the amount of information included in the second image, that is, the first image is a weak mode image, and the second image is a strong mode image, by implementing the embodiment of the present application, the feature data of the weak mode image and M random vectors (or M different random vectors) may be input to the feature generation network, so as to obtain a series of pseudo strong mode feature data. The pseudo strong mode feature data can be understood as feature data with the same dimensionality as that of the feature data of the strong mode image, wherein the feature data are obtained by the feature generation network.

In this embodiment, the feature generation network may be a feature generation network trained by the image matching device, or may be a feature generation network sent to the image matching device after being trained by another device, such as a training device. That is, in the embodiment of the present application, the image matching apparatus may receive the feature generation network trained by the training apparatus, or the image matching apparatus may train the feature generation network by itself. It can be understood that, the embodiment of the present application is not limited to how the training device obtains the training feature generation network.

S103, determining a target image matched with the first image from at least two second images based on the M groups of target characteristic data.

In the embodiment of the application, the target image is an image matched with the first image.

In the embodiment of the present application, the amount of information included in the first image may be equivalent to the amount of information included in the second image, or the amount of information included in the first image may be smaller than the amount of information included in the second image. For example, the first image may include sketches (sketches), the second image may include photos (photos); also for example, the first image may comprise grayscale images (gray images) and the second image may comprise color images (color images); also for example, the first image may comprise a low resolution image (low resolution images) and the second image may comprise a high resolution image (high resolution images). Alternatively, the first image contains a larger amount of information than the second image.

When the amount of information included in the first image is smaller than the amount of information included in the second image, that is, the amount of information included in the first image is smaller than the amount of information included in any one of the at least two second images, the first image may be referred to as a weak-mode image and the second image may be referred to as a strong-mode image. Since the amount of information included in the weak modal image is much smaller than the amount of information included in the strong modal image, the weak modal image may not provide enough information to determine the strong modal image matched therewith, and at this time, as shown in fig. 2, there may be a plurality of strong modal images matched therewith.

In some possible implementations, determining a target image from the at least two second images that matches the first image based on the M sets of target feature data includes:

determining similarity of the first image and each of the at least two second images based on the M sets of target feature data;

a target image matching the first image is determined from the at least two images based on the similarity of the first image to each of the at least two second images.

In the embodiment of the present application, the similarity between the first image and each of the second images may be determined according to M sets of target feature data, so as to determine the target image from the at least two second images based on the similarity between the first image and each of the at least two second images. For example, an image with the highest similarity to the first image in the at least two second images may be determined as the target image, or an image with a similarity higher than a preset threshold to the first image in the at least two second images may also be determined as the target image, or the first several images in one or more candidate images with a similarity higher than a preset threshold to the first image in the at least two second images may also be determined as the target image, and the like, which is not limited in the embodiment of the present application.

In some possible implementations, determining a similarity of the first image to each of the at least two second images based on the M sets of target feature data includes:

acquiring characteristic data of each of at least two second images;

In this embodiment of the application, the similarity between the M groups of target feature data and the feature data of the second image may be respectively calculated based on the M groups of target feature data to obtain M similarities corresponding to the second image, and the similarity between the first image and the second image is determined according to the M similarities corresponding to the second image.

The embodiment of the application further provides a method for acquiring feature data of a second image, which includes:

and performing feature extraction on each of the at least two second images by using a second feature extraction network to obtain feature data of each second image.

The second feature extraction network may adopt a deep learning network, for example, a ResNet network may be used to obtain feature data of each second image. Further, the ResNet network may include i convolutional layers, and the second image may sequentially undergo convolution processing of the i convolutional layers to obtain feature data of the second image.

It is understood that the weight parameters of the second feature extraction network may be the same as the weight parameters of the first feature extraction network, i.e. the first feature extraction network and the second feature extraction network share the same network parameters. For example, the image matching device may use the same deep learning network to extract feature data of the first image and each second image. Or, the weight parameters of the second feature extraction network are different from the weight parameters of the first feature extraction network, such as the parameters of the deep learning network for extracting the first image and the deep learning network for extracting the second image by the image matching device are different. Specifically, different deep learning networks are adopted to extract the feature data of the first image and the second image, and the accuracy of feature extraction can be improved.

Optionally, a specific implementation manner of acquiring the feature data of each of the at least two second images may include: the image matching device obtains the feature data of each of the at least two second images through the processor of the image matching device, that is, when the image matching device needs to implement the embodiment of the application, the feature data of each second image can be obtained through the processor in real time.

Or, the specific implementation manner of acquiring the feature data of each of the at least two second images may further include: the image matching device obtains the characteristic data of each second image from the memory of the image matching device. That is, the feature data of the second image is stored in the memory in advance, and in the case where the image matching apparatus needs to determine the target image from the at least two second images, the image matching apparatus may acquire the feature data of the second image from the memory.

Alternatively, the image matching device may also acquire feature data of each second image from other devices. It can be understood that how the image matching apparatus obtains the feature data of each second image may also correspond to how the image matching apparatus obtains the feature data of the first image, and a detailed description thereof is omitted here.

In some possible implementations, determining the similarity between the first image and each of the at least two second images according to the M similarities corresponding to each of the at least two second images includes:

In this embodiment of the application, the average processing result of the M similarities corresponding to the second image may include: a weighted average result of the M similarities corresponding to the second image, or an arithmetic average result of the M similarities corresponding to the second image, and so on.

In one embodiment, the formula for obtaining M similarities corresponding to each of the at least two second images based on the M sets of target feature data and the feature data of each of the at least two second images may be as follows:

wherein f represents a feature vector of the second image,

and is

Fs represents a second feature extraction network, X^gRepresenting a second image, i represents any one of the at least two second images.

A feature vector representing an ith second image of the at least two second images extracted by the second feature extraction network.

Wherein f is^qA feature vector representing the first image, f^q＝Fw(X^q). Fw denotes a first feature extraction network, Fw (X)^q) Representing a first image X extracted by a first feature extraction network^qThe feature vector of (2). g_jRepresents the jth target feature data, and g_j＝G(f^q,z_j)，{g_j}_j＝1:m. Wherein z is_jRepresenting the jth random vector sampled randomly from the multivariate gaussian distribution N (0, I). Wherein, I in N (0, I) represents a matrix with a characteristic diagonal (identity matrix) of 1. G (f)^q,z_j) Representing a first image X^qAnd j th target feature data are obtained after the j th random vector is input into the feature generation network.

It can be understood that, with formula (2), the method of averaging can be determined according to the value of σ, and if σ corresponding to different j is the same, the arithmetic mean result of M similarities corresponding to the second image is determined as the similarity between the first image and the second image. And if the values of sigma corresponding to different j are different, determining the similarity of the first image and the second image as the weighted average result of the M similarities corresponding to the second image.

In the embodiment of the application, the feature data of the first image is input to the feature generation network to obtain M groups of target feature data, so that the target image is determined according to the M groups of target feature data and the feature data of each of the at least two second images. By implementing the embodiment of the application, a huge gap in modal matching between the first image and the second image due to too little information amount of the first image can be effectively made up, and the accuracy of determining the target image is effectively improved.

Further, the embodiment of the present application obtains the feature data of the first image, and obtains the feature data of each of the at least two second images, that is, the similarity is determined by the level of the feature data, rather than directly determining the similarity at the image level. On one hand, high-level features (i.e., levels of feature data) are typically lower in dimension, and thus it is easier and less costly to learn or train a feature generator on high-level features than low-level features (i.e., image levels); on the other hand, high dimensional features are closer to the semantic space, at which level the distinction between different modalities is smaller and therefore easier to connect.

For a more visual understanding of the image matching method provided in the embodiments of the present application, the following description will be given by taking an example in which the amount of information included in the first image is less than the amount of information included in the second image, and in this case, the image matching method determines an image matching the weak modal image from the plurality of strong modal images, that is, performs cross-modal image matching.

Referring to fig. 2, fig. 2 is a schematic diagram of a specific scene of image matching provided by an embodiment of the present application, and as shown in fig. 2, three leftmost images (sketch images) in the drawing represent a first image, and the rest of the images (photos) represent a second image. It can also be seen that the information content of the first image is significantly less than the information content of the second image. The feature data of the first image is extracted and input to the feature generation network to obtain M groups of target feature data, and the target image can be obtained after the similarity between the first image and each second image is determined based on the M groups of target feature data. It will be appreciated that the target image may also represent the image in the second image that most closely matches the similarity of the first image.

As shown in FIG. 2, the target image may be sequentially the second from the right of the first row, the third from the right of the second row, and the fourth from the right of the third row. That is, by implementing the embodiment of the present application, a photograph corresponding to (i.e., most suitable for) a sketch image can be obtained quickly and efficiently from a large number of photographs. It is to be understood that the drawings or photographs shown in fig. 2 are only examples and should not be construed as limiting the embodiments of the present application.

Fig. 1 illustrates in detail how the embodiment of the present application determines a target image matching a first image from at least two second images. The feature generation network may be a feature generation network received from another device, or may be trained by the image matching device itself, so that the training method will be described in detail below by taking the image matching device as an example for training the feature generation network by itself.

It can be understood that in the training methods shown in fig. 3 and 4, the amount of information contained in any one of the images in the first training sample may be equivalent to the amount of information contained in any one of the images in the second training sample; or, the amount of information contained in any image in the first training sample may also be less than the amount of information contained in any image in the second training sample; alternatively, the amount of information contained in any one of the images in the first training sample may also be greater than the amount of information contained in any one of the images in the second training sample.

However, which way is specifically selected may also correspond to the image matching method shown in fig. 2. For example, in the image matching method shown in fig. 2, the amount of information contained in the first image is less than the amount of information contained in the second image, and thus, during training, the amount of information contained in any one of the images in the first training sample needs to be less than the amount of information contained in any one of the images in the second training sample.

Therefore, the training methods shown in fig. 3 and 4 should not be construed as limiting the embodiments of the present application.

Referring to fig. 3, fig. 3 is a schematic flowchart of a training method provided in an embodiment of the present application, and as shown in fig. 3, the method includes:

s301, inputting the feature data of the first training sample into a feature generation network to obtain target training feature data.

In this embodiment, the image in the first training sample may be any image, such as a still image or a video frame image, and the image included in the first training sample is not limited in this embodiment. It can be appreciated that the dimensions of the target training feature data are larger than the dimensions of the feature data of the first training sample.

Specifically, the image matching apparatus may extract feature data of the first training sample through the first feature extraction network before S301.

It is understood that the weight parameters of the first feature extraction network may be the same as the weight parameters of the second feature extraction network, or the weight parameters of the first feature extraction network may also be different from the weight parameters of the second feature extraction network. That is, the feature extraction network that extracts the feature data of the first training sample (i.e., the first feature extraction network) may or may not share the weight parameter with the feature extraction network that extracts the feature data of the second training sample (i.e., the second feature extraction network). For the specific implementation of the first feature extraction network and the second feature extraction network, reference may also be made to the specific implementation shown in fig. 1, which is not described in detail here.

Specifically, inputting the feature data of the first training sample into the feature generation network to obtain the target training feature data, including:

inputting the feature data of the first training sample and the training random vector into a feature generation network to obtain target training feature data; wherein, the target training characteristic data obtained by different training random vectors are different.

In the embodiment of the application, different training random vectors are input, and different target training characteristic data can be obtained. The training random vector may specifically be a gaussian random vector. For example, if the number of images in the first training sample is ten, one of the training samples may be used as a condition, and different gaussian random vectors are used as variables, and are respectively input to the feature generation network, so that at least two target training feature data may be obtained according to one training sample (i.e., one image). Thus, since at least twenty target training feature data may be obtained from ten first training samples. It is understood that the gaussian random vector can be a random vector sampled from a gaussian distribution, and thus, the specific vector of the gaussian random vector is not limited. Alternatively, the training random vector may be other types of random vectors, and so on.

S302, inputting the target training characteristic data into a discrimination network for processing to obtain a first discrimination result.

In the embodiment of the application, the discriminant network may be used to distinguish the target training feature data from the feature data of the second training sample. If the first discrimination result output by the discrimination network can be output in the form of percentage (namely probability), namely, the percentage of the input feature data belonging to the strong mode is discriminated; alternatively, the first discrimination result output by the discrimination network may also be output in the form of a score system, that is, a score for discriminating that the input feature data belongs to a strong modality. If the first decision result can be used to represent the probability that the target training feature data belongs to the strong modal feature. Or the first discrimination result can be used to represent the score of the target training feature data belonging to the strong modal feature, and the like.

S303, based on the first determination result, a first loss is determined.

Specifically, determining the first loss based on the first discrimination result includes:

inputting target training characteristic data into a random vector regression network for processing to obtain training regression characteristics, wherein the dimensionality of the training regression characteristics is equal to the dimensionality of an input training random vector;

and determining a first loss based on the first discrimination result, the first classification result and the training regression feature.

In the embodiment of the application, the classification network can be used for supervising target training characteristic data. Random vector regression networks can be used to efficiently utilize random vectors when training a feature generation network. It can be understood that, in the embodiment of the present application, when the target training feature data is input into the classification network for processing, for example, the target training feature data and the class label corresponding to the target training feature data may be input into the classification network, so as to output a probability corresponding to the class label (i.e., output a classification result of the target training feature data, that is, a first classification result). The category label may specifically include an identity, or the category label may also include a category identifier. For example, animals such as cats and dogs are classified into two categories, and if further classified, Erha and coxy can also belong to two categories. Therefore, in the embodiment of the present application, there is no limitation on how much the category labels are specifically distinguished.

In the embodiment of the present application, the first loss may be in the form of a function, and the like, and the embodiment of the present application is not limited.

S304, training the feature generation network based on the first loss.

It is understood that the training method described in the embodiment of the present application may be specifically understood as training the weight parameters of the corresponding networks, i.e., updating the weight parameters of the corresponding networks. For example, training the feature generation network based on the first loss may also be understood as updating the weight parameters of the feature generation network based on the first loss.

Specifically, when the feature generation network is trained by using the first loss, the feature generation network can be trained in a back propagation manner. And stopping training when the first judgment result output by the judgment result meets the target result. Because the feature generation network and the discrimination network are trained during the countermeasure, when the discrimination network cannot identify whether the input feature data belongs to the strong-modal feature data or the pseudo-strong-modal feature data, it indicates that the feature generation network is trained, and the feature data of the pseudo-strong-modal output by the feature generation network is close to the strong-modal feature data to the maximum extent. Therefore, the setting of the target result may be related to the output of the discrimination result, and if the discrimination result is the percentage of the input feature data belonging to the strong modality, the target result may be 1, or the target result may be greater than 0.8, and so on, and the embodiment of the present application does not limit how the target result is.

In the embodiment of the application, the image matching device outputs the target training characteristic data through the characteristic generating network, so that the target training characteristic data is input into the discrimination network, and the discrimination network discriminates (namely, discriminates the probability or the fraction of the target training characteristic data belonging to a strong mode, and the like); the feature generation network and the discrimination network are trained in a confrontation mode, so that the training efficiency can be effectively improved, and the accuracy of target training feature data output is gradually improved.

To more visually describe the training method shown in fig. 3, the following description is given with a specific formula.

For example, the formula for inputting the target training feature data into the discriminant network for processing to obtain the first discriminant result may be as follows:

wherein D represents a discriminant network, G may represent a feature generation network, and Fw represents a first feature extraction network. G (Fw (X), z) represents inputting the first training sample X into the first feature extraction network to obtain training feature data Fw (X) of the first training sample; and then inputting the feature data of the first training sample and the training random vector into a feature generation network to obtain target training feature data G (Fw (X), z). D (G (fw (x), z)) is the result of obtaining the first decision by inputting the target training feature data, which is the output of the feature generation network, to the decision network.

For example, the formula for inputting the target training feature data into the classification network for processing to obtain the first classification result may be as follows:

wherein Wc denotes a classification network.

For example, the target training feature data is input into a random vector regression network for processing, and a formula for obtaining the training regression features can be as follows:

wherein E represents a random vector regression network. For formula (5), target training feature data and training random vectors are input, target training feature data with class labels can be output, and then the target training feature data are input into a random vector regression network to obtain an output result. For example, fw (x) is a vector of 1 × 10, z is a vector of 1 × 12, G (fw (x), z) is a vector of 1 × 10, and E (G (fw (x), z)) is a vector of 1 × 12.

For example, based on the first discrimination result, the first classification result, and the training regression feature, the formula for determining the first loss may be as follows:

it can be understood that the above describes how to determine the first loss by taking the loss function as an example.

Specifically, the method shown in fig. 3 further includes: and training the first feature extraction network according to the first loss.

As for equation (6), G and Fw may be differentiated, respectively, to update the feature generation network and the first feature extraction network.

That is to say, in the embodiment of the present application, the weight parameter of the first feature extraction network may also be updated according to the first determination result.

How to train the feature generation network is described in detail in the training method shown in fig. 3, and how to train the discriminant network is described in detail below. Referring to fig. 4, fig. 4 is a schematic flowchart of another training method provided in the embodiment of the present application, and as shown in fig. 4, the training method includes:

s401, inputting the feature data of the second training sample into a discrimination network for processing to obtain a second discrimination result, wherein the information content contained in the second training sample is greater than the information content contained in the first training sample.

In an embodiment of the present application, the first training sample may be an image of a weak mode, and the second training sample may be an image of a strong mode. It can be understood that, since the first training sample and the second training sample are used for training the feature generation network, the number of the first training sample and the second training sample may be related to the training degree, and the like, and the number of the first training sample and the second training sample is not limited in the embodiment of the present application.

In the embodiment of the present application, the target training feature data and the feature data of the second training sample may be identical in dimensionality, that is, in order to further improve the accuracy of the discrimination by the discrimination network, when the feature data of the second training sample and the target training feature data are input to the discrimination network, the target training feature data and the feature data of the second training sample may be guaranteed to be identical in dimensionality. For example, the feature data of the second training sample is b1 × M, and the target training feature data is b2 × M, that is, the number b1 of the feature data of the second training sample may be different from the number b2 of the target training feature data, or may be the same, which is not limited in this application; but the dimensions are the same.

In an embodiment of the present application, the dimension of the feature data of the second training sample is different from the dimension of the feature data of the first training sample. Optionally, the dimension of the feature data of the second training sample is larger than the dimension of the feature data of the first training sample.

Specifically, before the feature data of the second training sample is input to the discrimination network for processing, the feature data of the second training sample may be extracted by the second feature extraction network. It is understood that the specific implementation of the second feature extraction network can refer to the specific implementation shown in fig. 1 and 3, and detailed description thereof is omitted here.

S402, determining a second loss based on the second judgment result and the training regression feature.

Specifically, determining the second loss based on the second determination result and the training regression feature includes:

inputting the characteristic data of the second training sample into a classification network for processing to obtain a second classification result;

and determining a second loss based on the second judgment result, the second classification result and the training regression feature.

It can be understood that, in the embodiment of the present application, for a specific implementation manner of the classification network and the training regression feature, reference may be made to the implementation manner shown in fig. 3, which is not described herein again.

In the embodiment of the present application, the second loss may also be in the form of a function, and the like, and the embodiment of the present application is not limited.

And S403, training a discriminant network, a classification network and a random vector regression network based on the second loss.

To more visually describe the training method shown in fig. 4, the following description is given with a specific formula.

For example, the feature data of the second training sample is input to the decision network for processing, and a formula for obtaining a second decision result is as follows:

where Fs represents a second feature extraction network. It is understood that the formulas described in the foregoing embodiments can also be referred to for the respective parameters in formula (7).

For example, the feature data of the second training sample is input to the classification network for processing, and a formula for obtaining a second classification result is as follows:

in the embodiment of the present application, it can be understood that, for the formula (4) and the formula (8), the weight parameter of the classification network may be trained by the second training sample, and when the classification network is used, the class label of the target training feature data may be ensured by the classification network, so as to avoid generating feature data with obviously different classes.

For example, based on the second loss, the formulas for training the discriminant network, the classification network, and the random vector regression network are as follows:

it can be understood that for equation (9), D, Wc, Fs, and E can be differentiated to update the discriminant network, the classification network, the second feature extraction network, and the random vector regression network, respectively.

Referring to fig. 5, fig. 5 is a schematic flowchart of another training method provided in the embodiment of the present application, and as shown in fig. 5, the method includes:

s501, inputting the feature data of the first training sample into a feature generation network to obtain target training feature data.

S502, inputting the target training characteristic data into a discrimination network for processing to obtain a first discrimination result, and inputting the target training characteristic data into a classification network for processing to obtain a first classification result.

S503, inputting the characteristic data of the second training sample into the discrimination network for processing to obtain a second discrimination result, and inputting the characteristic data of the second training sample into the classification network for processing to obtain a second classification result.

S504, inputting the target training characteristic data into a random vector regression network for processing to obtain training regression characteristics.

And S505, determining a first loss based on the first discrimination result, the first classification result and the training regression feature.

S506, determining a second loss based on the second judgment result, the second classification result and the training regression feature.

And S507, training a feature generation network and a first feature extraction network based on the first loss.

And S508, training a discrimination network, a classification network, a random vector regression network and a second feature extraction network based on the second loss.

It can be understood that, in the embodiment of the present application, when training the feature generation network and the first feature extraction network and training the discriminant network, the classification network, the random vector regression network, and the second feature extraction network, the training may be performed in two stages, for example, the discriminant network, the classification network, the random vector regression network, and the second feature extraction network are trained first, and then the training generation network and the first feature extraction network are trained. Or training the feature generation network and the first feature extraction network, and then training the discrimination network, the classification network, the random vector regression network and the second feature extraction network. And the training method provided by the embodiment of the application can be used for training continuously in stages until the expected effect is achieved.

For the training method shown in fig. 5, reference may be made to the methods shown in fig. 3 and 4, and detailed description thereof is omitted.

For a visual understanding of the training method shown in fig. 5, referring to fig. 6, fig. 6 is a specific scene diagram of a training method provided in an embodiment of the present application. As shown in fig. 6, the dotted line part in fig. 6 indicates that the parameters of the network are not updated, and the realized part indicates that the parameters of the network are updated.

As shown in fig. 6, the output of the first feature extraction network may be input to the feature generation network, and the output of the feature generation network may be input to the random vector regression network, the classification network, and the discrimination network, respectively. And the output of the second feature extraction network may be input to the discrimination network and the classification network.

It can be understood that, in the diagram shown on the left side of fig. 6, the feature generation network may receive a supervisory signal from the discrimination network and the classification network in common according to the direction propagation, so that the target training feature data can both ensure the original identity information and keep consistent with the feature space distribution of the second training sample in the generated feature space distribution. In the diagram shown on the right side of fig. 6, the discrimination network may receive the target training feature data output from the feature generation network and the feature data of the second training sample output from the second feature extraction network, respectively, so as to learn the difference in distribution between the two training samples.

As can be seen from the above implementation manner, one input of the feature generation network shown in fig. 6 may be feature data of the first training sample extracted from the first feature extraction network; one may be a discrimination result output from the discrimination network. That is, the feature generation network may continuously generate the target training feature data according to the feature data of the first training sample, and may update the feature generation network according to the determination result output by the determination network.

Specifically, the discrimination result output by the discrimination network can not only train the feature generation network, but also update the discrimination network. More specifically, the purpose of the feature generation network is to make the feature data of the obtained pseudo strong mode image more realistic, and the purpose of the discrimination network is to discriminate whether the input feature data belongs to the feature data of the strong mode or the feature data of the pseudo strong mode as much as possible. Thus, the discriminant network and the feature generation network form a countermeasure in which the feature generation network and the discriminant network are trained.

It can be understood that, in order to further improve the training efficiency and avoid the confusion of data, the weight parameters of the discrimination network can be fixed in the process of inputting the first training sample into the feature generation network to obtain the target training feature data; when the feature data of the second training sample is input to the discrimination network, the weight parameters of the feature generation network can be fixed; and when the judgment result is input into the feature generation network, fixing the weight parameter of the judgment network. Therefore, the problem that the input of the feature data is disordered or the training efficiency is reduced due to the fact that different networks are changed can be effectively avoided.

In the embodiment of the application, a two-stage training method is used for training the feature generation network and the discrimination network, on one hand, the feature generation network can be trained according to the discrimination result output by the discrimination network; on the other hand, the feature generation network can be trained according to the classification network, the difference between the target training feature data and the feature data of the first training sample is ensured through the classification network, the training efficiency is further improved, and therefore the accuracy of the target image is improved.

It can be understood that the methods shown in fig. 1, 3 to 6 are each emphasized, and therefore, in one embodiment, implementation manners not described in detail may be correspondingly referred to in other embodiments.

The method of the embodiments of the present application is elaborated and the apparatus of the embodiments of the present application is provided below.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an image matching apparatus provided in an embodiment of the present application, which may be used to perform the methods shown in fig. 1, fig. 3 to fig. 5, and as shown in fig. 7, the image matching apparatus may include:

an acquiring unit 701 configured to acquire feature data of a first image;

a first data processing unit 702, configured to obtain M groups of target feature data based on feature data of the first image; wherein M is an integer greater than or equal to 2;

a first determining unit 703, configured to determine a target image matching the first image from the at least two second images based on the M groups of target feature data.

By implementing the embodiment of the application, the target image matched with the first image is determined based on the M groups of target characteristic data, the accuracy and the precision of determining the target image can be further improved, and the situation that the accuracy is not high due to the fact that the target image is determined based on only one group of target characteristic data is avoided.

In a possible implementation manner, the first determining unit 703 may specifically determine, based on the M groups of target feature data and feature data of each of the at least two second images, a target image that matches the first image from the at least two second images.

In one possible implementation, the dimension of the feature data of the second image is different from the dimension of the feature data of the first image, and the dimension of the target feature data is larger than the dimension of the feature data of the second image.

In a possible implementation manner, the obtaining unit 701 is further configured to obtain feature data of each of the at least two second images.

Optionally, as shown in fig. 8, the first determining unit 703 includes:

a first determining subunit 7031, configured to determine, based on the M groups of target feature data, a similarity between the first image and each of the at least two second images;

a second determining subunit 7032, configured to determine, from the at least two images, a target image that matches the first image based on a similarity between the first image and each of the at least two second images.

Specifically, the first determining subunit 7031 is specifically configured to obtain feature data of each of the at least two second images; obtaining M similarity corresponding to each second image based on the M groups of target feature data and the feature data of each second image in the at least two second images; and determining the similarity of the first image and each second image according to the M similarities corresponding to each second image in the at least two second images.

Specifically, the first determining subunit 7031 is specifically configured to determine an average processing result of the M similarities corresponding to the second image as the similarity between the first image and the second image.

Specifically, the amount of information contained in the first image is less than the amount of information contained in the second image.

By implementing the embodiment of the application, a huge gap in modal matching between the first image and the second image due to too little information amount of the first image can be effectively made up, and the accuracy of determining the target image is effectively improved.

Further, the first data processing unit 702 is specifically configured to obtain M groups of target feature data based on the feature data of the first image and the M random vectors.

Specifically, the first data processing unit 702 is specifically configured to input the feature data of the first image into the feature generation network for processing, so as to obtain M groups of target feature data.

Further, as shown in fig. 9, the image matching apparatus further includes:

the second data processing unit 704 is configured to input the feature data of the first training sample to the feature generation network, so as to obtain target training feature data;

wherein the dimensionality of the target training feature data is greater than the dimensionality of the feature data of the first training sample.

A first judging unit 705, configured to input the target training feature data into a judging network for processing, so as to obtain a first judging result;

a second determining unit 706 configured to determine a first loss based on the first discrimination result;

a first training unit 707 for training the feature generation network based on the first loss.

Optionally, as shown in fig. 10, the second determining unit 706 includes:

a classification processing subunit 7061, configured to input the target training feature data into a classification network for processing, so as to obtain a first classification result;

a regression processing subunit 7062, configured to input the target training feature data into a random vector regression network for processing, so as to obtain a training regression feature, where a dimension of the training regression feature is equal to a dimension of the input training random vector;

a third determining subunit 7063, configured to determine the first loss based on the first discrimination result, the first classification result, and the training regression feature.

Specifically, the obtaining unit 701 is specifically configured to perform feature extraction on the first image by using a first feature extraction network to obtain feature data of the first image;

the first training unit 707 is further configured to train the first feature extraction network according to the first loss.

Further, as shown in fig. 9, the image matching apparatus further includes:

a second judging unit 708, configured to input feature data of a second training sample to the judging network for processing, so as to obtain a second judging result, where an information amount included in the second training sample is greater than an information amount included in the first training sample;

a third determining unit 709, configured to determine a second loss based on the second determination result and the training regression feature;

and a second training unit 710 for training the discriminant network, the classification network and the random vector regression network based on the second loss.

In a possible implementation manner, the third determining unit 709 is specifically configured to input the feature data of the second training sample into a classification network for processing, so as to obtain a second classification result; and determining a second loss based on the second discrimination result, the second classification result and the training regression feature.

In a possible implementation manner, the obtaining unit 701 is specifically configured to perform feature extraction on each of the at least two second images by using a second feature extraction network, so as to obtain feature data of each of the at least two second images;

and the second training unit 710 is further configured to train a second feature extraction network based on the second loss.

It is understood that the implementation of the respective units may also correspond to the respective description of the method embodiments illustrated with reference to fig. 1, 3, 4 and 5.

Referring to fig. 11, fig. 11 is a schematic structural diagram of another image matching apparatus provided in an embodiment of the present application, where the image matching apparatus may be used to perform the methods shown in fig. 1 and fig. 3 to fig. 5, and as shown in fig. 11, the apparatus includes:

a first feature extraction module 1101, which may extract feature data of the first image based on the first feature extraction network, and may also extract feature data of the first training sample based on the first feature extraction network. That is, the input of the module may be the first image and the output may be the feature data (i.e., the feature vector) of the first image, and the input of the module may be the first training sample and the output may be the feature data (i.e., the feature vector) of the first training sample. It is understood that this module may also be referred to as a weak modal feature extraction module.

A second feature extraction module 1102, which can extract feature data of the second image based on the second feature extraction network, and can also extract feature data of the second training sample based on the second feature extraction network, i.e. the module inputs the second image and outputs feature data (feature vector) of the second image, and the module inputs the second training sample and outputs feature data of the second training sample. It is understood that this module may also be referred to as a strong modal feature extraction module.

The feature generation module 1103 may have as inputs the feature data of the first image extracted from the first feature extraction module and the random vector sampled from the gaussian distribution, and output the feature data as target feature data. The input of the module can also be the feature data of a first training sample extracted from the first feature extraction module and the random vector sampled from the gaussian distribution, and the output is the target training feature data. Wherein the module may generate the network based on the feature to perform the above operations.

And a judging module 1104, which inputs the feature data of the second image and the feature data of the target image and outputs the score belonging to the strong modal feature. That is, the input to the module may be the feature data extracted from the second feature extraction module and the target feature data obtained from the feature generation module. Or the input of the module is the feature data of the second training sample and the target training feature data, and the output is the score belonging to the strong modal feature.

The classifying module 1105 inputs the category label and the feature data extracted from the first feature extracting module, and outputs the probability corresponding to the category label. The module can convert the feature vectors into category responses, and then the category responses are converted into probabilities corresponding to the sampled category labels through the normalized exponential function.

A random vector regression module 1106, whose input can be the feature data of the pseudo-strong model obtained in the feature generation module, and output is the feature vector with dimensions equal to the random vector. The module can adopt a regression loss function for supervision, so that the first feature extraction module can effectively utilize feature vectors obtained by sampling in Gaussian distribution. It can be understood that the random vector is the random vector input in the feature generation module.

It can be understood that the image matching apparatus shown in fig. 11 is shown in the case that the amount of information included in the first image is less than the amount of information included in the second image, and optionally, the image matching apparatus may also be applied to other cases, and details are not repeated here.

It can be understood that the specific implementation of the image matching apparatus shown in fig. 11 can refer to the foregoing embodiments, and detailed description thereof is omitted here.

Referring to fig. 12, fig. 12 is a schematic structural diagram of another image matching apparatus provided in the embodiment of the present application. The image matching apparatus includes a processor 1201, and may further include an input interface 1202, an output interface 1203, and a memory 1204. The input interface 1202, the output interface 1203, the memory 1204, and the processor 1201 are connected to each other via a bus.

The memory includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), which is used for storing instructions and data.

The input interface is used for inputting data and/or signals, and the output interface is used for outputting data and/or signals. The output interface and the input interface may be separate devices or may be an integral device.

The processor may include one or more processors, for example, one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU.

The memory is used for storing program codes and data of the image matching device.

The processor is used to call the program code and data in the memory and execute the steps in the method embodiment.

As in one embodiment, the processor may be configured to perform the implementations shown in S101-S103; also as a processor may be used to perform the implementations shown in S301-S303, and so on.

As another example, in an embodiment, the processor may be further configured to execute the methods illustrated by the obtaining unit 701, the first data processing unit 702, and the determining unit 703, and so on.

For a specific implementation of the processor, reference may be made to the description in the method embodiment, and details are not described here.

It will be appreciated that fig. 12 only shows a simplified design of the image matching apparatus. In practical applications, the image matching apparatus may further include other necessary components, including but not limited to any number of input/output interfaces, processors, controllers, memories, etc., and all image matching apparatuses that can implement the embodiments of the present application are within the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the division of the unit is only one logical function division, and other division may be implemented in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. The shown or discussed mutual coupling, direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In embodiments, this may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be ROM, or RAM, or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a Digital Versatile Disk (DVD), or a semiconductor medium, such as a Solid State Disk (SSD).

Claims

1. An image matching method, comprising:

acquiring characteristic data of a first image;

inputting the feature data of the first image into a feature generation network for processing to obtain M groups of target feature data, wherein M is an integer greater than or equal to 2, the information content of the first image is less than that of the second image, and the dimensionality of the target feature data is greater than that of the feature data of the first image; the feature generation network is obtained by training in an antagonistic manner with a discrimination network, the discrimination network is used for distinguishing the mode to which the input feature data belongs, the mode comprises a strong mode or a weak mode, and the information content of the image corresponding to the strong mode is greater than that of the image corresponding to the weak mode;

2. The method of claim 1, wherein determining the target image from at least two second images that matches the first image based on the M sets of target feature data comprises:

determining a target image from the at least two second images that matches the first image based on the similarity of the first image to each of the at least two second images.

3. The method of claim 2, wherein determining the similarity of the first image to each of the at least two second images based on the M sets of target feature data comprises:

4. The method according to claim 3, wherein determining the similarity between the first image and each of the at least two second images according to the M similarities corresponding to each of the at least two second images comprises:

5. The method according to any one of claims 1 to 4, wherein the inputting the feature data of the first image into a feature generation network for processing to obtain M groups of target feature data comprises:

and inputting the feature data of the first image and M random vectors into the feature generation network for processing to obtain the M groups of target feature data.

6. The method of claim 1, wherein before inputting the feature data of the first image into a feature generation network for processing, obtaining M sets of target feature data, the method further comprises:

inputting the feature data of the first training sample into the feature generation network to obtain target training feature data;

inputting the target training characteristic data into the discrimination network for processing to obtain a first discrimination result;

determining a first loss based on the first discrimination result;

training the feature generation network based on the first loss.

7. The method of claim 6, wherein inputting the feature data of the first training sample into the feature generation network to obtain target training feature data comprises:

the determining a first loss based on the first discrimination result includes:

inputting the target training characteristic data into a random vector regression network for processing to obtain training regression characteristics, wherein the dimensionality of the training regression characteristics is equal to the dimensionality of the training random vector;

8. The method of claim 6, wherein the obtaining feature data for the first image comprises:

the method further comprises the following steps:

training the first feature extraction network according to the first loss.

9. The method according to any one of claims 6 to 8, further comprising:

determining a second loss based on the second discrimination result, the training regression feature, and the second classification result;

training the discrimination network, the classification network, and a random vector regression network based on the second loss.

10. An image matching apparatus, characterized by comprising:

an acquisition unit configured to acquire feature data of a first image;

a first data processing unit, configured to input feature data of the first image into a feature generation network for processing, so as to obtain M groups of target feature data, where M is an integer greater than or equal to 2, an information amount included in the first image is less than an information amount included in a second image, and a dimensionality of the target feature data is greater than a dimensionality of the feature data of the first image; the feature generation network is obtained by training in an antagonistic manner with a discrimination network, the discrimination network is used for distinguishing the mode to which the input feature data belongs, the mode comprises a strong mode or a weak mode, and the information content of the image corresponding to the strong mode is greater than that of the image corresponding to the weak mode;

11. The apparatus of claim 10, wherein the first determining unit comprises:

a second determining subunit, configured to determine, based on a similarity between the first image and each of the at least two second images, a target image that matches the first image from the at least two second images.

12. The apparatus of claim 11,

the first determining subunit is specifically configured to acquire feature data of each of the at least two second images; obtaining M similarity corresponding to each second image based on the M groups of target feature data and the feature data of each second image in the at least two second images; and determining the similarity of the first image and each second image according to the M similarities corresponding to each second image in the at least two second images.

13. The apparatus of claim 12,

the first determining subunit is specifically configured to determine, as the similarity between the first image and the second image, an average processing result of the M similarities corresponding to the second image.

14. The apparatus of any one of claims 10 to 13,

the first data processing unit is specifically configured to input the feature data of the first image and M random vectors to the feature generation network for processing, so as to obtain the M groups of target feature data.

15. The apparatus of claim 10, further comprising:

the first judging unit is used for inputting the target training characteristic data into the judging network for processing to obtain a first judging result;

16. The apparatus of claim 15,

the second data processing unit is specifically configured to input the feature data of the first training sample and the training random vector to the feature generation network, so as to obtain the target training feature data;

the second determination unit includes:

the regression processing subunit is used for inputting the target training feature data into a random vector regression network for processing to obtain training regression features, wherein the dimensionality of the training regression features is equal to the dimensionality of the training random vector;

17. The apparatus of claim 15,

the acquiring unit is specifically configured to perform feature extraction on the first image by using a first feature extraction network to obtain feature data of the first image;

18. The apparatus of any one of claims 15 to 17, further comprising:

the third determining unit is used for inputting the feature data of the second training sample into a classification network for processing to obtain a second classification result;

the third determining unit is further configured to determine a second loss based on the second determination result, the training regression feature, and the second classification result;

19. An image matching apparatus comprising a processor and a memory, the memory storing computer readable instructions which, when executed by the processor, cause the processor to carry out the method of any one of claims 1 to 9.

20. A computer-readable storage medium having computer-readable instructions stored therein, which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 9.