CN111931791A

CN111931791A - Method for realizing image turnover invariance

Info

Publication number: CN111931791A
Application number: CN202010802264.4A
Authority: CN
Inventors: 夏瑞阳; 李国权; 黄正文; 刘一麟; 林金朝; 庞宇
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2020-11-13
Anticipated expiration: 2040-08-11
Also published as: CN111931791B

Abstract

The invention relates to a method for realizing image turnover invariance, which belongs to the field of image processing and comprises the following steps: s1: clustering all feature descriptors in the image and generating visual vocabularies; s2: establishing an index structure model to store the feature descriptors according to the visual vocabularies; s3: analyzing the distribution of the main key points by using a visual vocabulary containing most feature descriptors; s4: and determining whether the object is overturned and matching. The method has high matching efficiency and high matching precision.

Description

Method for realizing image turnover invariance

Technical Field

The invention belongs to the field of image processing, and relates to a method for realizing image turnover invariance.

Background

Feature descriptors, particularly local feature descriptors, are important tools for describing image features, and are widely applied to the field of image processing such as target classification, target detection, image search and the like. Although there are many types of feature descriptors, the goal is to make feature descriptors with strong scale, translation, rotation, and flip invariance.

The application range of the feature descriptors is very wide, for example, in order to realize a quick and correct object retrieval system in a website, the establishment of an inverted index is indispensable, and the basis of the method is to establish a bag-of-words model based on the feature descriptors; in addition, gesture recognition, texture recognition, three-dimensional reconstruction and the like can be realized by combining different image feature descriptors with other applications.

Image matching based on local feature descriptors is generally divided into two stages of feature point generation and matching, in the generation process of feature points, two stages of feature point detection and feature point description are divided, a reliable interest point in an image needs to be extracted as a feature point, and then the feature point is described, and the feature descriptors should be robust to photometric transformation (such as brightness and highlight) and invariant to geometric transformation (such as rotation, scaling, viewpoint and reflection).

Although there are many different types of feature descriptors and the design process has similarities, in particular, the key points in the image are first generated and located, and then some noise points are removed by taking into account some factors, so that the resulting true key points have some invariant features. Finally, we shall describe these keypoints by constructing a decimal or binary string. Therefore, they become feature descriptors. For the first step, there are many methods to generate keypoints, such as pixel range, gradient value, Harris angle function, Laplace-of-Gaussian (LoG), differential-of-Hessian (DoH), etc. of a region. In the second step, deleting edge points by using non-maximum suppression (NMS) or establishing a Hessian matrix is an effective method for reducing noise points, then, counting direction vectors of a region around a key point, and obtaining a main direction of a feature descriptor so as to ensure rotation invariance, and for scale invariance, the method can be realized by establishing a Gaussian pyramid. In a third step, a string is used to name the keypoints, and if the string is too short or binary, the feature descriptors have no explicit recognition capability, but if the string is too long or decimal, the feature descriptors will contain so much noise that the keypoints named by the string are the feature descriptors.

Compared with many different classical algorithms, the SIFT algorithm generates feature descriptors with better performance, on one hand, 128-dimensional information is contained in the feature descriptors, and on the other hand, the SIFT algorithm can also generate many key points in the image. Until recently, there were still many more advanced descriptors based on SIFT design, such as ROOT-SIFT, DSP-SIFT. Although many scholars have recently proposed learning-type descriptors, manual descriptors still perform well in many applications. However, the performance of roll-over invariance is still less efficient due to the structure of the descriptor. In many application domains, feature descriptors with flip invariance are required. For example, many pirated files often use a flipping operation to spoof the identification system, because this operation does not damage the content of the document. There are many algorithms invented to handle the case of two identical images, but one is flipped. However, this is unlikely to occur in everyday applications.

Disclosure of Invention

In view of the above, the present invention provides a method for implementing image flipping invariance.

In order to achieve the purpose, the invention provides the following technical scheme:

a method of implementing image flip invariance, comprising the steps of:

s1: clustering all feature descriptors in the image and generating visual vocabularies;

s2: establishing an index structure model to store the feature descriptors according to the visual vocabularies;

s3: analyzing the distribution of the main key points by using a visual vocabulary containing most feature descriptors;

s4: and determining whether the object is overturned and matching.

Further, in step S1, the generating of the feature descriptor by the SIFT algorithm includes the following steps:

A. detecting a scale space extreme value:

detecting key points by using Gaussian difference (DoG), establishing a Gaussian pyramid to ensure the invariance of proportion when two images are matched, and acquiring the key points from details to outlines;

B. filter out the correct key points:

deleting partial error key points by using a Taylor function of the scale space function, and filtering partial edge points by using a Hessian matrix, wherein Taylor second-order expansion expressions of the scale space function and the Hessian matrix are written as follows:

where x represents the actual extreme point, x' represents the increment between the actual extreme point and the detected keypoint, D^TRepresents the transpose of the d (x) equation,

represents D^TTaking the derivative of x, wherein sigma represents a Gaussian blur coefficient;

C. direction distribution:

the method comprises the steps of eliminating the influence of long-distance pixels by adopting a Gaussian fuzzy function with a key point as a center, dividing an area around the key point into 16 sub-blocks, calculating statistical data of each block in eight directions according to trilinear interpolation to obtain data of 128 dimensions, wherein the number of pixels in each area is different under different sigma values, and each dimension is a decimal number.

D. Feature descriptors are generated.

Further, in step S1, generating a visual vocabulary by k-Means for representing all feature descriptors specifically includes:

dividing all feature descriptors into k clustering points, wherein the minimum loss function is as follows:

wherein i represents the ith clustering point, x represents 1 feature descriptor of 1 picture, C_iRepresenting feature descriptors, mu, belonging to the ith visual word_iRepresenting the ith visual vocabulary;

by iteratively minimizing the value of E, which means that feature descriptors have high similarity in one cluster and high dissimilarity in a different cluster, the visual word is calculated by:

further, in step S2, the establishing an index structure specifically includes the following steps:

before two images are matched, one image is defined as an original image, and the other image is defined as a matched image;

establishing an index structure model by using the original images, namely defining visual vocabularies generated by the feature descriptors as reference and finding out the same or opposite distribution between the two images;

for matching images, feature descriptors follow the reference and are incorporated into the visual vocabulary.

Further, step S3 specifically includes:

and sequencing the sequence of the visual words according to the bilateral number of the key points, and selecting the visual words containing the most key points between the two images.

Further, step S4 specifically includes: determining a distribution between the two images, focusing on the distribution of the keypoints over the main objects in the images;

obtaining a center through partial key points;

when two images are matched, a kNN algorithm is adopted to accelerate the matching speed, and the principle of the kNN algorithm is as follows:

m0.distance<α*m1.distance

each feature description and match is compared using euclidean distance, the formula being as follows:

when the Euclidean distance of the matching pair is smaller, the probability that the matching pair is the same key point is larger, and therefore a matching result is obtained.

Further, the obtaining of the center through a part of the key points specifically includes:

using 10 keypoints, 5 of which are the leftmost keypoints and 5 of which the rightmost keypoints are set to 1 in the image label if the number of keypoints on the left side is greater than the right side with respect to the center, and otherwise to 0;

and calculating the result of the labels by using an exclusive-or operation according to the two labels between the two images, wherein if the result is 0, the object in the two images has the same direction, and if the result is 1, the matched images are turned over again to ensure the number of matched pairs.

The invention has the beneficial effects that: the invention provides clustering SIFT to realize the turnover invariance by constructing an index structure model and analyzing the distribution of main key points. The method has high matching efficiency and high matching precision.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a short process diagram of SIFT, SURF, ORB to generate feature descriptors;

FIG. 2 is a diagram of the matching result of a classic matching algorithm on a flipped object;

FIG. 3 is a schematic diagram of an index model structure;

FIG. 4 is a diagram illustrating a matching result of the present embodiment;

FIG. 5 is a graph of the impact of the number of visual words on accuracy and averaging time;

FIG. 6 is a schematic diagram of the general steps of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

In order to cluster the feature descriptors, the feature descriptors need to be generated first, and currently popular feature descriptor generation algorithms include SIFT, SURF, ORB, PCA-SIFT, and the like.

SIFT is one of the most popular algorithms in machine vision, and it can be divided into four steps, the first step being the detection of extreme values in the scale space. The use of gaussian difference (DoG) to detect keypoints is because the LoG function produces image keypoints that are more stable than any other possible function, but the LoG function results in a long computation time when dealing with problems, and then replacing the LoG with DoG finds similar results and time-saving. The formula (1) and the formula (2) show the relationship between DoG and LoG.

DoG＝ G(x,y,kσ)-G(x,y,σ)＝(k-1)*LoG (2)

Clearly, they differ only by coefficients, and then they build a gaussian pyramid to ensure scale invariance when matching two images and to obtain key points from detail to contour.

The second step is to filter out the correct keypoints. And deleting some wrong key points by using a Taylor function of the scale space function, and filtering some edge points by using a Hessian matrix. The Taylor second order expansion of the scale space function and Hessian matrix is written as:

where x represents the actual extreme point and x' represents the increment between the actual extreme point and the detected keypoint.

The third step is direction allocation. The method comprises the steps of calculating the size and the direction around a key point based on sigma, collecting different directions to obtain the maximum direction, and accordingly achieving rotation invariance, considering that the distances of different pixels around the key point are different, eliminating the influence of long-distance pixels by adopting a Gaussian fuzzy function with the key point as the center, dividing a region around the key point into 16 sub-blocks in order to obtain a feature descriptor, wherein under different sigma, the number of pixels in each region is different, each block has different statistical data in eight directions, the data are calculated through trilinear interpolation, and finally, the data obtain 128 dimensions, and each dimension is a decimal number.

And the PCA-SIFT feature descriptor based on SIFT improves the efficiency of SIFT algorithm by reducing the dimension of SIFT descriptor vector from 128 to 36. In addition, the purpose of SURF is to get a fast, short, yet stable feature descriptor. Unlike SIFT, on the one hand, they use a box filter to fit the original gaussian filter in order to speed up the key points. The box filter is related to the original gaussian filter as follows:

Det(H)＝L_xxL_yy-L_xy*L_xy

≈D_xxD_yy-(w*D_xy*D_xy) (5)

where w is equal to 0.9. Unlike gaussian kernels and changing image size to detect keypoints, on the one hand, Hessian matrices of different scales are used, without changing image scale. Moreover, an integral image algorithm is used on each pixel, so that when a kernel is convoluted with an image, only addition and subtraction operation is needed to reduce time overhead. Different sizes of directions are then derived from the key points of the 60 sector. Selecting an area at each key point, dividing the area into 4 × 4 sub-blocks, and calculating the response of each sub-block by using a Harr wavelet to obtain 4 vectors consisting of Σ dx, Σ | dx |, Σ dy, Σ | dy |. Finally, each feature descriptor has a 64-dimensional vector, each dimension also being a decimal number.

Unlike SIFT and SURF, ORB generates feature descriptors that are binary, each with 256 dimensions. On one hand, the method can reduce the influence of noise, on the other hand, the biggest advantage is that the extraction speed is high, the speed of the extraction speed is far higher than SIFT and SURF, and the scale invariance and the rotation invariance can still be kept, and the algorithm is based on FAST and rBRIEF. The concept of FAST was used to detect key points, which were described by rBRIEF. Fig. 1 shows a short process of SIFT, SURF, ORB generating feature descriptors, where (a) shows the difference in detecting keypoints for SIFT (top), SURF (middle), and ORB (bottom). (b) The difference in obtaining the dominant direction to ensure rotational invariance is shown. (c) The difference of the generated feature descriptors is displayed. In SIFT, each feature descriptor has 128 decimal numbers. In SURF, each feature descriptor has 64 decimal numbers. In the ORB, there are 256 matching pairs and 256 binary numbers per feature descriptor.

Although these classical algorithms perform well in terms of geometric transformations and are of interest to many researchers, the nature of the flip invariance is often not considered. In recent years, there have been many advanced algorithms based on SIFT to solve this problem, and they can be classified into two categories. One is to guarantee the flip invariance before generating the feature descriptor, and the F-SIFT descriptor guarantees the flip invariance of the local feature by forcing the direction of a local region to follow a predetermined direction, inspired by Stokes theorem. The function for determining whether to change the local region is shown in equation (6):

if the sign of C changes, this local area should be changed accordingly. However, the algorithm should compute all local regions, many of which do not need to be actually performed. Therefore, it would take too much time to calculate the value of C. Another feature descriptor with roll-over invariance is FIND, which aims to change the original SIFT feature descriptor sequence so that roll-over invariance can be achieved. However, it has been found that the judgment of FIND is highly parameter dependent. Like FIND, the MBR-SIFT descriptor not only changes the array structure, but also replaces the decimal number with binary. In the matching process, there are rough matching and fine matching. The MAX-SIFT descriptor considers different cases to get the maximum value for each dimension. However, this descriptor reduces the accuracy of the matching.

In most experiments, only two identical images, one of which was flipped, were used. As a result, most processes can be highly improved because only this case is considered. However, in most cases, the flipped image is often a different image processed by geometric transformation such as angle, distance, etc., and different images of the same object will generate more uncertainty. According to these methods, it is difficult to ensure that two identical images have good performance. Thus, the present solution assumes that it is more useful to focus on changes in objects in the image than on changes in local feature descriptors, since the same object in different images still has different features despite the transformation.

In the present embodiment, UKBench is used as the test data set, and fig. 2 shows the matching result of the classical matching algorithm on the flipped object, where (a) is the picture to be matched, (b) is SURF algorithm, (c) is SIFT algorithm, and (d) is ORB algorithm. The matching effect is not ideal when observed, because the flipping operation causes the sequence of feature descriptors to have symmetry change. Inspired by an image retrieval system, a model based on k-Means algorithm is established before matching.

Therefore, in the method of the present embodiment, the description is divided into four steps. The first step is to cluster all feature descriptors in the image and generate a visual vocabulary. Second, based on these visual words, an index structure model is built to hold feature descriptors. Third, the distribution of the primary keypoints is analyzed using a visual vocabulary containing the most feature descriptors. And finally, determining whether the object is overturned and performing matching.

First, after a plurality of feature descriptors are generated from an image using SIFT, all feature descriptors, called visual vocabulary, are represented using other less but representative feature descriptors. On the one hand, different images have different numbers of feature descriptors, so it is difficult to guarantee that two images have the same number of feature descriptors. On the other hand, these feature descriptors are distributed out of order. All feature descriptors are used to generate a representative visual vocabulary.

In the method of the present embodiment, the visual vocabulary is generated using k-Means because it is simple and converges quickly. The K-Means algorithm may divide all feature descriptors into K cluster points. In this embodiment k is set to 15. This means that for both images they have the same number of visual words. Equation (7) shows the minimum loss function of the algorithm.

This embodiment minimizes the value of E by iteration, i.e. the feature descriptors have a high similarity in one cluster and a high variance in different clusters. Equation (8) represents an equation for calculating the visible word.

And secondly, constructing an index structure model after constructing the visual vocabulary. And analyzing the distribution of the key points by using a BoW index structure model constructed by an SIFT feature descriptor algorithm and a clustering algorithm, and judging whether the image is overturned.

Before matching two images, a relationship should be established between the two images, thus defining one of them as an original image and the other as a matching image. The original image is used to build an index structure model, i.e. the visual vocabulary generated by the feature descriptors is defined as a reference, and the same or opposite distribution between the two images is found. For matching images, feature descriptors should follow these references and be incorporated into these visual words.

First, feature descriptors are generated in an original image by using a SIFT algorithm. Then, a visual vocabulary is generated based on the descriptors, and the descriptors are added to the corresponding visual vocabulary. Based on these visual words, the corresponding coordinates of the matched image feature descriptors are also merged into these visual words, and fig. 3 shows the proposed structure of the index model.

And thirdly, after the index structure model is established, difference analysis can be carried out only if the distribution condition of key points is obtained. But not all keypoints are used, since on the one hand too many keypoints can reduce the apparent difference between the two images, and on the other hand it will take too much time. In this embodiment, only one visual vocabulary is used and all the keypoint distributions contained in this visual vocabulary are analyzed. However, randomly selecting a visual word results in an uncertain analysis of the two images and a less accurate determination of whether an image is flipped. Thus, prior to selecting a visual word, the sequence of visual words is ordered according to the bilateral number of keypoints. A visual vocabulary is then selected that includes the greatest number of keypoints between the two images.

Fourth, after selecting the visual vocabulary, in this step not only the distribution between the two images should be determined, but also the distribution of the key points over the main objects in the images should be focused. Therefore, the object is to be found first. When the background noise is low, these keypoints are distributed around the object, and thus can be used to analyze the object in the image.

And then using some of the key points to obtain the center. In the present embodiment, 10 key points are used. Of which 5 are the leftmost keypoints and the other the rightmost keypoints. If the number of keypoints on the left is greater than on the right relative to the center, a 1 will be set in the image tag, otherwise it is 0. Finally, from the two labels between the two images, the results of these labels are computed using an exclusive-or operation. If the result is 0, it means that the objects in the two images have the same direction, but if the result is 1, the matching images should be flipped again to ensure the number of matching pairs as much as possible.

And when the two images are matched, the kNN algorithm is adopted to accelerate the matching speed. In this embodiment, k is defined to be equal to 2, and equation (9) shows the principle of the kNN algorithm.

m0.distance<α*m1.distance (9)

A good matching pair is considered if the distance of the first matching pair is smaller than the second pair and multiplied by a factor a. Each feature description and match is compared using euclidean distance as shown in equation (10). The smaller the euclidean distance of the matched pairs, the more likely they represent the same keypoint. When the matching process is completed, the final captured matching pair can be obtained. Fig. 4 shows the matching result of the present embodiment.

Finally, the clustered SIFT method proposed in this example was evaluated on a common data set UKBench, which contains 10200 images from 2550 subjects. Each object is composed of four images taken from different perspectives or under different imaging conditions.

Comprehensive comparison is performed with classical feature matching methods, such as comparing the number of matching pairs, the number of generated feature descriptors, and the accuracy of the match. In order to evaluate the accuracy of the clustered SIFT, five methods of ORB, SURF, SIFT, PCA-SIFT and clustered SIFT are adopted in the matching experiment. Images were randomly selected and subjected to matching experiments. The matching results are shown in table 1.

TABLE 1

The result shows that compared with other methods, the method of the embodiment not only has higher number of the matched pairs than other algorithms, but also has obviously higher precision than other algorithms. This means that the present embodiment method can effectively solve the situation of flipping because both images are already brought into the best matching direction before matching, and assuming that both images have the largest number of matching pairs in a certain direction, although one of the images is flipped, α is set to 0.8 in formula (9) in the present embodiment.

Next, 120 image pairs were randomly tested to investigate the effect of the number of visual words on the accuracy and averaging time. The results are shown in fig. 5, (a) as the average accuracy of determining whether the image is inverted, and (b) as the average time of processing the image. From the results of (a), it can be seen that as the number of visual words increases, the accuracy also starts to increase, but when the number of visual words exceeds 15, the accuracy starts to decrease, and from the results of (b), it can be seen that the time for determining whether the image is flipped still increases as the number of visual words increases. That is, if the number of visual words is small, there are too many key points in the visual words, which reduces the variance of the distribution. However, if the number of visual words is too large, too few key points in each visual word will also reduce the variance of the distribution, resulting in a reduction in the accuracy of the determination.

Fig. 6 shows the general steps of the present invention.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A method for realizing image turning invariance is characterized in that: the method comprises the following steps:

s4: and determining whether the object is overturned and matching.

2. The method for realizing image flipping invariance of claim 1, wherein: in step S1, the feature descriptor is generated by a SIFT algorithm, and includes the following steps:

A. detecting a scale space extreme value:

detecting key points by using Gaussian difference DoG, establishing a Gaussian pyramid to ensure the invariance of proportion when two images are matched, and acquiring the key points from details to outlines;

B. filter out the correct key points:

C. direction distribution:

D. Feature descriptors are generated.

3. The method for realizing image flipping invariance of claim 2, wherein: in step S1, a visual vocabulary is generated by k-Means to represent all feature descriptors, which specifically includes:

4. the method for realizing image flipping invariance of claim 1, wherein: in step S2, the establishing an index structure specifically includes the following steps:

5. The method for realizing image flipping invariance of claim 1, wherein: step S3 specifically includes:

6. The method for realizing image flipping invariance of claim 1, wherein: step S4 specifically includes: determining a distribution between the two images, focusing on the distribution of the keypoints over the main objects in the images;

obtaining a center through partial key points;

m0.distance<α*m1.distance

7. The method of claim 6, wherein: the obtaining of the center through part of the key points specifically comprises: