CN110688976A

CN110688976A - Store comparison method based on image identification

Info

Publication number: CN110688976A
Application number: CN201910953557.XA
Authority: CN
Inventors: 张发恩; 杨麒弘; 秦永强
Original assignee: Innovation Qizhi (beijing) Technology Co Ltd
Current assignee: Innovation Qizhi (beijing) Technology Co Ltd
Priority date: 2019-10-09
Filing date: 2019-10-09
Publication date: 2020-01-14

Abstract

The invention discloses an image recognition-based store comparison method, which comprises a training stage: 1) forming m pieces of mini-batch store pictures; 2) obtaining a characteristic diagram with the size of (N, C, H, W) after the shop picture passes through the last convolution layer of the neural network; global pooling is carried out to obtain global characteristics of store pictures, a tri-hard strategy is adopted to construct a triple, and a triple loss function is used for training; 3) obtaining a concerned local area, then calculating a corresponding area on the feature map, obtaining the local feature of the store picture through global pooling, and training by using the local feature and the triple loss function; and (3) reasoning stage: combining the global features and the local features of the shop pictures to obtain final features expressed by the shop pictures; and calculating the distance between the final characteristics of the two store pictures, and considering the two store pictures as the same store picture if the distance is less than a specified threshold, or not, judging the two store pictures as the same store picture. The invention has the advantage of improving the recognition effect of stores.

Description

Store comparison method based on image identification

Technical Field

The invention relates to the technical field of store identification, in particular to a store comparison method technology based on image identification.

Background

Most of the prior art is based on deep learning, an original picture is zoomed to different sizes by utilizing a neural network, a global feature is learned by utilizing a softmax loss function, and subsequent retrieval or comparison tasks are carried out by utilizing the feature. However, in an actual scene, the angle and the light of the shop photo taken by the user are greatly changed, which increases the difficulty of recognition. Some photos even contain a large amount of non-store background information, such as roads, other nearby stores, sky and the like, and it is difficult to obtain a good description of stores by directly using global features, which affects the final recognition result.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide an image recognition-based store comparison method, so that the identification effect of stores is improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

an image recognition-based store comparison method comprises the following steps:

a training stage:

1) selecting m stores each time, and selecting k stores each time to form m × k mini-batch store pictures; the batch means that how many pictures are transmitted to the neural network at one time for training, wherein the batch is m k, and m k store pictures are transmitted in one training;

2) obtaining a characteristic diagram with the size of (N, C, H, W) after the store picture passes through the last convolution layer of the neural network, wherein N represents the size of the store picture, C represents the number of channels, H represents the height, and W represents the width;

obtaining global features of the portal pictures after global pooling, global average pooling or global maximum pooling, constructing a triple by using a tri-hard strategy, and training by using a triple loss function, wherein the tri-hard strategy is to obtain the features of m k samples after the step 1) is propagated forwards through a neural network, obtain an m k vector through the neural network output, calculate Euclidean distances between the m k vectors, find out the most difficult positive sample pair with the largest distance of the features of the same portal and the most difficult negative sample pair with the closest distance of different portals from the distances, the triple loss function is Max (0, the distance of the most difficult positive sample pair-the distance of the most difficult negative sample pair + margin), and the margin is an adjustable parameter;

3) obtaining feature points of each triplet on an original drawing by utilizing an orb algorithm according to the triplets obtained in the step 2), taking a certain field near the feature points as a concerned local area, then calculating a corresponding area on the feature map, mapping the area into a feature map with a fixed size through roi alignment, obtaining local features of the shop pictures after global pooling, global average pooling or global maximum pooling, and performing training by utilizing the local features and the triplet loss function;

and (3) reasoning stage:

inputting two store pictures, and performing forward propagation through a neural network to obtain global features of the store pictures and a feature map after final convolution; obtaining feature points and concerned local areas of the two shop pictures by using an orb algorithm, and obtaining local features of the shop pictures by the same operation as the step 3); combining the global features and the local features of the store pictures in the reasoning stage to obtain final features expressed by the store pictures; determining whether two store pictures are the same store picture using the final features of the store pictures: and calculating the distance between the final features of the two store pictures, and if the distance is less than a specified threshold, considering the two store pictures as the same store picture, and if the distance is greater than the specified threshold, considering the two store pictures as not the same store picture.

The distance is the Euclidean distance between the two feature vectors, and the specified threshold value is 1.8-2.5.

The neural network in the step 1) adopts the following structure: directly removing the last full connection layer on the neural network structure; or after the last convolutional layer, followed by a global average firing, and then followed by a full-link layer of dimension 512 and a batchnorm layer.

The invention has the following beneficial effects:

in the training process, when the scale of the number of the classes is large, compared with the softmax loss function, the ternary loss function does not have the last full connection layer for classification, so a large number of parameters are reduced, and a large amount of storage space is saved. In addition, the characteristics learned by the ternary loss function have higher discrimination, and the discrimination of the characteristics can be improved because the characteristics of the same class are restricted to reduce the characteristic distance between the classes and the distance characteristics of different classes are increased as seen from the loss function; the local features and the global features share one neural network, and can be obtained by one-time forward propagation without bringing too much extra calculation overhead; the global features and the local features can better describe the picture, represent the picture and improve the accuracy of comparison.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic framework diagram of the store comparison method based on image recognition according to the present invention.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.

The invention relates to an image identification-based store comparison method, which comprises the following steps:

firstly, extracting global features of store pictures by utilizing a neural network;

neural networks are broadly referred to as convolutional neural networks, such as vgg16, vgg19, resnet, initiation, densenet, etc. Two structures are mainly used, (1) the existing network structures are directly utilized, and the last full connection layer is removed for feature extraction; (2) another network is that the basic skeleton structure is based on these, modifying the last few layers, specifically, after the last convolutional layer, followed by a global average potential, followed by a fully-connected layer of dimension 512 and a batcnorm layer.

Secondly, detecting feature points in the shop pictures, taking a certain field near the feature points as a concerned area, and extracting the features of the local area through the neural network as local features;

thirdly, combining the local features and the global features to obtain final features expressed by store pictures;

the fourth step uses the final features of the store pictures to determine whether two store pictures are the same store picture: and calculating the distance between the features, and if the distance is less than a specified threshold value, considering the picture as the same store picture, and if the distance is greater than the specified threshold value, considering the picture as not the same store picture. The features refer to the output result of the neural network and the result of the local features, and are described by taking resnet50 as an example, if the structure (1) in the first step is adopted, namely the last full-connection layer is removed, the features are 2048-dimensional vectors, and the global features plus the local features are 4096-dimensional vectors; if the structure (2) in the first step is adopted, the 512-dimensional feature vector is output, and the global feature plus the local feature is a 1024-dimensional vector. The distance is the Euclidean distance between the two vectors, and when the distance is less than 1.8-2.5, the same store is judged.

The invention relates to an image identification-based store comparison method, which comprises the following steps of:

a training stage:

1) selecting m store pictures each time, and selecting k store pictures to form a mini-batch with the size of m x k; batch means that how many pictures are simultaneously transmitted to the neural network for training, if 16 store pictures are selected and 4 pictures are selected for each store picture, the batch is 16 × 4 — 64, and 64 pictures are transmitted in one training.

2) Obtaining a characteristic diagram with the size of (N, C, H, W) after passing through the last convolution layer of the neural network; n represents the number of store pictures, that is, N represents the size of batch, the number of pictures input at a time, N1 picture is 1, N8 pictures is 8, C represents the number of channels, for example, rgb images, the number of channels is 3, here, the number of convolution kernels in a convolution layer is, for example, 2048 convolution kernels, the dimension of the number of channels is 2048, H represents the height, W represents the width, and H and W represent the height and the width of a feature map, respectively, the feature map of (N, C, H, W) represents a calculation structure obtained after the convolution layer, and the feature map represents a calculation result obtained after the convolution operation of the input images. The neural network: the traditional classification convolutional network structure is a convolutional layer, which is then classified by a fully connected layer.

Global features of the portal picture are obtained through global pooling, global average pooling or global maximum pooling of global maximum displacement pooling, a tri-hard strategy is adopted to construct a triple, the triple is mainly used for constructing a loss function and a training neural network, training is carried out by utilizing the triple loss function, and finally parameters of the model are obtained after the model is trained, such as weights and bias items of convolution kernels in convolution layers in the model. the tri-hard strategy is an industry-specific noun, and direct translation is a ternary difficult strategy. The process of constructing the triples is: the method includes the steps of (1) inputting m × k store photos, for example, m is 16, k is 4, which are input 64 pictures, when the neural network is trained, the pictures are usually input in a batch (batch), in this example, the input batch size is 64, specifically, first, 16 stores are randomly selected, and each store randomly selects 4 samples, so that the size of each batch is 64, after forward propagation through the neural network, the features of 64 samples are obtained, and a 64 × 512-dimensional vector is obtained through neural network output, that is, according to the input batch size 64, the obtained feature vector is a 64 × 512 feature vector, wherein the meaning is that the feature dimension is 512, and the input sample size is 64; calculating Euclidean distances between every two of the 64 vectors; the most difficult positive and negative sample pairs are found from these distances, and how to find the most difficult positive and negative sample pairs is found In the reference [ 1 ] In Defensse of the triple Loss for Person Re-Identification (https:// axiv.org/abs/1703.07737);

the hardest positive sample pair refers to a pair of samples with the largest distance of the features in the same store, namely d (a, p) in the corresponding formula, and the hardest negative sample pair refers to a pair of samples with the closest distance of the features in different stores, namely d (a, n) in the corresponding formula. Training a triple loss function: using the above-found pairs of the most difficult positive samples and the most difficult negative samples, the loss function is trained with max (0, the distance of the most difficult positive sample pair-the distance of the most difficult negative sample pair + margin), margin being an adjustable parameter, such as 0.4. Specifically, the triplet loss function L is max (d (a, p) -d (a, n) + margin,0), and the input is a triplet < a, p, n >, a: anchor sample, p: positive, a sample of the same class as a, n: negative, which is a different class of samples from a, d (a, p) represents the distance between pairs of positive samples, d (a, n) represents the distance between pairs of negative samples. The pooling is a layer which is very commonly used in deep learning, and the global average pooling is to average the (N, C, H, W) feature map obtained in the above two last dimensions to obtain the (N, C, 1, 1) feature map. Since the average is obtained in the H and W dimensions as a whole, and the information of the whole image is sufficiently considered, this feature map of (N, C, 1, 1) is generally referred to as a global feature.

According to the triples obtained in the step 2), if the size of the input batch size is 64, 64 triples exist, characteristic points of each triplet on an original image, namely an original image input into a neural network, are obtained by using an orb algorithm, a specified area near the characteristic points is obtained as an attention area, then a corresponding area on a characteristic graph is calculated, the area is mapped into a characteristic graph with a fixed size through roiign, local characteristic expression is obtained after posing, and training is performed by using the local characteristic and a triplet loss function. Specifically, after key points, namely feature points, in the picture are obtained, regions near the key points, namely attention regions, are extracted, and 150 pixels are respectively expanded from the key points to the upper, lower, left and right to serve as the attention regions; after the neural network is propagated forward, the picture is downsampled and reduced in size, for example, if the original picture size is 640x640, and if the network adopts a resnet50 structure, the picture is downsampled by 32 times, and the size of the feature map becomes 20x 20. Therefore, the attention area needs to be correspondingly reduced, and the original picture is mapped onto the small feature map to obtain the attention area corresponding to the feature map. The characteristic points of the aforementioned pictures can be simply understood as more significant points in the pictures, such as contour points, bright points in darker areas, dark points in lighter areas, and the like. orb use the modified FAST algorithm to detect feature points and the modified BRIEF algorithm to compute a feature point. Briefly, the FAST algorithm detects a circle of pixel values around a candidate feature point, and if there are enough pixel points in the area around the candidate point and the gray value difference of the candidate point is large enough, the candidate point is considered as a feature point. The improvement is to improve the scale invariance, namely, the sum of the feature points of the original image under different proportions is extracted as the FAST feature point of the image. The BRIEF algorithm calculates a binary string of feature descriptors. It selects n pairs of pixel points pi, qi (i is 1,2, …, n) in the neighborhood of each feature point. The magnitude of the gray value of each point pair is then compared. If I (pi) > I (qi), a 1 in the binary string is generated, otherwise it is 0. All the point pairs are compared, and a binary string with the length of n is generated. Typically n is taken to be 128, 256 or 512, typically 256. The improvement mainly guarantees rotation invariance.

ORB Algorithm is a classic Algorithm in computer vision, namely, ORB (an effective alternative ToSIFT or SURF) is a paper which is published by Ruble et al in 2011 ICCV and is related to feature point extraction and matching, and is mainly used for extracting and matching feature points between two images to obtain points which can be corresponded to each other in the two images.

roi align is a trade-specific term, and the translation is to align a region of interest, i.e. to map a region on the feature map to a fixed size, e.g. to change the region of interest to a 7 × 7.

And (3) reasoning stage:

inputting two store pictures, and performing forward propagation through a neural network to obtain global features of the store pictures and a feature map after final convolution; obtaining feature points and concerned local areas of the two shop pictures by using an orb algorithm, and obtaining local features of the shop pictures by the same operation as the step 3); combining the global features and the local features of the store pictures in the reasoning stage, namely splicing the features to obtain the final features expressed by the store pictures, wherein the final features are vectors, for example, the global features are 512-dimensional vectors, the local features are 512-dimensional vectors, and the final features are 1024-dimensional vectors when being spliced;

determining whether two store pictures are the same store picture using the final features of the store pictures: calculating the distance between the final features of the two store pictures, for example, the 2x 512-dimensional feature vector obtained by each store is the euclidean distance of the two 1024-dimensional feature vectors, and if the distance is smaller than a specified threshold, the store pictures are considered to be the same store picture, and if the distance is larger than the specified threshold, the store pictures are not the same store picture.

The distance refers to the Euclidean distance between the two feature vectors, and the specified threshold value is 1.8-2.5, namely, the distance is smaller than the specified threshold value of 1.8-2.5, and the two feature vectors are considered to be the same store picture.

The neural network in the step 1) adopts the following structure: the last full link layer is directly removed from the convolutional neural network structure, because the last layer of the classical classification neural network structure is used for classification, the task is not needed in the invention, so the last full link layer is removed; or after the last convolution layer, a global average posing is followed, the global average posing is represented by an abbreviation GAP, global information is fused, then a full-link layer with a dimensionality of 512 is followed, the dimensionality of the characteristic is represented by 512, a final characteristic batcnom layer is obtained after the batcnom layer is processed, the batcnom layer is a structure commonly used in deep learning and used for normalization operation, the training convergence speed is improved, and the overall structure is that an input picture is larger than the convolution layer, the GAP is larger than the full-link layer, and the larger than the batcnom layer.

In summary, according to the store comparison method based on image recognition, firstly, the neural network is used for extracting the global features of stores, then the orb algorithm is used for detecting the feature points in the picture, a certain field near the feature points is used as a concerned area, the features of the local area are extracted through the neural network and used as local features, and finally the local features and the global features are combined to obtain the final feature expression. This feature is used to determine whether two stores are the same store.

When the scale of the number of categories is large during training, compared with a softmax loss function, the ternary loss has no last full connection layer for classification, so that a large number of parameters are reduced, and a large amount of storage space is saved. In addition, the characteristics learned by the ternary loss function are more discriminative. The local features and the global features share one neural network and can be obtained by one-time forward propagation without bringing too much extra calculation overhead. The global features and the local features can better describe the picture, represent the picture and improve the accuracy of comparison.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention and the technical principles used, and any changes or substitutions which can be easily conceived by those skilled in the art within the technical scope of the present invention disclosed herein should be covered within the protective scope of the present invention.

Claims

1. An image recognition-based store comparison method is characterized by comprising the following steps:

a training stage:

1) selecting m stores each time, and selecting k stores each time to form m × k mini-batch store pictures; batch means that how many pictures are transmitted to a neural network at one time for training, wherein batch in the mini-batch is m × k, m × k store pictures are transmitted in one training, and mini only represents a mark name;

2) after the store picture passes through the last convolution layer of the neural network, obtaining a feature graph with the size of (N, C, H, W), namely a calculation result obtained through convolution calculation, wherein N represents the size of the store picture batch, C represents the number of channels, namely the number of convolution kernels in the neural network convolution layer, H represents the height of the store picture, W represents the width of the store picture, and the feature graph represents the feature obtained after the input picture is subjected to convolution operation;

obtaining global features of the shop pictures after global pooling, global average pooling or global maximum pooling, and constructing a triple by using a tri-hard strategy, wherein the triple is mainly used for constructing a loss function and a training neural network, the training is performed by using a triple loss function, parameters of the neural network model are obtained after the neural network model is trained, the tri-hard strategy is that m x k shop pictures input in the step 1) are propagated forwards through the neural network to obtain features of m x k samples, the input m x k shop pictures are output through the neural network to obtain m x k vectors, Euclidean distances between every two m x k vectors are calculated, a most difficult positive sample pair and a most difficult negative sample pair are found from the distances, the most difficult positive sample pair refers to two samples of the same shop but with the largest feature distance, the hardest negative sample pair refers to two samples which are different stores but have the characteristics with the closest distance; the triplet loss function is max (0, distance of hardest positive sample pair-hardest negative sample pair distance + margin), margin is an adjustable parameter;

3) obtaining feature points of each triplet on an original image, namely an original image input into the neural network, according to the triplets obtained in the step 2) by utilizing an orb algorithm, taking a certain field near the feature points as a concerned local region, then calculating a corresponding region on the feature map, mapping the region into a feature map with a fixed size through roi align, obtaining local features of the portal image after global pooling, global average pooling or global maximum pooling, and similarly training by utilizing the triplet loss function by utilizing the local features;

and (3) reasoning stage:

inputting two store pictures, and performing forward propagation through a neural network to obtain global features of the store pictures and a feature map after final convolution; obtaining feature points and concerned local areas of the two shop pictures by using an orb algorithm, and obtaining local features of the shop pictures by the same operation as the step 3); combining the global features and the local features of the store pictures in the reasoning stage, namely splicing the features to obtain the final features expressed by the store pictures;

determining whether two store pictures are the same store picture using the final features of the store pictures: and calculating the distance between the final features of the two store pictures, and if the distance is smaller than a specified threshold, the two store pictures are considered to be the same store picture, and if the distance is larger than the specified threshold, the two store pictures are not considered to be the same store picture, wherein the distance refers to the Euclidean distance between the two final feature vectors.

2. The image recognition-based store comparison method according to claim 1, wherein the predetermined threshold is 1.8-2.5.

3. The image recognition-based store comparison method according to claim 1, wherein the neural network of step 1) adopts the following structure: directly removing the last full-link layer from the convolutional neural network structure; and after the last convolution layer, connecting a global average pond for global average pooling and fusing global information, then connecting a full-connection layer with the dimensionality of 512, and obtaining a final characteristic batcnom layer after the batcnom layer.