CN108875076B

CN108875076B - Rapid trademark image retrieval method based on Attention mechanism and convolutional neural network

Info

Publication number: CN108875076B
Application number: CN201810750096.1A
Authority: CN
Inventors: 冯永; 张英琦; 尚家兴; 强保华; 邱媛媛
Original assignee: Chongqing University; Guilin University of Electronic Technology
Current assignee: Chongqing University; Guilin University of Electronic Technology
Priority date: 2018-07-10
Filing date: 2018-07-10
Publication date: 2021-07-20
Anticipated expiration: 2038-07-10
Also published as: CN108875076A

Abstract

The invention discloses a rapid trademark image retrieval method based on an Attention mechanism and a convolutional neural network, which comprises the steps of constructing a Caffe deep learning open source framework and training an open source VGG16 network model; designing an Attention network comprising two convolutional layers based on a VGG16 network model, and adding the Attention network into a trained VGG16 network model; training the VGG16 network model added with the Attention network by using a training set in a FlickrLogos-32 data set; generating an Attention-MAC trademark feature extraction model based on a trained VGG16 network model added with an Attention network; and retrieving the trademark image to be queried based on the Attention-MAC trademark feature extraction model, and generating a retrieval result. The invention avoids using the redundant parameters of the full connection layer, achieves the purpose of simplifying the model, improves the training and searching speed and reduces the false detection rate.

Description

Rapid trademark image retrieval method based on Attention mechanism and convolutional neural network

Technical Field

The invention relates to the technical field of computers, in particular to a rapid trademark image retrieval method based on an Attention mechanism and a convolutional neural network.

Background

In recent years, with rapid economic development, the registration amount of various trademarks has increased year by year. Therefore, trademark retrieval is of great significance to trademark registration, management, and protection. For a trademark registration party, barriers of trademark registration application can be timely found through trademark retrieval, such as whether trademarks to be registered are preempted by others, whether similarity exists in a certain area with other existing trademarks, and the like. For the trademark office, the similarity between the trademark to be registered and the existing trademark in the database can be intelligently obtained through trademark retrieval, the workload of manual comparison is reduced, and the working efficiency is improved.

The traditional trademark detection and identification system comprises database image retrieval based on manual labeling and comparison, trademark graphic element-based coding retrieval, image retrieval based on image binarization characteristics, keyword-based retrieval and the like. In the traditional method, a large amount of time and labor are consumed for marking and comparing the images, and the cost of searching the trademark images is increased. With the gradual maturity of the convolutional neural network, the search and classification of the trademark image by using the convolutional neural network becomes a new idea of a trademark detection and identification system. Firstly, training a model by using a training image in a data set, adjusting the weight of each network layer, then testing by using the image in a test set, and finally applying the tested model to a system. Although the degree of intellectualization is higher, the traditional regional convolutional neural network has a complex structure and redundant parameters of a full connection layer, so that the training process is slow and a large amount of time and resources are consumed; in addition, in this method, the data set is required to be completely labeled with the category and the positioning information, but for the trademark office, the image labeling in the trademark database is not complete, so that the method is difficult to be applied to training.

Disclosure of Invention

In view of the above, the invention provides a fast trademark image retrieval method based on an Attention mechanism and a convolutional neural network, which combines the Attention mechanism with the convolutional neural network, removes a full connection layer on the basis, and uses the output of a convolutional neural network intermediate layer to represent the characteristic representation of an image, thereby providing a fast trademark detection method.

In order to achieve the above object, the present invention provides a fast trademark image retrieval method based on an Attention mechanism and a convolutional neural network, the method comprising the steps of:

s1, building a Caffe deep learning open source framework, and training an open source VGG16 network model;

s2, designing an Attention network comprising two convolutional layers based on a VGG16 network model, and adding the Attention network into the trained VGG16 network model;

s3, training the VGG16 network model added with the Attention network by using a training set in a FlickrLogos-32 data set;

s4, generating an Attention-MAC trademark feature extraction model based on the trained VGG16 network model added with the Attention network;

s5, retrieving the trademark image to be queried based on the Attention-MAC trademark feature extraction model, and generating a retrieval result.

Preferably, the step S1 includes the steps of:

s1-1, building a Caffe deep learning open source framework, and pre-training a VGG16 network model by using an ImageNet data set;

and S1-2, carrying out transfer learning training on the VGG16 network model obtained by pre-training by using a training set in a FlickrLogos-32 data set.

Preferably, the step S2 includes the steps of:

s2-1, designing an Attention network comprising two convolutional layers based on model parameters of a VGG16 network model;

s2-2, adding a designed Attention network between the output of the last layer of the trained pooling layer of the VGG16 network model and a full connection layer.

Preferably, the step S3 includes the steps of:

s3-1, fixing the network weight of the feature extraction part in the VGG16 network model added with the Attention network;

s3-2, training the Attention network in the VGG16 network model added with the Attention network by using a training set in a FlickrLogos-32 data set.

Preferably, the step S4 includes the steps of:

s4-1, removing a trained full connection layer in the VGG16 network model added with the Attention network;

s4-2, adding a pooling layer behind the Attention network to generate an Attention-MAC trademark feature extraction model.

Preferably, the step S5 includes the steps of:

s5-1, performing dimensionality reduction on image feature representation output by the Attention-MAC trademark feature extraction model through a main feature analysis method to obtain an Attention-MAC feature vector;

s5-2, taking trademark images in FlickrLogos-32 data sets as input images, sequentially inputting the input images into an Attention-MAC trademark feature extraction model, generating Attention-MAC feature vectors of the trademark data set images, and constructing a feature library of the trademark data set images;

s5-3, inputting the trademark image to be retrieved into an Attention-MAC trademark feature extraction model as an input image, and generating an Attention-MAC feature vector of the trademark image to be retrieved;

s5-4, calculating cosine similarity of the Attention-MAC feature vector of the trademark image to be retrieved and the Attention-MAC feature vector of the FlickrLogos-32 data set image to obtain initial sequencing of the trademark data set image based on the cosine similarity;

s5-5, rearranging FlickrLogos-32 data set images through expanding query to obtain the final sequence of the similarity between the trademark data set images and the trademark images to be retrieved and reporting the trademark retrieval result.

Preferably, the step S1-2 includes the steps of:

s1-2-1, fine-tuning the network weight of the VGG16 network model obtained through pre-training by using a training set in a FlickrLogos-32 data set;

s1-2-2, in the training of transfer learning, a standard cross entropy loss function is used for carrying out classification training on the VGG16 network model obtained through pre-training.

Preferably, the step S5-1 includes the steps of:

s5-1-1, L representation of the feature output by the Attention-MAC trademark feature extraction model₂Carrying out regularization treatment;

s5-1-2, performing feature selection on the processed feature representation through a main feature analysis method to obtain a feature vector after dimension reduction;

s5-1-3, carrying out L again on the feature vector after dimension reduction₂And (5) regularizing to obtain an Attention-MAC feature vector.

Preferably, the pooling layer added in step S4-2 processes the feature representation output by the Attention-MAC trademark feature extraction model by using a region mean pooling method, wherein the region mean pooling method specifically includes the following steps:

c1, inputting an Attention-MAC trademark feature extraction model of the image after training, and outputting a group of space matrixes W multiplied by H multiplied by K;

c2, regarding the set of three-dimensional matrixes as a set x ═ x of a set of two-dimensional characteristic response matrixes_iK, k being the total number of channels of the set of output two-dimensional feature maps, x_iA two-dimensional characteristic response matrix representing the output of the ith characteristic channel;

c3, let Ω represent the two-dimensional characteristic response matrix W of the ith characteristic channel output_i×H_iAll possible positions in (1), x_i(p) represents x_iResponse at upper position p, assume:

based on the continuity of the features, calculate by x_i(p_j) The characteristic mean of the centered 3 × 3 region is taken as x_iOf pooled output, i.e.

Wherein p is_lFor this purpose, all pixels in the 3 × 3 region, n is 1,2.. 9, because of the position x_i(p_j) For this purpose, the center of the 3 × 3 region, so when l is 5, there is x_i(p₅)＝x_i(p_j) By outputting a characteristic response matrix W for each layer_i×H_iThe mean value of the calculated area is pooled,obtaining a k-dimensional feature representation of the image:

feature vector f_ΩIs the output of the Attention-MAC trademark feature extraction model.

In summary, the invention discloses a rapid trademark image retrieval method based on an Attention mechanism and a convolutional neural network, which comprises the steps of firstly, building a Caffe deep learning open source framework and training an open source VGG16 network model; designing an Attention network comprising two convolutional layers based on a VGG16 network model, and adding the Attention network into the trained VGG16 network model; then training the VGG16 network model added with the Attention network by using a training set in a FlickrLogos-32 data set; generating an Attention-MAC trademark feature extraction model based on the trained VGG16 network model added with the Attention network; and finally, retrieving the trademark image to be queried based on the Attention-MAC trademark feature extraction model, and generating a retrieval result. The invention combines the Attention mechanism with the convolutional neural network, and the provided Attention-MAC trademark feature extraction model does not comprise any full connection layer, but processes the feature graph output by the convolution of the middle layer to obtain the feature representation of the original image, thereby avoiding using the redundant parameters of the full connection layer, achieving the purpose of simplifying the model, simultaneously improving the training and retrieval speed and reducing the false detection rate.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a basic flow chart of a fast trademark image retrieval method based on an Attention mechanism and a convolutional neural network in a preferred embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a network structure of a pre-trained VGG16 network model in a preferred embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a network structure after an Attention network proposed by the present invention is added in a preferred embodiment disclosed in the present invention;

FIG. 4 is a schematic network structure of the Attention-MAC trademark image retrieval model proposed by the present invention in a preferred embodiment disclosed in the present invention;

FIG. 5 is a search flow chart for trademark detection using the Attention-MAC trademark image search model in a preferred embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.

In the description of the present invention, unless otherwise specified and limited, it is to be noted that the terms "mounted," "connected," and "connected" are to be interpreted broadly, and may be, for example, a mechanical connection or an electrical connection, a communication between two elements, a direct connection, or an indirect connection via an intermediate medium, and specific meanings of the terms may be understood by those skilled in the art according to specific situations.

The invention provides a rapid trademark image retrieval method based on an Attention mechanism and a convolutional neural network, which comprises the following steps as shown in figure 1:

Preferably, the step S1 includes the steps of:

Preferably, the step S2 includes the steps of:

Preferably, the step S3 includes the steps of:

Preferably, the step S4 includes the steps of:

Preferably, the step S5 includes the steps of:

Preferably, the step S1-2 includes the steps of:

Preferably, the step S5-1 includes the steps of:

FIG. 2 is a schematic diagram of a network structure of a pre-trained VGG16 network model in a preferred embodiment of the present disclosure, wherein the selected VGG network model includes 13 convolutional layers and 3 fully-connected layers; the result of training on the ImageNet dataset is to output 1000 classification categories.

The following describes the training process of the Attention-MAC trademark feature extraction model proposed by the present invention with reference to FIG. 3, which specifically includes the following steps:

sa, building a Caffe deep learning open source framework, and pre-training the VGG16 network model by using an ImageNet data set;

sb, fine tuning the VGG16 network model obtained by pre-training by using a training image set in a FlickrLogos-32 data set, retraining the network weight, completing transfer learning of the model under a new data set, and further improving the accuracy of feature extraction;

sc, adding the Attention network provided by the invention between the output of the last pooling layer and the full connection layer for the trained VGG16 network model; partial network weights are extracted by fixing the features, and the Attention network is trained by using training images of a FlickrLogos-32 data set.

Sd, removing a full connection layer from the trained model, and keeping other parts; and adding a pooling layer behind the Attention network to obtain the Attention-MAC trademark feature extraction model provided by the invention.

The above steps are explained in more detail below:

the transfer learning described in step Sb includes the following steps:

a1, selection of data sets. The FlickerLogis-32 data set is divided into three parts, P1, P2 and P3, which are respectively a training set, a verification set and a test set. Where each part contains 32 different brand classes, P1 contains 320 pictures, P2 contains 3960 pictures, and P3 contains 3960 pictures, the pictures of each part being disjoint.

A2, adjustment of the loss function. In the training of the transfer learning, the model is classified and trained by using a standard cross entropy loss function:

wherein y represents the classification of the model output in the training; y is^*Representing the real classification of the image, which is a 0-1 vector (if belonging to a certain class, the position representing the class is 1, and the rest are 0); 1 is an all 1 vector.

The design and training of the Attention network described in step Sc comprises the following steps:

b1, model parameters of VGG16 selected according to the invention, the output of convolutional layer Conv5_3 will be pooled and a W × H × K spatial matrix will be output, this output is called Origin _ feature _ map.

B2, the Attention network proposed in the present invention, includes 2 convolution layers (referred to as Attention _ Conv _1 and Attention _ Conv _2, respectively), using Softplus function as activation function.

B3, wherein the convolution kernel size adopted by the Attention _ Conv _1 is 1 × 1, and the number of the convolution kernels is k, which is the same as the number of channels of the spatial matrix; the convolution kernel size adopted by the Attention _ Conv _2 is 1 × 1, and the number is 1. The parameter θ of the convolution kernel is initialized randomly at the beginning of training.

B4, training the parameter theta of the Attention network by adopting a standard cross entropy loss function and back propagation. The purpose of the training is to learn the importance of each feature in the Origin _ feature _ map through the Attention network. Thus, the output function φ (f) of the Attention network is defined_i(ii) a θ) for expressing the feature f_iScore (i.e., weight) on origin _ feature _ map, where f_i∈R^k,i＝1,2...k。

B5, after the Attention network training is completed, the step B4 passes through the function phi (f)_i(ii) a Theta) score of_iAs weights, the corresponding feature vector f in Origin _ feature _ map_iMultiplying to obtain a group of W × H × K space matrixes with adjusted weights, and referring the output as the attribute _ feature _ map.

The pooling layer described in step Sd, the present invention proposes a new method of area mean pooling. The specific calculation steps are as follows:

c1, outputting a group of W × H × K space matrix by the trained Attention-MAC trademark feature extraction model of the input image.

C2, regarding the set of three-dimensional matrixes as a set x ═ x of a set of two-dimensional characteristic response matrixes_iK, k being the total number of channels of the set of output two-dimensional feature maps, x_iA two-dimensional eigenresponse matrix representing the output of the ith eigenchannel.

Wherein p is_lFor this purpose, all pixels in the 3 × 3 region, n is 1,2.. 9, because of the position x_i(p_j) For this purpose, the center of the 3 × 3 region, so when l is 5, there is x_i(p₅)＝x_i(p_j) By outputting a characteristic response matrix W for each layer_i×H_iCalculating the region mean pooling to obtain k-dimensional feature representation of the image:

The calculation process of the Attention-MAC feature vector proposed in the present invention is described below with reference to FIG. 4.

For the image feature representation output by the Attention-MAC trademark feature extraction model, the output feature representation is subjected to dimensionality reduction through a main feature analysis method to obtain the Attention-MAC feature vector provided by the invention. The specific calculation method is as follows:

d1, processing the k-dimensional feature representation output by the Attention-MAC trademark feature extraction model in the step Sd. To prevent overfitting, f is first paired_ΩCarry out L₂And (6) regularizing.

D2, for the processed k-dimensional feature representation, selecting features by adopting a main feature analysis method, for k features, calculating and finding out l features with high correlation or high interaction information by using the main feature analysis method, and reducing the data from k dimension to l dimension.

D3, performing L again on the L-dimensional vector calculated by the principal feature analysis method₂And (6) regularizing.

D4, the image feature vector obtained in the above steps is the Attention-MAC feature vector of the present invention.

The following describes a search process for trademark detection by using the Attention-MAC trademark image search model in accordance with the present invention with reference to fig. 5.

E1, using trademark images in FlickrLogos-32 data set as input images, sequentially inputting the Attention-MAC trademark image retrieval model to obtain a one-dimensional Attention-MAC characteristic vector f representing the FlickrLogos-32 data set image_iN, n is the number of trademarks in the FlickrLogos-32 dataset.

E2, inputting the trademark image to be retrieved into the Attention-MAC trademark feature extraction model as an input image to obtain the Attention-MAC feature vector q of the trademark image to be retrieved_A。

E3, sequentially calculating the similarity between the image to be retrieved and the images in the FlickrLogos-32 data set by utilizing the cosine similarity.

Using similarity s_iRepresenting the similarity score of the ith image and the image to be retrieved in the FlickrLogos-32 data set according to the score s_iThe FlickrLogos-32 dataset images were initially sorted.

E4, selecting the images in the initial ordering, arranging the images in the first 5, summing the Attention-MAC feature vector of the images and the Attention-MAC feature vector of the trademark image to be searched, calculating the average value,

the obtained q is_reAs a new query vector, the first 100 images in the initial ranking are selected, using q_reAnd calculating the similarity, and rearranging the first 100 images according to the new score to obtain the final sorting result.

It should be noted that the system structures or method flows shown in fig. 1 to fig. 5 of the present invention are only some preferred embodiments of the present invention, and the illustration is only for the convenience of understanding the present invention and is not to be construed as a limitation of the present invention.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A rapid trademark image retrieval method based on an Attention mechanism and a convolutional neural network is characterized by comprising the following steps:

s4-2, adding a pooling layer behind the Attention network to generate an Attention-MAC trademark feature extraction model; the pooling layer adopts a region mean pooling method to process the feature representation output by the Attention-MAC trademark feature extraction model, and the method specifically comprises the following steps:

f_Ω＝[f_Ω,1,f_Ω,2,...f_Ω,i,...f_Ω,k]^T,

feature vector f_ΩIs the output of the Attention-MAC trademark feature extraction model;

2. The Attention mechanism and convolutional neural network based fast trademark image retrieval method of claim 1, wherein the step S1 comprises the steps of:

3. The Attention mechanism and convolutional neural network based fast trademark image retrieval method of claim 1, wherein the step S2 comprises the steps of:

4. The Attention mechanism and convolutional neural network based fast trademark image retrieval method of claim 1, wherein the step S3 comprises the steps of:

5. The Attention mechanism and convolutional neural network based fast trademark image retrieval method of claim 1, wherein the step S5 comprises the steps of:

6. The Attention mechanism and convolutional neural network based fast trademark image retrieval method of claim 2, wherein the step S1-2 comprises the steps of:

7. The Attention mechanism and convolutional neural network-based fast trademark image retrieval method of claim 5, wherein the step S5-1 comprises the steps of: