CN108108657B

CN108108657B - Method for correcting locality sensitive Hash vehicle retrieval based on multitask deep learning

Info

Publication number: CN108108657B
Application number: CN201711135951.XA
Authority: CN
Inventors: 何霞; 汤一平; 陈朋; 王丽冉; 袁公萍; 金宇杰
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2020-10-30
Anticipated expiration: 2037-11-16
Also published as: CN108108657A

Abstract

A retrieval method of a modified local sensitive Hash vehicle based on multitask deep learning is characterized in that a multitask end-to-end convolutional neural network is adopted to recognize vehicle types, vehicle series, vehicle logos, colors and license plates of the vehicle at the same time in a segmented and parallel mode, a network module for extracting vehicle image example features based on a feature pyramid is used, an algorithm for sorting vehicle features in a database is used for sorting the vehicle features by using a modified local sensitive Hash sorting algorithm, and a cross-modal text retrieval method for retrieving vehicle images cannot be obtained. The invention provides a multi-task end-to-end convolutional neural network and a modified local sensitive Hash vehicle retrieval method, which effectively improve the automation and intelligence level of vehicle retrieval, and meet the image retrieval requirement in a big data era with less storage space and higher retrieval speed.

Description

Method for correcting locality sensitive Hash vehicle retrieval based on multitask deep learning

Technical Field

The invention relates to application of computer vision, pattern recognition, information retrieval, multi-task learning, similarity measurement, a deep self-coding convolutional neural network and a deep learning technology in the field of image retrieval, in particular to a modified local sensitive Hash vehicle retrieval method based on multi-task deep learning.

Background

With the rapid development of social economy, motor vehicles increasingly become essential vehicles for people in daily life, and also become essential tools for criminals and terrorists to engage in illegal activities. The vehicles at present are generally based on license plate recognition technology, and once suspected vehicles use fake cards, no cards and a mode of continuously replacing license plates, tracking and recognition of the vehicles by the existing gates can be avoided.

Vehicle feature recognition based on images relates to the related technical fields of image processing, pattern recognition, computer vision and the like, and the current research on the technology at home and abroad can be roughly divided into three directions: (1) based on vehicle type recognition of license plates, the method only recognizes license plate information from images, does not directly analyze and obtain the types of vehicles, has coarse classification granularity, and cannot distinguish fake-licensed vehicles; (2) based on vehicle type recognition of the vehicle logo, in practical application, due to the existence of objective factors such as undersize of the vehicle logo, light rays and shielding, an ideal effect cannot be achieved; and (3) vehicle type identification based on appearance characteristics, compared with the former two methods, the technology has better robustness, the identification type is more detailed, and the vehicle type identification can be accurate to the brand, series, model, annual payment and the like of the vehicle.

The vehicle feature identification technology based on the appearance features is mainly completed by the following three steps: vehicle segmentation, feature extraction of the vehicle, and classification of the vehicle. The traditional vehicle type identification method mainly comprises the following steps: a template matching method, a statistical pattern recognition method, a neural network recognition method, a recognition method of a bionic pattern (topological pattern), and a method of a support vector machine. However, these methods have their own drawbacks, and cannot satisfy the two most important indexes of speed and accuracy of vehicle type classification at the same time. However, the most important factors influencing the two indexes are the extracted vehicle characteristics and the vehicle can be quickly positioned, so that the characteristic extraction and the target quick positioning are the key of the whole identification process. The extraction of the vehicle features is influenced by many factors, such as the vehicle types are many but there is no obvious distinguishing feature, the movement of the vehicle and the height and angle of the camera cause the vehicle type to have large feature difference, the influence of weather, the influence of illumination and the like.

The development of the deep learning technology promotes the picture structuring and feature extraction capability. The bayonet system of early construction, intelligent analysis ability is weak, and picture quality and license plate discernment rate of accuracy are lower, often will pass car picture or video according to vehicle self intrinsic information such as brand model colour, from magnanimity, the artifical target vehicle that seeks, because factors such as a ray of police strength is limited, intensity of labour is big, the motorcycle type is various, light angle is uncertain, can't guarantee the accuracy and the ageing of seeking, especially emergent incident, often delay the best opportunity of handling. By using the vehicle characteristic deep learning system, the characteristic structural analysis and identification are carried out on the vehicle passing picture obtained by the front-end bayonet or the simple bayonet, valuable information in a mass of bayonet vehicle passing pictures is fully excavated, the accuracy of the vehicle type of the license plate can be improved, the identification information of the vehicle characteristic is increased, the identification and detection functions of the sub-brand of the vehicle, the color of the vehicle body, no safety belt fastening, the call receiving and making of the driver, the state of the sun visor and the like are realized, the vehicle passing data is finely corrected, the single traditional means of analyzing and researching by only depending on the license plate is eliminated, more practical vehicle prevention and control application is provided for bayonet electric police, the effective early warning and control on high-risk vehicles can be realized, the police deployment is optimized for carrying out targeted vehicle investigation, suspected vehicles can be effectively locked in a large number of vehicle-related driving cases, and the criminal investigation efficiency is improved, the public security prevention and control means is changed from passive investigation after the fact to active early warning in advance.

The invention discloses a vehicle retrieval method based on similarity learning, which is developed by the Chinese patent application No. CN201510744990.4, wherein after a vehicle region is given, SIFT feature points are obtained and described, and then a clustering algorithm is used for discretizing SIFT features. In order to make up for the defect that the SIFT features lack position information, neighborhood features are further generated by using discrete SIFT feature distribution in the neighborhood and serve as final feature point description, each vehicle picture is represented by a batch of features, the features of a pair of similar vehicle pictures form a positive sample, and the features of a pair of different vehicle pictures form a negative sample. After a large number of positive and negative samples are collected, the similarity learning is carried out by using a random forest method, and the obtained classifier can be used for judging whether two vehicles are similar or not so as to achieve the purpose of similar vehicle retrieval. The technology cannot sufficiently extract vehicle features by using the SIFT features.

The invention of China patent application No. CN201610711333.4 discloses a vehicle retrieval method and a device based on big data, the method comprises the following steps: extracting each vehicle inspection mark of the target vehicle from the target vehicle image; fusing the vehicle inspection marks according to the position relation among the vehicle inspection marks to obtain a plurality of fusion areas, wherein the fusion areas comprise at least one vehicle inspection mark; determining the shape and color of each vehicle inspection mark contained in each fusion area; and searching the target vehicle in the plurality of vehicle images to be searched layer by layer according to the number of the vehicle inspection marks, the number of the fusion areas and the number, the shape and the color of the vehicle inspection marks contained in each fusion area. The technique uses only a single feature to retrieve the vehicle.

The invention discloses a fake-licensed vehicle retrieval and identification system based on machine vision, which is developed by the Chinese patent application No. CN 201710451957.1. The system mainly comprises a vehicle image acquisition system, a database system and a retrieval system, and the invention provides a fake-licensed vehicle retrieval and identification system based on machine vision, which is used for retrieving a suspected vehicle by means of characteristics of vehicle-mounted ornaments of the suspected vehicle, such as vehicle ornaments, annual inspection labels and the like, and solving the problem of searching a target vehicle from a mass of traffic scene images by performing characteristic acquisition on a vehicle-mounted ornament area image and performing vehicle retrieval by adopting a vehicle-mounted ornament area image sparse coding method, so that the fake-licensed vehicle is accurately identified and discovered. This technique is time-complex in large databases.

In summary, the image searching technique using the deep self-coding convolutional neural network and the modified locality-sensitive hash reordering method has several problems as follows: 1) how to accurately segment the whole image of the detected vehicle from the complex background; how to adopt few label image data to learn and train and obtain the characteristic data of the vehicle type as far as possible; 2) how to classify vehicle types in more fine categories and identify more information such as the brand, series and body color of the vehicle. On the other hand, how to perform parallel processing on the vehicle type, the license plate and the vehicle logo in the same deep convolutional neural network, namely, the multi-task parallel computation of deep learning is realized so as to improve the vehicle identity recognition level; 3) how to design a method for extracting example characteristics from a vehicle image for searching similar types of vehicle models; 4) how to use the extracted features to establish hierarchical depth search so as to obtain more accurate retrieval results; 5) how to reduce the problems of large storage space consumption, low retrieval speed and the like of an image retrieval system under the background of a big data era.

Disclosure of Invention

The invention provides a vehicle image retrieval method through hierarchical depth search end-to-end based on a depth self-coding convolutional neural network, aiming at the problems that the existing vehicle retrieval technology is low in automation and intelligence level, lack of deep learning, difficult to obtain accurate retrieval results, large in storage space consumption of the retrieval technology, low in retrieval speed, and difficult to meet the image retrieval requirements in the big data era.

In order to solve the technical problems, the invention provides the following technical scheme:

a modified locality sensitive Hash vehicle retrieval method based on multitask deep learning comprises the following steps:

1) constructing a multi-task end-to-end convolutional neural network for deep learning and training recognition, and deeply learning various attribute information of the vehicle, including vehicle type, vehicle system, vehicle logo, color and license plate, by training data and a network structure which progresses layer by layer;

2) constructing a vehicle attribute hash code by using the multitask convolutional neural network in the step 1) and adopting a segmented parallel learning and coding strategy;

3) constructing a characteristic pyramid module by utilizing a pyramid pooling layer and a vector compression layer so as to adapt to the input of convolution characteristic graphs of different sizes and extract example characteristics of the vehicle;

4) constructing a local sensitive reordering algorithm by using the example characteristics obtained in the step 3);

5) a cross-mode retrieval method under the condition that a retrieval vehicle image cannot be obtained is constructed, and vehicle retrieval is realized.

Furthermore, the multi-task end-to-end convolutional neural network for deep learning and training identification comprises a shared convolution module, an interesting region coordinate regression and identification module, a multi-task learning module and an example feature extraction module;

a shared convolution module: the shared network consists of 5 convolution modules, where the last layer of conv2_ x through conv5_ x is {4 }, respectively²,8²,16²,16²As the output size of the feature map, conv1 contains only a single convolutional layer as the input layer;

the method comprises the following steps that a region-of-interest coordinate regression and identification module is connected behind a shared convolution module, the module can take an image with any size as input, outputs a set of rectangular prediction frames of a target region, and comprises position coordinates of each prediction frame and probability scores of categories in a data set, in order to generate a region suggestion frame, firstly, the input image generates a feature map through a convolution sharing layer, and then, multi-scale convolution operation is carried out on the feature map, and the implementation process is as follows: using 3 scales and 3 length-width ratios at the position of each sliding window, taking the center of the current sliding window as the center and corresponding to one scale and length-width ratio, and then mapping to obtain 9 candidate regions with different scales on an original image; if the shared convolution characteristic diagram with the size of w × h is used, w × h × 9 candidate regions are totally obtained; finally, the classification layer outputs scores of w × h × 9 × 2 candidate regions, namely, the estimation probability that each region is a target/non-target, and the regression layer outputs w × h × 9 × 4 parameters, namely, coordinate parameters of the candidate regions;

when the RPN is trained, each candidate region is allocated with a binary label so as to mark whether the region is an object target, and the operation is as follows: 1) IoU (Intersection-over-Union ratio) with the highest overlap candidate area with a real target area (GT); 2) candidate regions with IoU overlap with any GT bounding box by more than 0.7. Assigning negative labels to candidate regions for which the IoU ratio for all GT bounding boxes is below 0.3; 3) between the two.

With these definitions, the objective function is minimized. The loss function for an image is defined as:

where i is the index of the ith candidate region, P_iIs the probability that the candidate region is of the ith class. If the label of the candidate region is positive,

is 1, if the candidate area label is 0,

is simply 0. t is t_iIs a vector, representing the 4 parameterized coordinates of the predicted bounding box,

is the coordinate vector of the corresponding GT bounding box. N is a radical of_clsAnd N_regRespectively, the normalized coefficients of the classification loss function and the position regression loss function, lambda is the weight parameter between the two, and the classification loss function L_clsIs a log loss of two classes, target and non-target:

regression loss function L for position_regDefined by the following function:

where R is a robust loss function (smooth L1).

However, training a multitask deep learning network is not an easy process to implement, because information at different task levels has different learning difficulties and convergence rates, and therefore, it is important to design a good multitask objective function. The multitask joint training process is as follows: assuming that the total number of tasks is T, the training data for the T-th task is recorded as

Wherein T belongs to (1, T), i belongs to (1, N), N is the total training sample number,

respectively, the feature vector and the label of the ith sample, the multitask objective function is expressed as:

in the formula

Is an input feature vector

And a weight parameter w^tL (-) is a loss function, phi (w)^t) Is a regularization value of a weight parameter;

and for the loss function, training the characteristics of the last layer by utilizing softmax in cooperation with a log-likelihood cost function to realize image classification. The softmax loss function is defined as follows:

in the formula, x_iIs the ith depth feature, W_jJ column of weights in the last fully-connected layer, b is the offset term, m, n are the number and class of processed samples, respectivelyCounting;

the convolutional neural network training is a back propagation process, is similar to a BP algorithm, performs back propagation through an error function, and performs optimization adjustment on convolutional parameters and bias by using a random gradient descent method until the network converges or reaches the maximum iteration times;

the neural network training is a back propagation process, the convolution parameters and the bias are optimized and adjusted by a random gradient descent method through back propagation of an error function until the network is converged or the maximum iteration times are reached;

the back propagation needs to compare the training samples with labels, adopt a square error cost function to identify multiple classes of the c classes and the N training samples, calculate the error by the formula (7) according to the final output error function of the network,

in the formula, E^NIn order to be a function of the squared error cost,

for the kth dimension of the label for the nth sample,

corresponding to the k output of the network prediction for the n sample;

when the error function is reversely propagated, a calculation method similar to the traditional BP algorithm is adopted, the specific formula form is shown as a formula (8),

^l＝(W^l+1)^T ^l+1×f'(u^l) (u^l＝W^lx^l-1+b^l) (8)

in the formula (I), the compound is shown in the specification,^lrepresenting the error function of the current layer,^l+1representing the error function of the previous layer, W^l+1For the previous layer of the mapping matrix, f' represents the inverse of the activation function, i.e. upsampling, u^lOutput, x, representing the layer above the failed activation function^l-1Denotes the input of the next layer, W^lThe weight matrix is mapped for this layer.

Furthermore, the relevance of the tasks exists in the multi-task learning process, namely information sharing exists among the tasks, and when a plurality of tasks are trained simultaneously, the network enhances the induction bias capability of the system and the generalization capability of the classifier by using the shared information among the tasks; the multitask network is divided into five subtasks by adding five full-connection layers behind an interested region module, each full-connection post-connection softmax activation function normalizes a threshold value between [0 and 1], then sends the normalized value to a segmentation function to promote the output of binary codes, and reduces the redundancy among the hash codes through a segmentation learning and coding strategy so as to enhance the robustness of the learned characteristics;

dividing the multi-task learning network into T tasks, wherein each task contains c^tOne class, full connected layer one-dimensional vector output per task uses m^tRepresents; first, the output of the fully-connected layer is normalized to [0,1] by using a softmax activation function]The formula is embodied as follows:

wherein θ represents a random hyperplane; and sending the normalized value into a threshold segmentation function for binarization to obtain binary output of the full-link layer, wherein the formula is specifically expressed as follows:

and finally, H obtained by the formula (10) is subjected to H in order to obtain the vehicle attribute hash code which is obtained by the segmented parallel learning of the multitask convolutional network^tThe vectors are fused again in a certain proportion, using the vector f_AThe formula is expressed in the following concrete form:

f_A＝[α¹H¹；α²H²；...；α^tH^t](11)

in formula (11)Alpha of (A)^tThe concrete expression form is as follows:

at each H^tPreviously multiplying by a penalty factor alpha^tThe error caused by different classification numbers among different tasks is compensated.

Furthermore, the manually designed functional age uses a large amount of visualized image pyramids, so that object detectors like DPM require high density sampling to obtain good results (e.g. 10 scales per octave); for the recognition task, the engineering features have been largely replaced by features computed by deep convolutional networks. In addition to representing higher-level semantics, the deep convolutional network has stronger scale variability, so that the deep convolutional network is beneficial to being identified from features calculated on a single input scale; however, even with this robustness, the pyramid still needs to get the most accurate results; all of the most important recent entries in ImageNet and COCO detection challenges use multi-scale testing of the characterized image pyramid; the main advantage of characterizing each level of the image pyramid is that it produces a multi-level feature representation, where all levels are semantically strong, including high resolution levels;

simultaneously creating a characteristic pyramid with strong semantics on all scales by utilizing the pyramid shape of the convolution characteristic hierarchical structure; to achieve this goal, low resolution, semantically strong features are combined with high resolution, semantically weak features by top-down paths and transverse connections, and can be constructed quickly from a single input image scale, which can be used to replace a characterized image pyramid without sacrificing representative features, speed or memory; in order to obtain example features of the vehicle image and adapt to the input of a convolution feature map with any size, the last layer of each unit of the sharing modules conv2_ x to conv5_ x is selected and combined with the output of the region-of-interest module, and then a pyramid pooling layer and a vector compression layer are added to compress three-dimensional features into a one-dimensional feature vector, so that the selection is that the feature map information obtained by a feature pyramid can be enriched, and the deepest layer of each stage has the strongest feature representation function;

with the last layer of each module as input to the feature pyramid, {4 ] is selected in turn for the last layer of the networks conv2_ x through conv5_ x defined above²,8²,16²,16²The size of an input feature map of the feature pyramid is used as the size of the feature pyramid; the input image is represented by I, the length and width of the input image are represented by letters h and w, the shared convolution module of the x-th layer is represented by convx _ x, the input image is activated into a three-dimensional feature vector T after being input, the dimension h 'multiplied by w' multiplied by d is a set of a series of two-dimensional feature maps, the length and width of the two-dimensional feature maps are h 'multiplied by w', the T contains d two-dimensional feature maps, and the set S is S { S ═ S { (S) }_nIs represented by ∈ (1, d), S_nCorresponding to the nth channel characteristic diagram; then, the three-dimensional feature vector T is sent into a feature pyramid, and is subjected to convolution by a plurality of scale convolution kernels to obtain a three-dimensional feature vector T ', the dimension of which is l × l × d, and the three-dimensional feature vector T ' also comprises a group of two-dimensional feature maps, wherein S ' can be used as S ' { S '_nIs represented by ∈ (1, d), wherein S_n' corresponding to the nth channel feature map, each feature map is l × l in size, and the total number of the feature maps is d; then, a sliding window with the size of k multiplied by k and the maximum pooling are selected to carry out logistic regression on the feature maps to obtain a group of feature maps with the size of l/k multiplied by l/k, and then S of each channel is carried out_n' conducting fusion to obtain a one-dimensional vector, conducting the same operation on d channels in sequence, and finally obtaining an individual characteristic vector f_BThe size is (1, l/k × d). The final search feature vector f is shown in equation (13):

f＝[f_A；f_B](13)。

the basic idea of the locality sensitive hashing algorithm is as follows: after two adjacent data points in the original data space are subjected to the same mapping or projection transformation, the probability that the two data points are still adjacent in the new data space is very high, and the probability that non-adjacent data points are mapped to the same bucket is very low. That is, if we have some hash mapping on the original data, we want two data that were originally adjacent to each other to be able to be hashed into the same bucket, having the same bucket number. After all the data in the original data set are subjected to hash mapping, a hash table is obtained, the original data sets are dispersed into buckets of the hash table, each bucket can fall into some original data, the data belonging to the same bucket are probably adjacent, and certainly, the non-adjacent data are hashed in the same bucket. Therefore, if some hash functions can be found, after the hash mapping transformation of the hash functions, the adjacent data in the original space fall into the same bucket, neighbor searching in the data set becomes easy, and the data adjacent to the query data can be found only by performing hash mapping on the query data to obtain the bucket number of the query data, then taking out all the data in the bucket corresponding to the bucket number, and then performing linear matching. In other words, the original data set is divided into a plurality of subsets through the mapping transformation operation of the hash function, the data in each subset are adjacent, and the number of elements in each subset is small, so that the problem of searching for adjacent elements in a super-large set is converted into the problem of searching for adjacent elements in a small set, and the searching calculation amount can be greatly reduced by the algorithm;

the hash function that two adjacent data points fall into the same bucket after hash transformation originally needs to satisfy the following two conditions:

if d (x, y) ≦ d1, the probability of h (x) ≦ h (y) is at least p 1;

if d (x, y) ≧ d2, the probability of h (x) ≧ h (y) is at most p 2;

where d (x, y) represents the distance between x and y, d1< d2, h (x) and h (y) represent the hash of x and y, respectively.

Hash functions that satisfy the above two conditions are called (d1, d2, p1, p2) -sensitive. And the process of hashing the raw set of data by one or more (d1, d2, p1, p2) -sensitive hash functions to generate one or more hash tables is referred to as locality-sensitive hashing.

The process of using locality sensitive hashing to index mass data, namely, a hash table and performing approximate nearest neighbor lookup through indexing is as follows:

off-line index building

(1) Selecting a hash function satisfying (d1, d2, p1, p2) -sensitive locality sensitive hashes;

(2) determining the number L of hash tables, the number K of hash functions in each hash table and parameters related to the hash functions of the locality sensitive hashes according to the accuracy of the search results, namely the probability of the adjacent data being searched;

(3) hashing all data into corresponding buckets through a hash function of locality sensitive hashing to form one or more hash tables;

on-line lookup

(1) Hashing the query data by a hash function of locality sensitive hashing to obtain a corresponding barrel number;

(2) taking out corresponding data in the barrel number; in order to ensure the searching speed, only the first 2L data are taken out;

(3) calculating the similarity or distance between the query data and the 2L data, and returning the nearest neighbor data;

the locality sensitive hash online lookup time consists of two parts: calculating a hash value, namely calculating the time of a barrel number, by using a hash function of a locality sensitive hash; and secondly, comparing the query data with the data in the bucket for calculating time. Thus, the lookup time for locality sensitive hashes is at least a sub-linear time. This is because the matching speed is increased by indexing the attributes in the bucket, and the second part of the time consumption is changed from O (n) to O (logn) or O (1), which greatly reduces the amount of calculation.

One key of the locality sensitive hashing is as follows: mapping similar samples to the same bucket with high probability; the hash function h () of the locality sensitive hash satisfies the following condition:

s{h(f_Aq)＝h(f_A)}＝sim(f_Aq,f_A) (14)

in the formula, sim (f)_Aq,f_A) Denotes f_AqAnd f_ASimilarity of (c), h (f)_A) Denotes f_AHash function of h (f)_Aq) Denotes f_AThe hash function of (a), wherein the similarity measure is directly associated with a distance function σ, such as:

a typical classification of a locality-sensitive hash function is given by a random projection and a threshold, as shown in equation (16),

h(f_A)＝sign(Wf_A+b) (16)

where W is a random hyperplane vector and b is a random intercept.

The locality sensitive hash is composed of a preprocessing algorithm and a nearest neighbor search algorithm, and the search image features are represented into a string of binary codes with fixed length through the processing of the two algorithms;

the preprocessing algorithm comprises the following processes:

inputting a set of extracted image features p and a hash table number l₁Mapping image features by using a random hash function g (.) to obtain a point p_jStore to hash table T_iCorresponding barrel number g_i(p_j) Performing the following steps; output hash table T_i,i＝1,…,l₁；

The nearest neighbor search algorithm comprises the following processes:

inputting a search image feature q, accessing a hash table T generated by a preprocessing algorithm_i,i＝1,…,l₁The number K of the nearest neighbors returns K nearest neighbor data of the retrieval point q in the data set S;

if ═ I₁,I₂,…,I_nIs a data set composed of n images, each image corresponding to a binary code of_H＝{H₁,H₂,…,H_n}，H_i∈{0,1}^h(ii) a Given search image I_qAnd binary code H_qIs prepared from H_qAnd H_i∈_HThe Hamming distance between is less than the threshold value T_HAre put into the candidate pool P,

are candidate images.

Constructing a local sensitive reordering algorithm by using the example characteristics; in the traditional locality sensitive hashing algorithm, returned images which are close in distance are mainly used, namely the similarity between a retrieval image and an image in a candidate pool is close to 1; the vehicle attribute hash codes are mapped to obtain vehicles of the same model, but the vehicles of the same model are still difficult to distinguish, and the vehicle attribute hash codes are obviously distinguished from subjective judgment of people, but the differences cannot be effectively distinguished only through the vehicle attribute hash codes; in order to find out vehicles in the candidate pool which have the same individual characteristics as the retrieval pictures, after the retrieval images are mapped into each barrel through the vehicle attribute hash codes, the images in the barrels are sorted again by using the acquired image example characteristics to reduce the errors in the classes, and the representation form of a re-sorting formula is as follows:

in equation (17), k represents the kth image in the bucket selected by the vehicle attribute hash code mapping,

represents a penalty factor and

cos represents a cosine distance formula for measuring the characteristics of the image instance; to exclude the vehicle attribute hash code from being mapped incorrectly, y represents the pre-mapping search image f_AqAnd images in bucket

Is equal, if equal, y is 1, otherwise is 0;

in further ranking, H has already been assigned_qAnd H_i∈_HThe Hamming distance between is less than the threshold value T_HAre put into a candidate pool P in order to obtain more precisionAccording to the search result, a re-ordering method is further adopted on the basis of the candidate pool;

reordering method, given search image I_qAnd a candidate pool P, using the instance features to determine top k ranked images from the images in the candidate pool P; the degree of similarity between them is calculated using equation (17),

further, with respect to the re-ranking evaluation, a ranking-based criterion is used for evaluation; for a given search image I_qAnd a similarity measure, one ranking for each dataset image; here, a search image I is represented by evaluating the top k ranked images_qThe retrieval accuracy of (1) is expressed by the formula (18);

wherein Rel (I) represents the search image I_qThe real correlation between the ith ranking image and the ith ranking image, wherein k represents the number of the ranking images and Precision @ k searching Precision; when real correlation is calculated, only a classification label part, Rel (i) is belonged to {0,1}, if the search image and the ith ranking image have the same label setting Rel (i) ═ 1, otherwise, Rel (i) ═ 0, and the search precision can be obtained by traversing the first k ranking images in the candidate pool P;

in the step 5), when the retrieved image cannot be obtained, a text retrieval mode is adopted for auxiliary retrieval, and the retrieval characteristics obtained through the text and the text characteristics obtained through the convolutional network can share one set of retrieval mode under the condition of not adding extra training, wherein the text acquisition characteristic method comprises the following steps:

initialization: the text file is analyzed into a term vector; removing small words and repeated words; checking the entry to ensure the correctness of the analysis;

5.1) extracting the randomly combined participle minimum vector R (R) from the input text O₁,r₂,...,r_n)；

5.2) pairing R with f_AIntegrating the sequence and the vehicle attribute Hash codes to obtain text attribute characteristics

At this time f_ATxtA dimension less than R;

5.3) using a locality sensitive reordering Hash algorithm for retrieval;

5.4) returning the similar image group I.

To implement the above summary, several core problems must be solved: 1) aiming at the problem of difficult image feature extraction, the strong feature characterization capability of a depth self-coding convolutional neural network is utilized to realize feature self-adaptive extraction; 2) Aiming at the problem of low retrieval speed of large-scale images, a multi-task layering method is designed, and a query image is used for being rapidly compared with an image in a database; 3) designing a method for extracting example characteristics from a vehicle image for searching similar types of vehicle models; 4) designing a modified locality sensitive hash reordering code to increase the difference between the images of vehicles in the class; 5) by utilizing the advantages of an end-to-end deep network, an end-to-end deep self-coding convolutional neural network is designed, and detection, identification and feature extraction are fused into one network.

The invention discloses a modified locality sensitive hash reordering vehicle retrieval method based on multitask deep learning, which comprises the following processes: 1) sending the image into a depth self-coding convolutional neural network, performing logistic regression on the characteristic graph, and performing position and category segmentation and prediction on an interested region on the retrieval image; 2) extracting vehicle attribute hash codes learned by image segmentation and parallel learning by using a multi-task deep self-coding convolutional neural network; 3) extracting example features of each vehicle by using the pyramid shape of the convolution feature hierarchical structure; 4) retrieving the extracted features by using a modified locality sensitive hashing method; 5) adopting cross-modal retrieval under the condition that the vehicle image cannot be obtained;

the invention has the following beneficial effects:

1) a multi-task end-to-end convolutional neural network is provided for identifying the vehicle type, the vehicle series, the vehicle logo, the color and the license plate of the vehicle;

2) the strong characteristic representation capability of the deep convolutional neural network is utilized to realize the self-adaptive extraction of the characteristics;

3) constructing a characteristic college retrieval extracted from the modified locality sensitive Hash reordering code to the convolutional network;

4) the design gives consideration to universality and specificity, and meets the requirements of various users in the aspects of universality, retrieval speed, precision, practicability and the like; and in the aspect of specificity, a user makes a special data set according to the specific requirement of the user and finely adjusts network parameters to realize a vehicle retrieval system oriented to specific application.

Drawings

Fig. 1 is a flowchart of the overall search.

FIG. 2 is a flow chart of an overall training network.

Fig. 3 is an RPN network development diagram.

Fig. 4 is a schematic diagram of the vehicle attribute hash code being unable to distinguish the vehicle.

FIG. 5 is a diagram of text feature vector generation.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Referring to fig. 1 to 5, a modified locality sensitive hash vehicle retrieval method based on multitask deep learning, the overall flow chart is as shown in fig. 1, firstly, a multitask end-to-end convolutional neural network for deep learning and training identification is sent to pictures in a database, and various attribute information of a vehicle, including a vehicle type, a vehicle family, a vehicle logo, a color and a license plate, is deeply learned through a massive training and layer-by-layer progressive network structure; then, extracting vehicle attribute hash codes which are obtained by segmenting and parallel learning of vehicle images by utilizing the convolutional network, and extracting example features of the vehicle from the constructed feature pyramid module; and comparing the retrieved vehicle image with the image in the database by using a modified locality sensitive hash reordering method.

The multi-task end-to-end convolutional neural network for deep learning and training identification comprises a shared convolution module, an interesting region coordinate regression and identification module, a multi-task learning module and an example feature extraction module, wherein the overall flow chart is shown in FIG. 2 and comprises 4 shared convolution modules and 4 layers of feature pyramid modules; the two-dot chain line in fig. 2 is a vehicle example feature extracted by the compression layer; the dashed line part in fig. 2 is framed by the proposed partitioning and encoding module, which learns the compact features of the vehicle through different tasks, and finally fuses the two extracted feature vectors;

the invention comprises the following steps:

1) a shared convolution module: the shared network consists of 5 convolution modules, where the last layer of conv2_ x through conv5_ x is {4 }, respectively²,8²,16²,16²As the output size of the feature map, conv1 contains only a single convolutional layer as the input layer;

the method comprises the following steps that a region-of-interest coordinate regression and identification module is connected behind a shared convolution module, the module can take an image with any size as input, outputs a set of rectangular prediction frames of a target region, and comprises position coordinates of each prediction frame and probability scores of categories in a data set, in order to generate a region suggestion frame, firstly, the input image generates a feature map through a convolution sharing layer, and then, multi-scale convolution operation is carried out on the feature map, and the method is specifically realized as follows: using 3 scales and 3 length-width ratios at the position of each sliding window, taking the center of the current sliding window as the center and corresponding to one scale and length-width ratio, and then mapping to obtain 9 candidate regions with different scales on an original image; for a shared convolution signature of size w × h, there are a total of w × h × 9 candidate regions. Finally, the classification layer outputs scores of w × h × 9 × 2 candidate regions, that is, the estimated probability that each region is a target/non-target, and the regression layer outputs w × h × 9 × 4 parameters, that is, coordinate parameters of the candidate regions, and the specific form is shown in fig. 3;

when the RPN network is trained, each candidate region is assigned with a binary label so as to mark whether the region is an object target. The specific operation is as follows: 1) IoU (Intersection-over-Union ratio) with the highest overlap candidate area with a real target area (GT); 2) candidate regions with IoU overlap with any GT bounding box by more than 0.7. Assigning negative labels to candidate regions for which the IoU ratio for all GT bounding boxes is below 0.3; 3) between the two.

is 1, if the candidate area label is 0,

is the coordinate vector of the corresponding GT bounding box. N is a radical of_clsAnd N_regThe normalized coefficients are respectively a classification loss function and a position regression loss function, and lambda is a weight parameter between the two. Classification loss function L_clsIs the log loss of two classes (target vs. non-target):

regression loss function L for position_regDefined by the following function:

where R is a robust loss function (smooth L1).

However, training a multitask deep learning network is not a goalThe process is easy to implement because the information of different task levels has different learning difficulty and convergence speed respectively. Therefore, it is crucial to design a good multitask objective function. The multitask joint training process is as follows: assuming that the total number of tasks is T, the training data for the T-th task is recorded as

Wherein T belongs to (1, T), i belongs to (1, N), and N is the total training samples.

Respectively, the feature vector and the label of the ith sample. Then the multitasking objective function can be expressed as:

in the formula

Is an input feature vector

And a weight parameter w^tL (-) is a loss function, phi (w)^t) Is a regularization value of the weight parameter.

in the formula, x_iIs the ith depth feature, W_jB is a bias term, m and n are the number of processed samples and the category number respectively;

in the formula, E^NIn order to be a function of the squared error cost,

for the kth dimension of the label for the nth sample,

corresponding to the k output of the network prediction for the n sample;

^l＝(W^l+1)^T ^l+1×f'(u^l) (u^l＝W^lx^l-1+b^l) (8)

in the formula (I), the compound is shown in the specification,^lrepresenting the error function of the current layer,^l+1representing the error function of the previous layer, W^l+1For the previous layer of the mapping matrix, f' represents the inverse of the activation function, i.e. upsampling, u^lOutput, x, representing the layer above the failed activation function^l-1Denotes the input of the next layer, W^lMapping a weight matrix for the layer;

2) the relevance of tasks exists in the multi-task learning process, namely information sharing exists among the tasks, and when a plurality of tasks are trained simultaneously, the network enhances the induction bias capability of the system and the generalization capability of the classifier by using the shared information among the tasks; the multitask network is divided into five subtasks by adding five full-connection layers behind an interested region module, each full-connection post-connection softmax activation function normalizes a threshold value between [0 and 1], then sends the normalized value to a segmentation function to promote the output of binary codes, and reduces the redundancy among the hash codes through a segmentation learning and coding strategy so as to enhance the robustness of the learned characteristics;

f_A＝[α¹H¹；α²H²；...；α^tH^t](11)

α in formula (11)^tThe concrete expression form is as follows:

at each H^tPreviously multiplying by a penalty factor alpha^tThe error caused by different classification numbers among different tasks is made up;

3) simultaneously creating a characteristic pyramid with strong semantics on all scales by utilizing the pyramid shape of the convolution characteristic hierarchical structure; to achieve this goal, low resolution, semantically strong features are combined with high resolution, semantically weak features by top-down paths and transverse connections, and can be constructed quickly from a single input image scale, which can be used to replace a characterized image pyramid without sacrificing representative features, speed or memory; in order to obtain example features of the vehicle image and adapt to the input of a convolution feature map with any size, the last layer of each unit of the sharing modules conv2_ x to conv5_ x is selected and combined with the output of the region-of-interest module, and then a pyramid pooling layer and a vector compression layer are added to compress three-dimensional features into a one-dimensional feature vector, so that the selection is that the feature map information obtained by a feature pyramid can be enriched, and the deepest layer of each stage has the strongest feature representation function;

with the last layer of each module as input to the feature pyramid, {4 ] is selected in turn for the last layer of the networks conv2_ x through conv5_ x defined above²,8²,16²,16²The size of an input feature map of the feature pyramid is used as the size of the feature pyramid; the input image is represented by I, the length and width of the input image are represented by letters h and w, the shared convolution module of the x-th layer is represented by convx _ x, the input image is activated into a three-dimensional feature vector T after being input, the dimension h 'multiplied by w' multiplied by d is a set of a series of two-dimensional feature maps, the length and width of the two-dimensional feature maps are h 'multiplied by w', the T contains d two-dimensional feature maps, and the set S is S { S ═ S { (S) }_nIs represented by ∈ (1, d), S_nCorresponding to the nth channel characteristic diagram; then, the three-dimensional feature vector T is sent into a feature pyramid, and is subjected to convolution by a plurality of scale convolution kernels to obtain a three-dimensional feature vector T ', the dimension of which is l × l × d, and the three-dimensional feature vector T ' also comprises a group of two-dimensional feature maps, wherein S ' can be used as S ' { S '_nIs represented by ∈ (1, d), wherein S_n' corresponds to the n-th channelEach feature map is l multiplied by l in size and comprises d in number; then, a sliding window with the size of k multiplied by k and the maximum pooling are selected to carry out logistic regression on the feature maps to obtain a group of feature maps with the size of l/k multiplied by l/k, and then S of each channel is carried out_n' conducting fusion to obtain a one-dimensional vector, conducting the same operation on d channels in sequence, and finally obtaining an individual characteristic vector f_BThe size is (1, l/k × d). The final search feature vector f is shown in equation (13):

f＝[f_A；f_B](13)

if d (x, y) ≦ d1, the probability of h (x) ≦ h (y) is at least p 1;

if d (x, y) ≧ d2, the probability of h (x) ≧ h (y) is at most p 2;

off-line index building

on-line lookup

the locality sensitive hash online lookup time consists of two parts: calculating a hash value, namely calculating the time of a barrel number, by using a hash function of a locality sensitive hash; and secondly, comparing the query data with the data in the bucket for calculating time. Thus, the lookup time for locality sensitive hashes is at least a sub-linear time. This is because the matching speed is increased by indexing the part belonging to the bucket, and the time consumption of the second part is changed from O (N) to O (logN) or O (1), thereby greatly reducing the calculation amount;

s{h(f_Aq)＝h(f_A)}＝sim(f_Aq,f_A) (14)

h(f_A)＝sign(Wf_A+b) (16)

where W is a random hyperplane vector and b is a random intercept.

the preprocessing algorithm comprises the following processes:

The nearest neighbor search algorithm comprises the following processes:

are candidate images.

4) Constructing a local sensitive reordering algorithm by using the example characteristics; in the traditional locality sensitive hashing algorithm, returned images which are close in distance are mainly used, namely the similarity between a retrieval image and an image in a candidate pool is close to 1; the vehicle attribute hash codes are mapped through the low-dimensional vehicle attribute hash codes to obtain vehicles of the same model, but the vehicles of the same model are still difficult to distinguish, and are obviously distinguished from subjective judgment of people, but the differences cannot be effectively distinguished through the vehicle attribute hash codes, as shown in fig. 4; in order to find out vehicles in the candidate pool which have the same individual characteristics as the retrieval pictures, after the retrieval images are mapped into each barrel through the vehicle attribute hash codes, the images in the barrels are sorted again by using the acquired image example characteristics to reduce the errors in the classes, and the representation form of a re-sorting formula is as follows:

represents a penalty factor and

Is equal, if equal, y is 1, otherwise is 0;

in further ranking, H has already been assigned_qAnd H_i∈_HThe Hamming distance between is less than the threshold value T_HThe images are put into a candidate pool P, and in order to obtain a more accurate search result, a re-ordering method is further adopted on the basis of the candidate pool;

wherein Rel (I) represents a search image I_qThe real correlation between the ith ranking image and the ith ranking image, wherein k represents the number of the ranking images and Precision @ k searching Precision; in computing true correlations, only the part of the class label, Rel (i) e, is considered{0,1}, if the search image and the ith ranking image have the same label setting rel (i) ═ 1, otherwise, setting rel (i) ═ 0, traversing the first k ranking images in the candidate pool P to obtain the search precision;

5) when the retrieved image cannot be obtained, a text retrieval mode is adopted for auxiliary retrieval, so that the retrieval characteristics obtained through the text and the text characteristics obtained through the convolution network can share one set of retrieval mode under the condition of not adding extra training, and if a certain text contains a vehicle description information identification marker, as shown in fig. 5, the text acquisition characteristic method comprises the following steps:

the initialization process is as follows: the text file is analyzed into a term vector; removing small words and repeated words; checking the entry to ensure the correctness of the analysis;

At this time f_ATxtA dimension less than R;

5.3) using a locality sensitive reordering Hash algorithm for retrieval;

5.4) returning the similar image group I;

the above description is only exemplary of the preferred embodiments of the present invention, and is not intended to limit the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A modified locality sensitive Hash vehicle retrieval method based on multitask deep learning is characterized by comprising the following steps:

5) constructing a cross-mode retrieval method under the condition that a retrieval vehicle image cannot be obtained, and realizing vehicle retrieval;

the multi-task end-to-end convolutional neural network for deep learning and training identification comprises a shared convolution module, an interesting region coordinate regression and identification module and a multi-task learning module;

the method comprises the following steps that a region-of-interest coordinate regression and identification module is connected behind a shared convolution module, the module takes an image with any size as input, outputs a set of rectangular prediction frames of a target region, and comprises position coordinates of each prediction frame and probability scores of categories in a data set, in order to generate a region suggestion frame, firstly, the input image generates a feature map through a convolution sharing layer, and then, multi-scale convolution operation is carried out on the feature map, and the implementation process is as follows: using 3 scales and 3 length-width ratios at the position of each sliding window, taking the center of the current sliding window as the center and corresponding to one scale and length-width ratio, and then mapping to obtain 9 candidate regions with different scales on an original image; if the shared convolution characteristic diagram with the size of w × h is used, w × h × 9 candidate regions are totally obtained; finally, the classification layer outputs scores of w × h × 9 × 2 candidate regions, namely, the estimation probability that each region is a target/non-target, and the regression layer outputs w × h × 9 × 4 parameters, namely, coordinate parameters of the candidate regions;

when the RPN is trained, each candidate region is allocated with a binary label so as to mark whether the region is an object target, and the operation is as follows: 1) IoU overlap candidate regions highest with a certain real target region GT; 2) candidate regions with IoU overlap of greater than 0.7 with any GT bounding box, assigning negative labels to candidate regions with IoU ratios to all GT bounding boxes below 0.3; 3) discard between the two;

with these definitions, the objective function is minimized, and the loss function for an image is defined as:

where i is the index of the ith candidate region, p_iIs the probability that the candidate region is of the ith class, if the label of the candidate region is positive,

is 1, if the candidate area label is 0,

is 0, t_iIs a vector, representing the 4 parameterized coordinates of the predicted bounding box,

is the coordinate vector of the corresponding GT bounding box, N_clsAnd N_regRespectively, the normalized coefficients of the classification loss function and the position regression loss function, lambda is the weight parameter between the two, and the classification loss function L_clsIs a log loss of two classes, target and non-target:

regression loss function L for position_regDefined by the following function:

wherein R is a robust loss function smooth_L1；

However, training a multitask deep learning network is not an easy process to implement, because information at different task levels has different learning difficulties and convergence rates, and the multitask joint training process is as follows: assuming that the total number of tasks is T, the training data for the T-th task is recorded as

in the formula

Is an input feature vector

for a loss function, training the characteristics of the last layer by utilizing softmax matched with a log-likelihood cost function to realize image classification, wherein the softmax loss function is defined as follows:

the convolutional neural network training is a back propagation process, and the convolutional parameters and the bias are optimized and adjusted by using a random gradient descent method through back propagation of an error function until the network is converged or the maximum iteration times is reached;

in the formula, E^NIn order to be a function of the squared error cost,

for the kth dimension of the label for the nth sample,

corresponding to the k output of the network prediction for the n sample;

^l＝(W^l+1)^T ^l+1×f'(u^l) (u^l＝W^lx^l-1+b^l) (8)

2. The revised locality-sensitive hashing vehicle retrieval method based on multitask deep learning as claimed in claim 1, wherein: the multi-task has relevance between tasks in the learning process, namely information sharing exists between the tasks,

when a plurality of tasks are trained simultaneously, the network utilizes shared information among the tasks to enhance the induction bias capability of the system and the generalization capability of the classifier; the multitask network is divided into five subtasks by adding five full-connection layers behind an interested region module, each full-connection post-connection softmax activation function normalizes a threshold value between [0 and 1], then sends the normalized value to a segmentation function to promote the output of binary codes, and reduces the redundancy among the hash codes through a segmentation learning and coding strategy so as to enhance the robustness of the learned characteristics;

dividing the multi-task learning network into T tasks, wherein each task contains c^tOne class, full connected layer one-dimensional vector output per task uses m^tIndicating that the output of the fully-connected layer is first normalized to [0,1] using the softmax activation function]The formula is embodied as follows:

f_A＝[α¹H¹；α²H²；...；α^tH^t](11)

α in formula (11)^tThe concrete expression form is as follows:

3. The revised locality-sensitive hashing vehicle retrieval method based on multitask deep learning as claimed in claim 2, wherein: simultaneously creating a characteristic pyramid with strong semantics on all scales by utilizing the pyramid shape of the convolution characteristic hierarchical structure; to achieve this goal, low resolution, semantically strong features are combined with high resolution, semantically weak features by top-down paths and transverse connections by means of a structure; the feature pyramid has rich semantics on all levels, can be quickly constructed from a single input image scale, can be used for replacing a characterized image pyramid without sacrificing representative features, speed or memory, selects the last layer of each unit of sharing modules conv2_ x to conv5_ x and combines the output of an interested region module in order to obtain example features of a vehicle image and adapt to the input of a convolution feature map with any size, and then adds a pyramid pooling layer and a vector compression layer to compress three-dimensional features into a one-dimensional feature vector, so that the feature information obtained by the feature pyramid can be enriched, and the deepest layer of each stage has the strongest feature representation function;

most of each moduleThe latter layer is used as input for the feature pyramid, and {4 } is selected in turn for the last layer of the networks conv2_ x through conv5_ x defined above²,8²,16²,16²The size of an input feature map of the feature pyramid is used as the size of the feature pyramid; the input image is represented by I, the length and width of the input image are represented by letters h and w, the shared convolution module of the x-th layer is represented by convx _ x, the input image is activated into a three-dimensional feature vector T after being input, the dimension h 'multiplied by w' multiplied by d is a set of a series of two-dimensional feature maps, the length and width of the two-dimensional feature maps are h 'multiplied by w', the T contains d two-dimensional feature maps, and the set S is S { S ═ S { (S) }_nIs represented by ∈ (1, d), S_nCorresponding to the nth channel characteristic diagram; then, the three-dimensional feature vector T is sent into a feature pyramid, and is subjected to convolution by a plurality of scale convolution kernels to obtain a three-dimensional feature vector T ', the dimension of which is l × l × d, and the three-dimensional feature vector T ' also comprises a group of two-dimensional feature maps, wherein S ' can be used as S ' { S '_nDenotes n ∈ (1, d), wherein S'_nCorresponding to the nth channel characteristic diagram, wherein the size of each characteristic diagram is l multiplied by l, and the total number of the characteristic diagrams is d; then, performing logistic regression on the feature maps by using a k × k sliding window and selecting the maximum pooling to obtain a group of l/k × l/k feature maps, and then performing S 'on each channel'_nFusing to obtain a one-dimensional vector, sequentially performing the same operation on the d channels, and finally obtaining an individual feature vector f_BThe size is (1, l/k × d), and the final search feature vector f is shown in formula (13):

f＝[f_A；f_B](13)。

4. the revised locality-sensitive hashing vehicle retrieval method based on multitask deep learning as claimed in claim 3, wherein: performing fast image comparison on the compact binary code by using a hash method and a Hamming distance; the Hash method adopts a locality sensitive Hash algorithm, namely, a Hash bit is constructed by adopting random projection transformation;

s{h(f_Aq)＝h(f_A)}＝sim(f_Aq,f_A) (14)

in the formula (f)_AqRepresenting the pre-map search image, sim (f)_Aq,f_A) Denotes f_AqAnd f_ASimilarity of (c), h (f)_A) Denotes f_AHash function of h (f)_Aq) Denotes f_AqThe hash function of (3), wherein the similarity measure is directly associated with a distance function σ, such as:

h(f_A)＝sign(Wf_A+b) (16)

where W is a random hyperplane vector and b is a random intercept.

5. The revised locality-sensitive hashing vehicle retrieval method based on multitask deep learning as claimed in claim 4, wherein: the locality sensitive hash is composed of a preprocessing algorithm and a nearest neighbor search algorithm, and the search image features are represented into a string of binary codes with fixed length through the processing of the two algorithms;

the preprocessing algorithm comprises the following processes:

The nearest neighbor search algorithm comprises the following processes:

are candidate images.