CN111695508B

CN111695508B - Multi-scale Retinex and gesture image retrieval method based on improved VGGNet network

Info

Publication number: CN111695508B
Application number: CN202010532767.4A
Authority: CN
Inventors: 谢武; 贾清玉; 刘满意; 强保华; 崔梦银; 瞿元昊
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2022-07-19
Anticipated expiration: 2040-06-12
Also published as: CN111695508A

Abstract

The invention discloses a gesture image retrieval method based on multi-scale Retinex and an improved VGGNet network. After the gesture image retrieval model is trained, the features extracted by the last FC layer are used as the features of the image to represent participation in a gesture image retrieval task, a Hash layer is introduced into an improved VGGNet multi-branch network structure, the input of the model is a gesture image and a category label, the category label is used as a supervision information learning image feature, each branch learns different label information, the learned features of the two previous branches are fused through a full connection layer to obtain a nonlinear combination feature, then a low-dimensional Hash feature is obtained through the Hash layer, a binary Hash is obtained through the Hash layer, and finally a binary Hash code is used as a feature vector to perform gesture retrieval. On the premise of ensuring the accuracy, the efficiency of gesture retrieval is improved.

Description

Multi-scale Retinex and gesture image retrieval method based on improved VGGNet network

Technical Field

The invention relates to a gesture image retrieval method, in particular to a gesture image retrieval method based on multi-scale Retinex and an improved VGGNet network.

Background

In recent years, deep learning has achieved a series of better results in the field of computer vision than the traditional method, and the deep learning technology has become one of the most popular research methods at present. When the feature vector with high dimensionality is adopted to carry out gesture feature vector similarity calculation, the feature dimensionality of the last but one layer of the fully connected layer of the VGGNet is up to 4096 dimensionalities, and the features are stored in floating point numbers, so that the time for matching the similarity is greatly increased when gesture image retrieval is carried out, and extremely poor user experience is caused.

Disclosure of Invention

The invention aims to provide a gesture image retrieval method based on a multiscale Retinex and an improved VGGNet network, aiming at the problem that the speed of high-dimensional feature vectors in the prior art is low when gesture feature vector similarity calculation is carried out.

The technical scheme for realizing the purpose of the invention is as follows:

a gesture image retrieval method based on multi-scale Retinex and improved VGGNet network comprises the following steps:

(1) picture preprocessing: performing dim light enhancement on the gesture image by adopting a multi-scale Retinex algorithm, normalizing the gesture image after data enhancement processing, and processing the gesture image into a data input format required by a CNN (hidden network connection) model;

(2) feature extraction: performing a series of convolution, pooling and full-connection operations on the gesture image by using a trained CNN model, extracting features of the gesture image, preprocessing a gesture data set, labeling and fusing labels, constructing a VGGNet-based network structure, defining and performing initialization training on the VGGNet-based network structure, taking the features extracted by the last FC layer of the trained gesture model as features of the image to represent the participation in a gesture image retrieval task, introducing a hash layer, fusing the features through the full-connection layer to obtain nonlinear combination features, then obtaining a binary hash code through the hash layer, taking the binary hash code as a feature vector to perform gesture retrieval, and constructing a feature database;

(3) similarity matching: and acquiring an image list from the feature database, and matching features similar to the query picture.

The gesture picture is normalized to be a data input format required by a CNN model in the step (1), and the adopted method comprises the following steps:

1) inputting an original image I (x, y);

2) estimating the noise of each position, and removing, assuming that an image I seen by human eyes is a product of an image illumination component L and a reflectivity component R, which is specifically expressed as formula 1:

I(x，y)＝R(x，y)·L(x，y) (1)

3) separating three color channel space components and converting the three color channel space components into a logarithmic domain, reasonably calculating illumination L from a shot picture I, keeping the inherent attribute R of an object, and eliminating the interference of uneven illumination distribution; simultaneously, taking logarithms of both sides of formula 1, and then, assuming that I (x, y) is log (I (x, y)), R (x, y) is log (R (x, y)), and L (x, y) is log (L (x, y)), formula 2 can be obtained:

i(x,y)＝r(x,y)+l(x,y) (2)

4) setting the number and size of Gaussian function scales;

5) the Gaussian function filters three channels of the image, the filtered image is an illumination component, an image r (x, y) is obtained, and a reflection component calculation formula is as follows.

r_i(x,y)＝i_i(x,y)-i_i(x,y)*G(x,y) (3)

Wherein i_i(x, y) represents the original image of the ith channel, G (x, y) is a Gaussian filter function, r_i(x, y) represents the reflection component of the ith channel, represents the convolution, and σ is the scale parameter.

The data enhancement of the gesture image in the step (1) is carried out, and the adopted method comprises the following steps:

1) for a gesture image, filtering three channels of the image by adopting Gaussian filter functions with various scales, and performing weighted average on the reflection component of each scale to obtain a final output result, wherein the formula (3) can be changed into the following steps:

wherein G is_k(x, y) represents the kth gaussian filter function, N represents the number of gaussian filter functions, and experiments show that when N is 3, the gesture image data is enhanced most effectively; w is a_kThe weight of the kth scale is adopted, and the proportion of N Gaussian filter functions meets the constraint condition:

2) converting R (x, y) from a logarithmic domain to a real domain to obtain R (x, y);

3) and (3) performing linear correction processing on the R (x, y) (because the range of the R (x, y) is not in the range of 0-255), and obtaining an enhanced gesture image after correction.

The feature of the extracted gesture image in the step (2) comprises two faces: one is to extract the characteristics of the query picture uploaded by the user, and the other is to extract the characteristics of the picture database to construct an image characteristic database; the adopted characteristic extraction method comprises the following steps:

(1) data preprocessing: preprocessing a gesture data set and labeling and integrating labels, wherein the preprocessing comprises data enhancement, data normalization and the like;

(2) constructing a network structure based on VGGNet: training by adopting a VGGNet16 network model, performing network structure definition and initialization on VGGNet16, and setting a learning rate lr, a batch size batch, iteration rounds epochs and the like;

(3) training a model: training and verifying the model alternately;

(4) taking the features extracted from the last FC layer of the gesture model trained in the step (3) as the features of the image to represent participation in a gesture image retrieval task, inputting the features into a gesture image and a category label, learning the image features by taking the category label as supervision information, learning different label information by each branch, fusing the learned features of the previous two branches through a full connection layer to obtain a nonlinear combination feature, obtaining a low-dimensional hash feature through a hash layer, obtaining a binary hash through the hash layer, and finally taking the binary hash code as a feature vector to perform gesture retrieval;

(5) saving the model file;

(6) and randomly selecting 100 pictures from the test set as query pictures, using the rest pictures as an image database, selecting the model with the best classification effect as a feature extractor, and constructing a feature database.

The invention has the beneficial effects that: according to the invention, after the gesture data image is enhanced by a multi-scale Retinex method, the model is trained by adopting a deep learning method, after the gesture image retrieval model is trained, the characteristics extracted from the last FC layer are taken as the characteristics of the image to represent participation in a gesture image retrieval task, and the Hash layer is introduced while the multi-branch network structure of the improved VGGNet is adopted, so that the gesture retrieval efficiency can be improved on the premise of ensuring the accuracy.

Drawings

Fig. 1 is a flow chart of an improved VGGNet network according to an embodiment of the present invention;

FIG. 2 is a flow chart of the calculation of the reflection component according to the embodiment of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The embodiment is as follows:

a gesture image retrieval method based on multi-scale Retinex and an improved VGGNet network comprises the following steps:

1. a user uploads a picture needing gesture inquiry;

2. picture preprocessing: the method comprises the following steps of performing dim light enhancement on a gesture image by adopting a multi-scale Retinex algorithm, normalizing the gesture image after data enhancement processing, and processing the gesture image into a data input format required by a CNN (hidden network connection) model, wherein the method comprises the following specific steps:

(1) inputting an original image I (x, y);

(2) and estimating the noise of each position and removing the noise. Assuming that an image I seen by human eyes is a product of an image illumination component L and a reflectivity component R, the specific expression is as shown in formula 1:

I(x，y)＝R(x，y)·L(x，y) (1)

(3) the three color channel spatial components are separated and converted to the log domain. The illumination L is reasonably calculated from the shot picture I, so that the inherent attribute R of the object is reserved, the interference of uneven illumination distribution is eliminated, and the sensory effect of the image is improved. For convenience of calculation, the logarithm of both sides of formula 1 is taken at the same time, and then formula 2 is obtained by setting I (x, y) to log (I (x, y)), R (x, y) to log (R (x, y)), and L (x, y) to log (L (x, y)):

i(x,y)＝r(x,y)+l(x,y) (2)

the calculation process of the reflection component is shown in fig. 2;

(4) setting the number and size of Gaussian function scales;

(5) the Gaussian function filters three channels of the image, the filtered image is an illumination component, an image r (x, y) is obtained, and a calculation formula of a reflection component is as follows:

r_i(x,y)＝i_i(x,y)-i_i(x,y)*G(x,y) (3)

wherein i_i(x, y) represents the original image of the ith channel, G (x, y) is a Gaussian filter function, r_i(x, y) represents the reflection component of the ith channel, which represents the convolution, and σ is the scale parameter.

(6) The method adopts a multiscale Retinex algorithm to perform data enhancement on the gesture image, and the specific process of the algorithm is as follows: for a gesture image, filtering three channels of the image by adopting Gaussian filter functions of various scales, taking weighted average of reflection components of each scale to obtain a final output result, and changing the formula of 3 into:

wherein G is_k(x, y) represents the kth gaussian filter function, N represents the number of gaussian filter functions, and experiments show that when N is 3, the gesture image data is enhanced most effectively. w is a_kThe weight of the kth scale is adopted, and the proportion of N Gaussian filter functions meets the constraint condition:

(7) converting R (x, y) from logarithmic domain to real domain to obtain R (x, y)

(8) And performing linear correction processing on the R (x, y) (because the range of the R (x, y) is not in the range of 0-255), and obtaining an enhanced gesture image after correction.

3. Feature extraction: the feature extraction is mainly to use a trained CNN model to carry out a series of convolution, pooling and full-connection operations on the gesture image so as to extract the features of the gesture image. The feature extraction here includes two faces: one is to extract the characteristics of the query picture uploaded by the user, and the other is to extract the characteristics of the picture database to construct an image characteristic database. The specific implementation steps are as follows:

(1) data preprocessing: and preprocessing the gesture data set and labeling and integrating the label, wherein the preprocessing comprises data enhancement, data normalization and the like.

(2) Constructing a network structure based on VGGNet: and training by adopting a VGGNet16 network model. Performing network structure definition and initialization on VGGNet16, and setting a learning rate lr, a batch size batch, iteration rounds epochs and the like;

(3) and (5) training the model. Training and verifying the model alternately;

(4) and (3) taking the feature extracted from the last FC layer of the trained gesture model as the feature of the image to represent participation in a gesture image retrieval task, wherein the feature dimension of the last but one full connection layer of the VGGNet reaches up to 4096 dimensions, so that the time for matching the similarity can be greatly increased when the gesture image retrieval is carried out, an improved multi-branch network structure of the VGGNet is provided, and a Hash layer is introduced, so that the efficiency of the gesture retrieval can be improved on the premise of ensuring the accuracy. The overall network model is as shown in fig. 1, input is a gesture image and a category label, the category label is used as supervision information to learn image characteristics, each branch learns different label information, the learned characteristics of the two previous branches are fused through a full connection layer to obtain nonlinear combination characteristics, then a low-dimensional hash characteristic is obtained through a hash layer, a binary hash is obtained through the hash layer, and finally the binary hash code is used as a characteristic vector to perform gesture retrieval;

(5) saving the model file;

(6) randomly selecting 100 pictures from the test set as query pictures, selecting the rest pictures as an image database, selecting a model with the best classification effect as a feature extractor, and constructing a feature database;

4. similarity matching: matching out features similar to the query picture from the feature database;

5. returning the result of the query: and acquiring the sorted image list from the image database and presenting the image list to the user.

Claims

1. A gesture image retrieval method based on multi-scale Retinex and improved VGGNet network is characterized in that: the method comprises the following steps:

(1) picture preprocessing: performing dim light enhancement on the gesture image by adopting a multi-scale Retinex algorithm, normalizing the gesture image after data enhancement processing, and processing the gesture image into a data input format required by a CNN model; the method for normalizing the gesture picture into the data input format required by the CNN model comprises the following steps of:

1) inputting an original image

；

2) Estimating the noise of each position, and eliminating the noise, assuming the original image seen by human eyes

Is the product of the image illumination component L and the reflectance component R, and is specifically expressed as shown in equation 1:

（1）

3) separating three color channel space components and converting the three color channel space components into a logarithmic domain, reasonably calculating illumination from a shot picture, and keeping the inherent attribute of an object

Eliminating the interference of uneven illumination distribution; taking logarithm of two sides of formula 1 at the same time, and then ordering

，

，

Equation 2 can be obtained:

（2）

4) setting the number and size of Gaussian function scales;

5) the Gaussian function filters three channels of the image, the filtered image is an illumination component, and a calculation formula of a reflection component is as follows:

wherein the content of the first and second substances,

is shown as

The original image of each of the channels is,

in the form of a gaussian filter function,

is shown as

The reflected component of each channel, represents a convolution,

is a scale parameter;

the data enhancement is carried out on the gesture image, and the adopted method comprises the following steps:

1) for a gesture image, filtering three channels of the image by adopting Gaussian filter functions of various scales, taking weighted average of reflection components of each scale to obtain a final output result, and changing a formula (3) into:

wherein the content of the first and second substances,

represents the first

A function of a gaussian filter is used to filter,

the number of gaussian filter functions is represented,

is the first

The weight of each of the scales is determined,

the proportion of the Gaussian filter functions meets the constraint condition:

2) handle

Conversion from the logarithmic domain to the real domain

；

3) To pair

Performing linear correction processing to obtain an enhanced gesture image after correction;

(2) characteristic extraction: performing a series of convolution, pooling and full-connection operations on the gesture image by using a trained CNN model, extracting features of the gesture image, preprocessing a gesture data set, labeling and fusing labels, constructing a VGGNet-based network structure, defining and performing initialization training on the VGGNet-based network structure, taking the features extracted by the last FC layer of the trained gesture model as features of the image to represent the participation in a gesture image retrieval task, introducing a hash layer, fusing the features through the full-connection layer to obtain nonlinear combination features, then obtaining a binary hash code through the hash layer, taking the binary hash code as a feature vector to perform gesture retrieval, and constructing a feature database;

2. The method for retrieving a gesture image according to claim 1, wherein: the feature of the extracted gesture image in the step (2) comprises two faces: one is to extract the characteristics of the query picture uploaded by the user, and the other is to extract the characteristics of the picture database to construct an image characteristic database; the adopted characteristic extraction method comprises the following steps:

(1) data preprocessing: preprocessing a gesture data set and labeling and integrating labels, wherein the preprocessing comprises data enhancement and data normalization;

(2) constructing a network structure based on VGGNet: training by adopting a VGGNet16 network model, defining and initializing a network structure of VGGNet16, setting a learning rate lr, a batch size batch and iteration rounds epochs;

(3) training a model: training and verifying the model alternately;

(4) taking the features extracted from the last FC layer of the gesture model trained in the step (3) as the features of the image to represent the participation in a gesture image retrieval task, inputting the features into a gesture image and a category label, learning the image features by taking the category label as supervision information, learning different label information by each branch, fusing the learned features of the previous two branches through a full connection layer to obtain a nonlinear combination feature, obtaining a low-dimensional hash feature through a hash layer, obtaining a binary hash code through the hash layer, and finally taking the binary hash code as a feature vector to perform gesture retrieval;

(5) saving the model file;