CN107679462B

CN107679462B - Depth multi-feature fusion classification method based on wavelets

Info

Publication number: CN107679462B
Application number: CN201710823051.8A
Authority: CN
Inventors: 于刚; 李艇
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2017-09-13
Filing date: 2017-09-13
Publication date: 2021-10-19
Anticipated expiration: 2037-09-13
Also published as: CN107679462A

Abstract

The invention provides a depth multi-feature fusion classification method based on wavelets, which comprises an offline training stage and an online identification stage, wherein the offline training stage trains samples of n types of labels by constructing a convolutional neural network, discrete wavelet transformation is added into a convolutional layer and a full connection layer at the tail end of a model to decompose depth multi-feature mapping, and the obtained high and low frequency components are fused linearly so as to obtain the optimal weight; and in the on-line identification stage, the convolutional neural network is matched with a support vector machine to identify and classify the actions in the images and videos. The invention has the beneficial effects that: the accuracy of classification and identification of the image video is improved.

Description

Depth multi-feature fusion classification method based on wavelets

Technical Field

The invention relates to robot visual image processing, in particular to a depth multi-feature fusion classification method based on wavelets.

Background

In recent years, deep learning has become the most fierce vocabulary of the science and technology circle. The method gradually subverts algorithm design ideas in numerous fields of voice recognition, image classification, text understanding and the like, and gradually forms a new mode which starts from training data, passes through an end-to-end model and then directly outputs to obtain a final result. With the advent of the big data era and the development of various more powerful computing devices such as a GPU and the like, deep learning like tiger is more important, various mass data can be fully utilized, abstract knowledge expression can be completely and automatically learned, and original data are concentrated into certain knowledge. Which is again the most common framework in deep learning.

With the continuous expansion of the convolutional neural network framework and the continuous deepening of the network layer number, feature maps extracted by each module are gradually increased, and the convolutional layer flatten is simply taken as a vector and then fully connected, so that the calculated amount is huge, and the feature blurring is caused, and the accuracy of the classification and identification of the image video is influenced.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a depth multi-feature fusion classification method based on wavelets, which improves the accuracy of classification and identification of image videos.

The invention provides a depth multi-feature fusion classification method based on wavelets, which comprises an offline training stage and an online identification stage, wherein the offline training stage trains samples of n types of labels by constructing a convolutional neural network, discrete wavelet transformation is added into a convolutional layer and a full connection layer at the tail end of a model to decompose depth multi-feature mapping, and the obtained high and low frequency components are fused linearly so as to obtain the optimal weight; and in the on-line identification stage, the convolutional neural network is matched with a support vector machine to identify and classify the actions in the images and videos.

As a further improvement of the present invention, the offline training phase comprises the steps of:

the method comprises the following steps: firstly, constructing a convolutional neural network for training;

step two: set up 3 passageways in the first layer, respectively: the video clip comprises 1 gray level channel and 2 optical flow channels, wherein the gray level channel comprises a gray level image group of a video clip, and the optical flow channels comprise motion relation information between two frames of the video clip;

step three: constructing a multi-module convolutional neural network;

step four: extracting high-frequency and low-frequency components from feature maps of all the module full-connection layers by adopting discrete wavelet transform, and fusing the high-frequency and low-frequency components in the three modules respectively;

step five: connecting the fused high-frequency and low-frequency components in series through the merge layer and fully connecting the fused high-frequency and low-frequency components with the next layer to obtain a group of 128-dimensional feature maps;

step six: setting n output nodes corresponding to the n classification behaviors, wherein each node is fully connected with all feature maps of the previous layer;

step seven: and adjusting the calculation parameters among the layers through a back propagation algorithm to reduce the error between the output of each sample and the label, and setting the label for each output vector according to the corresponding sample video behavior name after the error meets the requirement and the training is finished.

As a further improvement of the invention, the on-line training phase comprises the following steps:

step eight: inputting a video stream to be identified, preprocessing the video in the first step, loading a weight through an optimal model obtained in offline training, and extracting a feature vector from the video stream to be identified through the network layers in the second step to the eighth step;

step nine: and (4) classifying the feature vectors in the step ten by adopting a support vector machine, and finding out the label which is most matched with the feature vectors to obtain the optimal accuracy.

As a further improvement of the invention, the method comprises the following steps:

s1: acquiring a training sample image;

s2: preprocessing an image;

s3: constructing a gray scale and optical flow multi-channel network channel;

s4: respectively constructing a gray level, optical flow x and y channel network;

s5: performing discrete wavelet transform on the feature mapping of the full connection layer at the tail end of each channel;

s6, extracting high-frequency and low-frequency components, and carrying out feature fusion between channels;

s7, serially connecting and fusing the characteristics through a merge layer;

s8: training and extracting the optimal weight;

s9: sending the video to a trained optimal model for feature extraction;

s10: online identification is performed using a support vector machine.

As a further improvement of the present invention, in step S1, a training sample and a sample label are obtained from the data set; in step S2, unifying the resolutions of the video streams in the training sample set, unifying the resolutions by using a Lanczos interpolation method, and interpolating eight adjacent points in the interpolation process along the x and y directions, that is, calculating a weighted sum, where a window function of the Lanczos interpolation method is:

the two-dimensional form is then: l (x, y) ═ L (x) L (y).

As a further improvement of the invention, in step S3, a gray scale channel is established by graying the video stream, the gray scale image retains the most basic information of the original image, optical flow channels in the x and y directions are established for extracting the inter-frame motion information in the video stream, the optical flow information between frames is extracted by adopting an improved L-K optical flow method, a convolution kernel is used to replace pyramid downsampling, and a partial derivative f is firstly obtained from f (x, y, t)_x,f_y,f_tThe convolution kernel adopts a Prewitt filter, namely:

I_x＝I*D_x，I_y＝I*D_y，I_t＝I*D_t

velocity estimation using the least squares method:

as a further improvement of the present invention, in step S4, each channel is sampled, the picture size is changed to 150 × 100, 5 convolutional layers are constructed, 3 pooling layers are connected, and then one full-connected layer is connected, the convolutional kernel size of the first convolutional layer is 5 × 5, the convolutional kernel sizes of the subsequent convolutional layers are all 3 × 3, the step size is set to 1, 3D maxpoling is used for the pooling layers, the kernel selection of the pooling layers is two, 2 × 2 and 2 × 1, and the activation function selects relu.

As a further improvement of the invention, in step S5, the feature mapping of the full connection layer at each channel end is used for extracting high and low frequency components by discrete wavelet transform, and the high and low frequency components are extracted by continuous wavelet function psi_a,b(t) can be written as a discrete wavelet function:

the discrete wavelet transform is obtained in the form:

as a further improvement of the present invention, in step S6, decomposing the 512-dimensional feature map of the fully connected layer of the grayscale channel, the optical flow x and the y channel into 3 pairs of 128-dimensional feature maps containing high and low frequency components, and then performing vector product operation on the 128-dimensional feature maps of each channel to obtain two sets of 128-dimensional feature maps; in step S7, by adding a merge layer, mode sets concat, concatenates the fused high-frequency component and low-frequency component, sets n output nodes, and connects all feature maps of the upper layer corresponding to the n classification behaviors.

As a further improvement of the present invention, in step S8, a training sample set is put into the network for training, a model with the minimum loss value is recalled, and the optimal weight is saved; in step S10, the input video stream is passed through a convolutional neural network to extract a 128-dimensional feature map, a kernel function is selected as a linear function, and a support vector machine is constructed for classification and identification.

The invention has the beneficial effects that: through the scheme, the classical convolutional neural network training process is improved, the discrete wavelet transform is added to decompose the depth features in the training process, the multi-resolution features are extracted, the corresponding multi-resolution features in all the depth features are fused, the bottom layer information is enhanced, the high layer information is enhanced, the complexity of network calculation is reduced, meanwhile, the robustness of network training is enhanced, and the accuracy of classification and identification of image videos is improved.

Drawings

FIG. 1 is a flow chart of a depth multi-feature fusion classification method based on wavelets according to the present invention.

Fig. 2 is a diagram of a single channel network.

Fig. 3 is a general structure diagram of a convolutional neural network based on wavelet improvement.

Detailed Description

The invention is further described with reference to the following description and embodiments in conjunction with the accompanying drawings.

A depth multi-feature fusion classification method based on wavelets is divided into two stages: an offline training phase and an online recognition phase. Training samples of n types of labels by constructing a convolutional neural network, adding discrete wavelet transform to a convolutional layer and a full connection layer at the tail end of a model to decompose deep multi-feature mapping, linearly fusing obtained high and low frequency components to obtain optimal weight, and then identifying and classifying actions in images and videos by matching the neural network with a support vector machine.

Off-line training phase

The method comprises the following steps: firstly, a convolutional neural network is constructed for training, a behavior recognition data set HMDB51 is taken as a training set by taking action recognition as an example, video segments are preprocessed, and the video resolution is unified;

step three: constructing a multi-module convolutional neural network

step six: setting n output nodes corresponding to n classification behaviors (labels), wherein each node is fully connected with all feature maps on the previous layer;

step seven: adjusting the calculation parameters among all layers through a back propagation algorithm to reduce the error between the output of each sample and the label, and setting the label for each output vector according to the corresponding sample video behavior name after the error meets the requirement and the training is finished;

(II) on-line identification

The invention provides a depth multi-feature fusion classification method based on wavelets, which improves the classic convolutional neural network training process, adds discrete wavelet transform to decompose the depth features in the training process, extracts multi-resolution features, fuses corresponding multi-resolution features in each depth feature, enhances bottom information, enhances high-level information, reduces the complexity of network calculation, and enhances the robustness of network training.

As shown in fig. 1, a depth multi-feature fusion classification method based on wavelets specifically includes the following steps:

s1: acquiring a training sample image:

training samples and sample labels are obtained from the HMDB51 dataset.

S2: image preprocessing:

the video streams in the training sample set are subjected to resolution unification, and when the resolution unification operation is performed, the image edges are blurred, so that information loss is caused. The resolution is unified by using a Lanczos interpolation method, and eight adjacent points are interpolated in the x and y directions in the interpolation process, that is, a weighted sum is calculated, so that the method is an 8 × 8 descriptor. Although the calculation amount of the Lanczos interpolation method is more complicated than that of other interpolation methods, the Lanczos interpolation method has little influence on the overall performance due to operation on a GPU, and the effect is more remarkable than that of other interpolation methods. The window function is:

the two-dimensional form is then: l (x, y) ═ L (x) L (y).

S3: constructing a gray scale and optical flow multi-channel network channel:

by establishing a grayscale channel for the graying of the video stream, the grayscale map retains the most basic information of the original image, so the grayscale channel is essential. The extraction of the inter-frame motion information in the video stream establishes optical flow channels in the x and y directions. The optical flow is the instantaneous speed of the pixel motion of a space moving object on an observation imaging plane, and is a method for finding the corresponding relation between the previous frame and the current frame by using the change of the pixels in an image sequence on a time domain and the correlation between adjacent frames so as to calculate the motion information of the object between the adjacent frames. Optical flow channels are also essential for motion recognition. The improved L-K optical flow method is adopted to extract the optical flow information between frames. The convolution kernel is used for replacing pyramid downsampling, so that the calculation amount can be reduced, and the effect is better. First, the partial derivative f is determined from f (x, y, t)_x,f_y,f_tThe convolution kernel adopts a Prewitt filter, namely:

I_x＝I*D_x，I_y＝I*D_y，I_t＝I*D_t

velocity estimation using the least squares method:

s4: respectively constructing a gray level, optical flow x and y channel network:

fig. 2 is a diagram of a single-channel network structure, each channel is processed by down-sampling, the picture size is changed to 150 × 100, 5 convolutional layers and 3 pooling layers are constructed, and then a full connection layer is connected. The first layer of convolutional kernels has a size of 5 x 5, the subsequent convolutional kernels have a size of 3 x 3, and the step size is set to 1. 3D maxporoling is adopted in the pooling layer, and the cores of the pooling layer are selected from 2 x 2 and 2 x 1, so that the later time dimension is prevented from being reduced too fast. The relu is selected as an activation function, the function can simulate a more accurate activation model of a brain neuron receiving signal, and compared with a sigmoid function, the function has the characteristics of unilateral inhibition, relatively wide excitation boundary and sparse activation.

S5: performing discrete wavelet transform on the feature mapping of the full connection layer at the tail end of each channel:

extracting high and low frequency components from the feature mapping of the full connection layer at the tail end of each channel by discrete wavelet transform, and performing continuous wavelet function psi_a,b(t) can be written as a discrete wavelet function:

the discrete wavelet transform is obtained in the form:

s6, extracting high-frequency and low-frequency components, and carrying out feature fusion between channels:

in fig. 3, Dwt operation decomposes 512-dimensional feature maps of the fully-connected layers of the gray-scale channel, the optical flow x and the y channel into 3 pairs of 128-dimensional feature maps containing high and low frequency components, and then performs vector product operation on the 128-dimensional feature maps of the channels to obtain two sets of 128-dimensional feature maps.

S7 characteristics after tandem fusion through merge layers:

by additionally arranging the merge layer, the mode sets concat, the fused high-frequency component and low-frequency component are connected in series, n output nodes are set, and the corresponding n classification behaviors (labels) are fully connected with all feature maps on the upper layer.

S8: training and extracting optimal weight:

and putting the training sample set into a network for training, recalling the model with the minimum loss value, and storing the optimal weight.

S9: and sending the video to a trained optimal model for feature extraction.

S10: online identification using a support vector machine:

extracting a 128-dimensional feature map from an input video stream through a convolutional neural network, selecting a kernel function as a linear function, and constructing a support vector machine for classification and identification.

Compared with a convolutional neural network model which is not added with wavelets for depth feature fusion, the method disclosed by the invention can achieve a better effect, and can achieve higher accuracy by testing on a public data set. Meanwhile, the invention is not limited to the recognition of the action in the specific implementation scheme, and can be widely used for the classification recognition of the image video.

According to the depth multi-feature fusion classification method based on the wavelet, provided by the invention, the discrete wavelet transform is adopted to extract low-frequency components and high-frequency components from the feature map, and the high-frequency components and the low-frequency components are respectively fused, so that the purposes of enhancing bottom information and enhancing high-level information are achieved, and the accuracy and the robustness of network identification are improved.

The depth multi-feature fusion classification method based on the wavelet is suitable for the technical field of robot visual image processing, and is particularly suitable for depth learning, feature extraction and video image processing.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A depth multi-feature fusion classification method based on wavelets is characterized in that: the method comprises an offline training stage and an online identification stage, wherein the offline training stage trains n types of labeled samples by constructing a convolutional neural network, discrete wavelet transform is added to a full connection layer to decompose depth multi-feature mapping, and obtained high and low frequency components are linearly fused to obtain optimal weight; in the on-line identification stage, the convolutional neural network is matched with a support vector machine to identify and classify the actions in the images and videos;

the offline training phase comprises the following steps:

step three: constructing a multi-module convolutional neural network, wherein each module corresponds to one channel;

step four: extracting high-frequency and low-frequency components from feature maps of all connection layers of all channels by adopting discrete wavelet transform, fusing the high-frequency components in all the channels, and fusing the low-frequency components in all the channels;

the on-line identification phase comprises the following steps:

step eight: inputting a video stream to be identified, preprocessing the video, loading a weight through an optimal model obtained in offline training, and extracting a feature vector from the video stream to be identified through the network layers from the second step to the seventh step;

step nine: classifying the feature vectors in the step eight by adopting a support vector machine, and finding out the label which is most matched with the feature vectors to obtain the optimal accuracy;

the depth multi-feature fusion classification method based on the wavelet comprises the following steps:

s1: acquiring a training sample image;

s2: preprocessing an image;

s3: constructing a gray scale and optical flow multi-channel network channel;

s6, extracting high-frequency and low-frequency components, fusing the high-frequency components in each channel, and fusing the low-frequency components in each channel;

s7, serially connecting and fusing the characteristics through a merge layer;

s8: training and extracting the optimal weight;

s9: sending the video to a trained optimal model for feature extraction;

s10: online identification is performed using a support vector machine.

2. The wavelet-based depth multi-feature fusion classification method of claim 1, wherein in step S1, training samples and sample labels are obtained from a dataset; in step S2, unifying the resolutions of the video streams in the training sample set, unifying the resolutions by using a Lanczos interpolation method, and interpolating eight adjacent points in the interpolation process along the x and y directions, that is, calculating a weighted sum, where a window function of the Lanczos interpolation method is:

the two-dimensional form is then: l (x, y) ═ L (x) L (y).

3. The wavelet-based depth multi-feature fusion classification method of claim 2, wherein in step S3, a grayscale channel is established by graying the video stream, the grayscale map retains the most basic information of the original image, optical flow channels in both x and y directions are established for the extraction of inter-frame motion information in the video stream, the improved L-K optical flow method is used to extract the inter-frame optical flow information, a convolution kernel is used to replace pyramid down-sampling, and first the partial derivative f (x, y, t) is obtained from f (x, y, t)_x,f_y,f_tThe convolution kernel adopts a Prewitt filter, namely:

I_x＝I*D_x，I_y＝I*D_y，I_t＝I*D_t

velocity estimation using the least squares method:

4. the wavelet-based depth multi-feature fusion classification method of claim 3, wherein in step S4, each channel is sampled, the picture size is changed to 150 × 100, 5 convolutional layers are constructed, 3 pooling layers are connected, then one fully-connected layer is connected, the first convolutional layer convolution kernel size is 5 × 5, the subsequent convolutional layer convolution kernel sizes are 3 × 3, the step size is set to 1, 3D maxpoling is adopted for the pooling layers, the kernels of the pooling layers are selected to be two of 2 × 2 and 2 × 1, and the activation function selects relu.

5. The wavelet-based depth multi-feature fusion classification method of claim 4, wherein in step S5, the feature mapping of the full connection layer at each channel end is transformed by discrete wavelet to extract high and low frequency components, and the continuous wavelet function ψ is used to extract high and low frequency components_a,b(t) can be written as a discrete wavelet function:

the discrete wavelet transform is obtained in the form:

6. the wavelet-based depth multi-feature fusion classification method of claim 5, wherein in step S6, decomposing 512-dimensional feature maps of fully connected layers of gray-scale channels, optical flows x and y channels into 3 pairs of 128-dimensional feature maps containing high and low frequency components, and then performing vector product operation on the 128-dimensional feature maps of each channel to obtain two sets of 128-dimensional feature maps; in step S7, by adding a merge layer, mode sets concat, concatenates the fused high-frequency component and low-frequency component, sets n output nodes, and connects all feature maps of the upper layer corresponding to the n classification behaviors.

7. The wavelet-based depth multi-feature fusion classification method of claim 6, wherein in step S8, a training sample set is put into a network for training, a model with the minimum loss value is recalled, and an optimal weight is saved; in step S10, the input video stream is passed through a convolutional neural network to extract a 256-dimensional feature map, a kernel function is selected as a linear function, and a support vector machine is constructed for classification and identification.