CN113111711A

CN113111711A - Pooling method based on bilinear pyramid and spatial pyramid

Info

Publication number: CN113111711A
Application number: CN202110265552.5A
Authority: CN
Inventors: 邵一鸣; 包晓安; 包梓群; 许铭洋; 马云龙; 马铉钧
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT; Zhejiang Sci Tech University ZSTU; Zhejiang University of Science and Technology ZUST
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-07-13

Abstract

The invention discloses a pooling method based on bilinear and spatial pyramids, and belongs to the field of image processing and computer vision. The invention comprises the following steps: acquiring a video stream, and intercepting a target image to be processed; extracting features of different levels or different categories in the target image; fusing the feature groups through a bilinear method to obtain a global feature map; pyramid pooling is carried out on the fused global feature map, and the dimensionality of the feature map is reduced; and normalizing the feature map subjected to dimension reduction to be used as the final feature of the target image, finishing pooling operation, and using the obtained final feature for subsequent classification to realize identification of the object to be detected. The method is suitable for behavior recognition in the image and pooling operation in target detection, reduces dimensionality of multi-feature fusion, improves recognition efficiency, and meets different recognition requirements for multiple features in recognition.

Description

Pooling method based on bilinear pyramid and spatial pyramid

Technical Field

The invention relates to the field of image processing and computer vision, in particular to a pooling method based on bilinear and spatial pyramids.

Background

In the era of rapid development of intelligent science and technology, functions of behavior recognition, target detection and the like of intelligent monitoring are gradually perfected and popularized, and pooling operation is often used in a convolutional neural network to reduce the dimension of a feature vector output by a convolutional layer and improve a result under the condition of minimum influence on the expression of original image semantics. The image has the characteristic of 'statics', useful features can be shared and applied in different image areas, the human visual system is simulated, and the pooling operation can be used for carrying out aggregation statistics on the features at different positions.

The traditional pooling mode generally comprises average pooling, maximum pooling, random pooling and the like, namely, one of the average value and the maximum value of the corresponding image area is taken, the elements are randomly selected according to the probability, and the probability that the element value is selected according to the random selection is also successively increased based on the large value of the element in the random selection, so that the value range of the maximum value is ensured, the existence sense of other elements is kept, and excessive distortion is prevented.

And finally, considering the corresponding advantages and disadvantages of different pooling methods, the invention aims to adopt bilinear pooling to fuse two features and then obtains a corresponding feature map by means of pyramid pooling dimension reduction and fixed output dimension, thereby being better helpful for the accuracy of behavior recognition target detection.

Disclosure of Invention

In order to overcome the defects of the conventional image pooling method aiming at behavior recognition, target detection and the like, the method combines bilinear pooling and pyramid pooling, firstly performs multi-feature extraction on an object in a target image, performs bilinear fusion on a feature group to obtain a fused global feature map, and then performs pyramid pooling on the corresponding position of the global feature map. The pooling method disclosed by the invention integrates more image characteristics, reduces data loss, lays a foundation for improving the subsequent classification accuracy, generates output with a fixed size aiming at image input with any size, can be suitable for various classifiers, and is wide in application. The technical scheme adopted by the invention for solving the technical problems is as follows:

a bilinear and spatial pyramid-based pooling method comprises the following steps:

s1: acquiring a video stream according to a time sequence recorded by a monitoring system, wherein the video stream comprises an object to be detected;

s2: preprocessing the intercepted video stream, wherein the preprocessing comprises video shot segmentation and key frame extraction, and the extracted key frame image is used as a target image;

s3: identifying an object in a target image, labeling a candidate frame, and performing multi-feature extraction on the object in the candidate frame to obtain multi-feature data;

s4: multiplying multiple features corresponding to the same position of a target image by a bilinear method to obtain a local feature map, and summing and pooling the local feature maps corresponding to all target positions in the image to obtain a fused global feature map;

s5: pyramid pooling is carried out on the fused global feature map, and the dimensionality of the feature map is reduced; and normalizing the feature map subjected to dimension reduction to be used as the final feature of the target image, finishing pooling operation, and using the obtained final feature for subsequent classification to realize identification of the object to be detected.

Compared with the prior art, the invention has the advantages that:

(1) the invention can realize the fusion of feature groups of different levels and various categories by utilizing a bilinear method, and the feature groups can be related feature groups of different levels and different frequencies; or similar feature groups extracted in different extraction modes, wherein the individual features have the original dimensionality. Because the fused feature map contains features of different levels and different types, the obtained feature information is more comprehensive, and a foundation is laid for improving the subsequent classification accuracy.

(2) The invention further adopts a pyramid pooling method to reduce the dimension of the feature graph after bilinear fusion, and performs pooling operation of different sizes on the feature graph to obtain feature information of different resolutions, thereby effectively improving the identification precision of the network on the features. Compared with the traditional R-CNN dimension reduction method, the method has the advantages that the global feature map is divided by using windows with different scales, pooling is carried out on each feature region, weights of different layers are set, finally, features with uniform square sum and dimension are obtained through splicing, the numerical value of the features is far smaller than the dimension of the features under the R-CNN calculation mode, and the calculation efficiency is high.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of bilinear pooling employed in the present embodiment;

fig. 3 is a schematic diagram of spatial pyramid pooling employed in the present embodiment.

Detailed Description

The invention is further described by the following detailed description in conjunction with the accompanying drawings.

As shown in fig. 1, the pooling method based on bilinear pyramid and spatial pyramid provided by the present invention combines bilinear pooling with spatial pyramid pooling, and is used for multi-feature fusion and reduction to obtain uniform dimensionality, and includes the following steps:

step 1: and acquiring screening data to obtain video streams. Part of data of the invention is from an INRIA XMAX multi-view video library, and part of data is shot and recorded by a monitoring system.

Step 2: and preprocessing the intercepted video stream, wherein the preprocessing comprises video shot segmentation and key frame extraction, and the extracted key frame image is used as a target image.

And step 3: and identifying the object in the target image, labeling the candidate frame, performing multi-feature extraction on the object in the candidate frame, and acquiring multi-feature data.

And 4, step 4: as shown in fig. 2, the extracted multi-feature points are linearly fused by a bilinear pooling method, and a feature map after linear fusion is output.

And 5: as shown in fig. 3, the linear fused feature map is subjected to dimension reduction by using a spatial pyramid pooling method, and the output size is unified for further processing.

In a specific implementation of the present invention, the video stream obtained in step 1 includes shot images of different viewing angles of the object to be detected, and is set according to a specific application scenario. The method also comprises a step of dividing each key frame image into area blocks by adopting a block matching method between the step 2 and the step 3, wherein the similarity between the continuous frames is judged by comparing the corresponding area blocks, and the method utilizes the local characteristics of the images to inhibit noise.

In one embodiment of the present invention, the multiple features extracted in step 3 form a feature group, each type of feature has its own original dimension, and the feature group may be related feature groups of different levels and different frequencies; or similar feature groups extracted in different extraction manners. In this embodiment, a dynamic video feature extraction technology is used to perform multi-feature extraction.

As shown in fig. 2, the extracted multiple features are subjected to bilinear fusion processing, multiple features corresponding to the same position of the target image are multiplied to obtain a local feature map, and then the local feature maps corresponding to all target positions in the image are summed and pooled to obtain a fused global feature map; the method specifically comprises the following steps:

for the target image I in fig. 2, different features are extracted through two branches, and the features extracted at the same position are multiplied by each other, and the calculation formula is as follows:

f_A(l,I)∈R^1×M

f_B(l,I)∈R^1×N

in the formula, b (l, I, f)_A,f_B) Representing a local feature map at a position l of the target image I after bilinear fusion, f_A(l, I) and f_B(l, I) are two features extracted at the same location l of the image I, M and N are the number of feature channels, and T is the transpose.

Summing and pooling local feature maps corresponding to all target positions in the image to obtain a global feature map, wherein the calculation formula is as follows:

where ξ (I) represents the global feature map of the target image, the final output in fig. 2.

In a specific implementation of the present invention, as shown in fig. 3, pyramid pooling of different scales is performed on the fused global feature map, so as to reduce the dimension of the feature map and obtain feature information with different resolutions; normalizing the feature map subjected to dimension reduction, and outputting the feature map with unified standard dimensions as final features of the target image; it should be noted here that the input data of fig. 3 is the final output in fig. 2, i.e., ξ (I) in the above description, and the picture given in fig. 3 is only for exemplary purposes. The method specifically comprises the following steps:

1) and dividing the global feature map by using windows with different scales, wherein each scale represents one layer of the pyramid, and the global feature map is divided into image blocks in each layer.

Image division calculation formula:

win-size ═ a/n-pooling windows (rounded up);

str-size ═ a/n-pooling stride size (rounded down);

a represents that the size of the feature map input by the pyramid pooling layer is a multiplied by a;

n represents the size of the feature map output by the pyramid pooling layer as n × n.

2) Performing unified pooling operation on each image block, setting the number of feature layers obtained by pooling in each layer as a weight, and extracting higher-level image feature information; the pooling operation described herein may be maximum pooling, average pooling, or random pooling.

3) And cascading the feature vectors of the corresponding dimensionality generated by each layer.

4) And carrying out normalization processing on the cascaded feature vectors to serve as final features of the target image.

In this embodiment, as shown in fig. 3, three layers of pyramids are used for pooling, different dedicated resolutions of corresponding layers are formulated, and feature maps of corresponding layers are extracted.

Setting the resolution of the pooling layer 1 as a multiplied by a, the resolution of the pooling layer 2 as b multiplied by b, the resolution of the pooling layer 3 as c multiplied by c, and taking the number of the feature layers of each pooling layer as x, y and z as weights. The corresponding dimensions of each layer are respectively a multiplied by x, b multiplied by y, c multiplied by z characteristic vectors, wherein x, y and z can select the size of corresponding numerical values according to the identification requirement, and if the occupied area of an identified object in the whole picture is small, the weight of the pooling layer in a small area needs to be correspondingly large; if the identified object focuses on the association of each part of the whole image and occupies a large area, the corresponding weight of the pooling layer in a relatively large area needs to be increased as appropriate. For example: when the method is applied to small targets such as traffic signal mark identification, as marks are generally fine, the occupied area in the picture is small, but the semantic information is concentrated, the z value of the pooling layer 3 in fig. 3 is large, and the total concentration (x + y + z) is heavier; similarly, when the method is applied to pedestrian motion recognition, because the specific area of a person in the whole picture is large, the weight z of the pooling layer 3 is reduced appropriately, the ratio of the weights x and y of the pooling layer 2 and the pooling layer 1 is relatively increased, in terms of the weight ratio, x, y and z are in percentage difference, but the ratio is a numerical value (integer) and directly determines the number of semantic features of each dimension after linear concatenation.

Referring to fig. 3, in this embodiment, the pooling layer 1 uses a 1 × 1 window to divide the global feature map, that is, the global feature map is not divided into small regions, the number of feature map layers output by the pooling layer 1 is set to be x, and a 1 × 1 × x feature vector is output; the pooling layer 2 divides the global feature map by adopting a 2 × 2 window, namely, the global feature map is divided into 2 × 2 small areas, the number of feature layers output by the pooling layer 2 is set as y, and 2 × 2 × y feature vectors are output by the pooling layer; the pooling layer 3 divides the global feature map by using a 4 × 4 window, that is, the global feature map is divided into 4 × 4 small regions, the number of feature layers output by the pooling layer 3 is set to z, and a 4 × 4 × z feature vector is output. And finally, performing linear cascade feature fusion on the three feature vectors to obtain (a multiplied by x + b multiplied by y + c multiplied by z) dimensional feature vectors, wherein the feature vectors are fused with the semantic information of bilinear feature fusion and simultaneously contain feature information of different scales and different levels. And carrying out normalization processing on the cascaded feature vectors through a normalization function, wherein the normalized feature vectors serve as final features of the target image and are used for subsequent classifier classification.

In this embodiment, after the values of x, y, and z are determined, due to the pooling layer settings with different resolutions, feature vectors with different numbers and different dimensions are obtained in each layer, wherein z 16-dimensional feature vectors are extracted from the pooling layer 3, y 4-dimensional feature vectors are extracted from the pooling layer 2, and x 1-dimensional feature vectors are extracted from the pooling layer 1, where feature vectors extracted from three layers are fused, where feature fusion is a conventional means in a neural network, and this embodiment employs a direct "early fusion": and (4) performing feature fusion by Concat to obtain the dimensionality of the linear cascade vector, namely adding the dimensionality vectors obtained by the three pooling layers respectively, wherein the fusion vector is used for classification of a subsequent classifier, and selecting semantic information with different dimensionalities to judge the recognition result, such as an SVM classifier and the like. The invention does not limit the selection of the feature fusion method and the classifier, and can be replaced according to the actual requirement and the operating environment.

According to the method, the bilinear pooling is used for fusing the feature groups, the pyramid pooling is used for dimensionality reduction, the corresponding feature graph is obtained in a fixed output dimensionality mode, the final feature graph comprises semantic information sampled under different levels, different features, different dimensionalities and resolution ratios, and the semantic information is used for subsequent classification, so that the accuracy of behavior identification and target detection can be effectively improved.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A pooling method based on bilinear and spatial pyramids, characterized by comprising the following steps:

s1: acquiring a video stream according to a time sequence recorded by a monitoring system, wherein the video stream contains an object to be detected;

2. The bilinear and spatial pyramid based pooling method of claim 1, wherein the acquired video stream includes different view angle captured images of the object to be detected.

3. The bilinear and spatial pyramid-based pooling method of claim 1, further comprising a step of dividing each key frame image into region blocks by using a block matching method between said step S2 and step S3, wherein the similarity between consecutive frames is determined by comparing the corresponding region blocks.

4. The bilinear and spatial pyramid based pooling method of claim 1, wherein said multi-feature data includes different levels or classes of features.

5. The bilinear and spatial pyramid based pooling method of claim 1, wherein said bilinear method is calculated by the formula:

f_A(l,I)∈R^1×M

f_B(l,I)∈R^1×N

6. The bilinear and spatial pyramid-based pooling method of claim 5, wherein the local feature maps corresponding to all target positions in the image are summed and pooled to obtain a global feature map, and the calculation formula is:

in the formula, ξ (I) represents a global feature map of the target image.

7. The bilinear and spatial pyramid-based pooling method of claim 1, wherein said pyramid pooling process comprises:

7.1) dividing the global feature map by using windows with different scales, wherein each scale represents one layer of a pyramid, and the global feature map is divided into image blocks in each layer;

7.2) carrying out uniform pooling operation on each image block, and setting the number of feature layers obtained by pooling of each layer as a weight;

7.3) cascading the feature vectors of the corresponding dimensionality generated by each layer;

and 7.4) carrying out normalization processing on the cascaded feature vectors to be used as the final features of the target image.

8. Bilinear and spatial pyramid based pooling method according to claim 7, characterized in that said pooling of step 7.2) is a maximal pooling, an average pooling or a random pooling.

9. The bilinear and spatial pyramid-based pooling method of claim 1, wherein the fused global feature map is pyramid-pooled and then output in a uniform standard dimension.