CN114972965A

CN114972965A - Scene recognition method based on deep learning

Info

Publication number: CN114972965A
Application number: CN202210416206.7A
Authority: CN
Inventors: 刘怀亮; 梁玮麟; 赵舰波; 杨斌
Original assignee: Lezhi Future Technology Shenzhen Co ltd
Current assignee: Lezhi Future Technology Shenzhen Co ltd
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2022-08-30

Abstract

The invention discloses a scene recognition method based on deep learning, which comprises the following steps: preprocessing an original picture to be recognized to obtain an image to be recognized with consistent size and channel; constructing a deep learning network, and training the deep learning network to obtain a trained deep learning network model, wherein the deep learning network comprises a target detection network unit, a scene recognition network unit, a first feature fusion unit, an attention network unit and a second feature fusion unit; and inputting the image to be recognized into the trained deep learning network model to obtain a scene recognition result of the image. According to the method, the low-high level features are fused, an attention mechanism is combined, the detail information in the image features is increased, and the original pictures can be effectively classified.

Description

Scene recognition method based on deep learning

Technical Field

The invention belongs to the technical field of scene recognition, and particularly relates to a scene recognition method based on deep learning.

Background

In recent years, with the development of scientific technology, image classification and target detection technology based on deep learning has achieved tremendous achievement in the field of computer vision, and scene recognition is one of important research directions of computer vision, and has wide application prospects in the fields of automatic navigation, unmanned aerial vehicles and the like. Scene recognition refers to recognizing scenes in an image and classifying the scenes into predefined scene categories, wherein common scene categories are divided into: natural scenes (forests, seas, deserts and the like), artificial scenes (airports, basketball courts and the like) and indoor scenes (classrooms, meeting rooms and the like), the scene concepts are complex and various, the understanding of scene images is improved, and the method is an important development direction in the field of computer vision at present. Compared with object target identification, the same scene type has the characteristics of small difference between classes and large difference in classes, and the current computer system cannot judge the scene type like human being.

In an early scene recognition method, an image shallow feature description operator is mainly used to represent a scene picture, such as Scale-invariant feature transform (SIFT), Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP), and the like, which are commonly used to describe basic features of an image, such as color, texture, and shape, but the feature form is simple and convenient to obtain, but has certain limitations. With the development of deep learning, a Convolutional Neural Network (CNN) is applied more and more to scene recognition, image features of different layers are usually obtained through multiple network architectures, and the image features are used as input of a training network through fusion of multiple model features, and then classification of scene images is performed.

The CNN-based scene recognition method obtains the final classification result by training and analyzing the features of the entire image, but in the scene image, not all the features on the image are effective information for the computer system to judge the scene category, which may cause the non-effective scene image features to cause large interference to the final classification result, and thus, the accuracy is reduced.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a scene recognition method based on deep learning. The technical problem to be solved by the invention is realized by the following technical scheme:

one aspect of the present invention provides a scene recognition method based on deep learning, including:

preprocessing an original picture to be recognized to obtain an image to be recognized with consistent size and channel;

constructing a deep learning network, and training the deep learning network to obtain a trained deep learning network model, wherein the deep learning network comprises a target detection network unit, a scene recognition network unit, a first feature fusion unit, an attention network unit and a second feature fusion unit, and the target detection network unit is used for obtaining a target feature vector of the image to be recognized; the scene recognition network unit is used for obtaining a scene feature vector of the image to be recognized; the first feature fusion unit is used for fusing the target feature vector and the scene feature vector to obtain a fusion feature vector with a target attribute and a scene attribute; the attention network unit is used for acquiring global feature information and local feature information of the image to be identified according to the fusion feature vector; the second feature fusion unit is used for fusing the target feature vector, the scene feature vector, the global feature information and the local feature information and obtaining a scene classification result;

and inputting the image to be recognized into the trained deep learning network model to obtain a scene recognition result of the image.

In one embodiment of the present invention, the target detection network element is a Yolo network, and the scenario identification network element is a Resnet18 network with the last full connectivity layer removed.

In one embodiment of the invention, the attention network element comprises a global feature network and a local feature network, wherein,

the global feature network comprises a first global average pooling layer, a first full-connection layer, a second full-connection layer, a sigmoid activation function layer and a second global average pooling layer which are connected in sequence, wherein the first global average pooling layer is used for carrying out global average on feature maps of each layer in the fusion feature vector to obtain a feature value containing global context information, the first full-connection layer and the second full-connection layer are used for capturing correlation among the feature value channels, the sigmoid activation function layer is used for learning weight factors of the channels, and the second global average pooling layer is used for obtaining global feature information of each channel;

the local feature network comprises an attention residual module and a third global average pooling layer, wherein the attention residual module is used for enhancing local detail information of the fused feature vector and obtaining an attention map, and the third global average pooling layer is used for extracting local feature information according to the attention map.

In one embodiment of the present invention, training the deep learning network model includes:

forming an image training set by using a large number of images with scene labels and target labels in the scene;

and training the deep learning network model by using the pictures in the image training set to obtain the trained deep learning network model.

In an embodiment of the present invention, the second feature fusion unit is a hadamard fusion unit, and is configured to fuse the target feature vector extracted by the target detection network unit, the scene feature vector extracted by the scene identification network unit, the global feature vector extracted by the global feature network unit, and the local feature vector extracted by the local feature network unit, and classify the target feature vector, the scene feature vector extracted by the scene identification network unit, the global feature vector extracted by the global feature network unit, and the local feature vector extracted by the local feature network unit by softmax.

Another aspect of the present invention provides a storage medium, in which a computer program is stored, the computer program being configured to execute the steps of the scene recognition method based on deep learning according to any one of the foregoing embodiments.

Yet another aspect of the present invention provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the deep learning based scene recognition method according to any one of the above embodiments when calling the computer program in the memory.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the scene recognition method based on deep learning, the low-high level features are fused, the attention mechanism is combined, the detail information in the image features is increased, and effective scene recognition and classification of the original image can be achieved by reasonably and effectively setting the parameters in the deep learning network.

2. Aiming at the problem that the classification result is inaccurate because the image features cannot be quickly and effectively extracted based on the traditional machine learning image scene classification, the invention provides a multi-feature fusion deep learning network, and the final classification and identification accuracy is obviously improved through the training and verification of a model.

The present invention will be described in further detail with reference to the drawings and examples.

Drawings

Fig. 1 is a flowchart of a scene recognition method based on deep learning according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a deep learning network according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a global feature network according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a local feature network according to an embodiment of the present invention;

FIG. 5 is a comparison of the accuracy of scene recognition on the MIT Indor 67 dataset using different recognition methods.

Detailed Description

In order to further explain the technical means and effects of the present invention adopted to achieve the predetermined invention purpose, a scene recognition method based on deep learning according to the present invention is described in detail below with reference to the accompanying drawings and the detailed description.

The foregoing and other technical matters, features and effects of the present invention will be apparent from the following detailed description of the embodiments, which is to be read in connection with the accompanying drawings. While the present invention has been described in connection with the preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

It should be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or device that comprises a list of elements does not include only those elements but may include other elements not expressly listed. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of additional like elements in the article or device comprising the element.

Referring to fig. 1, fig. 1 is a flowchart of a scene recognition method based on deep learning according to an embodiment of the present invention. The scene recognition method comprises the following steps:

s1: and preprocessing the original picture to be recognized to obtain the image to be recognized with the consistent size and channel.

Specifically, in the deep learning network, in order to ensure the consistency of the dimensions of the input image, the original image data needs to be subjected to size transformation to adapt to the input of the network, and the embodiment may implement the size transformation of the image through a function in the opencv library, as follows:

dst＝cv.resize(src,dsize[,dst[,fx[,fy[,interpolation]]]])

where src represents an original picture, dsize represents a scaled image size, dst represents a target image, fx and fy represent scaling in x and y directions, respectively, and interplation represents an int type, representing an interpolation mode.

Since the channel requirements of images are different when different deep learning networks perform image reading, channel conversion is required according to the format requirements of the used deep learning networks. For example, TensorFlow has two data formats, NHWC and NCHW, which can be specified by a parameter data _ format. When the "NHWC" is set, the order is [ batch, height, width, channels ], and when the "NCHW" is set, the order is [ batch, channels, height, width ], and the default format is NHWC. And adjusting the original picture to be identified into a picture meeting the input requirement of the deep learning network through the size conversion and the channel conversion.

S2: and constructing a deep learning network, and training the deep learning network to obtain a trained deep learning network model.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a deep learning network according to an embodiment of the present invention. The deep learning network constructed in the embodiment comprises a target detection network unit, a scene recognition network unit, a first feature fusion unit, an attention network unit and a second feature fusion unit, wherein the target detection network unit is used for obtaining a target feature vector of the image to be recognized; the scene recognition network unit is used for obtaining a scene feature vector of the image to be recognized; the first feature fusion unit is used for fusing the target feature vector and the scene feature vector to obtain a fusion feature vector with a target attribute and a scene attribute; the attention network unit is used for acquiring global feature information and local feature information of the image to be identified according to the fusion feature vector; the second feature fusion unit is configured to fuse the target feature vector, the scene feature vector, the global feature information, and the local feature information, and obtain a scene classification result.

Specifically, the deep learning network is mainly composed of three parts: (1) and the feature extraction part adopts a method of combining the target detection network unit and the scene identification network unit to replace the feature extraction part of the original backbone network, so that the image features have both target attributes and scene attributes, and the local detail features are favorably kept. (2) Based on the global-local attention network portion of the attention mechanism, global features and local features are obtained from the portion, respectively. (3) The feature fusion part, which is provided in this embodiment to increase the diversity of features, fuses the features with the target attributes and scene identification with global-local features, and finally classifies by softmax.

Preferably, the target detection network element in this embodiment is a Yolo network, and the scene recognition network element is a Resnet18 network with the last full connection layer removed. By fusing the characteristics with the attributes of the target and the scene, the local detail characteristics are favorably kept, and the characteristic diagram X with the size of C multiplied by H multiplied by W is obtained through the network, wherein H is the height, W is the width, and C is the channel.

Further, please refer to fig. 3 and fig. 4, fig. 3 is a schematic structural diagram of a global feature network according to an embodiment of the present invention; fig. 4 is a schematic structural diagram of a local feature network according to an embodiment of the present invention. The attention network unit of this embodiment includes a global feature network and a local feature network, where the global feature network includes a first global average pooling layer, a first full-link layer, a second full-link layer, a sigmoid activation function layer, and a second global average pooling layer, which are connected in sequence, where the first global average pooling layer is configured to perform global average on a feature map of each layer in the fused feature vector to obtain a feature value including global context information, the first full-link layer and the second full-link layer are configured to capture correlation between feature value channels, the sigmoid activation function layer is configured to learn a weight factor of a channel, and the second global average pooling layer is configured to obtain global feature information of each channel.

Specifically, the attention network unit includes a global feature network and a local feature network, the output of the first feature fusion unit is used as the input of the attention network unit, and the fusion features obtained by the first feature fusion unit are respectively input into the global feature network and the local feature network.

The main task of the global feature network is information extraction, and the whole process is shown in fig. 3. Firstly, global average pooling is carried out through a global average pooling layer (global average pooling), each input layer of feature graph is subjected to global average to obtain a value containing global context information, the specific calculation process is that all pixel values of the feature graph of each layer are added and averaged to obtain a numerical value, and the numerical value is used for representing the corresponding feature graph. And then connecting the two fully-connected layers in order to capture the correlation between the channels, and learning the weight factor of the channel i by using a sigmoid function after the two fully-connected layers, wherein the specific formula is as follows:

σ _i ＝sigmoid(w _2,i Relu(w _1,i ,z _i ))

wherein, w _1,i And w _2,i Respectively representing the weights of two full connection layers of the ith channel, both Relu and sigmoid are activation functions, z _i And (4) representing the output result after the global average pooling, namely the first-layer output of the global feature extraction network.

The global feature network may assign different weights to different channels, with higher weights meaning that the channels are more important. Then, global features s of the channels i are respectively obtained through global average pooling _i Finally, the global features obtained are expressed as follows: s. the _global ＝(s ₁ ,s ₂ ,…,s _i-1 ,s _i )。

Further, the embodiment may further learn more semantic information from the fusion feature x obtained by the first feature fusion unit through the local feature network. The embodiment is mainly implemented by an attention residual error module, which is mainly used for enhancing local detail information, and meanwhile, residual error connection is beneficial to optimizing a network learning process, the attention residual error module takes a fusion feature x output by the first feature fusion unit as an input, and a calculation formula is as follows:

Ms＝θ(F(x,{w}))

where F (x, { w }) (w × x + b) () is convolution operation, b is deviation, θ is a nonlinear function, Ms is an attention map, w represents a connection weight between the input fusion feature map x and the attention map Ms, and Ms is normalized to [0,1], so that a final normalized attention map can be obtained, and a specific calculation formula is as follows:

where L ═ (i, j), i ═ 1, …, W, j ═ 1, …, H }, W denotes feature width, H denotes feature height,

finally, extracting local characteristic information S through global average pooling _local 。

Further, the second feature fusion unit of this embodiment is a Hadamard fusion (Hadamard fusion) unit, and is capable of fusing the target feature vector extracted by the target detection network unit, the scene feature vector extracted by the scene recognition network unit, the global feature vector extracted by the global feature network unit, and the local feature vector extracted by the local feature network, and performing classification by softmax.

Specifically, the target detection network unit and the scene recognition network unit extract low-level features, including a feature vector So extracted by the target detection network and a feature vector Sp extracted by the scene recognition network. Extracting high-level features including feature vector S extracted by global feature network by attention network unit _global And local feature network extracted feature vector S _local . In this embodiment, a plurality of feature vectors So, Sp, and S are combined by a hadamard fusion method _global And S _local Performing fusion, then classifying by a softmax function to obtain the identification classification of the original pictureAs a result, the scene type in the original picture is obtained.

Further, after the deep learning network is constructed, the constructed deep learning network needs to be trained, and the specific training process includes:

first, a training set of images is composed with a large number of images having scene labels and target labels in the scene.

In this embodiment, the MIT Indoor67 data set is selected, and the data set includes 67 Indoor scene categories, about 15620 ten thousand images, and each category has at least 100 images, wherein the Indoor scene category includes: bedroom (android), dining room (dining room), office (office), Bar (Bar). The deep learning network model proposed in this embodiment is trained, validated, and tested on this data set. Specifically, the construction process of the training data set comprises the following steps:

(one) preprocessing pictures from MIT Indor 67 dataset

(1) Size transformation: similar to the original picture processing process, since in the deep learning network, in order to ensure the consistency of the input image dimensions, the image data needs to be subjected to size transformation to adapt to the input of the network, here, the method can be implemented by a function in an opencv library, as follows:

dst＝cv.resize(src,dsize[,dst[,fx[,fy[,interpolation]]]])

where src represents an original image, dsize represents a zoomed image size, dst represents a target image, fx and fy represent zoom ratios in x and y directions, and interpolation is int type, representing an interpolation method.

(2) Channel conversion: the present embodiment also requires channel conversion according to the format requirements of the deep learning network used.

(3) Data enhancement: in order to train the data amount of the sample and improve the training effect of the model, the present embodiment clips the image of an arbitrary size to 224 × 224 by means of random clipping.

And then, training the deep learning network model by using the pictures in the image training set to obtain the trained deep learning network model.

Specifically, pictures from the MIT Indoor67 dataset are processed using the above-described picture data processing procedure to train the dataset, and then the constructed deep learning network is trained using the pictures in the training dataset. The training process of this embodiment adopts a mini-batch policy to update the weight, and the size of batch _ size is set to 64. To reduce overfitting, the initial learning rate was set to 10 ^-3 Dropout is set to 0.5 and weight attenuation is set to 10 ^-5 And batchnormation is used at the corresponding layer. The initial learning rates lr of the network are all set to 0.1, and the learning rate is multiplied by 0.1 every 10000 times of iteration.

S3: and inputting the image to be recognized into the trained deep learning network model to obtain a scene recognition result of the image.

Specifically, the picture to be recognized, which is preprocessed in step S1, is input into the trained deep learning network model, so that the scene recognition result of the picture, that is, the scene type in the original picture, can be obtained.

Further, in order to illustrate the effect of the scene recognition method based on deep learning provided by the embodiment of the present invention, the embodiment identifies the test set formed by the MIT Indoor67 data set by using several different classification identification methods. Referring to fig. 5, fig. 5 is a comparison of accuracy rates of scene recognition on MIT inor 67 data sets by using different recognition methods, where place365-VGG represents a scene recognition method based on a Visual Geometry Group, VASD represents a scene recognition method based on a Semantic aggregation Descriptor Vector (Vector of Semantic Aggregated Descriptors), and SDO represents a scene recognition method based on an object Semantic Descriptor (Semantic Descriptor with object), which shows that the scene recognition method based on deep learning according to the embodiment of the present invention has the highest recognition accuracy rate.

In summary, in the scene recognition method based on deep learning of this embodiment, through the fusion of low-high level features and the combination of an attention mechanism, the detail information in the image features is increased, and through reasonably and effectively setting the parameters in the deep learning network, effective scene recognition and classification of the original image can be achieved. Aiming at the problem that the classification result is inaccurate because the image features cannot be rapidly and effectively extracted based on the traditional machine learning image scene classification, the deep learning network with multi-feature fusion is provided in the embodiment, and the final classification and identification accuracy is remarkably improved through the training and verification of the model.

Yet another embodiment of the present invention provides a storage medium, in which a computer program is stored, the computer program being used for executing the steps of the scene recognition method based on deep learning in the above embodiments. Yet another aspect of the present invention provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the scene recognition method based on deep learning according to the above embodiment when calling the computer program in the memory. Specifically, the integrated module implemented in the form of a software functional module may be stored in a computer readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable an electronic device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A scene recognition method based on deep learning is characterized by comprising the following steps:

2. The deep learning-based scene recognition method according to claim 1, wherein the target detection network element is a Yolo network, and the scene recognition network element is a Resnet18 network with the last full connection layer removed.

3. The deep learning based scene recognition system of claim 1, wherein the attention network element comprises a global feature network and a local feature network, wherein,

4. The deep learning based scene recognition method of claim 3, wherein training the deep learning network model comprises:

5. The deep learning based scene recognition method of claim 3, wherein the second feature fusion unit is a Hadamard fusion unit, and is configured to fuse the target feature vector extracted by the target detection network unit, the scene feature vector extracted by the scene recognition network unit, the global feature vector extracted by the global feature network unit, and the local feature vector extracted by the local feature network unit, and perform classification by softmax.

6. A storage medium, characterized in that the storage medium has stored therein a computer program for executing the steps of the deep learning based scene recognition method according to any one of claims 1 to 5.

7. An electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the deep learning based scene recognition method according to any one of claims 1 to 5 when the processor calls the computer program in the memory.