CN110852206A

CN110852206A - Scene recognition method and device combining global features and local features

Info

Publication number: CN110852206A
Application number: CN201911033329.7A
Authority: CN
Inventors: 樊硕
Original assignee: Beijing Yingpu Technology Co Ltd
Current assignee: Beijing Yingpu Technology Co Ltd
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2020-02-28

Abstract

The application discloses a scene recognition method and device combining global features and local features, and relates to the field of computer vision. The method comprises the following steps: feature extraction: for each image of the training data set, extracting DAISY characteristics as local characteristics, and extracting HOG characteristics as global characteristics of the image; image coding: encoding each image as a histogram of visual words using local DAISY features corresponding to each keypoint; constructing a pool device: the DAISY histogram feature is used as a first layer, the global feature after L2 normalization is used as a second layer, and the two layers are connected in series to form a mixed feature; scene recognition: the classifier is trained with the hybrid features to form a final scene recognizer, which is used for scene recognition. The device comprises: the device comprises a feature extraction module, an image coding module, a pooling device construction module and a scene identification module. The method and the device can improve the accuracy of scene recognition.

Description

Scene recognition method and device combining global features and local features

Technical Field

The present application relates to the field of computer vision, and in particular, to a scene recognition method and apparatus combining global features and local features.

Background

Scene recognition is a hot problem in the field of computer vision, and the research goal is to process video or image information and automatically recognize scene information in the video or image, and the method has rich application fields, such as automatic monitoring, human-computer interaction, video indexing, image indexing and the like.

The feature extraction types in scene recognition are classified into three categories, namely a bottom-layer feature method, a middle-layer semantic method and a high-layer feature method. The bottom layer features are basic features for describing image color, shape, texture and the like, the feature form is simple and easy to obtain, a scene is regarded as an object with a structure and a shape, the whole information of the scene is represented by analyzing spectral information, and the features are suitable for outdoor scene recognition with low complexity. The middle-layer semantic method is a method for combining features to form a new feature, aims to solve the semantic gap existing between the features and the semantics, is generally realized by depending on a visual bag-of-words model, has the main defect that spatial information is ignored, and has the recognition effect greatly depending on the performance of the selected features. The high-level features are more complex and closer to image semantics, are generally combined and constructed on the basis of the bottom-level features, are richer in expressive force relative to the bottom-level features, can also be used for processing the scene classification problem with a large number of categories, are closer to the real semantics of the image and also contain more scene information, but are higher in general dimensionality, more complex in calculation and extraction, and can be used in the scene identification problem.

Therefore, the three types of feature extraction methods have the advantages and the disadvantages, different feature extraction methods can be adopted according to different application requirements, and in the traditional scene identification method, the bottom-layer features or the high-layer features are generally used more, and the methods are easy to understand and simple and feasible. However, the three feature extraction methods do not fully consider the feature information of the image, and cannot represent richer image scene information, so that the accuracy of scene identification is reduced.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to an aspect of the present application, there is provided a scene recognition method combining global features and local features, including:

feature extraction: for each image of a training data set, extracting DAISY features corresponding to key points detected from the image, taking the DAISY features as local features of the image, extracting standard HOG features corresponding to the whole image at different granularities, and taking the HOG features as global features of the image;

image coding: encoding each image as a histogram of visual words using local DAISY features corresponding to each keypoint;

constructing a double-layer pooling device: adopting DAISY histogram features as a first layer of the pooling device, adopting an L2 normalized global feature as a second layer of the pooling device, and connecting the L2 normalized global feature and the corresponding DAISY histogram features in series to form a mixed feature; the method for constructing the DAISY histogram features comprises the following steps: constructing a histogram by representing the frequency of each visual word in each image according to the DAISY characteristics, selecting each key point in the image, determining the DAISY characteristics corresponding to the key point, searching cluster IDs corresponding to the DAISY characteristics, and performing L2 normalization on the obtained histogram to form the DAISY histogram characteristics;

scene recognition: and training the classifier by using the mixed features to form a final scene recognizer, and performing scene recognition by using the scene recognizer.

Optionally, the DAISY feature utilizes a sketch library in python to perform feature extraction.

Optionally, the encoding each image into a histogram of visual words using the local DAISY feature corresponding to each keypoint comprises:

quantizing the DAISY characteristics into 'K' clusters by adopting a Mini-BatchKMeans algorithm so as to form 'visual words' in a vocabulary table, wherein K represents the size of a vocabulary amount;

a histogram with 'K' as the dimension is formed using the vocabulary.

Optionally, the classifier is an SVM classifier.

Optionally, K has a value of 700.

According to another aspect of the present application, there is provided a scene recognition apparatus combining global features and local features, including:

a feature extraction module configured to extract, for each image of the training data set, DAISY features corresponding to key points detected from the image and use the DAISY features as local features of the image, extract standard HOG features corresponding to the entire image at different granularities, and use the HOG features as global features of the image;

an image encoding module configured to encode each image as a histogram of visual words using local DAISY features corresponding to each keypoint;

a pooling device building module configured to use a DAISY histogram feature as a first layer of the pooling device, use an L2 normalized global feature as a second layer of the pooling device, and connect the L2 normalized global feature in series with a corresponding DAISY histogram feature to form a hybrid feature; the method for constructing the DAISY histogram features comprises the following steps: constructing a histogram by representing the frequency of each visual word in each image according to the DAISY characteristics, selecting each key point in the image, determining the DAISY characteristics corresponding to the key point, searching cluster IDs corresponding to the DAISY characteristics, and performing L2 normalization on the obtained histogram to form the DAISY histogram characteristics;

and the scene recognition module is configured to train the classifier by utilizing the mixed features to form a final scene recognizer, and the scene recognizer is utilized for scene recognition.

a histogram with 'K' as the dimension is formed using the vocabulary.

Optionally, the classifier is an SVM classifier.

Optionally, K has a value of 700.

According to the scene recognition method and device combining the global features and the local features, the DAISY features are used as the local features of the images, the local features are clustered through the Mini-Batch KMeans algorithm to form the visual word bag, the HOG features are used as the global features of the images and are connected with the corresponding DAISY histogram features in series to form the mixed features to comprehensively represent the image features, and therefore the accuracy of scene recognition can be improved.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a schematic flow chart diagram of a method for scene recognition combining global and local features according to one embodiment of the present application;

FIG. 2 is a schematic flow chart of the method shown in FIG. 1 for encoding each image as a histogram of visual words using local DAISY features corresponding to each keypoint;

FIG. 3 is a block diagram of a scene recognition apparatus that combines global features and local features according to an embodiment of the present application;

FIG. 4 is a block schematic diagram of a computing device of one embodiment of the present application;

fig. 5 is a schematic block diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

Fig. 1 is a schematic flow chart of a scene recognition method combining global features and local features according to an embodiment of the present application. The method may generally include:

s1, feature extraction: for each image of a training data set, extracting DAISY features corresponding to key points detected from the image, taking the DAISY features as local features of the image, extracting standard HOG features corresponding to the whole image at different granularities, and taking the HOG features as global features of the image;

s2, image coding: each image is encoded as a histogram of visual words using the local DAISY features corresponding to each keypoint, as shown in fig. 2:

s21, image coding uses local DAISY characteristics corresponding to each key point to code each image into a histogram of visual words, specifically, a Mini-Batch KMeans algorithm is applied to quantize the DAISY characteristics extracted from all training images into 'K' clusters so as to form 'visual words' in a vocabulary table, wherein K represents the size of vocabulary, the Mini-Batch KMeans algorithm has the advantage that the calculation time can be greatly reduced on the premise of keeping the clustering accuracy as much as possible, the algorithm adopts a small-Batch data subset to reduce the calculation time, and simultaneously tries to optimize an objective function, the value of the number K of the optimal visual words is determined to be 700 through cross validation and experience;

s22, forming a histogram having 'K' as a dimension by using the vocabulary in correspondence to each image.

S3, constructing a double-layer pooling device: adopting DAISY histogram features as a first layer of the pooling device, adopting an L2 normalized global feature as a second layer of the pooling device, and connecting the L2 normalized global feature and the corresponding DAISY histogram features in series to form a mixed feature; the method for constructing the DAISY histogram features comprises the following steps: selecting each key point in the image, determining a DAISY feature corresponding to the key point, searching a cluster ID corresponding to the DAISY feature, and performing L2 normalization on the histogram obtained in the step S22 by using the cluster ID to form a DAISY histogram feature;

s4, scene recognition: training the classifier by using the mixed features to form a final scene recognizer, and performing scene recognition by using the scene recognizer:

through the three steps, the mixed feature description of the training data set can be obtained, the classifier is trained by using the mixed feature to form a final scene recognizer, the SVM classifier is used as the scene recognizer, and cross verification is performed by randomly splitting the data set into a training and verification set. Experimental results prove that the scene recognition method combining the global features and the local features can comprehensively extract the global and local information of the image, and the accuracy of scene recognition is improved.

Fig. 3 is a schematic structural block diagram of a scene recognition apparatus combining global features and local features according to an embodiment of the present application. The apparatus may generally include: the device comprises a feature extraction module 1, an image coding module 2, a pooling device construction module 3 and a scene identification module 4.

The feature extraction module 1 is configured to extract, for each image of a training data set, DAISY features corresponding to key points detected from the image and using the DAISY features as local features of the image, extract standard HOG features corresponding to the entire image at different granularities, and use the HOG features as global features of the image;

we use DAISY features extracted from all training images to form visual word bags by clustering them using the Mini-Batch KMeans algorithm, which is a clustering model that can maintain clustering accuracy as much as possible but can reduce computation time by a large margin, using a small Batch of data subsets to reduce computation time while still trying to optimize the objective function, and we define the number of best visual words (K) empirically by cross-validation in this application, where we define K as 700.

The image encoding module 2 is configured to encode each image into a histogram of visual words using the local DAISY features corresponding to each keypoint, as follows:

the image coding utilizes local DAISY characteristics corresponding to each key point to code each image into a histogram of visual words, specifically, a Mini-Batch KMeans algorithm is applied to quantize the DAISY characteristics extracted from all training images into 'K' clusters so as to form 'visual words' in a vocabulary table, wherein K represents the size of word aggregates, and the Mini-Batch KMeans algorithm has the advantages that on the premise of keeping clustering accuracy as much as possible, the calculation time can be greatly reduced, the algorithm adopts a small-Batch data subset to reduce the calculation time, and simultaneously still tries to optimize an objective function, the value of the number K of the optimal visual words is determined to be 700 through cross validation and experience;

a histogram having 'K' as a dimension is formed by using the vocabulary table corresponding to each image.

The pooling device constructing module 3 is configured to adopt a DAISY histogram feature as a first layer of the pooling device, adopt an L2 normalized global feature as a second layer of the pooling device, and connect the L2 normalized global feature in series with a corresponding DAISY histogram feature to form a mixed feature; the method for constructing the DAISY histogram features comprises the following steps: selecting each key point in the image, determining a DAISY feature corresponding to the key point, searching a cluster ID corresponding to the DAISY feature, and performing L2 normalization on the histogram obtained by the image coding module 2 by using the cluster ID to form a DAISY histogram feature;

the scene recognition module 4 is configured to train the classifier using the mixed features to form a final scene recognizer, and perform scene recognition using the scene recognizer:

through the feature extraction module 1, the image coding module 2 and the pooling device construction module 3, mixed feature description of a training data set can be obtained, a classifier is trained by using the mixed features to form a final scene recognizer, the SVM classifier is used as the scene recognizer in the embodiment, and cross validation is performed by randomly splitting the data set into a training and validation set. Experimental results prove that the scene recognition method combining the global features and the local features can comprehensively extract the global and local information of the image, and the accuracy of scene recognition is improved.

Embodiments also provide a computing device, referring to fig. 4, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, the computer program, when executed by the processor 1110, implementing the method steps 1131 for performing any of the methods according to the invention.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 5, the computer readable storage medium comprises a storage unit for program code provided with a program 1131' for performing the steps of the method according to the invention, which program is executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A scene recognition method combining global features and local features comprises the following steps:

2. The method as claimed in claim 1, wherein the DAISY feature is extracted using a sketch library in python.

3. The method of claim 1 or 2, wherein encoding each image as a histogram of visual words using local DAISY features corresponding to each keypoint comprises:

a histogram with 'K' as the dimension is formed using the vocabulary.

4. A method according to any one of claims 1-3, wherein the classifier is a SVM classifier.

5. The method of any one of claims 1-4, wherein K has a value of 700.

6. A scene recognition apparatus that combines global features and local features, comprising:

7. The apparatus of claim 6 wherein the DAISY features are extracted using a sketch library in python.

8. The apparatus of claim 6 or 7, wherein encoding each image as a histogram of visual words using local DAISY features corresponding to each keypoint comprises:

a histogram with 'K' as the dimension is formed using the vocabulary.

9. The apparatus of any one of claims 6-8, wherein said classifier is an SVM classifier.

10. The apparatus of any one of claims 6-9, wherein K has a value of 700.