CN108596195B

CN108596195B - Scene recognition method based on sparse coding feature extraction

Info

Publication number: CN108596195B
Application number: CN201810435125.5A
Authority: CN
Inventors: 曾伟波; 苏江文; 郑耀松; 吕君玉; 林吓强; 陈铠
Original assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd
Priority date: 2018-05-09
Filing date: 2018-05-09
Publication date: 2022-08-19
Anticipated expiration: 2038-05-09
Also published as: CN108596195A

Abstract

The invention relates to the technical field of image recognition, in particular to a scene recognition method based on sparse coding feature extraction. A scene recognition method based on sparse coding feature extraction comprises the following steps: carrying out preprocessing operation on a sample image set which is acquired in advance and used for training; extracting a characteristic expression vector of the sample image set; adding the feature expression vectors and the corresponding class labels into a linear classifier to construct a linear scene classifier; preprocessing a sample image set to be identified; extracting a characteristic expression vector of a sample image set to be identified; and sending the characteristic expression vector of the sample image set to be identified into a linear scene classifier for identification, and obtaining the class label of the scene class of the sample image set. By adopting the sparse coding technology, the dimensionality of the image can be reduced, the main information of the image can be kept, and meanwhile, the sparse coding technology has strong robustness on noise and shielding.

Description

Scene recognition method based on sparse coding feature extraction

Technical Field

The invention relates to the technical field of image recognition, in particular to a scene recognition method based on sparse coding feature extraction.

Background

Scene recognition refers to recognizing scenes in scene pictures according to similar contents of the scene images, such as same color features, and aims to automatically recognize the scenes to which the images belong by mining the scene features in the images by simulating the perception capability of human beings. In the scene recognition process, the entire image is discriminated as a whole and does not relate to a specific object. Because the specific target can only be used as one basis for judging the category of the scene, but is not necessarily completely related to the category of the scene. Scene recognition is a fundamental pre-processing procedure in the computer vision and robotics fields, and plays an important role in the computer intelligence fields of image content retrieval, pattern recognition, machine learning, and the like.

In recent years, scene recognition research has made great progress, and methods for modeling a plurality of scene categories emerge. The existing scene recognition methods are divided into four categories according to the scene category modeling mode:

(1) scene recognition method based on global features

Scene recognition methods based on global features mostly describe scenes through global visual features of images such as colors, textures and shapes, and are successfully applied to outdoor scene recognition. In contrast, the color features can obtain better recognition results for the scale of the scene, the change of the visual angle and the rotation of the image; the texture and shape features correspond to the structural and directional information of the image, which is also very sensitive to the human visual system, so that the texture and shape features have better consistency with the result of human visual perception. However, the scene recognition method based on the global feature generally needs to search all pixel points of the image, and does not consider the spatial position relationship of the pixel points, so that the method has poor real-time performance and universality.

(2) Scene recognition method based on target

Based on the principle that a specific place can be accurately positioned through a series of extremely representative targets around the specific place, most scene recognition methods also recognize a scene corresponding to an image according to the recognition result of the targets in the image. Therefore, the scene recognition method needs to go through stages of image segmentation, multi-feature combination, target recognition and the like. When the target to be recognized is far away from the visual angle, the target is likely to be hidden in background information which lacks analysis value, and the target is ignored in the segmentation stage, so that the target recognition work cannot be realized. In addition, in order to simplify the complexity of a specific scene, a group of targets capable of representing the scene needs to be selected, and the problem of selecting these reliable and stable representative targets becomes another bottleneck restricting the target-based scene recognition.

(3) Region-based scene recognition method

In view of the limitation of the target-based scene recognition method, some researchers use the segmented regions to replace the scene representative target, and perform feature combination according to the structural relationship of the regions to form the scene mark. The key of the scene identification method is how to obtain a reliable region segmentation algorithm. There are many methods for characterizing such region information, for example: the method can be realized by adopting a mode of combining local and global, namely extracting global statistical characteristics in the region; regions can also be characterized by extracting local invariant features in the regions; the region information may also be characterized according to a bag of words model.

(4) Scene recognition method based on bionic features

In view of the real-time and efficient nature of scene recognition, there is still an uncompensated gap between the best current computer vision system and the vision systems of humans and other animals. In view of the superior scene recognition capability of human and animals, a scene recognition method based on bionic features is generated, and the scene recognition is realized by simulating a processing mechanism of a biological visual cortex. The basic idea is to develop research aiming at a certain biological visual mechanism or a certain class of biological visual characteristics and establish an effective calculation model through careful analysis so as to obtain a satisfactory result. For example, the method based on the human visual attention selection mechanism can take some image area information which is easy to attract human attention as a priority processing object, and the selective mechanism can greatly improve the efficiency of the scene recognition method in processing, analyzing and recognizing the visual information.

Various difficulties in existing scene recognition, such as the fact that a scene is dynamically changed, pictures of the same scene are changeable, images of different classes may have many similar points, images of different scenes may overlap, and the classification performance of the scene classification depends on the accuracy of class labeling of training scene images, all of which result in low accuracy of the scene classification recognition.

Disclosure of Invention

Therefore, a scene recognition method based on sparse coding feature extraction is needed to be provided for solving the problem of low accuracy of scene classification and recognition.

In order to achieve the above object, the inventor provides a scene recognition method based on sparse coding feature extraction, and the specific technical scheme is as follows:

a scene recognition method based on sparse coding feature extraction comprises the following steps: carrying out preprocessing operation on a sample image set which is acquired in advance and used for training; extracting a feature expression vector of the sample image set after the preprocessing operation; adding the feature expression vector of the sample image set and the class label corresponding to the feature expression vector into a linear classifier, performing parameter learning on the linear classifier to obtain the optimal parameter of the linear classifier, and constructing a linear scene classifier according to the optimal parameter; preprocessing a sample image set to be identified; extracting the feature expression vector of the sample image set to be identified after the preprocessing operation; and sending the preprocessed feature expression vector of the sample image set to be identified into the linear scene classifier for identification, and obtaining the class label of the scene class to which the sample image set to be identified belongs.

Further, the preprocessing operation includes: and (5) image contrast normalization and Gamma correction processing.

Further, the extracting the feature expression vector of the sample image set after the preprocessing operation includes: extracting the bottom-layer features of the sample image set after the preprocessing operation by adopting a multi-scale SIFT feature fusion method, namely adopting fields with various scales for each pixel point, extracting SIFT key points of the image in each field, solving sparse expression of the SIFT key points, and forming feature expression vectors of the preprocessed sample image set by adopting a space pyramid strategy and max-posing.

Further, the step of "solving sparse expressions of the SIFT key points" includes: and solving the sparse expression of the SIFT key points by adopting local linear constraint coding.

Further, the step "forming a feature expression vector of the sample image set after the preprocessing operation by using the spatial pyramid strategy and max-posing" includes: dividing the image into local areas of 1 × 1, 1 × 4 and 4 × 1, forming feature expressions of the local areas by adopting histograms of max-pooling statistical coding features in the local areas, and connecting the feature expressions of all the areas to form a feature expression vector of the sample image set after the preprocessing operation.

Further, the step of "learning parameters of the linear classifier to obtain optimal parameters of the linear classifier" includes: and calculating by adopting a least square method to obtain weight parameters of the linear classifier, and obtaining optimal parameters of the linear classifier by adopting a cross verification method.

Further, the "image contrast normalization" includes the steps of: converting the image from the RGB color space to the YUV color space, and carrying out global and local contrast normalization processing on the YUV color space; the global normalization is to normalize the pixel value of the image to be near the mean value of the pixel of the image, and the local normalization is to strengthen the edge.

Further, the extracting the feature expression vector of the sample image set to be identified after the preprocessing operation includes: extracting bottom layer characteristics of the preprocessed sample image set to be identified, and fusing by adopting multi-scale SIFT characteristics; extracting the bottom-layer features of the sample image set after the preprocessing operation by adopting a multi-scale SIFT feature fusion method, namely adopting fields with various scales for each pixel point, extracting SIFT key points of the image in each field, solving sparse expression of the SIFT key points, and forming feature expression vectors of the preprocessed sample image set by adopting a space pyramid strategy and max-posing.

The invention has the beneficial effects that:

1. the method is based on scene recognition of global features, and the whole scene image is judged as a whole without relating to a specific target. And when extracting the bottom layer characteristics of the sample image set, adopt multi-scale SIFT feature fusion, can increase the number of SIFT key point, can also increase the local detail information of image simultaneously.

2. By adopting the sparse coding technology, the image dimensionality can be reduced, the main information of the image can be kept, and meanwhile, the sparse coding technology has strong robustness on noise and shielding. The complexity of an upper-layer classifier model can be reduced by combining bottom-layer sparse coding feature expression with a max-firing method, and the training speed of the classifier is accelerated. And sparse coding is a nonlinear feature mapping mode, and subsequent classification performance can be effectively improved by adopting the feature mapping mode.

3. The preprocessing operation combines contrast normalization and Gamma correction, and can significantly reduce the influence caused by local shadow and illumination change of the image.

4. The sparse coding technology adopts a local linear constraint coding technology, can analytically obtain sparse expression of signals, namely directly obtain expression of solution, does not need iterative solution, and improves the solution efficiency of sparse coding.

5. The linear classifier is used as a scene classifier, so that the complexity of a model is reduced, the training speed of the classifier is improved, and the possibility of overfitting is reduced.

Drawings

Fig. 1 is a flowchart of a scene recognition method based on sparse coding feature extraction according to an embodiment;

FIG. 2 is a schematic diagram illustrating the extraction of feature expression vectors of the sample image set after the preprocessing operation according to one embodiment;

FIG. 3 is a diagram illustrating solving sparse expressions of SIFT key points using sparse coding techniques according to an embodiment;

FIG. 4 is a diagram illustrating a process of a sparse representation calculation method according to an embodiment;

fig. 5 is a schematic diagram of dividing an image into a plurality of local regions by using a spatial pyramid strategy according to an embodiment.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

First, some explanations will be made on terms related to this embodiment:

SIFT: scale-invariant feature transform (SIFT), is a description used in the field of image processing. The description has scale invariance, can detect key points in the image and is a local feature descriptor.

Sparse Coding (Sparse Coding): is a simple cell receptive field artificial neural network method for simulating the main visual cortex V1 area of the visual system of mammals. The method has spatial locality, directivity and band-pass property of frequency domain, and is a self-adaptive image statistical method.

Referring to fig. 1, in this embodiment, the sample image set for training at least satisfies the following conditions: 1. training sample image sets of homogeneous scenes are required to contain different modalities as much as possible; 2. the training sample image sets for different classes of scenes are kept as uniform as possible. The purpose of doing so is to learn the parameters of the linear scene classifier better, so that the scene classification recognition accuracy can be improved.

Step S101: a pre-processing operation is performed on a pre-acquired sample image set for training. The following may be used: the preprocessing operation comprises the following steps: and (5) image contrast normalization and Gamma correction processing. The "image contrast normalization" comprises the steps of: converting the image from the RGB color space to the YUV color space, and carrying out global and local contrast normalization processing on the YUV color space; the global normalization is to normalize the pixel value of the image to be near the mean value of the pixel of the image, and the local normalization is to strengthen the edge. The preprocessing operation combines contrast normalization and Gamma correction, and can remarkably reduce the influence caused by local shadow and light shadow change of the image.

Step S102: and extracting the characteristic expression vector of the sample image set after the preprocessing operation. The following may be used: extracting bottom-layer features of the sample image set after the preprocessing operation by adopting a multi-scale SIFT feature fusion method, wherein scale factors comprise (4, 6,8,9 and 10); namely, domains with multiple scales such as 4 × 4,6 × 6 and the like are adopted in one pixel point of the sample image, and SIFT key points of the image are extracted in each domain. When the bottom layer features of the sample image set are extracted, a multi-scale SIFT feature fusion method is adopted, the number of SIFT key points can be increased, more image information can be acquired, and meanwhile, the local detail information of the image can be increased.

Further, after the SIFT key points in different regions are obtained, sparse expression of the SIFT key points is solved, and in the embodiment, sparse expression of the SIFT key points is solved by adopting local linear constraint coding. The sparse coding technology adopts a local linear constraint coding technology, can analytically obtain sparse expression of signals, namely directly obtain expression of solution, does not need iterative solution, and improves the solution efficiency of sparse coding. And the sparse coding technology is adopted to construct the feature expression of the image, so that the complexity of the image can be effectively reduced, the main information of the image can be retained to the maximum extent, and the robustness to noise and shielding is strong. The sparse coding is a nonlinear feature mapping mode, and the subsequent classification performance can be effectively improved by adopting the feature mapping mode.

After solving the sparse expression of the SIFT key points, namely after completing the sparse coding of the SIFT key points, dividing the image into local areas of 1 × 1, 1 × 4 and 4 × 1, counting the histogram of the SIFT key point coding in the local areas by adopting max-posing to form the feature expression of the local areas, and connecting the feature expressions of all the areas to form the feature expression vector of the sample image set after the preprocessing operation.

Step S103: and adding the characteristic expression vector of the sample image set and the class label corresponding to the characteristic expression vector into a linear classifier, performing parameter learning on the linear classifier to obtain the optimal parameter of the linear classifier, and constructing the linear scene classifier according to the optimal parameter. The following may be used: and calculating by adopting a least square method to obtain weight parameters of the linear classifier, obtaining optimal parameters of the linear classifier by adopting a cross verification method, and constructing the linear scene classifier according to the optimal parameters. The linear classifier is adopted as a scene classifier, so that the complexity of a model is reduced, the training speed of the classifier is increased, and the probability of overfitting is reduced.

Step S104: and carrying out preprocessing operation on the sample image set to be identified. The following may be used: the preprocessing operation comprises the following steps: and (5) image contrast normalization and Gamma correction processing. The "image contrast normalization" comprises the steps of: converting the image from the RGB color space to the YUV color space, and carrying out global and local contrast normalization processing on the YUV color space; the global normalization is to normalize the pixel value of the image to be near the mean value of the pixel of the image, and the local normalization is to strengthen the edge. The preprocessing operation combines contrast normalization and Gamma correction, and can remarkably reduce the influence caused by local shadow and light shadow change of the image.

Step S105: and extracting the characteristic expression vector of the sample image set to be identified after the preprocessing operation. The following may be used: extracting bottom layer features of the sample image set after preprocessing operation by adopting multi-scale SIFT feature fusion, wherein scale factors comprise (4, 6,8,9 and 10); namely, domains with multiple scales such as 4 × 4,6 × 6 and the like are adopted in one pixel point of the sample image, and SIFT key points of the image are extracted in each domain. When the bottom layer features of the sample image set are extracted, multi-scale SIFT feature fusion is adopted, the number of SIFT key points can be increased, more image information can be obtained, and meanwhile, local detail information of the image can be increased.

After solving the sparse expression of the SIFT key points, namely after completing the sparse coding of the SIFT key points, dividing the image into local areas of 1 × 1, 1 × 4 and 4 × 1, adopting a max-posing statistical key point coding histogram in the local area to form the feature expression of the local area, and connecting the feature expressions of all the areas to form the feature expression vector of the sample image set after the preprocessing operation.

Step S106: and sending the preprocessed feature expression vector of the sample image set to be identified into the linear scene classifier for identification, and obtaining the class label of the scene class to which the sample image set to be identified belongs. Namely: inputting a feature expression vector of a sample image set to be identified currently, sending the feature expression vector into a trained linear scene classifier model, and determining the type of the sample image set to be identified currently according to the output quantity of the linear scene classifier. The type to which it belongs is determined by the output point of the highest value.

The invention is based on scene recognition of global characteristics, the whole scene image is judged as a whole, and no specific target is involved. And when extracting the bottom layer characteristics of the sample image set, adopt multi-scale SIFT feature fusion, can increase the number of SIFT key point, can also increase the local detail information of image simultaneously. And the sparse coding technology is adopted, so that the dimensionality of the image can be reduced, the main information of the image can be kept, and the robustness to noise and shielding is strong. The complexity of an upper-layer classifier model can be reduced by combining bottom-layer sparse coding feature expression with a max-firing method, and the training speed of the classifier is accelerated. And sparse coding is a nonlinear feature mapping mode, and subsequent classification performance can be effectively improved by adopting the feature mapping mode. The sparse coding technology adopts a local linear constraint coding technology, can analytically obtain sparse expression of signals, namely directly obtain expression of solution, does not need iterative solution, and improves the solution efficiency of sparse coding. And a linear classifier is adopted as a scene classifier, so that the complexity of a model is reduced, the training speed of the classifier is increased, and the probability of overfitting is reduced.

Referring to fig. 2 to 5, the step S102 or the step S105 is implemented as follows:

solving sparse expression of SIFT key points by adopting sparse coding technology, and making x be as good as R ⁿ Is the input signal (i.e., SIFT keypoint), B ═ B ₁ ,b ₂ ,...,b _m ]∈R ^n×m For dictionary, the sparse coding technique is to solve the following L1-norm problem

Thereby obtaining sparse expression c epsilon R of input signals ^m 。

Further, the calculation method of sparse representation specifically includes:

with local constraint, each input signal is projected to its local coordinate system, for an input vector x ═ x ₁ ,x ₂ ,...,x _n Finding the K adjacent vectors of x in the local range, and then reconstructing x by using the K adjacent vectors, wherein the K adjacent vectors are used for each atomThe weighting achieves the purpose of selecting K nearest neighbors to obtain an objective function:

where λ is a regular term coefficient, represents a vector formed by multiplying by elements, d is a weight of each atom of the dictionary, A represents a vector whose elements are all 1, let

Wherein dist (x, B) [ dist (x, B) ] ₁ ),...,dist(x,b _m )] ^T ，dist(x,b _j ) Representing the signal x with the atom b _j The euclidean distance, j ═ 1, 2., m, σ, is a parameter for controlling the attenuation speed of the weight, and the target function is solved by analysis to obtain a code c:

wherein C represents the covariance matrix of the data, resulting in a code

And normalizing the code to obtain the final code c.

The image is divided into a plurality of local areas, and the dividing method adopts a spatial pyramid strategy, namely, the image is divided into sizes of 1 × 1, 1 × 4 and 4 × 1 (the area division is to count all key points in the area so as to obtain local information of the image). And in each local region, calculating a histogram of key point codes in each region by using max-posing as a characteristic expression of the region. E.g. sharing N in interval i _i A key point, the coding matrix of all key points is

Each column of which represents a sparse representation of a keypoint. z is a radical of _i ∈R ^m Is a characteristic expression of the region, then

Wherein z is _ij Is z _i The jth element of (1), c _kj Row k and column j of C.

The feature expressions of all regions are connected to form the final feature expression vector of the image, i.e. Z ═ Z ₁ ,z ₂ ,...,z ₉ ]∈R ^9m 。

Further, the step S103 is implemented as follows:

training sample pairs for a series of inputs (z) _i ,t _i ) 1, 2., N (t is the label truth of the training sample), the objective function of the linear classifier is:

Subjectto:Wz _i ＝t _i -ε _i ,i＝1,....,N.

and (3) solving to obtain the optimal model weight by adopting a Lagrange multiplier method as follows:

wherein C is a regular term coefficient, and an optimal parameter is obtained through adjustment by a cross-validation method.

It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by changing and modifying the embodiments described herein or by using the equivalent structures or equivalent processes of the content of the present specification and the attached drawings, and are included in the scope of the present invention.

Claims

1. A scene recognition method based on sparse coding feature extraction is characterized by comprising the following steps:

carrying out preprocessing operation on a sample image set which is acquired in advance and used for training;

extracting a feature expression vector of the sample image set after the preprocessing operation;

adding the feature expression vector of the sample image set and the class label corresponding to the feature expression vector into a linear classifier, performing parameter learning on the linear classifier to obtain the optimal parameter of the linear classifier, and constructing a linear scene classifier according to the optimal parameter;

preprocessing a sample image set to be identified;

extracting a feature expression vector of the sample image set to be identified after the preprocessing operation;

sending the feature expression vector of the preprocessed sample image set to be identified into the linear scene classifier for identification, and obtaining the class label of the scene class to which the sample image set to be identified belongs;

the extracting the feature expression vector of the sample image set after the preprocessing operation includes: extracting bottom layer features of the sample image set by using a multi-scale SIFT feature fusion method, namely extracting SIFT key points of the image in each field by adopting fields with various scales for each pixel point;

solving sparse expression of the SIFT key points, and forming a feature expression vector of the preprocessed sample image set by adopting a space pyramid strategy and max-posing;

the calculation method of the sparse representation comprises the following steps:

with local constraint, each input signal is projected to its local coordinate system, for an input vector x ═ x ₁ ,x ₂ ,...,x _n Finding K neighbor vectors of x in a local range, reconstructing x by using the K neighbor vectors, weighting each atom to achieve the purpose of selecting K nearest neighbors to obtain an objective function, and obtaining a code c by the objective function through resolution;

the method for obtaining the optimal parameters of the linear classifier by adding the feature expression vector of the sample image set and the class label corresponding to the feature expression vector into the linear classifier and performing parameter learning on the linear classifier further comprises the following steps:

training sample pairs for a series of inputs (z) _i ,t _i ) N, t is the true label value of the training sample, and the objective function of the linear classifier is:

Subject to:Wz _i ＝t _i -ε _i ,i＝1,....,N.

c is a regular term coefficient, and an optimal parameter is obtained through adjustment by a cross validation method;

wherein z is _i ∈R ^m Is a characteristic expression of a local area;

the sample image set for training complies with the following conditions: the training sample image sets of the same type of scene need to contain different modalities, and the training sample image sets of different types of scenes need to be balanced.

2. The scene recognition method based on sparse coding feature extraction as claimed in claim 1,

the preprocessing operation comprises the following steps: and (5) image contrast normalization and Gamma correction processing.

3. The scene recognition method based on sparse coding feature extraction as claimed in claim 1,

the step of forming the feature expression vector of the sample image set after the preprocessing operation by adopting a spatial pyramid strategy and max-posing comprises the following steps of:

the method comprises the steps of dividing an image into local areas of 1 × 1, 1 × 4 and 4 × 1 by adopting a spatial pyramid strategy, forming feature expressions of the local areas by adopting a max-posing statistical SIFT key point coding histogram in the local areas, and connecting the feature expressions of all the areas to form a feature expression vector of a sample image set after preprocessing operation.

4. The scene recognition method based on sparse coding feature extraction as claimed in claim 1,

the step of learning parameters of the linear classifier to obtain the optimal parameters of the linear classifier includes:

and calculating to obtain weight parameters of the linear classifier by adopting a least square method, and obtaining optimal parameters of the linear classifier by adopting a cross verification method.

5. The scene recognition method based on sparse coding feature extraction as claimed in claim 2,

the "image contrast normalization" comprises the steps of: converting the image from the RGB color space to the YUV color space, and carrying out global and local contrast normalization processing on the YUV color space;

the global normalization is to normalize the pixel value of the image to be near the mean value of the pixel of the image, and the local normalization is to strengthen the edge.

6. The scene recognition method based on sparse coding feature extraction according to claim 1,

the extracting of the feature expression vector of the sample image set to be identified after the preprocessing operation includes: extracting bottom layer features of the sample image set by using a multi-scale SIFT feature fusion method, namely extracting SIFT key points of the image in each field by adopting fields with various scales for each pixel point;

and solving the sparse expression of the SIFT key points, and forming a feature expression vector of the preprocessed sample image set by adopting a space pyramid strategy and max-posing.