CN112001399B - Image scene classification method and device based on local feature saliency - Google Patents

Image scene classification method and device based on local feature saliency Download PDF

Info

Publication number
CN112001399B
CN112001399B CN202010928765.7A CN202010928765A CN112001399B CN 112001399 B CN112001399 B CN 112001399B CN 202010928765 A CN202010928765 A CN 202010928765A CN 112001399 B CN112001399 B CN 112001399B
Authority
CN
China
Prior art keywords
scene
local
features
feature
local feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010928765.7A
Other languages
Chinese (zh)
Other versions
CN112001399A (en
Inventor
谢毓湘
张家辉
宫铨志
栾悉道
魏迎梅
康来
蒋杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202010928765.7A priority Critical patent/CN112001399B/en
Publication of CN112001399A publication Critical patent/CN112001399A/en
Application granted granted Critical
Publication of CN112001399B publication Critical patent/CN112001399B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The application relates to an image scene classification method and device based on local feature saliency. The method comprises the following steps: dividing image scene data to be classified to obtain an image scene data block, respectively extracting scene local features and object local features through a preset scene feature extraction model and an object feature extraction model, extracting scene global features and object global features through a scene feature extraction model and an object feature extraction model, respectively obtaining enhanced scene local features and enhanced object local features through setting weights of the scene local features and the object local features, fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features to obtain fused features of the image scene data, and inputting the fused features into a pre-trained classification model to obtain scene classification of the image scene data. The method can reduce the calculated amount and solve the problem of structural redundancy.

Description

Image scene classification method and device based on local feature saliency
Technical Field
The application relates to the technical field of scene classification, in particular to an image scene classification method and device based on local feature saliency.
Background
With the development of internet multimedia technology and the growth of visual data, how to process the massive data becomes a new era of difficulty. Scene classification technology, which is a key technology for solving the problems of image retrieval and image recognition, has become a very important and challenging research topic in the field of computer vision. Meanwhile, scene classification has wide application in the fields of remote sensing image analysis, video monitoring, robot perception and the like. Therefore, the method has very important significance for carrying out corresponding research on scene classification technology and improving the recognition capability of the computer scene.
The image scene classification refers to that for a given image, the scene to which the given image belongs is judged by judging and identifying the information and the content contained in the given image, so that the purpose of classification is achieved. In recent years, deep learning technology is rapidly developed, traditional manual image feature design methods are gradually replaced, and brand new progress is made in the field of scene classification. However, deep learning requires a large number of training samples, which cannot be satisfied for some small-scale scene data sets, and as in practical application, it cannot be guaranteed that each type of scene can provide a considerable number of images for training, and the birth of transfer learning provides ideas and solutions for solving such problems. The transfer learning is a machine learning method, and refers to that a pre-trained model is reused in another task, and a network is selectively fine-tuned on a target data set by reasonably selecting a plurality of deep networks pre-trained on a large-scale data set so as to adapt to the current task requirement, so that the transfer learning method is widely applied to a plurality of deep learning problems. The pre-trained network parameter structures in different data sets often have large differences, and the characteristics extracted on the task data sets through the networks can reflect the properties of different aspects of the data. The scene images are rich in content and complex in concept, and the features extracted by using only one type of pre-trained network are not enough to describe the scene images, so that one common practice is to form scene feature representations with more discriminant ability by fusing the features extracted by different networks. However, although features extracted by different pre-training models can reflect the properties of different aspects of a scene, the accuracy of describing the scene by these features is different, and how to combine the features of these features to extract the effective parts thereof for fusion is a difficult problem, and a general solution does not exist at present. On the other hand, the convolutional neural network has different understandings on scene images under different scales, and features which cannot be extracted under a certain scale can be obtained under another scale, so that scene image description can be effectively enhanced by combining scene image information under a plurality of scales. However, the features extracted in the multi-scale image do not always complement each other, resulting in a more accurate representation of the scene. For example, more detailed information can be extracted from a small-scale image, but noise information in the image is amplified, and how to reasonably filter and screen the features becomes a problem. At present, multiscale images are usually obtained by densely sampling an original image, taking a 256×256-pixel image as an example, and local images with different sizes can be sampled from the original image by setting the size and sampling step length of a new image. The number of local features through dense sampling is relatively large, and the local features are usually required to be encoded by combining a Visual-of-Word-Bag model (BoVW) and other methods, and finally, the new scene image description is obtained through aggregation. The multi-scale scene image description obtained by the method has the defects of large calculation amount, structural redundancy and the like.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an image scene classification method and apparatus based on local feature saliency, which can solve the problems of large calculation amount and structural redundancy in multi-scale scene image description.
An image scene classification method based on local feature saliency, the method comprising:
dividing the image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting weights of each scene local feature and each object local feature;
fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object to obtain fused features of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain scene classification of the image scene data.
In one embodiment, the method further comprises: calculating the average value of the local features of the scene corresponding to all the image scene data blocks, and determining a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector; according to a preset first super parameter, adjusting the local feature weight of the initial scene to obtain the local feature weight of the scene; and weighting the local scene features according to the local scene feature weights to obtain the local scene feature enhancement.
In one embodiment, the method further comprises: the global feature of the object and the local feature of the object are subjected to difference, and the absolute value is taken to obtain a local feature distance vector corresponding to the local feature of the object; normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature; according to a preset second super parameter, adjusting the local feature weight of the initial object to obtain the local feature weight of the object; and weighting the local object features according to the local object feature weights to obtain the local object feature enhancement.
In one embodiment, the method further comprises: and fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object in a splicing mode to obtain fused features of the image scene data.
In one embodiment, the method further comprises: inputting the fusion features into a pre-trained linear support vector machine to obtain scene classification of the image scene data.
In one embodiment, the method further comprises: normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector as follows:
Figure BDA0002669413700000031
wherein ,
Figure BDA0002669413700000032
representing the initial scene local feature weights, +.>
Figure BDA0002669413700000033
Representing local features of a scene->
Figure BDA0002669413700000034
Representing the feature center, l representing the number of samples in the scene local feature, and n representing the number of images in the category.
In one embodiment, the method further comprises: normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature, wherein the initial object local feature weight is as follows:
Figure BDA0002669413700000041
wherein ,
Figure BDA0002669413700000042
representing initial object local feature weights, +.>
Figure BDA0002669413700000043
Representing local features of the object->
Figure BDA0002669413700000044
Representing the local feature distance vector.
An image scene classification device based on local feature saliency, the device comprising:
the segmentation module is used for segmenting the image scene data to be classified to obtain image scene data blocks;
the feature extraction module is used for respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
the salifying module is used for respectively obtaining the local characteristics of the enhanced scene and the local characteristics of the enhanced object by setting the weight of each local characteristic of the scene and the local characteristics of the object;
the fusion module is used for fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object to obtain fusion features of image scene data;
and the classification module is used for inputting the fusion characteristics into a pre-trained classification model to obtain scene classification of the image scene data.
A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:
dividing the image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting weights of each scene local feature and each object local feature;
fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object to obtain fused features of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain scene classification of the image scene data.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
dividing the image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting weights of each scene local feature and each object local feature;
fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object to obtain fused features of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain scene classification of the image scene data.
According to the image scene classification method, the device, the computer equipment and the storage medium based on the local feature salification, the scene local feature and the object local feature as well as the scene global feature and the object global feature are respectively extracted through the preset scene feature extraction model and the object feature extraction model, then different weights are set for the scene local feature and the object local feature in a weight setting mode, so that the pertinence of the features is improved, the fused features are obtained through feature fusion, the scenes corresponding to the fused features can be classified through the classification model, and the calculated amount can be reduced and the problem of structural redundancy can be solved due to the fact that the local features are salified through the weights.
Drawings
FIG. 1 is a flow diagram of an image scene classification method based on local feature saliency in one embodiment;
FIG. 2 is a schematic diagram of elements of a Place-CNN and ImageNet-CNN attention scene image in one embodiment;
FIG. 3 is a frame diagram of a model in one embodiment;
FIG. 4 is a schematic diagram of feature extraction in one embodiment;
FIG. 5 is a class activation diagram of Place-CNN and ImageNet-CNN for different scenarios in one embodiment;
FIG. 6 is a schematic diagram of feature fusion in one embodiment;
FIG. 7 is a block diagram of an image scene classification device based on local feature saliency in one embodiment;
fig. 8 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided an image scene classification method based on local feature saliency, including the steps of:
and 102, dividing the image scene data to be classified to obtain image scene data blocks.
The image scene data may be an image acquired in a scene, and the image scene data block may be obtained by dividing the acquired image.
Specifically, the acquired original image can be adjusted by bilinear interpolation, for example, the image size is adjusted to 224×224, then the image mean value of the ImageNet dataset is subtracted, and the image is normalized by dividing the standard deviation, so that the data accords with the distribution rule after normalization, and the generalization capability of the model is improved. For the multi-scale image, the image processed in the previous step is adjusted, for example, to 448×448, and then four corners of the adjusted image are selected for cutting, so as to form 4 image scene data blocks. The local image size is 224×224 as supplementary data of the original scene image at a small scale. Unlike dense sampling, the simple sampling only increases 4 small-scale images to supplement the original image, and reduces the repetition and redundancy in data.
It should be noted that the above dimensions and the number of cuts are examples, and other values may be used to achieve the technical effects of the present invention.
Step 104, extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model respectively.
The scene feature extraction model and the object feature extraction model can be CNN networks, specifically, the scene feature extraction model can be Places-CNN and the object feature extraction model can be ImageNet-CNN.
Specifically, the depth network selected by the invention is denoted as DenseNet, and specifically, the network can be built in a pytorch deep learning framework by taking the DenseNet161 proposed by Gao Huang as a basic network. In the feature extraction, denseNet is set to test mode, and Dropout for regularization multiplies the output of neurons in the form of probability values, and the final feature vector is the final layer of convolution layer output of DenseNet.
And 106, respectively obtaining the enhanced scene local feature and the enhanced object local feature by setting the weight of each scene local feature and each object local feature.
In this step, the scene local feature and the object local feature are highlighted, and for the scene local feature, the purpose of the highlighting is to highlight details, and for the object local feature, the purpose of the highlighting is to preserve the main body.
And step 108, fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object to obtain fused features of the image scene data.
There are various ways of feature fusion, such as addition operation, splicing operation, etc.
Step 110, inputting the fusion features into a pre-trained classification model to obtain scene classification of the image scene data.
The classification model may be a support vector machine, logistic regression, etc., and is not limited in this regard.
In the image scene classification method based on the local feature saliency, the scene local feature and the object local feature, and the scene global feature and the object global feature are respectively extracted through the preset scene feature extraction model and the object feature extraction model, then different weights are set for the scene local feature and the object local feature by setting weights, so that the pertinence of the features is improved, the fused features are obtained through feature fusion, the scene corresponding to the fused features can be classified through the classification model, and the calculated amount can be reduced and the problem of structural redundancy is solved due to the fact that the local feature saliency is carried out through the weights.
In one embodiment, calculating the average value of the local features of the corresponding scenes of all the image scene data blocks, and determining the feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector; according to a preset first super parameter, adjusting the local feature weight of the initial scene to obtain the local feature weight of the scene; and weighting the local scene features according to the local scene feature weights to obtain the local scene feature enhancement.
Specifically, the average value of the local features of the corresponding scenes of all the image scene data blocks is calculated, and the feature center is determined as follows:
Figure BDA0002669413700000081
wherein ,
Figure BDA0002669413700000082
representing the characteristic center +_>
Figure BDA0002669413700000083
And representing local characteristics of the scene, wherein l represents the number of sampling pictures, and n is the number of pictures in the category.
According to the distance from the local scene feature to the feature center, determining a scene distance vector of each local scene feature, specifically: the feature center is differenced from each scene local feature, and the absolute value is taken to obtain a scene distance vector from the scene local feature to the feature center, wherein the vector represents the discrete degree of each dimension feature.
In a specific embodiment, the scene distance vectors are normalized, and the initial scene local feature weight of each scene distance vector is obtained as follows:
Figure BDA0002669413700000084
/>
wherein ,
Figure BDA0002669413700000085
representing the initial scene local feature weights, +.>
Figure BDA0002669413700000086
Representing local features of a scene->
Figure BDA0002669413700000087
Representing the feature center, l representing the number of samples in the scene local feature, and n representing the number of images in the category.
In this embodiment, considering that there is already a relatively accurate global feature on the original scale, relatively prominent scene details are required on the small scale, and all that is required is to strengthen those values far from the feature center so as to supplement the global feature of the scene. By means of the normalization, local details of the scene can be highlighted.
In addition, adjusting the local feature weight of the initial scene according to a preset first super parameter to obtain the local feature weight of the scene as follows:
Figure BDA0002669413700000091
Figure BDA0002669413700000092
representing a first hyper-parameter.
Finally, taking the sum of the products of the weights and the local features as the local features of the enhanced scene
Figure BDA0002669413700000095
Figure BDA0002669413700000093
In one embodiment, the global feature of the object and the local feature of the object are subjected to difference, and a local feature distance vector corresponding to the local feature of the object is obtained; normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature; according to a preset second super parameter, adjusting the local feature weight of the initial object to obtain the local feature weight of the object; and weighting the local characteristics of the object according to the local characteristic weights of the object to obtain the local characteristics of the enhanced object.
Specifically, corresponding to the scene local feature is an object local feature extracted by ImageNet-CNN. Because the image scene data contains rich object content, especially the extracted features of the small-scale image contain a large amount of object detail features, the detail features have an adverse effect on the existing scene local features as complementary global features, and the effect of directly using the extracted global features of the ImageNet-CNN as the complementary global features of the scene is not ideal, the extracted global features are used as the object feature semantic centers to conduct guiding correction on the small-scale local features, detailed parts in the object features are reduced, and therefore more suitable object local features are obtained.
The method comprises the following specific steps:
and obtaining a local feature distance vector by using the absolute value obtained by making a difference between the global feature of the object and the local feature of each object. Because global features are needed to guide the local vector, features closer to the global features are given higher weight as opposed to scene local feature highlighting details. Object local feature weights
Figure BDA0002669413700000096
Is calculated as follows:
Figure BDA0002669413700000094
specifically, the process of local feature salification is to weight and add 4 local features, and the sum of the weights needs to be 1, which is also the purpose of normalization. Dividing by 3 can just make the sum of the 4 weights 1.
Controlling object local feature weights using hyper-parameters lambda
Figure BDA0002669413700000101
Is the degree of influence of (a):
Figure BDA0002669413700000102
finally, the object local feature weight is given to the enhanced object local feature after the object local feature is corrected:
Figure BDA0002669413700000103
in one embodiment, a stitching mode is adopted to fuse the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object, so as to obtain the fused features of the image scene data.
In this embodiment, considering that the scene local feature and the object local feature are two distinct features, it does not make sense to semantically add, and therefore a stitching manner is selected.
In one embodiment, the fusion features are input into a pre-trained linear support vector machine to obtain scene classification of the image scene data.
In this embodiment, the linear support vector machine can maximize the class-to-class interval and reduce the overfitting while ensuring a certain training accuracy.
In summary, the beneficial effects achieved by the invention are as follows:
1. compared with the common dense sampling and semantic codebook construction method, the method has the advantage that the calculated amount is remarkably reduced.
2. An optimization scheme based on network characteristics of different depths, namely a characteristic salifying method, is used. The optimization method is combined with the characteristics of the features, so that the scene description capability of the fusion features is effectively improved, the complementarity of the optimized features is more sufficient, and the scene classification precision is improved.
The advantageous effects of the present invention will be described in the following with reference to a specific example.
FIG. 2 is a diagram of the different elements of a scene image of interest to Place-CNN and ImageNet-CNN in a model. The focus of attention is different between Places-CNN and ImageNet-CNN. The scene images are rich in content and elements, the extracted image features of Places-CNN often have more integral and spatial characteristics, and the extracted features of the ImageNet-CNN pay more attention to details, especially to details of a single object.
FIG. 3 is a general frame diagram of the present invention, which includes the following three steps:
and a first step, extracting features. And performing feature extraction on the scene image on two scales by using the constructed Places-CNN and the image Net-CNN.
And secondly, the characteristic is highlighted. The method comprises the steps of optimizing the extracted features under different types and different scales, and specifically comprises two parts, namely a scene local feature highlighting detail and an object local feature preserving main body.
And thirdly, feature fusion and classification. And performing dimensional splicing on the optimized features, and then completing classification by using a linear support vector machine.
Fig. 4 depicts a first step feature extraction process. In the feature extraction stage, the input image is propagated forward, and the output of the last Dense Block is used as the feature extracted by the two types of convolutional neural networks. When the size of the input image is 224×224, the dimension of the feature map obtained by adding global averaging pooling is 1×2208. Wherein, the local feature dimension extracted by using one network is 4×2208, and the global feature dimension is 1×2208.
Fig. 5 is a class activation diagram of two classes of networks for different scenarios. Different types of depth features face the same classification task, and the activation area and classification effect of the depth features are obviously different. FIG. 5 illustrates some scene image activated visualization function diagrams of placs-CNN and ImageNet-CNN on MIT indicator 67 dataset, class activation mapping is used to achieve visualization of key visual attention areas of different CNNs (brighter place of image represents stronger discriminant), reflecting different properties of scene features and object features. From the figure, the activation area and the color brightness of the Places-CNN are obviously higher than those of the ImageNet-CNN, and the reason that the Places-CNN has better effect on the scene classification task than the ImageNet-CNN is also explained. Unlike Places-CNN, which focuses more on scene features, imageNet-CNN focuses on visual emphasis on some scene objects, such as toilets and cupboards in bachrom, tables and chairs in Fastfood restaurant, etc.
Fig. 6 shows two different feature fusion strategies in the third step of feature fusion and classification, one is stitching in dimensions, as shown in the left part of fig. 6, and the other is adding in dimensions, as shown in the right part of fig. 6, and considering that the scene local feature and the object local feature are two distinct features, adding in semantics is not meaningful, and all the fusion strategies selected by the present invention are the first fusion strategy.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
In one embodiment, as shown in fig. 7, there is provided an image scene classification apparatus based on local feature saliency, including: a segmentation module 702, a feature extraction module 704, a saliency module 706, a fusion module 708, and a classification module 710, wherein:
the segmentation module 702 is configured to segment image scene data to be classified to obtain image scene data blocks;
a feature extraction module 704, configured to extract a scene local feature and an object local feature in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extract a scene global feature and an object global feature in the image scene data through the scene feature extraction model and the object feature extraction model, respectively;
a salifying module 706, configured to obtain an enhanced scene local feature and an enhanced object local feature by setting weights of each of the scene local feature and the object local feature, respectively;
a fusion module 708, configured to fuse the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature to obtain a fused feature of the image scene data;
and the classification module 710 is configured to input the fusion feature into a pre-trained classification model to obtain a scene classification of the image scene data.
In one embodiment, the saliency module 706 is further configured to calculate a mean value of all the image scene data blocks corresponding to the local features of the scene, and determine a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector; according to a preset first super parameter, adjusting the local feature weight of the initial scene to obtain the local feature weight of the scene; and weighting the local scene features according to the local scene feature weights to obtain the local scene feature enhancement.
In one embodiment, the salifying module 706 is further configured to make a difference between the global feature of the object and the local feature of the object, and obtain a local feature distance vector corresponding to the local feature of the object by taking an absolute value; normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature; according to a preset second super parameter, adjusting the local feature weight of the initial object to obtain the local feature weight of the object; and weighting the local object features according to the local object feature weights to obtain the local object feature enhancement.
In one embodiment, the fusion module 708 is further configured to fuse the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature in a stitching manner, so as to obtain a fused feature of the image scene data.
In one embodiment, the classification module 710 is further configured to input the fusion feature into a pre-trained linear support vector machine to obtain a scene classification of the image scene data.
In one embodiment, the saliency module 706 is further configured to normalize the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector as follows:
Figure BDA0002669413700000131
wherein ,
Figure BDA0002669413700000132
representing the initial scene local feature weights, +.>
Figure BDA0002669413700000133
Representing local features of a scene->
Figure BDA0002669413700000134
Representing the feature center, l representing the number of samples in the scene local feature, and n representing the number of images in the category.
In one embodiment, the salifying module 706 is further configured to normalize the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature as follows:
Figure BDA0002669413700000135
wherein ,
Figure BDA0002669413700000136
representing initial object local feature weights, +.>
Figure BDA0002669413700000137
Representing local features of the object->
Figure BDA0002669413700000138
Representing the local feature distance vector.
For specific limitations of the image scene classification device based on local feature saliency, reference may be made to the above limitation of the image scene classification method based on local feature saliency, and details thereof are not repeated here. The above-described image scene classification apparatus based on local feature saliency may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for classifying image scenes based on local feature saliency. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an embodiment a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method of the above embodiments when the computer program is executed.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method of the above embodiments.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (6)

1. An image scene classification method based on local feature saliency, the method comprising:
dividing the image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and respectively extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting weights of each scene local feature and each object local feature;
the method for obtaining the enhanced scene local features by setting the weight of each scene local feature comprises the following steps: calculating the average value of the local scene features corresponding to all the image scene data blocks, and determining a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector; according to a preset first super parameter, adjusting the local feature weight of the initial scene to obtain the local feature weight of the scene; weighting the local scene features according to the local scene feature weights to obtain enhanced local scene features; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector, wherein the initial scene local feature weights are as follows:
Figure QLYQS_1
wherein ,
Figure QLYQS_2
representing the initial scene local feature weights, +.>
Figure QLYQS_3
Representing local features of a scene->
Figure QLYQS_4
Representing the characteristic center +_>
Figure QLYQS_5
Representing the number of sampled pictures in the local features of the scene, and n represents the number of images in the category;
the method for obtaining the enhanced object local feature by setting the weight of each object local feature comprises the following steps: the global feature of the object and the local feature of the object are subjected to difference, and the absolute value is taken to obtain a local feature distance vector corresponding to the local feature of the object; normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature; according to a preset second super parameter, adjusting the local feature weight of the initial object to obtain the local feature weight of the object; weighting the local object features according to the local object feature weights to obtain enhanced local object features; the local feature distance vector is normalized, and initial object local feature weights corresponding to the local features of each object are obtained as follows:
Figure QLYQS_6
wherein ,
Figure QLYQS_7
representing initial object local feature weights, +.>
Figure QLYQS_8
Representing local features of the object->
Figure QLYQS_9
Representing global features of the object;
fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object to obtain fused features of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain scene classification of the image scene data.
2. The method of claim 1, wherein fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature to obtain a fused feature of image scene data, comprises:
and fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object in a splicing mode to obtain fused features of the image scene data.
3. The method according to any one of claims 1 to 2, wherein inputting the fusion features into a pre-trained classification model results in a scene classification of the image scene data, comprising:
inputting the fusion features into a pre-trained linear support vector machine to obtain scene classification of the image scene data.
4. An image scene classification device based on local feature saliency, the device comprising:
the segmentation module is used for segmenting the image scene data to be classified to obtain image scene data blocks;
the feature extraction module is used for respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and respectively extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
the salifying module is used for respectively obtaining the local characteristics of the enhanced scene and the local characteristics of the enhanced object by setting the weight of each local characteristic of the scene and the local characteristics of the object;
the method for obtaining the enhanced scene local features by setting the weight of each scene local feature comprises the following steps: calculating the average value of the local scene features corresponding to all the image scene data blocks, and determining a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector; according to a preset first super parameter, adjusting the local feature weight of the initial scene to obtain the local feature weight of the scene; weighting the local scene features according to the local scene feature weights to obtain enhanced local scene features; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector, wherein the initial scene local feature weights are as follows:
Figure QLYQS_10
wherein ,
Figure QLYQS_11
representing the initial scene local feature weights, +.>
Figure QLYQS_12
Representing local features of a scene->
Figure QLYQS_13
Representing the characteristic center +_>
Figure QLYQS_14
Representing the number of sampled pictures in the local features of the scene, and n represents the number of images in the category;
the method for obtaining the enhanced object local feature by setting the weight of each object local feature comprises the following steps: the global feature of the object and the local feature of the object are subjected to difference, and the absolute value is taken to obtain a local feature distance vector corresponding to the local feature of the object; normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature; according to a preset second super parameter, adjusting the local feature weight of the initial object to obtain the local feature weight of the object; weighting the local object features according to the local object feature weights to obtain enhanced local object features; the local feature distance vector is normalized, and initial object local feature weights corresponding to the local features of each object are obtained as follows:
Figure QLYQS_15
wherein ,
Figure QLYQS_16
representing initial object local feature weights, +.>
Figure QLYQS_17
Representing local features of the object->
Figure QLYQS_18
Representing local features of the object;
the fusion module is used for fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object to obtain fusion features of image scene data;
and the classification module is used for inputting the fusion characteristics into a pre-trained classification model to obtain scene classification of the image scene data.
5. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 3 when the computer program is executed.
6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 3.
CN202010928765.7A 2020-09-07 2020-09-07 Image scene classification method and device based on local feature saliency Active CN112001399B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010928765.7A CN112001399B (en) 2020-09-07 2020-09-07 Image scene classification method and device based on local feature saliency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010928765.7A CN112001399B (en) 2020-09-07 2020-09-07 Image scene classification method and device based on local feature saliency

Publications (2)

Publication Number Publication Date
CN112001399A CN112001399A (en) 2020-11-27
CN112001399B true CN112001399B (en) 2023-06-09

Family

ID=73468773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010928765.7A Active CN112001399B (en) 2020-09-07 2020-09-07 Image scene classification method and device based on local feature saliency

Country Status (1)

Country Link
CN (1) CN112001399B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699855B (en) * 2021-03-23 2021-10-22 腾讯科技(深圳)有限公司 Image scene recognition method and device based on artificial intelligence and electronic equipment
CN112907138B (en) * 2021-03-26 2023-08-01 国网陕西省电力公司电力科学研究院 Power grid scene early warning classification method and system from local to whole perception
CN113128527B (en) * 2021-06-21 2021-08-24 中国人民解放军国防科技大学 Image scene classification method based on converter model and convolutional neural network
CN113657462A (en) * 2021-07-28 2021-11-16 讯飞智元信息科技有限公司 Method for training vehicle recognition model, vehicle recognition method and computing device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555446A (en) * 2019-08-19 2019-12-10 北京工业大学 Remote sensing image scene classification method based on multi-scale depth feature fusion and transfer learning
CN111079674A (en) * 2019-12-22 2020-04-28 东北师范大学 Target detection method based on global and local information fusion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8447119B2 (en) * 2010-03-16 2013-05-21 Nec Laboratories America, Inc. Method and system for image classification

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555446A (en) * 2019-08-19 2019-12-10 北京工业大学 Remote sensing image scene classification method based on multi-scale depth feature fusion and transfer learning
CN111079674A (en) * 2019-12-22 2020-04-28 东北师范大学 Target detection method based on global and local information fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于视觉敏感区域信息增强的室内场景分类算法;史静,朱虹,王婧,薛杉;模式识别与人工智能;第30卷(第6期);第520-529页 *

Also Published As

Publication number Publication date
CN112001399A (en) 2020-11-27

Similar Documents

Publication Publication Date Title
CN112001399B (en) Image scene classification method and device based on local feature saliency
WO2022001623A1 (en) Image processing method and apparatus based on artificial intelligence, and device and storage medium
CN111079632A (en) Training method and device of text detection model, computer equipment and storage medium
CN111612008A (en) Image segmentation method based on convolution network
CN112862824A (en) Novel coronavirus pneumonia focus detection method, system, device and storage medium
CN108154191B (en) Document image recognition method and system
CN113538441A (en) Image segmentation model processing method, image processing method and device
CN112528845B (en) Physical circuit diagram identification method based on deep learning and application thereof
CN111768415A (en) Image instance segmentation method without quantization pooling
CN113012169A (en) Full-automatic cutout method based on non-local attention mechanism
CN112084911A (en) Human face feature point positioning method and system based on global attention
Wei et al. EGA-Net: Edge feature enhancement and global information attention network for RGB-D salient object detection
CN114782355A (en) Gastric cancer digital pathological section detection method based on improved VGG16 network
CN114639102A (en) Cell segmentation method and device based on key point and size regression
Liu et al. Attentive semantic and perceptual faces completion using self-attention generative adversarial networks
Xiong et al. Single image super-resolution via image quality assessment-guided deep learning network
CN110889858A (en) Automobile part segmentation method and device based on point regression
CN116030341A (en) Plant leaf disease detection method based on deep learning, computer equipment and storage medium
CN110659724B (en) Target detection depth convolution neural network construction method based on target scale
CN114565626A (en) Lung CT image segmentation algorithm based on PSPNet improvement
Wang et al. Image Semantic Segmentation Algorithm Based on Self-learning Super-Pixel Feature Extraction
CN113763413B (en) Training method of image segmentation model, image segmentation method and storage medium
CN117523205B (en) Segmentation and identification method for few-sample ki67 multi-category cell nuclei
Cao et al. An improved defocusing adaptive style transfer method based on a stroke pyramid
CN114897779B (en) Cervical cytology image abnormal region positioning method and device based on fusion attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant