CN112001399A - Image scene classification method and device based on local feature saliency - Google Patents

Image scene classification method and device based on local feature saliency Download PDF

Info

Publication number
CN112001399A
CN112001399A CN202010928765.7A CN202010928765A CN112001399A CN 112001399 A CN112001399 A CN 112001399A CN 202010928765 A CN202010928765 A CN 202010928765A CN 112001399 A CN112001399 A CN 112001399A
Authority
CN
China
Prior art keywords
scene
local
feature
features
local feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010928765.7A
Other languages
Chinese (zh)
Other versions
CN112001399B (en
Inventor
谢毓湘
张家辉
宫铨志
栾悉道
魏迎梅
康来
蒋杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202010928765.7A priority Critical patent/CN112001399B/en
Publication of CN112001399A publication Critical patent/CN112001399A/en
Application granted granted Critical
Publication of CN112001399B publication Critical patent/CN112001399B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to an image scene classification method and device based on local feature saliency. The method comprises the following steps: the method comprises the steps of segmenting image scene data to be classified to obtain image scene data blocks, respectively extracting scene local features and object local features through a preset scene feature extraction model and an object feature extraction model, extracting scene global features and object global features through the scene feature extraction model and the object feature extraction model, respectively obtaining enhanced scene local features and enhanced object local features through setting weights of the scene local features and the object local features, fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features to obtain fusion features of the image scene data, inputting the fusion features into a classification model trained in advance, and obtaining scene classification of the image scene data. By adopting the method, the calculation amount can be reduced, and the problem of structural redundancy is solved.

Description

Image scene classification method and device based on local feature saliency
Technical Field
The present application relates to the field of scene classification technologies, and in particular, to a method and an apparatus for classifying image scenes based on local feature saliency.
Background
With the development of internet multimedia technology and the growth of visual data, how to process these massive data becomes a new era of difficulty. Scene classification technology, as a key technology for solving the problems of image retrieval and image recognition, has become a very important and challenging research topic in the field of computer vision. Meanwhile, scene classification has wide application in the fields of remote sensing image analysis, video monitoring, robot perception and the like. Therefore, the method has very important significance for correspondingly researching the scene classification technology and improving the recognition capability of the computer scene.
The image scene classification is to judge the scene to which a given image belongs by judging and identifying the information and content contained in the given image, thereby achieving the purpose of classification. In recent years, the deep learning technology is rapidly developed, the traditional method for manually designing image features is gradually replaced, and brand new progress is made in the field of scene classification. However, deep learning requires a large number of training samples, which cannot be satisfied by some small-scale scene data sets, and as in practical application, it cannot be guaranteed that each type of scene can provide a considerable number of images for training, and the birth of transfer learning provides ideas and solutions for solving such problems. The transfer learning is a machine learning method, which means that a pre-trained model is reused in another task, and a deep network pre-trained on a large-scale data set is reasonably selected, so that the network is selectively fine-tuned on a target data set to adapt to the requirements of the current task, and the method is widely applied to some deep learning problems. The pre-trained network parameter structures in different data sets are often greatly different, and the characteristics extracted from the task data sets through the networks can reflect the properties of different aspects of the data. The scene images are rich in content and complex in concept, and features extracted by only using one type of pre-trained network are not enough to describe the scene images, so that a common method is to form scene feature representation with more discrimination by fusing features extracted by different networks. However, although the features extracted from different pre-training models can reflect the properties of different aspects of a scene, the accuracy of describing the scene by the features is different, how to extract effective parts from the features by combining the features of the features is a difficult problem, and a general solution does not exist at present. On the other hand, the convolutional neural network has different understandings on the scene images under different scales, and features which cannot be extracted under a certain scale can be obtained under another scale, so that the scene image description can be effectively enhanced by combining scene image information under multiple scales. However, the features extracted in the multi-scale image do not always complement each other, resulting in a more accurate representation of the scene. For example, more detailed information can be extracted from a small-scale image, but noise information in the image is also amplified, and how to reasonably filter and screen the features becomes a problem. Currently, multi-scale images are usually obtained by performing dense sampling on an original image, and for example, a 256 × 256 pixel image is taken as an example, by setting the size and sampling step size of a new image, local images with different sizes can be sampled from the original image. Local feature quantity through dense sampling is large, and the local feature quantity is usually coded by combining a method such as a Bag-of-Visual-Word (BoVW) model, and finally, new scene image description is obtained through aggregation. The multi-scale scene image description obtained by the method has the defects of large calculated amount, structural redundancy and the like.
Disclosure of Invention
Based on the above, it is necessary to provide an image scene classification method and apparatus based on local feature saliency, which can solve the problems of large calculation amount and structural redundancy in multi-scale scene image description.
A method of image scene classification based on local feature saliency, the method comprising:
segmenting image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
In one embodiment, the method further comprises the following steps: calculating the mean value of all the image scene data blocks corresponding to the scene local features, and determining a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector; adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight; and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.
In one embodiment, the method further comprises the following steps: making a difference between the object global feature and the object local feature, and taking an absolute value to obtain a local feature distance vector corresponding to the object local feature; normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature; adjusting the initial object local characteristic weight according to a preset second hyperparameter to obtain an object local characteristic weight; and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.
In one embodiment, the method further comprises the following steps: and fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features in a splicing mode to obtain fusion features of image scene data.
In one embodiment, the method further comprises the following steps: and inputting the fusion characteristics into a pre-trained linear support vector machine to obtain the scene classification of the image scene data.
In one embodiment, the method further comprises the following steps: normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector as follows:
Figure BDA0002669413700000031
wherein ,
Figure BDA0002669413700000032
the weight of the local feature of the initial scene is represented,
Figure BDA0002669413700000033
the local features of the scene are represented,
Figure BDA0002669413700000034
the feature center is represented, l represents the number of sampled pictures in the scene local feature, and n represents the number of images in the category.
In one embodiment, the method further comprises the following steps: normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature as follows:
Figure BDA0002669413700000041
wherein ,
Figure BDA0002669413700000042
the local feature weights of the initial object are represented,
Figure BDA0002669413700000043
the local features of the object are represented,
Figure BDA0002669413700000044
representing a local feature distance vector.
An apparatus for image scene classification based on local feature saliency, the apparatus comprising:
the segmentation module is used for segmenting the image scene data to be classified to obtain image scene data blocks;
the characteristic extraction module is used for respectively extracting scene local characteristics and object local characteristics in the image scene data block through a preset scene characteristic extraction model and an object characteristic extraction model, and extracting scene global characteristics and object global characteristics in the image scene data through the scene characteristic extraction model and the object characteristic extraction model;
the saliency module is used for respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
the fusion module is used for fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and the classification module is used for inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
segmenting image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
segmenting image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
According to the image scene classification method, device, computer equipment and storage medium based on local feature saliency, the scene local features and the object local features, the scene global features and the object global features are extracted respectively through the preset scene feature extraction model and the preset object feature extraction model, then different weights are set for the scene local features and the object local features in a weight setting mode, so that the pertinence of the features is improved, the fusion features are obtained through feature fusion, the scenes corresponding to the fusion features can be classified through the classification model, and due to the fact that the local features are highlighted through the weights, the calculated amount can be reduced, and the problem of structural redundancy is solved.
Drawings
FIG. 1 is a flow diagram illustrating a method for image scene classification based on local feature saliency, according to an embodiment;
FIG. 2 is a diagram illustrating elements of a scene image of interest to Place-CNN and ImageNet-CNN in one embodiment;
FIG. 3 is a block diagram of a model in one embodiment;
FIG. 4 is a schematic diagram of feature extraction in one embodiment;
FIG. 5 is a class activation diagram for Place-CNN and ImageNet-CNN for different scenarios in one embodiment;
FIG. 6 is a schematic illustration of feature fusion in one embodiment;
FIG. 7 is a block diagram of an image scene classification device based on local feature saliency in one embodiment;
FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided an image scene classification method based on local feature saliency, comprising the following steps:
step 102, segmenting image scene data to be classified to obtain image scene data blocks.
The image scene data may be an image captured in a scene, and the image scene data block may be obtained by segmenting the captured image.
Specifically, the collected original image may be adjusted by bilinear interpolation, for example, the size of the image is adjusted to 224 × 224, the image mean of the ImageNet data set is subtracted, the image is normalized by dividing by the standard deviation, and after normalization, the data is in accordance with the distribution rule, so that the generalization capability of the model is increased. For the multi-scale image, the image processed in the previous step is adjusted, for example, to 448 × 448, and then four corners of the adjusted image are selected for cutting, so as to form 4 image scene data blocks. The local image size is 224 × 224 as supplementary data of the original scene image at a small scale. Unlike dense sampling, the simplified sampling adds only 4 small-scale images to supplement the original image, reducing duplication and redundancy in data.
It should be noted that the above dimensions and the number of cuts are examples, and other values can also be adopted to achieve the technical effects of the present invention.
And 104, respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model.
The scene feature extraction model and the object feature extraction model can be CNN networks, specifically, the scene feature extraction model can be Places-CNN and the object feature extraction model can be ImageNet-CNN.
Specifically, the deep network selected by the invention is represented as DenseNet, and specifically, a network can be built in a deep learning framework of the pytorch by taking DenseNet161 proposed by Gao Huang as a basic network. In performing feature extraction, DenseNet is set to test mode, Dropout for regularization will multiply the output of the neuron in the form of probability values, and the final feature vector is the last convolutional layer output of DenseNet.
And 106, respectively obtaining the local features of the enhanced scene and the enhanced object by setting the weight of each scene local feature and each object local feature.
In this step, the scene local feature and the object local feature are highlighted, the purpose of the highlighting is to highlight details for the scene local feature, and the purpose of the highlighting is to retain a subject for the object local feature.
And step 108, fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features to obtain fusion features of the image scene data.
There are various ways of feature fusion, such as addition operation, splicing operation, etc.
And step 110, inputting the fusion characteristics into a pre-trained classification model to obtain scene classification of the image scene data.
The classification model may be a support vector machine, logistic regression, etc., and is not limited herein.
According to the image scene classification method based on local feature saliency, the scene local features and the object local features, as well as the scene global features and the object global features are extracted respectively through the preset scene feature extraction model and the preset object feature extraction model, then different weights are set for the scene local features and the object local features in a weight setting mode, so that the pertinence of the features is improved, the fusion features are obtained through feature fusion, the scenes corresponding to the fusion features can be classified through the classification model, and the calculation amount can be reduced and the problem of structural redundancy is solved because the local feature saliency is performed through the weights.
In one embodiment, calculating the mean value of scene local features corresponding to all image scene data blocks, and determining a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector; adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight; and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.
Specifically, the mean value of the scene local features corresponding to all the image scene data blocks is calculated, and the feature center is determined as follows:
Figure BDA0002669413700000081
wherein ,
Figure BDA0002669413700000082
the center of the feature is represented by,
Figure BDA0002669413700000083
the scene local feature is represented, l represents the number of sampling pictures, and n represents the number of pictures in the category.
Determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center, specifically: and (4) making a difference between the feature center and each scene local feature, and taking an absolute value to obtain a scene distance vector from the scene local feature to the feature center, wherein the vector represents the discrete degree of each dimension of the feature.
In a specific embodiment, the scene distance vectors are normalized, and the initial scene local feature weight of each scene distance vector is obtained as follows:
Figure BDA0002669413700000084
wherein ,
Figure BDA0002669413700000085
the weight of the local feature of the initial scene is represented,
Figure BDA0002669413700000086
the local features of the scene are represented,
Figure BDA0002669413700000087
the feature center is represented, l represents the number of sampled pictures in the scene local feature, and n represents the number of images in the category.
In this embodiment, considering that there is a relatively accurate global feature on the original scale and a relatively prominent scene detail is required on the small scale, it is necessary to reinforce the values farther from the feature center so as to supplement the scene global feature. By the normalization method, the local details of the scene can be highlighted.
In addition, according to a preset first hyper-parameter, the initial scene local feature weight is adjusted, and the obtained scene local feature weight is:
Figure BDA0002669413700000091
Figure BDA0002669413700000092
representing the first hyperparameter.
Finally, the sum of the products of the weights and the local features is used as the local features of the enhanced scene
Figure BDA0002669413700000095
Figure BDA0002669413700000093
In one embodiment, the global feature of the object is subtracted from the local feature of the object, and the absolute value is taken to obtain a local feature distance vector corresponding to the local feature of the object; normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature; adjusting the initial object local characteristic weight according to a preset second hyper-parameter to obtain an object local characteristic weight; and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.
Specifically, the local scene features correspond to the local object features extracted by ImageNet-CNN. The image scene data contains rich object contents, especially the extracted features of the small-scale image contain a large amount of object detail features, and the detail features have an adverse effect on the existing scene local features serving as complementary global features, so that the effect of directly using the global features extracted by ImageNet-CNN as the complementary global features of the scene global features is not ideal, and the small-scale local features are guided and corrected by using the detail features as an object feature semantic center, so that detailed parts in the object features are reduced, and more appropriate object local features are obtained.
The method comprises the following specific steps:
and (4) performing difference between the global features of the objects and the local features of each object to obtain an absolute value to obtain a local feature distance vector. Since global features are needed to guide the local vectors, features closer to the global features are given higher weight as opposed to scene local features highlighting details. Object local feature weights
Figure BDA0002669413700000096
Is calculated as follows:
Figure BDA0002669413700000094
specifically, the process of local feature saliency is to add 4 local features in weighted manner, and the sum of the weights needs to be 1, which is also the purpose of normalization. Dividing by 3 can just make the sum of 4 weights 1.
Controlling object local feature weights using hyper-parameter lambda
Figure BDA0002669413700000101
Degree of influence of (c):
Figure BDA0002669413700000102
and finally, endowing the object local characteristic weight to the object local characteristic to obtain the modified enhanced object local characteristic:
Figure BDA0002669413700000103
in one embodiment, the enhanced scene local features, the enhanced object local features, the scene global features and the object global features are fused in a splicing mode to obtain fusion features of image scene data.
In this embodiment, the splicing mode is selected in consideration that the scene local feature and the object local feature are two distinct features and have no meaning in semantic addition.
In one embodiment, the fusion features are input into a pre-trained linear support vector machine to obtain scene classification of image scene data.
In this embodiment, the linear support vector machine can maximize the interval between classes and reduce overfitting while ensuring a certain training accuracy.
In summary, the invention achieves the following beneficial effects:
1. compared with the common dense sampling and semantic code construction method, the multi-scale scene image simple sampling scheme and the multi-scale feature generation method have the advantage that the calculation amount is remarkably reduced.
2. An optimization scheme based on network features of different depths, namely a feature saliency method, is used. The optimization method is combined with the characteristics of the features, so that the scene description capacity of the fused features is effectively improved, the complementarity of the optimized features is more sufficient, and the scene classification precision is improved.
The following description will explain the advantageous effects of the present invention by using a specific example.
FIG. 2 shows different elements of the scene image of interest to Place-CNN and ImageNet-CNN in the model. Places-CNN and ImageNet-CNN focus on a distinction. The scene image is rich in content and elements, the images extracted by the Places-CNN often have more overall and spatial characteristics, and the features extracted by the ImageNet-CNN pay more attention to details, especially details of a single object.
Fig. 3 is an overall framework diagram of the present invention, which includes the following three steps:
first, feature extraction. And performing feature extraction on the scene image on two scales by using the constructed Places-CNN and ImageNet-CNN.
And secondly, characterizing prominently. And performing optimization processing on the extracted features of different types and different scales, wherein the features specifically comprise two parts, namely a scene local feature highlight detail and an object local feature retention main body.
And thirdly, fusing and classifying the features. And performing dimensionality splicing on the optimized features, and finishing classification by using a linear support vector machine.
Fig. 4 depicts a first step feature extraction process. In the feature extraction stage, an input image is propagated forwards, and the output of the last Dense Block is used as the feature extracted by the two types of convolutional neural networks. When the size of the input image is 224 × 224, the dimension of the feature map obtained by adding the global average pooling is 1 × 2208. The local feature dimension extracted by using one network is 4 × 2208, and the global feature dimension is 1 × 2208.
Fig. 5 is a class activation diagram for two types of networks for different scenarios. Different types of depth features are oriented to the same classification task, and the activation regions and the classification effects are obviously different. FIG. 5 shows a visualization function diagram of some scene image activations of Places-CNN and ImageNet-CNN on the MIT Indor 67 dataset, and class activation mapping is used to realize visualization of important visual attention areas of different CNNs (the brighter Places of the images represent stronger discriminative power), reflecting different properties of scene features and object features. As can be seen from the figure, the activation region and the color brightness of Places-CNN are obviously higher than those of ImageNet-CNN, which also explains why the effect of Places-CNN on the scene classification task is better than that of ImageNet-CNN. Unlike Places-CNN, which focuses more on scene features, ImageNet-CNN Places visual emphasis on some scene objects, such as toilets and cabinets in Bathroom, tables and chairs in Fastfood restaurarant, and so on.
Fig. 6 shows two different feature fusion strategies in the third step of feature fusion and classification, one is dimensional splicing, as shown in the left side of fig. 6, and the other is dimensional addition, as shown in the right side of fig. 6, considering that the scene local feature and the object local feature are two distinct features, which are semantically added without meaning, all the fusion strategies selected by the present invention are the first fusion strategy.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 7, there is provided an image scene classification apparatus based on local feature saliency, including: a segmentation module 702, a feature extraction module 704, a saliency module 706, a fusion module 708, and a classification module 710, wherein:
a segmentation module 702, configured to segment image scene data to be classified to obtain an image scene data block;
a feature extraction module 704, configured to respectively extract a scene local feature and an object local feature in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extract a scene global feature and an object global feature in the image scene data through the scene feature extraction model and the object feature extraction model;
a saliency module 706, configured to obtain enhanced scene local features and enhanced object local features by setting weights of each of the scene local features and the object local features;
a fusion module 708, configured to fuse the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature to obtain a fusion feature of image scene data;
and the classification module 710 is configured to input the fusion features into a pre-trained classification model to obtain a scene classification of the image scene data.
In one embodiment, the saliency module 706 is further configured to calculate a mean value of all the image scene data blocks corresponding to the scene local features, and determine a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector; adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight; and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.
In one embodiment, the saliency module 706 is further configured to perform a difference between the object global feature and the object local feature, and take an absolute value to obtain a local feature distance vector corresponding to the object local feature; normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature; adjusting the initial object local characteristic weight according to a preset second hyperparameter to obtain an object local characteristic weight; and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.
In one embodiment, the fusion module 708 is further configured to fuse the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature in a splicing manner to obtain a fusion feature of the image scene data.
In one embodiment, the classification module 710 is further configured to input the fusion features into a pre-trained linear support vector machine to obtain a scene classification of the image scene data.
In one embodiment, the saliency module 706 is further configured to normalize the scene distance vectors, and obtain an initial scene local feature weight of each scene distance vector as:
Figure BDA0002669413700000131
wherein ,
Figure BDA0002669413700000132
the weight of the local feature of the initial scene is represented,
Figure BDA0002669413700000133
the local features of the scene are represented,
Figure BDA0002669413700000134
the feature center is represented, l represents the number of sampled pictures in the scene local feature, and n represents the number of images in the category.
In one embodiment, the saliency module 706 is further configured to normalize the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature as follows:
Figure BDA0002669413700000135
wherein ,
Figure BDA0002669413700000136
the local feature weights of the initial object are represented,
Figure BDA0002669413700000137
the local features of the object are represented,
Figure BDA0002669413700000138
representing a local feature distance vector.
For specific definition of the image scene classification device based on local feature saliency, refer to the above definition of the image scene classification method based on local feature saliency, and details thereof are not repeated here. The modules in the image scene classification device based on local feature saliency can be fully or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for image scene classification based on local feature saliency. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An image scene classification method based on local feature saliency, characterized in that the method comprises:
segmenting image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
2. The method of claim 1, wherein the obtaining of enhanced scene local features by setting a weight of each scene local feature comprises:
calculating the mean value of all the image scene data blocks corresponding to the scene local features, and determining a feature center;
determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center;
normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector;
adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight;
and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.
3. The method of claim 1, wherein obtaining enhanced object local features by setting a weight of each of the object local features comprises:
making a difference between the object global feature and the object local feature, and taking an absolute value to obtain a local feature distance vector corresponding to the object local feature;
normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature;
adjusting the initial object local characteristic weight according to a preset second hyperparameter to obtain an object local characteristic weight;
and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.
4. The method according to any one of claims 1 to 3, wherein fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fused feature of image scene data comprises:
and fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features in a splicing mode to obtain fusion features of image scene data.
5. The method of any one of claims 1 to 3, wherein inputting the fused features into a pre-trained classification model to obtain a scene classification of the image scene data comprises:
and inputting the fusion characteristics into a pre-trained linear support vector machine to obtain the scene classification of the image scene data.
6. The method of claim 2, wherein normalizing the scene distance vectors to obtain an initial scene local feature weight for each scene distance vector comprises:
normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector as follows:
Figure FDA0002669413690000021
wherein ,
Figure FDA0002669413690000022
the weight of the local feature of the initial scene is represented,
Figure FDA0002669413690000023
the local features of the scene are represented,
Figure FDA0002669413690000024
the feature center is represented, l represents the number of sampled pictures in the scene local feature, and n represents the number of images in the category.
7. The method of claim 3, wherein normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature comprises:
normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature as follows:
Figure FDA0002669413690000031
wherein ,
Figure FDA0002669413690000032
the local feature weights of the initial object are represented,
Figure FDA0002669413690000033
the local features of the object are represented,
Figure FDA0002669413690000034
representing a local feature distance vector.
8. An apparatus for classifying an image scene based on local feature saliency, the apparatus comprising:
the segmentation module is used for segmenting the image scene data to be classified to obtain image scene data blocks;
the characteristic extraction module is used for respectively extracting scene local characteristics and object local characteristics in the image scene data block through a preset scene characteristic extraction model and an object characteristic extraction model, and extracting scene global characteristics and object global characteristics in the image scene data through the scene characteristic extraction model and the object characteristic extraction model;
the saliency module is used for respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
the fusion module is used for fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and the classification module is used for inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202010928765.7A 2020-09-07 2020-09-07 Image scene classification method and device based on local feature saliency Active CN112001399B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010928765.7A CN112001399B (en) 2020-09-07 2020-09-07 Image scene classification method and device based on local feature saliency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010928765.7A CN112001399B (en) 2020-09-07 2020-09-07 Image scene classification method and device based on local feature saliency

Publications (2)

Publication Number Publication Date
CN112001399A true CN112001399A (en) 2020-11-27
CN112001399B CN112001399B (en) 2023-06-09

Family

ID=73468773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010928765.7A Active CN112001399B (en) 2020-09-07 2020-09-07 Image scene classification method and device based on local feature saliency

Country Status (1)

Country Link
CN (1) CN112001399B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699855A (en) * 2021-03-23 2021-04-23 腾讯科技(深圳)有限公司 Image scene recognition method and device based on artificial intelligence and electronic equipment
CN112907138A (en) * 2021-03-26 2021-06-04 国网陕西省电力公司电力科学研究院 Power grid scene early warning classification method and system from local perception to overall perception
CN113128527A (en) * 2021-06-21 2021-07-16 中国人民解放军国防科技大学 Image scene classification method based on converter model and convolutional neural network
CN113657462A (en) * 2021-07-28 2021-11-16 讯飞智元信息科技有限公司 Method for training vehicle recognition model, vehicle recognition method and computing device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110229045A1 (en) * 2010-03-16 2011-09-22 Nec Laboratories America, Inc. Method and system for image classification
CN110555446A (en) * 2019-08-19 2019-12-10 北京工业大学 Remote sensing image scene classification method based on multi-scale depth feature fusion and transfer learning
CN111079674A (en) * 2019-12-22 2020-04-28 东北师范大学 Target detection method based on global and local information fusion

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110229045A1 (en) * 2010-03-16 2011-09-22 Nec Laboratories America, Inc. Method and system for image classification
CN110555446A (en) * 2019-08-19 2019-12-10 北京工业大学 Remote sensing image scene classification method based on multi-scale depth feature fusion and transfer learning
CN111079674A (en) * 2019-12-22 2020-04-28 东北师范大学 Target detection method based on global and local information fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
史静,朱虹,王婧,薛杉: "基于视觉敏感区域信息增强的室内场景分类算法", 模式识别与人工智能, vol. 30, no. 6, pages 520 - 529 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699855A (en) * 2021-03-23 2021-04-23 腾讯科技(深圳)有限公司 Image scene recognition method and device based on artificial intelligence and electronic equipment
CN112907138A (en) * 2021-03-26 2021-06-04 国网陕西省电力公司电力科学研究院 Power grid scene early warning classification method and system from local perception to overall perception
CN112907138B (en) * 2021-03-26 2023-08-01 国网陕西省电力公司电力科学研究院 Power grid scene early warning classification method and system from local to whole perception
CN113128527A (en) * 2021-06-21 2021-07-16 中国人民解放军国防科技大学 Image scene classification method based on converter model and convolutional neural network
CN113657462A (en) * 2021-07-28 2021-11-16 讯飞智元信息科技有限公司 Method for training vehicle recognition model, vehicle recognition method and computing device

Also Published As

Publication number Publication date
CN112001399B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
CN112001399B (en) Image scene classification method and device based on local feature saliency
US20220189142A1 (en) Ai-based object classification method and apparatus, and medical imaging device and storage medium
US20220165053A1 (en) Image classification method, apparatus and training method, apparatus thereof, device and medium
CN111079632A (en) Training method and device of text detection model, computer equipment and storage medium
CN116168017B (en) Deep learning-based PCB element detection method, system and storage medium
CN114897779A (en) Cervical cytology image abnormal area positioning method and device based on fusion attention
CN112528845B (en) Physical circuit diagram identification method based on deep learning and application thereof
CN113538441A (en) Image segmentation model processing method, image processing method and device
CN111667459B (en) Medical sign detection method, system, terminal and storage medium based on 3D variable convolution and time sequence feature fusion
CN115858847B (en) Combined query image retrieval method based on cross-modal attention reservation
CN111666931A (en) Character and image recognition method, device and equipment based on mixed convolution and storage medium
CN111159450A (en) Picture classification method and device, computer equipment and storage medium
CN114821736A (en) Multi-modal face recognition method, device, equipment and medium based on contrast learning
CN110162689B (en) Information pushing method, device, computer equipment and storage medium
CN117636298A (en) Vehicle re-identification method, system and storage medium based on multi-scale feature learning
Niu et al. Bidirectional feature learning network for RGB-D salient object detection
CN112465847A (en) Edge detection method, device and equipment based on clear boundary prediction
WO2024011859A1 (en) Neural network-based face detection method and device
CN116704511A (en) Method and device for recognizing characters of equipment list
CN112116596A (en) Training method of image segmentation model, image segmentation method, medium, and terminal
CN116030341A (en) Plant leaf disease detection method based on deep learning, computer equipment and storage medium
CN113505247B (en) Content-based high-duration video pornography content detection method
EP4246375A1 (en) Model processing method and related device
WO2022222519A1 (en) Fault image generation method and apparatus
CN115862112A (en) Target detection model for facial image acne curative effect evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant