CN112001399A - Image scene classification method and device based on local feature saliency - Google Patents
Image scene classification method and device based on local feature saliency Download PDFInfo
- Publication number
- CN112001399A CN112001399A CN202010928765.7A CN202010928765A CN112001399A CN 112001399 A CN112001399 A CN 112001399A CN 202010928765 A CN202010928765 A CN 202010928765A CN 112001399 A CN112001399 A CN 112001399A
- Authority
- CN
- China
- Prior art keywords
- scene
- local
- feature
- features
- local feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000000605 extraction Methods 0.000 claims abstract description 54
- 230000004927 fusion Effects 0.000 claims abstract description 43
- 238000013145 classification model Methods 0.000 claims abstract description 13
- 239000013598 vector Substances 0.000 claims description 43
- 238000004590 computer program Methods 0.000 claims description 14
- 238000003860 storage Methods 0.000 claims description 9
- 238000012706 support-vector machine Methods 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 abstract description 4
- 238000013527 convolutional neural network Methods 0.000 description 27
- 238000010586 diagram Methods 0.000 description 12
- 238000005070 sampling Methods 0.000 description 8
- 230000004913 activation Effects 0.000 description 6
- 238000001994 activation Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000000295 complement effect Effects 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000013526 transfer learning Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 235000013410 fast food Nutrition 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Multimedia (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The application relates to an image scene classification method and device based on local feature saliency. The method comprises the following steps: the method comprises the steps of segmenting image scene data to be classified to obtain image scene data blocks, respectively extracting scene local features and object local features through a preset scene feature extraction model and an object feature extraction model, extracting scene global features and object global features through the scene feature extraction model and the object feature extraction model, respectively obtaining enhanced scene local features and enhanced object local features through setting weights of the scene local features and the object local features, fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features to obtain fusion features of the image scene data, inputting the fusion features into a classification model trained in advance, and obtaining scene classification of the image scene data. By adopting the method, the calculation amount can be reduced, and the problem of structural redundancy is solved.
Description
Technical Field
The present application relates to the field of scene classification technologies, and in particular, to a method and an apparatus for classifying image scenes based on local feature saliency.
Background
With the development of internet multimedia technology and the growth of visual data, how to process these massive data becomes a new era of difficulty. Scene classification technology, as a key technology for solving the problems of image retrieval and image recognition, has become a very important and challenging research topic in the field of computer vision. Meanwhile, scene classification has wide application in the fields of remote sensing image analysis, video monitoring, robot perception and the like. Therefore, the method has very important significance for correspondingly researching the scene classification technology and improving the recognition capability of the computer scene.
The image scene classification is to judge the scene to which a given image belongs by judging and identifying the information and content contained in the given image, thereby achieving the purpose of classification. In recent years, the deep learning technology is rapidly developed, the traditional method for manually designing image features is gradually replaced, and brand new progress is made in the field of scene classification. However, deep learning requires a large number of training samples, which cannot be satisfied by some small-scale scene data sets, and as in practical application, it cannot be guaranteed that each type of scene can provide a considerable number of images for training, and the birth of transfer learning provides ideas and solutions for solving such problems. The transfer learning is a machine learning method, which means that a pre-trained model is reused in another task, and a deep network pre-trained on a large-scale data set is reasonably selected, so that the network is selectively fine-tuned on a target data set to adapt to the requirements of the current task, and the method is widely applied to some deep learning problems. The pre-trained network parameter structures in different data sets are often greatly different, and the characteristics extracted from the task data sets through the networks can reflect the properties of different aspects of the data. The scene images are rich in content and complex in concept, and features extracted by only using one type of pre-trained network are not enough to describe the scene images, so that a common method is to form scene feature representation with more discrimination by fusing features extracted by different networks. However, although the features extracted from different pre-training models can reflect the properties of different aspects of a scene, the accuracy of describing the scene by the features is different, how to extract effective parts from the features by combining the features of the features is a difficult problem, and a general solution does not exist at present. On the other hand, the convolutional neural network has different understandings on the scene images under different scales, and features which cannot be extracted under a certain scale can be obtained under another scale, so that the scene image description can be effectively enhanced by combining scene image information under multiple scales. However, the features extracted in the multi-scale image do not always complement each other, resulting in a more accurate representation of the scene. For example, more detailed information can be extracted from a small-scale image, but noise information in the image is also amplified, and how to reasonably filter and screen the features becomes a problem. Currently, multi-scale images are usually obtained by performing dense sampling on an original image, and for example, a 256 × 256 pixel image is taken as an example, by setting the size and sampling step size of a new image, local images with different sizes can be sampled from the original image. Local feature quantity through dense sampling is large, and the local feature quantity is usually coded by combining a method such as a Bag-of-Visual-Word (BoVW) model, and finally, new scene image description is obtained through aggregation. The multi-scale scene image description obtained by the method has the defects of large calculated amount, structural redundancy and the like.
Disclosure of Invention
Based on the above, it is necessary to provide an image scene classification method and apparatus based on local feature saliency, which can solve the problems of large calculation amount and structural redundancy in multi-scale scene image description.
A method of image scene classification based on local feature saliency, the method comprising:
segmenting image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
In one embodiment, the method further comprises the following steps: calculating the mean value of all the image scene data blocks corresponding to the scene local features, and determining a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector; adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight; and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.
In one embodiment, the method further comprises the following steps: making a difference between the object global feature and the object local feature, and taking an absolute value to obtain a local feature distance vector corresponding to the object local feature; normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature; adjusting the initial object local characteristic weight according to a preset second hyperparameter to obtain an object local characteristic weight; and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.
In one embodiment, the method further comprises the following steps: and fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features in a splicing mode to obtain fusion features of image scene data.
In one embodiment, the method further comprises the following steps: and inputting the fusion characteristics into a pre-trained linear support vector machine to obtain the scene classification of the image scene data.
In one embodiment, the method further comprises the following steps: normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector as follows:
wherein ,the weight of the local feature of the initial scene is represented,the local features of the scene are represented,the feature center is represented, l represents the number of sampled pictures in the scene local feature, and n represents the number of images in the category.
In one embodiment, the method further comprises the following steps: normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature as follows:
wherein ,the local feature weights of the initial object are represented,the local features of the object are represented,representing a local feature distance vector.
An apparatus for image scene classification based on local feature saliency, the apparatus comprising:
the segmentation module is used for segmenting the image scene data to be classified to obtain image scene data blocks;
the characteristic extraction module is used for respectively extracting scene local characteristics and object local characteristics in the image scene data block through a preset scene characteristic extraction model and an object characteristic extraction model, and extracting scene global characteristics and object global characteristics in the image scene data through the scene characteristic extraction model and the object characteristic extraction model;
the saliency module is used for respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
the fusion module is used for fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and the classification module is used for inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
segmenting image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
segmenting image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
According to the image scene classification method, device, computer equipment and storage medium based on local feature saliency, the scene local features and the object local features, the scene global features and the object global features are extracted respectively through the preset scene feature extraction model and the preset object feature extraction model, then different weights are set for the scene local features and the object local features in a weight setting mode, so that the pertinence of the features is improved, the fusion features are obtained through feature fusion, the scenes corresponding to the fusion features can be classified through the classification model, and due to the fact that the local features are highlighted through the weights, the calculated amount can be reduced, and the problem of structural redundancy is solved.
Drawings
FIG. 1 is a flow diagram illustrating a method for image scene classification based on local feature saliency, according to an embodiment;
FIG. 2 is a diagram illustrating elements of a scene image of interest to Place-CNN and ImageNet-CNN in one embodiment;
FIG. 3 is a block diagram of a model in one embodiment;
FIG. 4 is a schematic diagram of feature extraction in one embodiment;
FIG. 5 is a class activation diagram for Place-CNN and ImageNet-CNN for different scenarios in one embodiment;
FIG. 6 is a schematic illustration of feature fusion in one embodiment;
FIG. 7 is a block diagram of an image scene classification device based on local feature saliency in one embodiment;
FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided an image scene classification method based on local feature saliency, comprising the following steps:
The image scene data may be an image captured in a scene, and the image scene data block may be obtained by segmenting the captured image.
Specifically, the collected original image may be adjusted by bilinear interpolation, for example, the size of the image is adjusted to 224 × 224, the image mean of the ImageNet data set is subtracted, the image is normalized by dividing by the standard deviation, and after normalization, the data is in accordance with the distribution rule, so that the generalization capability of the model is increased. For the multi-scale image, the image processed in the previous step is adjusted, for example, to 448 × 448, and then four corners of the adjusted image are selected for cutting, so as to form 4 image scene data blocks. The local image size is 224 × 224 as supplementary data of the original scene image at a small scale. Unlike dense sampling, the simplified sampling adds only 4 small-scale images to supplement the original image, reducing duplication and redundancy in data.
It should be noted that the above dimensions and the number of cuts are examples, and other values can also be adopted to achieve the technical effects of the present invention.
And 104, respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model.
The scene feature extraction model and the object feature extraction model can be CNN networks, specifically, the scene feature extraction model can be Places-CNN and the object feature extraction model can be ImageNet-CNN.
Specifically, the deep network selected by the invention is represented as DenseNet, and specifically, a network can be built in a deep learning framework of the pytorch by taking DenseNet161 proposed by Gao Huang as a basic network. In performing feature extraction, DenseNet is set to test mode, Dropout for regularization will multiply the output of the neuron in the form of probability values, and the final feature vector is the last convolutional layer output of DenseNet.
And 106, respectively obtaining the local features of the enhanced scene and the enhanced object by setting the weight of each scene local feature and each object local feature.
In this step, the scene local feature and the object local feature are highlighted, the purpose of the highlighting is to highlight details for the scene local feature, and the purpose of the highlighting is to retain a subject for the object local feature.
And step 108, fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features to obtain fusion features of the image scene data.
There are various ways of feature fusion, such as addition operation, splicing operation, etc.
And step 110, inputting the fusion characteristics into a pre-trained classification model to obtain scene classification of the image scene data.
The classification model may be a support vector machine, logistic regression, etc., and is not limited herein.
According to the image scene classification method based on local feature saliency, the scene local features and the object local features, as well as the scene global features and the object global features are extracted respectively through the preset scene feature extraction model and the preset object feature extraction model, then different weights are set for the scene local features and the object local features in a weight setting mode, so that the pertinence of the features is improved, the fusion features are obtained through feature fusion, the scenes corresponding to the fusion features can be classified through the classification model, and the calculation amount can be reduced and the problem of structural redundancy is solved because the local feature saliency is performed through the weights.
In one embodiment, calculating the mean value of scene local features corresponding to all image scene data blocks, and determining a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector; adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight; and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.
Specifically, the mean value of the scene local features corresponding to all the image scene data blocks is calculated, and the feature center is determined as follows:
wherein ,the center of the feature is represented by,the scene local feature is represented, l represents the number of sampling pictures, and n represents the number of pictures in the category.
Determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center, specifically: and (4) making a difference between the feature center and each scene local feature, and taking an absolute value to obtain a scene distance vector from the scene local feature to the feature center, wherein the vector represents the discrete degree of each dimension of the feature.
In a specific embodiment, the scene distance vectors are normalized, and the initial scene local feature weight of each scene distance vector is obtained as follows:
wherein ,the weight of the local feature of the initial scene is represented,the local features of the scene are represented,the feature center is represented, l represents the number of sampled pictures in the scene local feature, and n represents the number of images in the category.
In this embodiment, considering that there is a relatively accurate global feature on the original scale and a relatively prominent scene detail is required on the small scale, it is necessary to reinforce the values farther from the feature center so as to supplement the scene global feature. By the normalization method, the local details of the scene can be highlighted.
In addition, according to a preset first hyper-parameter, the initial scene local feature weight is adjusted, and the obtained scene local feature weight is:
Finally, the sum of the products of the weights and the local features is used as the local features of the enhanced scene
In one embodiment, the global feature of the object is subtracted from the local feature of the object, and the absolute value is taken to obtain a local feature distance vector corresponding to the local feature of the object; normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature; adjusting the initial object local characteristic weight according to a preset second hyper-parameter to obtain an object local characteristic weight; and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.
Specifically, the local scene features correspond to the local object features extracted by ImageNet-CNN. The image scene data contains rich object contents, especially the extracted features of the small-scale image contain a large amount of object detail features, and the detail features have an adverse effect on the existing scene local features serving as complementary global features, so that the effect of directly using the global features extracted by ImageNet-CNN as the complementary global features of the scene global features is not ideal, and the small-scale local features are guided and corrected by using the detail features as an object feature semantic center, so that detailed parts in the object features are reduced, and more appropriate object local features are obtained.
The method comprises the following specific steps:
and (4) performing difference between the global features of the objects and the local features of each object to obtain an absolute value to obtain a local feature distance vector. Since global features are needed to guide the local vectors, features closer to the global features are given higher weight as opposed to scene local features highlighting details. Object local feature weightsIs calculated as follows:
specifically, the process of local feature saliency is to add 4 local features in weighted manner, and the sum of the weights needs to be 1, which is also the purpose of normalization. Dividing by 3 can just make the sum of 4 weights 1.
and finally, endowing the object local characteristic weight to the object local characteristic to obtain the modified enhanced object local characteristic:
in one embodiment, the enhanced scene local features, the enhanced object local features, the scene global features and the object global features are fused in a splicing mode to obtain fusion features of image scene data.
In this embodiment, the splicing mode is selected in consideration that the scene local feature and the object local feature are two distinct features and have no meaning in semantic addition.
In one embodiment, the fusion features are input into a pre-trained linear support vector machine to obtain scene classification of image scene data.
In this embodiment, the linear support vector machine can maximize the interval between classes and reduce overfitting while ensuring a certain training accuracy.
In summary, the invention achieves the following beneficial effects:
1. compared with the common dense sampling and semantic code construction method, the multi-scale scene image simple sampling scheme and the multi-scale feature generation method have the advantage that the calculation amount is remarkably reduced.
2. An optimization scheme based on network features of different depths, namely a feature saliency method, is used. The optimization method is combined with the characteristics of the features, so that the scene description capacity of the fused features is effectively improved, the complementarity of the optimized features is more sufficient, and the scene classification precision is improved.
The following description will explain the advantageous effects of the present invention by using a specific example.
FIG. 2 shows different elements of the scene image of interest to Place-CNN and ImageNet-CNN in the model. Places-CNN and ImageNet-CNN focus on a distinction. The scene image is rich in content and elements, the images extracted by the Places-CNN often have more overall and spatial characteristics, and the features extracted by the ImageNet-CNN pay more attention to details, especially details of a single object.
Fig. 3 is an overall framework diagram of the present invention, which includes the following three steps:
first, feature extraction. And performing feature extraction on the scene image on two scales by using the constructed Places-CNN and ImageNet-CNN.
And secondly, characterizing prominently. And performing optimization processing on the extracted features of different types and different scales, wherein the features specifically comprise two parts, namely a scene local feature highlight detail and an object local feature retention main body.
And thirdly, fusing and classifying the features. And performing dimensionality splicing on the optimized features, and finishing classification by using a linear support vector machine.
Fig. 4 depicts a first step feature extraction process. In the feature extraction stage, an input image is propagated forwards, and the output of the last Dense Block is used as the feature extracted by the two types of convolutional neural networks. When the size of the input image is 224 × 224, the dimension of the feature map obtained by adding the global average pooling is 1 × 2208. The local feature dimension extracted by using one network is 4 × 2208, and the global feature dimension is 1 × 2208.
Fig. 5 is a class activation diagram for two types of networks for different scenarios. Different types of depth features are oriented to the same classification task, and the activation regions and the classification effects are obviously different. FIG. 5 shows a visualization function diagram of some scene image activations of Places-CNN and ImageNet-CNN on the MIT Indor 67 dataset, and class activation mapping is used to realize visualization of important visual attention areas of different CNNs (the brighter Places of the images represent stronger discriminative power), reflecting different properties of scene features and object features. As can be seen from the figure, the activation region and the color brightness of Places-CNN are obviously higher than those of ImageNet-CNN, which also explains why the effect of Places-CNN on the scene classification task is better than that of ImageNet-CNN. Unlike Places-CNN, which focuses more on scene features, ImageNet-CNN Places visual emphasis on some scene objects, such as toilets and cabinets in Bathroom, tables and chairs in Fastfood restaurarant, and so on.
Fig. 6 shows two different feature fusion strategies in the third step of feature fusion and classification, one is dimensional splicing, as shown in the left side of fig. 6, and the other is dimensional addition, as shown in the right side of fig. 6, considering that the scene local feature and the object local feature are two distinct features, which are semantically added without meaning, all the fusion strategies selected by the present invention are the first fusion strategy.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 7, there is provided an image scene classification apparatus based on local feature saliency, including: a segmentation module 702, a feature extraction module 704, a saliency module 706, a fusion module 708, and a classification module 710, wherein:
a segmentation module 702, configured to segment image scene data to be classified to obtain an image scene data block;
a feature extraction module 704, configured to respectively extract a scene local feature and an object local feature in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extract a scene global feature and an object global feature in the image scene data through the scene feature extraction model and the object feature extraction model;
a saliency module 706, configured to obtain enhanced scene local features and enhanced object local features by setting weights of each of the scene local features and the object local features;
a fusion module 708, configured to fuse the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature to obtain a fusion feature of image scene data;
and the classification module 710 is configured to input the fusion features into a pre-trained classification model to obtain a scene classification of the image scene data.
In one embodiment, the saliency module 706 is further configured to calculate a mean value of all the image scene data blocks corresponding to the scene local features, and determine a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector; adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight; and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.
In one embodiment, the saliency module 706 is further configured to perform a difference between the object global feature and the object local feature, and take an absolute value to obtain a local feature distance vector corresponding to the object local feature; normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature; adjusting the initial object local characteristic weight according to a preset second hyperparameter to obtain an object local characteristic weight; and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.
In one embodiment, the fusion module 708 is further configured to fuse the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature in a splicing manner to obtain a fusion feature of the image scene data.
In one embodiment, the classification module 710 is further configured to input the fusion features into a pre-trained linear support vector machine to obtain a scene classification of the image scene data.
In one embodiment, the saliency module 706 is further configured to normalize the scene distance vectors, and obtain an initial scene local feature weight of each scene distance vector as:
wherein ,the weight of the local feature of the initial scene is represented,the local features of the scene are represented,the feature center is represented, l represents the number of sampled pictures in the scene local feature, and n represents the number of images in the category.
In one embodiment, the saliency module 706 is further configured to normalize the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature as follows:
wherein ,the local feature weights of the initial object are represented,the local features of the object are represented,representing a local feature distance vector.
For specific definition of the image scene classification device based on local feature saliency, refer to the above definition of the image scene classification method based on local feature saliency, and details thereof are not repeated here. The modules in the image scene classification device based on local feature saliency can be fully or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for image scene classification based on local feature saliency. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. An image scene classification method based on local feature saliency, characterized in that the method comprises:
segmenting image scene data to be classified to obtain image scene data blocks;
respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;
respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
2. The method of claim 1, wherein the obtaining of enhanced scene local features by setting a weight of each scene local feature comprises:
calculating the mean value of all the image scene data blocks corresponding to the scene local features, and determining a feature center;
determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center;
normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector;
adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight;
and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.
3. The method of claim 1, wherein obtaining enhanced object local features by setting a weight of each of the object local features comprises:
making a difference between the object global feature and the object local feature, and taking an absolute value to obtain a local feature distance vector corresponding to the object local feature;
normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature;
adjusting the initial object local characteristic weight according to a preset second hyperparameter to obtain an object local characteristic weight;
and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.
4. The method according to any one of claims 1 to 3, wherein fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fused feature of image scene data comprises:
and fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features in a splicing mode to obtain fusion features of image scene data.
5. The method of any one of claims 1 to 3, wherein inputting the fused features into a pre-trained classification model to obtain a scene classification of the image scene data comprises:
and inputting the fusion characteristics into a pre-trained linear support vector machine to obtain the scene classification of the image scene data.
6. The method of claim 2, wherein normalizing the scene distance vectors to obtain an initial scene local feature weight for each scene distance vector comprises:
normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector as follows:
7. The method of claim 3, wherein normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature comprises:
normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature as follows:
8. An apparatus for classifying an image scene based on local feature saliency, the apparatus comprising:
the segmentation module is used for segmenting the image scene data to be classified to obtain image scene data blocks;
the characteristic extraction module is used for respectively extracting scene local characteristics and object local characteristics in the image scene data block through a preset scene characteristic extraction model and an object characteristic extraction model, and extracting scene global characteristics and object global characteristics in the image scene data through the scene characteristic extraction model and the object characteristic extraction model;
the saliency module is used for respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;
the fusion module is used for fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;
and the classification module is used for inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010928765.7A CN112001399B (en) | 2020-09-07 | 2020-09-07 | Image scene classification method and device based on local feature saliency |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010928765.7A CN112001399B (en) | 2020-09-07 | 2020-09-07 | Image scene classification method and device based on local feature saliency |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112001399A true CN112001399A (en) | 2020-11-27 |
CN112001399B CN112001399B (en) | 2023-06-09 |
Family
ID=73468773
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010928765.7A Active CN112001399B (en) | 2020-09-07 | 2020-09-07 | Image scene classification method and device based on local feature saliency |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112001399B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112699855A (en) * | 2021-03-23 | 2021-04-23 | 腾讯科技(深圳)有限公司 | Image scene recognition method and device based on artificial intelligence and electronic equipment |
CN112907138A (en) * | 2021-03-26 | 2021-06-04 | 国网陕西省电力公司电力科学研究院 | Power grid scene early warning classification method and system from local perception to overall perception |
CN113128527A (en) * | 2021-06-21 | 2021-07-16 | 中国人民解放军国防科技大学 | Image scene classification method based on converter model and convolutional neural network |
CN113657462A (en) * | 2021-07-28 | 2021-11-16 | 讯飞智元信息科技有限公司 | Method for training vehicle recognition model, vehicle recognition method and computing device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110229045A1 (en) * | 2010-03-16 | 2011-09-22 | Nec Laboratories America, Inc. | Method and system for image classification |
CN110555446A (en) * | 2019-08-19 | 2019-12-10 | 北京工业大学 | Remote sensing image scene classification method based on multi-scale depth feature fusion and transfer learning |
CN111079674A (en) * | 2019-12-22 | 2020-04-28 | 东北师范大学 | Target detection method based on global and local information fusion |
-
2020
- 2020-09-07 CN CN202010928765.7A patent/CN112001399B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110229045A1 (en) * | 2010-03-16 | 2011-09-22 | Nec Laboratories America, Inc. | Method and system for image classification |
CN110555446A (en) * | 2019-08-19 | 2019-12-10 | 北京工业大学 | Remote sensing image scene classification method based on multi-scale depth feature fusion and transfer learning |
CN111079674A (en) * | 2019-12-22 | 2020-04-28 | 东北师范大学 | Target detection method based on global and local information fusion |
Non-Patent Citations (1)
Title |
---|
史静,朱虹,王婧,薛杉: "基于视觉敏感区域信息增强的室内场景分类算法", 模式识别与人工智能, vol. 30, no. 6, pages 520 - 529 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112699855A (en) * | 2021-03-23 | 2021-04-23 | 腾讯科技(深圳)有限公司 | Image scene recognition method and device based on artificial intelligence and electronic equipment |
CN112907138A (en) * | 2021-03-26 | 2021-06-04 | 国网陕西省电力公司电力科学研究院 | Power grid scene early warning classification method and system from local perception to overall perception |
CN112907138B (en) * | 2021-03-26 | 2023-08-01 | 国网陕西省电力公司电力科学研究院 | Power grid scene early warning classification method and system from local to whole perception |
CN113128527A (en) * | 2021-06-21 | 2021-07-16 | 中国人民解放军国防科技大学 | Image scene classification method based on converter model and convolutional neural network |
CN113657462A (en) * | 2021-07-28 | 2021-11-16 | 讯飞智元信息科技有限公司 | Method for training vehicle recognition model, vehicle recognition method and computing device |
Also Published As
Publication number | Publication date |
---|---|
CN112001399B (en) | 2023-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112001399B (en) | Image scene classification method and device based on local feature saliency | |
US20220189142A1 (en) | Ai-based object classification method and apparatus, and medical imaging device and storage medium | |
US20220165053A1 (en) | Image classification method, apparatus and training method, apparatus thereof, device and medium | |
CN111079632A (en) | Training method and device of text detection model, computer equipment and storage medium | |
CN116168017B (en) | Deep learning-based PCB element detection method, system and storage medium | |
CN114897779A (en) | Cervical cytology image abnormal area positioning method and device based on fusion attention | |
CN112528845B (en) | Physical circuit diagram identification method based on deep learning and application thereof | |
CN113538441A (en) | Image segmentation model processing method, image processing method and device | |
CN111667459B (en) | Medical sign detection method, system, terminal and storage medium based on 3D variable convolution and time sequence feature fusion | |
CN115858847B (en) | Combined query image retrieval method based on cross-modal attention reservation | |
CN111666931A (en) | Character and image recognition method, device and equipment based on mixed convolution and storage medium | |
CN111159450A (en) | Picture classification method and device, computer equipment and storage medium | |
CN114821736A (en) | Multi-modal face recognition method, device, equipment and medium based on contrast learning | |
CN110162689B (en) | Information pushing method, device, computer equipment and storage medium | |
CN117636298A (en) | Vehicle re-identification method, system and storage medium based on multi-scale feature learning | |
Niu et al. | Bidirectional feature learning network for RGB-D salient object detection | |
CN112465847A (en) | Edge detection method, device and equipment based on clear boundary prediction | |
WO2024011859A1 (en) | Neural network-based face detection method and device | |
CN116704511A (en) | Method and device for recognizing characters of equipment list | |
CN112116596A (en) | Training method of image segmentation model, image segmentation method, medium, and terminal | |
CN116030341A (en) | Plant leaf disease detection method based on deep learning, computer equipment and storage medium | |
CN113505247B (en) | Content-based high-duration video pornography content detection method | |
EP4246375A1 (en) | Model processing method and related device | |
WO2022222519A1 (en) | Fault image generation method and apparatus | |
CN115862112A (en) | Target detection model for facial image acne curative effect evaluation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |