CN112001399A

CN112001399A - Image scene classification method and device based on local feature saliency

Info

Publication number: CN112001399A
Application number: CN202010928765.7A
Authority: CN
Inventors: 谢毓湘; 张家辉; 宫铨志; 栾悉道; 魏迎梅; 康来; 蒋杰
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2020-11-27
Anticipated expiration: 2040-09-07
Also published as: CN112001399B

Abstract

The application relates to an image scene classification method and device based on local feature saliency. The method comprises the following steps: the method comprises the steps of segmenting image scene data to be classified to obtain image scene data blocks, respectively extracting scene local features and object local features through a preset scene feature extraction model and an object feature extraction model, extracting scene global features and object global features through the scene feature extraction model and the object feature extraction model, respectively obtaining enhanced scene local features and enhanced object local features through setting weights of the scene local features and the object local features, fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features to obtain fusion features of the image scene data, inputting the fusion features into a classification model trained in advance, and obtaining scene classification of the image scene data. By adopting the method, the calculation amount can be reduced, and the problem of structural redundancy is solved.

Description

Image scene classification method and device based on local feature saliency

Technical Field

The present application relates to the field of scene classification technologies, and in particular, to a method and an apparatus for classifying image scenes based on local feature saliency.

Background

With the development of internet multimedia technology and the growth of visual data, how to process these massive data becomes a new era of difficulty. Scene classification technology, as a key technology for solving the problems of image retrieval and image recognition, has become a very important and challenging research topic in the field of computer vision. Meanwhile, scene classification has wide application in the fields of remote sensing image analysis, video monitoring, robot perception and the like. Therefore, the method has very important significance for correspondingly researching the scene classification technology and improving the recognition capability of the computer scene.

The image scene classification is to judge the scene to which a given image belongs by judging and identifying the information and content contained in the given image, thereby achieving the purpose of classification. In recent years, the deep learning technology is rapidly developed, the traditional method for manually designing image features is gradually replaced, and brand new progress is made in the field of scene classification. However, deep learning requires a large number of training samples, which cannot be satisfied by some small-scale scene data sets, and as in practical application, it cannot be guaranteed that each type of scene can provide a considerable number of images for training, and the birth of transfer learning provides ideas and solutions for solving such problems. The transfer learning is a machine learning method, which means that a pre-trained model is reused in another task, and a deep network pre-trained on a large-scale data set is reasonably selected, so that the network is selectively fine-tuned on a target data set to adapt to the requirements of the current task, and the method is widely applied to some deep learning problems. The pre-trained network parameter structures in different data sets are often greatly different, and the characteristics extracted from the task data sets through the networks can reflect the properties of different aspects of the data. The scene images are rich in content and complex in concept, and features extracted by only using one type of pre-trained network are not enough to describe the scene images, so that a common method is to form scene feature representation with more discrimination by fusing features extracted by different networks. However, although the features extracted from different pre-training models can reflect the properties of different aspects of a scene, the accuracy of describing the scene by the features is different, how to extract effective parts from the features by combining the features of the features is a difficult problem, and a general solution does not exist at present. On the other hand, the convolutional neural network has different understandings on the scene images under different scales, and features which cannot be extracted under a certain scale can be obtained under another scale, so that the scene image description can be effectively enhanced by combining scene image information under multiple scales. However, the features extracted in the multi-scale image do not always complement each other, resulting in a more accurate representation of the scene. For example, more detailed information can be extracted from a small-scale image, but noise information in the image is also amplified, and how to reasonably filter and screen the features becomes a problem. Currently, multi-scale images are usually obtained by performing dense sampling on an original image, and for example, a 256 × 256 pixel image is taken as an example, by setting the size and sampling step size of a new image, local images with different sizes can be sampled from the original image. Local feature quantity through dense sampling is large, and the local feature quantity is usually coded by combining a method such as a Bag-of-Visual-Word (BoVW) model, and finally, new scene image description is obtained through aggregation. The multi-scale scene image description obtained by the method has the defects of large calculated amount, structural redundancy and the like.

Disclosure of Invention

Based on the above, it is necessary to provide an image scene classification method and apparatus based on local feature saliency, which can solve the problems of large calculation amount and structural redundancy in multi-scale scene image description.

A method of image scene classification based on local feature saliency, the method comprising:

segmenting image scene data to be classified to obtain image scene data blocks;

respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;

respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;

fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;

and inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.

In one embodiment, the method further comprises the following steps: calculating the mean value of all the image scene data blocks corresponding to the scene local features, and determining a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector; adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight; and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.

In one embodiment, the method further comprises the following steps: making a difference between the object global feature and the object local feature, and taking an absolute value to obtain a local feature distance vector corresponding to the object local feature; normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature; adjusting the initial object local characteristic weight according to a preset second hyperparameter to obtain an object local characteristic weight; and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.

In one embodiment, the method further comprises the following steps: and fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features in a splicing mode to obtain fusion features of image scene data.

In one embodiment, the method further comprises the following steps: and inputting the fusion characteristics into a pre-trained linear support vector machine to obtain the scene classification of the image scene data.

In one embodiment, the method further comprises the following steps: normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector as follows:

wherein ,

the weight of the local feature of the initial scene is represented,

the local features of the scene are represented,

the feature center is represented, l represents the number of sampled pictures in the scene local feature, and n represents the number of images in the category.

In one embodiment, the method further comprises the following steps: normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature as follows:

wherein ,

the local feature weights of the initial object are represented,

the local features of the object are represented,

representing a local feature distance vector.

An apparatus for image scene classification based on local feature saliency, the apparatus comprising:

the segmentation module is used for segmenting the image scene data to be classified to obtain image scene data blocks;

the characteristic extraction module is used for respectively extracting scene local characteristics and object local characteristics in the image scene data block through a preset scene characteristic extraction model and an object characteristic extraction model, and extracting scene global characteristics and object global characteristics in the image scene data through the scene characteristic extraction model and the object characteristic extraction model;

the saliency module is used for respectively obtaining enhanced scene local features and enhanced object local features by setting the weight of each scene local feature and each object local feature;

the fusion module is used for fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fusion feature of image scene data;

and the classification module is used for inputting the fusion characteristics into a pre-trained classification model to obtain the scene classification of the image scene data.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

segmenting image scene data to be classified to obtain image scene data blocks;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

segmenting image scene data to be classified to obtain image scene data blocks;

According to the image scene classification method, device, computer equipment and storage medium based on local feature saliency, the scene local features and the object local features, the scene global features and the object global features are extracted respectively through the preset scene feature extraction model and the preset object feature extraction model, then different weights are set for the scene local features and the object local features in a weight setting mode, so that the pertinence of the features is improved, the fusion features are obtained through feature fusion, the scenes corresponding to the fusion features can be classified through the classification model, and due to the fact that the local features are highlighted through the weights, the calculated amount can be reduced, and the problem of structural redundancy is solved.

Drawings

FIG. 1 is a flow diagram illustrating a method for image scene classification based on local feature saliency, according to an embodiment;

FIG. 2 is a diagram illustrating elements of a scene image of interest to Place-CNN and ImageNet-CNN in one embodiment;

FIG. 3 is a block diagram of a model in one embodiment;

FIG. 4 is a schematic diagram of feature extraction in one embodiment;

FIG. 5 is a class activation diagram for Place-CNN and ImageNet-CNN for different scenarios in one embodiment;

FIG. 6 is a schematic illustration of feature fusion in one embodiment;

FIG. 7 is a block diagram of an image scene classification device based on local feature saliency in one embodiment;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided an image scene classification method based on local feature saliency, comprising the following steps:

step 102, segmenting image scene data to be classified to obtain image scene data blocks.

The image scene data may be an image captured in a scene, and the image scene data block may be obtained by segmenting the captured image.

Specifically, the collected original image may be adjusted by bilinear interpolation, for example, the size of the image is adjusted to 224 × 224, the image mean of the ImageNet data set is subtracted, the image is normalized by dividing by the standard deviation, and after normalization, the data is in accordance with the distribution rule, so that the generalization capability of the model is increased. For the multi-scale image, the image processed in the previous step is adjusted, for example, to 448 × 448, and then four corners of the adjusted image are selected for cutting, so as to form 4 image scene data blocks. The local image size is 224 × 224 as supplementary data of the original scene image at a small scale. Unlike dense sampling, the simplified sampling adds only 4 small-scale images to supplement the original image, reducing duplication and redundancy in data.

It should be noted that the above dimensions and the number of cuts are examples, and other values can also be adopted to achieve the technical effects of the present invention.

And 104, respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model.

The scene feature extraction model and the object feature extraction model can be CNN networks, specifically, the scene feature extraction model can be Places-CNN and the object feature extraction model can be ImageNet-CNN.

Specifically, the deep network selected by the invention is represented as DenseNet, and specifically, a network can be built in a deep learning framework of the pytorch by taking DenseNet161 proposed by Gao Huang as a basic network. In performing feature extraction, DenseNet is set to test mode, Dropout for regularization will multiply the output of the neuron in the form of probability values, and the final feature vector is the last convolutional layer output of DenseNet.

And 106, respectively obtaining the local features of the enhanced scene and the enhanced object by setting the weight of each scene local feature and each object local feature.

In this step, the scene local feature and the object local feature are highlighted, the purpose of the highlighting is to highlight details for the scene local feature, and the purpose of the highlighting is to retain a subject for the object local feature.

And step 108, fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features to obtain fusion features of the image scene data.

There are various ways of feature fusion, such as addition operation, splicing operation, etc.

And step 110, inputting the fusion characteristics into a pre-trained classification model to obtain scene classification of the image scene data.

The classification model may be a support vector machine, logistic regression, etc., and is not limited herein.

According to the image scene classification method based on local feature saliency, the scene local features and the object local features, as well as the scene global features and the object global features are extracted respectively through the preset scene feature extraction model and the preset object feature extraction model, then different weights are set for the scene local features and the object local features in a weight setting mode, so that the pertinence of the features is improved, the fusion features are obtained through feature fusion, the scenes corresponding to the fusion features can be classified through the classification model, and the calculation amount can be reduced and the problem of structural redundancy is solved because the local feature saliency is performed through the weights.

In one embodiment, calculating the mean value of scene local features corresponding to all image scene data blocks, and determining a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector; adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight; and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.

Specifically, the mean value of the scene local features corresponding to all the image scene data blocks is calculated, and the feature center is determined as follows:

wherein ,

the center of the feature is represented by,

the scene local feature is represented, l represents the number of sampling pictures, and n represents the number of pictures in the category.

Determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center, specifically: and (4) making a difference between the feature center and each scene local feature, and taking an absolute value to obtain a scene distance vector from the scene local feature to the feature center, wherein the vector represents the discrete degree of each dimension of the feature.

In a specific embodiment, the scene distance vectors are normalized, and the initial scene local feature weight of each scene distance vector is obtained as follows:

wherein ,

the weight of the local feature of the initial scene is represented,

the local features of the scene are represented,

In this embodiment, considering that there is a relatively accurate global feature on the original scale and a relatively prominent scene detail is required on the small scale, it is necessary to reinforce the values farther from the feature center so as to supplement the scene global feature. By the normalization method, the local details of the scene can be highlighted.

In addition, according to a preset first hyper-parameter, the initial scene local feature weight is adjusted, and the obtained scene local feature weight is:

representing the first hyperparameter.

Finally, the sum of the products of the weights and the local features is used as the local features of the enhanced scene

In one embodiment, the global feature of the object is subtracted from the local feature of the object, and the absolute value is taken to obtain a local feature distance vector corresponding to the local feature of the object; normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature; adjusting the initial object local characteristic weight according to a preset second hyper-parameter to obtain an object local characteristic weight; and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.

Specifically, the local scene features correspond to the local object features extracted by ImageNet-CNN. The image scene data contains rich object contents, especially the extracted features of the small-scale image contain a large amount of object detail features, and the detail features have an adverse effect on the existing scene local features serving as complementary global features, so that the effect of directly using the global features extracted by ImageNet-CNN as the complementary global features of the scene global features is not ideal, and the small-scale local features are guided and corrected by using the detail features as an object feature semantic center, so that detailed parts in the object features are reduced, and more appropriate object local features are obtained.

The method comprises the following specific steps:

and (4) performing difference between the global features of the objects and the local features of each object to obtain an absolute value to obtain a local feature distance vector. Since global features are needed to guide the local vectors, features closer to the global features are given higher weight as opposed to scene local features highlighting details. Object local feature weights

Is calculated as follows:

specifically, the process of local feature saliency is to add 4 local features in weighted manner, and the sum of the weights needs to be 1, which is also the purpose of normalization. Dividing by 3 can just make the sum of 4 weights 1.

Controlling object local feature weights using hyper-parameter lambda

Degree of influence of (c):

and finally, endowing the object local characteristic weight to the object local characteristic to obtain the modified enhanced object local characteristic:

in one embodiment, the enhanced scene local features, the enhanced object local features, the scene global features and the object global features are fused in a splicing mode to obtain fusion features of image scene data.

In this embodiment, the splicing mode is selected in consideration that the scene local feature and the object local feature are two distinct features and have no meaning in semantic addition.

In one embodiment, the fusion features are input into a pre-trained linear support vector machine to obtain scene classification of image scene data.

In this embodiment, the linear support vector machine can maximize the interval between classes and reduce overfitting while ensuring a certain training accuracy.

In summary, the invention achieves the following beneficial effects:

1. compared with the common dense sampling and semantic code construction method, the multi-scale scene image simple sampling scheme and the multi-scale feature generation method have the advantage that the calculation amount is remarkably reduced.

2. An optimization scheme based on network features of different depths, namely a feature saliency method, is used. The optimization method is combined with the characteristics of the features, so that the scene description capacity of the fused features is effectively improved, the complementarity of the optimized features is more sufficient, and the scene classification precision is improved.

The following description will explain the advantageous effects of the present invention by using a specific example.

FIG. 2 shows different elements of the scene image of interest to Place-CNN and ImageNet-CNN in the model. Places-CNN and ImageNet-CNN focus on a distinction. The scene image is rich in content and elements, the images extracted by the Places-CNN often have more overall and spatial characteristics, and the features extracted by the ImageNet-CNN pay more attention to details, especially details of a single object.

Fig. 3 is an overall framework diagram of the present invention, which includes the following three steps:

first, feature extraction. And performing feature extraction on the scene image on two scales by using the constructed Places-CNN and ImageNet-CNN.

And secondly, characterizing prominently. And performing optimization processing on the extracted features of different types and different scales, wherein the features specifically comprise two parts, namely a scene local feature highlight detail and an object local feature retention main body.

And thirdly, fusing and classifying the features. And performing dimensionality splicing on the optimized features, and finishing classification by using a linear support vector machine.

Fig. 4 depicts a first step feature extraction process. In the feature extraction stage, an input image is propagated forwards, and the output of the last Dense Block is used as the feature extracted by the two types of convolutional neural networks. When the size of the input image is 224 × 224, the dimension of the feature map obtained by adding the global average pooling is 1 × 2208. The local feature dimension extracted by using one network is 4 × 2208, and the global feature dimension is 1 × 2208.

Fig. 5 is a class activation diagram for two types of networks for different scenarios. Different types of depth features are oriented to the same classification task, and the activation regions and the classification effects are obviously different. FIG. 5 shows a visualization function diagram of some scene image activations of Places-CNN and ImageNet-CNN on the MIT Indor 67 dataset, and class activation mapping is used to realize visualization of important visual attention areas of different CNNs (the brighter Places of the images represent stronger discriminative power), reflecting different properties of scene features and object features. As can be seen from the figure, the activation region and the color brightness of Places-CNN are obviously higher than those of ImageNet-CNN, which also explains why the effect of Places-CNN on the scene classification task is better than that of ImageNet-CNN. Unlike Places-CNN, which focuses more on scene features, ImageNet-CNN Places visual emphasis on some scene objects, such as toilets and cabinets in Bathroom, tables and chairs in Fastfood restaurarant, and so on.

Fig. 6 shows two different feature fusion strategies in the third step of feature fusion and classification, one is dimensional splicing, as shown in the left side of fig. 6, and the other is dimensional addition, as shown in the right side of fig. 6, considering that the scene local feature and the object local feature are two distinct features, which are semantically added without meaning, all the fusion strategies selected by the present invention are the first fusion strategy.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 7, there is provided an image scene classification apparatus based on local feature saliency, including: a segmentation module 702, a feature extraction module 704, a saliency module 706, a fusion module 708, and a classification module 710, wherein:

a segmentation module 702, configured to segment image scene data to be classified to obtain an image scene data block;

a feature extraction module 704, configured to respectively extract a scene local feature and an object local feature in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extract a scene global feature and an object global feature in the image scene data through the scene feature extraction model and the object feature extraction model;

a saliency module 706, configured to obtain enhanced scene local features and enhanced object local features by setting weights of each of the scene local features and the object local features;

a fusion module 708, configured to fuse the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature to obtain a fusion feature of image scene data;

and the classification module 710 is configured to input the fusion features into a pre-trained classification model to obtain a scene classification of the image scene data.

In one embodiment, the saliency module 706 is further configured to calculate a mean value of all the image scene data blocks corresponding to the scene local features, and determine a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector; adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight; and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.

In one embodiment, the saliency module 706 is further configured to perform a difference between the object global feature and the object local feature, and take an absolute value to obtain a local feature distance vector corresponding to the object local feature; normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature; adjusting the initial object local characteristic weight according to a preset second hyperparameter to obtain an object local characteristic weight; and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.

In one embodiment, the fusion module 708 is further configured to fuse the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature in a splicing manner to obtain a fusion feature of the image scene data.

In one embodiment, the classification module 710 is further configured to input the fusion features into a pre-trained linear support vector machine to obtain a scene classification of the image scene data.

In one embodiment, the saliency module 706 is further configured to normalize the scene distance vectors, and obtain an initial scene local feature weight of each scene distance vector as:

wherein ,

the weight of the local feature of the initial scene is represented,

the local features of the scene are represented,

In one embodiment, the saliency module 706 is further configured to normalize the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature as follows:

wherein ,

the local feature weights of the initial object are represented,

the local features of the object are represented,

representing a local feature distance vector.

For specific definition of the image scene classification device based on local feature saliency, refer to the above definition of the image scene classification method based on local feature saliency, and details thereof are not repeated here. The modules in the image scene classification device based on local feature saliency can be fully or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for image scene classification based on local feature saliency. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An image scene classification method based on local feature saliency, characterized in that the method comprises:

segmenting image scene data to be classified to obtain image scene data blocks;

2. The method of claim 1, wherein the obtaining of enhanced scene local features by setting a weight of each scene local feature comprises:

calculating the mean value of all the image scene data blocks corresponding to the scene local features, and determining a feature center;

determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center;

normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector;

adjusting the initial scene local feature weight according to a preset first hyper-parameter to obtain a scene local feature weight;

and weighting the scene local features according to the scene local feature weight to obtain enhanced scene local features.

3. The method of claim 1, wherein obtaining enhanced object local features by setting a weight of each of the object local features comprises:

making a difference between the object global feature and the object local feature, and taking an absolute value to obtain a local feature distance vector corresponding to the object local feature;

normalizing the local feature distance vectors to obtain an initial object local feature weight corresponding to each object local feature;

adjusting the initial object local characteristic weight according to a preset second hyperparameter to obtain an object local characteristic weight;

and weighting the local features of the object according to the weight of the local features of the object to obtain the local features of the enhanced object.

4. The method according to any one of claims 1 to 3, wherein fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature and the object global feature to obtain a fused feature of image scene data comprises:

and fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features in a splicing mode to obtain fusion features of image scene data.

5. The method of any one of claims 1 to 3, wherein inputting the fused features into a pre-trained classification model to obtain a scene classification of the image scene data comprises:

and inputting the fusion characteristics into a pre-trained linear support vector machine to obtain the scene classification of the image scene data.

6. The method of claim 2, wherein normalizing the scene distance vectors to obtain an initial scene local feature weight for each scene distance vector comprises:

normalizing the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector as follows:

wherein ,

the weight of the local feature of the initial scene is represented,

the local features of the scene are represented,

7. The method of claim 3, wherein normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature comprises:

normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature as follows:

wherein ,

the local feature weights of the initial object are represented,

the local features of the object are represented,

representing a local feature distance vector.

8. An apparatus for classifying an image scene based on local feature saliency, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.