CN112001399B

CN112001399B - Image scene classification method and device based on local feature saliency

Info

Publication number: CN112001399B
Application number: CN202010928765.7A
Authority: CN
Inventors: 谢毓湘; 张家辉; 宫铨志; 栾悉道; 魏迎梅; 康来; 蒋杰
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2023-06-09
Anticipated expiration: 2040-09-07
Also published as: CN112001399A

Abstract

The application relates to an image scene classification method and device based on local feature saliency. The method comprises the following steps: dividing image scene data to be classified to obtain an image scene data block, respectively extracting scene local features and object local features through a preset scene feature extraction model and an object feature extraction model, extracting scene global features and object global features through a scene feature extraction model and an object feature extraction model, respectively obtaining enhanced scene local features and enhanced object local features through setting weights of the scene local features and the object local features, fusing the enhanced scene local features, the enhanced object local features, the scene global features and the object global features to obtain fused features of the image scene data, and inputting the fused features into a pre-trained classification model to obtain scene classification of the image scene data. The method can reduce the calculated amount and solve the problem of structural redundancy.

Description

Image scene classification method and device based on local feature saliency

Technical Field

The application relates to the technical field of scene classification, in particular to an image scene classification method and device based on local feature saliency.

Background

With the development of internet multimedia technology and the growth of visual data, how to process the massive data becomes a new era of difficulty. Scene classification technology, which is a key technology for solving the problems of image retrieval and image recognition, has become a very important and challenging research topic in the field of computer vision. Meanwhile, scene classification has wide application in the fields of remote sensing image analysis, video monitoring, robot perception and the like. Therefore, the method has very important significance for carrying out corresponding research on scene classification technology and improving the recognition capability of the computer scene.

The image scene classification refers to that for a given image, the scene to which the given image belongs is judged by judging and identifying the information and the content contained in the given image, so that the purpose of classification is achieved. In recent years, deep learning technology is rapidly developed, traditional manual image feature design methods are gradually replaced, and brand new progress is made in the field of scene classification. However, deep learning requires a large number of training samples, which cannot be satisfied for some small-scale scene data sets, and as in practical application, it cannot be guaranteed that each type of scene can provide a considerable number of images for training, and the birth of transfer learning provides ideas and solutions for solving such problems. The transfer learning is a machine learning method, and refers to that a pre-trained model is reused in another task, and a network is selectively fine-tuned on a target data set by reasonably selecting a plurality of deep networks pre-trained on a large-scale data set so as to adapt to the current task requirement, so that the transfer learning method is widely applied to a plurality of deep learning problems. The pre-trained network parameter structures in different data sets often have large differences, and the characteristics extracted on the task data sets through the networks can reflect the properties of different aspects of the data. The scene images are rich in content and complex in concept, and the features extracted by using only one type of pre-trained network are not enough to describe the scene images, so that one common practice is to form scene feature representations with more discriminant ability by fusing the features extracted by different networks. However, although features extracted by different pre-training models can reflect the properties of different aspects of a scene, the accuracy of describing the scene by these features is different, and how to combine the features of these features to extract the effective parts thereof for fusion is a difficult problem, and a general solution does not exist at present. On the other hand, the convolutional neural network has different understandings on scene images under different scales, and features which cannot be extracted under a certain scale can be obtained under another scale, so that scene image description can be effectively enhanced by combining scene image information under a plurality of scales. However, the features extracted in the multi-scale image do not always complement each other, resulting in a more accurate representation of the scene. For example, more detailed information can be extracted from a small-scale image, but noise information in the image is amplified, and how to reasonably filter and screen the features becomes a problem. At present, multiscale images are usually obtained by densely sampling an original image, taking a 256×256-pixel image as an example, and local images with different sizes can be sampled from the original image by setting the size and sampling step length of a new image. The number of local features through dense sampling is relatively large, and the local features are usually required to be encoded by combining a Visual-of-Word-Bag model (BoVW) and other methods, and finally, the new scene image description is obtained through aggregation. The multi-scale scene image description obtained by the method has the defects of large calculation amount, structural redundancy and the like.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an image scene classification method and apparatus based on local feature saliency, which can solve the problems of large calculation amount and structural redundancy in multi-scale scene image description.

An image scene classification method based on local feature saliency, the method comprising:

dividing the image scene data to be classified to obtain image scene data blocks;

respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;

respectively obtaining enhanced scene local features and enhanced object local features by setting weights of each scene local feature and each object local feature;

fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object to obtain fused features of image scene data;

and inputting the fusion characteristics into a pre-trained classification model to obtain scene classification of the image scene data.

In one embodiment, the method further comprises: calculating the average value of the local features of the scene corresponding to all the image scene data blocks, and determining a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector; according to a preset first super parameter, adjusting the local feature weight of the initial scene to obtain the local feature weight of the scene; and weighting the local scene features according to the local scene feature weights to obtain the local scene feature enhancement.

In one embodiment, the method further comprises: the global feature of the object and the local feature of the object are subjected to difference, and the absolute value is taken to obtain a local feature distance vector corresponding to the local feature of the object; normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature; according to a preset second super parameter, adjusting the local feature weight of the initial object to obtain the local feature weight of the object; and weighting the local object features according to the local object feature weights to obtain the local object feature enhancement.

In one embodiment, the method further comprises: and fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object in a splicing mode to obtain fused features of the image scene data.

In one embodiment, the method further comprises: inputting the fusion features into a pre-trained linear support vector machine to obtain scene classification of the image scene data.

In one embodiment, the method further comprises: normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector as follows:

wherein ,

representing the initial scene local feature weights, +.>

Representing local features of a scene->

Representing the feature center, l representing the number of samples in the scene local feature, and n representing the number of images in the category.

In one embodiment, the method further comprises: normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature, wherein the initial object local feature weight is as follows:

wherein ,

representing initial object local feature weights, +.>

Representing local features of the object->

Representing the local feature distance vector.

An image scene classification device based on local feature saliency, the device comprising:

the segmentation module is used for segmenting the image scene data to be classified to obtain image scene data blocks;

the feature extraction module is used for respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;

the salifying module is used for respectively obtaining the local characteristics of the enhanced scene and the local characteristics of the enhanced object by setting the weight of each local characteristic of the scene and the local characteristics of the object;

the fusion module is used for fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object to obtain fusion features of image scene data;

and the classification module is used for inputting the fusion characteristics into a pre-trained classification model to obtain scene classification of the image scene data.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

According to the image scene classification method, the device, the computer equipment and the storage medium based on the local feature salification, the scene local feature and the object local feature as well as the scene global feature and the object global feature are respectively extracted through the preset scene feature extraction model and the object feature extraction model, then different weights are set for the scene local feature and the object local feature in a weight setting mode, so that the pertinence of the features is improved, the fused features are obtained through feature fusion, the scenes corresponding to the fused features can be classified through the classification model, and the calculated amount can be reduced and the problem of structural redundancy can be solved due to the fact that the local features are salified through the weights.

Drawings

FIG. 1 is a flow diagram of an image scene classification method based on local feature saliency in one embodiment;

FIG. 2 is a schematic diagram of elements of a Place-CNN and ImageNet-CNN attention scene image in one embodiment;

FIG. 3 is a frame diagram of a model in one embodiment;

FIG. 4 is a schematic diagram of feature extraction in one embodiment;

FIG. 5 is a class activation diagram of Place-CNN and ImageNet-CNN for different scenarios in one embodiment;

FIG. 6 is a schematic diagram of feature fusion in one embodiment;

FIG. 7 is a block diagram of an image scene classification device based on local feature saliency in one embodiment;

fig. 8 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided an image scene classification method based on local feature saliency, including the steps of:

and 102, dividing the image scene data to be classified to obtain image scene data blocks.

The image scene data may be an image acquired in a scene, and the image scene data block may be obtained by dividing the acquired image.

Specifically, the acquired original image can be adjusted by bilinear interpolation, for example, the image size is adjusted to 224×224, then the image mean value of the ImageNet dataset is subtracted, and the image is normalized by dividing the standard deviation, so that the data accords with the distribution rule after normalization, and the generalization capability of the model is improved. For the multi-scale image, the image processed in the previous step is adjusted, for example, to 448×448, and then four corners of the adjusted image are selected for cutting, so as to form 4 image scene data blocks. The local image size is 224×224 as supplementary data of the original scene image at a small scale. Unlike dense sampling, the simple sampling only increases 4 small-scale images to supplement the original image, and reduces the repetition and redundancy in data.

It should be noted that the above dimensions and the number of cuts are examples, and other values may be used to achieve the technical effects of the present invention.

Step 104, extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model respectively.

The scene feature extraction model and the object feature extraction model can be CNN networks, specifically, the scene feature extraction model can be Places-CNN and the object feature extraction model can be ImageNet-CNN.

Specifically, the depth network selected by the invention is denoted as DenseNet, and specifically, the network can be built in a pytorch deep learning framework by taking the DenseNet161 proposed by Gao Huang as a basic network. In the feature extraction, denseNet is set to test mode, and Dropout for regularization multiplies the output of neurons in the form of probability values, and the final feature vector is the final layer of convolution layer output of DenseNet.

And 106, respectively obtaining the enhanced scene local feature and the enhanced object local feature by setting the weight of each scene local feature and each object local feature.

In this step, the scene local feature and the object local feature are highlighted, and for the scene local feature, the purpose of the highlighting is to highlight details, and for the object local feature, the purpose of the highlighting is to preserve the main body.

And step 108, fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object to obtain fused features of the image scene data.

There are various ways of feature fusion, such as addition operation, splicing operation, etc.

Step 110, inputting the fusion features into a pre-trained classification model to obtain scene classification of the image scene data.

The classification model may be a support vector machine, logistic regression, etc., and is not limited in this regard.

In the image scene classification method based on the local feature saliency, the scene local feature and the object local feature, and the scene global feature and the object global feature are respectively extracted through the preset scene feature extraction model and the object feature extraction model, then different weights are set for the scene local feature and the object local feature by setting weights, so that the pertinence of the features is improved, the fused features are obtained through feature fusion, the scene corresponding to the fused features can be classified through the classification model, and the calculated amount can be reduced and the problem of structural redundancy is solved due to the fact that the local feature saliency is carried out through the weights.

In one embodiment, calculating the average value of the local features of the corresponding scenes of all the image scene data blocks, and determining the feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector; according to a preset first super parameter, adjusting the local feature weight of the initial scene to obtain the local feature weight of the scene; and weighting the local scene features according to the local scene feature weights to obtain the local scene feature enhancement.

Specifically, the average value of the local features of the corresponding scenes of all the image scene data blocks is calculated, and the feature center is determined as follows:

wherein ,

representing the characteristic center +_>

And representing local characteristics of the scene, wherein l represents the number of sampling pictures, and n is the number of pictures in the category.

According to the distance from the local scene feature to the feature center, determining a scene distance vector of each local scene feature, specifically: the feature center is differenced from each scene local feature, and the absolute value is taken to obtain a scene distance vector from the scene local feature to the feature center, wherein the vector represents the discrete degree of each dimension feature.

In a specific embodiment, the scene distance vectors are normalized, and the initial scene local feature weight of each scene distance vector is obtained as follows:

/>

wherein ,

representing the initial scene local feature weights, +.>

Representing local features of a scene->

In this embodiment, considering that there is already a relatively accurate global feature on the original scale, relatively prominent scene details are required on the small scale, and all that is required is to strengthen those values far from the feature center so as to supplement the global feature of the scene. By means of the normalization, local details of the scene can be highlighted.

In addition, adjusting the local feature weight of the initial scene according to a preset first super parameter to obtain the local feature weight of the scene as follows:

representing a first hyper-parameter.

Finally, taking the sum of the products of the weights and the local features as the local features of the enhanced scene

In one embodiment, the global feature of the object and the local feature of the object are subjected to difference, and a local feature distance vector corresponding to the local feature of the object is obtained; normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature; according to a preset second super parameter, adjusting the local feature weight of the initial object to obtain the local feature weight of the object; and weighting the local characteristics of the object according to the local characteristic weights of the object to obtain the local characteristics of the enhanced object.

Specifically, corresponding to the scene local feature is an object local feature extracted by ImageNet-CNN. Because the image scene data contains rich object content, especially the extracted features of the small-scale image contain a large amount of object detail features, the detail features have an adverse effect on the existing scene local features as complementary global features, and the effect of directly using the extracted global features of the ImageNet-CNN as the complementary global features of the scene is not ideal, the extracted global features are used as the object feature semantic centers to conduct guiding correction on the small-scale local features, detailed parts in the object features are reduced, and therefore more suitable object local features are obtained.

The method comprises the following specific steps:

and obtaining a local feature distance vector by using the absolute value obtained by making a difference between the global feature of the object and the local feature of each object. Because global features are needed to guide the local vector, features closer to the global features are given higher weight as opposed to scene local feature highlighting details. Object local feature weights

Is calculated as follows:

specifically, the process of local feature salification is to weight and add 4 local features, and the sum of the weights needs to be 1, which is also the purpose of normalization. Dividing by 3 can just make the sum of the 4 weights 1.

Controlling object local feature weights using hyper-parameters lambda

Is the degree of influence of (a):

finally, the object local feature weight is given to the enhanced object local feature after the object local feature is corrected:

in one embodiment, a stitching mode is adopted to fuse the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object, so as to obtain the fused features of the image scene data.

In this embodiment, considering that the scene local feature and the object local feature are two distinct features, it does not make sense to semantically add, and therefore a stitching manner is selected.

In one embodiment, the fusion features are input into a pre-trained linear support vector machine to obtain scene classification of the image scene data.

In this embodiment, the linear support vector machine can maximize the class-to-class interval and reduce the overfitting while ensuring a certain training accuracy.

In summary, the beneficial effects achieved by the invention are as follows:

1. compared with the common dense sampling and semantic codebook construction method, the method has the advantage that the calculated amount is remarkably reduced.

2. An optimization scheme based on network characteristics of different depths, namely a characteristic salifying method, is used. The optimization method is combined with the characteristics of the features, so that the scene description capability of the fusion features is effectively improved, the complementarity of the optimized features is more sufficient, and the scene classification precision is improved.

The advantageous effects of the present invention will be described in the following with reference to a specific example.

FIG. 2 is a diagram of the different elements of a scene image of interest to Place-CNN and ImageNet-CNN in a model. The focus of attention is different between Places-CNN and ImageNet-CNN. The scene images are rich in content and elements, the extracted image features of Places-CNN often have more integral and spatial characteristics, and the extracted features of the ImageNet-CNN pay more attention to details, especially to details of a single object.

FIG. 3 is a general frame diagram of the present invention, which includes the following three steps:

and a first step, extracting features. And performing feature extraction on the scene image on two scales by using the constructed Places-CNN and the image Net-CNN.

And secondly, the characteristic is highlighted. The method comprises the steps of optimizing the extracted features under different types and different scales, and specifically comprises two parts, namely a scene local feature highlighting detail and an object local feature preserving main body.

And thirdly, feature fusion and classification. And performing dimensional splicing on the optimized features, and then completing classification by using a linear support vector machine.

Fig. 4 depicts a first step feature extraction process. In the feature extraction stage, the input image is propagated forward, and the output of the last Dense Block is used as the feature extracted by the two types of convolutional neural networks. When the size of the input image is 224×224, the dimension of the feature map obtained by adding global averaging pooling is 1×2208. Wherein, the local feature dimension extracted by using one network is 4×2208, and the global feature dimension is 1×2208.

Fig. 5 is a class activation diagram of two classes of networks for different scenarios. Different types of depth features face the same classification task, and the activation area and classification effect of the depth features are obviously different. FIG. 5 illustrates some scene image activated visualization function diagrams of placs-CNN and ImageNet-CNN on MIT indicator 67 dataset, class activation mapping is used to achieve visualization of key visual attention areas of different CNNs (brighter place of image represents stronger discriminant), reflecting different properties of scene features and object features. From the figure, the activation area and the color brightness of the Places-CNN are obviously higher than those of the ImageNet-CNN, and the reason that the Places-CNN has better effect on the scene classification task than the ImageNet-CNN is also explained. Unlike Places-CNN, which focuses more on scene features, imageNet-CNN focuses on visual emphasis on some scene objects, such as toilets and cupboards in bachrom, tables and chairs in Fastfood restaurant, etc.

Fig. 6 shows two different feature fusion strategies in the third step of feature fusion and classification, one is stitching in dimensions, as shown in the left part of fig. 6, and the other is adding in dimensions, as shown in the right part of fig. 6, and considering that the scene local feature and the object local feature are two distinct features, adding in semantics is not meaningful, and all the fusion strategies selected by the present invention are the first fusion strategy.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.

In one embodiment, as shown in fig. 7, there is provided an image scene classification apparatus based on local feature saliency, including: a segmentation module 702, a feature extraction module 704, a saliency module 706, a fusion module 708, and a classification module 710, wherein:

the segmentation module 702 is configured to segment image scene data to be classified to obtain image scene data blocks;

a feature extraction module 704, configured to extract a scene local feature and an object local feature in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and extract a scene global feature and an object global feature in the image scene data through the scene feature extraction model and the object feature extraction model, respectively;

a salifying module 706, configured to obtain an enhanced scene local feature and an enhanced object local feature by setting weights of each of the scene local feature and the object local feature, respectively;

a fusion module 708, configured to fuse the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature to obtain a fused feature of the image scene data;

and the classification module 710 is configured to input the fusion feature into a pre-trained classification model to obtain a scene classification of the image scene data.

In one embodiment, the saliency module 706 is further configured to calculate a mean value of all the image scene data blocks corresponding to the local features of the scene, and determine a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector; according to a preset first super parameter, adjusting the local feature weight of the initial scene to obtain the local feature weight of the scene; and weighting the local scene features according to the local scene feature weights to obtain the local scene feature enhancement.

In one embodiment, the salifying module 706 is further configured to make a difference between the global feature of the object and the local feature of the object, and obtain a local feature distance vector corresponding to the local feature of the object by taking an absolute value; normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature; according to a preset second super parameter, adjusting the local feature weight of the initial object to obtain the local feature weight of the object; and weighting the local object features according to the local object feature weights to obtain the local object feature enhancement.

In one embodiment, the fusion module 708 is further configured to fuse the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature in a stitching manner, so as to obtain a fused feature of the image scene data.

In one embodiment, the classification module 710 is further configured to input the fusion feature into a pre-trained linear support vector machine to obtain a scene classification of the image scene data.

In one embodiment, the saliency module 706 is further configured to normalize the scene distance vectors to obtain an initial scene local feature weight of each scene distance vector as follows:

wherein ,

representing the initial scene local feature weights, +.>

Representing local features of a scene->

In one embodiment, the salifying module 706 is further configured to normalize the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature as follows:

wherein ,

representing initial object local feature weights, +.>

Representing local features of the object->

Representing the local feature distance vector.

For specific limitations of the image scene classification device based on local feature saliency, reference may be made to the above limitation of the image scene classification method based on local feature saliency, and details thereof are not repeated here. The above-described image scene classification apparatus based on local feature saliency may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for classifying image scenes based on local feature saliency. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method of the above embodiments when the computer program is executed.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method of the above embodiments.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. An image scene classification method based on local feature saliency, the method comprising:

respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and respectively extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;

the method for obtaining the enhanced scene local features by setting the weight of each scene local feature comprises the following steps: calculating the average value of the local scene features corresponding to all the image scene data blocks, and determining a feature center; determining a scene distance vector of each scene local feature according to the distance from the scene local feature to the feature center; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector; according to a preset first super parameter, adjusting the local feature weight of the initial scene to obtain the local feature weight of the scene; weighting the local scene features according to the local scene feature weights to obtain enhanced local scene features; normalizing the scene distance vectors to obtain initial scene local feature weights of each scene distance vector, wherein the initial scene local feature weights are as follows:

，

wherein ,

representing the initial scene local feature weights, +.>

Representing local features of a scene->

Representing the characteristic center +_>

Representing the number of sampled pictures in the local features of the scene, and n represents the number of images in the category;

the method for obtaining the enhanced object local feature by setting the weight of each object local feature comprises the following steps: the global feature of the object and the local feature of the object are subjected to difference, and the absolute value is taken to obtain a local feature distance vector corresponding to the local feature of the object; normalizing the local feature distance vector to obtain an initial object local feature weight corresponding to each object local feature; according to a preset second super parameter, adjusting the local feature weight of the initial object to obtain the local feature weight of the object; weighting the local object features according to the local object feature weights to obtain enhanced local object features; the local feature distance vector is normalized, and initial object local feature weights corresponding to the local features of each object are obtained as follows:

，

wherein ,

representing initial object local feature weights, +.>

Representing local features of the object->

Representing global features of the object;

2. The method of claim 1, wherein fusing the enhanced scene local feature, the enhanced object local feature, the scene global feature, and the object global feature to obtain a fused feature of image scene data, comprises:

and fusing the local features of the enhanced scene, the local features of the enhanced object, the global features of the scene and the global features of the object in a splicing mode to obtain fused features of the image scene data.

3. The method according to any one of claims 1 to 2, wherein inputting the fusion features into a pre-trained classification model results in a scene classification of the image scene data, comprising:

inputting the fusion features into a pre-trained linear support vector machine to obtain scene classification of the image scene data.

4. An image scene classification device based on local feature saliency, the device comprising:

the feature extraction module is used for respectively extracting scene local features and object local features in the image scene data block through a preset scene feature extraction model and an object feature extraction model, and respectively extracting scene global features and object global features in the image scene data through the scene feature extraction model and the object feature extraction model;

，

wherein ,

representing the initial scene local feature weights, +.>

Representing local features of a scene->

Representing the characteristic center +_>

，

wherein ,

representing initial object local feature weights, +.>

Representing local features of the object->

Representing local features of the object;

5. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 3 when the computer program is executed.

6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 3.