CN112329660A

CN112329660A - Scene recognition method and device, intelligent equipment and storage medium

Info

Publication number: CN112329660A
Application number: CN202011249944.4A
Authority: CN
Inventors: 鲍虎军; 章国锋; 余海林; 冯友计
Original assignee: Zhejiang Shangtang Technology Development Co Ltd
Current assignee: Zhejiang Shangtang Technology Development Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-02-05
Anticipated expiration: 2040-11-10
Also published as: WO2022100133A1; JP2023510945A; CN112329660B

Abstract

The invention is applied to the technical field of image retrieval, and provides a scene identification method, a device, intelligent equipment and a storage medium, wherein the scene identification method comprises the following steps: acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises a query image and an image to be recognized, and the semantic mask image corresponding to the image to be processed comprises a semantic mask image of the query image and a semantic mask image of the image to be recognized; performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed; and determining an image matched with the scene of the query image from the image to be identified by using the feature vector of the image to be processed. Therefore, the interference of the interference factor characteristics on the characteristic identification is reduced through the semantic mask map, and the robustness of scene identification is further improved.

Description

Scene recognition method and device, intelligent equipment and storage medium

Technical Field

The invention relates to the technical field of image retrieval, in particular to a scene recognition method, a scene recognition device, intelligent equipment and a storage medium.

Background

Scene recognition has important applications in the field of computer vision, such as Simultaneous Localization And Mapping (SLAM), Motion recovery Structure (SFM), And Visual Localization (VL).

The main content of the scene identification problem research is to identify a corresponding scene from a given image, to give the name of the scene or the geographical location of the scene, or to select an image similar to the scene from a database, which can also be regarded as an image retrieval problem. There are two methods commonly used at present, one is to directly calculate the global description of the image, and the other is to use the feature aggregation method. At present, in the prior art, the research on the scene recognition method is more and more.

Disclosure of Invention

The invention provides a scene recognition method, a scene recognition device, intelligent equipment and a storage medium, which are used for improving the robustness of scene recognition when an image contains interference factors.

In order to solve the above technical problems, a first technical solution provided by the present invention is: provided is a scene recognition method, including: acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises a query image and an image to be recognized, and the semantic mask image corresponding to the image to be processed comprises a semantic mask image of the query image and a semantic mask image of the image to be recognized; performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed; and determining an image matched with the scene of the query image from the images to be identified by utilizing the feature vector of the images to be processed. The corresponding features of the image to be processed are obtained by combining the semantic mask map with a feature aggregation mode, so that the interference of interference factors can be reduced, and the robustness of scene recognition is improved.

The step of obtaining the image to be processed and the semantic mask image corresponding to the image to be processed comprises: performing semantic segmentation processing on the image to be identified and the query image to obtain the category of each pixel and the probability corresponding to the category; setting weight for each pixel type according to set conditions; and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the category and the weight corresponding to the category, wherein the semantic masks corresponding to all the pixels form a semantic mask image. By means of the setting of the weight, after the obtained semantic mask image is combined with a feature aggregation mode to obtain the corresponding features of the image to be processed, the interference of interference factors can be reduced, and the robustness of scene recognition is improved.

Before setting the weight for the category of each pixel according to the setting condition, the method further comprises the following steps: performing attribute classification on all pixels to obtain one or more sub-categories; setting weight for each sub-category according to set conditions; and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask image. And each subcategory is provided with a weight, so that the interference of interference factors can be reduced, and the robustness of scene identification is improved.

Wherein the sub-categories include at least two of fixed sub-categories, non-fixed sub-categories, dynamic and unknown; the weight of the dynamic subcategory is less than the weight of the fixed subcategory, the non-fixed subcategory, and the unknown subcategory. For example, a higher weight is set for the unfixed sub-category, and a smaller weight is set for the fixed sub-category, so that the interference of unfixed features on feature recognition is eliminated, and the robustness of scene recognition is improved.

Wherein the obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category comprises: calculating a semantic mask corresponding to the pixel by using the following formula (1):

m_i＝p_i×w_i (1)

wherein m is_iRepresenting the semantic mask corresponding to the ith pixel, the generated graph is a semantic mask graph, p_iRepresenting the probability, w, of the sub-category to which the ith pixel belongs_iAnd represents the weight corresponding to the category or the sub-category to which the ith pixel belongs. And (3) reducing the interference of the unfixed features on scene recognition by calculating the semantic masking map.

Performing feature aggregation processing on the image to be processed according to the semantic mask map to obtain a feature vector of the image to be processed, including: extracting the features of the image to be processed to obtain a feature set; forming a plurality of clustering centers according to the feature set; obtaining a clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers; determining a value corresponding to a first dimension of each feature in the image to be processed, and determining a value corresponding to the first dimension of a clustering center corresponding to each feature in the image to be processed; performing feature aggregation processing on the query image by using a clustering center corresponding to each feature in the to-be-processed image, a value corresponding to a first dimension of the clustering center corresponding to each feature in the to-be-processed image, and a value corresponding to the first dimension of the clustering center corresponding to each feature in the to-be-processed image, in combination with a semantic mask map of the query image, so as to obtain a feature vector of the query image; and performing feature aggregation processing on the image to be recognized by combining the value of the clustering center corresponding to each feature in the image to be processed in the first dimension with the value of the clustering center corresponding to each feature in the image to be processed in the first dimension, so as to obtain the feature vector of the image to be recognized. The semantic mask image is used for obtaining the characteristics corresponding to the image to be processed, and the unfixed characteristics are subjected to weight setting in the semantic mask image, so that the interference of interference factors can be reduced, and the robustness of scene recognition is improved.

Wherein said forming a plurality of cluster centers from said feature set comprises: processing the feature set by using a clustering algorithm to form a plurality of clustering centers; the obtaining of the clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers comprises: and taking the clustering center closest to each feature as the clustering center corresponding to each feature in the image to be processed.

Wherein the determining, from the images to be identified, an image matching the scene of the query image by using the feature vector of the image to be processed comprises: and determining an image matched with the query image scene from the image to be recognized according to the distance between the feature vector of the image to be recognized and the feature vector of the query image. Because the calculation of the feature vector is combined with the semantic mask graph, the interference of unfixed features is reduced, and the image to be identified with higher similarity to the query image is obtained.

The step of determining an image matched with the query image scene from the image to be recognized according to the distance between the feature vector of the image to be recognized and the feature vector of the query image comprises the following steps: and determining the image to be identified corresponding to the feature vector closest to the feature vector of the query image as the image matched with the query image. Thus, the image to be identified with higher similarity to the query image is obtained.

The number of the images matched with the query image in the images to be identified is multiple; the step of determining the image to be identified corresponding to the feature vector closest to the feature vector of the query image as the image matched with the query image further comprises: and arranging the images matched with the query image by adopting a space consistency method so as to obtain the image most similar to the query image. Therefore, the obtained scenes are more similar and the accuracy is higher.

In order to solve the above technical problems, a second technical solution provided by the present invention is: there is provided a scene recognition apparatus including: the acquisition module is used for acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises an inquiry image and an image to be identified; the feature aggregation module is used for performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed; and the image matching module is used for determining an image matched with the scene of the query image from the image to be identified by utilizing the characteristic vector of the image to be processed. The corresponding features of the image to be processed are obtained by combining the semantic mask map with a feature aggregation mode, so that the interference of interference factors can be reduced, and the robustness of scene recognition is improved.

In order to solve the above technical problems, a third technical solution provided by the present invention is: there is provided a smart device comprising: a processor and a memory coupled to each other, wherein the memory is used for storing program instructions for implementing the scene recognition method according to any one of the above items.

In order to solve the above technical problems, a fourth technical solution provided by the present invention is: there is provided a computer-readable storage medium storing a program file executable to implement the scene recognition method of any one of the above.

The invention has the beneficial effects that: different from the prior art, the scene recognition method provided by the invention obtains the image to be processed and the semantic mask image corresponding to the image to be processed, performs feature aggregation processing on the image to be processed according to the semantic mask image to obtain the feature vector of the image to be processed, determines the image matched with the scene of the query image from the image to be recognized by using the feature vector, obtains the high-level semantic information of the image by obtaining the semantic mask image, and eliminates the interference caused by the interference factor in the image by combining the semantic mask image and the feature aggregation, thereby improving the robustness of scene recognition.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a scene recognition method according to the present invention;

FIG. 2 is a flowchart illustrating an embodiment of step S11 in FIG. 1;

FIG. 3 is a schematic flow chart diagram illustrating another embodiment of step S11 in FIG. 1;

FIG. 4 is a schematic structural diagram of an embodiment of a scene recognition apparatus according to the present invention;

FIG. 5 is a schematic block diagram of an embodiment of a smart device of the present invention;

fig. 6 is a schematic structural diagram of the computer-readable storage medium of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The present invention will be described in detail below with reference to the accompanying drawings and examples.

Referring to fig. 1, a flowchart of a scene recognition method according to a first embodiment of the present invention is shown, which includes:

step S11: acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises an inquiry image and an image to be identified.

Specifically, the image to be processed comprises a query image and an image to be recognized, and the semantic mask map corresponding to the image to be processed comprises the semantic mask map of the query image and the semantic mask map of the image to be recognized. Specifically, referring to fig. 2, obtaining the semantic mask map corresponding to the image to be recognized includes:

step S21: and performing semantic segmentation processing on the image to be recognized and the query image to obtain the category of each pixel and the probability corresponding to the category.

The query image is an image defined by a user, and can be an image shot by the user at present or an image stored in advance by the user. The image to be identified is an image which is searched from the database according to the query image and is matched with the query image. The database is a server, a query image is input, and the server matches a plurality of images to be identified with similar scenes for the query image. And performing semantic segmentation on the image to be recognized and the query image to obtain the category of each pixel in the image and the probability corresponding to the category.

Step S22: a weight is set for each pixel type according to a setting condition.

After the classes of pixels are acquired, weights are set for the pixels of each class. In one embodiment, if the category obtained by semantic segmentation includes four categories, i.e., a fixed sub-category (e.g., stable), an unfixed sub-category (e.g., volatile), and a dynamic sub-category and an unknown sub-category, in order to reduce the interference of the dynamic features on the scene recognition, in one embodiment, the weight of the dynamic sub-category is set to be the lowest and is smaller than the weights of the fixed sub-category, the unfixed sub-category and the unknown sub-category. In another embodiment, if it is desired to reduce the interference of unfixed sub-category features with scene recognition, in one embodiment, the weight of the unfixed sub-category features is set to be the lowest, and is less than the weights of the fixed sub-category, the dynamic sub-category, and the unknown sub-category.

Step S23: and obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask image.

Specifically, in an embodiment, the semantic mask corresponding to each pixel is calculated by using the following formula (1):

m_i＝p_i×w_i (1)

wherein m is_iRepresenting the semantic mask corresponding to the ith pixel, the generated graph is a semantic mask graph, p_iRepresenting the probability, w, of the sub-category to which the ith pixel belongs_iAnd represents the weight corresponding to the category or the sub-category to which the ith pixel belongs.

In another embodiment, please refer to fig. 3 if the category result after semantic segmentation does not include four categories, i.e., fixed sub-category, unfixed sub-category, dynamic category and unknown category, wherein step S31 is the same as that in fig. 2 and is not described herein again. In this embodiment, if the category result after semantic segmentation does not include four categories, i.e., stable, variable, dynamic, and unknown:

step S32: all pixels are attribute classified to obtain one or more sub-categories.

All pixels are classified for attributes to derive one or more sub-categories, which in one embodiment include at least two or at least one of fixed sub-categories, non-fixed sub-categories, dynamic and unknown.

Step S33: and setting weight for each sub-category according to set conditions.

Specifically, after the sub-categories of the pixels are acquired, a weight is set for the pixels of each sub-category. In an embodiment, if the subclasses obtained by classifying the result attributes of semantic segmentation include four classes, namely a fixed subclass, an unfixed subclass, a dynamic subclass and an unknown subclass, in order to reduce the interference of the dynamic features on scene recognition, in an embodiment, the weight of the dynamic features is set to be the lowest and is smaller than the weights of the fixed subclass, the unfixed subclass and the unknown subclass. In another embodiment, if it is desired to reduce the interference of unfixed sub-category features on scene recognition, in one embodiment, the weight of the unfixed sub-category features is set to be the lowest, and is smaller than the fixed, dynamic and unknown weights.

Step S34: and obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask image.

m_i＝p_i×w_i (1)

According to the method provided by the embodiment, different weights are set for the pixel classes after semantic segmentation, so that the interference caused by the classes in feature recognition is reduced, and the robustness of the scene recognition score is further improved.

Step S12: and performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed.

Specifically, the existing method for performing feature aggregation on features to be processed to obtain a feature vector includes obtaining a feature vector by means of VLAD coding. Specifically, obtaining the feature vector by means of VLAD coding includes: and performing feature extraction on the image to be processed to obtain a feature set. In another embodiment, the feature extraction may be performed on a preset image to be processed to obtain a feature set, where the preset data image may be a set of all images in the database and the server, may also be a set of partial images in the server, and is not limited thereto, and may also be a set of pictures collected by the user himself or herself. It is understood that each image to be processed contains a plurality of features, that is, when feature extraction is performed, each image to be processed extracts a plurality of features. And forming a feature set by all the extracted features, and then carrying out a clustering algorithm on the feature set to obtain K clustering centers. The K clustering centers are called codebooks, and the codebooks are obtained as C ═ { C1, C2, …, ck }.

And forming a feature set X by a plurality of features in one of the images to be processed, wherein the feature set X is { X1, X2, … and xk }. In a specific embodiment, the feature set X can also be aggregated into a feature vector with a fixed length through the codebook C.

After a plurality of clustering centers are obtained, obtaining each feature x in each image to be processed through the plurality of clustering centers_iThe corresponding cluster center. In particular, the feature x is determined_iWill be associated with the feature x_iThe nearest cluster center is determined as feature x_iCorresponding cluster center c_k. In one embodiment, the feature x is determined to be current_iCorresponding cluster center c_kThen, the cluster center c is determined_kAt the value corresponding to the first dimension, in one embodiment, the cluster center c_kCorresponding dimension and cluster center c_kCorresponding feature x_iIs the same, determining the clustering center c_kValue corresponding to the first dimension and cluster center c_kCorresponding feature x_iAt the value corresponding to the first dimension, due to the cluster center c_kDimension and cluster center c_kCorresponding feature x_iIn order to better distinguish the cluster centers c_kAnd cluster center c_kCorresponding feature x_iCenter the cluster c_kDimension of (d) plus cluster center c_kCorresponding to the feature x_iThe distance between them. In the embodiment of the present disclosure, the first dimension may be dimension 1, dimension 2, dimension 3, and the like, and for clarity, the clustering center and the feature are aggregated in the same dimension, so the description with the first dimension is omitted here for brevity.

Passing through the clustering center c in the existing feature recognition mode_kAnd a clustering center c corresponding to each feature_kAnd obtaining the query image and the feature vector of the image to be identified in the value corresponding to the first dimension. In one embodiment, the prior art generally obtains the feature vector of the query image or the image to be identified by the following formula (3):

where v (k, j) represents the feature vector of the query image or image to be identified, α_k(x_i) Representing a selection function, x_iIs characterized in that when c_kIs x_iWhen clustering the center of (a)_k(x_i) Equal to 1, otherwise α_k(x_i) Is equal to 0, x_i(j) Value corresponding to the jth dimension expressed as ith feature, c_k(j) And representing the value corresponding to the j dimension of the k clustering center.

It will be appreciated that when it is desired to compute the feature vector of the query image, v (k, j) represents the feature vector of the query image, α_k(x_i) Representing a selection function, x_iTo query for features of the image, when c_k(cluster center) is x_iCorresponding cluster center, α_k(x_i) Equal to 1, otherwise α_k(x_i) Equal to 0. x is the number of_i(j) Expressed as the ith on the query imageValue corresponding to the jth dimension of the feature, c_k(j) And representing the value corresponding to the j dimension of the k clustering center of the query image.

It will be understood that when it is desired to calculate the feature vector of the image to be recognized, v (k, j) represents the feature vector of the image to be recognized, α_k(x_i) Representing a selection function, x_iAs a feature of the image to be recognized, when c_k(cluster center) is x_iCorresponding cluster center, α_k(x_i) Equal to 1, otherwise α_k(x_i) Equal to 0. x is the number of_i(j) Expressed as the value corresponding to the jth dimension of the ith feature on the image to be recognized, c_k(j) And representing the value corresponding to the j dimension of the k clustering center of the image to be identified.

In the technical scheme of the invention, in order to avoid the influence of dynamic characteristics on the identification of the characteristic vector caused by the lack of high-level semantic information and further cause inaccurate identification result, each characteristic x in the image to be processed is used for identifying the characteristic vector_iCorresponding cluster center c_kThe clustering center c corresponding to each feature in the image to be processed_kThe value corresponding to the first dimension, and, said each feature x in the image to be processed_iCorresponding cluster center c_kAnd performing feature aggregation processing on the query image by combining the semantic mask image of the query image at the value corresponding to the first dimension to obtain a feature vector of the query image. And passes through each feature x in the image to be processed_iCorresponding cluster center c_kThe clustering center c corresponding to each feature in the image to be processed_kThe value corresponding to the first dimension, and, said each feature x in the image to be processed_iCorresponding cluster center c_kAnd performing feature aggregation processing on the image to be recognized by combining the semantic mask image of the image to be recognized with the value corresponding to the first dimension to obtain a feature vector of the image to be recognized.

Specifically, the feature vectors of the query image and the image to be identified are obtained through the following formula (2):

wherein v (k, j)' represents the feature vector of the query image and the image to be identified, alpha_k(x_i) Representing a selection function, x_iIs characterized in that when c_kIs x_iWhen clustering the center of (a)_k(x_i) Equal to 1, otherwise α_k(x_i) Is equal to 0, x_i(j) Value corresponding to the jth dimension expressed as ith feature, c_k(j) And representing a value corresponding to the j dimension of the k clustering center, and mi represents a semantic mask graph of the query image and the image to be identified.

By using the method of the invention, for example, when the image contains a large number of dynamic objects, the weight of the dynamic objects can be reduced by weighting through the semantic mask, and the robustness of the feature recognition is improved.

Specifically, in an embodiment, when weighting is performed through semantic masks, if a feature is a pixel-level feature, a semantic mask of a corresponding position may be directly obtained according to a position of the feature in an image, and if the feature is a sub-pixel-level feature, the semantic mask may be obtained by interpolating corresponding to the same position on a semantic mask map.

In an embodiment, after the feature vectors of the query image and the image to be identified are obtained in the above manner, the feature vectors may be normalized in the K clustering centers, and then the whole vectors are normalized.

Step S13: and determining an image matched with the scene of the query image from the image to be identified by using the feature vector of the image to be processed.

After the query image and the feature vector of the image to be recognized are obtained in the manner of step S12, an image matching the scene of the query image is determined from the image to be recognized according to the positions of the feature vector of the image to be recognized and the feature vector of the query image.

It will be appreciated that the closer the distance between feature vectors, the higher the similarity of features, and the further the distance between feature vectors, the lower the similarity of features. Therefore, in an embodiment, the image to be identified corresponding to the feature vector closest to the feature vector of the query image is determined as the image matched with the query image.

In an embodiment, if the number of the images matched with the query image in the to-be-identified images is multiple, in order to obtain the most similar images, the images matched with the query image are arranged by using a spatial consistency method to obtain the most similar images to the query image.

According to the scene recognition method provided by the invention, the semantic mask image is combined with the traditional feature aggregation mode, so that the interference of dynamic features in the image on feature recognition is reduced in a semantic mask weighting mode, and the negative image of an unstable object on scene recognition is effectively avoided. Meanwhile, the image caused by the instability of semantic segmentation is effectively avoided by using a weighting mode, and the robustness of the image is further improved. Moreover, the method of the invention has good robustness when seasons change.

Fig. 4 is a schematic structural diagram of a scene recognition device according to an embodiment of the present invention. The method comprises the following steps: an acquisition module 41, a feature aggregation module 42, and an image matching module 43.

The obtaining module 41 is configured to obtain an image to be processed and a semantic mask map corresponding to the image to be processed; the image to be processed comprises a query image and an image to be recognized, and the semantic mask image corresponding to the image to be processed comprises a semantic mask image of the query image and a semantic mask image of the image to be recognized. Specifically, the obtaining module 41 is configured to obtain a query image, and obtain a plurality of images to be identified, which are matched with the query image, from a database according to the query image; performing semantic segmentation processing on the image to be recognized and the query image to obtain the category of each pixel and the probability corresponding to the category; setting weight for each pixel type according to set conditions; and obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the category and the weight corresponding to the category, wherein the semantic masks corresponding to all the pixels form a semantic mask image. In an embodiment, the obtaining module 41 is further configured to perform attribute classification on all pixels to obtain one or more sub-categories; setting weight for each subcategory according to set conditions; and obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask image.

The feature aggregation module 42 is configured to perform feature aggregation processing on the image to be processed according to the semantic mask map to obtain a feature vector of the image to be processed. Specifically, the feature aggregation module 42 is configured to perform feature extraction on the image to be processed to obtain a feature set; forming a plurality of clustering centers according to the feature set; obtaining a clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers; determining a value corresponding to a first dimension of each feature in the image to be processed, and determining a value corresponding to the first dimension of a clustering center corresponding to each feature in the image to be processed;

and performing feature aggregation processing on the query image by combining a semantic mask map of the query image through a clustering center corresponding to each feature in the image to be processed and a value corresponding to a first dimension of the clustering center corresponding to each feature in the image to be processed, and a value corresponding to the first dimension of the clustering center corresponding to each feature in the image to be processed, so as to obtain a feature vector of the query image. And performing feature aggregation processing on the image to be recognized by combining the value of the clustering center corresponding to each feature in the image to be processed in the first dimension with the value of the clustering center corresponding to each feature in the image to be processed in the first dimension, so as to obtain the feature vector of the image to be recognized.

The image matching module 43 is configured to determine, from the images to be identified, an image that matches the scene of the query image by using the feature vectors of the images to be processed. Specifically, the image matching module 43 is configured to determine an image matched with the query image scene from the image to be recognized according to a distance between a feature vector of the image to be recognized and a feature vector of the query image. In an embodiment, the image matching module 43 is configured to determine the image to be identified corresponding to the feature vector closest to the feature vector of the query image as the image matched with the query image. In an embodiment, the image matching module 43 is further configured to, when there are a plurality of images matching the query image in the to-be-identified images, arrange the images matching the query image by using a spatial consistency method to obtain an image most similar to the query image.

According to the scene recognition device provided by the invention, the semantic mask image is combined with the traditional feature aggregation mode, so that the interference of dynamic features in the image on feature recognition is reduced in a semantic mask weighting mode, and the robustness of the device is further improved.

Fig. 5 is a schematic structural diagram of an intelligent device according to the present invention. The smart device comprises a memory 52 and a processor 51 connected to each other.

The memory 52 is used to store program instructions implementing the scene recognition method of any one of the above.

The processor 51 is operative to execute program instructions stored in the memory 52.

The processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 52 may be a memory bank, a TF card, etc., and may store all information in the smart device, including the input raw data, the computer program, the intermediate operation results, and the final operation results. It stores and retrieves information based on the location specified by the controller. With the memory, the intelligent device has a memory function, and can work normally. The storage of the smart device can be classified into a main storage (internal storage) and an auxiliary storage (external storage) according to the purpose of use, and there is a classification method into an external storage and an internal storage. The external memory is usually a magnetic medium, an optical disk, or the like, and can store information for a long period of time. The memory refers to a storage component on the main board, which is used for storing data and programs currently being executed, but is only used for temporarily storing the programs and the data, and the data is lost when the power is turned off or the power is cut off.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a system server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method of the embodiments of the present application.

Please refer to fig. 6, which is a schematic structural diagram of a computer-readable storage medium according to the present invention. The computer readable storage medium of the present application stores a program file 61 capable of implementing all the above-mentioned scene recognition methods, wherein the program file 61 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. The aforementioned storage device includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for scene recognition, comprising:

acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises a query image and an image to be recognized, and the semantic mask image corresponding to the image to be processed comprises a semantic mask image of the query image and a semantic mask image of the image to be recognized;

performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed;

and determining an image matched with the scene of the query image from the images to be identified by utilizing the feature vector of the images to be processed.

2. The scene recognition method according to claim 1, wherein the obtaining of the to-be-processed image and the semantic mask map corresponding to the to-be-processed image includes:

performing semantic segmentation processing on the image to be identified and the query image to obtain the category of each pixel and the probability corresponding to the category;

setting weight for each pixel type according to set conditions;

and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the category and the weight corresponding to the category, wherein the semantic masks corresponding to all the pixels form a semantic mask image.

3. The method according to claim 2, wherein the weighting the category of each pixel according to the setting condition further comprises:

performing attribute classification on all pixels to obtain one or more sub-categories;

setting weight for each sub-category according to set conditions;

and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask image.

4. The method of claim 3, wherein the sub-categories include at least two of fixed sub-categories, non-fixed sub-categories, dynamic and unknown;

the weight of the dynamic subcategory is less than the weight of the fixed subcategory, the non-fixed subcategory, and the unknown subcategory.

5. The method of claim 4, wherein the deriving the semantic mask for each pixel according to the probability and the weight corresponding to the sub-category comprises: calculating a semantic mask corresponding to the pixel by using the following formula (1):

m_i＝p_i×w_i (1)

6. The method according to claim 1, wherein performing feature aggregation processing on the image to be processed according to the semantic mask map to obtain a feature vector of the image to be processed comprises:

extracting the features of the image to be processed to obtain a feature set;

forming a plurality of clustering centers according to the feature set;

obtaining a clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers;

determining a value corresponding to a first dimension of each feature in the image to be processed, and determining a value corresponding to the first dimension of a clustering center corresponding to each feature in the image to be processed;

performing feature aggregation processing on the query image by using a clustering center corresponding to each feature in the to-be-processed image, a value corresponding to a first dimension of the clustering center corresponding to each feature in the to-be-processed image, and a value corresponding to the first dimension of the clustering center corresponding to each feature in the to-be-processed image, in combination with a semantic mask map of the query image, so as to obtain a feature vector of the query image; and

and performing feature aggregation processing on the image to be recognized by combining the value of the clustering center corresponding to each feature in the image to be recognized in the first dimension and the value of the clustering center corresponding to each feature in the image to be recognized in the first dimension through the clustering center corresponding to each feature in the image to be processed to obtain the feature vector of the image to be recognized.

7. The method of claim 6, wherein said forming a plurality of cluster centers from said feature set comprises:

processing the feature set by using a clustering algorithm to form a plurality of clustering centers;

the obtaining of the clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers comprises:

and taking the clustering center closest to each feature as the clustering center corresponding to each feature in the image to be processed.

8. The method according to claims 1-7, wherein the determining an image matching the scene of the query image from the image to be identified by using the feature vector of the image to be processed comprises:

and determining an image matched with the query image scene from the image to be recognized according to the distance between the feature vector of the image to be recognized and the feature vector of the query image.

9. The method according to claim 8, wherein the step of determining an image matching the query image scene from the images to be identified according to the distance between the feature vector of the image to be identified and the feature vector of the query image comprises:

and determining the image to be identified corresponding to the feature vector closest to the feature vector of the query image as the image matched with the query image.

10. The method according to claim 9, wherein the number of the images to be identified matching the query image is plural;

after the determining the image to be identified corresponding to the feature vector closest to the feature vector of the query image as the image matched with the query image, the method further comprises:

and arranging the images matched with the query image by adopting a space consistency method so as to obtain the image most similar to the query image.

11. A scene recognition apparatus, comprising:

the acquisition module is used for acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises a query image and an image to be recognized, and the semantic mask image corresponding to the image to be processed comprises a semantic mask image of the query image and a semantic mask image of the image to be recognized;

the feature aggregation module is used for performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed;

and the image matching module is used for determining an image matched with the scene of the query image from the image to be identified by utilizing the characteristic vector of the image to be processed.

12. A scene recognition device, comprising: a processor and a memory coupled to each other, wherein,

the memory is for storing program instructions for implementing the scene recognition method of any one of claims 1-10.

13. A computer-readable storage medium, characterized in that a program file is stored, which can be executed to implement the scene recognition method according to any one of claims 1 to 10.