CN112329660B

CN112329660B - Scene recognition method and device, intelligent equipment and storage medium

Info

Publication number: CN112329660B
Application number: CN202011249944.4A
Authority: CN
Inventors: 鲍虎军; 章国锋; 余海林; 冯友计
Original assignee: Zhejiang Shangtang Technology Development Co Ltd
Current assignee: Zhejiang Shangtang Technology Development Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2024-05-24
Anticipated expiration: 2040-11-10
Also published as: CN112329660A; WO2022100133A1; JP2023510945A

Abstract

The invention is applied to the technical field of image retrieval, and provides a scene recognition method, a device, intelligent equipment and a storage medium, wherein the scene recognition method comprises the following steps: acquiring a to-be-processed image and a semantic mask image corresponding to the to-be-processed image; the image to be processed comprises a query image and an image to be identified, and the semantic mask image corresponding to the image to be processed comprises a semantic mask image of the query image and a semantic mask image of the image to be identified; performing feature aggregation processing on the image to be processed according to the semantic mask graph to obtain feature vectors of the image to be processed; and determining an image matched with the scene of the query image from the images to be identified by utilizing the feature vectors of the images to be processed. Therefore, the interference of the interference factor features on the feature recognition is reduced through the semantic mask map, and the robustness of scene recognition is further improved.

Description

Scene recognition method and device, intelligent equipment and storage medium

Technical Field

The invention relates to the technical field of image retrieval, in particular to a scene recognition method, a scene recognition device, intelligent equipment and a storage medium.

Background

Scene recognition has important applications in the field of computer vision, such as simultaneous localization and mapping (Simultaneously Localization AND MAPPING, SLAM for short), motion restoration structures (Structure From Motion, SFM) and visual localization (Visual Lcoalization, VL).

The main content of the scene recognition problem study is to recognize a corresponding scene from a given image, give the name of the scene or the geographical position of the scene, or select an image similar to the scene from a database, and can also be regarded as an image retrieval problem. There are two methods currently in common use, one is to directly calculate the global description of the image, and the other is to use feature aggregation. At present, more and more scene recognition methods are studied in the prior art.

Disclosure of Invention

The invention provides a scene recognition method, a device, intelligent equipment and a storage medium, which are used for improving the robustness of scene recognition when an image contains interference factors.

In order to solve the technical problems, the first technical scheme provided by the invention is as follows: provided is a scene recognition method, including: acquiring a to-be-processed image and a semantic mask image corresponding to the to-be-processed image; the image to be processed comprises a query image and an image to be identified, and the semantic mask corresponding to the image to be processed comprises a semantic mask of the query image and a semantic mask of the image to be identified; performing feature aggregation processing on the image to be processed according to the semantic mask graph to obtain feature vectors of the image to be processed; and determining an image matched with the scene of the query image from the images to be identified by utilizing the feature vectors of the images to be processed. The features corresponding to the image to be processed are obtained through combining the semantic mask image with the feature aggregation mode, so that interference of interference factors can be reduced, and robustness of scene recognition is improved.

The step of obtaining the to-be-processed image and the semantic mask image corresponding to the to-be-processed image comprises the following steps: carrying out semantic segmentation processing on the image to be identified and the query image to obtain the category of each pixel and the probability corresponding to the category; setting a weight for each pixel category according to the set condition; and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the category and the weight corresponding to the category, wherein the semantic masks corresponding to all the pixels form a semantic mask graph. Through the weight setting, after the obtained semantic mask map is combined with the feature aggregation mode to obtain the features corresponding to the image to be processed, the interference of interference factors can be reduced, and the robustness of scene recognition is improved.

Before the weighting is set for each pixel category according to the set condition, the method further comprises: classifying the attributes of all pixels to obtain one or more sub-categories; setting weight for each sub-category according to the set condition; and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask graph. And weight is set for each sub-category, so that interference of interference factors can be reduced, and robustness of scene recognition is improved.

Wherein the sub-categories include at least two of fixed sub-categories, non-fixed sub-categories, dynamic and unknown; the dynamic subcategory is less weighted than the fixed subcategory, the non-fixed subcategory, and the unknown subcategory. For example, a higher weight is set for the non-fixed subcategory, and a smaller weight is set for the fixed subcategory, so that interference of the non-fixed features on feature recognition is eliminated, and the robustness of scene recognition is improved.

The obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category includes: calculating a semantic mask corresponding to the pixel by using the following formula (1):

m_i＝p_i×w_i (1)

Wherein m _i represents a semantic mask corresponding to the ith pixel, the generated graph is a semantic mask graph, p _i represents the probability of the sub-category to which the ith pixel belongs, and w _i represents the weight corresponding to the category or sub-category to which the ith pixel belongs. The interference of the unfixed features on scene recognition is reduced by calculating the semantic mask map.

The feature aggregation processing is performed on the image to be processed according to the semantic mask graph, and obtaining feature vectors of the image to be processed comprises the following steps: extracting features of the image to be processed to obtain a feature set; forming a plurality of clustering centers according to the feature set; obtaining a clustering center corresponding to each feature in each image to be processed according to a plurality of clustering centers; determining a value corresponding to each feature in the image to be processed in a first dimension, and determining a value corresponding to a cluster center corresponding to each feature in the image to be processed in the first dimension; through the clustering center corresponding to each feature in the image to be processed, the value corresponding to the clustering center corresponding to each feature in the image to be processed in a first dimension, and the value corresponding to the clustering center corresponding to each feature in the image to be processed in the first dimension, combining the semantic mask graph of the query image, and carrying out feature aggregation processing on the query image to obtain a feature vector of the query image; and carrying out feature aggregation processing on the image to be identified by combining the semantic mask map of the image to be identified through the value corresponding to the first dimension of the cluster center corresponding to each feature in the image to be processed and the value corresponding to the first dimension of the cluster center corresponding to each feature in the image to be processed, so as to obtain the feature vector of the image to be identified. And obtaining the corresponding features of the image to be processed by using the semantic mask map, wherein the non-fixed features are subjected to weight setting in the semantic mask map, so that the interference of interference factors can be reduced, and the robustness of scene recognition is improved.

Wherein forming a plurality of cluster centers from the feature set comprises: processing the feature set by using a clustering algorithm to form a plurality of clustering centers; the obtaining the clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers comprises the following steps: and taking the cluster center closest to each feature as the cluster center corresponding to each feature in the image to be processed.

Wherein the determining, from the images to be identified, an image matching the scene of the query image using the feature vectors of the images to be processed includes: and determining an image matched with the query image scene from the image to be identified according to the distance between the feature vector of the image to be identified and the feature vector of the query image. Because the calculation of the feature vector combines the semantic mask map, the interference of unfixed features is reduced, and the image to be identified with higher similarity with the query image is obtained.

Wherein, according to the distance between the feature vector of the image to be identified and the feature vector of the query image, the step of determining the image matched with the query image scene from the image to be identified comprises the following steps: and determining the image to be identified corresponding to the feature vector nearest to the feature vector of the query image as the image matched with the query image. Thus, the image to be identified with higher similarity with the query image is obtained.

Wherein, the images matched with the query image in the images to be identified are a plurality of images; the step of determining the image to be identified corresponding to the feature vector nearest to the feature vector of the query image as the image matched with the query image further comprises the following steps: and arranging the images matched with the query image by adopting a space consistency method so as to acquire the image most similar to the query image. Therefore, the obtained scene is more similar and has higher accuracy.

In order to solve the technical problems, a second technical scheme provided by the invention is as follows: there is provided a scene recognition apparatus including: the acquisition module is used for acquiring the image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises a query image and an image to be identified; the feature aggregation module is used for carrying out feature aggregation processing on the image to be processed according to the semantic mask graph to obtain a feature vector of the image to be processed; and the image matching module is used for determining an image matched with the scene of the query image from the images to be identified by utilizing the feature vector of the images to be processed. The features corresponding to the image to be processed are obtained through combining the semantic mask image with the feature aggregation mode, so that interference of interference factors can be reduced, and robustness of scene recognition is improved.

In order to solve the technical problems, a third technical scheme provided by the invention is as follows: provided is an intelligent device, including: and the processor and the memory are mutually coupled, wherein the memory is used for storing program instructions for realizing the scene recognition method according to any one of the above.

In order to solve the technical problems, a fourth technical scheme provided by the invention is as follows: there is provided a computer-readable storage medium storing a program file executable to implement the scene recognition method of any one of the above.

The beneficial effects of the invention are as follows: compared with the prior art, the scene recognition method provided by the invention has the advantages that the image to be processed and the semantic mask image corresponding to the image to be processed are obtained, the feature vector of the image to be processed is obtained by carrying out feature aggregation processing on the image to be processed according to the semantic mask image, and then the image matched with the scene of the query image is determined from the image to be recognized by utilizing the feature vector, so that the high-level semantic information of the image is obtained by obtaining the semantic mask image, and the interference caused by interference factors in the image is eliminated by combining the semantic mask image with the feature aggregation, so that the robustness of scene recognition is improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a scene recognition method of the present invention;

FIG. 2 is a flow chart of an embodiment of the step S11 in FIG. 1;

FIG. 3 is a flowchart illustrating another embodiment of step S11 in FIG. 1;

FIG. 4 is a schematic diagram illustrating a scene recognition device according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an embodiment of a smart device of the present invention;

Fig. 6 is a schematic structural view of a computer-readable storage medium of the present invention.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The present invention will be described in detail with reference to the accompanying drawings and examples.

Referring to fig. 1, a flowchart of a first embodiment of a scene recognition method according to the present invention includes:

Step S11: acquiring a to-be-processed image and a semantic mask image corresponding to the to-be-processed image; the image to be processed comprises a query image and an image to be identified.

Specifically, the image to be processed comprises a query image and an image to be identified, and the semantic mask corresponding to the image to be processed comprises a semantic mask of the query image and a semantic mask of the image to be identified. Specifically, referring to fig. 2, obtaining a semantic mask map corresponding to an image to be identified includes:

step S21: and carrying out semantic segmentation processing on the image to be identified and the query image to obtain the category of each pixel and the probability corresponding to the category.

The query image is a user-defined image, which may be a current image shot by the user or an image stored in advance by the user. The image to be identified is an image which is searched from the database according to the query image and is matched with the query image. The database is a server, the query image is input, and the server matches a plurality of images to be identified with similar scenes for the query image. Semantic segmentation is performed on the image to be identified and the query image to obtain the category of each pixel in the image and the probability corresponding to the category.

Step S22: weights are set for the categories of each pixel according to the set conditions.

After the category of the pixel is acquired, a weight is set for the pixel of each category. In one embodiment, if the classes obtained by semantic segmentation include fixed sub-classes (e.g., stable), non-fixed sub-classes (e.g., volatile), dynamic and unknown, in order to reduce the interference of dynamic features on scene recognition, in one embodiment, the weights of the dynamic sub-classes are set to be the lowest and smaller than the weights of the fixed sub-classes, the non-fixed sub-classes and the unknown. In another embodiment, if it is desired to reduce the interference of non-stationary subcategory features on scene recognition, in one embodiment, the weights of the non-stationary subcategory features are set to be lowest, less than the weights of the stationary subcategory, the dynamic subcategory, and the unknown subcategory.

Step S23: and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask graph.

Specifically, in one embodiment, the semantic mask corresponding to each pixel is calculated using the following equation (1):

m_i＝p_i×w_i (1)

Wherein m _i represents a semantic mask corresponding to the ith pixel, the generated graph is a semantic mask graph, p _i represents the probability of the sub-category to which the ith pixel belongs, and w _i represents the weight corresponding to the category or sub-category to which the ith pixel belongs.

In another embodiment, if the semantic-segmented class result does not include the fixed sub-class, the non-fixed sub-class, the dynamic and the unknown sub-classes, please refer to fig. 3, wherein step S31 is the same as fig. 2, and the description thereof is omitted. In this embodiment, if the semantic-segmented class result does not include the four classes of stability, variability, dynamics and unknowns, the method further includes:

step S32: all pixels are attribute classified to obtain one or more sub-categories.

All pixels are attribute classified to obtain one or more sub-categories, which in one embodiment include at least two or at least one of fixed sub-categories, non-fixed sub-categories, dynamic and unknown.

Step S33: and setting weight for each sub-category according to the set condition.

Specifically, after the sub-categories of the pixels are acquired, a weight is set for the pixels of each sub-category. In one embodiment, if the sub-categories obtained by classifying the result attribute of the semantic segmentation include a fixed sub-category, a non-fixed sub-category, a dynamic sub-category and an unknown sub-category, in order to reduce the interference of the dynamic feature on the scene recognition, in one embodiment, the weight of the dynamic feature is set to be the lowest, so that the weight is smaller than the weights of the fixed sub-category, the non-fixed sub-category and the unknown sub-category. In another embodiment, if it is desired to reduce the interference of non-stationary subcategory features with scene recognition, in one embodiment the weights of the non-stationary subcategory features are set to be lowest, less than the weights of the stationary subcategories, dynamic and unknown.

Step S34: and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask graph.

m_i＝p_i×w_i (1)

According to the method provided by the embodiment, different weights are set for the pixel categories after semantic segmentation, so that interference caused by the categories during feature recognition is reduced, and scene recognition robustness is further improved.

Step S12: and carrying out feature aggregation processing on the image to be processed according to the semantic mask graph to obtain feature vectors of the image to be processed.

Specifically, the existing method for performing feature aggregation processing on the feature to be processed to obtain a feature vector includes obtaining the feature vector by means of VLAD coding. Specifically, obtaining the feature vector by means of VLAD coding includes: and extracting the characteristics of the image to be processed to obtain a characteristic set. In another embodiment, the feature extraction may be performed on the image to be processed to obtain a feature set, and the preset data image may be a set of all images in the database and the server, or may be a set of part of images in the server, which is not limited, or may be a set of pictures acquired by the user, which is not limited. It will be appreciated that each image to be processed contains a plurality of features, i.e. in the case of feature extraction, each image to be processed extracts a plurality of features. And forming a feature set by all the extracted features, and then carrying out a clustering algorithm on the feature set to obtain K clustering centers. The K cluster centers are called a codebook, and the obtained codebook is c= { C1, C2, …, ck }.

A plurality of features in one of the images to be processed is formed into a feature set x= { X1, X2, …, xk }. In a specific embodiment, the feature set X may also be aggregated into a feature vector with a fixed length by the codebook C.

After a plurality of cluster centers are obtained, the cluster center corresponding to each feature x _i in each image to be processed is obtained through the plurality of cluster centers. Specifically, the position of the feature x _i is determined, and the cluster center closest to the feature x _i is determined as the cluster center c _k corresponding to the feature x _i. In one embodiment, after determining the cluster center c _k corresponding to the current feature x _i, determining the value corresponding to the cluster center c _k in the first dimension, in one embodiment, the dimension corresponding to the cluster center c _k is the same as the dimension of the feature x _i corresponding to the cluster center c _k, determining the value corresponding to the cluster center c _k in the first dimension and the value corresponding to the feature x _i corresponding to the cluster center c _k in the first dimension, since the dimension of the cluster center c _k is the same as the dimension of the feature x _i corresponding to the cluster center c _k, to better distinguish the feature x _i of the cluster center c _k corresponding to the cluster center c _k, the dimension of this cluster center c _k is added to the distance between the cluster center c _k and the corresponding said feature x _i. In the embodiment of the present disclosure, the first dimension may be dimension 1, dimension 2, dimension 3, etc., so that the clustering center and the features are aggregated in the same dimension for clarity, and therefore, the description in the first dimension is omitted herein.

In the existing feature recognition mode, the feature vector of the image to be recognized is queried through a cluster center c _k and a cluster center c _k corresponding to each feature in a first dimension. In a specific embodiment, the prior art generally obtains the feature vector of the query image or the image to be identified by the following formula (3):

Where v (k, j) represents a feature vector of the query image or the image to be identified, α _k(x_i) represents a selection function, x _i is a feature, α _k(x_i) is equal to 1 when c _k is a cluster center of x _i, otherwise α _k(x_i) is equal to 0, x _i (j) represents a value corresponding to a j-th dimension of the i-th feature, and c _k (j) represents a value corresponding to a j-th dimension of the k-th cluster center.

It will be appreciated that when it is desired to calculate the feature vector of the query image, v (k, j) represents the feature vector of the query image, α _k(x_i) represents the selection function, x _i is the feature of the query image, α _k(x_i) is equal to 1 when c _k (cluster center) is the cluster center corresponding to x _i, otherwise α _k(x_i) is equal to 0.x _i (j) represents the value corresponding to the j-th dimension of the i-th feature on the query image, and c _k (j) represents the value corresponding to the j-th dimension of the k-th cluster center of the query image.

It will be appreciated that when the feature vector of the image to be identified needs to be calculated, v (k, j) represents the feature vector of the image to be identified, α _k(x_i) represents the selection function, x _i is the feature of the image to be identified, α _k(x_i) is equal to 1 when c _k (cluster center) is the cluster center corresponding to x _i, otherwise α _k(x_i) is equal to 0.x _i (j) represents a value corresponding to the j-th dimension of the i-th feature on the image to be identified, and c _k (j) represents a value corresponding to the j-th dimension of the k-th cluster center of the image to be identified.

In the technical scheme of the invention, in order to avoid the influence of dynamic characteristics on the identification of the feature vector and the inaccurate identification result caused by the lack of high-level semantic information, the feature vector of the query image is obtained by carrying out feature aggregation processing on the query image through the value of the cluster center c _k corresponding to each feature x _i in the image to be processed and the value of the cluster center c _k corresponding to each feature x _i in the image to be processed in the first dimension and combining the value of the cluster center c _k corresponding to each feature x _i in the image to be processed with the semantic mask image of the query image. And performing feature aggregation processing on the image to be identified by combining the semantic mask graph of the image to be identified with the value of the cluster center c _k corresponding to each feature x _i in the image to be processed, the value of the cluster center c _k corresponding to each feature in the image to be processed in the first dimension, and the value of the cluster center c _k corresponding to each feature x _i in the image to be processed in the first dimension, so as to obtain the feature vector of the image to be identified.

Specifically, the invention obtains the feature vectors of the query image and the image to be identified through the following formula (2):

Where v (k, j)' represents the feature vector of the query image and the image to be identified, α _k(x_i) represents the selection function, x _i is a feature, α _k(x_i) is equal to 1 when c _k is the cluster center of x _i, otherwise α _k(x_i) is equal to 0, x _i (j) represents the value corresponding to the j-th dimension of the i-th feature, c _k (j) represents the value corresponding to the j-th dimension of the k-th cluster center, and mi represents the semantic mask map of the query image and the image to be identified.

By using the method of the invention, for example, when a large number of dynamic objects are contained in the image, the weight of the dynamic objects can be reduced by weighting through the semantic masks, and the robustness of feature recognition is improved.

Specifically, in an embodiment, when weighting is performed through the semantic mask, if the feature is a pixel-level feature, the semantic mask of the corresponding position may be directly obtained according to the position of the feature in the image, and if the feature is a sub-pixel-level feature, the semantic mask may be obtained by interpolating the corresponding same position on the semantic mask map.

In an embodiment, after the feature vectors of the query image and the image to be identified are obtained in the above manner, the feature vectors may be normalized in K cluster centers, respectively, and then the whole vectors may be normalized together.

Step S13: and determining an image matched with the scene of the query image from the images to be identified by utilizing the feature vectors of the images to be processed.

After obtaining the feature vectors of the query image and the image to be identified in the manner of step S12, determining an image matching with the scene of the query image from the image to be identified according to the positions of the feature vectors of the image to be identified and the feature vectors of the query image.

It will be appreciated that the closer the distance between feature vectors, the higher the similarity of features, and the further the distance between feature vectors, the lower the similarity of features. In one embodiment, therefore, the image to be identified corresponding to the feature vector closest to the feature vector of the query image is determined to be the image to which the query image matches.

In an embodiment, if the number of images matched with the query image in the images to be identified is a plurality of, in order to obtain the most similar images, the images matched with the query image are arranged by adopting a spatial consistency method so as to obtain the most similar images to the query image.

According to the scene recognition method provided by the invention, the semantic mask image is combined with the traditional feature aggregation mode, so that the interference of dynamic features in the image on feature recognition is reduced in a semantic mask weighting mode, and the negative image of an unstable object on scene recognition is effectively avoided. Meanwhile, the weighting mode is used, so that images caused by instability of semantic segmentation are effectively avoided, and robustness of the images is further improved. Furthermore, the method of the invention is very robust also in seasonal variations.

Fig. 4 is a schematic structural diagram of a scene recognition device according to an embodiment of the invention. Comprising the following steps: an acquisition module 41, a feature aggregation module 42 and an image matching module 43.

The acquiring module 41 is configured to acquire an image to be processed and a semantic mask map corresponding to the image to be processed; the image to be processed comprises a query image and an image to be identified, and the semantic mask corresponding to the image to be processed comprises a semantic mask of the query image and a semantic mask of the image to be identified. Specifically, the obtaining module 41 is configured to obtain a query image, and obtain, from a database, a plurality of images to be identified that match the query image according to the query image; carrying out semantic segmentation processing on the image to be identified and the query image to obtain the category of each pixel and the probability corresponding to the category; setting a weight for each pixel category according to the set condition; and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the category and the weight corresponding to the category, wherein the semantic masks corresponding to all the pixels form a semantic mask graph. In an embodiment, the obtaining module 41 is further configured to classify all the pixels into one or more sub-categories; setting weight for each sub-category according to the set condition; and obtaining the semantic masks corresponding to the pixels according to the probabilities corresponding to the sub-categories and the weights corresponding to the sub-categories, wherein the semantic masks corresponding to all the pixels form a semantic mask graph.

The feature aggregation module 42 is configured to perform feature aggregation processing on the image to be processed according to the semantic mask map, so as to obtain a feature vector of the image to be processed. Specifically, the feature aggregation module 42 is configured to perform feature extraction on the image to be processed to obtain a feature set; forming a plurality of clustering centers according to the feature set; obtaining a clustering center corresponding to each feature in each image to be processed according to a plurality of clustering centers; determining a value corresponding to each feature in the image to be processed in a first dimension, and determining a value corresponding to a cluster center corresponding to each feature in the image to be processed in the first dimension;

And carrying out feature aggregation processing on the query image by combining the semantic mask image of the query image with the value corresponding to the first dimension of the cluster center corresponding to each feature in the image to be processed and the value corresponding to the first dimension of the cluster center corresponding to each feature in the image to be processed, so as to obtain the feature vector of the query image. And carrying out feature aggregation processing on the image to be identified by combining the semantic mask map of the image to be identified through the value corresponding to the first dimension of the cluster center corresponding to each feature in the image to be processed and the value corresponding to the first dimension of the cluster center corresponding to each feature in the image to be processed, so as to obtain the feature vector of the image to be identified.

Wherein, the image matching module 43 is used for determining an image matching with the scene of the query image from the images to be identified by using the feature vector of the images to be processed. Specifically, the image matching module 43 is configured to determine an image that matches the query image scene from the images to be identified according to the distance between the feature vector of the images to be identified and the feature vector of the query image. In an embodiment, the image matching module 43 is configured to determine the image to be identified corresponding to the feature vector closest to the feature vector of the query image as the image matched with the query image. In an embodiment, the image matching module 43 is further configured to, when there are a plurality of images matching the query image in the images to be identified, arrange the images matching the query image by using a spatial consistency method, so as to obtain an image most similar to the query image.

According to the scene recognition device provided by the invention, the semantic mask map is combined with the traditional feature aggregation mode, so that the interference of dynamic features in the image on feature recognition is reduced in a semantic mask weighting mode, and the robustness of the scene recognition device is further improved.

Fig. 5 is a schematic structural diagram of an intelligent device according to the present invention. The smart device comprises a memory 52 and a processor 51 connected to each other.

The memory 52 is used to store program instructions implementing the scene recognition method of any of the above.

The processor 51 is operative to execute program instructions stored in the memory 52.

The processor 51 may also be referred to as a CPU (Central Processing Unit ). The processor 51 may be an integrated circuit chip with signal processing capabilities. Processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 52 may be a memory bank, a TF card, etc., and may store all information in the smart device, including input raw data, a computer program, intermediate operation results, and final operation results, which are stored in the memory. It stores and retrieves information according to the location specified by the controller. With the memory, the intelligent device has a memory function and can ensure normal operation. The memories of the intelligent devices can be classified into main memories (memories) and auxiliary memories (external memories) according to the purposes of use, and also classified into external memories and internal memories. The external memory is usually a magnetic medium, an optical disk, or the like, and can store information for a long period of time. The memory refers to a storage component on the motherboard for storing data and programs currently being executed, but is only used for temporarily storing programs and data, and the data is lost when the power supply is turned off or the power is turned off.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a system server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the method of the embodiments of the present application.

Referring to fig. 6, a schematic structure of a computer readable storage medium according to the present application is shown. The computer readable storage medium of the present application stores a program file 61 capable of implementing all the above-mentioned scene recognition methods, wherein the program file 61 may be stored in the above-mentioned storage medium in the form of a software product, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. The aforementioned storage device includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes, or a terminal device such as a computer, a server, a mobile phone, a tablet, or the like.

The foregoing is only the embodiments of the present invention, and therefore, the patent scope of the invention is not limited thereto, and all equivalent structures or equivalent processes using the descriptions of the present invention and the accompanying drawings, or direct or indirect application in other related technical fields, are included in the scope of the invention.

Claims

1. A scene recognition method, comprising:

Acquiring a to-be-processed image and a semantic mask image corresponding to the to-be-processed image; the image to be processed comprises a query image and an image to be identified, and the semantic mask corresponding to the image to be processed comprises a semantic mask of the query image and a semantic mask of the image to be identified;

performing feature aggregation processing on the image to be processed according to the semantic mask graph to obtain feature vectors of the image to be processed;

determining an image matched with the scene of the query image from the images to be identified by utilizing the feature vectors of the images to be processed;

The feature aggregation processing is performed on the image to be processed according to the semantic mask graph, and obtaining feature vectors of the image to be processed includes:

Extracting features of the image to be processed to obtain a feature set; forming a plurality of clustering centers according to the feature set; obtaining a clustering center corresponding to each feature in each image to be processed according to a plurality of clustering centers;

Determining a value corresponding to each feature in the image to be processed in a first dimension, and determining a value corresponding to a cluster center corresponding to each feature in the image to be processed in the first dimension;

Through the clustering center corresponding to each feature in the image to be processed, the value corresponding to the clustering center corresponding to each feature in the image to be processed in a first dimension, and the value corresponding to the clustering center corresponding to each feature in the image to be processed in the first dimension, combining the semantic mask graph of the query image, and carrying out feature aggregation processing on the query image to obtain a feature vector of the query image; and

And carrying out feature aggregation processing on the image to be identified by combining the semantic mask map of the image to be identified according to the value of the clustering center corresponding to each feature in the image to be processed corresponding to the first dimension and the value of the clustering center corresponding to each feature in the image to be processed corresponding to the first dimension so as to obtain the feature vector of the image to be identified.

2. The scene recognition method according to claim 1, wherein the obtaining the image to be processed and the semantic mask map corresponding to the image to be processed includes:

Carrying out semantic segmentation processing on the image to be identified and the query image to obtain the category of each pixel and the probability corresponding to the category;

Setting a weight for each pixel category according to the set condition;

And obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the category and the weight corresponding to the category, wherein the semantic masks corresponding to all the pixels form a semantic mask graph.

3. The method of claim 2, wherein before weighting the class of each pixel according to the set condition, further comprising:

Classifying the attributes of all pixels to obtain one or more sub-categories;

setting weight for each sub-category according to the set condition;

And obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask graph.

4. The method of claim 3, wherein the sub-categories include at least two of a fixed sub-category, a non-fixed sub-category, a dynamic sub-category, and an unknown sub-category;

The dynamic subcategory is less weighted than the fixed subcategory, the non-fixed subcategory, and the unknown subcategory.

5. The method of claim 4, wherein the deriving the semantic mask for each pixel according to the probability for the sub-category and the weight for the sub-category comprises: calculating a semantic mask corresponding to the pixel by using the following formula (1):

m_i＝p_i×w_i (1)

6. The method of claim 1, wherein forming a plurality of cluster centers from the feature set comprises:

Processing the feature set by using a clustering algorithm to form a plurality of clustering centers;

the obtaining the clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers comprises the following steps:

and taking the cluster center closest to each feature as the cluster center corresponding to each feature in the image to be processed.

7. The method of any of claims 1-6, wherein determining an image from the images to be identified that matches the scene of the query image using the feature vectors of the images to be processed comprises:

and determining an image matched with the query image scene from the image to be identified according to the distance between the feature vector of the image to be identified and the feature vector of the query image.

8. The method of claim 7, wherein determining an image from the image to be identified that matches the query image scene based on the distance of the feature vector of the image to be identified from the feature vector of the query image comprises:

And determining the image to be identified corresponding to the feature vector nearest to the feature vector of the query image as the image matched with the query image.

9. The method of claim 8, wherein the images to be identified that match the query image are a plurality of images;

The determining the image to be identified corresponding to the feature vector nearest to the feature vector of the query image as the image matched with the query image further comprises:

and arranging the images matched with the query image by adopting a space consistency method so as to acquire the image most similar to the query image.

10. A scene recognition device, comprising:

The acquisition module is used for acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises a query image and an image to be identified, and the semantic mask corresponding to the image to be processed comprises a semantic mask of the query image and a semantic mask of the image to be identified;

the feature aggregation module is used for carrying out feature aggregation processing on the image to be processed according to the semantic mask graph to obtain a feature vector of the image to be processed;

The image matching module is used for determining an image matched with the scene of the query image from the images to be identified by utilizing the feature vectors of the images to be processed;

11. A scene recognition apparatus, characterized by comprising: a processor and a memory coupled to each other, wherein,

The memory is configured to store program instructions for implementing the scene recognition method according to any one of claims 1-9.

12. A computer readable storage medium, characterized in that a program file is stored, which program file is executable to implement the scene recognition method according to any of claims 1-9.