CN112329660A - Scene recognition method and device, intelligent equipment and storage medium - Google Patents

Scene recognition method and device, intelligent equipment and storage medium Download PDF

Info

Publication number
CN112329660A
CN112329660A CN202011249944.4A CN202011249944A CN112329660A CN 112329660 A CN112329660 A CN 112329660A CN 202011249944 A CN202011249944 A CN 202011249944A CN 112329660 A CN112329660 A CN 112329660A
Authority
CN
China
Prior art keywords
image
processed
feature
query
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011249944.4A
Other languages
Chinese (zh)
Other versions
CN112329660B (en
Inventor
鲍虎军
章国锋
余海林
冯友计
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Shangtang Technology Development Co Ltd
Original Assignee
Zhejiang Shangtang Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Shangtang Technology Development Co Ltd filed Critical Zhejiang Shangtang Technology Development Co Ltd
Priority to CN202011249944.4A priority Critical patent/CN112329660B/en
Publication of CN112329660A publication Critical patent/CN112329660A/en
Priority to PCT/CN2021/106936 priority patent/WO2022100133A1/en
Priority to JP2022543759A priority patent/JP2023510945A/en
Application granted granted Critical
Publication of CN112329660B publication Critical patent/CN112329660B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention is applied to the technical field of image retrieval, and provides a scene identification method, a device, intelligent equipment and a storage medium, wherein the scene identification method comprises the following steps: acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises a query image and an image to be recognized, and the semantic mask image corresponding to the image to be processed comprises a semantic mask image of the query image and a semantic mask image of the image to be recognized; performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed; and determining an image matched with the scene of the query image from the image to be identified by using the feature vector of the image to be processed. Therefore, the interference of the interference factor characteristics on the characteristic identification is reduced through the semantic mask map, and the robustness of scene identification is further improved.

Description

Scene recognition method and device, intelligent equipment and storage medium
Technical Field
The invention relates to the technical field of image retrieval, in particular to a scene recognition method, a scene recognition device, intelligent equipment and a storage medium.
Background
Scene recognition has important applications in the field of computer vision, such as Simultaneous Localization And Mapping (SLAM), Motion recovery Structure (SFM), And Visual Localization (VL).
The main content of the scene identification problem research is to identify a corresponding scene from a given image, to give the name of the scene or the geographical location of the scene, or to select an image similar to the scene from a database, which can also be regarded as an image retrieval problem. There are two methods commonly used at present, one is to directly calculate the global description of the image, and the other is to use the feature aggregation method. At present, in the prior art, the research on the scene recognition method is more and more.
Disclosure of Invention
The invention provides a scene recognition method, a scene recognition device, intelligent equipment and a storage medium, which are used for improving the robustness of scene recognition when an image contains interference factors.
In order to solve the above technical problems, a first technical solution provided by the present invention is: provided is a scene recognition method, including: acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises a query image and an image to be recognized, and the semantic mask image corresponding to the image to be processed comprises a semantic mask image of the query image and a semantic mask image of the image to be recognized; performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed; and determining an image matched with the scene of the query image from the images to be identified by utilizing the feature vector of the images to be processed. The corresponding features of the image to be processed are obtained by combining the semantic mask map with a feature aggregation mode, so that the interference of interference factors can be reduced, and the robustness of scene recognition is improved.
The step of obtaining the image to be processed and the semantic mask image corresponding to the image to be processed comprises: performing semantic segmentation processing on the image to be identified and the query image to obtain the category of each pixel and the probability corresponding to the category; setting weight for each pixel type according to set conditions; and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the category and the weight corresponding to the category, wherein the semantic masks corresponding to all the pixels form a semantic mask image. By means of the setting of the weight, after the obtained semantic mask image is combined with a feature aggregation mode to obtain the corresponding features of the image to be processed, the interference of interference factors can be reduced, and the robustness of scene recognition is improved.
Before setting the weight for the category of each pixel according to the setting condition, the method further comprises the following steps: performing attribute classification on all pixels to obtain one or more sub-categories; setting weight for each sub-category according to set conditions; and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask image. And each subcategory is provided with a weight, so that the interference of interference factors can be reduced, and the robustness of scene identification is improved.
Wherein the sub-categories include at least two of fixed sub-categories, non-fixed sub-categories, dynamic and unknown; the weight of the dynamic subcategory is less than the weight of the fixed subcategory, the non-fixed subcategory, and the unknown subcategory. For example, a higher weight is set for the unfixed sub-category, and a smaller weight is set for the fixed sub-category, so that the interference of unfixed features on feature recognition is eliminated, and the robustness of scene recognition is improved.
Wherein the obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category comprises: calculating a semantic mask corresponding to the pixel by using the following formula (1):
mi=pi×wi (1)
wherein m isiRepresenting the semantic mask corresponding to the ith pixel, the generated graph is a semantic mask graph, piRepresenting the probability, w, of the sub-category to which the ith pixel belongsiAnd represents the weight corresponding to the category or the sub-category to which the ith pixel belongs. And (3) reducing the interference of the unfixed features on scene recognition by calculating the semantic masking map.
Performing feature aggregation processing on the image to be processed according to the semantic mask map to obtain a feature vector of the image to be processed, including: extracting the features of the image to be processed to obtain a feature set; forming a plurality of clustering centers according to the feature set; obtaining a clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers; determining a value corresponding to a first dimension of each feature in the image to be processed, and determining a value corresponding to the first dimension of a clustering center corresponding to each feature in the image to be processed; performing feature aggregation processing on the query image by using a clustering center corresponding to each feature in the to-be-processed image, a value corresponding to a first dimension of the clustering center corresponding to each feature in the to-be-processed image, and a value corresponding to the first dimension of the clustering center corresponding to each feature in the to-be-processed image, in combination with a semantic mask map of the query image, so as to obtain a feature vector of the query image; and performing feature aggregation processing on the image to be recognized by combining the value of the clustering center corresponding to each feature in the image to be processed in the first dimension with the value of the clustering center corresponding to each feature in the image to be processed in the first dimension, so as to obtain the feature vector of the image to be recognized. The semantic mask image is used for obtaining the characteristics corresponding to the image to be processed, and the unfixed characteristics are subjected to weight setting in the semantic mask image, so that the interference of interference factors can be reduced, and the robustness of scene recognition is improved.
Wherein said forming a plurality of cluster centers from said feature set comprises: processing the feature set by using a clustering algorithm to form a plurality of clustering centers; the obtaining of the clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers comprises: and taking the clustering center closest to each feature as the clustering center corresponding to each feature in the image to be processed.
Wherein the determining, from the images to be identified, an image matching the scene of the query image by using the feature vector of the image to be processed comprises: and determining an image matched with the query image scene from the image to be recognized according to the distance between the feature vector of the image to be recognized and the feature vector of the query image. Because the calculation of the feature vector is combined with the semantic mask graph, the interference of unfixed features is reduced, and the image to be identified with higher similarity to the query image is obtained.
The step of determining an image matched with the query image scene from the image to be recognized according to the distance between the feature vector of the image to be recognized and the feature vector of the query image comprises the following steps: and determining the image to be identified corresponding to the feature vector closest to the feature vector of the query image as the image matched with the query image. Thus, the image to be identified with higher similarity to the query image is obtained.
The number of the images matched with the query image in the images to be identified is multiple; the step of determining the image to be identified corresponding to the feature vector closest to the feature vector of the query image as the image matched with the query image further comprises: and arranging the images matched with the query image by adopting a space consistency method so as to obtain the image most similar to the query image. Therefore, the obtained scenes are more similar and the accuracy is higher.
In order to solve the above technical problems, a second technical solution provided by the present invention is: there is provided a scene recognition apparatus including: the acquisition module is used for acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises an inquiry image and an image to be identified; the feature aggregation module is used for performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed; and the image matching module is used for determining an image matched with the scene of the query image from the image to be identified by utilizing the characteristic vector of the image to be processed. The corresponding features of the image to be processed are obtained by combining the semantic mask map with a feature aggregation mode, so that the interference of interference factors can be reduced, and the robustness of scene recognition is improved.
In order to solve the above technical problems, a third technical solution provided by the present invention is: there is provided a smart device comprising: a processor and a memory coupled to each other, wherein the memory is used for storing program instructions for implementing the scene recognition method according to any one of the above items.
In order to solve the above technical problems, a fourth technical solution provided by the present invention is: there is provided a computer-readable storage medium storing a program file executable to implement the scene recognition method of any one of the above.
The invention has the beneficial effects that: different from the prior art, the scene recognition method provided by the invention obtains the image to be processed and the semantic mask image corresponding to the image to be processed, performs feature aggregation processing on the image to be processed according to the semantic mask image to obtain the feature vector of the image to be processed, determines the image matched with the scene of the query image from the image to be recognized by using the feature vector, obtains the high-level semantic information of the image by obtaining the semantic mask image, and eliminates the interference caused by the interference factor in the image by combining the semantic mask image and the feature aggregation, thereby improving the robustness of scene recognition.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a scene recognition method according to the present invention;
FIG. 2 is a flowchart illustrating an embodiment of step S11 in FIG. 1;
FIG. 3 is a schematic flow chart diagram illustrating another embodiment of step S11 in FIG. 1;
FIG. 4 is a schematic structural diagram of an embodiment of a scene recognition apparatus according to the present invention;
FIG. 5 is a schematic block diagram of an embodiment of a smart device of the present invention;
fig. 6 is a schematic structural diagram of the computer-readable storage medium of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The present invention will be described in detail below with reference to the accompanying drawings and examples.
Referring to fig. 1, a flowchart of a scene recognition method according to a first embodiment of the present invention is shown, which includes:
step S11: acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises an inquiry image and an image to be identified.
Specifically, the image to be processed comprises a query image and an image to be recognized, and the semantic mask map corresponding to the image to be processed comprises the semantic mask map of the query image and the semantic mask map of the image to be recognized. Specifically, referring to fig. 2, obtaining the semantic mask map corresponding to the image to be recognized includes:
step S21: and performing semantic segmentation processing on the image to be recognized and the query image to obtain the category of each pixel and the probability corresponding to the category.
The query image is an image defined by a user, and can be an image shot by the user at present or an image stored in advance by the user. The image to be identified is an image which is searched from the database according to the query image and is matched with the query image. The database is a server, a query image is input, and the server matches a plurality of images to be identified with similar scenes for the query image. And performing semantic segmentation on the image to be recognized and the query image to obtain the category of each pixel in the image and the probability corresponding to the category.
Step S22: a weight is set for each pixel type according to a setting condition.
After the classes of pixels are acquired, weights are set for the pixels of each class. In one embodiment, if the category obtained by semantic segmentation includes four categories, i.e., a fixed sub-category (e.g., stable), an unfixed sub-category (e.g., volatile), and a dynamic sub-category and an unknown sub-category, in order to reduce the interference of the dynamic features on the scene recognition, in one embodiment, the weight of the dynamic sub-category is set to be the lowest and is smaller than the weights of the fixed sub-category, the unfixed sub-category and the unknown sub-category. In another embodiment, if it is desired to reduce the interference of unfixed sub-category features with scene recognition, in one embodiment, the weight of the unfixed sub-category features is set to be the lowest, and is less than the weights of the fixed sub-category, the dynamic sub-category, and the unknown sub-category.
Step S23: and obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask image.
Specifically, in an embodiment, the semantic mask corresponding to each pixel is calculated by using the following formula (1):
mi=pi×wi (1)
wherein m isiRepresenting the semantic mask corresponding to the ith pixel, the generated graph is a semantic mask graph, piRepresenting the probability, w, of the sub-category to which the ith pixel belongsiAnd represents the weight corresponding to the category or the sub-category to which the ith pixel belongs.
In another embodiment, please refer to fig. 3 if the category result after semantic segmentation does not include four categories, i.e., fixed sub-category, unfixed sub-category, dynamic category and unknown category, wherein step S31 is the same as that in fig. 2 and is not described herein again. In this embodiment, if the category result after semantic segmentation does not include four categories, i.e., stable, variable, dynamic, and unknown:
step S32: all pixels are attribute classified to obtain one or more sub-categories.
All pixels are classified for attributes to derive one or more sub-categories, which in one embodiment include at least two or at least one of fixed sub-categories, non-fixed sub-categories, dynamic and unknown.
Step S33: and setting weight for each sub-category according to set conditions.
Specifically, after the sub-categories of the pixels are acquired, a weight is set for the pixels of each sub-category. In an embodiment, if the subclasses obtained by classifying the result attributes of semantic segmentation include four classes, namely a fixed subclass, an unfixed subclass, a dynamic subclass and an unknown subclass, in order to reduce the interference of the dynamic features on scene recognition, in an embodiment, the weight of the dynamic features is set to be the lowest and is smaller than the weights of the fixed subclass, the unfixed subclass and the unknown subclass. In another embodiment, if it is desired to reduce the interference of unfixed sub-category features on scene recognition, in one embodiment, the weight of the unfixed sub-category features is set to be the lowest, and is smaller than the fixed, dynamic and unknown weights.
Step S34: and obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask image.
Specifically, in an embodiment, the semantic mask corresponding to each pixel is calculated by using the following formula (1):
mi=pi×wi (1)
wherein m isiRepresenting the semantic mask corresponding to the ith pixel, the generated graph is a semantic mask graph, piRepresenting the probability, w, of the sub-category to which the ith pixel belongsiAnd represents the weight corresponding to the category or the sub-category to which the ith pixel belongs.
According to the method provided by the embodiment, different weights are set for the pixel classes after semantic segmentation, so that the interference caused by the classes in feature recognition is reduced, and the robustness of the scene recognition score is further improved.
Step S12: and performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed.
Specifically, the existing method for performing feature aggregation on features to be processed to obtain a feature vector includes obtaining a feature vector by means of VLAD coding. Specifically, obtaining the feature vector by means of VLAD coding includes: and performing feature extraction on the image to be processed to obtain a feature set. In another embodiment, the feature extraction may be performed on a preset image to be processed to obtain a feature set, where the preset data image may be a set of all images in the database and the server, may also be a set of partial images in the server, and is not limited thereto, and may also be a set of pictures collected by the user himself or herself. It is understood that each image to be processed contains a plurality of features, that is, when feature extraction is performed, each image to be processed extracts a plurality of features. And forming a feature set by all the extracted features, and then carrying out a clustering algorithm on the feature set to obtain K clustering centers. The K clustering centers are called codebooks, and the codebooks are obtained as C ═ { C1, C2, …, ck }.
And forming a feature set X by a plurality of features in one of the images to be processed, wherein the feature set X is { X1, X2, … and xk }. In a specific embodiment, the feature set X can also be aggregated into a feature vector with a fixed length through the codebook C.
After a plurality of clustering centers are obtained, obtaining each feature x in each image to be processed through the plurality of clustering centersiThe corresponding cluster center. In particular, the feature x is determinediWill be associated with the feature xiThe nearest cluster center is determined as feature xiCorresponding cluster center ck. In one embodiment, the feature x is determined to be currentiCorresponding cluster center ckThen, the cluster center c is determinedkAt the value corresponding to the first dimension, in one embodiment, the cluster center ckCorresponding dimension and cluster center ckCorresponding feature xiIs the same, determining the clustering center ckValue corresponding to the first dimension and cluster center ckCorresponding feature xiAt the value corresponding to the first dimension, due to the cluster center ckDimension and cluster center ckCorresponding feature xiIn order to better distinguish the cluster centers ckAnd cluster center ckCorresponding feature xiCenter the cluster ckDimension of (d) plus cluster center ckCorresponding to the feature xiThe distance between them. In the embodiment of the present disclosure, the first dimension may be dimension 1, dimension 2, dimension 3, and the like, and for clarity, the clustering center and the feature are aggregated in the same dimension, so the description with the first dimension is omitted here for brevity.
Passing through the clustering center c in the existing feature recognition modekAnd a clustering center c corresponding to each featurekAnd obtaining the query image and the feature vector of the image to be identified in the value corresponding to the first dimension. In one embodiment, the prior art generally obtains the feature vector of the query image or the image to be identified by the following formula (3):
Figure BDA0002771284470000081
where v (k, j) represents the feature vector of the query image or image to be identified, αk(xi) Representing a selection function, xiIs characterized in that when ckIs xiWhen clustering the center of (a)k(xi) Equal to 1, otherwise αk(xi) Is equal to 0, xi(j) Value corresponding to the jth dimension expressed as ith feature, ck(j) And representing the value corresponding to the j dimension of the k clustering center.
It will be appreciated that when it is desired to compute the feature vector of the query image, v (k, j) represents the feature vector of the query image, αk(xi) Representing a selection function, xiTo query for features of the image, when ck(cluster center) is xiCorresponding cluster center, αk(xi) Equal to 1, otherwise αk(xi) Equal to 0. x is the number ofi(j) Expressed as the ith on the query imageValue corresponding to the jth dimension of the feature, ck(j) And representing the value corresponding to the j dimension of the k clustering center of the query image.
It will be understood that when it is desired to calculate the feature vector of the image to be recognized, v (k, j) represents the feature vector of the image to be recognized, αk(xi) Representing a selection function, xiAs a feature of the image to be recognized, when ck(cluster center) is xiCorresponding cluster center, αk(xi) Equal to 1, otherwise αk(xi) Equal to 0. x is the number ofi(j) Expressed as the value corresponding to the jth dimension of the ith feature on the image to be recognized, ck(j) And representing the value corresponding to the j dimension of the k clustering center of the image to be identified.
In the technical scheme of the invention, in order to avoid the influence of dynamic characteristics on the identification of the characteristic vector caused by the lack of high-level semantic information and further cause inaccurate identification result, each characteristic x in the image to be processed is used for identifying the characteristic vectoriCorresponding cluster center ckThe clustering center c corresponding to each feature in the image to be processedkThe value corresponding to the first dimension, and, said each feature x in the image to be processediCorresponding cluster center ckAnd performing feature aggregation processing on the query image by combining the semantic mask image of the query image at the value corresponding to the first dimension to obtain a feature vector of the query image. And passes through each feature x in the image to be processediCorresponding cluster center ckThe clustering center c corresponding to each feature in the image to be processedkThe value corresponding to the first dimension, and, said each feature x in the image to be processediCorresponding cluster center ckAnd performing feature aggregation processing on the image to be recognized by combining the semantic mask image of the image to be recognized with the value corresponding to the first dimension to obtain a feature vector of the image to be recognized.
Specifically, the feature vectors of the query image and the image to be identified are obtained through the following formula (2):
Figure BDA0002771284470000091
wherein v (k, j)' represents the feature vector of the query image and the image to be identified, alphak(xi) Representing a selection function, xiIs characterized in that when ckIs xiWhen clustering the center of (a)k(xi) Equal to 1, otherwise αk(xi) Is equal to 0, xi(j) Value corresponding to the jth dimension expressed as ith feature, ck(j) And representing a value corresponding to the j dimension of the k clustering center, and mi represents a semantic mask graph of the query image and the image to be identified.
By using the method of the invention, for example, when the image contains a large number of dynamic objects, the weight of the dynamic objects can be reduced by weighting through the semantic mask, and the robustness of the feature recognition is improved.
Specifically, in an embodiment, when weighting is performed through semantic masks, if a feature is a pixel-level feature, a semantic mask of a corresponding position may be directly obtained according to a position of the feature in an image, and if the feature is a sub-pixel-level feature, the semantic mask may be obtained by interpolating corresponding to the same position on a semantic mask map.
In an embodiment, after the feature vectors of the query image and the image to be identified are obtained in the above manner, the feature vectors may be normalized in the K clustering centers, and then the whole vectors are normalized.
Step S13: and determining an image matched with the scene of the query image from the image to be identified by using the feature vector of the image to be processed.
After the query image and the feature vector of the image to be recognized are obtained in the manner of step S12, an image matching the scene of the query image is determined from the image to be recognized according to the positions of the feature vector of the image to be recognized and the feature vector of the query image.
It will be appreciated that the closer the distance between feature vectors, the higher the similarity of features, and the further the distance between feature vectors, the lower the similarity of features. Therefore, in an embodiment, the image to be identified corresponding to the feature vector closest to the feature vector of the query image is determined as the image matched with the query image.
In an embodiment, if the number of the images matched with the query image in the to-be-identified images is multiple, in order to obtain the most similar images, the images matched with the query image are arranged by using a spatial consistency method to obtain the most similar images to the query image.
According to the scene recognition method provided by the invention, the semantic mask image is combined with the traditional feature aggregation mode, so that the interference of dynamic features in the image on feature recognition is reduced in a semantic mask weighting mode, and the negative image of an unstable object on scene recognition is effectively avoided. Meanwhile, the image caused by the instability of semantic segmentation is effectively avoided by using a weighting mode, and the robustness of the image is further improved. Moreover, the method of the invention has good robustness when seasons change.
Fig. 4 is a schematic structural diagram of a scene recognition device according to an embodiment of the present invention. The method comprises the following steps: an acquisition module 41, a feature aggregation module 42, and an image matching module 43.
The obtaining module 41 is configured to obtain an image to be processed and a semantic mask map corresponding to the image to be processed; the image to be processed comprises a query image and an image to be recognized, and the semantic mask image corresponding to the image to be processed comprises a semantic mask image of the query image and a semantic mask image of the image to be recognized. Specifically, the obtaining module 41 is configured to obtain a query image, and obtain a plurality of images to be identified, which are matched with the query image, from a database according to the query image; performing semantic segmentation processing on the image to be recognized and the query image to obtain the category of each pixel and the probability corresponding to the category; setting weight for each pixel type according to set conditions; and obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the category and the weight corresponding to the category, wherein the semantic masks corresponding to all the pixels form a semantic mask image. In an embodiment, the obtaining module 41 is further configured to perform attribute classification on all pixels to obtain one or more sub-categories; setting weight for each subcategory according to set conditions; and obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask image.
The feature aggregation module 42 is configured to perform feature aggregation processing on the image to be processed according to the semantic mask map to obtain a feature vector of the image to be processed. Specifically, the feature aggregation module 42 is configured to perform feature extraction on the image to be processed to obtain a feature set; forming a plurality of clustering centers according to the feature set; obtaining a clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers; determining a value corresponding to a first dimension of each feature in the image to be processed, and determining a value corresponding to the first dimension of a clustering center corresponding to each feature in the image to be processed;
and performing feature aggregation processing on the query image by combining a semantic mask map of the query image through a clustering center corresponding to each feature in the image to be processed and a value corresponding to a first dimension of the clustering center corresponding to each feature in the image to be processed, and a value corresponding to the first dimension of the clustering center corresponding to each feature in the image to be processed, so as to obtain a feature vector of the query image. And performing feature aggregation processing on the image to be recognized by combining the value of the clustering center corresponding to each feature in the image to be processed in the first dimension with the value of the clustering center corresponding to each feature in the image to be processed in the first dimension, so as to obtain the feature vector of the image to be recognized.
The image matching module 43 is configured to determine, from the images to be identified, an image that matches the scene of the query image by using the feature vectors of the images to be processed. Specifically, the image matching module 43 is configured to determine an image matched with the query image scene from the image to be recognized according to a distance between a feature vector of the image to be recognized and a feature vector of the query image. In an embodiment, the image matching module 43 is configured to determine the image to be identified corresponding to the feature vector closest to the feature vector of the query image as the image matched with the query image. In an embodiment, the image matching module 43 is further configured to, when there are a plurality of images matching the query image in the to-be-identified images, arrange the images matching the query image by using a spatial consistency method to obtain an image most similar to the query image.
According to the scene recognition device provided by the invention, the semantic mask image is combined with the traditional feature aggregation mode, so that the interference of dynamic features in the image on feature recognition is reduced in a semantic mask weighting mode, and the robustness of the device is further improved.
Fig. 5 is a schematic structural diagram of an intelligent device according to the present invention. The smart device comprises a memory 52 and a processor 51 connected to each other.
The memory 52 is used to store program instructions implementing the scene recognition method of any one of the above.
The processor 51 is operative to execute program instructions stored in the memory 52.
The processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 52 may be a memory bank, a TF card, etc., and may store all information in the smart device, including the input raw data, the computer program, the intermediate operation results, and the final operation results. It stores and retrieves information based on the location specified by the controller. With the memory, the intelligent device has a memory function, and can work normally. The storage of the smart device can be classified into a main storage (internal storage) and an auxiliary storage (external storage) according to the purpose of use, and there is a classification method into an external storage and an internal storage. The external memory is usually a magnetic medium, an optical disk, or the like, and can store information for a long period of time. The memory refers to a storage component on the main board, which is used for storing data and programs currently being executed, but is only used for temporarily storing the programs and the data, and the data is lost when the power is turned off or the power is cut off.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a system server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method of the embodiments of the present application.
Please refer to fig. 6, which is a schematic structural diagram of a computer-readable storage medium according to the present invention. The computer readable storage medium of the present application stores a program file 61 capable of implementing all the above-mentioned scene recognition methods, wherein the program file 61 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. The aforementioned storage device includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (13)

1. A method for scene recognition, comprising:
acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises a query image and an image to be recognized, and the semantic mask image corresponding to the image to be processed comprises a semantic mask image of the query image and a semantic mask image of the image to be recognized;
performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed;
and determining an image matched with the scene of the query image from the images to be identified by utilizing the feature vector of the images to be processed.
2. The scene recognition method according to claim 1, wherein the obtaining of the to-be-processed image and the semantic mask map corresponding to the to-be-processed image includes:
performing semantic segmentation processing on the image to be identified and the query image to obtain the category of each pixel and the probability corresponding to the category;
setting weight for each pixel type according to set conditions;
and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the category and the weight corresponding to the category, wherein the semantic masks corresponding to all the pixels form a semantic mask image.
3. The method according to claim 2, wherein the weighting the category of each pixel according to the setting condition further comprises:
performing attribute classification on all pixels to obtain one or more sub-categories;
setting weight for each sub-category according to set conditions;
and obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic masks corresponding to all the pixels form a semantic mask image.
4. The method of claim 3, wherein the sub-categories include at least two of fixed sub-categories, non-fixed sub-categories, dynamic and unknown;
the weight of the dynamic subcategory is less than the weight of the fixed subcategory, the non-fixed subcategory, and the unknown subcategory.
5. The method of claim 4, wherein the deriving the semantic mask for each pixel according to the probability and the weight corresponding to the sub-category comprises: calculating a semantic mask corresponding to the pixel by using the following formula (1):
mi=pi×wi (1)
wherein m isiRepresenting the semantic mask corresponding to the ith pixel, the generated graph is a semantic mask graph, piRepresenting the probability, w, of the sub-category to which the ith pixel belongsiAnd represents the weight corresponding to the category or the sub-category to which the ith pixel belongs.
6. The method according to claim 1, wherein performing feature aggregation processing on the image to be processed according to the semantic mask map to obtain a feature vector of the image to be processed comprises:
extracting the features of the image to be processed to obtain a feature set;
forming a plurality of clustering centers according to the feature set;
obtaining a clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers;
determining a value corresponding to a first dimension of each feature in the image to be processed, and determining a value corresponding to the first dimension of a clustering center corresponding to each feature in the image to be processed;
performing feature aggregation processing on the query image by using a clustering center corresponding to each feature in the to-be-processed image, a value corresponding to a first dimension of the clustering center corresponding to each feature in the to-be-processed image, and a value corresponding to the first dimension of the clustering center corresponding to each feature in the to-be-processed image, in combination with a semantic mask map of the query image, so as to obtain a feature vector of the query image; and
and performing feature aggregation processing on the image to be recognized by combining the value of the clustering center corresponding to each feature in the image to be recognized in the first dimension and the value of the clustering center corresponding to each feature in the image to be recognized in the first dimension through the clustering center corresponding to each feature in the image to be processed to obtain the feature vector of the image to be recognized.
7. The method of claim 6, wherein said forming a plurality of cluster centers from said feature set comprises:
processing the feature set by using a clustering algorithm to form a plurality of clustering centers;
the obtaining of the clustering center corresponding to each feature in each image to be processed according to the plurality of clustering centers comprises:
and taking the clustering center closest to each feature as the clustering center corresponding to each feature in the image to be processed.
8. The method according to claims 1-7, wherein the determining an image matching the scene of the query image from the image to be identified by using the feature vector of the image to be processed comprises:
and determining an image matched with the query image scene from the image to be recognized according to the distance between the feature vector of the image to be recognized and the feature vector of the query image.
9. The method according to claim 8, wherein the step of determining an image matching the query image scene from the images to be identified according to the distance between the feature vector of the image to be identified and the feature vector of the query image comprises:
and determining the image to be identified corresponding to the feature vector closest to the feature vector of the query image as the image matched with the query image.
10. The method according to claim 9, wherein the number of the images to be identified matching the query image is plural;
after the determining the image to be identified corresponding to the feature vector closest to the feature vector of the query image as the image matched with the query image, the method further comprises:
and arranging the images matched with the query image by adopting a space consistency method so as to obtain the image most similar to the query image.
11. A scene recognition apparatus, comprising:
the acquisition module is used for acquiring an image to be processed and a semantic mask image corresponding to the image to be processed; the image to be processed comprises a query image and an image to be recognized, and the semantic mask image corresponding to the image to be processed comprises a semantic mask image of the query image and a semantic mask image of the image to be recognized;
the feature aggregation module is used for performing feature aggregation processing on the image to be processed according to the semantic mask image to obtain a feature vector of the image to be processed;
and the image matching module is used for determining an image matched with the scene of the query image from the image to be identified by utilizing the characteristic vector of the image to be processed.
12. A scene recognition device, comprising: a processor and a memory coupled to each other, wherein,
the memory is for storing program instructions for implementing the scene recognition method of any one of claims 1-10.
13. A computer-readable storage medium, characterized in that a program file is stored, which can be executed to implement the scene recognition method according to any one of claims 1 to 10.
CN202011249944.4A 2020-11-10 2020-11-10 Scene recognition method and device, intelligent equipment and storage medium Active CN112329660B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202011249944.4A CN112329660B (en) 2020-11-10 2020-11-10 Scene recognition method and device, intelligent equipment and storage medium
PCT/CN2021/106936 WO2022100133A1 (en) 2020-11-10 2021-07-16 Scene recognition method and apparatus, intelligent device, storage medium and computer program
JP2022543759A JP2023510945A (en) 2020-11-10 2021-07-16 Scene identification method and apparatus, intelligent device, storage medium and computer program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011249944.4A CN112329660B (en) 2020-11-10 2020-11-10 Scene recognition method and device, intelligent equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112329660A true CN112329660A (en) 2021-02-05
CN112329660B CN112329660B (en) 2024-05-24

Family

ID=74317739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011249944.4A Active CN112329660B (en) 2020-11-10 2020-11-10 Scene recognition method and device, intelligent equipment and storage medium

Country Status (3)

Country Link
JP (1) JP2023510945A (en)
CN (1) CN112329660B (en)
WO (1) WO2022100133A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113393515A (en) * 2021-05-21 2021-09-14 杭州易现先进科技有限公司 Visual positioning method and system combined with scene labeling information
WO2022100133A1 (en) * 2020-11-10 2022-05-19 浙江商汤科技开发有限公司 Scene recognition method and apparatus, intelligent device, storage medium and computer program

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009532B (en) * 2023-09-21 2023-12-19 腾讯科技(深圳)有限公司 Semantic type recognition method and device, computer readable medium and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239535A (en) * 2017-05-31 2017-10-10 北京小米移动软件有限公司 Similar pictures search method and device
CN108710847A (en) * 2018-05-15 2018-10-26 北京旷视科技有限公司 Scene recognition method, device and electronic equipment
WO2019095997A1 (en) * 2017-11-15 2019-05-23 深圳云天励飞技术有限公司 Image recognition method and device, computer device and computer-readable storage medium
CN111027493A (en) * 2019-12-13 2020-04-17 电子科技大学 Pedestrian detection method based on deep learning multi-network soft fusion
WO2020134674A1 (en) * 2018-12-29 2020-07-02 平安科技(深圳)有限公司 Palmprint identification method, apparatus, computer device, and storage medium
CN111709398A (en) * 2020-07-13 2020-09-25 腾讯科技(深圳)有限公司 Image recognition method, and training method and device of image recognition model
US20210118144A1 (en) * 2018-09-15 2021-04-22 Beijing Sensetime Technology Development Co., Ltd. Image processing method, electronic device, and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8891908B2 (en) * 2012-11-14 2014-11-18 Nec Laboratories America, Inc. Semantic-aware co-indexing for near-duplicate image retrieval
CN105335757A (en) * 2015-11-03 2016-02-17 电子科技大学 Model identification method based on local characteristic aggregation descriptor
WO2018081537A1 (en) * 2016-10-31 2018-05-03 Konica Minolta Laboratory U.S.A., Inc. Method and system for image segmentation using controlled feedback
CN112329660B (en) * 2020-11-10 2024-05-24 浙江商汤科技开发有限公司 Scene recognition method and device, intelligent equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239535A (en) * 2017-05-31 2017-10-10 北京小米移动软件有限公司 Similar pictures search method and device
WO2019095997A1 (en) * 2017-11-15 2019-05-23 深圳云天励飞技术有限公司 Image recognition method and device, computer device and computer-readable storage medium
CN108710847A (en) * 2018-05-15 2018-10-26 北京旷视科技有限公司 Scene recognition method, device and electronic equipment
US20210118144A1 (en) * 2018-09-15 2021-04-22 Beijing Sensetime Technology Development Co., Ltd. Image processing method, electronic device, and storage medium
WO2020134674A1 (en) * 2018-12-29 2020-07-02 平安科技(深圳)有限公司 Palmprint identification method, apparatus, computer device, and storage medium
CN111027493A (en) * 2019-12-13 2020-04-17 电子科技大学 Pedestrian detection method based on deep learning multi-network soft fusion
CN111709398A (en) * 2020-07-13 2020-09-25 腾讯科技(深圳)有限公司 Image recognition method, and training method and device of image recognition model

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022100133A1 (en) * 2020-11-10 2022-05-19 浙江商汤科技开发有限公司 Scene recognition method and apparatus, intelligent device, storage medium and computer program
CN113393515A (en) * 2021-05-21 2021-09-14 杭州易现先进科技有限公司 Visual positioning method and system combined with scene labeling information
CN113393515B (en) * 2021-05-21 2023-09-19 杭州易现先进科技有限公司 Visual positioning method and system combining scene annotation information

Also Published As

Publication number Publication date
WO2022100133A1 (en) 2022-05-19
JP2023510945A (en) 2023-03-15
CN112329660B (en) 2024-05-24

Similar Documents

Publication Publication Date Title
CN103207898B (en) A kind of similar face method for quickly retrieving based on local sensitivity Hash
US9330111B2 (en) Hierarchical ranking of facial attributes
CN112329660B (en) Scene recognition method and device, intelligent equipment and storage medium
CN109359725B (en) Training method, device and equipment of convolutional neural network model and computer readable storage medium
US9747305B2 (en) Image search device, image search method, program, and computer-readable storage medium
US20150039583A1 (en) Method and system for searching images
US11301509B2 (en) Image search system, image search method, and program
US9940366B2 (en) Image search device, image search method, program, and computer-readable storage medium
US20200410280A1 (en) Methods and apparatuses for updating databases, electronic devices and computer storage mediums
CN109033955B (en) Face tracking method and system
US20210073890A1 (en) Catalog-based image recommendations
CN110825894A (en) Data index establishing method, data index retrieving method, data index establishing device, data index retrieving device, data index establishing equipment and storage medium
CN107315984B (en) Pedestrian retrieval method and device
CN112561976A (en) Image dominant color feature extraction method, image retrieval method, storage medium and device
CN110083731B (en) Image retrieval method, device, computer equipment and storage medium
US20210042565A1 (en) Method and device for updating database, electronic device, and computer storage medium
Wang et al. Salient object detection based on multi-feature graphs and improved manifold ranking
CN112650869B (en) Image retrieval reordering method and device, electronic equipment and storage medium
CN114461837A (en) Image processing method and device and electronic equipment
CN113505257A (en) Image search method, trademark search method, electronic device, and storage medium
Peng et al. Multi-task person re-identification via attribute and part-based learning
Dimai Unsupervised extraction of salient region-descriptors for content based image retrieval
Mary et al. Content based image retrieval using colour, multi-dimensional texture and edge orientation
Ni et al. Research on image segmentation algorithm based on fuzzy clustering and spatial pyramid
CN116662589A (en) Image matching method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40040124

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant