WO2022100133A1 - 场景识别方法、装置、智能设备、存储介质和计算机程序 - Google Patents

场景识别方法、装置、智能设备、存储介质和计算机程序 Download PDF

Info

Publication number
WO2022100133A1
WO2022100133A1 PCT/CN2021/106936 CN2021106936W WO2022100133A1 WO 2022100133 A1 WO2022100133 A1 WO 2022100133A1 CN 2021106936 W CN2021106936 W CN 2021106936W WO 2022100133 A1 WO2022100133 A1 WO 2022100133A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
feature
processed
semantic mask
semantic
Prior art date
Application number
PCT/CN2021/106936
Other languages
English (en)
French (fr)
Inventor
鲍虎军
章国锋
余海林
冯友计
Original Assignee
浙江商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江商汤科技开发有限公司 filed Critical 浙江商汤科技开发有限公司
Priority to JP2022543759A priority Critical patent/JP2023510945A/ja
Publication of WO2022100133A1 publication Critical patent/WO2022100133A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Definitions

  • the present application relates to the technical field of image retrieval, and in particular, designs a scene recognition method, apparatus, smart device, storage medium and computer program.
  • Scene recognition has important applications in the field of computer vision, such as Simultaneously Localization And Mapping (SLAM), Structure From Motion (SFM), and Visual Localization (VL).
  • SLAM Simultaneously Localization And Mapping
  • SFM Structure From Motion
  • VL Visual Localization
  • the main content of the scene recognition problem is to identify the corresponding scene from a given image, give the name of the scene or the geographic location of the scene, or select images similar to the scene from the database, or See it as an image retrieval problem.
  • Embodiments of the present application provide a scene recognition method, apparatus, smart device, storage medium, and computer program.
  • An embodiment of the present application provides a scene recognition method, including: acquiring an image to be processed and a semantic mask map corresponding to the image to be processed; wherein, the image to be processed includes a query image and an image to be recognized, and the image to be processed
  • the corresponding semantic mask map includes the semantic mask map of the query image and the semantic mask map of the to-be-recognized image; perform feature aggregation processing on the to-be-processed image according to the semantic mask map to obtain the to-be-processed image.
  • the feature vector of the image is processed; the image matching the scene of the query image is determined from the to-be-recognized image by using the feature vector of the to-be-processed image.
  • the features corresponding to the images to be processed are obtained by combining the semantic mask map with the feature aggregation method, which can reduce the interference of interfering factors and improve the robustness of scene recognition.
  • the step of acquiring the to-be-processed image and the semantic mask map corresponding to the to-be-processed image includes: performing semantic segmentation on the to-be-recognized image and the query image to obtain the category of each pixel and the probability corresponding to the category; set the weight for the category of each pixel according to the set conditions; obtain the semantic mask corresponding to each pixel according to the probability corresponding to the category and the weight corresponding to the category, wherein all The semantic masks corresponding to the pixels constitute a semantic mask map.
  • the obtained semantic mask map can reduce the interference of interference factors and improve the robustness of scene recognition after combining the feature aggregation method to obtain the corresponding features of the image to be processed.
  • the weights to the categories of each pixel according to the set conditions before setting the weights to the categories of each pixel according to the set conditions, it further includes: classifying all the pixels by attributes to obtain one or more subcategories; setting the weights for each of the subcategories according to the set conditions weight; obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the subcategory and the weight corresponding to the subcategory, wherein the semantic mask corresponding to all the pixels constitutes a semantic mask map. Setting weights for each sub-category can reduce the interference of interference factors and improve the robustness of scene recognition.
  • the sub-categories include at least two of fixed sub-categories, non-fixed sub-categories, dynamic sub-categories and unknown sub-categories; the weight of the dynamic sub-categories is smaller than the fixed sub-categories, the non-fixed Subcategories and weights for the unknown subcategories. For example, set higher weights for non-fixed sub-categories and set smaller weights for fixed sub-categories, so as to eliminate the interference of non-fixed features on feature recognition and improve the robustness of scene recognition.
  • obtaining the semantic mask corresponding to each pixel according to the probability corresponding to the subcategory and the weight corresponding to the subcategory includes : calculating the The semantic mask corresponding to the pixel;
  • m i represents the semantic mask corresponding to the ith pixel
  • the generated image is a semantic mask map
  • pi represents the probability of the subcategory to which the ith pixel belongs
  • wi represents the category to which the ith pixel belongs or Weights corresponding to subcategories.
  • performing feature aggregation processing on the to-be-processed image according to the semantic mask map to obtain a feature vector of the to-be-processed image includes: performing feature extraction on the to-be-processed image to obtain a feature set; The feature set forms a plurality of cluster centers; obtain a cluster center corresponding to each feature in each of the to-be-processed images according to the plurality of the cluster centers; determine that each feature in the to-be-processed image is in the The value corresponding to the first dimension, and the value corresponding to the first dimension of the cluster center corresponding to each feature in the image to be processed is determined; the cluster center, the value corresponding to the cluster center in the first dimension for each feature in the image to be processed, and the value corresponding to the first dimension for each feature in the image to be processed, In combination with the semantic mask map of the query image, feature aggregation processing is performed on the query image to obtain a feature vector of the query image; and through the cluster center corresponding to each feature in the
  • the forming a plurality of cluster centers according to the feature set includes: using a clustering algorithm to process the feature set to form a plurality of cluster centers;
  • Obtaining the cluster center corresponding to each feature in each of the to-be-processed images includes: taking the cluster center closest to each of the features as the cluster center corresponding to each of the features in the to-be-processed image.
  • the determining an image from the image to be recognized that matches the scene of the query image by using the feature vector of the image to be processed includes: according to the feature vector of the image to be recognized and the query The distance of the feature vector of the image, and the image matching the scene of the query image is determined from the to-be-recognized image. Since the calculation of the feature vector combines the semantic mask map, the interference of non-fixed features is reduced, and an image to be recognized that is more similar to the query image is obtained.
  • the step of determining the image matching the scene of the query image from the image to be recognized includes: The to-be-identified image corresponding to the feature vector with the nearest feature vector of the query image is determined to be an image matching the query image. In this way, an image to be recognized that is more similar to the query image is obtained.
  • the step of querying the images matched with the image further includes: using a spatial consistency method to arrange the images matching the query image, so as to obtain the most similar image to the query image. In this way, the obtained scenes are more similar and more accurate.
  • An embodiment of the present application provides a scene recognition device, including: an acquisition module configured to acquire an image to be processed and a semantic mask map corresponding to the image to be processed; wherein the image to be processed includes a query image and an image to be recognized; A feature aggregation module, configured to perform feature aggregation processing on the to-be-processed image according to the semantic mask map, to obtain a feature vector of the to-be-processed image; an image matching module, configured to use the feature vector of the to-be-processed image from An image matching the scene of the query image is determined in the to-be-identified image.
  • the features corresponding to the images to be processed are obtained by combining the semantic mask map with the feature aggregation method, which can reduce the interference of interfering factors and improve the robustness of scene recognition.
  • An embodiment of the present application provides a smart device, including: a processor and a memory coupled to each other, wherein the memory is used to store program instructions for implementing the scene recognition method described in any one of the above.
  • An embodiment of the present application provides a computer-readable storage medium storing a program file, where the program file can be executed to implement any one of the scene recognition methods described above.
  • An embodiment of the present application provides a computer program, including computer-readable code.
  • a processor in the smart device executes a program for implementing any of the scenarios described above. recognition methods.
  • the embodiments of the present application provide a scene recognition method, device, smart device, storage medium, and computer program.
  • feature aggregation processing is performed on the to-be-processed image according to the semantic mask map.
  • obtain the feature vector of the image to be processed and then use the feature vector to determine the image matching the scene of the query image from the image to be recognized.
  • the high-level semantic information of the image can be obtained by obtaining the semantic mask map.
  • the interference caused by interfering factors in the image is eliminated, thereby improving the robustness of scene recognition.
  • FIG. 1 is a schematic flowchart of an embodiment of a scene recognition method according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of an embodiment of step S11 in FIG. 1 according to the embodiment of the present application;
  • FIG. 3 is a schematic flowchart of another embodiment of step S11 in FIG. 1 according to the embodiment of the present application;
  • FIG. 4 is a schematic structural diagram of an embodiment of a scene recognition apparatus according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an embodiment of a smart device according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application.
  • Scene recognition has an important application in the field of computer vision.
  • the main content of the scene recognition problem is to identify the corresponding scene from a given image, give the name of the scene or the geographic location of the scene, or get the scene from the database. Picking out images of similar scenes can also be seen as an image retrieval problem.
  • the core of this type of problem is to accurately describe the image or the scene in the image. There are two commonly used methods, one is to directly calculate the global description of the image, and the other is to use local feature aggregation.
  • the method of directly calculating the global description of the image the input is a complete image and the output is the global descriptor of the image.
  • the simplest method is to stitch together all the pixel values of the image as the descriptor of the image, or to use the histogram to count the grayscale information or gradient information of the pixels, etc.
  • This method has extremely poor robustness.
  • the input is the local features extracted from the image, and the output is an encoded feature vector. This method only uses local features, lacks high-level semantic information, and is not robust to illumination changes and dynamic scenes.
  • Semantic information as a high-level visual information, has a good guiding role for scene recognition.
  • the use of semantic information is also more in line with human cognition.
  • an embodiment of the present application proposes a scene recognition method for semantic masks. The method uses the semantic segmentation results to apply different weights to different regions in the image, and effectively deals with the negative impact of dynamic unstable objects on scene recognition.
  • the soft weighting method since the soft weighting method is used, the influence of the instability of semantic segmentation is effectively avoided. Not only that, but the method is also robust to seasonal changes.
  • FIG. 1 is a schematic flowchart of a first embodiment of a scene recognition method according to an embodiment of the present application.
  • the scene recognition method is executed by a smart device, and the method includes:
  • Step S11 acquiring the image to be processed and the semantic mask map corresponding to the image to be processed; wherein the image to be processed includes a query image and an image to be recognized.
  • the image to be processed includes a query image and an image to be recognized
  • the semantic mask map corresponding to the image to be processed includes a semantic mask map of the query image and a semantic mask map of the image to be recognized.
  • obtaining the semantic mask map corresponding to the image to be recognized includes:
  • Step S21 perform semantic segmentation processing on the image to be recognized and the query image to obtain the category of each pixel and the probability corresponding to the category.
  • the query image is a user-defined image, which may be an image currently captured by the user, or an image stored in advance by the user.
  • the image to be recognized is an image that matches the query image and is searched from the database according to the query image.
  • the database is a server, and a query image is input, and the server matches a plurality of images to be recognized with similar scenes for the query image.
  • Semantic segmentation is performed on the image to be recognized and the query image to obtain the category of each pixel in the image and the corresponding probability of the category.
  • Step S22 Set a weight for each pixel category according to the set condition.
  • the weight of the dynamic sub-category is set to be the lowest, which is smaller than the weight of the fixed sub-category, the non-fixed sub-category, and the unknown weight.
  • the weight of non-fixed sub-category features is set to the lowest, which is smaller than that of fixed sub-categories, dynamic sub-categories, and unknown sub-categories. class weight.
  • Step S23 obtaining a semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic mask corresponding to all pixels constitutes a semantic mask map.
  • the following formula (1) is used to calculate the semantic mask corresponding to each pixel:
  • m i represents the semantic mask corresponding to the ith pixel
  • the generated image is a semantic mask map
  • pi represents the probability of the subcategory to which the ith pixel belongs
  • wi represents the category to which the ith pixel belongs or Weights corresponding to subcategories.
  • the category result after semantic segmentation does not include four categories including fixed subcategories, non-fixed subcategories, dynamic and unknown, please refer to FIG. 3 , where step S31 is the same as that in FIG. 2 .
  • the category result after semantic segmentation does not include the four categories of stable, volatile, dynamic and unknown, it also includes:
  • Step S32 Perform attribute classification on all pixels to obtain one or more sub-categories.
  • Attribute classification is performed on all pixels to obtain one or more sub-categories.
  • the sub-categories include at least two or at least one of fixed sub-categories, non-fixed sub-categories, dynamic sub-categories and unknown sub-categories.
  • Step S33 Set weights for each sub-category according to the set conditions.
  • weights are set for the pixels of each sub-category.
  • the sub-categories obtained by classifying the result attributes of semantic segmentation include four categories: fixed sub-categories, non-fixed sub-categories, dynamic sub-categories and unknown sub-categories, in order to reduce the interference of dynamic features on scene recognition, in one embodiment, the weight of the dynamic feature is set to be the lowest, which is smaller than the fixed sub-category, the non-fixed sub-category and the unknown weight.
  • the weight of the non-fixed sub-category feature is set to the lowest, which is smaller than the fixed sub-category, dynamic and unknown weights .
  • Step S34 Obtain a semantic mask corresponding to each pixel according to the probability corresponding to the sub-category and the weight corresponding to the sub-category, wherein the semantic mask corresponding to all pixels constitutes a semantic mask map.
  • the following formula (2) is used to calculate the semantic mask corresponding to each pixel:
  • m i represents the semantic mask corresponding to the ith pixel
  • the generated image is a semantic mask map
  • pi represents the probability of the subcategory to which the ith pixel belongs
  • wi represents the category to which the ith pixel belongs or Weights corresponding to subcategories.
  • different weights are set for the pixel categories after semantic segmentation, so as to reduce the interference caused by the category in the feature recognition, thereby improving the robustness of the scene recognition score.
  • Step S12 Perform feature aggregation processing on the image to be processed according to the semantic mask map to obtain a feature vector of the image to be processed.
  • the existing method of performing feature aggregation processing on the feature to be processed to obtain the feature vector includes obtaining the feature vector by VLAD encoding.
  • obtaining the feature vector by means of VLAD coding includes: performing feature extraction on the to-be-processed image to obtain a feature set.
  • feature extraction can also be performed on the preset to-be-processed image to obtain a feature set, and the preset data image can be a set of all images in the database and the server, or can be a set of some images in the server, which does not For limitation, it may also be a collection of pictures collected by the user, which is not limited.
  • the feature set X can also be aggregated into a feature vector with a fixed length through the codebook C.
  • a cluster center corresponding to each feature xi in each image to be processed is obtained through the plurality of cluster centers.
  • the position of the feature xi is determined, and the cluster center closest to the feature xi is determined as the cluster center ck corresponding to the feature xi .
  • the value corresponding to the cluster center ck in the first dimension is determined.
  • the value corresponding to the cluster center ck is determined.
  • the dimension is the same as the dimension of the feature x i corresponding to the cluster center ck .
  • the dimension of the cluster center ck is the same as the dimension of the feature xi corresponding to the cluster center ck .
  • the cluster center ck The dimension of , plus the distance between the cluster center ck and the corresponding feature xi .
  • the first dimension may be dimension 1, dimension 2, dimension 3, etc. In order to clarify that cluster centers and features are aggregated in the same dimension, the first dimension is used for description.
  • the query image and the feature vector of the to-be-recognized image are obtained through the cluster center ck and the corresponding value of the cluster center ck corresponding to each feature in the first dimension.
  • the prior art generally obtains the feature vector of the query image or the to-be-recognized image through the following formula (3):
  • v(k, j) represents the feature vector of the query image or the image to be recognized
  • ⁇ k (x i ) represents the selection function
  • xi is the feature.
  • ck is the cluster center of xi
  • ⁇ k (x i ) i ) is equal to 1
  • otherwise ⁇ k ( xi ) is equal to
  • x i (j) is the value corresponding to the j-th dimension of the i-th feature
  • c k (j) is the j-th dimension of the k-th cluster center value corresponding to each dimension.
  • v(k, j) represents the feature vector of the query image
  • ⁇ k ( xi ) represents the selection function
  • xi is the feature of the query image
  • c k When the cluster center) is the cluster center corresponding to xi , ⁇ k ( xi ) is equal to 1, otherwise ⁇ k ( xi ) is equal to 0.
  • x i (j) represents the value corresponding to the j-th dimension of the i-th feature on the query image
  • c k (j) represents the value corresponding to the j-th dimension of the k-th cluster center of the query image.
  • v(k, j) represents the feature vector of the image to be recognized
  • ⁇ k ( xi ) represents the selection function
  • x i is the feature of the image to be recognized
  • c k (cluster center) is the cluster center corresponding to x i
  • ⁇ k ( xi ) is equal to 1
  • ⁇ k ( xi ) is equal to 0.
  • x i (j) represents the value corresponding to the j-th dimension of the i-th feature on the image to be recognized
  • c k (j) represents the value corresponding to the j-th dimension of the k-th cluster center of the image to be recognized.
  • the embodiment of the present application uses each feature x in the image to be processed.
  • the cluster center ck corresponding to i the value corresponding to the cluster center ck corresponding to each feature in the image to be processed in the first dimension, and the value of each feature xi in the image to be processed in the first dimension
  • the corresponding values combined with the semantic mask map of the query image, perform feature aggregation processing on the query image to obtain the feature vector of the query image.
  • the value corresponding to the cluster center ck corresponding to each feature in the image to be processed in the first dimension is combined with the semantic mask map of the image to be recognized, and feature aggregation processing is performed on the image to be recognized to obtain a feature vector of the image to be recognized.
  • the embodiment of the present application obtains the feature vector of the query image and the image to be recognized by the following formula (4):
  • v(k, j)' represents the feature vector of the query image and the image to be recognized
  • ⁇ k ( xi ) represents the selection function
  • xi is the feature.
  • ck is the cluster center of xi
  • ⁇ k ( x i ) is equal to 1
  • otherwise ⁇ k ( xi ) is equal to 1
  • x i (j) is the value corresponding to the j-th dimension of the i-th feature
  • c k (j) is the value of the k-th cluster center
  • the values corresponding to the j dimensions, m i represent the query image and the semantic mask map of the image to be recognized.
  • weighting can be performed by using a semantic mask, thereby reducing the weight of dynamic objects and improving the robustness of feature recognition.
  • the semantic mask of the corresponding position can be directly obtained according to the position of the feature in the image. If the feature is a sub-pixel-level feature, Then it can be obtained by interpolation corresponding to the same position on the semantic mask map.
  • the feature vectors of the query image and the image to be recognized are obtained in the above manner, the feature vectors may be normalized respectively in the K cluster centers, and then the entire vectors may be normalized.
  • Step S13 Determine an image matching the scene of the query image from the to-be-recognized image by using the feature vector of the to-be-processed image.
  • the image matching the scene of the query image is determined from the image to be recognized by the position of the feature vector of the image to be recognized and the feature vector of the query image.
  • the to-be-identified image corresponding to the feature vector closest to the feature vector of the query image is determined as the image matching the query image.
  • the images matching the query image are arranged by using a spatial consistency method, so as to obtain the image matching the query image.
  • the image with the most similar images are arranged by using a spatial consistency method, so as to obtain the image matching the query image.
  • the method combines the semantic mask map with the traditional feature aggregation method, so as to reduce the interference of dynamic features in the image to the feature recognition by means of semantic mask weighting, and effectively avoid unnecessary Negative images of stable objects for scene recognition.
  • the weighted method is used to effectively avoid the image caused by the instability of semantic segmentation, thereby improving its robustness.
  • the method of the embodiments of the present application also has good robustness when changing seasons.
  • the embodiments of the present application further provide a scene recognition method.
  • the scene recognition method uses the semantic segmentation result to weight different areas of the image when generating the global feature vector of the image. Robustness of the method used for scene recognition when a large number of dynamic objects or scenes change seasonally.
  • the scene recognition method can be implemented in the following ways:
  • the input of the semantic segmentation is an image
  • the output is the result of the semantic segmentation
  • a semantic segmentation network may be used to perform semantic segmentation on the input image.
  • the result of semantic segmentation contains the class of each pixel and the probability of belonging to that class.
  • the semantic segmentation network can be any network, and the segmented categories can be customized and trained, or can be directly trained using categories defined on public datasets.
  • the results of the segmentation can be further divided into four categories: stable categories, volatile categories, dynamic categories, and unknown categories. If the above-mentioned segmentation results are the same as the four categories, the step of continuing segmentation is not performed; otherwise, the categories can be further divided according to the actual usage scenarios. For example, for an indoor environment, the floor, walls, ceiling can be considered as stable categories, beds, tables, chairs, etc. can be regarded as volatile categories, people, cats and dogs, etc. can be regarded as dynamic categories, etc. For outdoor scenes, buildings, roads, street lights, etc. can be regarded as stable categories, green plants, sky, etc. can be regarded as volatile categories, and pedestrians and vehicles can be regarded as dynamic categories. Of course, this classification can be adjusted differently according to the actual usage scenario, for example, in some indoor scenarios, the table can be regarded as a stable class.
  • the input of the semantic mask is the result of semantic segmentation
  • the output is the semantic mask map
  • the weights corresponding to the stable category, the volatile category, the dynamic category, and the unknown category are w 1 , w 2 , w 3 , and w 4 , respectively.
  • This weight can be set manually, eg 1.0, 0.5, 0.1 and 0.3 for the four categories).
  • pi the probability of the category
  • m i is called the semantic mask corresponding to pixel i
  • the generated image is the semantic mask image.
  • using the generated semantic masks can be embedded in current local feature aggregation methods, as well as in end-to-end deep learning methods.
  • the input of the feature aggregation is the image and the corresponding semantic mask map
  • the output is the image feature vector
  • the image feature vector can be expressed by the following formula (5):
  • ⁇ k ( xi ) represents the calculation of the nearest cluster center of the feature xi , that is, the selection function, which is 1 at the position of the nearest cluster center, otherwise it is 0;
  • xi (j) represents the jth of the feature xi .
  • the value corresponding to the dimension, ck (j) represents the value corresponding to the jth dimension of the kth cluster center.
  • m i is the semantic mask corresponding to the ith feature. If the feature is a pixel set feature, the semantic mask of the corresponding position can be directly obtained at the position of the image. If the feature is a sub-pixel level feature, it can be used in the semantic The same position on the mask map is obtained by interpolation.
  • the input of the scene recognition is the feature vector obtained from the image and the semantic mask, and the output is the most similar scene.
  • step (3) feature vectors are extracted from all database images to construct an image feature database. Then, the feature vector is also extracted for the image to be recognized, and then the distance comparison between the features of the query image and the image features in the database is used to find the first few images with the smallest distance as the retrieval result. Several images are reordered to obtain the most similar scene image.
  • the scene recognition method using the semantic mask provided by the embodiment of the present application can effectively deal with these dynamic objects, and by assigning a lower weight, the interference to the image description can be effectively alleviated.
  • the scene recognition method using semantic masks in the embodiment of the present application can assign higher weights to categories with strong discriminativeness, thereby increasing their proportion in the image description, thereby effectively suppressing non-discriminatory regions such as Roads, floors, etc.
  • the usage scenarios of the scene recognition method provided by the embodiments of the present application may include: in the visual positioning algorithm, an image-level description is usually used to retrieve a similar scene first, thereby narrowing the matching range of local features. If the target scene contains a large number of dynamic objects during mapping or positioning, such as pedestrians coming and going in shopping malls, vehicles on the road, etc., if it is used directly without processing, it will greatly affect the retrieval performance. , reducing the success rate of retrieval. For the outdoor environment, if the mapping and positioning are in different seasons, the outdoor green plants show different shapes due to seasonal changes, which will also greatly affect the effect of scene recognition. If the methods proposed in the embodiments of the present application are adopted, these problems can be effectively dealt with. Of course, the scene recognition method provided in the embodiment of the present application also includes other usage scenarios, which can be used by those skilled in the art according to actual needs.
  • FIG. 4 is a schematic structural diagram of an embodiment of a scene recognition apparatus according to an embodiment of the present application. It includes: acquisition module 41, feature aggregation module 42 and image matching module 43.
  • the acquisition module 41 is configured to acquire the image to be processed and the semantic mask map corresponding to the image to be processed; the image to be processed includes the query image and the image to be recognized, and the semantic mask map corresponding to the image to be processed includes the semantic mask map of the query image The code map and the semantic mask map of the image to be recognized.
  • the acquisition module 41 is configured to acquire a query image, acquire a plurality of images to be identified that match the query image from a database according to the query image; perform semantic segmentation processing on the image to be recognized and the query image to obtain the category and category of each pixel Corresponding probability; set the weight for the category of each pixel according to the set conditions; obtain the semantic mask corresponding to each pixel according to the probability corresponding to the category and the weight corresponding to the category, wherein the semantic mask corresponding to all pixels constitutes the semantic mask picture.
  • the acquisition module 41 is further configured to perform attribute classification on all pixels to obtain one or more sub-categories; set weights for each sub-category according to set conditions; The weight obtains a semantic mask corresponding to each of the pixels, wherein the semantic masks corresponding to all pixels constitute a semantic mask map.
  • the feature aggregation module 42 is configured to perform feature aggregation processing on the image to be processed according to the semantic mask map to obtain a feature vector of the image to be processed.
  • the feature aggregation module 42 is configured to perform feature extraction on the to-be-processed image to obtain a feature set; form a plurality of cluster centers according to the feature set; obtain each of the to-be-processed centers according to the plurality of cluster centers the cluster center corresponding to each feature in the image; determine the value corresponding to each feature in the image to be processed in the first dimension, and determine the cluster corresponding to each feature in the image to be processed the value corresponding to the center in the first dimension;
  • the value corresponding to the first dimension of the cluster center corresponding to each feature in the image to be processed, and the The value corresponding to each feature in the first dimension is combined with the semantic mask map of the query image to perform feature aggregation processing on the query image to obtain a feature vector of the query image.
  • the value corresponding to the first dimension of the cluster center corresponding to each feature in the image to be processed, and the value of the cluster center in the image to be processed is combined with the semantic mask map of the to-be-recognized image to perform feature aggregation processing on the to-be-recognized image to obtain a feature vector of the to-be-recognized image.
  • the image matching module 43 is configured to use the feature vector of the image to be processed to determine the image matching the scene of the query image from the image to be recognized.
  • the image matching module 43 is configured to determine, from the to-be-recognized image, an image that matches the query image scene according to the distance between the feature vector of the to-be-recognized image and the feature vector of the query image.
  • the image matching module 43 is configured to determine the to-be-identified image corresponding to the feature vector closest to the feature vector of the query image as an image matched by the query image.
  • the image matching module 43 is further configured to, when there are multiple images matching the query image in the to-be-recognized image, use a spatial consistency method to arrange the images matching the query image, to obtain the most similar image to the query image.
  • the scene recognition device provided by the embodiment of the present application combines the semantic mask map with the traditional feature aggregation method, so as to reduce the interference of dynamic features in the image on feature recognition by means of semantic mask weighting, thereby improving the robustness of the device .
  • FIG. 5 is a schematic structural diagram of a smart device according to an embodiment of the present application.
  • the smart device includes a memory 52 and a processor 51 that are interconnected.
  • the memory 52 is used to store program instructions for implementing any one of the above-mentioned scene recognition methods.
  • the processor 51 is used to execute program instructions stored in the memory 52 .
  • the processor 51 may also be referred to as a central processing unit (Central Processing Unit, CPU).
  • the processor 51 may be an integrated circuit chip with signal processing capability.
  • the processor 51 may also be a general-purpose processor, a digital signal processor (Digital Signal Process, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other Programming logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 52 can be a memory stick, a flash memory card (Trans-flash, TF card for short), etc., and can store all the information in the smart device, including the input original data, computer programs, intermediate running results and final running results. middle. It stores and retrieves information according to the location specified by the controller. With memory, smart devices have memory function to ensure normal work.
  • the memory of smart devices can be divided into main memory (memory) and auxiliary memory (external memory) according to the purpose of memory, and there are also classification methods into external memory and internal memory. External storage is usually a magnetic medium or an optical disc, etc., which can store information for a long time.
  • Memory refers to the storage components on the motherboard, which are used to store the data and programs currently being executed, but are only used to temporarily store programs and data. When the power is turned off or powered off, the data will be lost.
  • the disclosed method and apparatus may be implemented in other manners.
  • the apparatus implementations described above are only illustrative, for example, the division of modules or units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, which may be electrical, mechanical or other forms.
  • Units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this implementation manner.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium.
  • a computer device which may be a personal computer, a system server, or a network device, etc.
  • a processor Processor
  • FIG. 6 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application.
  • the computer-readable storage medium of the embodiment of the present application stores a program file 61 capable of implementing all the above-mentioned scene recognition methods, wherein the program file 61 may be stored in the above-mentioned storage medium in the form of a software product, and includes several instructions to make a A computer device (which may be a personal computer, a server, or a network device, etc.) or a processor executes all or part of the steps of the methods of the various embodiments of the present application.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • a processor executes all or part of the steps of the methods of the various embodiments of the present application.
  • the aforementioned storage devices include: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes , or terminal devices such as computers, servers, mobile phones, and tablets.
  • the embodiments of the present application provide a computer program, including computer-readable code, when the computer-readable code is executed in a smart device, a processor in the smart device executes the above method.
  • Embodiments of the present application provide a scene recognition method, apparatus, smart device, storage medium and computer program
  • the scene recognition method includes: acquiring an image to be processed and a semantic mask map corresponding to the image to be processed;
  • the image to be processed includes a query image and an image to be recognized, and the semantic mask map corresponding to the image to be processed includes a semantic mask map of the query image and a semantic mask map of the image to be recognized; according to the semantic mask map
  • the code map performs feature aggregation processing on the to-be-processed image to obtain a feature vector of the to-be-processed image; uses the feature vector of the to-be-processed image to determine an image matching the scene of the query image from the to-be-recognized image .
  • the feature corresponding to the image to be processed can be obtained by combining the semantic mask map and the feature aggregation method, so as to reduce the interference of interference factors and improve the robustness of the scene recognition.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

一种场景识别方法、装置、智能设备及存储介质,所述场景识别方法包括:获取待处理图像以及待处理图像对应的语义掩码图;其中,待处理图像包括查询图像及待识别图像(S11),所述待处理图像对应的语义掩码图包括所述查询图像的语义掩码图和所述待识别图像的语义掩码图;根据语义掩码图对待处理图像进行特征聚合处理,得到待处理图像的特征向量(S12);利用待处理图像的特征向量从待识别图像中确定与查询图像的场景匹配的图像(S13)。如此,能够通过语义掩码图降低干扰因素特征对特征识别的干扰,进而提高场景识别的鲁棒性。

Description

场景识别方法、装置、智能设备、存储介质和计算机程序
相关申请的交叉引用
本申请基于申请号为202011249944.4、申请日为2020年11月10日、申请名称为“一种场景识别方法、装置、智能设备及存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以全文引用的方式引入本申请。
技术领域
本申请涉及图像检索技术领域,尤其设计一种场景识别方法、装置、智能设备、存储介质和计算机程序。
背景技术
场景识别在计算机视觉领域有着重要的应用,譬如同时定位和地图构建(Simultaneously Localization And Mapping,简称SLAM)、运动恢复结构(Structure From Motion,SFM)和视觉定位(Visual Localization,VL)。
场景识别问题研究的主要内容是从给定的一张图像中识别出对应的场景,给出场景的名称或是场景的地理位置,亦或是从数据库中挑选出与场景相似的图像,也可以看做是图像检索问题。目前常用的方法有两种,一种是直接计算图像的全局描述,另一种是使用特征聚合的方法。目前现有技术中对场景识别方法的研究也越来越多。
发明内容
本申请实施例提供一种场景识别方法、装置、智能设备、存储介质和计算机程序。
本申请实施例提供一种场景识别方法,包括:获取待处理图像以及所述待处理图像对应的语义掩码图;其中,所述待处理图像包括查询图像及待识别图像,所述待处理图像对应的语义掩码图包括所述查询图像的语义掩码图和所述待识别图像的语义掩码图;根据所述语义掩码图对所述待处理图像进行特征聚 合处理,得到所述待处理图像的特征向量;利用所述待处理图像的特征向量从所述待识别图像中确定与所述查询图像的场景匹配的图像。通过语义掩码图结合特征聚合方式得到待处理图像对应的特征,以此能够降低干扰因素的干扰,提高场景识别的鲁棒性。
在一些实施例中,所述获取待处理图像以及所述待处理图像对应的语义掩码图的步骤包括:对所述待识别图像及所述查询图像进行语义分割处理,得到每一像素的类别及所述类别对应的概率;按照设定条件对每一像素的类别设置权重;根据所述类别对应的概率及所述类别对应的权重得到每一所述像素对应的语义掩码,其中,所有所述像素对应的语义掩码构成语义掩码图。通过权重的设置使得得到的语义掩码图在结合特征聚合方式得到待处理图像对应的特征后,能够降低干扰因素的干扰,提高场景识别的鲁棒性。
在一些实施例中,所述按照设定条件对每一像素的类别设置权重之前还包括:对所有像素进行属性分类,得到一个或多个子类别;按照设定条件对每一所述子类别设置权重;根据所述子类别对应的概率及所述子类别对应的权重得到每一所述像素对应的语义掩码,其中,所有所述像素对应的语义掩码构成语义掩码图。给每一子类别设置权重,能够降低干扰因素的干扰,提高场景识别的鲁棒性。
在一些实施例中,所述子类别包括固定子类别、不固定子类别、动态子类别和未知子类别中的至少两种;所述动态子类别的权重小于所述固定子类别、所述不固定子类别及所述未知子类别的权重。例如,为不固定子类别设置较高权重,固定子类别设置较小权重,以此消除不固定特征对特征识别的干扰,提高场景识别的鲁棒性。
在一些实施例中,所述根据所述子类别对应的概率及所述子类别对应的权重得到每一所述像素对应的语义掩码包括:利用公式m i=p i×w i计算所述像素对应的语义掩码;
其中,m i表示第i个像素对应的语义掩码,其生成的图为语义掩码图,p i表示第i个像素所属的子类别的概率,w i表示第i个像素所属的类别或子类别对应的权重。通过计算语义掩码图,以降低不固定特征对场景识别的干扰。
在一些实施例中,根据所述语义掩码图对所述待处理图像进行特征聚合处理,得到所述待处理图像的特征向量包括:对所述待处理图像进行特征抽取, 得到特征集合;依据所述特征集合形成多个聚类中心;根据多个所述聚类中心得到每一所述待处理图像中的每一特征对应的聚类中心;确定所述待处理图像中的每一特征在第一维度对应的值,以及确定所述待处理图像中的所述每一特征对应的聚类中心在所述第一维度对应的值;通过所述待处理图像中的每一特征对应的聚类中心,所述待处理图像中的每一特征对应的聚类中心在第一维度对应的值,以及,所述待处理图像中的所述每一特征在所述第一维度对应的值,结合所述查询图像的语义掩码图,对所述查询图像进行特征聚合处理,得到所述查询图像的特征向量;以及通过所述待处理图像中的每一特征对应的聚类中心,所述待处理图像中的每一特征对应的聚类中心在第一维度对应的值,以及所述待处理图像中的每一特征在所述第一维度对应的值,结合所述待识别图像的语义掩码图,对所述待识别图像进行特征聚合处理,得到所述待识别图像的特征向量。利用语义掩码图得到待处理图像对应的特征,由于语义掩码图中对不固定特征进行了权重设置,以此能够降低干扰因素的干扰,提高场景识别的鲁棒性。
在一些实施例中,所述依据所述特征集合形成多个聚类中心包括:利用聚类算法对所述特征集合进行处理,以形成多个聚类中心;所述根据多个所述聚类中心得到每一所述待处理图像中的每一特征对应的聚类中心包括:将距离每一所述特征最近的聚类中心作为所述待处理图像中的每一特征对应的聚类中心。
在一些实施例中,所述利用所述待处理图像的特征向量从所述待识别图像中确定与所述查询图像的场景匹配的图像包括:根据所述待识别图像的特征向量与所述查询图像的特征向量的距离,从所述待识别图像中确定与所述查询图像场景匹配的图像。由于特征向量的计算结合了语义掩码图,以此降低了不固定特征的干扰,得到与查询图像相似度更高的待识别图像。
在一些实施例中,根据所述待识别图像的特征向量与所述查询图像的特征向量的距离,从所述待识别图像中确定与所述查询图像场景匹配的图像的步骤包括:将距离所述查询图像的特征向量最近的特征向量对应的所述待识别图像确定为所述查询图像匹配的图像。以此得到与查询图像相似度更高的待识别图像。
在一些实施例中,所述待识别图像中与所述查询图像匹配的图像为多个;所述将距离所述查询图像的特征向量最近的特征向量对应的所述待识别图像确 定为所述查询图像匹配的图像的步骤之后还包括:采用空间一致性方法将与所述查询图像匹配的图像进行排列,以获取到与所述查询图像最相似的图像。以此使得得到的场景更为相似、准确度更高。
本申请实施例提供一种场景识别装置,包括:获取模块,配置为获取待处理图像以及所述待处理图像对应的语义掩码图;其中,所述待处理图像包括查询图像及待识别图像;特征聚合模块,配置为根据所述语义掩码图对所述待处理图像进行特征聚合处理,得到所述待处理图像的特征向量;图像匹配模块,配置为利用所述待处理图像的特征向量从所述待识别图像中确定与所述查询图像的场景匹配的图像。通过语义掩码图结合特征聚合方式得到待处理图像对应的特征,以此能够降低干扰因素的干扰,提高场景识别的鲁棒性。
本申请实施例提供一种智能设备,包括:相互藕接的处理器及存储器,其中,所述存储器用于存储实现如上述任意一项所述的场景识别方法的程序指令。
本申请实施例提供一种计算机可读存储介质,存储有程序文件,所述程序文件能够被执行以实现上述任意一项所述的场景识别方法。
本申请实施例提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在智能设备中运行时,所述智能设备中的处理器执行用于实现上述任意一项所述的场景识别方法。
本申请实施例提供一种场景识别方法、装置、智能设备、存储介质和计算机程序,通过获取待处理图像以及待处理图像对应的语义掩码图,根据语义掩码图对待处理图像进行特征聚合处理,得到待处理图像的特征向量,再利用特征向量从待识别图像中确定与查询图像的场景匹配的图像,如此,能够通过获取语义掩码图,得到图像的高层语义信息,通过语义掩码图与特征聚合的结合,消除图像中干扰因素带来的干扰,进而提高场景识别的鲁棒性。
附图说明
图1是本申请实施例场景识别方法的一实施例的流程示意图;
图2是本申请实施例图1中步骤S11的一实施例的流程示意图;
图3是本申请实施例图1中步骤S11的另一实施例的流程示意图;
图4是本申请实施例场景识别装置的一实施例的结构示意图;
图5是本申请实施例智能设备的一实施例的结构示意图;
图6是本申请实施例计算机可读存储介质的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
场景识别在计算机视觉领域有着重要的应用,场景识别问题研究的主要内容是从给定的一张图像中识别出对应的场景,给出场景的名称或是场景的地理位置、或是从数据库中挑选出相似场景的图像,也可以看做是图像检索问题。这类问题的核心是对图像或是图像中的场景进行准确的描述。目前常用的方法有两种,一种是直接计算图像的全局描述,另一种是使用局部特征聚合的方法。
其中,直接计算图像的全局描述的方法,其输入是一张完整的图像输出是图像的全局描述子。最简单的方法就是将图像的所有像素值拼接起来作为图像的描述子,或是使用直方图统计像素的灰度信息或梯度信息等,这种方法鲁棒性极差。使用局部特征聚合的方法,其输入是从图像抽取的局部特征,输出为编码的特征向量。这种方法只使用局部特征,缺少高层语义信息,在光照变化和动态场景中不具备鲁棒性。
而语义信息作为一种高层的视觉信息,对场景识别有着很好的指导作用。同时,使用语义信息也更加符合人类的认知方式。基于此,本申请实施例提出一种语义掩码的场景识别方法。该方法利用语义分割结果对图像中不同区域施加不同的权重,有效的处理动态不稳定物体对场景识别的消极影响。同时,由于使用软加权方式有效避免了因语义分割的不稳定性带来的影响。不仅如此,该方法在季节变化时也具备很好的鲁棒性。
下面结合附图和实施例对本申请进行详细的说明。
请参见图1,为本申请实施例场景识别方法的第一实施例的流程示意图,所述场景识别方法由智能设备执行,所述方法包括:
步骤S11:获取待处理图像以及待处理图像对应的语义掩码图;其中,待处理图像包括查询图像及待识别图像。
在一些实施例中,待处理图像包括查询图像及待识别图像,待处理图像对 应的语义掩码图包括查询图像的语义掩码图和待识别图像的语义掩码图。其中,请参见图2,获取待识别图像对应的语义掩码图包括:
步骤S21:对待识别图像及查询图像进行语义分割处理,得到每一像素的类别及类别对应的概率。
其中,查询图像为用户自定义的图像,其可以为用户当前拍摄的图像,还可以为用户提前存储的图像。待识别图像为根据查询图像从数据库中搜索到的与查询图像匹配的图像。数据库为服务器,将查询图像输入,服务器为该查询图像匹配具有相似场景的多个待识别图像。
对待识别图像及查询图像进行语义分割,以得到图像中每一像素的所属类别,及该类别对应的概率。
步骤S22:按照设定条件对每一像素的类别设置权重。
获取到像素的类别之后,对每一类别的像素设置权重。在一实施例中,若语义分割得到的类别包括固定子类别(如稳定)、不固定子类别(如易变)、动态和未知四种类别时,为降低动态特征对场景识别的干扰,在一实施例中,将动态子类别的权重设置最低,使其小于固定子类别、不固定子类别及未知的权重。在另一实施例中,若需要降低不固定子类别特征对场景识别的干扰,在一实施例中,将不固定子类别特征的权重设置最低,使其小于固定子类别、动态子类别及未知子类别的权重。
步骤S23:根据子类别对应的概率及子类别对应的权重得到每一像素对应的语义掩码,其中,所有像素对应的语义掩码构成语义掩码图。
在一实施例中,利用如下公式(1)计算每一像素对应的语义掩码:
m i=p i×w i      (1);
其中,m i表示第i个像素对应的语义掩码,其生成的图为语义掩码图,p i表示第i个像素所属的子类别的概率,w i表示第i个像素所属的类别或子类别对应的权重。
在另一实施例中,若语义分割后的类别结果不包括括固定子类别、不固定子类别、动态和未知四种类别时,请参见图3,其中,步骤S31与图2中相同。本实施例中,若语义分割后的类别结果不包括稳定、易变、动态和未知四种类别时还包括:
步骤S32:对所有像素进行属性分类,以得到一个或多个子类别。
对所有的像素进行属性分类,以得到一个或多个子类别,在一实施例中, 子类别包括固定子类别、不固定子类别、动态子类别和未知子类别中的至少两种或至少一种。
步骤S33:按照设定条件对每一子类别设置权重。
这里,获取到像素的子类别之后,对每一子类别的像素设置权重。在一实施例中,若对语义分割的结果属性分类得到的子类别包括固定子类别、不固定子类别、动态子类别和未知子类别四种类别时,为降低动态特征对场景识别的干扰,在一实施例中,将动态特征的权重设置最低,使其小于固定子类别、不固定子类别及未知的权重。在另一实施例中,若需要降低不固定子类别特征对场景识别的干扰,在一实施例中,将不固定子类别特征的权重设置最低,使其小于固定子类别、动态及未知的权重。
步骤S34:根据子类别对应的概率及子类别对应的权重得到每一像素对应的语义掩码,其中,所有像素对应的语义掩码构成语义掩码图。
这里,在一实施例中,利用如下公式(2)计算每一像素对应的语义掩码:
m i=p i×w i      (2);
其中,m i表示第i个像素对应的语义掩码,其生成的图为语义掩码图,p i表示第i个像素所属的子类别的概率,w i表示第i个像素所属的类别或子类别对应的权重。
本实施例提供的方法,通过对语义分割后的像素类别设置不同的权重,以降低该类别在特征识别时所造成的干扰,进而提高场景识别分鲁棒性。
步骤S12:根据语义掩码图对待处理图像进行特征聚合处理,得到待处理图像的特征向量。
这里,现有的对待处理特征进行特征聚合处理以得到特征向量的方式包括通过VLAD编码的方式获得特征向量。在一些实施例中,通过VLAD编码的方式获得特征向量包括:对所述待处理图像进行特征抽取,得到特征集合。在另一实施例中,还可以对预设待处理图像进行特征抽取,得到特征集合,预设数据图像可以为数据库与服务器中所有图像的集合,还可以为服务器中部分图像的集合,其不做限定,还可以是用户自行采集的图片集合,其不做限定。可以理解的,每一待处理图像均包含多个特征,即在进行特征抽取时,每一待处理图像均抽取多个特征。将所有抽取到的特征形成特征集合,然后对其进行聚类算法得到K个聚类中心。将K个聚类中心称为码书,得到码书为 C={c1,c2,…,ck}。
将待处理图像中的一个待处理图像中的多个特征形成特征集合X={x1,x2,…,xk}。在一些实施例中,还可以通过码书C将特征集合X聚合成一个具有固定长度的特征向量。
在得到多个聚类中心后,通过多个聚类中心得到每一待处理图像中的每一特征x i对应的聚类中心。其中,确定该特征x i的位置,将与该特征x i距离最近的聚类中心确定为特征x i对应的聚类中心c k。在一实施例中,在确定好当前特征x i对应的聚类中心c k后,确定该聚类中心c k在第一维度对应的值,在一些实施例中,聚类中心c k对应的维度与聚类中心c k对应的特征x i的维度相同,确定聚类中心c k在第一维度对应的值及聚类中心c k对应的特征x i在第一维度对应的值,由于聚类中心c k的维度与聚类中心c k对应的特征x i的维度相同,为了更好的区分聚类中心c k与聚类中心c k对应的特征x i,将该聚类中心c k的维度加上聚类中心c k与对应的所述特征x i之间的距离。本公开实施例中,第一维度可以是维度1、维度2、维度3等,为了阐明清楚聚类中心和特征在相同的维度进行聚合,因此以第一维度进行说明。
现有的特征识别方式时通过聚类中心c k、每一特征对应的聚类中心c k在第一维度对应的值得到查询图像及所述待识别图像的特征向量。在一些实施例中,现有技术一般通过如下公式(3)得到查询图像或所述待识别图像的特征向量:
Figure PCTCN2021106936-appb-000001
其中,v(k,j)表示查询图像或待识别图像的特征向量,α k(x i)表示选择函数,x i为特征,当c k为x i的聚类中心时,α k(x i)等于1,否则α k(x i)等于0,x i(j)表示为第i个特征的第j个维度对应的值,c k(j)表示第k个聚类中心的第j个维度对应的值。
可以理解的是,当需要计算查询图像的特征向量时,v(k,j)表示查询图像的特征向量,α k(x i)表示选择函数,x i为查询图像的特征,当c k(聚类中心)为x i对应的的聚类中心时,α k(x i)等于1,否则α k(x i)等于0。x i(j)表示为查询图像上第i个特征的第j个维度对应的值,c k(j)表示查询图像第k个聚类中心的第j个维度对应的值。
可以理解的是,当需要计算待识别图像的特征向量时,v(k,j)表示待识别图像的特征向量,α k(x i)表示选择函数,x i为待识别图像的特征,当c k(聚类 中心)为x i对应的的聚类中心时,α k(x i)等于1,否则α k(x i)等于0。x i(j)表示为待识别图像上第i个特征的第j个维度对应的值,c k(j)表示待识别图像第k个聚类中心的第j个维度对应的值。
而本申请实施例的技术方案中,为避免缺少高层语义信息,而使得动态特征对特征向量识别造成影响,进而造成识别不准确的结果,本申请实施例通过待处理图像中的每一特征x i对应的聚类中心c k,待处理图像中的每一特征对应的聚类中心c k在第一维度对应的值,以及,待处理图像中的所述每一特征x i在第一维度对应的值,结合查询图像的语义掩码图,对查询图像进行特征聚合处理,以得到查询图像的特征向量。并且通过待处理图像中的每一特征x i对应的聚类中心c k,待处理图像中的每一特征对应的聚类中心c k在第一维度对应的值,以及,待处理图像中的所述每一特征x i在第一维度对应的值,结合待识别图像的语义掩码图,对待识别图像进行特征聚合处理,以得到待识别图像的特征向量。
这里,本申请实施例通过如下公式(4)得到查询图像及待识别图像的特征向量:
Figure PCTCN2021106936-appb-000002
其中,v(k,j)'表示查询图像及待识别图像的特征向量,α k(x i)表示选择函数,x i为特征,当c k为x i的聚类中心时,α k(x i)等于1,否则α k(x i)等于0,x i(j)表示为第i个特征的第j个维度对应的值,c k(j)表示第k个聚类中心的第j个维度对应的值,m i表示查询图像以及待识别图像的语义掩码图。
利用本申请实施例的方法,例如在图像中含有大量的动态物体时,可以通过语义掩码进行加权,进而降低动态物体的权重,提高特征识别的鲁棒性。
这里,在一实施例中,在通过语义掩码进行加权时,如果特征为像素级特征,则可以根据特征在图像中的位置直接获取对应位置的语义掩码,如果特征为亚像素级特征,则可以在语义掩码图上对应相同位置插值获得。
在一实施例中,通过上述方式得到查询图像及待识别图像的特征向量后,还可以在K个聚类中心中对特征向量分别进行归一化,然后将整个向量同一进行归一化。
步骤S13:利用待处理图像的特征向量从待识别图像中确定与查询图像的场景匹配的图像。
通过步骤S12的方式得到查询图像及待识别图像的特征向量后,通过待识别图像的特征向量与查询图像的特征向量的位置从待识别图像中确定与查询图像的场景匹配的图像。
可以理解的,特征向量之间的距离越近,则特征的相似度越高,特征向量之间的距离越远,则特征的相似度越低。因此在一实施例中,将距离查询图像的特征向量最近的特征向量对应的待识别图像确定为查询图像匹配的图像。
在一实施例中,若待识别图像中与查询图像匹配的图像数量为多个时,为了得到最相似的图像,采用空间一致性方法将与查询图像匹配的图像进行排列,以获取到与查询图像最相似的图像。
本申请实施例提供的场景识别方法,该方法通过将语义掩码图与传统的特征聚合方式进行结合,以通过语义掩码加权的方式降低图像中动态特征对特征识别的干扰,有效避免了不稳定物体对场景识别的消极影像。同时使用加权的方式有效避免了因语义分割的不稳定性带来的影像,进而提高其鲁棒性。不仅如此,本申请实施例的方法在季节变化时也具有很好的鲁棒性。
基于前述的实施例,本申请实施例再提供一种场景识别方法,所述场景识别方法在生成图像全局特征向量时使用语义分割结果对图像的不同区域进行加权,如此,能够保证在场景中包含大量动态物体或是场景在季节变化的情况下,场景识别所采用的方法的鲁棒性。所述场景识别方法可以通过以下方式实现:
(1)语义分割;
这里,所述语义分割的输入是图像,输出是语义分割的结果。
本申请实施例中,可以采用语义分割网络对输入的图像进行语义分割。语义分割的结果包含每个像素的类别和属于该类别的概率。所述语义分割网络可以是任意的网络,分割的类别可以是自定义并训练的,也可以是直接使用公开数据集上定义的类别并进行训练。
在一些实施例中,可以将分割的结果继续分成四类:稳定的类别、易变的类别、动态的类别和未知的类别。如果上述的分割结果和该四类相同则不执行继续分割的步骤,否则可以根据实际的使用场景将类别再进一步划分。例如,对于室内环境,可以将地面、墙壁、天花板视为稳定的类别,床、桌子、椅子等视为易变的类别,人、猫和狗等视为动态的类别等。对于室外场景,可以将建筑、路面、路灯等视为稳定的类别,绿植、天空等视为易变的类别,行人和车辆等视为动态的类别等。当然,这种分类可根据实际使用场景做不同的调整, 例如在某些室内场景中可以将桌子视为稳定的类别。
(2)语义掩码;
这里,所述语义掩码的输入是语义分割的结果,输出是语义掩码图。
本申请实施例中,假设稳定的类别、易变的类别、动态的类别和未知的类别对应的权重分别为w 1、w 2、w 3和w 4。(这个权重可以手动设定,例如针对四种类别分别设为1.0、0.5、0.1和0.3)。则对一副图像中的某个像素i会得到两个值p i和w i,其中p i为类别的概率,w i为类别的权重。因此,可以使用类别的概率乘以类别的权重,即m i=p i×w i来确定所述像素对应的语义掩码。其中,m i称为像素i对应的语义掩码,进而生成的图为语义掩码图。
在一些实施例中,使用生成的语义掩码可以嵌入到目前的局部特征聚合方法中,也可以嵌入到端到端的深度学习方法中。下面以VLAD方法为例来实现语义掩码的使用方法。
(3)基于语义掩码的VLAD特征聚合;
这里,所述特征聚合的输入是图像和对应的语义掩码图,输出是图像特征向量。
本申请实施例中,可以对训练集中的所有图像抽取局部特征(这个局部特征可以是稀疏特征,也可以是稠密的局部特征)来构建局部特征集合,并对该局部特征集合执行聚类算法获得K个聚类中心,所述K个聚类中心称为码书C={c1,c2,…,ck}。
进而,对于从单幅图像上抽取的局部特征集合X={x1,x2,…,xk}中的每一个特征,找到其最近的聚类中心,然后在对应维度上累加特征到聚类中心的残差,最终生成K×D维的图像特征向量,其中K是码书的大小,D是特征向量的维度。使用传统的VLAD编码方式,所述图像特征向量可以用如下公式(5)表达:
Figure PCTCN2021106936-appb-000003
其中,α k(x i)表示计算特征x i最近的聚类中心,即选择函数,在最近的聚类中心位置为1,否则为0;x i(j)表示特征x i的第j个维度对应的值,c k(j)表示第k个聚类中心的第j个维度对应的值。这种方法对图像上的所有特征统一对待,因此在图像中含有大量动态物体时,容易被动态物体所干扰。为此,在一些实施例中可以引入上述的语义掩码进行加权,则本申请实施例中的所述图像特征向量可以用如下公式(6)表示:
Figure PCTCN2021106936-appb-000004
其中,m i为第i个特征对应的语义掩码,如果特征为像素集特征,则可以在图像的位置上直接获取对应位置的语义掩码,如果特征为亚像素级特征,则可以在语义掩码图上相同的位置插值获得。
最后,对于生成的特征向量,先在K个类中分别做归一化,然后将整个向量一起做归一化。
(4)基于VLAD的场景识别;
这里,所述场景识别的输入是图像和语义掩码得到的特征向量,输出是最相似的场景。
本申请实施例中,按照上述步骤(3)中的方法对所有数据库图像抽取特征向量,构建图像特征数据库。然后,对于待识别图像同样抽取特征向量,然后使用查询图像的特征和数据库中的图像特征进行距离比较,找到距离最小的前几张图像作为检索结果,然后再采用空间一致性验证对检索的这几张图像进行重新排序获得最相似的场景图像。
如此,在一些使用场景下,例如在自动驾驶场景中,道路上通常会有很多的车辆,而真正对识别有意义的则是路边的建筑。此时本申请实施例提供的使用语义掩码的场景识别方法可以有效地处理这些动态物体,通过赋予一个较低的权重则可以有效地减轻其对图像描述的干扰。同时,本申请实施例中使用语义掩码的场景识别方法可以对判别性较强的类别赋予更高的权重,从而提高其在图像描述中的比例,进而有效地抑制了无判别性的区域如道路、地板等。
在一些实施例中,本申请实施例提供的场景识别方法的使用场景,可以包括:在视觉定位算法中,通常会先用图像级描述检索一个相似场景,从而缩小局部特征的匹配范围。如果在建图的时候或是定位的时候,目标场景中含有大量的动态物体,例如商场中来来往往的行人,道路上的车辆等等,如果不加处理直接使用则会非常影响检索的性能,降低检索的成功率。对于室外环境,如果建图和定位处于不同的季节,室外的绿植因为季节的变化表现出不同的形态,也会极大的影响场景识别的效果。如果采用本申请实施例提出的方法,可以有效地处理这些问题。当然,本申请实施例提供的场景识别方法还包括其他的使用场景,本领域技术人员可以根据实际需要进行使用。
请参见图4,为本申请实施例场景识别装置的一实施例的结构示意图。包 括:获取模块41、特征聚合模块42及图像匹配模块43。
其中,获取模块41配置为获取待处理图像以及待处理图像对应的语义掩码图;其中,待处理图像包括查询图像及待识别图像,待处理图像对应的语义掩码图包括查询图像的语义掩码图和待识别图像的语义掩码图。其中,获取模块41配置为获取查询图像,根据查询图像从数据库中获取与查询图像匹配的多个待识别图像;对待识别图像及所述查询图像进行语义分割处理,得到每一像素的类别及类别对应的概率;按照设定条件对每一像素的类别设置权重;根据类别对应的概率及类别对应的权重得到每一像素对应的语义掩码,其中,所有像素对应的语义掩码构成语义掩码图。在一实施例中,获取模块41还配置为对所有像素进行属性分类,以得到一个或多个子类别;按照设定条件对每一子类别设置权重;根据子类别对应的概率及子类别对应的权重得到每一所述像素对应的语义掩码,其中,所有像素对应的语义掩码构成语义掩码图。
其中,特征聚合模块42配置为根据语义掩码图对待处理图像进行特征聚合处理,得到待处理图像的特征向量。其中,特征聚合模块42配置为对对所述待处理图像进行特征抽取,得到特征集合;依据所述特征集合形成多个聚类中心;根据多个所述聚类中心得到每一所述待处理图像中的每一特征对应的聚类中心;确定所述待处理图像中的每一特征在第一维度对应的值,以及,确定所述待处理图像中的所述每一特征对应的聚类中心在所述第一维度对应的值;
通过所述待处理图像中的每一特征对应的聚类中心,所述待处理图像中的每一特征对应的聚类中心在第一维度对应的值,以及,所述待处理图像中的所述每一特征在所述第一维度对应的值,结合所述查询图像的语义掩码图,对所述查询图像进行特征聚合处理,以得到所述查询图像的特征向量。以及通过所述待处理图像中的每一特征对应的聚类中心,所述待处理图像中的每一特征对应的聚类中心在第一维度对应的值,以及,所述待处理图像中的每一特征在所述第一维度对应的值,结合所述待识别图像的语义掩码图,对所述待识别图像进行特征聚合处理,以得到所述待识别图像的特征向量。
其中,图像匹配模块43配置为利用待处理图像的特征向量从待识别图像中确定与查询图像的场景匹配的图像。其中,图像匹配模块43配置为根据所述待识别图像的特征向量与所述查询图像的特征向量的距离,从所述待识别图像中确定与所述查询图像场景匹配的图像。在一实施例中,图像匹配模块43配置为将距离所述查询图像的特征向量最近的特征向量对应的所述待识别图像确定为 所述查询图像匹配的图像。在一实施例中,图像匹配模块43还配置为在所述待识别图像中与所述查询图像匹配的图像为多个时,采用空间一致性方法将与所述查询图像匹配的图像进行排列,以获取到与所述查询图像最相似的图像。
本申请实施例提供的场景识别装置,通过将语义掩码图与传统的特征聚合方式进行结合,以通过语义掩码加权的方式降低图像中动态特征对特征识别的干扰,进而提高器鲁棒性。
请参见图5,为本申请实施例智能设备的结构示意图。智能设备包括相互连接的存储器52和处理器51。
存储器52用于存储实现上述任意一项的场景识别方法的程序指令。
处理器51用于执行存储器52存储的程序指令。
其中,处理器51还可以称为中央处理单元(Central Processing Unit,CPU)。处理器51可能是一种集成电路芯片,具有信号的处理能力。处理器51还可以是通用处理器、数字信号处理器(Digital Signal Process,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器52可以为内存条、快闪存储器卡(Trans-flash,简称TF卡)等,可以存储智能设备中全部信息,包括输入的原始数据、计算机程序、中间运行结果和最终运行结果都保存在存储器中。它根据控制器指定的位置存入和取出信息。有了存储器,智能设备才有记忆功能,才能保证正常工作。智能设备的存储器按用途存储器可分为主存储器(内存)和辅助存储器(外存),也有分为外部存储器和内部存储器的分类方法。外存通常是磁性介质或光盘等,能长期保存信息。内存指主板上的存储部件,用来存放当前正在执行的数据和程序,但仅用于暂时存放程序和数据,关闭电源或断电,数据会丢失。
在本申请所提供的几个实施例中,应该理解到,所揭露的方法和装置,可以通过其它的方式实现。例如,以上所描述的装置实施方式仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可 以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,系统服务器,或者网络设备等)或处理器(Processor)执行本申请各个实施方式方法的全部或部分步骤。
请参阅图6,为本申请实施例计算机可读存储介质的结构示意图。本申请实施例的计算机可读存储介质存储有能够实现上述所有场景识别方法的程序文件61,其中,该程序文件61可以以软件产品的形式存储在上述存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施方式方法的全部或部分步骤。而前述的存储装置包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。
在一些实施例中,本申请实施例提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在智能设备中运行时,所述智能设备中的处理器执行实现上述方法。
以上仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。
工业实用性
本申请实施例提供了一种场景识别方法、装置、智能设备、存储介质和计算机程序,所述场景识别方法包括:获取待处理图像以及所述待处理图像对应的语义掩码图;其中,所述待处理图像包括查询图像及待识别图像,所述待处理图像对应的语义掩码图包括所述查询图像的语义掩码图和所述待识别图像的语义掩码图;根据所述语义掩码图对所述待处理图像进行特征聚合处理,得到所述待处理图像的特征向量;利用所述待处理图像的特征向量从所述待识别图像中确定与所述查询图像的场景匹配的图像。根据本申请实施例提供的场景识别方法能够通过语义掩码图结合特征聚合方式得到待处理图像对应的特征,以此能够降低干扰因素的干扰,提高场景识别的鲁棒性。

Claims (14)

  1. 一种场景识别方法,其中,所述方法由智能设备执行,所述方法包括:
    获取待处理图像以及所述待处理图像对应的语义掩码图;其中,所述待处理图像包括查询图像及待识别图像,所述待处理图像对应的语义掩码图包括所述查询图像的语义掩码图和所述待识别图像的语义掩码图;
    根据所述语义掩码图对所述待处理图像进行特征聚合处理,得到所述待处理图像的特征向量;
    利用所述待处理图像的特征向量从所述待识别图像中确定与所述查询图像的场景匹配的图像。
  2. 根据权利要求1所述的场景识别方法,其中,所述获取待处理图像以及所述待处理图像对应的语义掩码图包括:
    对所述待识别图像及所述查询图像进行语义分割处理,得到每一像素的类别及所述类别对应的概率;
    按照设定条件对每一像素的类别设置权重;
    根据所述类别对应的概率及所述类别对应的权重得到每一所述像素对应的语义掩码,其中,所有所述像素对应的语义掩码构成语义掩码图。
  3. 根据权利要求2所述的方法,其中,所述按照设定条件对每一像素的类别设置权重之前还包括:
    对所有像素进行属性分类,得到一个或多个子类别;
    按照设定条件对每一所述子类别设置权重;
    根据所述子类别对应的概率及所述子类别对应的权重得到每一所述像素对应的语义掩码,其中,所有所述像素对应的语义掩码构成语义掩码图。
  4. 根据权利要求3所述的方法,其中,所述子类别包括固定子类别、不固定子类别、动态子类别和未知子类别中的至少两种;
    所述动态子类别的权重小于所述固定子类别、所述不固定子类别及所述未知子类别的权重。
  5. 根据权利要求4所述的方法,其中,所述根据所述子类别对应的概 率及所述子类别对应的权重得到每一所述像素对应的语义掩码包括:利用公式m i=p i×w i计算所述像素对应的语义掩码;
    其中,m i表示第i个像素对应的语义掩码,其生成的图为语义掩码图,p i表示第i个像素所属的子类别的概率,w i表示第i个像素所属的类别或子类别对应的权重。
  6. 根据权利要求1所述的方法,其中,所述根据所述语义掩码图对所述待处理图像进行特征聚合处理,得到所述待处理图像的特征向量包括:
    对所述待处理图像进行特征抽取,得到特征集合;
    依据所述特征集合形成多个聚类中心;
    根据多个所述聚类中心得到每一所述待处理图像中的每一特征对应的聚类中心;
    确定所述待处理图像中的每一特征在第一维度对应的值,以及确定所述待处理图像中的所述每一特征对应的聚类中心在所述第一维度对应的值;
    通过所述待处理图像中的每一特征对应的聚类中心,所述待处理图像中的每一特征对应的聚类中心在第一维度对应的值,以及,所述待处理图像中的所述每一特征在所述第一维度对应的值,结合所述查询图像的语义掩码图,对所述查询图像进行特征聚合处理,得到所述查询图像的特征向量;
    通过所述待处理图像中的每一特征对应的聚类中心,所述待处理图像中的每一特征对应的聚类中心在第一维度对应的值,以及,所述待处理图像中的每一特征在所述第一维度对应的值,结合所述待识别图像的语义掩码图,对所述待识别图像进行特征聚合处理,得到所述待识别图像的特征向量。
  7. 根据权利要求6所述的方法,其中,所述依据所述特征集合形成多个聚类中心包括:
    利用聚类算法对所述特征集合进行处理,以形成多个聚类中心;
    所述根据多个所述聚类中心得到每一所述待处理图像中的每一特征对应的聚类中心包括:
    将距离每一所述特征最近的聚类中心作为所述待处理图像中的每一特征对应的聚类中心。
  8. 根据权利要求1至7任一项所述的方法,其中,所述利用所述待处理图像的特征向量从所述待识别图像中确定与所述查询图像的场景匹配的图像包括:
    根据所述待识别图像的特征向量与所述查询图像的特征向量的距离,从所述待识别图像中确定与所述查询图像场景匹配的图像。
  9. 根据权利要求8所述的方法,其中,所述根据所述待识别图像的特征向量与所述查询图像的特征向量的距离,从所述待识别图像中确定与所述查询图像场景匹配的图像的步骤包括:
    将距离所述查询图像的特征向量最近的特征向量对应的所述待识别图像确定为所述查询图像匹配的图像。
  10. 根据权利要求9所述的方法,其中,所述待识别图像中与所述查询图像匹配的图像为多个;
    所述将距离所述查询图像的特征向量最近的特征向量对应的所述待识别图像确定为所述查询图像匹配的图像之后还包括:
    采用空间一致性方法将与所述查询图像匹配的图像进行排列,以获取到与所述查询图像最相似的图像。
  11. 一种场景识别装置,其中,包括:
    获取模块,获取待处理图像以及所述待处理图像对应的语义掩码图;其中,所述待处理图像包括查询图像及待识别图像,所述待处理图像对应的语义掩码图包括所述查询图像的语义掩码图和所述待识别图像的语义掩码图;
    特征聚合模块,配置为根据所述语义掩码图对所述待处理图像进行特征聚合处理,得到所述待处理图像的特征向量;
    图像匹配模块,配置为利用所述待处理图像的特征向量从所述待识别图像中确定与所述查询图像的场景匹配的图像。
  12. 一种智能设备,其中,包括:相互藕接的处理器及存储器,其中,
    所述存储器用于存储实现如权利要求1-10任意一项所述的场景识别方 法的程序指令。
  13. 一种计算机可读存储介质,其中,存储有程序文件,所述程序文件能够被执行以实现如权利要求1-10任意一项所述的场景识别方法。
  14. 一种计算机程序,其中,包括计算机可读代码,当所述计算机可读代码在智能设备中运行时,所述智能设备中的处理器执行用于实现如权利要求1-10任意一项所述的场景识别方法。
PCT/CN2021/106936 2020-11-10 2021-07-16 场景识别方法、装置、智能设备、存储介质和计算机程序 WO2022100133A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2022543759A JP2023510945A (ja) 2020-11-10 2021-07-16 シーン識別方法及びその装置、インテリジェントデバイス、記憶媒体並びにコンピュータプログラム

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011249944.4 2020-11-10
CN202011249944.4A CN112329660B (zh) 2020-11-10 2020-11-10 一种场景识别方法、装置、智能设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022100133A1 true WO2022100133A1 (zh) 2022-05-19

Family

ID=74317739

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/106936 WO2022100133A1 (zh) 2020-11-10 2021-07-16 场景识别方法、装置、智能设备、存储介质和计算机程序

Country Status (3)

Country Link
JP (1) JP2023510945A (zh)
CN (1) CN112329660B (zh)
WO (1) WO2022100133A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009532A (zh) * 2023-09-21 2023-11-07 腾讯科技(深圳)有限公司 语义类型识别方法、装置、计算机可读介质及电子设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329660B (zh) * 2020-11-10 2024-05-24 浙江商汤科技开发有限公司 一种场景识别方法、装置、智能设备及存储介质
CN113393515B (zh) * 2021-05-21 2023-09-19 杭州易现先进科技有限公司 一种结合场景标注信息的视觉定位方法和系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140133759A1 (en) * 2012-11-14 2014-05-15 Nec Laboratories America, Inc. Semantic-Aware Co-Indexing for Near-Duplicate Image Retrieval
CN105335757A (zh) * 2015-11-03 2016-02-17 电子科技大学 一种基于局部特征聚合描述符的车型识别方法
CN107239535A (zh) * 2017-05-31 2017-10-10 北京小米移动软件有限公司 相似图片检索方法及装置
US20190295260A1 (en) * 2016-10-31 2019-09-26 Konica Minolta Laboratory U.S.A., Inc. Method and system for image segmentation using controlled feedback
CN111027493A (zh) * 2019-12-13 2020-04-17 电子科技大学 一种基于深度学习多网络软融合的行人检测方法
CN112329660A (zh) * 2020-11-10 2021-02-05 浙江商汤科技开发有限公司 一种场景识别方法、装置、智能设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871143B (zh) * 2017-11-15 2019-06-28 深圳云天励飞技术有限公司 图像识别方法及装置、计算机装置和计算机可读存储介质
CN108710847B (zh) * 2018-05-15 2020-11-27 北京旷视科技有限公司 场景识别方法、装置及电子设备
JP7026826B2 (ja) * 2018-09-15 2022-02-28 北京市商▲湯▼科技▲開▼▲發▼有限公司 画像処理方法、電子機器および記憶媒体
CN109829383B (zh) * 2018-12-29 2024-03-15 平安科技(深圳)有限公司 掌纹识别方法、装置和计算机设备
CN111709398A (zh) * 2020-07-13 2020-09-25 腾讯科技(深圳)有限公司 一种图像识别的方法、图像识别模型的训练方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140133759A1 (en) * 2012-11-14 2014-05-15 Nec Laboratories America, Inc. Semantic-Aware Co-Indexing for Near-Duplicate Image Retrieval
CN105335757A (zh) * 2015-11-03 2016-02-17 电子科技大学 一种基于局部特征聚合描述符的车型识别方法
US20190295260A1 (en) * 2016-10-31 2019-09-26 Konica Minolta Laboratory U.S.A., Inc. Method and system for image segmentation using controlled feedback
CN107239535A (zh) * 2017-05-31 2017-10-10 北京小米移动软件有限公司 相似图片检索方法及装置
CN111027493A (zh) * 2019-12-13 2020-04-17 电子科技大学 一种基于深度学习多网络软融合的行人检测方法
CN112329660A (zh) * 2020-11-10 2021-02-05 浙江商汤科技开发有限公司 一种场景识别方法、装置、智能设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009532A (zh) * 2023-09-21 2023-11-07 腾讯科技(深圳)有限公司 语义类型识别方法、装置、计算机可读介质及电子设备
CN117009532B (zh) * 2023-09-21 2023-12-19 腾讯科技(深圳)有限公司 语义类型识别方法、装置、计算机可读介质及电子设备

Also Published As

Publication number Publication date
CN112329660B (zh) 2024-05-24
JP2023510945A (ja) 2023-03-15
CN112329660A (zh) 2021-02-05

Similar Documents

Publication Publication Date Title
WO2022100133A1 (zh) 场景识别方法、装置、智能设备、存储介质和计算机程序
CN107679250B (zh) 一种基于深度自编码卷积神经网络的多任务分层图像检索方法
US11244205B2 (en) Generating multi modal image representation for an image
US9075824B2 (en) Retrieval system and method leveraging category-level labels
CN108038122B (zh) 一种商标图像检索的方法
US9330341B2 (en) Image index generation based on similarities of image features
CN103207898B (zh) 一种基于局部敏感哈希的相似人脸快速检索方法
CN104036012B (zh) 字典学习、视觉词袋特征提取方法及检索系统
US9600738B2 (en) Discriminative embedding of local color names for object retrieval and classification
WO2013053320A1 (zh) 一种图像检索方法及装置
US10839006B2 (en) Mobile visual search using deep variant coding
WO2023142602A1 (zh) 图像处理方法、装置和计算机可读存储介质
WO2023142551A1 (zh) 模型训练及图像识别方法和装置、设备、存储介质和计算机程序产品
CN112561976A (zh) 一种图像主颜色特征提取方法、图像检索方法、存储介质及设备
CN103761503A (zh) 用于相关反馈图像检索的自适应训练样本选取方法
CN110188864B (zh) 基于分布表示和分布度量的小样本学习方法
CN111709317A (zh) 一种基于显著性模型下多尺度特征的行人重识别方法
CN114743139A (zh) 视频场景检索方法、装置、电子设备及可读存储介质
CN108694411A (zh) 一种识别相似图像的方法
Yan Accurate Image Retrieval Algorithm Based on Color and Texture Feature.
CN105205497B (zh) 一种基于局部pca白化的图像表示方法和处理装置
WO2017143979A1 (zh) 图像的检索方法及装置
WO2017168601A1 (ja) 類似画像検索方法およびシステム
Jena et al. Content based image retrieval using adaptive semantic signature
Islam et al. Texture feature based image retrieval algorithms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21890664

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022543759

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21890664

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21890664

Country of ref document: EP

Kind code of ref document: A1