CN112487927A

CN112487927A - Indoor scene recognition implementation method and system based on object associated attention

Info

Publication number: CN112487927A
Application number: CN202011344887.8A
Authority: CN
Inventors: 苗博; 周立广; 林天麟; 徐扬生
Original assignee: Shenzhen Institute of Artificial Intelligence and Robotics; Chinese University of Hong Kong CUHK
Current assignee: Shenzhen Institute of Artificial Intelligence and Robotics; Chinese University of Hong Kong CUHK
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-03-12
Anticipated expiration: 2040-11-26
Also published as: CN112487927B

Abstract

The invention discloses an indoor scene recognition realization method and system based on object associated attention, wherein the method comprises the following steps: extracting semantic feature vectors of each spatial position in an input image through a backbone network; forming a feature map by the semantic feature vectors of all the spatial positions according to the spatial positions of the semantic feature vectors, and transmitting the feature map to a segmentation module to calculate the probability that each spatial position in the input image belongs to different objects; and calculating the feature vector of each object through an object feature aggregation module, multiplying the feature vectors of all spatial positions of each object by the probability that the spatial positions belong to the object, and performing weighted average to obtain the feature vector expression of each object. According to the method and the system for identifying the indoor scene based on the object associated attention, aiming at the fact that objects are different in different scenes, the object feature aggregation module is used for detecting all object features on the input image, and therefore information contained in the image is better expressed.

Description

Indoor scene recognition implementation method and system based on object associated attention

Technical Field

The invention relates to an intelligent identification method and a software system, in particular to an identification method and system improvement aiming at object associated attention characteristics during indoor scene identification.

Background

In the prior art, the perception capability of environmental information is an indispensable capability of a robot, and accurate perception of surrounding scenes is helpful for the robot to make correct judgment and behaviors.

As technology and computing power have advanced, a number of deep learning based scene recognition algorithms have been proposed. Herranze et al have found that feature extraction needs to adapt to different Scales of images, and perform multi-scale fusion on features obtained from models trained on different datasets to identify scenes, see page 571-579 of CVPR 2016, Scene registration with CNNs: Objects, Scales and Dataset Bias (CVPR is an abbreviation for IEEE Conference on Computer Vision and Pattern registration, IEEE International Conference on Computer Vision and Pattern Recognition).

However, enhancement of scene recognition effects based only on picture global information is limited because these methods are not only semantically difficult to interpret, but also are easily disturbed by common objects existing across scenes.

Therefore, some scholars attempt to implement scene recognition in conjunction with contextual information and local object associations. Lopez-cities et al obtain context information by Semantic segmentation to help eliminate the divergence of common objects in different scenes, see Pattern Recognition, vol.102, page 107 and 256, and Semantic-Aware Scene Recognition.

Wang et al train PatchNet based on the weak Supervised training mode, and guide Local feature extraction based on the training mode, and finally aggregate Local features based on semantic probability to realize Scene Recognition, please refer to Pattern Recognition, vol.26, page 2028 + 2041, Weakly Supervised Patchnets: Describing and Aggregating Local patterns for Scene Recognition.

Meanwhile, there are also many studies to improve the scene comprehension capability of the model by combining the multi-modal features. However, most of the indoor scene recognition methods in the prior art are realized by combining manually set features with global features, which not only has a large calculation amount, but also cannot effectively learn the relationship between objects so as to accurately recognize scenes.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

The invention aims to provide an indoor scene recognition implementation method and system based on object association attention, and provides a quick and accurate object association recognition implementation method and system aiming at the problems of inaccurate overall recognition and redundant network structure in the prior art.

The technical scheme of the invention is as follows:

an indoor scene recognition implementation method based on object associated attention comprises the following steps:

A. extracting semantic feature vectors of each spatial position in an input image through a backbone network;

B. forming a feature map by the semantic feature vectors of all the spatial positions according to the spatial positions of the semantic feature vectors, and transmitting the feature map to a segmentation module to calculate the probability that each spatial position in the input image belongs to different objects;

C. calculating a feature vector of each object through an object feature aggregation module, multiplying the feature vectors of all spatial positions of each object by the probability that the spatial positions belong to the object, and performing weighted average to obtain feature vector expression of each object;

D. and splicing the feature vectors of all the objects to form the object feature expression of the input image.

The method for realizing indoor scene identification based on object associated attention comprises the steps that the backbone network and the object feature aggregation module can calculate feature expressions of different objects based on different spatial position feature hidden vectors.

The method for realizing indoor scene recognition based on object-related attention further comprises, after the step D:

E. and inputting the object feature expression into a light weight object association attention module, wherein the light weight object association attention module is realized by adopting a neural network and is used for calculating the relation between the objects.

The method for identifying an indoor scene based on object-related attention includes:

e1, calculating the relation characteristic vector expression of each object and all other objects based on the neural network and cosine similarity, and splicing the relation characteristic vector expression into the characteristic vector expression of the object.

The method for realizing indoor scene recognition based on object-related attention comprises the following steps:

F. and inputting the feature vector expression and the relation feature vector expression of the objects into a global association aggregation module so as to aggregate the relation among all the objects and form a common feature expression vector of all the objects.

G. and inputting the common characteristic expression vector to a classification identification module of a neural network full-connection layer for identifying the scene to which the input image belongs.

An indoor scene recognition implementation system based on object-associated attention, comprising:

a backbone network for extracting semantic feature vectors of each spatial position from the input image;

the segmentation module is used for forming a feature map from the semantic feature vectors of all the spatial positions according to the spatial positions of the semantic feature vectors to the segmentation module and calculating the probability that each spatial position in the input image belongs to different objects;

an object feature aggregation module, configured to calculate a feature vector of each object, and multiply the feature vectors of all spatial positions of each object by a probability that the spatial position belongs to the object and perform weighted average, thereby obtaining a feature vector expression of each object;

and the object feature aggregation module is also used for splicing the feature vectors of all the objects to form the object feature expression of the input image.

The system for realizing indoor scene recognition based on object-related attention further comprises: and the light weight object association attention module calculates the relation characteristic vector expression of each object and all other objects based on the neural network and cosine similarity, and splices the relation characteristic vector expression into the characteristic vector expression of the object.

The system for realizing indoor scene recognition based on object-related attention further comprises: and the global association aggregation module is used for taking the feature vector expression of the object and the relation feature vector expression as input, and aggregating the relations among all the objects to form a common feature expression vector of all the objects.

The system for realizing indoor scene recognition based on object-related attention further comprises: and the classification identification module is used for inputting the common characteristic expression vector and identifying the scene to which the input image belongs.

According to the method and the system for identifying the indoor scene based on the object associated attention, provided by the invention, aiming at different objects in different scenes, the object feature aggregation module is used for detecting all object features on the input image so as to better express the information contained in the image. Meanwhile, aiming at different distribution of coexisting objects in different scenes, the lightweight object association attention module and the global association aggregation module are used for learning the relation with the aggregation objects, and finally, a common feature expression vector is generated to facilitate a subsequent classification module to identify different scenes. The method is high in efficiency, accurate in identification and suitable for identifying and judging different indoor scenes.

Drawings

Fig. 1 is a module and a flow diagram of an implementation method and system for identifying an indoor scene based on object-related attention according to a preferred embodiment of the present invention.

Fig. 2 is a schematic view illustrating an object feature aggregation module according to a preferred embodiment of the method and system for identifying an indoor scene based on object-related attention.

Fig. 3 is a schematic diagram illustrating an example of a lightweight object-related attention module according to a preferred embodiment of the method and system for identifying an indoor scene based on object-related attention.

Fig. 4 is a schematic diagram illustrating an exemplary global association aggregation module according to a preferred embodiment of the method and system for identifying an indoor scene based on object association attention.

Detailed Description

The following describes in detail preferred embodiments of the present invention.

According to the method and the system for identifying the indoor scene based on the object associated attention, in the identification process of the neural network, the distribution of the coexisting objects in different scenes is found to be different through analysis, so that the indoor scene identification performance can be improved through learning the object relationship. Therefore, the invention provides an object feature aggregation module for detecting and extracting the features of all objects on a picture, learning the relationship between the objects through the proposed light-weight object association attention module, aggregating the object features and the object relationships through a global association aggregation module, and realizing the identification of the indoor scene through full connection. The method realizes scene recognition at a brand new angle, and is more effective than the method in the prior art.

As shown in fig. 1, in the preferred embodiment of the method and system for recognizing an indoor scene based on object-related attention, first, according to an input image, the input image may be a still image obtained by a camera or one of frames of images captured from a video. And then, forming a feature map according to the spatial positions of the semantic feature vectors of all the spatial positions, and transmitting the feature map to a segmentation module to calculate the probability that each spatial position of the upper input image belongs to different objects.

Based on the feature map calculated by the backbone network and the object attribution probability map calculated by the segmentation module, a newly proposed object feature aggregation module is then used for calculating the feature vector of each object, and the implementation process of the module is to multiply the feature vectors of all spatial positions of each object by the probability that the spatial position belongs to the object and then perform weighted average, so as to obtain the feature vector expression of each object. Finally, the feature vectors of all the objects are spliced to form the object feature expression of the picture.

Then inputting the object feature expression into a lightweight object association attention module newly proposed by the invention for calculating the relationship between the objects, wherein the lightweight object association attention module calculates the relationship feature vector expression of each object and all other objects based on the neural network and cosine similarity, and splices the relationship feature vector expression into the feature vector expression of the object, thereby enriching the object features.

The invention further inputs the expression of the feature vectors of the objects and the expression of the object relation feature vectors into a newly proposed global association aggregation module for aggregating the relations among all the objects, thereby forming a common feature expression vector of all the objects on the input image. Finally, the feature expression vector is input into a classification identification module formed by a neural network full-connection layer to identify which scene the picture belongs to.

Specifically, after the camera acquires an input image of an indoor scene, object feature analysis is performed on all spatial positions of the input image, so that scene judgment is performed according to all object features contained in the image. The specific judgment process does not simply depend on local object characteristics, and simultaneously judges all object relationships in the input image, so that indoor scenes can be judged more accurately and effectively, for example, a kitchen, a bedroom, a living room or a dining room, and the interference of object characteristics which commonly appear between different scenes is prevented by identifying the relationship characteristics among the objects, so that the scenes are identified more accurately.

Fig. 2 shows a preferred implementation example of an object feature aggregation module in the method and system for identifying an indoor scene based on object-related attention according to the present invention. In order to effectively extract the object features in the input image, the invention proposes the implementation scheme of the object feature aggregation module in fig. 2. Firstly, a space position feature map F and an object attribution probability map S are calculated according to an input image and based on a backbone network of scene segmentation, and then all space position feature vectors of each object and the attribution probability of the object at the corresponding position are weighted and summed to obtain a feature vector expression O of the object.

Finally, the feature vector expressions of all the objects are spliced to obtain the object feature expression of the input image, wherein the non-existing object is an all-zero vector, and different feature vectors exist for the existing object, specifically as shown in the example in fig. 2, and the final feature dimension is 1024x150x 1.

The method for calculating the feature vector of each object in the object feature aggregation module is shown in the following figure, wherein Oj represents the feature vector of an object j, Bij represents whether the ith pixel position belongs to the object j with the maximum probability, Sij represents the probability that the ith pixel position belongs to the object j, and Fi represents the feature vector of the ith pixel position. Finally, the following calculation formula is adopted to determine the feature vector expression of each object:

therefore, the calculation of the expression of the object feature vector of each divided region of the input image can realize the judgment of different objects, but the coexistence relationship of the objects is difficult to express only through the object features.

Thus, the present invention further provides a lightweight object association attention module for calculating the coexistence relationship between objects, as shown in FIG. 3.

In order to effectively transfer the characteristics of the objects in the scene segmentation for scene identification and learn the potential relationship between the objects, the invention further provides a light-weight object association attention module. The lightweight object associated attention module is composed of one or more lightweight object associated attention blocks in cascade as shown in fig. 3, which is implemented as a neural network. Compared with the existing method in the prior art, the method has the advantages that the calculation of K and V based on Q reduces the calculation amount by 50%, and the dimensions of Q, K, V and the output features can be simultaneously controlled by only adjusting the value of alpha. K and V can obtain the relation expression of each object through matrix multiplication, and finally the object relation and the original object characteristics are spliced and output to the next module, so that the object relation and the object characteristics are subjected to feature vector expression.

In order to aggregate object features and relationships into hidden vector expressions with as few parameters and computation amounts as possible (too complicated parameters and computation amounts cause too low processing efficiency and difficulty in extracting key information), as shown in fig. 4, the invention provides a global association aggregation module, and the module adopts strip-shaped deep convolution, so that compared with the block convolution in the traditional deep convolution, the strip-shaped deep convolution can model object features without position relationships. The module first aggregates the information of all objects in each channel with a 150x1 stripe depth convolution for each channel of the feature.

However, at this time, information between channels is not circulated, so that information between all channels is aggregated by using point convolution of 1x1, and a high-level semantic feature expression vector is generated to express scene information. And finally, transmitting the final scene expression vector, namely the object characteristic vector expression to a universal full-connection layer, namely a classification identification module, so as to obtain the final scene identification result of the input image or picture.

In the method and the system for realizing the indoor scene recognition based on the object associated attention, a brand-new object feature aggregation method is adopted in an object feature aggregation module, namely, a space position feature map F and an object attribution probability map S are calculated based on a scene segmentation algorithm, and then the feature vectors of all space positions of each object and the object attribution probabilities of the corresponding space positions are weighted and summed to obtain a feature vector expression O of the object. Finally, the feature vectors of all the objects are spliced to obtain the object feature expression of the image.

Secondly, the invention further provides an object-associated attention module, which adopts a brand-new lightweight network structure, compared with the traditional attention module, the lightweight object-associated attention module is lighter while learning the object relationship, and the number of output characteristic channels can be controlled at will.

In the method and the system for realizing the indoor scene recognition based on the object associated attention, the global associated aggregation module is further provided, and a strip depth convolution mode is adopted, so that compared with the block convolution in the traditional depth convolution, the strip depth convolution aggregates the characteristics and the relation of all objects, and even if the input of the strip depth convolution has no space position information, the strip depth convolution can be aggregated to form the final characteristic vector expression of all objects.

The method and the system for realizing the indoor scene recognition based on the object associated attention realize the calculation efficiency and accuracy meeting the actual calculation quantity requirement and facilitate the indoor scene recognition and judgment process of the input image by adopting the object characteristic aggregation and light weight associated attention processing method and module.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. An indoor scene recognition implementation method based on object associated attention comprises the following steps:

2. The method of claim 1, wherein the backbone network and the object feature aggregation module are configured to compute the feature expressions of different objects based on different spatial locality feature steganographic vectors.

3. The method for realizing object-related attention-based indoor scene recognition according to claim 2, further comprising, after the step D:

4. The method for realizing indoor scene recognition based on object-related attention according to claim 3, wherein the step E further comprises:

5. The method for realizing indoor scene recognition based on object-related attention according to claim 4, wherein the step E is further followed by:

6. The method for realizing object-related attention-based indoor scene recognition according to claim 5, wherein the step F is further followed by:

7. An indoor scene recognition implementation system based on object-associated attention, comprising:

the segmentation module is used for forming a feature map from the semantic feature vectors of all the spatial positions according to the spatial positions of the semantic feature vectors to the segmentation module and calculating the probability that each spatial position in the input image corresponds to different objects;

an object feature aggregation module, configured to calculate a feature vector of each object, and multiply the feature vectors of all spatial positions of each object by the probability that the spatial positions are the object, and perform weighted average, thereby obtaining a feature vector expression of each object;

8. The system of claim 7, further comprising: and the light weight object association attention module calculates the relation characteristic vector expression of each object and all other objects based on the neural network and cosine similarity, and splices the relation characteristic vector expression into the characteristic vector expression of the object.

9. The system of claim 8, further comprising: and the global association aggregation module takes the feature vector expression of the objects and the relation feature vector expression as input, and aggregates the relation among all the objects to form a common feature expression vector of all the objects.

10. The system of claim 9, further comprising: and the classification identification module is used for inputting the common characteristic expression vector and identifying the scene to which the input image belongs.