CN112487927B

CN112487927B - Method and system for realizing indoor scene recognition based on object associated attention

Info

Publication number: CN112487927B
Application number: CN202011344887.8A
Authority: CN
Inventors: 苗博; 周立广; 林天麟; 徐扬生
Original assignee: Chinese University of Hong Kong Shenzhen; Shenzhen Institute of Artificial Intelligence and Robotics
Current assignee: Chinese University of Hong Kong Shenzhen; Shenzhen Institute of Artificial Intelligence and Robotics
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2024-02-13
Anticipated expiration: 2040-11-26
Also published as: CN112487927A

Abstract

The invention discloses an indoor scene recognition realization method and system based on object associated attention, wherein the method comprises the following steps: extracting semantic feature vectors of each spatial position in an input image through a backbone network; the semantic feature vectors of all the spatial positions form a feature map according to the spatial positions and are transmitted to a segmentation module to calculate the probability that each spatial position in the input image belongs to different objects; calculating the feature vector of each object through the object feature aggregation module, multiplying the feature vector of all the spatial positions of each object by the probability that the spatial positions belong to the object, and carrying out weighted average so as to obtain the feature vector expression of each object. According to the method and the system for realizing the indoor scene recognition based on the object-associated attention, the object feature aggregation module is used for detecting all object features on the input image aiming at the fact that objects are different in different scenes, so that information contained in the image is better expressed.

Description

Method and system for realizing indoor scene recognition based on object associated attention

Technical Field

The invention relates to an intelligent recognition method and a software system, in particular to an object associated attention characteristic recognition method and system improvement aiming at indoor scene recognition.

Background

In the prior art, the perception capability of environmental information is an indispensable capability of a robot, and accurate perception of surrounding scenes is helpful for the robot to make correct judgment and behaviors.

With advances in technology and computing, a number of scene recognition algorithms based on deep learning have been proposed. Herranze et al found that feature extraction required adaptation to different scales of the image and identified the scene by multi-scale fusion of features obtained from models trained on different datasets, see pages 571-579 of CVPR 2016, scene Recognition with CNNs: objects, scales and Dataset Bias (CVPR is an abbreviation for IEEE Conference on Computer Vision and Pattern Recognition, IEEE international conference on computer vision and pattern recognition).

However, the enhancement of scene recognition effects based solely on picture global information is limited, as these methods are not only semantically difficult to interpret, but are also easily interfered with by common objects existing across scenes.

Thus, some students attempt to combine context information and local object associations to achieve scene recognition. Lfupez-cifues et al obtain context information through Semantic segmentation to help eliminate divergence from common objects in different scenarios, see Pattern Recognition, vol.102, pages 107-256, semantic-Aware Scene Recognition.

Wang et al trains Patchet Net based on weak supervision training, guides local feature extraction based thereon, and finally aggregates local features based on semantic probability to achieve scene recognition, see Pattern Recognition, vol.26, pages 2028-2041, weakly Supervised Patchnets: describing and Aggregating Local Patches for Scene Recognition.

At the same time, there are many studies to enhance the scene understanding capabilities of models by combining multi-modal features. However, in the prior art, the indoor scene recognition method is mostly realized by combining the manual setting feature with the global feature, so that the calculation amount is large, and the relationship between objects cannot be effectively learned, so that the scene can be accurately recognized.

Accordingly, the prior art is still in need of improvement and development.

Disclosure of Invention

The invention aims to provide an indoor scene recognition implementation method and system based on object association attention, and provides a rapid and accurate object association recognition implementation method and system aiming at the problems of inaccurate overall recognition and excessive redundancy of a network structure in the prior art.

The technical scheme of the invention is as follows:

an indoor scene recognition implementation method based on object associated attention comprises the following steps:

A. extracting semantic feature vectors of each spatial position in an input image through a backbone network;

B. the semantic feature vectors of all the spatial positions form a feature map according to the spatial positions and are transmitted to a segmentation module to calculate the probability that each spatial position in the input image belongs to different objects;

C. calculating the feature vector of each object through an object feature aggregation module, multiplying the feature vector of all the spatial positions of each object by the probability that the spatial positions belong to the object, and carrying out weighted average so as to obtain the feature vector expression of each object;

D. and splicing the feature vectors of all the objects to form the object feature expression of the input image.

According to the method for realizing the indoor scene recognition based on the object-associated attention, the backbone network and the object feature aggregation module can calculate feature expressions of different objects based on feature hidden vectors of different spatial positions.

The method for realizing the indoor scene recognition based on the object associated attention, wherein the method further comprises the following steps after the step D:

E. the object feature expressions are input to a lightweight object-associated attention module, which is implemented using neural networks to calculate relationships between objects.

The method for realizing the indoor scene recognition based on the object associated attention, wherein the step E specifically further comprises the following steps:

and E1, calculating a relation feature vector expression of each object and all other objects based on a neural network and cosine similarity by the light-weight object association attention module, and splicing the relation feature vector expression into a feature vector expression of the object.

The method for realizing the indoor scene recognition based on the object associated attention, wherein the step E further comprises the following steps:

F. the feature vector expression of the object and the relation feature vector expression are input to a global association aggregation module to aggregate the relation among all objects and form a common feature expression vector of all objects.

The method for realizing the indoor scene recognition based on the object associated attention, wherein the step F further comprises the following steps:

G. and inputting the common characteristic expression vector to a classification and identification module of the neural network full-connection layer, so as to identify the scene to which the input image belongs.

An indoor scene recognition implementation system based on object-associated attention, comprising:

a backbone network for extracting semantic feature vectors of each spatial location for the input image;

the segmentation module is used for forming a feature map of semantic feature vectors of all the spatial positions according to the spatial positions of the semantic feature vectors, and calculating the probability that each spatial position in the input image belongs to different objects;

the object feature aggregation module is used for calculating the feature vector of each object, multiplying the feature vector of all the spatial positions of each object by the probability that the spatial positions belong to the object and carrying out weighted average so as to obtain the feature vector expression of each object;

the object feature aggregation module is also used for splicing feature vectors of all objects to form object feature expression of the input image.

The system for realizing the recognition of the indoor scene based on the object associated attention comprises the following components: and the light-weight object associated attention module is used for calculating the relation feature vector expression of each object and all other objects based on the neural network and the cosine similarity, and splicing the relation feature vector expression into the feature vector expression of the object.

The system for realizing the recognition of the indoor scene based on the object associated attention comprises the following components: and the global association aggregation module is used for taking the characteristic vector expression of the object and the relation characteristic vector expression as input, and aggregating the relation among all the objects to form a common characteristic expression vector of all the objects.

The system for realizing the recognition of the indoor scene based on the object associated attention comprises the following components: and the classification and identification module is used for inputting the common characteristic expression vector and identifying the scene to which the input image belongs.

According to the method and the system for realizing indoor scene recognition based on the object associated attention, provided by the invention, aiming at the fact that objects exist in different scenes, all object features on an input image are detected by using the object feature aggregation module, so that information contained in the image is better expressed. Meanwhile, aiming at different coexisting object distributions in different scenes, a light-weight object associated attention module and a global associated aggregation module are used for learning and aggregating the relation of objects, and finally a common feature expression vector is generated so as to facilitate the subsequent classification module to identify different scenes. The method is high in efficiency, accurate in identification and suitable for identifying and judging different indoor scenes.

Drawings

Fig. 1 is a block diagram and a flowchart illustrating a preferred embodiment of a method and a system for implementing indoor scene recognition based on object-related attention according to the present invention.

Fig. 2 is a schematic diagram illustrating an object feature aggregation module processing example of a preferred embodiment of the method and system for implementing indoor scene recognition based on object-related attention according to the present invention.

Fig. 3 is a schematic diagram of an example of a light-weight object-associated attention module according to a preferred embodiment of the method and system for implementing object-associated attention-based indoor scene recognition.

Fig. 4 is a schematic diagram of a global association aggregation module according to a preferred embodiment of the method and system for implementing indoor scene recognition based on object associated attention.

Detailed Description

The preferred embodiments of the present invention are described in detail below.

According to the method and the system for realizing the indoor scene recognition based on the object associated attention, in the recognition processing of the neural network, the distribution of the coexisting objects in different scenes is found to be different through analysis, so that the indoor scene recognition performance can be improved through learning the object relationship. Therefore, the invention provides the object feature aggregation module for detecting and extracting the features of all objects on the picture, learning the relation among the objects through the proposed light-weight object associated attention module, and finally, aggregating the object features and the object relation through the global associated aggregation module, and realizing the identification of indoor scenes through full connection. The invention realizes scene recognition in a brand new angle, and is more effective than the method in the prior art.

In the preferred embodiment of the method and system for recognizing indoor scene based on object-related attention according to the present invention, as shown in fig. 1, the input image may be a still image acquired by a camera or one of the frames of images captured from a video. Extracting high-level semantic feature vectors of each position of the input image through a backbone network capable of extracting feature hidden vectors of different spatial positions, and then forming feature graphs according to the semantic feature vectors of all the spatial positions and transmitting the feature graphs to a segmentation module to calculate the probability that each spatial position of the upper input image belongs to different objects.

Based on the feature map calculated by the backbone network and the object attribution probability map calculated by the segmentation module, a newly proposed object feature aggregation module is then used to calculate feature vectors of each object, and the implementation process of the module is to multiply feature vectors of all spatial positions of each object by the probability that the spatial positions belong to the object and then perform weighted average, so as to obtain feature vector expression of each object. And finally, the feature vectors of all the objects are spliced to form the object feature expression of the picture.

And then inputting the object feature expression into a light-weight object associated attention module newly proposed by the invention for calculating the relation between the objects, wherein the light-weight object associated attention module calculates the relation feature vector expression of each object and all other objects based on the neural network and cosine similarity, and splices the relation feature vector expression into the feature vector expression of the object, thereby enriching the object features.

The invention further inputs the object self feature vector expression and the object relation feature vector expression into a newly proposed global association aggregation module for aggregating the relations among all objects, thereby forming a common feature expression vector of all objects on the input image. And finally, inputting the feature expression vector into a classification recognition module formed by the neural network full-connection layer to recognize which scene the picture belongs to.

Specifically, after the camera acquires an input image of an indoor scene, object feature analysis is performed on all spatial positions of the input image, so that scene judgment is performed through all object features contained in the image. The specific judging process does not depend on local object features, and judges all object relationships in the input image at the same time, so that indoor scenes, such as a kitchen, a bedroom or a living room and a restaurant, can be judged more accurately and effectively, and the interference of the object features commonly appearing among different scenes is prevented by identifying the relationship features among the objects, so that the scene identification is more accurate.

Fig. 2 shows a preferred implementation example of the object feature aggregation module in the method and system for implementing indoor scene recognition based on object-related attention according to the present invention. In order to effectively extract the object features in the input image, the invention provides an implementation scheme of the object feature aggregation module in fig. 2. Firstly, calculating a spatial position feature map F and an object attribution probability map S based on a backbone network of scene segmentation aiming at an input image, and then carrying out weighted summation on all spatial position feature vectors of each object and object attribution probabilities of corresponding positions to obtain a feature vector expression O of the object.

Finally, the feature vector expressions of all the objects are spliced to obtain the object feature expression of the input image, wherein the non-existing objects are all zero vectors, different feature vectors exist for the existing objects, and the final feature dimension is 1024x150x1 as shown in an example in fig. 2.

The calculation method of each object feature vector in the object feature aggregation module is shown in the following chart, wherein Oj represents the feature vector of an object j, bij represents whether the ith pixel position belongs to the object j with the highest probability, sij represents the probability of the ith pixel position belonging to the object j, and Fi represents the feature vector of the ith pixel position. Finally, the following calculation formula is adopted to determine the feature vector expression of each object:

thus, the object feature vector expression is calculated for each divided region of the input image, and the judgment of different objects can be realized, but it is difficult to express the coexistence relation of the objects only by the object features.

Accordingly, the present invention further provides a lightweight object-related attention module for calculating coexistence relationships between objects, as shown in fig. 3.

In order to effectively migrate the characteristics of objects in scene segmentation for scene recognition and learn potential relationships between objects, the invention further provides a lightweight object-associated attention module. The lightweight object-associated attention module is comprised of one or more lightweight object-associated attention block cascades as shown in fig. 3, implemented as a neural network. Wherein Q can refine object features with higher semantic information while reducing data dimension, compared with existing methods in the prior art, the calculation of K and V based on Q not only reduces the calculation amount by 50%, but also enables the dimensions of Q, K, V and output features to be controlled simultaneously by adjusting the value of alpha only. K and V can obtain the relation expression of each object through matrix multiplication, and finally, the object relation and the original object characteristics are spliced and then output to the next module, so that the object relation and the object characteristics are subjected to characteristic vector expression.

In order to aggregate object features and relationships into hidden vector expressions with as few parameters and calculations as possible (too complex parameters and calculations would result in inefficient processing and difficult extraction of critical information), the present invention proposes a global associative aggregation module, as shown in fig. 4, in which a strip-like depth convolution is used, which models object features that have no positional relationship compared to the block convolutions in conventional depth convolutions. The module first aggregates the information of all objects in each channel with a strip-like depth convolution of 150x1 for each channel of the feature.

However, the information between the channels is not circulated, so that the information between all channels is aggregated by using the point convolution of 1x1, and thus, an advanced semantic feature expression vector is generated to express the scene information. And finally, transmitting the final scene expression vector, namely the object feature vector expression, to a general full-connection layer, namely a classification recognition module, so as to obtain a final scene recognition result of the input image or picture.

In the method and the system for realizing the indoor scene recognition based on the object-related attention, a brand new object feature aggregation method is adopted in the object feature aggregation module, namely, after a space position feature map F and an object attribution probability map S are calculated based on a scene segmentation algorithm, the feature vector expression O of the object is obtained by weighting and summing the feature vectors of all the space positions of each object and the object attribution probabilities of the corresponding space positions. And finally, splicing the feature vectors of all the objects to obtain the object feature expression of the image.

And secondly, the invention further provides an object associated attention module, and compared with the traditional attention module, the light object associated attention module is lighter when learning the object relationship, and the number of the output characteristic channels can be controlled at will.

In the method and the system for realizing the indoor scene recognition based on the object associated attention, the invention further provides a global associated aggregation module, and compared with the block convolution in the traditional depth convolution, the bar depth convolution adopts a bar depth convolution mode, and the bar depth convolution aggregates the characteristics and the relations of all objects, so that even if the input of the bar depth convolution does not have space position information, the bar depth convolution can aggregate to form the final characteristic vector expression of all the objects.

According to the method and the system for realizing the recognition of the indoor scene based on the object associated attention, the object feature aggregation and the light associated attention processing method and the light associated attention processing module are adopted, so that the calculation efficiency and accuracy meeting the actual calculation amount requirement are realized, and the recognition and judgment process of the indoor scene of the input image is facilitated.

It will be understood that modifications and variations will be apparent to those skilled in the art from the foregoing description, and it is intended that all such modifications and variations be included within the scope of the following claims.

Claims

1. An indoor scene recognition implementation method based on object associated attention comprises the following steps:

C. calculating the feature vector expression of each object through an object feature aggregation module, multiplying the semantic feature vector of all the spatial positions of each object by the probability that the spatial positions belong to the object, and carrying out weighted average so as to obtain the feature vector expression of each object;

c1, the calculation method of each object feature vector expression in the object feature aggregation module specifically comprises the following steps:

wherein, in O _j Feature vector representation representing object j, B _ij Indicating whether the ith pixel position belongs to the object j, S with the highest probability _ij Representing the probability of the ith pixel position belonging to object j, F _i Representing the semantic feature vector of the ith pixel position, so as to calculate the object feature vector expression of each divided area of the input image, and realize the judgment of different objects;

D. splicing the feature vector expressions of all the objects to form the object feature expression of the input image;

E. inputting the object feature expression into a light-weight object associated attention module, wherein the light-weight object associated attention module is realized by a neural network and is used for calculating the relation between objects;

e1, the light-weight object associated attention module calculates the relation feature vector expression of each object and all other objects based on a neural network and cosine similarity, and splices the relation feature vector expression into the feature vector expression of the object;

F. inputting the feature vector expression of the object and the relation feature vector expression into a global association aggregation module to aggregate the relation among all the objects and form a common feature expression vector of all the objects;

2. The method according to claim 1, wherein the backbone network and the object feature aggregation module can calculate feature vector expressions of different objects based on different spatial location feature hidden vectors.

3. An indoor scene recognition implementation system based on object-associated attention, comprising:

the segmentation module is used for forming a feature map of semantic feature vectors of all the spatial positions according to the spatial positions of the semantic feature vectors, and calculating the probability of each spatial position in the input image corresponding to different objects;

the object feature aggregation module is used for calculating the feature vector expression of each object, multiplying the semantic feature vectors of all the spatial positions of each object by the probability that the spatial positions are the object and carrying out weighted average so as to obtain the feature vector expression of each object;

the calculation method of each object feature vector expression in the object feature aggregation module specifically comprises the following steps:

wherein Oj represents the feature vector expression of the object j, bij represents whether the ith pixel position belongs to the object j with the highest probability, sij represents the probability of the ith pixel position belonging to the object j, fi represents the semantic feature vector of the ith pixel position, so that the object feature vector expression of each segmentation area of an input image is calculated, and the judgment of different objects is realized;

the object feature aggregation module is also used for splicing the feature vector expressions of all objects to form the object feature expression of the input image;

the light-weight object associated attention module is used for calculating the relation feature vector expression of each object and all other objects based on the neural network and cosine similarity, and splicing the relation feature vector expression into the feature vector expression of the object;

the global association aggregation module takes the feature vector expression of the object and the relation feature vector expression as input, and aggregates the relation among all the objects to form a common feature expression vector of all the objects;

and the classification and identification module is used for inputting the common characteristic expression vector and identifying the scene to which the input image belongs.