CN114241290A

CN114241290A - Indoor scene understanding method, equipment, medium and robot for edge calculation

Info

Publication number: CN114241290A
Application number: CN202111564642.0A
Authority: CN
Inventors: 虞玲华; 姚明
Original assignee: First Hospital of Jiaxing
Current assignee: First Hospital of Jiaxing
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-03-25

Abstract

The invention discloses an indoor scene understanding method, indoor scene understanding equipment, indoor scene understanding media and an indoor scene understanding robot for edge calculation, and belongs to the field of artificial intelligence. The invention can synthesize the spatial relationship information on the basis of target detection, and reduce the requirements on the calculated amount and the memory based on Bayesian inference so as to achieve the aim of realizing simple indoor scene understanding on edge computing equipment. The invention takes knowledge map to express the visual relationship of common objects in the indoor scene as prior knowledge, applies Bayesian inference to realize semantic understanding of the indoor scene, and can update the common knowledge base by using Bayesian inference results, which accords with the human cognition mode. The invention can be applied to a service robot, and the robot can have scene understanding capability similar to human beings through edge calculation.

Description

Indoor scene understanding method, equipment, medium and robot for edge calculation

Technical Field

The invention relates to the field of artificial intelligence, in particular to an indoor scene understanding method for edge calculation.

Background

The service robot is an intelligent machine device which is developed in recent years and used for providing services for human beings, has a wide application range, and can be used for maintenance, repair, transportation, cleaning, security, rescue, monitoring and the like. Taking a hospital as an example, the service robot enters the hospital to engage in heavy and repeated labor, and the condition of medical resource shortage can be greatly relieved. The large-scale popularization of robots in hospitals is an important problem which needs to be solved urgently, and the robots need to have semantic understanding similar to human beings on the environment. Based on this capability, the robot can naturally communicate with the human and make a judgment in line with the human interests.

Scene Understanding (Scene Understanding) is based on environmental data perception, and combines visual analysis and image recognition to enable a computer vision system to have visual perception capability similar to human, and can perceive, analyze and understand surrounding environment scenes to obtain Scene semantic description conforming to human habits and common knowledge. With scene understanding capability, the robot can answer questions such as: (1) what is around me? (2) What is happening? (3) What may happen next?

The basis of scene understanding is image Semantic Segmentation (Semantic Segmentation). Human understanding of the scene can detect each entity with pixel-level granularity and mark precise boundaries. Image semantic segmentation is the ability to mimic a human being, understanding an image at the pixel level by grouping together pixels belonging to the same object. There are various implementations of image semantic segmentation, such as segmentation based on classical feature and Conditional Random Field (CRF), and image semantic segmentation based on deep learning, such as FCN and U-Net. The former is limited by feature extraction, and has the problems of large workload, complex calculation, low precision and the like; the method based on deep learning can realize end-to-end training, can directly predict a plurality of classes of targets, and is the mainstream image semantic segmentation method at present.

The latest research result for scene understanding in the industry is Image scene graph generation (Image scene graph generation), which aims to make a computer automatically generate a semantic graph structure of an Image, called a scene graph (scene graph), as a semantic description of the Image. Scene graph generation is based on the result of Object Detection (Object Detection) algorithm, objects in the image (labeled with bounding boxes) correspond to nodes (graph nodes) in the scene graph, and the visual relationship between the objects is relative to the edges (graph edges) in the scene graph. Nodes and edges in the scene graph can be abstractly represented as a { target-visual relationship-target } triple, such as { patient-sitting-wheelchair }, wherein the target is acquired by a target detection algorithm, and a scene graph generation algorithm needs to infer the visual relationship and generate a semantic graph. The aforementioned visual relationships include spatial relationships and semantic relationships (actions, dependencies, comparisons, etc.). Compared with vector representation, the structural representation method of the scene graph is more intuitive and can be regarded as a small knowledge graph, so that the method can be widely applied to knowledge management, reasoning, retrieval, recommendation and the like.

However, deep learning is a memory-intensive and computation-intensive model, and the number of network layers and model parameters become more and more complex as the prediction accuracy of the model improves. The training and prediction of the deep learning-based image semantic segmentation model often depend on a large amount of memory and a high-end GPU, and the deep learning-based image semantic segmentation model is not suitable for being deployed on edge computing equipment with limited computing resources. The currently mainstream image scene graph generation algorithm generally comprises the following steps: (1) a target detection algorithm obtains a boundary box of a target; (2) judging the relation between the targets by a visual relation detection algorithm (based on a deep convolutional network); (3) a Graph Neural Network (GNN) algorithm generates a scene Graph. Each step involves a deep neural network, which is computationally expensive and is also unsuitable for deployment on edge computing devices.

Therefore, how to realize indoor scene understanding in an edge computing device with limited computing resources is a technical problem to be solved urgently at present.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide an indoor scene understanding method, equipment, a medium and a robot for edge computing, which can run on edge computing equipment with limited computing resources, and can integrate spatial information provided by a depth image on the basis of target detection and a knowledge graph to realize scene understanding of a low-dynamic indoor environment.

In order to achieve the purpose of the invention, the technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides an indoor scene understanding method for edge calculation, comprising:

s1, carrying out target detection on the RGB three-channel color image in the depth image aiming at the depth image corresponding to the indoor scene to be understood to obtain semantic information of each target object existing in the indoor scene and plane position information of each target object on a two-dimensional plane;

s2, based on the target objects and the plane position information thereof existing in the indoor scene, combining the depth values of the areas where the target objects are located provided by the depth map in the depth image, and reasoning to obtain a spatial relationship map among the target objects in the indoor scene; the space relation graph is a space relation among objects represented in a graph form, nodes in the graph represent target objects, and connection among the nodes represents the space relation among the objects and the probability of the space relation;

s3, based on semantic information of each target object in the indoor scene and a spatial relationship graph between the target objects, using an indoor scene common sense library as prior knowledge of Bayesian inference, inferring to obtain the semantic relationship graph between the target objects in the indoor scene, thereby realizing understanding of the indoor scene; the semantic relation graph refers to semantic relations among objects in a scene represented in a graph mode, nodes in the graph represent target objects, and connection among the nodes represents the semantic relations among the objects and the probability of the semantic relations.

Preferably, in the S2, the inference of the spatial relationship diagram is implemented by using a trained bayesian classification model or a support vector machine as a spatial relationship model.

Preferably, in the S3, the inference of the semantic relationship graph is implemented by using a trained bayesian classification model as a semantic relationship model.

Preferably, the indoor scene common sense library is a knowledge graph describing the interrelation among objects in the indoor scene, nodes in the knowledge graph represent the objects, and the connection among the nodes represents the visual relationship and probability among the objects; the visual relationship includes both spatial and semantic relationships between objects.

As a preferred aspect of the first aspect, in S3, the knowledge graph corresponding to the indoor scene is obtained by querying the indoor scene common sense library based on the spatial relationship graph between the target objects in the indoor scene, and then the semantic relationship between the target objects in the spatial relationship graph is obtained through bayesian inference by using the queried knowledge graph as prior knowledge and using the semantic information of each target object in the indoor scene and the spatial relationship graph between the target objects, so as to form the semantic relationship graph.

Preferably, in S3, the inference result is used to update the indoor scene knowledge base.

Preferably, the indoor scene is an indoor scene inside a hospital, and includes a corridor, a clinic room, an operating room, a nursing operating room, a ward, a warehouse, a buffer room, a classroom, a rest room, and a dressing room.

In a second aspect, the present invention provides an edge computing device for indoor scene understanding, comprising:

the target detection module is used for carrying out target detection on the RGB three-channel color image in the depth image aiming at the depth image corresponding to the indoor scene to be understood to obtain semantic information of each target object existing in the indoor scene and plane position information of each target object on a two-dimensional plane;

the spatial relationship reasoning module is used for reasoning and obtaining a spatial relationship graph among the target objects in the indoor scene based on the target objects and the plane position information thereof in the indoor scene and in combination with the depth values of the areas where the target objects are located, wherein the depth values are provided by the depth graph in the depth image; the space relation graph is a space relation among objects represented in a graph form, nodes in the graph represent target objects, and connection among the nodes represents the space relation among the objects and the probability of the space relation;

the semantic relation reasoning module is used for reasoning to obtain a semantic relation graph between the target objects in the indoor scene by taking an indoor scene common sense library as prior knowledge of Bayesian reasoning based on semantic information of each target object in the indoor scene and a spatial relation graph between the target objects, so that the indoor scene is understood; the semantic relation graph refers to semantic relations among objects in a scene represented in a graph mode, nodes in the graph represent target objects, and connection among the nodes represents the semantic relations among the objects and the probability of the semantic relations.

In a third aspect, the present invention provides a computer-readable storage medium, wherein the storage medium stores thereon a computer program, and when the computer program is executed by a processor, the computer program can implement the indoor scene understanding method for edge calculation according to any of the aspects of the first aspect.

In a fourth aspect, the invention provides a service robot, which is characterized by comprising a robot walking part, and a depth camera and an edge calculation device which are carried on the robot walking part;

the robot walking part is used for moving and walking in an indoor scene according to a planned route;

the depth camera is used for acquiring a depth image corresponding to the current indoor scene in real time in the walking process and sending the depth image to the edge computing equipment;

the edge computing device is configured to obtain an understanding result of the indoor scene for the received depth image according to the indoor scene understanding method according to any one of the above aspects.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention obtains semantic understanding of the indoor scene based on the target detection model, the knowledge graph and the Bayesian inference, reduces the requirements on calculated amount and memory, and achieves the purpose of realizing simple indoor scene understanding on the edge computing device.

(2) The invention takes knowledge map expression general knowledge (visual relation of common objects in an indoor scene) as prior knowledge, applies Bayesian inference to realize semantic understanding of the indoor scene, and updates a general knowledge base by using Bayesian inference results, so as to accord with the cognitive mode of human, and enable the robot to have scene understanding similar to human.

Drawings

FIG. 1 is a block diagram of an indoor scene understanding method flow for edge calculation;

fig. 2 is a schematic two-stage flow diagram of an indoor scene understanding method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a knowledge-graph in an embodiment of the invention;

FIG. 4 is a diagram of object detection (left) and depth map (right) for RGB color images in an exemplary scene depth image in accordance with an embodiment of the present invention;

FIG. 5 is an exemplary scene knowledge graph obtained via Bayesian inference in an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. The technical characteristics in the embodiments of the present invention can be combined correspondingly without mutual conflict.

The edge computing mainly relies on the Internet of things, computing, network and storage capacities are migrated from the cloud to the network edge, and data are processed nearby, so that the edge computing can be used for providing intelligent services in a service robot, and therefore the processing efficiency is improved, quick response is provided, and privacy data are protected. However, in the prior art, the mainstream image scene graph generation algorithm involves various deep neural networks, which are large in calculation amount and are not suitable for being deployed on edge computing equipment.

Bayesian Inference (Bayesian Inference) is based on the existing knowledge (prior probability) in combination with the observed data to infer the uncertainty knowledge, and can be expressed as:

wherein P (Θ) is the prior knowledge (prior probability) reflecting the knowledge of the parameter Θ to be estimated before observing the data X; p (X | Θ) is a description of the observed data; p (Θ | X) is the posterior probability, reflecting the new knowledge of the parameter to be estimated Θ on the basis of the given observation data X. Bayesian inference can update the posterior probability by using new data iteratively, and the posterior probability is taken as the prior probability of the next round of inference, which accords with the human inference process.

In addition, a Knowledge Graph (knowledgegraph) is a semantic network that exposes relationships between entities, describing in Graph form various entities and concepts existing in the real world, and the relationships between them. Similar to the scene graph structure, the knowledge-graph also describes knowledge in the form of triples { subject-predicate-object }. The following problems exist with respect to a data driving method typified by deep learning: (1) lack of common sense; (2) lack of semantic understanding; (3) lack of interpretability; (4) relying on a large amount of sample data. The knowledge map can make up the defects and provide common sense for cognitive intelligence. Because the knowledge graph and the scene graph are similar in structure, common objects in the indoor scene and the mutual relation among the objects can be constructed on the basis of the knowledge graph to serve as the prior knowledge of Bayesian inference.

Service robots applied in places such as hospitals often operate in low-dynamic indoor environments, that is, indoor scenes are relatively fixed and do not change much. Furthermore, the visual input of the robot is often a depth image (RGB-D) with depth information, which is more tasked with generating spatial relationship data than conventional scene graphs. Therefore, the invention can synthesize the spatial relationship information on the basis of target detection, and reduce the requirements on the calculated amount and the memory based on Bayesian inference so as to achieve the aim of realizing simple indoor scene understanding on the edge computing equipment. The following describes in detail specific implementations of the present invention.

As a preferred implementation manner of the present invention, an indoor scene understanding method for edge calculation is provided, a basic flow framework of which is shown in fig. 1, and specifically includes the following steps:

s1, carrying out target detection on the RGB three-channel color image in the Depth image aiming at the Depth image (namely RGB-D containing RGB three-channel color image data and Depth Map) corresponding to the indoor scene to be understood, and obtaining semantic information of each target object existing in the indoor scene and plane position information of each target object on a two-dimensional plane.

In the present invention, a depth image corresponding to an indoor scene to be understood may be acquired by a depth camera.

In the invention, the target detection algorithm can theoretically adopt any algorithm or network model capable of realizing target detection, and the detection result is a target object, which can also be called as an object or a target.

As a preferred implementation mode, the target detection model can adopt YOLO + MobileNet or SSD + ShuffleNet, and can be used for the target object detection in the step after training.

In addition, the semantic information of the target object in this step refers to the category of the target object, and the plane position information of the target object on the two-dimensional plane may be labeled with a bounding box.

S2, based on each target object existing in the indoor scene and the plane position information thereof, combining the Depth value of the area where each target object is located provided by a Depth Map (Depth Map) in the Depth image, and reasoning to obtain a spatial relationship Map (spatial Map) among the target objects in the indoor scene; the spatial relationship Graph is a spatial relationship between objects represented in a Graph (Graph) form, nodes in the Graph represent target objects, and connections between the nodes represent the spatial relationship between the objects and the probability thereof.

In the invention, the connection between nodes is the edge between the nodes, and the two meanings are the same.

In the present invention, the spatial relationship graph can be constructed in the form of a triple of { object-spatial relationship-object }. The spatial relationship refers to the position relationship among the objects in the front, back, left, right, up, down, containing, belonging to the equal space. In practical applications, such spatial relationships may be represented by logical symbols, such as "in.

In the invention, the inference of the spatial relationship can adopt any feasible relational inference algorithm, and the essence of the inference algorithm is a classification model of the spatial relationship.

As a preferred implementation, the inference of the spatial relationship diagram in step S2 can be implemented by using a trained bayesian classification model or a support vector machine as a spatial relationship model. The training of the bayesian classification model or the support vector machine belongs to the prior art, and is not described in detail.

S3, based on semantic information of each target object in the indoor scene and a spatial relationship graph between the target objects, using an indoor scene common sense library as prior knowledge of Bayesian inference, inferring to obtain the semantic relationship graph between the target objects in the indoor scene, thereby realizing understanding of the indoor scene; the semantic relation Graph refers to semantic relations among objects in a scene represented in a Graph (Graph) form, nodes in the Graph represent target objects, and connection among the nodes represents the semantic relations among the objects and the probability of the semantic relations.

In the present invention, the semantic relationship graph can be constructed in the form of { target-semantic relationship-target } triple. The semantic relationship refers to a semantic relationship between objects, and includes an interaction relationship (or a speaking action relationship) between the objects, a dependency relationship and a comparison relationship between the objects, and the like. In practical applications, such semantic relationships may be represented by logical symbols, such as "sit", "push", "walk", "talk", and the like. The semantic relationship in the semantic relationship graph is actually a triplet form of { subject-predicate-object }. According to the semantic relation graph, the robot can understand the indoor scene, such as: (1) what is there around? (2) What is happening? (3) What may happen next?

In the invention, the indoor scene common sense library is a knowledge Graph for describing the mutual relation among all objects in the indoor scene, the knowledge Graph is also a Graph (Graph) structure, nodes in the Graph represent the objects, and the connection among the nodes represents the visual relation and the probability among the objects; wherein the visual relationship includes both spatial and semantic relationships between objects.

In order to ensure that the service robot can recognize the indoor scene and understand the scene, all possible indoor scenes that the service robot will subsequently pass through should be covered in the indoor scene knowledge base.

Therefore, as a preferred implementation manner, when performing the bayesian inference in S3, it is necessary to first query the indoor scene common sense library based on the spatial relationship graph between the target objects in the indoor scene (denoted as scene a), obtain the knowledge graph corresponding to the indoor scene (i.e., scene a) from all the indoor scenes, then obtain the semantic information of each target object in the indoor scene (i.e., scene a) and the spatial relationship graph between the target objects by using the queried knowledge graph as the prior knowledge, and obtain the semantic relationship between each target object in the spatial relationship graph through bayesian inference, thereby forming the semantic relationship graph.

As a preferred implementation manner, in the foregoing S3, the inference result may also be constructed as a knowledge graph for updating the indoor scene common sense library.

Theoretically, the indoor scene understanding method can be applied to indoor scenes in any places.

As a preferred implementation, the indoor scene understanding method of the present invention is recommended in an indoor scene inside a hospital including a corridor, a clinic room, an operating room, a nursing operation room, a ward, a warehouse, a buffer room, a classroom, a rest room, a dressing room, and the like. Because the indoor scene inside the hospital belongs to a typical low dynamic indoor environment, i.e. the indoor scene is relatively fixed and does not change much, and there is a high demand for the popularization of the service robot.

In another preferred embodiment of the present invention, there is also provided an edge computing apparatus for indoor scene understanding, corresponding to the indoor scene understanding method for edge computing described in the foregoing S1-S3, including the following modules:

It should be noted that the specific implementation functions of the target detection module, the spatial relationship inference module, and the semantic relationship inference module in the edge computing device may be implemented by a written computer software program module, and the computer program module includes a program code for executing the corresponding method. Each module can also be realized by a plurality of sub-modules, which is not limited to only being a complete module, that is, each module can be split and combined as required. In addition, some of the computer program modules mentioned in the foregoing description may be implemented by hardware, for example, a circuit to implement a similar data processing function, besides being implemented by software, which is not limited to this.

In addition, in the above modules, specific implementation manners may also refer to the practice in the indoor scene understanding method described in the foregoing S1 to S3, and details thereof are not repeated.

In another preferred embodiment of the present invention, there is further provided a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, is capable of performing the indoor scene understanding method for edge calculation as set forth in the foregoing S1-S3

The computer-readable medium may be included in a component of the service robot, may exist alone, may be a single storage medium, or may be a combination of a plurality of storage media.

It is understood that the storage medium may include a Random Access Memory (RAM) and a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

In another preferred embodiment of the present invention, there is also provided a service robot, which includes a robot walking part, and a depth camera and an edge calculation device mounted on the robot walking part;

the edge computing device is configured to obtain an understanding result of the indoor scene for the received depth image according to the indoor scene understanding method for edge computing described in the foregoing S1 to S3, and use the understanding result as a decision basis for providing a service to the outside.

In the present invention, the robot walking part may be any component capable of driving the robot to walk on the ground, such as a roller, a crawler, etc., which are in the prior art.

As a preferred implementation, the Edge computing device may employ an Nvidia Jetson Nano or Google's Coral Edge TPU, which is better suited for high performance processing of images.

To further facilitate understanding, a specific implementation of the indoor scene understanding method described in the above-mentioned S1-S3 in one embodiment will be given below, which describes in detail the implementation of the training and reasoning phases of the models. However, the following embodiment is only one implementation and is not intended to limit the present invention.

Examples

In this embodiment, a mobile robot carrying a depth camera and an edge computing device is used to realize understanding of an indoor scene of a hospital, the implementation method thereof may be divided into two stages, namely a training stage and an inference stage, a specific flow of which is shown in fig. 2, and specific implementations of the two stages are described in detail below.

I. Training phase

The training stage is to obtain a target detection model, a spatial relation model, a semantic relation model and an indoor scene knowledge base which can run on the edge computing equipment, the calculation of the training stage can be completed at a workstation or a cloud end, and the training stage comprises the following steps (I-1), (I-2), (I-3), (I-4) and (I-5):

(I-1) obtaining a depth image of an indoor environment

The robot carrying the depth camera can be controlled to reciprocate for multiple times in a target scene, the depth image of an indoor scene is recorded, and data should include all the indoor scenes which can appear as far as possible. The control can be manual control of the robot with the depth camera to move back and forth, or can be control of the handheld device with the depth camera to move back and forth with the height of the robot camera.

(I-2) training target detection model of RGB three-channel color image

A target bounding box (bounding box) acquired by the target detection model in an RGB image represents the two-dimensional position of a target in the image, and semantically represents the category of a target object. The framework of the target detection model can be YOLO + MobileNet, or SSD + ShuffleNet.

In this embodiment, the step (I-2) may train the target detection model according to the following sub-steps:

(I-2-1) separating the RGB three-channel color image from all the depth images, respectively.

(I-2-2) labeling target objects in all RGB three-channel color images by software, including corridors, doors and windows, furniture, pedestrians and the like, and obtaining a first image labeling data set. The annotation software can be Boundingbox tools, LabelImg or Image Labeler.

(I-2-3) dividing the first image annotation data set into a training set, a testing set and a verification set, and further training YOLO + MobileNet or SSD + ShuffleNet to obtain a target detection model.

(I-3) training spatial relationship model

And the spatial relationship model fuses the information of the depth map according to the two-dimensional position of the target in the image, and the relationship of each object in the space is obtained through reasoning. And constructing a spatial relationship graph (spatial graph) by taking the identified objects as nodes and spatial relationships among the objects as edges. The framework of the spatial relationship model may be a bayesian classification algorithm model or a Support Vector Machine (SVM).

In this embodiment, step (I-3) may train the spatial relationship model according to the following sub-steps:

(I-3-1) constructing a spatial relationship diagram in the form of a triple of { object-spatial relationship-object } with an object as a node based on the first image annotation dataset obtained in the step (I-2), specifically: for each image, the two-dimensional position of the target in the image is represented by a bounding box of the target, the depth map information of RGB-D is fused, and the spatial relationship and the probability are labeled to be used as the edges of the spatial relationship map. And forming a second image annotation data set after the spatial relation annotation is finished.

And (I-3-2) dividing the second image labeling data set into a training set, a testing set and a verification set, and further training a Bayesian classification algorithm model or a support vector machine to obtain a spatial relationship model.

(I-4) training semantic relationship model

The method comprises the steps of inquiring an indoor scene common sense library based on spatial relation graph information to obtain a corresponding scene semantic relation graph; the semantic relation model is based on a spatial relation graph, the knowledge graph is used as priori knowledge, and the semantic relation among the objects is obtained through reasoning so as to achieve the purpose of scene understanding. The framework of the semantic relation model can be a Bayesian classification algorithm model.

In this embodiment, the semantic relation model may be trained in step (I-4) according to the following substeps:

(I-4-1) continuously labeling the semantic relation and the probability on the basis of the labeled spatial relation graph for each image based on the second image labeling data set to form a corresponding scene semantic relation graph; obtaining a third image labeling data set after the semantic relation labeling is finished

And (I-4-2) dividing the third image labeling data set into a training set, a testing set and a verification set, and further training a Bayesian classification algorithm model to obtain a semantic relation model.

(I-5) describing the interrelation among objects in the indoor scene based on the knowledge graph as an indoor scene common sense library

As shown in FIG. 3, the knowledge Graph is a layer of abstraction of the environment knowledge, and the relationship between objects is represented in a Graph (Graph) form: the nodes represent objects, and the connections between the nodes represent visual relationships and probabilities between the objects. The visual relationship between the objects includes the spatial relationship between the objects, such as front, back, left, right, up and down, and also includes the semantic relationship between the objects, such as the action and reaction of an object to another object. And marking the semantic relation and the occurrence probability among the objects according to the target detection result and the spatial relation graph of the RGB image, and constructing a common sense library of the indoor scene.

In this embodiment, in step (I-5), the indoor scene knowledge base may be constructed according to the following sub-steps:

(I-5-1) the indoor scene general knowledge base is a multi-relation knowledge graph of indoor scenes, and can describe the indoor scenes in the form of triples of { subject-predicate-object }. For an RGB image representing an indoor scene, a multi-relationship knowledge graph is constructed according to the following steps:

1) sequentially listing objects in the scene according to the spatial inclusion relationship, and taking the objects as nodes of the graph;

2) constructing connections between nodes represents spatial relationships between objects, represented in logical notation, such as "in.. before", "in.. after", "in.. left", "partially", "fixed", etc.; simultaneously labeling the probability of the occurrence of the spatial relationship;

3) constructing connection among nodes to represent possible semantic relations among objects, and representing the semantic relations by logical symbols, such as 'sitting', 'pushing', 'walking', 'talking' and the like; simultaneously labeling the probability of the occurrence of the semantic relation;

(I-5-2) repeating the step (I-5-1) for each indoor scene to obtain a series of multi-relation knowledge maps of different indoor scenes to form an indoor scene common sense library.

II. Inference phase

And in the inference stage, semantic information and spatial relation of a target object are obtained after target detection and depth map information fusion according to a depth image acquired by a depth camera, and the semantic understanding of the indoor scene is obtained by Bayesian inference by taking an indoor scene common sense library as prior knowledge. Inference phase all the calculations are run on an Edge computing device, which may employ the Jetson Nano of Nvidia or the Coral Edge TPU of Google. The reasoning phase comprises the following steps (II-1), (II-2), (II-3), (II-4) and (II-5):

(II-1) obtaining depth images of indoor environments

The depth camera of the robot acquires a depth image of the indoor environment and transmits to the edge computing device.

(II-2) target detection of RGB three-channel color image by target detection model

As shown in fig. 4, an RGB three-channel color image is separated from a depth image, a trained target detection model is used to perform target detection, if the classification probability of a prediction frame is greater than a threshold, the prediction frame is determined as a boundary frame of a target, and the prediction semantics is determined as the category of a target object.

(II-3) inferring spatial relationships

The RGB image and the pixel points of the depth map have a one-to-one correspondence relationship. And (3) detecting a two-dimensional position of a target in the image represented by the target bounding box from the RGB image by applying the target detection model in the step (2), fusing depth values of corresponding areas in the RGB-D depth map, and reasoning by applying a trained spatial relationship model to obtain a spatial relationship map in a { target-spatial relationship-target } triple form. The spatial relationship refers to the positional relationship of front, back, left, right, up, down, including, belonging to, etc. among the objects.

In this embodiment, the step (II-3) obtains the spatial relationship between the objects according to the following steps:

(II-3-1) detecting the target object from the RGB image by applying the target detection model in the training stage step (I-2) to obtain semantic information (namely category) of the target object and a target boundary box thereof.

(II-3-2) obtaining coordinate values (x, y) of each object in the RGB image plane according to the target object bounding box, namely { target object, (x, y) }. The coordinates are rectangular coordinate systems with the center points of the RGB images as the origin.

(II-3-3) As shown in FIG. 4, the depth value, i.e. the z-coordinate value, of the corresponding region in the depth map is obtained according to the bounding box of the target object. And fusing the target detection result to obtain coordinates { target object, (x, y, Z) } of each object in space, wherein the coordinates are a three-dimensional rectangular coordinate system taking the center of the RGB image as an origin, an X, Y axis as an RGB plane and a Z axis as the normal of the RGB image plane.

(II-3-4) according to the space coordinates of each object, applying a trained space relation model to infer and obtain a space relation graph in a { target-space relation-target } triple form, wherein the space relation refers to the front, back, left, right, up, down, containing, belonging to and the like position relation among the objects.

(II-4) according to the target object identified in the step (II-2) and the spatial relation of each object obtained in the step (II-3), using the common sense library of the indoor scene constructed in the step (I-3) in the training stage as prior knowledge, applying a trained semantic relation model to infer and obtain the semantic relation between each object, and expressing the semantic relation in a { target-semantic relation-target } triple form so as to achieve the purpose of scene understanding.

In this embodiment, in step (II-4), the semantic relationship between the objects is obtained according to the following steps:

(II-4-1) inquiring the indoor scene common sense library constructed based on the knowledge graph in the training stage step (I-5) according to the spatial relationship graph obtained in the step (II-3) to obtain the prior probability distribution of various possible scene semantic graphs

(II-4-2) integrating the target detection result in the step (II-2) and the spatial relation graph obtained in the step (II-3) and the prior probability distribution of the semantic graphs of various possible scenes

Reasoning by using the trained semantic relation model to obtain the semantic relation among the objects;

in the formula: χ is a specific value of the observed data X. Highest probability

I.e. the identified scene.

(II-4-3) the visual relation among the objects is expressed by { target-visual relation-target } so as to achieve the purpose of scene understanding.

As shown in fig. 5, an example scene knowledge graph obtained through bayesian inference after constructing a common sense library of an indoor scene by using a knowledge graph and combining the obtained target object and the spatial relationship is shown. Based on the knowledge-graph, questions such as: 1. what is there on the corridor, what is the probability? 2. What is the nurse doing behind the wheelchair, how likely? 3. The positional relationship between doctor and nurse?

And (II-5) updating the indoor scene common sense library again by using the knowledge graph result obtained by inference.

Specific data sets and model verification results of the indoor scene understanding method described in the above two stages are shown below:

in the indoor scene understanding method described in the above two stages of this embodiment, RGB-D data of various environments of a hospital including various scenes such as a corridor, a clinic room, an operating room, a nursing operation room, a ward, a warehouse, a buffer room, a classroom, a rest room, and a dressing room are collected, and the total amount of the RGB-D data is 4000 parts of data. And manually labeling the data through the processes of the first image labeling data set, the second image labeling data set and the third image labeling data set, and finally forming an RGB-D data set of three models in the hospital indoor environment, wherein the RGB-D data set is used for training and testing a target detection module, a spatial relation reasoning module and a semantic relation reasoning module.

In order to verify the scene understanding performance of the invention, a test index Recall @ X is adopted:

wherein Recall is Recall (Recall), X is the first X of predicted values (top X), and Y is the number of test sets. For each RGB-D data, there is G_yA real (Ground truth) semantic relation data, and the model successfully predicts

Semantic relationship data.

Meanwhile, in order to compare the performance of the indoor scene understanding method of the present invention, the embodiment further employs several pre-training models of a classical scene recognition algorithm, performs fine tuning on the RGB data subset of the RGB-D data set of the indoor environment of the hospital, and also tests the R @50 index, and the result is shown in table 1.

TABLE 1 indexes of several classical scene recognition algorithms

Therefore, compared with other scene recognition algorithms in the prior art, the method is more suitable for running on the edge computing device, and the recall rate is obviously higher than that of other scene recognition algorithms.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. An indoor scene understanding method for edge calculation, comprising:

2. The indoor scene understanding method for the edge calculation according to claim 1, wherein in the S2, the inference of the spatial relationship graph is implemented by using a trained bayesian classification model or a support vector machine as a spatial relationship model.

3. The indoor scene understanding method for edge computation of claim 1, wherein in the S3, the inference of the semantic relationship graph is implemented by using a trained bayesian classification model as a semantic relationship model.

4. An indoor scene understanding method for edge calculation according to claim 1, wherein the indoor scene knowledge base is a knowledge graph describing the mutual relationship among objects in the indoor scene, and the nodes in the knowledge graph represent the objects, and the connection among the nodes represents the visual relationship and probability among the objects; the visual relationship includes both spatial and semantic relationships between objects.

5. The indoor scene understanding method for edge calculation according to claim 1, wherein in S3, the indoor scene common sense library is first queried based on a spatial relationship diagram between target objects in the indoor scene, the knowledge base corresponding to the indoor scene is obtained through query, then semantic information of each target object in the indoor scene and a spatial relationship diagram between the target objects are obtained through bayesian inference with the knowledge base obtained through query as prior knowledge, so as to form a semantic relationship diagram.

6. The indoor scene understanding method for edge calculation according to claim 1, wherein the inference result is used to update the indoor scene common sense library in S3.

7. The indoor scene understanding method for edge calculation according to claim 1, wherein the indoor scene is an indoor scene inside a hospital including a corridor, a clinic room, an operating room, a nursing studio, a ward, a warehouse, a buffer room, a classroom, a rest room, and a dressing room.

8. An edge computing device for indoor scene understanding, comprising:

9. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, is capable of implementing the indoor scene understanding method for edge calculation according to any one of claims 1 to 7.

10. A service robot is characterized by comprising a robot walking part, a depth camera and an edge calculation device, wherein the depth camera and the edge calculation device are mounted on the robot walking part;

the edge computing device is used for obtaining an understanding result of the indoor scene for the received depth image according to the indoor scene understanding method of any one of claims 1 to 7.