CN113439275A

CN113439275A - Identification method of plane semantic category and image data processing device

Info

Publication number: CN113439275A
Application number: CN202080001308.1A
Authority: CN
Inventors: 马超群; 陈平; 方晓鑫
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-01-23
Filing date: 2020-01-23
Publication date: 2021-09-24
Also published as: WO2021147113A1

Abstract

The application provides a recognition method of plane semantic categories and an image data processing device, which relate to the technical field of image processing and are used for accurately determining the plane categories of plane semantics, and the method comprises the following steps: acquiring to-be-processed image data, wherein the to-be-processed image data comprises N pixel points; determining a semantic segmentation result of the image data to be processed, wherein the semantic segmentation result comprises a target plane category corresponding to at least part of pixel points in N pixel points; obtaining a first dense semantic map according to the semantic segmentation result, wherein the first dense semantic map comprises at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to at least one pixel point in at least part of pixel points; and carrying out plane semantic category identification according to the first dense semantic map to obtain plane semantic categories of one or more planes included in the image data to be processed. The method can improve the accuracy of the plane semantic recognition.

Description

Identification method of plane semantic category and image data processing device

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a method for identifying plane semantic categories and an image data processing device.

Background

Augmented Reality (AR) is a technology for calculating the position and angle of a camera image in real time and adding corresponding images, videos and 3D models, and aims to fit a virtual world on a screen in the real world and interact with the virtual world. The plane detection is an important function in augmented reality, and provides perception of a basic three-dimensional environment of the real world, so that a developer can place a virtual object according to the detected plane to achieve the effect of augmented reality. Three-dimensional spatial plane detection is an important and fundamental function, because the anchor point of an object can be further determined after a plane is detected, and the object is rendered at the determined anchor point, thereby providing perception of the basic three-dimensional environment of the real world as an important function in augmented reality, so that a developer can place a virtual object according to the detected plane to realize the effect of augmented reality.

At present, a plurality of three-dimensional points on a plane can be acquired based on laser equipment, a plane equation of the plane is calculated through three-dimensional point statistics, and position information of the plane is determined through the plane equation. However, most current augmented reality algorithms detect a plane which only provides position information and cannot identify the plane class of the plane. And identifying the plane category of the plane can help developers to improve the reality and interest of augmented reality applications.

Based on the above, at present, Red-Green-Blue (RGB) image data or Red-Green-Blue-Depth (RGBD) image data can be subjected to semantic segmentation through a neural network, and a semantic map is established according to a semantic segmentation result. And then generating a plane semantic category by using the semantic map. However, since the scheme directly uses the semantic segmentation result to build the semantic map, there may be an erroneous segmentation and an unsegmented part in the semantic segmentation result, so that the accuracy of semantic category identification is reduced.

Disclosure of Invention

The embodiment of the application provides a method for identifying plane semantic categories and an image data processing device, which are used for improving the accuracy of identifying the plane semantic categories.

In order to achieve the above purpose, the embodiments of the present application provide the following technical solutions:

in a first aspect, an embodiment of the present application provides a method for identifying a plane semantic category, including: the image data processing device acquires image data to be processed comprising N pixel points, wherein N is a positive integer. The image data processing device determines a semantic segmentation result of the image data to be processed, wherein the semantic segmentation result comprises target plane categories corresponding to at least part of pixel points in the N pixel points. And the image data processing device obtains a first dense semantic map according to the semantic segmentation result, wherein the first dense semantic map comprises at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to at least one pixel point in the at least part of pixel points. And the image data processing device performs plane semantic category identification according to the first dense semantic map to obtain plane semantic categories of one or more planes included in the image data to be processed.

The embodiment of the application provides a method for identifying plane semantic categories, which can improve the accuracy of plane semantic identification through the following semantic segmentation by acquiring a semantic segmentation result of image data to be processed, wherein the semantic segmentation result comprises a target plane category to which each pixel point of N pixel points included in the image data to be processed belongs. In addition, according to the method provided by the embodiment of the application, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then the plane semantic category is identified through the first dense semantic map, so that the plane semantic category of the image data to be processed can be obtained, and the accuracy of plane semantic identification can be enhanced.

In one possible implementation, the image data processing apparatus obtains a first dense semantic map according to the semantic segmentation result, and includes: and the image data processing device obtains a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed. The image data processing apparatus takes the second dense semantic map as the first dense semantic map.

In one possible implementation, the image data processing apparatus obtains a first dense semantic map according to the semantic segmentation result, and includes: and the image data processing device obtains a second dense semantic map according to the semantic segmentation result. The image data processing device updates the historical dense semantic map with one or more second three-dimensional points in a second three-dimensional point cloud in a second dense semantic map to obtain a first dense semantic map.

In one possible implementation, the image data processing apparatus determining whether a current state of the image data processing apparatus is a motion state includes: the image data processing device acquires second image data, which is different from the image data to be processed. And the image data processing device judges whether the state of the image data processing device is a motion state or not according to the first equipment pose corresponding to the image data to be processed and the second equipment pose corresponding to the second image data. For example, the second image data is adjacent to the image data to be processed and located in the previous frame of the image data to be processed.

In one possible implementation, the image data processing apparatus determining that the current state is the motion state includes: when the difference value between the first device pose and the second device pose is smaller than or equal to a first threshold value, determining that the current state is a motion state;

in one possible implementation, the image data processing apparatus determining that the current state is the motion state includes: the image data processing device acquires second image data shot by the camera; the second image data is adjacent to the image data to be processed and is positioned in the last frame of the image data to be processed; and the image data processing device judges that the state of the image data processing device is a motion state according to the first device pose corresponding to the image data to be processed, the second device pose corresponding to the second image data and the inter-frame difference between the second image data and the image data to be processed.

In a possible implementation manner, the image data processing apparatus determines that the state of the image data processing apparatus is a motion state according to a first device pose corresponding to the image data to be processed, a second device pose corresponding to the second image data, and an inter-frame difference between the second image data and the image data to be processed, and the determining includes: and under the condition that the difference value between the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data is smaller than or equal to a first threshold value, and the inter-frame difference between the second image data and the image data to be processed is larger than a second threshold value, the state of the image data processing device is a motion state.

In a possible implementation manner, after the image data processing apparatus determines a semantic segmentation result of image data to be processed, the method provided in the embodiment of the present application further includes: and the image data processing device executes optimization operation on the semantic segmentation result according to the image data to be processed and the depth information included in the depth image corresponding to the image data to be processed, wherein the optimization operation is used for correcting the noise and error part in the semantic segmentation result. This can make subsequent semantic recognition more accurate.

In one possible implementation, the image data processing apparatus determining a semantic segmentation result of the image data to be processed includes: the image data processing device determines a probability of each of one or more plane classes corresponding to any of at least some of the pixel points. And the image data processing device takes the plane class with the highest probability in the one or more plane classes corresponding to any pixel point as the target plane class corresponding to any pixel point so as to obtain the semantic segmentation result of the image data to be processed. That is, the probability of the target plane class corresponding to any one pixel point is the greatest among the probabilities of the one or more plane classes corresponding to any one pixel point. This may provide accuracy in semantic recognition.

In a possible implementation manner, the determining, by the image data processing apparatus, a probability of each of one or more plane categories corresponding to any one of at least part of the pixel points includes: the image data processing device carries out semantic segmentation on the image data to be processed according to the neural network to obtain the probability of each plane category in one or more plane categories corresponding to any pixel point in at least part of pixel points.

In one possible implementation manner, the image data processing apparatus performs plane semantic category identification according to the first dense semantic map to obtain plane semantic categories of one or more planes included in the image data to be processed, including: the image data processing apparatus determines a plane equation for each of the one or more planes based on the image data to be processed. The image data processing device performs the following steps on any one of the one or more planes to obtain a plane semantic category of the one or more planes: the image data processing device determines one or more target plane classes corresponding to the any plane and the confidence of the one or more target plane classes according to the plane equation of the any plane and the first dense semantic map; and selecting the object plane class with the highest confidence from the one or more object plane classes as the semantic plane class of any one plane. That is, the semantic plane category of any plane is the target plane category with the highest confidence in the one or more target plane categories corresponding to any plane, and the target plane category with the highest confidence is selected as the semantic plane category of any plane, so that the accuracy of plane semantic recognition can be enhanced.

In one possible implementation, the orientation of the one or more object plane categories corresponding to any plane is consistent with the orientation of any plane. I.e. the orientation of the one or more object plane classes to which each plane corresponds coincides with its respective orientation. Therefore, the target plane category inconsistent with the plane orientation can be filtered out, and the accuracy of plane semantic recognition is enhanced.

In one possible implementation, the image data processing apparatus determines, according to the plane equation of the any one plane and the first dense semantic map, one or more object plane classes corresponding to the any one plane and the confidence of the one or more object plane classes, including: the image data processing device determines M first three-dimensional points from the first dense semantic map according to a plane equation of any one plane, the distance between the M first three-dimensional points and any one plane is smaller than a third threshold value, and M is a positive integer; determining one or more target plane categories corresponding to the M first three-dimensional points as the one or more target plane categories corresponding to any one plane, wherein the orientation of the one or more target plane categories is consistent with the orientation of any one plane, and counting the proportion of the number of the three-dimensional points corresponding to each target plane category in the one or more target plane categories in the M first three-dimensional points to obtain the confidence of the one or more target plane categories. For example. The target plane category corresponding to each first three-dimensional point is the target plane category of the two-dimensional pixel point corresponding to the first three-dimensional point, and thus one or more target plane categories of all M first three-dimensional points can be obtained.

In a possible implementation manner, after the image data processing apparatus counts a ratio of the number of three-dimensional points corresponding to each of the one or more target plane categories to the M first three-dimensional points to obtain a confidence of the one or more target plane categories, the method provided in the embodiment of the present application further includes: and the image data processing device updates the confidence of one or more target plane categories corresponding to any one plane according to at least one of Bayesian theorem or voting mechanism. And updating the confidence of one or more plane categories corresponding to any plane based on the video sequence of the Bayesian theorem and the voting mechanism, so that the finally obtained plane semantic category result of each plane is more stable.

In a possible implementation manner, the method provided in the embodiment of the present application further includes: the image data processing device judges whether the current state of the image data processing device is a motion state, and under the condition that the current state is the motion state, the image data processing device obtains a first dense semantic map according to the semantic segmentation result. By judging whether the motion state is the motion state or not, the first dense semantic map is obtained according to the semantic segmentation result in the motion state, so that the data volume calculated by the image data processing device can be reduced, the calculation resources can be reduced, and the performance of a semantic map generation algorithm can be improved.

In one possible implementation, the image data to be processed is post-aligned image data.

In a possible implementation manner, before the image data processing apparatus acquires image data to be processed, the method provided in the embodiment of the present application further includes: an image data processing apparatus acquires first image data captured by a camera. And the image data processing device corrects the first image data according to the equipment pose corresponding to the first image data to obtain the image data to be processed.

In a second aspect, an embodiment of the present application provides an image data processing apparatus, including: the semantic segmentation module is used for acquiring image data to be processed which is provided by a camera and comprises N pixel points, wherein N is a positive integer. The semantic segmentation module is further used for determining a semantic segmentation result of the image data to be processed, wherein the semantic segmentation result comprises a target plane category corresponding to at least part of the N pixel points. And the semantic map module is used for obtaining a first dense semantic map according to a semantic segmentation result, wherein the first dense semantic map comprises at least one target plane category corresponding to at least one first three-dimensional point in a first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to at least one pixel point in at least part of pixel points. And the semantic clustering module is used for carrying out plane semantic category identification according to the first dense semantic map to obtain one or more plane semantic categories of the image data to be processed.

The embodiment of the application provides an image data processing device, which can improve the accuracy of plane semantic identification through semantic segmentation follow-up by acquiring a semantic segmentation result of image data to be processed, wherein the semantic segmentation result comprises a target plane category to which each pixel point in N pixel points included in the image data to be processed belongs. In addition, according to the method provided by the embodiment of the application, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then the plane semantic category is identified through the first dense semantic map, so that the plane semantic category of the image data to be processed can be obtained, and the accuracy of plane semantic identification can be enhanced. In one possible implementation, the semantic map module is configured to obtain a first dense semantic map according to the semantic segmentation result, and includes: and the semantic map module is used for obtaining a second dense semantic map according to the semantic segmentation result. The semantic map module is used for taking the second dense semantic map as the first dense semantic map.

In one possible implementation, the semantic map module is configured to obtain a first dense semantic map according to the semantic segmentation result, and includes: and the semantic map module is used for obtaining a second dense semantic map according to the semantic segmentation result. The semantic map module is used for updating the historical dense semantic map by utilizing one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map.

In one possible implementation, the image data processing apparatus further includes: an instant positioning and mapping (SLAM) module for calculating a device pose (e.g., a camera pose) of the image data, the semantic map module for determining whether a current state of the image data processing apparatus is a motion state, comprising: the semantic map module is used for acquiring second image data provided by the camera, and the second image data is different from the image data to be processed. And the semantic map module is used for judging whether the state of the image data processing device is a motion state or not according to the first equipment pose corresponding to the image data to be processed provided by the SLAM module and the second equipment pose corresponding to the second image data provided by the SLAM module. For example, the second image data is adjacent to the image data to be processed and located in the previous frame of the image data to be processed.

In one possible implementation, the semantic map module for determining that the current state is a motion state includes: when the difference value between the first device pose and the second device pose is smaller than or equal to a first threshold value, the semantic map module is used for determining that the current state is a motion state;

in one possible implementation, the semantic map module for determining that the current state is a motion state includes: the semantic map module is used for acquiring second image data shot by the camera; the second image data is adjacent to the image data to be processed and is positioned in the last frame of the image data to be processed; the semantic map module is used for judging the current state of the image data processing device to be a motion state according to a first device pose corresponding to the image data to be processed provided by the SLAM module, a second device pose corresponding to the second image data provided by the SLAM module and an inter-frame difference between the second image data and the image data to be processed.

In a possible implementation manner, the semantic map module is configured to determine, according to a first device pose corresponding to the image data to be processed, a second device pose corresponding to the second image data, and an inter-frame difference between the second image data and the image data to be processed, that a current state of the image data processing apparatus is a motion state, and the semantic map module includes: and under the condition that the difference value between the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data is smaller than or equal to a first threshold value, and the inter-frame difference between the second image data and the image data to be processed is larger than a second threshold value, the semantic map module is used for determining that the current state of the image data processing device is a motion state.

In a possible implementation manner, the semantic segmentation module is further configured to perform an optimization operation on the semantic segmentation result according to the to-be-processed image data and depth information included in a depth image corresponding to the to-be-processed image data, where the optimization operation is used to correct noise and an error portion in the semantic segmentation result.

In a possible implementation manner, the semantic segmentation module is configured to determine a semantic segmentation result of the image data to be processed, and includes a function of determining one or more plane categories corresponding to any one pixel point in the at least part of pixel points and a probability of each of the one or more plane categories, and a function of taking a plane category with a highest probability among the one or more plane categories corresponding to the any one pixel point as a target plane category corresponding to the any one pixel point, so as to obtain the semantic segmentation result of the image data to be processed. That is, the probability of the target plane category corresponding to any pixel point in at least part of pixel points included in the semantic segmentation result of the image data to be processed is the maximum in the probability of one or more plane categories corresponding to any pixel point.

In a possible implementation manner, the semantic segmentation module is configured to perform semantic segmentation on the image data to be processed according to the neural network, so as to obtain a probability of each of one or more plane categories corresponding to any one of the at least part of the pixel points.

In a possible implementation manner, the semantic clustering module is configured to perform plane semantic category identification according to the first dense semantic map to obtain plane semantic categories of one or more planes included in the image data to be processed, and includes: and the semantic clustering module is used for determining a plane equation of each plane in one or more planes according to the image data to be processed. The semantic clustering module is further configured to perform the following steps on any one of the one or more planes to obtain a plane semantic category of the one or more planes: a semantic clustering module, configured to determine, according to the plane equation of any one of the planes and the first dense semantic map, one or more object plane categories corresponding to any one of the planes and a confidence of the one or more object plane categories; and the semantic clustering module is used for selecting the object plane category with the highest confidence from the one or more object plane categories as the semantic plane category of any one plane.

In one possible implementation, the orientation of the one or more object plane classes corresponding to each plane is consistent with the respective orientation of each plane. That is, the orientation of one or more object plane categories corresponding to any plane is consistent with the orientation of any plane.

In one possible implementation, the semantic clustering module is configured to determine, according to the plane equation of the any one plane and the first dense semantic map, one or more object plane classes corresponding to the any one plane and the confidence of the one or more object plane classes, and includes: the semantic clustering module is used for determining M first three-dimensional points from the first dense semantic map according to a plane equation of any one plane, the distance between the M first three-dimensional points and the any one plane is smaller than a third threshold, the orientation of a target plane category corresponding to the M first three-dimensional points is consistent with the orientation of the any one plane, M is a positive integer, and the M first three-dimensional points correspond to the one or more plane categories; and counting the proportion of the number of the three-dimensional points corresponding to each plane category in the one or more plane categories in the M first three-dimensional points to obtain the confidence of the one or more plane categories.

In a possible implementation manner, the semantic clustering module is configured to count a ratio of the number of three-dimensional points corresponding to each of the one or more plane categories to the M first three-dimensional points, and after obtaining the confidence of the one or more plane categories, the semantic clustering module is further configured to update the confidence of the one or more target plane categories corresponding to any one of the planes according to at least one of bayes theorem or a voting mechanism.

In one possible implementation, the semantic map module is configured to determine whether a current state of the image data processing apparatus is a motion state. And when the current state is determined to be the motion state, the semantic map module is used for obtaining a first dense semantic map according to the semantic segmentation result.

In a possible implementation manner, before the semantic segmentation module is used to acquire the image data to be processed, the semantic segmentation module is further used to acquire the first image data captured by the camera. The semantic segmentation module is used for correcting the first image data according to the equipment pose corresponding to the first image data provided by the SLAM module to obtain the image data to be processed.

In one possible implementation, the SLAM module, the semantic clustering module, and the semantic map module run on the central processing unit CPU, while the part of the semantic segmentation module that performs semantic segmentation may run on the NPU, and the other parts of the semantic segmentation module except the semantic segmentation function run on the central processing unit CPU.

In a third aspect, the present application provides a computer-readable storage medium, in which instructions are stored, and when executed, the method described in any aspect of the first aspect is implemented.

In a fourth aspect, an embodiment of the present application provides an image data processing apparatus, including: the image processing device comprises a first processor and a second processor, wherein the first processor is used for acquiring to-be-processed image data comprising N pixel points, and N is a positive integer. The second processor is used for determining a semantic segmentation result of the image data to be processed, wherein the semantic segmentation result comprises a target plane category corresponding to at least part of pixel points in the N pixel points; the first processor is used for obtaining a first dense semantic map according to the semantic segmentation result, wherein the first dense semantic map comprises at least one target plane category corresponding to at least one first three-dimensional point in a first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to at least one pixel point in at least part of pixel points; and the first processor is used for carrying out plane semantic category identification according to the first dense semantic map to obtain the plane semantic categories of one or more planes included in the image data to be processed.

In a possible implementation manner, the first processor is specifically configured to obtain a second dense semantic map according to the semantic segmentation result and a depth image corresponding to the image data to be processed; the first processor is specifically configured to use the second dense semantic map as the first dense semantic map, or the first processor is specifically configured to update the historical dense semantic map with one or more second three-dimensional points in a second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map.

In a possible implementation manner, the second processor is configured to determine a semantic segmentation result of the image data to be processed, and includes a function of performing an optimization operation on the semantic segmentation result according to the depth information included in the image data to be processed and a depth image corresponding to the image data to be processed, where the optimization operation is used to correct noise and an error portion in the semantic segmentation result.

In a possible implementation manner, before determining a semantic segmentation result of the image data to be processed, the second processor is further configured to determine a probability of each of one or more plane categories corresponding to any pixel point of the at least part of pixel points; and the plane class with the highest probability in the one or more plane classes corresponding to any pixel point is used as a target plane class corresponding to any pixel point, so as to obtain a semantic segmentation result of the image data to be processed. That is, the probability of the target plane class corresponding to any one pixel point is the greatest among the probabilities of the one or more plane classes corresponding to any one pixel point. This may provide accuracy in semantic recognition.

In a possible implementation manner, the second processor is configured to perform semantic segmentation on the image data to be processed according to a neural network, so as to obtain a probability of each of one or more plane categories corresponding to any one of the at least part of the pixel points.

In one possible implementation, a first processor to determine a plane equation for each of the one or more planes; a first processor, further configured to perform the following steps for any of the one or more planes to obtain a plane semantic category of the one or more planes: a first processor further configured to determine one or more object plane classes corresponding to the any plane and a confidence of the one or more object plane classes based on the plane equation for the any plane and the first dense semantic map; the first processor is further used for selecting the object plane category with the highest confidence from the one or more object plane categories as the semantic plane category of any one plane. That is, the semantic plane class of any plane is the highest-confidence object plane class in the one or more object plane classes corresponding to any plane.

In one possible implementation, the orientation of the one or more object plane categories corresponding to any plane is consistent with the orientation of the any plane.

In a possible implementation manner, the first processor is specifically configured to determine, according to a plane equation of any one of the planes, M first three-dimensional points from the first dense semantic map, where a distance between the M first three-dimensional points and any one of the planes is smaller than a third threshold, and M is a positive integer; determining one or more target plane categories corresponding to the M first three-dimensional points as the one or more target plane categories corresponding to any one plane, wherein the orientation of the one or more target plane categories is consistent with the orientation of any one plane, and counting the proportion of the number of the three-dimensional points corresponding to each target plane category in the one or more target plane categories in the M first three-dimensional points to obtain the confidence of the one or more target plane categories.

In a possible implementation manner, the first processor is specifically configured to count a ratio of the number of three-dimensional points corresponding to each of the one or more target plane categories to the M first three-dimensional points, and after obtaining the confidence of the one or more target plane categories, the first processor is further configured to update the confidence of the one or more target plane categories corresponding to any one of the planes according to at least one of bayes theorem or a voting mechanism.

In a possible implementation manner, the first processor is configured to determine whether a current state is a motion state; and the semantic segmentation module is used for obtaining a first dense semantic map according to the semantic segmentation result when the current state is determined to be the motion state.

In one possible implementation, the first processor may be a CPU or a DSP. The second processor may be an NPU.

In a fifth aspect, an embodiment of the present application provides an image data processing apparatus, including: one or more processors, wherein the one or more processors are configured to execute instructions stored in the memory to perform the method as described in any of the first aspects.

A sixth aspect provides a computer program product comprising instructions, the computer program product comprising instructions that, when executed, implement a method as described in any of the first aspects.

Drawings

Fig. 1 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a software architecture applicable to a method for identifying a plane semantic category according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a method for identifying a plane semantic category according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of another method for identifying a plane semantic category according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of first image data acquired by an image data processing apparatus according to an embodiment of the present application before and after processing;

FIG. 6 is a diagram illustrating semantic segmentation results provided in an embodiment of the present application;

fig. 7 is a schematic diagram of a coordinate mapping provided in an embodiment of the present application;

fig. 8 is a schematic view illustrating a motion state determination process according to an embodiment of the present disclosure;

FIG. 9 is a flowchart of a plane confidence calculation according to an embodiment of the present disclosure;

FIG. 10 is a flow chart illustrating filtering performed on semantic segmentation results according to an embodiment of the present disclosure;

FIG. 11 is a flow chart illustrating another exemplary filtering performed on semantic segmentation results according to an embodiment of the present disclosure;

FIG. 12 is a diagram illustrating a flat semantic result provided by an embodiment of the present application;

fig. 13 is a schematic structural diagram of an image data processing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, the present application will be further described in detail with reference to the accompanying drawings. In the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

The method for identifying the plane semantic category can be applied to various image data processing devices with TOF, and the image data processing devices can be electronic equipment. Electronic devices may include, but are not limited to, personal computers, server computers, hand-held or laptop devices, mobile devices (such as cell phones, mobile phones, tablets, personal digital assistants, media players, etc.), consumer electronics, minicomputers, mainframe computers, mobile robots, drones, and the like. For example, the electronic device in the embodiment of the present application may be a device with an AR function, for example, a device with an AR glasses function, and may be applied to scenes such as AR automatic measurement, AR decoration, and AR interaction.

When the image data processing apparatus needs to identify a plane type of each of one or more planes included in the image data to be processed, in a possible implementation manner, the image data processing apparatus may obtain a plane type identification result of the image data to be processed by using the identification method of a plane semantic type provided in the embodiment of the present application. In another possible implementation manner, the image data processing apparatus may send the image data to be processed to another device, such as a server or a terminal device, which has the process of implementing the recognition of the plane semantic category, the recognition process of the plane semantic category is executed by the server or the terminal device, and then the image data processing apparatus receives the recognition result of the plane semantic category from the other device.

In the following embodiments, an image data processing apparatus is taken as an example of an electronic device, and a method for identifying a plane semantic category provided in the embodiments of the present application is described. The method for identifying the plane semantic category is applicable to the electronic device shown in fig. 1, and the specific structure of the electronic device is briefly introduced below.

Fig. 1 is a schematic diagram of a hardware structure of an electronic device applied in the embodiment of the present application. As shown in fig. 1, electronic device 100 may include a display device 110, a processor 120, and a memory 130. The memory 130 may be used for storing software programs and data, and the processor 120 may execute various functional applications and data processing of the electronic device 100 by operating the software programs and data stored in the memory 130.

The memory 130 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program (such as an image capturing function) required by at least one function, and the like; the storage data area may store data (such as audio data, text information, image data) created according to the use of the electronic apparatus 100, and the like. Further, the memory 130 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 120 is a control center of the electronic device 100, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device 100 and processes data by running or executing software programs and/or data stored in the memory 130, thereby performing overall monitoring of the electronic device. Processor 120 may include one or more processing units, such as: the processor 120 may include a Central Processing Unit (CPU), an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a Neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The NPU is used as a neural-network (NN) computing processor, and can rapidly process input information by referring to a biological neural network structure, for example, by referring to a transfer mode between neurons of a human brain, and can also continuously learn by self. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

In some embodiments, processor 120 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, processor 120 may include multiple sets of I2C buses. The processor 120 may be coupled to the touch sensor, the charger, the flash, the image capture device 160, etc. via different I2C bus interfaces, respectively. For example: the processor 120 may be coupled to the touch sensor via an I2C interface, such that the processor 120 and the touch sensor communicate via an I2C bus interface to implement touch functionality of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, processor 120 may include multiple sets of I2S buses. The processor 120 may be coupled to the audio module via an I2S bus to enable communication between the processor 120 and the audio module. In some embodiments, the audio module may transmit audio signals to the WiFi module 190 through the I2S interface, so as to implement the function of answering a call through a bluetooth headset.

The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, the audio module and WiFi module 190 may be coupled through a PCM bus interface. In some embodiments, the audio module may also transmit the audio signal to the WiFi module 190 through the PCM interface, so as to implement the function of answering a call through the bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communications. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 120 with the WiFi module 190. For example: the processor 120 communicates with the bluetooth module in the WiFi module 190 through the UART interface to implement the bluetooth function. In some embodiments, the audio module may transmit the audio signal to the WiFi module 190 through the UART interface, so as to realize the function of playing music through the bluetooth headset.

A MIPI interface may be used to connect processor 120 with peripheral devices such as display device 110, image capture device 160, and the like. The MIPI interface includes a Camera Serial Interface (CSI) of the image capture device 160, a display screen serial interface (DSI), and the like. In some embodiments, processor 120 and image capture device 160 communicate via a CSI interface to implement the capture functionality of electronic device 100. The processor 120 and the display screen communicate through the DSI interface to implement the display function of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 120 with the image capture device 160, the display device 110, the WiFi module 190, the audio module, the sensor module, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.

The USB interface is an interface which accords with the USB standard specification, and specifically can be a Mini USB interface, a Micro USB interface, a USB Type C interface and the like. The USB interface may be used to connect a charger to charge the electronic device 100, and may also be used to transmit data between the electronic device 100 and a peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. The interface may also be used to connect other electronic devices, such as AR devices and the like.

It should be understood that the connection relationship between the modules according to the embodiment of the present invention is only illustrative, and is not limited to the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

Also included in the electronic device 100 is an image capture device 160 for capturing images or video. The image acquisition device 160 includes one or more cameras, a TOF camera for acquiring image data, and for acquiring depth images. For example, a Video camera is used to capture Video Graphics Array (VGA) or image data and send the VGA or image data to the CPU and the GPU. The camera can be a common camera or a focusing camera.

The electronic device 100 may further include an input device 140 for receiving input numerical information, character information, or contact touch operation/non-contact gesture, and generating signal input related to user setting and function control of the electronic device 100, and the like.

The display device 110 includes a display panel 111 for displaying information input by a user or information provided to the user, various menu interfaces of the electronic device 100, and the like, and in the embodiment of the present application, is mainly used for displaying to-be-processed image data acquired by a camera or a sensor in the electronic device 100. Alternatively, the display panel may be configured in the form of a Liquid Crystal Display (LCD) or an organic light-emitting diode (OLED) to configure the display panel 111.

The electronic device 100 may also include one or more sensors 170, such as image sensors, infrared sensors, laser sensors, pressure sensors, gyroscope sensors, barometric pressure sensors, magnetic sensors, acceleration sensors, distance sensors, proximity light sensors, ambient light sensors, fingerprint sensors, touch sensors, temperature sensors, bone conduction sensors, Inertial Measurement Units (IMUs), and the like, wherein the image sensors may be time of flight (TOF) sensors, structured light sensors, and the like. Specifically, the inertial measurement unit is a device for measuring the three-axis attitude angle (or angular velocity) and acceleration of the object. Typically, an IMU contains three single axis accelerometers and three single axis gyroscopes. The accelerometer detects acceleration signals of the object in independent three axes of the carrier coordinate system. And the gyroscope detects an angular velocity signal of the carrier relative to a navigation coordinate system, measures the angular velocity and the acceleration of the object in a three-dimensional space, and calculates the posture of the object according to the angular velocity signal. Further, the image sensor may be a device in the image pickup device 160 or a separate device for picking up image data.

In addition, the electronic device 100 may also include a power supply 150 for powering other modules. The electronic device 100 may further include a Radio Frequency (RF) circuit 180 for performing network communication with a wireless network device, and a WiFi module 190 for performing WiFi communication with other devices, for example, for acquiring images or data transmitted by other devices. Although not shown in fig. 1, the electronic device 100 may further include a flash, a bluetooth module, an external interface, a button, a motor, and other possible functional modules, which are not described in detail herein.

As shown in fig. 2, fig. 2 illustrates a software architecture applicable to the recognition method for a flat semantic category provided in the embodiment of the present application, where the software architecture is applied to the electronic device 100 shown in fig. 1, and the architecture includes: a semantic segmentation module 202, a semantic map module 203, and a semantic clustering module 204. Optionally, the software architecture may further include a simultaneous localization and mapping (SLAM) module 201. The SLAM module 201, the semantic map module 203, and the semantic clustering module 204 operate on a CPU of the electronic device as described in fig. 1. Or alternatively, part of the functions in the SLAM module 201 may be deployed on a Digital Signal Processing (DSP), part of the functions in the semantic segmentation module 202 may be run on an NPU of the electronic device as described in fig. 1, and other functions in the semantic segmentation module 202 than the functions running on the NPU of the electronic device as described in fig. 1 may be run on a CPU. The functions running on the NPU include those described later.

Among other things, the SLAM module 201 is configured to calculate a device pose (for example, in the case of a device being a camera, the device pose may refer to a camera pose), that is, a rotation and a translation of the camera with respect to a first frame, and detect a plane and output the device pose, a normal parameter, and a boundary point of the plane, according to a video graphics sequence including one or more image data provided by a camera (also referred to as a camera corresponding to the image capturing device 160 of the electronic device depicted in fig. 1), Depth (Depth) information or a Depth image of image data provided by TOF, and an IMU, and a correlation between frames of the image data, and a visual geometry principle. For example, IMU data includes accelerometers, gyroscopes. The depth information includes a distance between each pixel point in the image data and a camera that captured the image data.

The semantic segmentation module 202 is used for enhancing semantic segmentation data based on the SLAM technology, and is divided into pre-processing, AI processing and post-processing. The preprocessed input is raw image data (for example, RGB image) provided by the camera and the device pose acquired by the SLAM module 201, and the raw image data is corrected according to the device pose and output as corrected image data. Compared with the method that rotation data are added during training, the constraint of the semantic segmentation model on the rotation invariance can be reduced, and the recognition rate is improved.

The AI process is based on semantic segmentation by a neural network, operates on the NPU, inputs the aligned image data, and outputs a probability distribution (i.e., a probability that each pixel belongs to one or more plane categories) that each pixel included in the aligned image data belongs to each of the one or more plane categories. If the plane category with the maximum probability is selected, the semantic segmentation result of the pixel level can be obtained. For example, the neural network may be a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), or a Recurrent Neural Network (RNN).

The post-processing input is original image data and depth information provided by the camera and a semantic segmentation result input by AI processing, the semantic segmentation result is filtered mainly according to the original image data and the depth information, and the output is an optimized semantic segmentation result. The accuracy and margin of the segmentation after post-processing is better. It is to be understood that post-processing is not a necessary technique of an embodiment and may not be performed. Alternatively, the pre-processing and the post-processing may be executed on a CPU or other processor instead of the NPU, which is not limited in this embodiment.

The input of the Semantic Map module 203 is optimized Semantic segmentation result or non-optimized Semantic segmentation result, and the device pose provided by the SLAM module 201, the depth information provided by the TOF, and the raw image data provided by the camera, and the Semantic Map module 203 generates a Dense Semantic Map (Dense Semantic Map) mainly based on the SLAM technology. With respect to the semantic map module 203, a dense semantic map is generated based on the SLAM technique, the optimized semantic segmentation result or the non-optimized semantic segmentation result, and the device pose provided by the SLAM module 201, the depth information provided by the TOF, and the raw image data provided by the camera, the process includes converting the two-dimensional raw image data into a three-dimensional dense semantic map. Through the conversion, two-dimensional RGB pixels in two-dimensional original image data are converted into three-dimensional points in a three-dimensional space, so that each pixel includes depth information in addition to RGB information. The process of the two-dimensional to three-dimensional conversion can refer to the description in the prior art, and is not described herein again. Through conversion, the target plane category of each pixel is taken as the target plane category of the three-dimensional point corresponding to the pixel point, so that the target plane categories of the pixels are converted into the target plane categories of the three-dimensional points. Thus, the dense semantic map includes object plane classes for a plurality of three-dimensional points. And the target plane category of any three-dimensional point is corresponding to the target plane category of the two-dimensional pixel point corresponding to the three-dimensional point.

It can be understood that, when the post-processing is performed in the embodiment, the input of the semantic map module 203 is the optimized semantic segmentation result; when no post-processing is performed in an embodiment, the input of the semantic map module 203 is a semantic segmentation result that is not optimized.

The semantic clustering module 204 performs plane semantic recognition based on the dense semantic map. Based on the above introduction, the present application provides a method for recognizing semantic categories of planes and an image data processing apparatus, wherein the method enables the image data processing apparatus to detect more than one plane included in image data. In the embodiment of the present application, the method and the image data processing apparatus are based on the same inventive concept, and because the principles of solving the problems of the method and the computing device are similar, the implementation of the image data processing apparatus and the method can be referred to each other, and repeated parts are not described again.

As shown in fig. 3, fig. 3 illustrates a method for identifying a plane semantic category provided by an embodiment of the present application, where the method is applied to an image data processing apparatus, and the method includes: step 301, the semantic segmentation module 202 obtains image data to be processed, where the image data to be processed includes N pixel points, and N is a positive integer. It should be understood that the image data to be processed may be captured by a camera of the image data processing apparatus and provided to the semantic segmentation module 202, or may be obtained by the semantic segmentation module 202 from a gallery in the image data processing apparatus for storing the image data, or may be sent by other devices, which is not limited in this embodiment of the present application. For example, the image data to be processed may be a two-dimensional image. The image data to be processed may be a color photograph or a black-and-white photograph, which is not limited in this embodiment of the application.

It should be noted that the N pixel points may be all pixel points in the image data to be processed, or may be partial pixel points in the image data to be processed. When the N pixel points are partial pixel points in the image data to be processed, the N pixel points may be pixel points belonging to a planar category in the image data to be processed, but do not include pixel points belonging to a non-planar category. It can be understood that the non-planar pixel point refers to a pixel point that does not belong to any identified planar category, and at this time, the pixel point is considered not to belong to a pixel point on any plane.

Step 302, the semantic segmentation module 202 determines a semantic segmentation result of the image data to be processed. The semantic segmentation result comprises target plane categories corresponding to at least part of pixel points in N pixel points included in the image data to be processed. Optionally, the at least part of the pixel points may be pixel points of one or more planes included in the image data to be processed. The target plane category corresponding to at least part of the pixel points in the N pixel points may refer to a target plane category corresponding to part of the pixel points in the N pixel points, or may refer to a target plane category corresponding to all the pixel points in the N pixel points.

On one hand, in the embodiment of the present application, the image data processing apparatus may autonomously determine the semantic segmentation result of the image data to be processed, and at this time, the image data processing apparatus may have a module (e.g., NPU) therein for determining the semantic segmentation result of the image data to be processed.

On the other hand, the image data processing apparatus in the embodiment of the present application may further send the image data to be processed to a device having a function of determining a semantic segmentation result of the image data to be processed, so that the device having the function of determining the semantic segmentation result of the image data to be processed determines the semantic segmentation result of the image data to be processed. Then, the image data processing apparatus acquires a semantic segmentation result of the image data to be processed from a device having a function of determining a semantic segmentation result of the image data to be processed. In the embodiment of the application, the image data processing device can detect one or more planes included in the image data to be processed by determining the semantic segmentation result of the image data to be processed.

Step 303, the semantic map module 203 obtains a first dense semantic map according to the semantic segmentation result, where the first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to at least one pixel point in the at least part of pixel points. Step 303 in this embodiment of the application aims at that, the semantic map module 203 updates the plane type of the three-dimensional point corresponding to each pixel point in the three-dimensional space by using the plane type of the pixel point in the two-dimensional space, that is, the plane type of the target of the three-dimensional point.

In one possible implementation, the semantic map module 203 may use at least one target plane category corresponding to the three-dimensional point cloud corresponding to all the pixel points in the semantic segmentation result as the first dense semantic map. The performance of the semantic map generation algorithm may be improved by step 303. Step 304, the semantic clustering module 204 performs plane semantic category identification according to the first dense semantic map to obtain plane semantic categories of one or more planes included in the image data to be processed.

The embodiment of the application provides a method for identifying plane semantic categories, which can improve the accuracy of plane semantic identification through the following semantic segmentation by acquiring a semantic segmentation result of image data to be processed, wherein the semantic segmentation result comprises a target plane category to which each pixel point of N pixel points included in the image data to be processed belongs. In addition, according to the method provided by the embodiment of the application, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then the plane semantic categories are identified through the first dense semantic map, so that the plane semantic categories of the image data to be processed can be obtained, and the accuracy and stability of the plane semantic identification can be enhanced.

In a possible implementation manner, step 303 in the embodiment of the present application may be implemented by: the semantic map module 203 determines whether the current state of the image data processing apparatus is a motion state. And when the current state is determined to be the motion state, obtaining a first dense semantic map according to the semantic segmentation result. By judging whether the map is in a motion state or not and obtaining a first dense semantic map according to a semantic segmentation result in the motion state, the calculation amount can be reduced.

In one possible implementation, when the current state of the image data processing apparatus is a non-moving state, that is, a still state, the image data processing apparatus uses the history dense semantic map as the first dense semantic map.

As a possible implementation manner, the image data to be processed in the embodiment of the present application is post-aligned image data. The semantic segmentation module 202 aligns the image data to be processed before performing semantic segmentation on the image data to be processed or adopts the aligned image data to be processed, so that the constraint of a semantic segmentation model on rotation invariance can be reduced, and the recognition rate is improved.

It should be noted that, in this embodiment of the application, if the first image data acquired by the semantic segmentation module 202 is not aligned, as shown in fig. 4, the method provided in this embodiment of the application may further include, before step 301: step 305, the semantic segmentation module 202 acquires first image data shot by the first device.

Alternatively, the image data processing apparatus may control the first device to capture the first image data and send the captured first image data to the semantic segmentation module 202. Of course, the first image data may be obtained by the semantic segmentation module 202 from a memory pre-stored in the image data processing apparatus, or the semantic segmentation module 202 may obtain the first image data captured by the first device from another device (e.g., a single lens reflex or a DV).

Illustratively, the first device may be a camera of the image data processing apparatus itself, or a photographing apparatus connected to the image data processing apparatus. Accordingly, step 301 may be implemented by step 3011: 3011, the semantic segmentation module 202 corrects the first image data according to the first device pose of the first device corresponding to the first image data, so as to obtain image data to be processed. It should be noted that, in the embodiment of the present application, each image data may correspond to one device pose.

In this embodiment of the application, if the semantic segmentation module 202 determines that the first image data is not aligned, the first image data may be aligned according to a device pose corresponding to capturing the first image data. The semantic segmentation module 202 may autonomously determine that the first image data is not aligned, and in a case where the image data processing apparatus receives an operation instruction for the first image data, which is input by a user and used for indicating that the first image data is aligned, the image data processing apparatus may determine that the first image data is not aligned, and then the image data processing apparatus aligns the first image data through the semantic segmentation module 202.

The device pose corresponding to the image data in the embodiment of the application refers to the pose of the device shooting the image data when shooting the image data. The same device may correspond to different device poses at different times. It is to be understood that if the first image data is the aligned image data, the process of aligning the image to be processed may be omitted.

As shown in fig. 5, (a) in fig. 5 shows the first image data acquired by the image data processing apparatus, and as can be seen from (a) in fig. 5, the first image data is not aligned, so that the image data processing apparatus can align the first image data according to the apparatus posture of the apparatus that captured the first image data, and the aligned image data is shown in (b) in fig. 5.

As another possible embodiment, in conjunction with fig. 4, in step 302, the method provided in this embodiment of the present application may be implemented by the following steps 3021 and 3022:

step 3021, the semantic segmentation module 202 determines one or more plane categories corresponding to any pixel point of at least some pixel points and a probability of each plane category of the one or more plane categories. As a possible implementation manner, step 3021 in the embodiment of the present application may be specifically implemented in the following manner: the semantic segmentation module 202 performs semantic segmentation on the image data to be processed according to the neural network, and obtains one or more plane categories corresponding to any pixel point in at least part of the pixel points and the probability of each plane category in the one or more plane categories. The training and predicting process of the neural network may refer to the prior art, which is not limited in this embodiment.

Step 3022, the semantic segmentation module 202 uses a plane category with the highest probability in the one or more plane categories corresponding to any one of the pixel points as a target plane category corresponding to the any one of the pixel points, so as to obtain a semantic segmentation result of the image data to be processed. Wherein the probability of the target plane class corresponding to any one pixel point is the greatest among the probabilities of the one or more plane classes corresponding to any one pixel point.

It can be understood that any pixel point in the embodiment of the present application may correspond to one or more plane categories, and any pixel point may correspond to a probability that belongs to each plane category in the one or more plane categories. The sum of the probabilities of one or more plane classes corresponding to any one pixel point is equal to 1.

After the semantic segmentation module 202 acquires the image data to be processed, in order to allow the semantic segmentation module 202 to identify a plane class to which each region of the image data to be processed belongs, semantic segmentation processing may be performed on the image data to be processed. It will be appreciated that the goal of semantic segmentation is to assign a class of labels to each pixel in the image data to be processed.

The image data to be processed is composed of a plurality of pixels, and the semantic segmentation is to group the pixels according to different expression semantic meanings in the image. So-called semantic segmentation, i.e. segmenting the image data to be processed into regions with different semantics and noting the plane class to which each region belongs, such as cars, trees or faces. Semantic segmentation combines segmentation and object recognition techniques to segment an image into regions with high-level semantic content. For example, through semantic segmentation, an image data may be segmented into regions having three different semantics of "cattle", "grassland" and "sky", respectively. As shown in fig. 6 (a) and fig. 6 (b), fig. 6 (a) shows a kind of image data to be processed provided by the embodiment of the present application, and fig. 6 (b) shows a schematic diagram of the image data to be processed after being subjected to semantic segmentation processing. As can be seen from (b) in fig. 6, the image data to be processed is divided into four regions having different semantics, i.e., "ground", "table", "wall", "chair", and so on.

In this embodiment, the semantic segmentation module 202 may determine the probability of one or more plane categories to which each of the N pixel points belongs by using a semantic segmentation model. As a possible implementation manner, in this embodiment of the present application, each pixel point may correspond to one or more plane categories, and a sum of probabilities of all the plane categories corresponding to each pixel point is equal to 1. The probability of the target plane category corresponding to any pixel point in the N pixel points is the maximum probability in the probabilities of one or more plane categories corresponding to the any pixel point.

Taking (a) in fig. 6 as an example, if the plane categories of one or more planes included in the image data to be processed are ground, table, chair, wall surface, etc., the image data processing apparatus can obtain the target plane categories to which the pixel points 1 to 4 belong through step 302, as shown in table 1:

TABLE 1 semantic segmentation results

Pixel point

Belonging to the ground

Belonging to chairs

Belonging to tables

Belonging to the wall

Object plane

	Probability of	Probability of	Probability of	Probability of	Categories
Pixel point 1	1％	98％	1％	0％	Chair (Ref. TM. chair)
Pixel point 2	1％	88％	1％	10％	Chair (Ref. TM. chair)
Pixel point 3	10％	20％	70％	0％	Table (Ref. Table)
Pixel point 4	98％	0.5％	1％	0.5％	Ground surface

As a possible implementation manner, in the embodiment of the present application, the semantic segmentation model may adopt mobileNet v2 as a coding network, or may be implemented by MaskRCNN or the like. It should be understood that any other model capable of performing semantic segmentation may also be used in the embodiment of the present application to obtain the semantic segmentation result. In the embodiment of the present application, semantic segmentation is performed by using mobileNet v2 as a coding network, but this does not cause a limitation on a semantic segmentation method, and is not described in detail later. In addition, the mobileNet v2 model has the advantages of small size, high speed, high precision and the like, meets the requirements of a mobile phone platform, and enables semantic segmentation to reach a frame rate of more than 5 fps.

The probability of the plane category corresponding to each pixel point in the image data to be processed in the two-dimensional space can be obtained by performing semantic segmentation on the image data to be processed. Accordingly, as shown in fig. 4, step 302 in the embodiment of the present application may be implemented by: the semantic segmentation module 202 determines a semantic segmentation result of the image data to be processed according to the probability of one or more plane categories corresponding to each pixel point in at least part of the N pixel points. That is, the semantic segmentation module 202 determines the plane class with the highest probability corresponding to each of at least some pixel points as the respective target plane class of each pixel point, so as to obtain the semantic segmentation result of the image data to be processed.

In a possible embodiment, in order to improve the accuracy of semantic segmentation, as shown in fig. 4, the method provided in this embodiment of the present application may further include, after step 302 and before step 303: step 306, the semantic segmentation module 202 executes an optimization operation on the semantic segmentation result according to the to-be-processed image data and the depth information included in the depth image corresponding to the to-be-processed image data, where the optimization operation is used to correct noise in the semantic segmentation result and an error portion caused by the segmentation process. For example, there may be a pixel point a corresponding to a certain ground in an image data close to a table, but the target plane type of the pixel point a in the semantic segmentation result is the table, but actually the target plane type of the pixel point a should be the ground, so the target plane type of the pixel point a may be modified from the table to the ground. Or, a certain pixel point B is not segmented, and the target plane type of the pixel point B can be determined by performing optimization operation. The specific algorithm implementation of the optimization operation may specifically refer to the prior art, and this embodiment is not described in detail.

The depth information in the embodiment of the present application includes a distance between each pixel point and a device that captures image data to be processed. The purpose of performing optimization operation on the semantic segmentation result in the embodiment of the application is as follows: and optimizing and repairing the semantic segmentation result. The semantic segmentation result can be filtered and corrected through the depth information, and wrong segmentation and unsegmentation in the semantic segmentation result are avoided. For the detailed process of performing the optimization operation on the semantic segmentation result, reference may be made to the following description of fig. 10 and fig. 11, which is not described herein again.

As a possible implementation manner, the semantic map module 303 in this embodiment may determine whether the current state of the image data processing apparatus is a motion state (i.e., step 303) by: the semantic map module 303 acquires second image data captured by the camera. The semantic map module 303 determines whether the current state of the image data processing apparatus is a motion state according to a difference between a first device pose corresponding to the image data to be processed and a second device pose corresponding to the second image data, and an inter-frame difference between the second image data and the image data to be processed.

Specifically, as shown in fig. 8, in the case that a difference value between a first device pose corresponding to the image data to be processed and a second device pose corresponding to the second image data is less than or equal to a first threshold, and an inter-frame difference between the second image data and the image data to be processed is greater than a second threshold, the semantic map module 303 determines that the current state of the image data processing apparatus is a motion state. The second image data is adjacent to the image data to be processed and is located in the previous frame of the image data to be processed. The specific process can be referred to fig. 8.

Further, as shown in fig. 8, in the case where the difference between the first apparatus pose corresponding to the image data to be processed and the second apparatus pose corresponding to the second image data is less than or equal to the first threshold value and the inter-frame difference between the second image data and the image data to be processed is less than or equal to the second threshold value, the image data processing apparatus determines that the current state of the image data processing apparatus is a still state. In the case where the current state is the still state, the image data processing apparatus may directly use the history dense semantic map as the first dense semantic map and perform the subsequent processing.

The history dense semantic map in the embodiment of the present application may be stored in the image data processing apparatus, or may be obtained by the image data processing apparatus from other devices, which is not limited in the embodiment of the present application. The historical dense semantic map is a semantic image result which is generated and stored historically, and the historical dense semantic map is updated after new image data of each frame arrives. Optionally, the historical dense semantic map is a dense semantic map corresponding to a previous frame image of the dense semantic map corresponding to the image data of one frame, or a combination of dense semantic maps corresponding to previous frames of images.

As a possible implementation manner, step 304 of the embodiment of the present application may be implemented by: the semantic map module 303 obtains a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed. The semantic map module 303 directly takes the second dense semantic map as the first dense semantic map. That is, each time a second dense semantic map is computed, the second dense semantic map is directly used as a subsequent computation.

In the embodiment of the present application, the depth image corresponding to the image data to be processed refers to an image that has the same size as the image data to be processed and has an element value that is a depth value of a scene point corresponding to an image point in the image data to be processed. Specifically, the image data to be processed is acquired by the image acquisition device shown in fig. 2, and the depth image corresponding to the processed image data is acquired by the TOF shown in the figure.

In the embodiment of the application, the depth information can be acquired by adopting a TOF camera, structured light, laser scanning and other modes, so that a depth image is obtained. It should be understood that any other manner (or camera) that can obtain a depth image may be used in the embodiments of the present application to obtain a depth image. Hereinafter, only the depth image obtained by using the TOF camera is taken as an example for explanation, but this does not cause limitation on the way of obtaining the depth image, and is not described in detail later.

Although the point cloud is a three-dimensional concept and the pixel point in the depth image is a two-dimensional concept, when the depth value of a certain point in the two-dimensional image is known, the image coordinate of the point can be converted into world coordinates in a three-dimensional space, and therefore, the point cloud in the three-dimensional space can be recovered from the depth image. For example, converting image coordinates to world coordinates may be accomplished using principles of visual geometry. According to the principle of visual geometry, the process of mapping the three-dimensional point M (Xw, Yw, Zw) in the world coordinate system to the point M (u, v) on the image is shown in fig. 7, where the Xc axis of the dotted line in fig. 7 is obtained by translating the Xc axis based on the solid line, and the Yc axis of the dotted line is obtained by translating the Yc axis based on the solid line.

Fig. 7 satisfies the following mathematical relationship:

and u and v are arbitrary coordinate points in an image coordinate system. f is the focal length of the camera, dx and dy are the pixel sizes in the x and y directions, respectively, u₀And v₀Respectively the center coordinates of the image. Xw, Yw, Zw represent three-dimensional coordinate points in the world coordinate system. Zc represents the Z-axis value of the camera coordinates, i.e., the object-to-camera distance. R and T are respectively a 3x3 rotation matrix and a 3x1 translation matrix of the external reference matrix.

Firstly, the depth map can be restored to a point cloud based on a camera coordinate system, namely, a rotation matrix R is used as a unit matrix, a translation vector T is 0, and the following can be obtained:

wherein Xc, Yc and Zc are three-dimensional point coordinates in a camera coordinate system.

The following can be derived from the above formula:

zc represents a value on a depth map, the depth unit acquired by the TOF is millimeter (mm), so that the coordinates of a three-dimensional point under a camera coordinate system can be calculated, and then point cloud data converted to a world coordinate system can be obtained according to the equipment poses R and T calculated by the SLAM module. Specifically, the following formula is shown:

when the device pose calculated by the SLAM module and the depth data acquired by the TOF are accurate, a better point cloud registration result can be obtained. The three-dimensional point in this embodiment is a three-dimensional pixel point, that is, the two-dimensional pixel point involved in steps 301 and 302 is converted into a three-dimensional pixel point.

As another possible implementation manner, step 304 of the embodiment of the present application may be implemented by: the semantic map module 303 obtains a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed, and specifically, how to synthesize the plurality of two-dimensional pixel points and the depth image to obtain the plurality of three-dimensional points included in the second dense semantic map may refer to the prior art. The semantic map module 303 updates the historical dense semantic map with one or more second three-dimensional points in a second three-dimensional point cloud in a second dense semantic map to obtain the first dense semantic map. Instead of directly using the second dense semantic map as the first dense semantic map, a portion of all three-dimensional points in the second three-dimensional point cloud may be used for the updating. Thus, the update may not be for all three-dimensional points in the second dense semantic map, but simply replace the probabilities of the target plane classes for the corresponding three-dimensional points of the historical dense semantic map with the probabilities of the target plane classes for some of the three-dimensional points in the second dense semantic map. Thus, the update may be an update to a portion of a dense semantic map, rather than directly using the second dense semantic map as the first dense semantic map.

Specifically, the semantic map module 303 updates the probability of the three-dimensional point corresponding to the one or more second three-dimensional points in the historical dense semantic map or the probability of the target plane category of the three-dimensional point by using the one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map. It should be understood that the probability that the semantic map module 303 updates the three-dimensional points corresponding to the one or more second three-dimensional points in the historical dense semantic map by using the one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map is the probability that the target plane category of the three-dimensional point a in the historical dense semantic map is replaced by the probability of the target plane category of the three-dimensional point a in the second dense semantic map.

As a possible implementation manner, as shown in fig. 4, step 304 in the embodiment of the present application may be specifically implemented by the following manner: step 3041, the semantic clustering module 204 determines a plane equation for each of the one or more planes. For example, the semantic clustering module 204 performs plane fitting on the three-dimensional point cloud data of each pixel point to obtain a plane equation.

Specifically, the semantic clustering module 204 may perform plane fitting on the three-dimensional point cloud data of each pixel point by using a RANSAC method or an SVD equation solving method to obtain a plane equation.

It is to be understood that, after the image data processing apparatus in the embodiment of the present application obtains the plane equation of each plane, the respective area of each plane and the orientation of each plane can be determined. Taking the plane equation AX + BY + CZ + D as 0 as an example, n, which is a normal vector of the plane, is (a, B, C). The normal vector is used to indicate the orientation of the plane. The orientation of the plane in the embodiments of the present application may also be replaced by the direction of the plane in the expression.

Semantic clustering module 204 performs the following steps 3042 and 3043 on any of the one or more planes to obtain a plane semantic category for the one or more planes: step 3042, the semantic clustering module 204 determines one or more object plane categories corresponding to the any plane and the confidence of the one or more object plane categories according to the plane equation of the any plane and the first dense semantic map.

In a possible implementation manner, step 3042 in this embodiment of the present application may be implemented by: the semantic clustering module 204 determines M first three-dimensional points from the first dense semantic map according to a plane equation of any one plane, wherein the distance between the M first three-dimensional points and any one plane is smaller than a third threshold, and M is a positive integer; the semantic clustering module 204 determines one or more target plane categories corresponding to the M first three-dimensional points as the one or more target plane categories corresponding to the any one plane, the orientation of the one or more target plane categories is consistent with the orientation of the any one plane, and counts the proportion of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories in the M first three-dimensional points to obtain the confidence of the one or more target plane categories. The specific numerical value of the third threshold is not limited in the embodiment of the application, and can be set as required in the actual process.

In the embodiment of the present application, the M first three-dimensional points determined from the first dense semantic map may be regarded as three-dimensional points belonging to any plane. Because the plane class to which each three-dimensional point of the M first three-dimensional points belongs can be determined, and the plane classes to which different three-dimensional points belong can be the same or different, for example, the plane class to which the three-dimensional point a of the M first three-dimensional points belongs is "ground", and the plane class to which the three-dimensional point B of the M first three-dimensional points belongs is "table", one or more plane classes corresponding to the M first three-dimensional points can be obtained according to the plane class to which each three-dimensional point of the M first three-dimensional points belongs. Since the M first three-dimensional points determined from the first dense semantic map are considered to be three-dimensional points belonging to any plane, it can be determined that any plane also corresponds to the one or more plane classes. The plane category of each of the M first three-dimensional points may be a target plane category of the two-dimensional pixel point corresponding to the three-dimensional point mentioned in the previous embodiment. For example, the step 3022 may be adopted to obtain a target plane category of each pixel point and use the target plane category as the plane category of the three-dimensional point corresponding to each pixel point, so that one or more target plane categories corresponding to the M first three-dimensional points may be obtained.

For example, if the plane category to which N1 three-dimensional points belong among the M first three-dimensional points is "ground", that is, the number of three-dimensional points whose plane category is "ground" is N1, the number of three-dimensional points whose plane category to which N2 three-dimensional points belong is "table", that is, the number of three-dimensional points whose plane category is "table" is N2, the plane category to which N3 three-dimensional points belong is "wall", that is, the number of three-dimensional points whose plane category is "wall" is N3, where N1+ N2+ N3 is less than or equal to M, N1, N2, and N3 are positive integers, the probability of the number of three-dimensional points included in the M first three-dimensional points whose plane category is "ground" is: N1/M. The probability that the number of three-dimensional points included in the plane class of "table" among the M first three-dimensional points is: N2/M. The probability of the number of three-dimensional points included by the plane class of the wall surface in the M first three-dimensional points is as follows: N3/M. The confidence of one or more plane classes for any plane is: N1/M, N2/M, N3/M. If N2/M > N1/M and N2/M is greater than N3/M, then the semantic plane class for any one plane is "ground".

Step 3043, the semantic clustering module 204 selects the object plane category with the highest confidence from the one or more object plane categories as the semantic plane category of any one of the planes.

For example, the confidence that the plane a corresponds to the ground is P1, the confidence that the plane a corresponds to the table is P2, the confidence that the plane a corresponds to the wall is P3, and P1 > P2 > P3, so the semantic clustering module 204 may determine that the semantic plane class of the plane a is the ground.

Since any plane may correspond to one or more target plane categories, but the orientation of all target plane categories in the one or more target plane categories corresponding to the any plane is not consistent with the orientation of the any plane, that is, a plane may correspond to a target plane category consistent with the orientation of the plane and may also correspond to a target plane category inconsistent with the orientation of the plane, and the probability that a target plane category inconsistent with the orientation of the plane belongs to the semantic plane category of the plane is lower than the probability that the target plane category is consistent with the orientation of the plane. Based on this, in order to simplify the subsequent calculation process and reduce the calculation error, in one possible implementation manner, the orientation of one or more target plane categories corresponding to any plane in the embodiment of the present application is consistent with the orientation of any plane. That is, the one or more object plane categories are plane categories selected by the image data processing apparatus from all object plane categories corresponding to any one plane, the plane categories being in accordance with the orientation of any one plane. The one or more object plane categories may be all plane categories of all object plane categories corresponding to any plane, or may be partial plane categories, which is not limited in this application. In this embodiment, all the target plane categories corresponding to any one plane may be regarded as all the target plane categories corresponding to the M first three-dimensional points.

For example, the orientation of plane a is downward, the orientation of plane class (ground) is upward, the orientation of plane class (desk) is downward, and the orientation of plane class (ceiling) is downward, so in calculating the confidence that plane a belongs to one or more plane classes, the confidence that plane a is determined to belong to a plane class (ground) can be culled. This can reduce the computational burden on the image data processing apparatus and improve the computational accuracy.

In a possible implementation manner, after the semantic clustering module 204 counts a ratio of the number of three-dimensional points corresponding to each of the one or more target plane categories in the M first three-dimensional points to obtain a confidence of the one or more target plane categories, the method provided in the embodiment of the present application further includes: the semantic clustering module 204 updates the confidence of the one or more object plane classes corresponding to any one of the planes according to at least one of bayesian theorem or voting mechanism.

Specifically, the semantic clustering module 204 performs plane fitting on the three-dimensional point cloud data of each three-dimensional point to obtain a plane equation. The formula of the plane equation is as follows: AX + BY + CZ + D is 0. Wherein, A, B, C and D are plane equation parameters to be solved, and the optimal plane equation parameters are solved through a plurality of points. The specific fitting scheme can be referred to the prior art. And the outermost points in all the point sets participating in calculation are taken as the boundary points of the plane. The normal vector of the plane, that is, n ═ a, B, and C, can be taken as the direction vector of the plane, and the area of the plane is defined as the area of the minimum bounding rectangle of the boundary points of the plane.

Semantic clustering module 204 then counts and screens M first three-dimensional points from the first dense semantic map that correspond to one or more target plane categories for which the distance to the plane is less than a third threshold based on the plane equation, orientation, and area of the detected plane. The semantic clustering module 204 normalizes the three-dimensional points of each plane category in one or more target plane categories as the confidence of the plane category, i.e. counts the ratio of the number of three-dimensional points included in each target plane category to the number of all three-dimensional points (M first three-dimensional points). Based on Bayesian theorem and voting mechanism, the confidence of various plane categories recorded last time is updated, and the plane category with the highest confidence at present is selected as the plane category of the plane semantics, so that the accuracy and stability of plane semantics identification can be enhanced.

The semantic clustering module 204 specifically uses bayes theorem and voting mechanism to count the confidence that a plane calculated before the current time belongs to multiple plane categories, so as to modify and update the confidence that the plane calculated at the current time belongs to multiple plane categories according to the obtained confidence.

For example, if the maximum voting number under the voting mechanism is MAX _ VOTE _ COUNT and the initial voting number is 0, if the plane class to which a certain three-dimensional point C in the current frame belongs is consistent with the plane class to which a three-dimensional point C in the previous frame of the current frame belongs, the voting number (VOTE) corresponding to the three-dimensional point C is added by 1, and the three-dimensional point is updatedAnd C, sliding the value of the plane class probability prob belonging to C between the mean value and the maximum value of the two values. For example,

wherein, prob_cRepresenting the probability distribution, prob, of the plane class to which the three-dimensional point C of the current frame belongs_pAnd representing the probability distribution of the plane class to which the three-dimensional point C of the frame before the current frame belongs. alpha is VOTE/MAX _ VOTE _ COUNT.

If the plane class to which a certain three-dimensional point C in the current frame belongs is not consistent with the plane class to which the three-dimensional point C in the previous frame belongs, the voting number of times vote is subtracted by 1, and the plane class probability prob is updated to be 80% of the value.

Specifically, step 304 may be implemented specifically as described in fig. 9: in step 901, the semantic clustering module 204 performs a plane detection step to obtain one or more planes included in the image data to be processed. Since the semantic clustering module 204 calculates the plane class of the plane semantic of each plane of one or more planes in the same manner and principle, the following steps are exemplified by the process in which the image data processing apparatus calculates the plane class of the plane semantic of the first plane, and do not have indicative meaning.

Step 902, the semantic clustering module 204 obtains a plane equation of the first plane. Step 903, the semantic clustering module 204 calculates the area of the first plane. Step 904, the semantic clustering module 204 calculates an orientation of the first plane. For specific implementation of step 903 and step 904, reference may be made to a process of calculating an area and an orientation of a plane in the prior art, which is not described herein again. Step 905, the semantic clustering module 204 counts M three-dimensional points, of the three-dimensional points of various plane categories in the first dense semantic map, whose distance from the first plane is smaller than a third threshold. Step 906, the semantic clustering module 204 determines whether the orientation of each of the one or more target plane categories corresponding to the M three-dimensional points is the same as or the same as the orientation of the first plane.

In step 907, if the orientations of the various target plane categories are consistent with the orientation of the first plane, the semantic clustering module 204 determines whether the number of three-dimensional points included in each target plane category in the unit plane meets a threshold according to the area of the first plane.

Step 908, if it is determined according to the plane area that the number of three-dimensional points included in each target plane category in the unit plane satisfies the threshold, the semantic clustering module 204 performs regularization on the number of three-dimensional points included in each target plane category, that is, calculates a ratio of the total number of three-dimensional points included in each target plane category occupied in the M first three-dimensional points, so as to obtain a confidence that the first plane belongs to one or more target plane categories. In step 909, the semantic clustering module 204 performs bayesian probability updating on the currently calculated confidence that the first plane belongs to one or more object plane categories and the currently and previously calculated confidence that the first plane belongs to various object plane categories. Step 910, the semantic clustering module 204 sets the object plane type with the highest current confidence of the first plane as the plane type of the first plane.

It should be noted that if the orientation of each target plane category is not consistent with the orientation of the first plane, the semantic clustering module 204 determines that the process is stopped. Further, the image data processing apparatus determines that the flow is stopped if the semantic clustering module 204 judges, from the area of the first plane, that the number of three-dimensional points included in each of the target plane categories in the unit plane does not satisfy the threshold.

In the embodiment of the present application, the specific steps of the semantic segmentation module 202 performing the optimization operation on the semantic segmentation result according to the image data to be processed and the depth information of the image data to be processed include a Random Sample Consensus (RANSAC) ground equation estimation process as described in fig. 10 and a semantic seed point region growing process as shown in fig. 11.

(1) RANSAC ground equation estimation

The ground (Floor) is an important component of a scene, and has the following characteristics: floor represents a large area of plane; floor is an important reference for SLAM initialization; the ground is easier to detect and recognize compared with other semantic targets; a plurality of objects in the scene are located above the ground; the height of objects in the scene is relative to the ground. Therefore, it is necessary to preferentially divide the ground and solve the plane equation.

The RANSAC algorithm is also called a random sampling consistency estimation method, is an estimation method with strong robustness, is more suitable for plane estimation in a region such as the ground, and depends on the result of semantic segmentation of a deep neural network to extract ground semantic pixels therein, that is, extract FLOOR pixels (FLOOR three-dimensional points) from a plurality of three-dimensional points and acquire point cloud data formed by depth information thereof to realize the estimation of a ground equation based on the RANSAC, and the specific steps are shown in fig. 10:

as a possible implementation manner, in the embodiment of the present application, when the plane type is the ground, the ground equation may also be estimated by using an AI manner.

Step 1011: the semantic segmentation module 202 obtains P three-dimensional points included on the ground by performing semantic segmentation processing on the ground. The RANSAC algorithm has an iteration number of M, and if M is greater than 0, the image data processing device randomly selects l (for example, l is 3) three-dimensional points from the P three-dimensional points as sampling points. Otherwise, execution jumps to step 1016.

Step 1012: the semantic segmentation module 202 substitutes the three-dimensional coordinates of the l three-dimensional points into the plane equation Ax + By + Cz ═ 1, and solves the parameter n ═ ab C of the plane equation using Singular Value Decomposition (SVD).

Step 1013: the semantic segmentation module 202 brings the three-dimensional coordinates q of the P three-dimensional points into the estimated plane equation, and calculates scalar distances d from the P three-dimensional points to the plane equation, if d is smaller than a preset threshold η, the P three-dimensional points are considered as interior points, and the number k of the interior points is counted. Wherein,

step 1014: the semantic segmentation module 202 determines the number K of the iteration interior points and the optimal number K of the iteration interior points, if K is less than K, the semantic segmentation module 202 decreases the iteration number M of the RANSAC algorithm by 1, and jumps to step 1011 to execute, otherwise, continues to execute.

Step 1015: the semantic segmentation module 202 assigns the number K of the iteration to the optimal number K of the inner points, the semantic segmentation module 202 stores the index of the optimal inner points, calculates the percentage v of the number of the inner points to be K/P, and calculates the percentage v of the number of the inner points to be K/P according to a formula

And modifying the iteration number M, wherein w is 0.99, and n is 3.

Step 1016: the semantic segmentation module 202 re-estimates the plane equation by using the K optimal interior points, that is, establishes an over-determined equation composed of the K equations, and obtains a globally optimal plane equation by using SVD.

(2) Semantic seed point region growing

Aiming at the problem of under-segmentation or over-segmentation of the result of neural network semantic segmentation, the semantic seeds are subjected to region growing in combination with depth information to expand the segmentation region and correct the segmentation result. The number of pixels in the semantic division category is used as an index of the region growing priority, so that the category with a large number of pixels in the semantic division category is preferentially grown, but the ground has the highest priority, namely, the ground is firstly grown and then the other categories are grown.

The region growing algorithm is based on the similarity degree between the seed point and the field point to realize that adjacent points with higher similarity are combined to continue growing outwards until the neighborhood points which do not meet the similarity condition are combined, a typical 8 neighborhood is selected for region growing, and the similarity condition is expressed by selecting depth and color information at the same time, so that the under-divided region can be well corrected. The so-called seed point is the initial point of region growth, which is out-diffusion growth using a similar method using a Breadth-First-Search (BFS). The specific steps are shown in fig. 11:

step 1101: the semantic segmentation module 202 traverses the semantic segmentation class priority list, and pushes the plane class with high priority (push to the seed point stack) first for region growing.Illustratively, let the seed point of the current push category stack as

That is, the method includes K seed points, and the coordinates of the two-dimensional pixel point corresponding to each seed point are (i, j). The priority list is a statistical division result, and is created according to the number of plane classes.

Step 1102: if the seed point stack is not empty, the semantic segmentation module 202 will assign the last seed point s_KAnd (i, j) popping and deleting from the stack, judging whether the type of the neighborhood point p (i + m, j + n) is OTHER (OTHER), if so, continuing to execute the next step, otherwise, jumping to the step 1101 for execution.

Step 1103: semantic segmentation module 202 compares seed points s_KAnd if the similarity distance d is smaller than a given threshold eta, continuing to execute the next step, otherwise, skipping to the step 1101 for execution, wherein the expression of the similarity distance d is as follows:

step 1104: the semantic segmentation module 202 pushes the neighborhood point p satisfying the similarity condition into the seed point heap

Then the process jumps to step 1101 for execution. Then, a semantic map is established according to the method described above, and plane detection and recognition are completed, so that a stable and accurate plane semantic result can be obtained, as shown in fig. 12 below.

The above description has mainly described the aspects of the embodiments of the present application from the perspective of an image data processing apparatus. It is to be understood that the image data processing apparatus and the like include hardware structures and/or software modules corresponding to the respective functions for realizing the above-described functions. Those of skill in the art would readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The image data processing apparatus according to the embodiment of the present application may divide the functional units according to the above method, for example, each functional unit may be divided for each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

In the case of dividing each functional module by corresponding functions, fig. 2 shows a schematic diagram of a possible structure of the image data processing apparatus according to the above-described embodiment, and the image data processing apparatus includes: a semantic segmentation module 202, a semantic map module 203, and a semantic clustering module 204. The semantic segmentation module 202 is configured to support the image data processing apparatus to perform step 301 and step 302 in the above embodiments. The semantic map module 203 is used to support the image data processing apparatus to execute step 303 in the above embodiments. The semantic clustering module 204 is used to support the image data processing apparatus to execute step 304 in the above embodiments.

In a possible embodiment, the semantic segmentation module 202 is further configured to support the image data processing apparatus to perform step 305 in the above-described embodiment. The semantic segmentation module 202 is used to support the image data processing apparatus to execute step 3011 in the above embodiment. The semantic segmentation module 202 is configured to support the image data processing apparatus to execute step 306, step 3021, and step 3022 in the above embodiment. In a possible embodiment, the semantic clustering module 204 is used to support the image data processing apparatus to perform the steps 3041, 3042 and 3043 in the above embodiments. Furthermore, the semantic clustering module 204 is also used to support the image data processing apparatus to execute steps 901 to 910 in the above embodiments. The present apparatus may be implemented in software, for example, and stored in a storage medium.

An image data processing apparatus in the present embodiment is described above from the perspective of a modular functional entity, and is described below from the perspective of hardware processing. As shown in fig. 13, fig. 13 is a schematic diagram showing a possible hardware configuration of the image data processing apparatus according to the above embodiment. The image data processing apparatus includes: a first processor 1301, and a second processor 1302. Optionally, the image data processing apparatus may further include a communication interface 1303, a memory 1304, and a bus 1305. The communication interface 1303 may include an input interface 13031 and an output interface 13032, among others. Accordingly, when the image data processing apparatus is an electronic device, the first processor 1301 and the second processor 1302 may be the processor 120 shown in fig. 1. For example, the first processor 1301 may be a DSP or a CPU. The second processor 1302 may be an NPU. The communication interface 1303 may be the input device 140 in fig. 1. A memory 1304 for storing program codes and data which the image data processing apparatus can use, corresponds to the memory 130 in fig. 1. The bus 1305 may be internal to the processor 120 of FIG. 1.

In this case, the first processor 1301 and the second processor 1302 are configured to perform a part of functions in the image data processing method described above. For example, the first processor 1301 is configured to support the image data processing apparatus to execute step 301 of the above-described embodiment. The second processor 1302 is used for the image data processing apparatus to execute step 302 of the above-described embodiment. The first processor 1301 is used for the image data processing apparatus to execute steps 303 and 304 of the above embodiments.

In a possible embodiment, the first processor 1301 is further configured to support the image data processing apparatus to perform step 305, step 3011, step 3041, step 3042, and step 3043 in the foregoing embodiment. The second processor 1302 is further configured to support the image data processing apparatus to execute step 306, step 3021, and step 3022 in the foregoing embodiment. Optionally, the first processor 1301 is further configured to support the image data processing apparatus to perform steps 901 to 910 in the above embodiments.

In some possible embodiments, the first processor 1301 or the second processor 1302 may be a single processor structure, a multi-processor structure, a single-threaded processor, a multi-threaded processor, or the like; in some possible embodiments, the first processor 1301 may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. The second processor 1302 may be a neural network processor that may implement or execute the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a digital signal processor and a microprocessor, or the like.

Output interface 13032: the output interface is used for outputting the processing result in the image data processing method, and in some feasible embodiments, the processing result can be directly output by a processor or can be stored in a memory first and then output through the memory; in some possible embodiments, there may be only one output interface or there may be multiple output interfaces. In some possible embodiments, the processing result output by the output interface may be sent to a memory for storage, or may be sent to another processing flow for further processing, or sent to a display device for display, sent to a player terminal for playing, and the like.

The memory 1301: the memory 1301 may store the image data to be processed, the related instructions configuring the first processor or the second processor, and the like. In some possible embodiments, there may be one memory or multiple memories; the memory may be a floppy disk, a hard disk such as a built-in hard disk and a removable hard disk, a magnetic disk, an optical disk, a magneto-optical disk such as a CD ROM, a DCD ROM, a non-volatile memory device such as a RAM, a ROM, a PROM, an EPROM, an EEPROM, a flash memory, or any other form of storage medium known in the art.

Bus 1304: the bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 13, but this is not intended to represent only one bus or type of bus.

The components of the image data processing apparatus provided in the embodiment of the present application are respectively used to implement the functions of the steps of the corresponding feature extraction method, and since the steps have been described in detail in the embodiment of the image data processing method, no further description is given here.

An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a device (for example, the device may be a single chip, a computer, or the like), the device is caused to perform one or more of steps 301 to 3011 of the above-mentioned image data processing method. The respective constituent modules of the image data processing apparatus described above may be stored in the computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products.

Based on such understanding, the embodiments of the present application also provide a computer program product containing instructions, where a part of or all or part of the technical solution that substantially contributes to the prior art may be embodied in the form of a software product stored in a storage medium, and the computer program product contains instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor therein to execute all or part of the steps of the method described in the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network appliance, a user device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire or wirelessly. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape; or optical media such as Digital Video Disks (DVDs); it may also be a semiconductor medium, such as a Solid State Drive (SSD).

While the present application has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Although the present application has been described in conjunction with specific features and embodiments thereof, it will be evident that various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and figures are merely exemplary of the present application as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include such modifications and variations.

Claims

A method for identifying a flat semantic category, comprising:

acquiring to-be-processed image data, wherein the to-be-processed image data comprises N pixel points, and N is a positive integer;

determining a semantic segmentation result of the image data to be processed, wherein the semantic segmentation result comprises a target plane category corresponding to at least part of pixel points in the N pixel points;

obtaining a first dense semantic map according to the semantic segmentation result, wherein the first dense semantic map comprises at least one target plane category corresponding to at least one first three-dimensional point in a first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to at least one pixel point in the at least part of pixel points;

and performing plane semantic category identification according to the first dense semantic map to obtain plane semantic categories of one or more planes included in the image data to be processed.
The method of claim 1, wherein the obtaining a first dense semantic map according to the semantic segmentation result comprises:

obtaining a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed;

using the second dense semantic map as the first dense semantic map, or,

and updating a historical dense semantic map by using one or more second three-dimensional points in a second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map.
The method according to claim 1 or 2, wherein after determining the result of semantic segmentation of the image data to be processed, the method further comprises:

and according to the image data to be processed and the depth information included in the depth image corresponding to the image data to be processed, performing optimization operation on the semantic segmentation result, wherein the optimization operation is used for correcting the noise and error part in the semantic segmentation result.
The method according to any one of claims 1-3, wherein the determining a semantic segmentation result of the image data to be processed comprises:

determining one or more plane categories corresponding to any pixel point in at least part of pixel points and the probability of each plane category in the one or more plane categories;

and taking the plane class with the highest probability in the one or more plane classes corresponding to any pixel point as a target plane class corresponding to any pixel point to obtain a semantic segmentation result of the image data to be processed.
The method of claim 4, wherein the determining the probability of each plane class of the one or more plane classes corresponding to any pixel point of the at least some pixel points comprises:

and performing semantic segmentation on the image data to be processed according to a neural network to obtain the probability of each plane category in one or more plane categories corresponding to any pixel point in at least part of pixel points.
The method according to any one of claims 1 to 5, wherein the performing plane semantic category recognition according to the first dense semantic map to obtain plane semantic categories of one or more planes included in the image data to be processed includes:

determining a plane equation for each of the one or more planes from the image data to be processed;

performing the following steps on any plane in the one or more planes to obtain a plane semantic category of the any plane:

determining one or more target plane classes corresponding to the any plane and the confidence of the one or more target plane classes according to the plane equation of the any plane and the first dense semantic map;

and selecting the object plane class with the highest confidence from the one or more object plane classes as the semantic plane class of any one plane.
The method of claim 6, wherein the orientation of the one or more object plane categories corresponding to the any one plane is coincident with the orientation of the any one plane.
The method according to claim 6 or 7, wherein the determining, from the plane equation for the any one plane and the first dense semantic map, one or more object plane classes to which the any one plane corresponds and a confidence of the one or more object plane classes comprises:

determining M first three-dimensional points from the first dense semantic map according to a plane equation of any one plane, wherein the distance between the M first three-dimensional points and the any one plane is smaller than a third threshold value, and M is a positive integer;

determining one or more object plane classes corresponding to the M first three-dimensional points as the one or more object plane classes corresponding to the any plane, wherein the orientation of the one or more object plane classes is consistent with the orientation of the any plane,

and counting the proportion of the number of the three-dimensional points corresponding to each target plane category in the one or more target plane categories in the M first three-dimensional points to obtain the confidence of the one or more target plane categories.
The method of claim 8, further comprising:

and updating the confidence of one or more target plane categories corresponding to any one plane according to at least one of Bayesian theorem or voting mechanism.
The method according to any one of claims 1-9, wherein the obtaining a first dense semantic map according to the semantic segmentation result comprises:

judging whether the current state is a motion state;

and under the condition that the current state is a motion state, obtaining a first dense semantic map according to the semantic segmentation result.
The method according to any one of claims 1 to 10, wherein the image data to be processed is post-aligned image data.
An image data processing apparatus characterized by comprising:

the semantic segmentation module is used for acquiring to-be-processed image data comprising N pixel points and determining a semantic segmentation result of the to-be-processed image data, wherein the semantic segmentation result comprises target plane categories corresponding to at least part of the N pixel points, and N is a positive integer;

the semantic map module is used for obtaining a first dense semantic map according to the semantic segmentation result, wherein the first dense semantic map comprises at least one target plane category corresponding to at least one first three-dimensional point in a first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to at least one pixel point in at least part of pixel points;

and the semantic clustering module is used for carrying out plane semantic category identification according to the first dense semantic map to obtain the plane semantic categories of one or more planes included in the image data to be processed.
The apparatus of claim 12, wherein the semantic map module is specifically configured to:

obtaining a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed;

using the second dense semantic map as the first dense semantic map, or,

the method further includes updating a historical dense semantic map with one or more second three-dimensional points in a second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map.
The apparatus according to claim 12 or 13, wherein after determining the semantic segmentation result of the image data to be processed, the semantic segmentation module is further configured to perform an optimization operation on the semantic segmentation result according to the image data to be processed and depth information included in a depth image corresponding to the image data to be processed, where the optimization operation is used to correct noise and error portions in the semantic segmentation result.
The apparatus according to any one of claims 12 to 14, wherein the semantic segmentation module is specifically configured to determine one or more plane classes corresponding to any one of the at least some pixel points and a probability of each of the one or more plane classes;

and the plane class with the highest probability in the one or more plane classes corresponding to any pixel point is used as a target plane class corresponding to any pixel point, so as to obtain a semantic segmentation result of the image data to be processed.
The apparatus according to claim 15, wherein the semantic segmentation module is configured to perform semantic segmentation on the image data to be processed according to a neural network, so as to obtain a probability of each of one or more plane categories corresponding to any one of the at least part of the pixel points.
The apparatus according to any one of claims 12-16, wherein the semantic clustering module is configured to:

determining a plane equation for each of the one or more planes from the image data to be processed;

performing the following steps on any plane in the one or more planes to obtain a plane semantic category of the one or more planes:

determining one or more target plane classes corresponding to the any plane and the confidence of the one or more target plane classes according to the plane equation of the any plane and the first dense semantic map;

and selecting the object plane class with the highest confidence from the one or more object plane classes as the semantic plane class of any one plane.
The apparatus of claim 17, wherein the orientation of the one or more object plane categories corresponding to the any one plane is coincident with the orientation of the any one plane.
The apparatus according to claim 17 or 18, wherein the semantic clustering module is specifically configured to:

determining M first three-dimensional points from the first dense semantic map according to a plane equation of any one plane, wherein the distance between the M first three-dimensional points and the any one plane is smaller than a third threshold value, and M is a positive integer;

determining one or more object plane classes corresponding to the M first three-dimensional points as the one or more object plane classes corresponding to the any plane, wherein the orientation of the one or more object plane classes is consistent with the orientation of the any plane,

and counting the proportion of the number of the three-dimensional points corresponding to each target plane category in the one or more target plane categories in the M first three-dimensional points to obtain the confidence of the one or more target plane categories.
The apparatus of claim 19, wherein the semantic clustering module, after counting a ratio of the number of three-dimensional points corresponding to each object plane category in the one or more object plane categories to the M first three-dimensional points to obtain a confidence of the one or more object plane categories, is further configured to update the confidence of the one or more object plane categories corresponding to the any one plane according to at least one of bayesian theorem or voting mechanism.
The apparatus according to any one of claims 12 to 20, wherein the semantic map module is specifically configured to determine whether a current state is a motion state, and obtain a first dense semantic map according to the semantic segmentation result when it is determined that the current state is the motion state.
The apparatus according to any one of claims 12 to 21, wherein the image data to be processed is post-aligned image data.
A computer-readable storage medium having stored thereon instructions which, when executed, implement the method of any one of claims 1 to 11.
The processing device is characterized by comprising a first processor and a second processor, wherein the first processor is used for acquiring image data to be processed, the image data to be processed comprises N pixel points, and N is a positive integer;

the second processor is configured to determine a semantic segmentation result of the image data to be processed, where the semantic segmentation result includes a target plane category corresponding to at least some pixel points in the N pixel points;

the first processor is further configured to obtain a first dense semantic map according to the semantic segmentation result, where the first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in a first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to at least one pixel point in the at least part of pixel points; and performing plane semantic category identification according to the first dense semantic map to obtain plane semantic categories of one or more planes included in the image data to be processed.
The processing device according to claim 24, wherein the second processor is specifically configured to determine one or more plane categories corresponding to any of the at least some of the pixel points and a probability for each of the one or more plane categories;

and taking the plane class with the highest probability in the one or more plane classes corresponding to any pixel point as a target plane class corresponding to any pixel point to obtain a semantic segmentation result of the image data to be processed.
The processing device according to claim 25, wherein the second processor is specifically configured to perform semantic segmentation on the image data to be processed according to a neural network, so as to obtain a probability of each of one or more plane classes corresponding to any one of the at least part of the pixel points.
A processing device, comprising: one or more processors configured to execute instructions stored in a memory to perform the method of any of claims 1-11.