WO2021147113A1

WO2021147113A1 - Plane semantic category identification method and image data processing apparatus

Info

Publication number: WO2021147113A1
Application number: PCT/CN2020/074040
Authority: WO
Inventors: 马超群; 陈平; 方晓鑫
Original assignee: 华为技术有限公司
Priority date: 2020-01-23
Filing date: 2020-01-23
Publication date: 2021-07-29
Also published as: CN113439275A

Abstract

Disclosed are a plane semantic category identification method and an image data processing apparatus, wherein same relate to the technical field of image processing, and are used for accurately determining a plane semantic category. The method comprises: acquiring image data to be processed, wherein the image data to be processed comprises N pixel points; determining a semantic segmentation result of the image data to be processed, wherein the semantic segmentation result comprises target plane categories corresponding to at least some of the N pixel points; according to the semantic segmentation result, obtaining a first dense semantic map, wherein the first dense semantic map comprises at least one target plane category corresponding to at least one first three-dimensional point in a first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to at least one pixel point in the at least some pixel points; and performing plane semantic category identification according to the first dense semantic map in order to obtain plane semantic categories of one or more planes comprised in the image data to be processed. The method can improve the accuracy of plane semantic recognition.

Description

Recognition method of plane semantic category and image data processing device

Technical field

The embodiments of the present application relate to the field of image processing technology, and in particular, to a method for recognizing planar semantic categories and an image data processing device.

Background technique

Augmented reality (AR) is a technology that calculates the position and angle of the camera image in real time and adds corresponding images, videos, and 3D models. The goal of this technology is to put the virtual world in the real world on the screen And interact. Among them, plane detection, as a more important function in augmented reality, provides the perception of the basic three-dimensional environment of the real world, so that developers can place virtual objects according to the detected plane to achieve the effect of augmented reality. Three-dimensional space plane detection is an important and basic function, because after the plane is detected, the anchor point of the object can be further determined, so as to render the object at the determined anchor point, so it is an important function in augmented reality , Provides the perception of the basic three-dimensional environment of the real world, enabling developers to place virtual objects according to the detected plane to achieve the effect of augmented reality.

At present, it is possible to obtain multiple 3D points on a plane based on laser equipment, calculate the plane equation of the plane through 3D point statistics, and determine the position information of the plane through the plane equation. However, the plane detected by most augmented reality algorithms only provides location information and cannot identify the plane category of the plane. The plane category of the recognition plane can help developers improve the authenticity and interest of augmented reality applications.

Based on the above, it is currently possible to perform semantic segmentation on red-green-blue (Red-Green-Blue, RGB) image data or red-green-blue-depth (Red-Green-Blue-Depth, RGBD) image data through neural networks, and perform semantic segmentation according to the semantic segmentation results Build a semantic map. Then use semantic maps to generate plane semantic categories. However, because this solution directly uses the semantic segmentation result to build a semantic map, it may cause the wrong segmentation and unsegmented parts in the semantic segmentation result, so that the accuracy of semantic category recognition is reduced.

Summary of the invention

The embodiments of the present application provide a method for recognizing planar semantic categories and an image data processing device to improve the accuracy of planar semantic category recognition.

In order to achieve the foregoing objectives, the embodiments of the present application provide the following technical solutions:

In the first aspect, an embodiment of the present application provides a method for recognizing planar semantic categories, including: an image data processing device obtains image data to be processed including N pixels, where N is a positive integer. The image data processing device determines a semantic segmentation result of the image data to be processed, where the semantic segmentation result includes the target plane category corresponding to at least some of the N pixels. The image data processing device obtains a first dense semantic map according to the semantic segmentation result, the first dense semantic map including at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, and the at least one first three-dimensional point cloud A three-dimensional point corresponds to at least one pixel point in the at least part of the pixel points. The image data processing device performs plane semantic category recognition according to the first dense semantic map, and obtains the plane semantic category of one or more planes included in the image data to be processed.

The embodiment of the present application provides a method for recognizing planar semantic categories. The method obtains the result of semantic segmentation of image data to be processed. Since the result of semantic segmentation includes the target to which each pixel of the N pixels included in the image data to be processed belongs For plane categories, subsequent semantic segmentation can improve the accuracy of plane semantic recognition. In addition, in the method provided by the embodiments of the present application, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then, uses the first dense semantic map to recognize the planar semantic category to obtain the planar semantic category of the image data to be processed It can enhance the accuracy of plane semantic recognition.

In a possible implementation manner, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, including: the image data processing device obtains the second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed Semantic map. The image data processing device uses the second dense semantic map as the first dense semantic map.

In a possible implementation manner, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, including: the image data processing device obtains the second dense semantic map according to the semantic segmentation result. The image data processing device uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the historical dense semantic map to obtain the first dense semantic map.

In a possible implementation manner, the image data processing apparatus judging whether the current state of the image data processing apparatus is a motion state includes: the image data processing apparatus obtains second image data, which is different from the image data to be processed. The image data processing device determines whether the state of the image data processing device is a motion state according to the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data. For example, the second image data is adjacent to the image data to be processed and is located in the previous frame of the image data to be processed.

In a possible implementation, the image data processing apparatus determining that the current state is the motion state includes: when the difference between the pose of the first device and the pose of the second device is less than or equal to a first threshold, determining that the current state is Motion state

In a possible implementation, the image data processing device determining that the current state is a motion state includes: the image data processing device acquires second image data shot by the camera; wherein the second image data is adjacent to the image data to be processed and is located The last frame of the image data to be processed; the image data processing device according to the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data, and between the second image data and the image data to be processed The difference between frames to determine the state of the image data processing device is a motion state.

In a possible implementation manner, the image data processing apparatus according to the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data, and the relationship between the second image data and the image data to be processed The difference between frames to determine the state of the image data processing device as a motion state includes: the difference between the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data is less than or equal to In the case where the first threshold value and the frame difference between the second image data and the image data to be processed are greater than the second threshold value, the state of the image data processing device is a motion state.

In a possible implementation, after the image data processing apparatus determines the semantic segmentation result of the image data to be processed, the method provided in the embodiment of the present application further includes: the image data processing apparatus according to the image data to be processed and the image data to be processed For the depth information included in the depth image corresponding to the image data, an optimization operation is performed on the semantic segmentation result, and the optimization operation is used to correct noise and error parts in the semantic segmentation result. This can make subsequent semantic recognition more accurate.

In a possible implementation manner, the image data processing apparatus determining the semantic segmentation result of the image data to be processed includes: the image data processing apparatus determining each plane in one or more plane categories corresponding to any one of at least some of the pixels The probability of the category. The image data processing device uses the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain the semantic segmentation result of the image data to be processed . That is, the probability of the target plane category corresponding to any pixel is the largest among the probabilities of one or more plane categories corresponding to any pixel. This can provide the accuracy of semantic recognition.

In a possible implementation manner, the image data processing device determines the probability of each plane category in one or more plane categories corresponding to any one of at least some of the pixels, including: the image data processing device processes according to the neural network Semantic segmentation of the image data is performed to obtain the probability of each plane category in one or more plane categories corresponding to any one of at least some pixels.

In a possible implementation manner, the image data processing device recognizes the plane semantic category according to the first dense semantic map, and obtains the plane semantic category of one or more planes included in the image data to be processed, including: the image data processing device recognizes the plane semantic category according to the to-be-processed image data. Process the image data to determine the plane equation for each of one or more planes. The image data processing device performs the following steps on any one of the one or more planes to obtain the plane semantic category of the one or more planes: the image data processing device according to the plane equation of any one of the planes, and the first dense Semantic map, determining one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories; selecting the target plane category with the highest confidence from the one or more target plane categories As the semantic plane category of any one of the planes. That is, the semantic plane category of any plane is the target plane category with the highest confidence among the one or more target plane categories corresponding to the any plane, and the target plane category with the highest confidence is selected as the semantic plane category of any plane. It can enhance the accuracy of plane semantic recognition.

In a possible implementation manner, the orientation of one or more target plane categories corresponding to any plane is consistent with the orientation of any plane. That is, the orientation of one or more target plane categories corresponding to each plane is consistent with the respective orientation of each plane. In this way, the target plane categories that are inconsistent with the plane orientation can be filtered out, and the accuracy of plane semantic recognition can be enhanced.

In a possible implementation manner, the image data processing device determines one or more target plane categories corresponding to any one plane and the first dense semantic map according to the plane equation of any one plane and the first dense semantic map. The confidence of one or more target plane categories includes: the image data processing device determines M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the M first three-dimensional points The distance between the point and any one of the planes is less than a third threshold, and M is a positive integer; the one or more target plane categories corresponding to the M first three-dimensional points are determined as the one corresponding to the any one of the planes One or more target plane categories, the orientation of the one or more target plane categories is consistent with the orientation of any one of the planes, and the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories is counted The ratio among the M first three-dimensional points obtains the confidence of the one or more target plane categories. E.g. The target plane category corresponding to each first three-dimensional point is the target plane category of the two-dimensional pixel points corresponding to the first three-dimensional point, and one or more target plane categories of all M first three-dimensional points can be obtained.

In a possible implementation manner, the image data processing device counts the proportion of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points to obtain After the confidence of one or more target plane categories, the method provided in the embodiment of the present application further includes: the image data processing device updates one or more corresponding to any one of the planes according to at least one of Bayes' theorem or voting mechanism The confidence of the target plane category. The video sequence based on Bayes' theorem and voting mechanism updates the confidence of one or more plane categories corresponding to any plane, so that the final result of the plane semantic category of each plane is more stable.

In a possible implementation manner, the method provided in the embodiment of the present application further includes: the image data processing device determines whether the current state of the image data processing device is a motion state, and when the current state is a motion state, the image data processing device According to the semantic segmentation result, the first dense semantic map is obtained. By judging whether it is in a motion state, in the motion state, according to the semantic segmentation result, the first dense semantic map can be obtained, which can reduce the amount of data calculated by the image data processing device, thereby reducing computing resources and improving the semantic map generation algorithm. performance.

In a possible implementation manner, the image data to be processed is image data after correction.

In a possible implementation manner, before the image data processing apparatus obtains the image data to be processed, the method provided in the embodiment of the present application further includes: the image data processing apparatus obtains the first image data taken by the camera. The image data processing device corrects the first image data according to the device pose corresponding to the first image data to obtain the image data to be processed.

In a second aspect, an embodiment of the present application provides an image data processing device. The image data processing device includes a semantic segmentation module, a semantic map module, and a semantic clustering module. The semantic segmentation module is used to obtain information provided by the camera including N Image data to be processed of pixels, N is a positive integer. The semantic segmentation module is also used to determine the semantic segmentation result of the image data to be processed, where the semantic segmentation result includes the target plane category corresponding to at least some of the N pixels. The semantic map module is configured to obtain a first dense semantic map according to the semantic segmentation result, the first dense semantic map including at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, the at least One first three-dimensional point corresponds to at least one pixel point in the at least part of the pixel points. The semantic clustering module is used to recognize the plane semantic category according to the first dense semantic map to obtain the plane semantic category of one or more planes included in the image data to be processed.

The embodiment of the present application provides an image data processing device, which obtains the semantic segmentation result of the image data to be processed. Because the semantic segmentation result includes the N pixels included in the image data to be processed, each pixel belongs to The target plane category can improve the accuracy of plane semantic recognition through semantic segmentation. In addition, in the method provided by the embodiments of the present application, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then, uses the first dense semantic map to recognize the planar semantic category to obtain the planar semantic category of the image data to be processed It can enhance the accuracy of plane semantic recognition. In a possible implementation manner, the semantic map module is used to obtain the first dense semantic map according to the semantic segmentation result, including: the semantic map module is used to obtain the second dense semantic map according to the semantic segmentation result. The semantic map module is used to use the second dense semantic map as the first dense semantic map.

In a possible implementation manner, the semantic map module is used to obtain the first dense semantic map according to the semantic segmentation result, including: the semantic map module is used to obtain the second dense semantic map according to the semantic segmentation result. The semantic map module is used to update the historical dense semantic map by using one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map.

In a possible implementation, the image data processing device further includes: an instant localization and map construction (simultaneous localization and mapping, SLAM) module, which is used to calculate the device pose (such as the camera pose) of the image data, and the semantic map The module is used to determine whether the current state of the image data processing device is a motion state, and includes: a semantic map module is used to obtain second image data provided by the camera, and the second image data is different from the image data to be processed. The semantic map module is used to determine whether the state of the image data processing device is a motion state according to the first device pose corresponding to the image data to be processed provided by the SLAM module and the second device pose corresponding to the second image data provided by the SLAM module. For example, the second image data is adjacent to the image data to be processed and is located in the previous frame of the image data to be processed.

In a possible implementation, the semantic map module used to determine that the current state is a motion state includes: when the difference between the pose of the first device and the pose of the second device is less than or equal to the first threshold, the semantic map module Used to determine that the current state is an exercise state;

In a possible implementation, the semantic map module used to determine that the current state is a motion state includes: the semantic map module is used to obtain second image data taken by the camera; wherein the second image data is adjacent to the image data to be processed, And is located in the previous frame of the image data to be processed; the semantic map module is used for the first device pose corresponding to the image data to be processed provided by the SLAM module and the second device pose corresponding to the second image data provided by the SLAM module, and The frame difference between the second image data and the image data to be processed determines that the current state of the image data processing device is a motion state.

In a possible implementation, the semantic map module is used to determine the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data, and the difference between the second image data and the image data to be processed. Determine the current state of the image data processing device as a motion state, including: the difference between the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data is less than When it is equal to the first threshold and the frame difference between the second image data and the image data to be processed is greater than the second threshold, the semantic map module is used to determine that the current state of the image data processing device is a motion state.

In a possible implementation, the semantic segmentation module is further configured to perform an optimization operation on the semantic segmentation result according to the image data to be processed and the depth information included in the depth image corresponding to the image data to be processed, and the optimization operation is used to modify the image data to be processed. Describe the noise and error in the semantic segmentation results.

In a possible implementation manner, the semantic segmentation module is used to determine the semantic segmentation result of the image data to be processed, including the method used to determine one or more plane categories corresponding to any one of the at least partial pixels and the The probability of each plane category in one or more plane categories, and the plane category with the highest probability among the one or more plane categories corresponding to any one pixel is used as the target plane category corresponding to any one pixel , In order to obtain the semantic segmentation result of the image data to be processed. That is, the probability of the target plane category corresponding to any one of at least some pixels included in the semantic segmentation result of the image data to be processed is the largest among the probabilities of one or more plane categories corresponding to any one pixel.

In a possible implementation, the semantic segmentation module is used to perform semantic segmentation on the image data to be processed according to the neural network to obtain at least part of the pixel points corresponding to any one of the one or more plane categories of each plane category Probability.

In a possible implementation, the semantic clustering module is used to recognize the plane semantic category according to the first dense semantic map to obtain the plane semantic category of one or more planes included in the image data to be processed, including: a semantic clustering module , Used to determine the plane equation of each of one or more planes according to the image data to be processed. The semantic clustering module is also used to perform the following steps on any one of the one or more planes to obtain the plane semantic category of the one or more planes: the semantic clustering module is used to perform the following steps according to any one of the planes. And the first dense semantic map to determine one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories; the semantic clustering module is used in all Among the one or more target plane categories, the target plane category with the highest confidence is selected as the semantic plane category of any one of the planes.

In a possible implementation, the orientation of one or more target plane categories corresponding to each plane is consistent with the respective orientation of each plane. That is, the orientation of one or more target plane categories corresponding to any plane is consistent with the orientation of the any plane.

In a possible implementation manner, the semantic clustering module is used to determine one or more target plane categories corresponding to any plane according to the plane equation of any plane and the first dense semantic map, and The confidence of the one or more target plane categories includes: the semantic clustering module is configured to determine M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the M The distance between the first three-dimensional points and any one of the planes is less than a third threshold, and the orientation of the target plane category corresponding to the M first three-dimensional points is consistent with the orientation of any one of the planes, and M is a positive integer , The M first three-dimensional points correspond to the one or more plane categories; and counting the number of three-dimensional points corresponding to each plane category in the one or more plane categories among the M first three-dimensional points Ratio to obtain the confidence of the one or more plane categories.

In a possible implementation manner, the semantic clustering module is used to count the proportion of the number of three-dimensional points corresponding to each plane category in the one or more plane categories among the M first three-dimensional points to obtain the After the confidence of one or more plane categories, the semantic clustering module is further used to update the confidence of one or more target plane categories corresponding to any one of the planes according to at least one of Bayes' theorem or voting mechanism.

In a possible implementation, the semantic map module is used to determine whether the current state of the image data processing device is a motion state. When it is determined that the current state is the motion state, the semantic map module is used to obtain the first dense semantic map according to the semantic segmentation result.

In a possible implementation manner, before the semantic segmentation module is used to obtain the image data to be processed, the semantic segmentation module is also used to obtain the first image data taken by the camera. The semantic segmentation module is used to correct the first image data according to the device pose corresponding to the first image data provided by the SLAM module to obtain the image data to be processed.

In a possible implementation, the SLAM module, the semantic clustering module, and the semantic map module run on the central processing unit CPU, and the semantic segmentation part of the semantic segmentation module can run on the NPU. The semantic segmentation module excludes semantics. The other part of the function of the division runs on the central processing unit (CPU).

In a third aspect, embodiments of the present application provide a computer-readable storage medium that stores instructions in the readable storage medium, and when the instructions are executed, the method described in any aspect of the first aspect is implemented.

In a fourth aspect, an embodiment of the present application provides an image data processing device, including: a first processor and a second processor, where the first processor is configured to obtain image data to be processed including N pixels, N is a positive integer. The second processor is configured to determine the semantic segmentation result of the image data to be processed, wherein the semantic segmentation result includes the target plane category corresponding to at least some of the N pixels; the first processor is configured to segment the image data according to the semantic As a result, a first dense semantic map is obtained, the first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to the at least part of the At least one pixel in the pixel; the first processor is configured to recognize the plane semantic category according to the first dense semantic map to obtain the plane semantic category of one or more planes included in the image data to be processed.

In a possible implementation, the first processor is specifically configured to obtain a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed; the first processor specifically uses The second dense semantic map is used as the first dense semantic map, or, the first processor is specifically configured to use one or more second three-dimensional point clouds in the second dense semantic map. The historical dense semantic map is updated with two and three-dimensional points to obtain the first dense semantic map.

In a possible implementation manner, the second processor is used to determine the semantic segmentation result of the image data to be processed, including depth information included in the image data to be processed and the depth image corresponding to the image data to be processed , Performing an optimization operation on the semantic segmentation result, and the optimization operation is used to correct noise and error parts in the semantic segmentation result.

In a possible implementation, before the second processor is configured to determine the semantic segmentation result of the to-be-processed image data, the second processor is also configured to determine the pixel corresponding to any one of the at least some pixels. The probability of each plane category in one or more plane categories; and the plane category with the highest probability among the one or more plane categories corresponding to any one pixel is used as the target plane category corresponding to any one pixel , In order to obtain the semantic segmentation result of the image data to be processed. That is, the probability of the target plane category corresponding to any pixel is the largest among the probabilities of one or more plane categories corresponding to any pixel. This can provide the accuracy of semantic recognition.

In a possible implementation manner, the second processor is configured to perform semantic segmentation on the to-be-processed image data according to the neural network to obtain one or more plane categories corresponding to any one of the at least some pixels The probability of each plane category in.

In a possible implementation manner, the first processor is used to determine the plane equation of each of the one or more planes; the first processor is also used to determine the plane equation of any one of the one or more planes. A plane executes the following steps to obtain the plane semantic category of the one or more planes: the first processor is further configured to determine the plane semantic category according to the plane equation of any plane and the first dense semantic map One or more target plane categories corresponding to any one plane and the confidence of the one or more target plane categories; the first processor is further configured to select the target with the highest confidence among the one or more target plane categories The plane category is used as the semantic plane category of any one of the planes. That is, the semantic plane category of any plane is the highest-confidence target plane category among the one or more target plane categories corresponding to the any plane.

In a possible implementation manner, the orientation of one or more target plane categories corresponding to any plane is consistent with the orientation of the any plane.

In a possible implementation manner, the first processor is specifically configured to determine M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the M first three-dimensional points are The distance between the three-dimensional point and any one of the planes is less than a third threshold, and M is a positive integer; the one or more target plane categories corresponding to the M first three-dimensional points are determined as the all corresponding to the any one of the planes. The one or more target plane categories, the orientation of the one or more target plane categories is consistent with the orientation of any one of the planes, and the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories is counted According to the ratio of the target in the M first three-dimensional points, the confidence of the one or more target plane categories is obtained.

In a possible implementation manner, the first processor is specifically configured to count the proportion of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points After obtaining the confidence of the one or more target plane categories, the first processor is further configured to update one or more targets corresponding to any one of the planes according to at least one of Bayes' theorem or a voting mechanism The confidence of the plane category.

In a possible implementation manner, the first processor is configured to determine whether the current state is the motion state; and when it is determined that the current state is the motion state, obtain the first dense semantic map according to the semantic segmentation result.

In a possible implementation manner, the first processor may be a CPU or a DSP. The second processor may be an NPU.

In a fifth aspect, an embodiment of the present application provides an image data processing device, including: one or more processors, wherein the one or more processors are configured to execute instructions stored in a memory to execute instructions as described in any aspect of the first aspect Methods.

In a sixth aspect, a computer program product including instructions is provided. The computer program product includes instructions. When the instructions are executed, the method as described in any aspect of the first aspect is implemented.

Description of the drawings

FIG. 1 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the application;

2 is a schematic diagram of a software architecture applicable to a method for identifying planar semantic categories provided by an embodiment of the application;

FIG. 3 is a schematic flowchart of a method for recognizing planar semantic categories according to an embodiment of this application;

FIG. 4 is a schematic flowchart of another method for recognizing planar semantic categories according to an embodiment of this application;

5 is a schematic diagram of the first image data before and after processing obtained by the image data processing device provided by the embodiment of the application;

FIG. 6 is a schematic diagram of a semantic segmentation result provided by an embodiment of this application;

FIG. 7 is a schematic diagram of a coordinate mapping provided by an embodiment of this application;

FIG. 8 is a schematic diagram of a flow state determination process provided by an embodiment of the application;

FIG. 9 is a calculation flow of plane confidence provided by an embodiment of the application;

FIG. 10 is a schematic flow chart of performing filtering on semantic segmentation results according to an embodiment of this application;

FIG. 11 is a schematic diagram of another process of performing filtering on semantic segmentation results according to an embodiment of the application;

FIG. 12 is a schematic diagram of a planar semantic result provided by an embodiment of this application;

FIG. 13 is a schematic structural diagram of an image data processing device provided by an embodiment of the application.

Detailed ways

In order to make the purpose, technical solutions, and advantages of the present application clearer, the present application will be further described in detail below with reference to the accompanying drawings. In this application, "at least one" refers to one or more, and "multiple" refers to two or more. "And/or" describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, and B exists alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are in an "or" relationship. "The following at least one item (a)" or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a). For example, at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .

The method for identifying planar semantic categories provided in the embodiments of the present application can be applied to various image data processing apparatuses with TOF, and the image data processing apparatus may be an electronic device. Among them, electronic devices may include, but are not limited to, personal computers, server computers, handheld or laptop devices, mobile devices (such as mobile phones, mobile phones, tablet computers, personal digital assistants, media players, etc.), consumer electronic devices, Small computers, large computers, mobile robots, drones, etc. For example, the electronic device in the embodiment of the present application may be a device with AR function, for example, a device with AR glasses function, which can be applied to scenarios such as AR automatic measurement, AR decoration, and AR interaction.

When the image data processing device needs to identify the plane category of each of the one or more planes included in the image data to be processed, in a possible implementation manner, the image data processing device can use the plane semantic category provided in this embodiment of the application. The recognition method to obtain the plane category recognition result of the image data to be processed. In another possible implementation manner, the image data processing device may send the image data to be processed to other devices that have the realization of the recognition process of the flat semantic category, such as a server or a terminal device, and the server or the terminal device performs the recognition of the flat semantic category. Process, and then the image data processing device receives plane category recognition results from other equipment.

In the following embodiments, taking the image data processing apparatus as an electronic device as an example, a method for recognizing planar semantic categories provided in the embodiments of the present application is introduced. The method for identifying flat semantic categories provided by the example of this application is applicable to the electronic device as shown in FIG. 1. The specific structure of the electronic device will be briefly introduced below.

Referring to FIG. 1, it is a schematic diagram of the hardware structure of an electronic device applied in an embodiment of this application. As shown in FIG. 1, the electronic device 100 may include a display device 110, a processor 120 and a memory 130. The memory 130 may be used to store software programs and data, and the processor 120 may execute various functional applications and data processing of the electronic device 100 by running the software programs and data stored in the memory 130.

The memory 130 may mainly include a storage program area and a storage data area. The storage program area may store an operating system, an application program (such as an image capture function, etc.) required by at least one function; the storage data area may store information according to the electronic device 100 Use the created data (such as audio data, text information, image data), etc. In addition, the memory 130 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.

The processor 120 is the control center of the electronic device 100, which uses various interfaces and lines to connect various parts of the entire electronic device, and executes various functions of the electronic device 100 by running or executing software programs and/or data stored in the memory 130 And process data to monitor the electronic equipment as a whole. The processor 120 may include one or more processing units. For example, the processor 120 may include a central processing unit (CPU), an application processor (AP), a modem processor, and a graphics processor ( graphics processing unit (GPU), image signal processor (ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network Processor (Neural-network Processing Unit, NPU), etc. Among them, the different processing units may be independent devices or integrated in one or more processors.

Among them, the NPU is used as a neural-network (NN) computing processor. By learning from the structure of biological neural networks, such as the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn. Through the NPU, applications such as intelligent cognition of the electronic device 100 can be realized, such as image recognition, face recognition, voice recognition, text understanding, and so on.

In some embodiments, the processor 120 may include one or more interfaces. Interfaces can include integrated circuit (I2C) interfaces, integrated circuits built-in audio (inter-integrated circuitsound, I2S) interfaces, pulse code modulation (PCM) interfaces, universal asynchronous receivers /transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and/or Universal serial bus (USB) interface, etc.

The I2C interface is a bidirectional synchronous serial bus, which includes a serial data line (SDA) and a serial clock line (SCL). In some embodiments, the processor 120 may include multiple sets of I2C buses. The processor 120 may be coupled to the touch sensor, charger, flashlight, image acquisition device 160, etc., respectively through different I2C bus interfaces. For example, the processor 120 may couple the touch sensor through an I2C interface, so that the processor 120 communicates with the touch sensor through the I2C bus interface, so as to realize the touch function of the electronic device 100.

The I2S interface can be used for audio communication. In some embodiments, the processor 120 may include multiple sets of I2S buses. The processor 120 may be coupled with the audio module through an I2S bus to implement communication between the processor 120 and the audio module. In some embodiments, the audio module can transmit audio signals to the WiFi module 190 through the I2S interface, so as to realize the function of answering calls through the Bluetooth headset.

The PCM interface can also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module and the WiFi module 190 may be coupled through a PCM bus interface. In some embodiments, the audio module may also transmit audio signals to the WiFi module 190 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.

UART interface is a universal serial data bus used for asynchronous communication. The bus can be a two-way communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, the UART interface is usually used to connect the processor 120 and the WiFi module 190. For example, the processor 120 communicates with the Bluetooth module in the WiFi module 190 through the UART interface to realize the Bluetooth function. In some embodiments, the audio module can transmit audio signals to the WiFi module 190 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.

The MIPI interface can be used to connect the processor 120 with peripheral devices such as the display device 110 and the image acquisition device 160. The MIPI interface includes an image acquisition device 160 serial interface (camera serial interface, CSI), a display serial interface (display serial interface, DSI), and so on. In some embodiments, the processor 120 and the image acquisition device 160 communicate through a CSI interface to implement the shooting function of the electronic device 100. The processor 120 communicates with the display screen through the DSI interface to realize the display function of the electronic device 100.

The GPIO interface can be configured through software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface can be used to connect the processor 120 with the image capture device 160, the display device 110, the WiFi module 190, the audio module, the sensor module, and so on. The GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.

The USB interface is an interface that complies with the USB standard specifications, and can be a Mini USB interface, a Micro USB interface, and a USB Type C interface. The USB interface can be used to connect a charger to charge the electronic device 100, and can also be used to transfer data between the electronic device 100 and peripheral devices. It can also be used to connect earphones and play audio through earphones. This interface can also be used to connect other electronic devices, such as AR devices.

It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present invention is merely a schematic description, and does not constitute a structural limitation of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.

The electronic device 100 also includes an image capture device 160 for capturing images or videos. The image capturing device 160 includes one or more cameras for capturing image data and a TOF camera for capturing depth images. For example, a camera is used to collect video graphics sequence (Video Graphics Array, VGA) or image data and send it to the CPU and GPU. The camera can be an ordinary camera or a focusing camera.

The electronic device 100 may also include an input device 140 for receiving inputted digital information, character information, or contact touch operations/non-contact gestures, and generating signal inputs related to user settings and function control of the electronic device 100.

The display device 110 includes a display panel 111 that is used to display information input by the user or information provided to the user, and various menu interfaces of the electronic device 100. In the embodiment of the present application, it is mainly used to display the camera in the electronic device 100 Or the image data to be processed obtained by the sensor. Optionally, the display panel may be configured with the display panel 111 in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).

The electronic device 100 may also include one or more sensors 170, such as an image sensor, an infrared sensor, a laser sensor, a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, an ambient light sensor, Fingerprint sensors, touch sensors, temperature sensors, bone conduction sensors, inertial measurement units (IMU), etc., where the image sensors may be time of flight (TOF) sensors, structured light sensors, and the like. Specifically, the inertial measurement unit is a device that measures the three-axis attitude angle (or angular velocity) and acceleration of an object. Generally, an IMU contains three single-axis accelerometers and three single-axis gyroscopes. The accelerometer detects the acceleration signal of the object in the independent three-axis of the carrier coordinate system. The gyroscope detects the angular velocity signal of the carrier relative to the navigation coordinate system, measures the angular velocity and acceleration of the object in three-dimensional space, and calculates the posture of the object. In addition, the image sensor may be a device in the image acquisition device 160 or an independent device for acquiring image data.

In addition, the electronic device 100 may also include a power supply 150 for supplying power to other modules. The electronic device 100 may also include a radio frequency (RF) circuit 180 for network communication with wireless network devices, and may also include a WiFi module 190 for WiFi communication with other devices, for example, for acquiring other devices The transmitted image or data, etc. Although not shown in FIG. 1, the electronic device 100 may also include other possible functional modules such as a flashlight, a Bluetooth module, an external interface, a button, a motor, etc., which will not be described here.

As shown in FIG. 2, FIG. 2 shows a software architecture applicable to a planar semantic category recognition method provided by an embodiment of the present application. The software architecture is applied to the electronic device 100 shown in FIG. 1, and the architecture includes: semantic segmentation Module 202, semantic map module 203, and semantic clustering module 204. Optionally, the software architecture may also include an instant localization and mapping (simultaneous localization and mapping, SLAM) module 201. Among them, the SLAM module 201, the semantic map module 203, and the semantic clustering module 204 run on the CPU of the electronic device as described in FIG. 1. Or alternatively, part of the functions in the SLAM module 201 can be deployed on digital signal processing (Digital Signal Processing, DSP), and part of the functions in the semantic segmentation module 202 runs on the NPU of the electronic device as described in FIG. The functions of the segmentation module 202 other than the functions running on the NPU of the electronic device as described in FIG. 1 are running on the CPU. Refer to the subsequent descriptions for the specific functions that run on the NPU.

Among them, the SLAM module 201 is used to provide a video graphics sequence including one or more image data provided by a camera (also called a camera corresponding to the image acquisition device 160 of the electronic device described in FIG. 1), and the depth of the image data provided by the TOF (Depth) information or depth image, and IMU is used to provide IMU data, as well as the correlation between image data frames, combined with the principle of visual geometry, the device pose can be calculated (for example, if the device is a camera, the device position The pose can refer to the camera pose), that is, the rotation and translation of the camera relative to the first frame, and the plane is detected, and the device pose, normal parameters, and boundary points of the plane are output. For example, IMU data includes accelerometers and gyroscopes. The depth information includes the distance between each pixel in the image data and the camera that captured the image data.

The semantic segmentation module 202 implements semantic segmentation data enhancement based on SLAM technology, and is divided into pre-processing, AI processing, and post-processing. The input of the pre-processing is the original image data (for example, RGB image) provided by the camera and the device pose obtained by the SLAM module 201. The original image data is corrected according to the device pose and output as the corrected image data. Compared with increasing rotation data during training, the semantic segmentation model's constraints on rotation invariance can be reduced, and the recognition rate can be improved.

AI processing is based on neural network for semantic segmentation, running on the NPU, the input is the image data after the correction, and the output is the image data after the correction. Each pixel included in the image data belongs to one or more plane categories. Probability distribution (that is, the probability that each pixel belongs to one or more plane categories). If the plane category with the highest probability is selected, the pixel-level semantic segmentation result can be obtained. For example, the aforementioned neural network may be a convolutional neural network (convolutional neural network, CNN), a deep neural network (deep neural network, DNN), and a recurrent neural network (recurrent neural network, RNN).

The input of post-processing is the original image data provided by the camera, depth information, and the semantic segmentation result of AI processing input. The semantic segmentation result is filtered mainly according to the original image data and depth information, and the output is the optimized semantic segmentation result. The accuracy and edges of the segmentation after post-processing are better. It can be understood that post-processing is not a necessary technique in the embodiment and may not be executed. Optionally, the pre-processing and post-processing may run on the CPU or other processors instead of the NPU, which is not limited in this embodiment.

The input of the semantic map module 203 is the optimized semantic segmentation result or the unoptimized semantic segmentation result, the device pose provided by the SLAM module 201, the depth information provided by the TOF, and the original image data provided by the camera. The semantic map module 203 mainly Dense Semantic Map is generated based on SLAM technology. Regarding the semantic map module 203 based on SLAM technology, the optimized semantic segmentation result or the unoptimized semantic segmentation result, as well as the device pose provided by the SLAM module 201, the depth information provided by the TOF, and the original image data provided by the camera generate dense semantic ground. , The process includes converting the original two-dimensional image data into a three-dimensional dense semantic map. After conversion, the two-dimensional RGB pixels in the two-dimensional original image data are converted into three-dimensional points in the three-dimensional space, so that each pixel includes depth information in addition to the RGB information. The process of this two-dimensional to three-dimensional conversion can refer to the description in the prior art, which will not be repeated here. Through conversion, the target plane category of each pixel is used as the target plane category of the three-dimensional point corresponding to the pixel, so that the target plane category of multiple pixels is transformed into the target plane category of multiple three-dimensional points. Therefore, the dense semantic map includes multiple three-dimensional point target plane categories. The target plane category of any three-dimensional point corresponds to the target plane category of the two-dimensional pixel point corresponding to the three-dimensional point.

It can be understood that when the post-processing is performed in the embodiment, the input of the semantic map module 203 is the optimized semantic segmentation result; when the post-processing is not performed in the embodiment, the input of the semantic map module 203 is the unoptimized semantic segmentation result .

The semantic clustering module 204 performs planar semantic recognition based on the dense semantic map. Based on the above introduction, this application provides a method for identifying plane semantic categories and an image data processing device, wherein the method can enable the image data processing device to detect more than one plane included in the image data. In the embodiments of the present application, the method and the image data processing device are based on the same inventive concept. Since the method and the computing device have similar principles for solving the problem, the implementation of the image data processing device and the method can be referred to each other, and the repetition will not be repeated.

As shown in Figure 3, Figure 3 shows a planar semantic category recognition method provided by an embodiment of the present application. The method is applied to an image data processing device. The method includes: step 301. The semantic segmentation module 202 obtains the to-be-processed Image data. The image data to be processed includes N pixels, and N is a positive integer. It should be understood that the image data to be processed may be taken by the camera of the image data processing device and provided to the semantic segmentation module 202, or it may be obtained by the semantic segmentation module 202 from the image library used for storing image data in the image data processing device. It may also be sent by other devices, which is not limited in the embodiment of the present application. For example, the image data to be processed may be a two-dimensional image. The image data to be processed may be a color photo or a black and white photo, which is not limited in the embodiment of the present application.

It should be noted that the N pixels may be all pixels in the image data to be processed, or may be some pixels in the image data to be processed. When the N pixels are part of the pixels in the image data to be processed, the N pixels may be pixels belonging to the plane category in the image data to be processed, excluding the pixels of the non-planar category. It can be understood that a pixel of a non-planar category refers to a pixel that does not belong to any recognized plane category, and at this time, the pixel is considered to not belong to a pixel on any plane.

Step 302: The semantic segmentation module 202 determines the semantic segmentation result of the image data to be processed. The semantic segmentation result includes the target plane category corresponding to at least some of the N pixels included in the image data to be processed. Optionally, the at least part of the pixels may be pixels of one or more planes included in the image data to be processed. Wherein, the target plane category corresponding to at least some of the N pixels may refer to the target plane category corresponding to some of the N pixels, or may refer to the target plane category corresponding to all of the N pixels.

On the one hand, the image data processing device in the embodiment of the present application can independently determine the semantic segmentation result of the image data to be processed, and at this time, the image data processing device may have a module (for example, NPU) that determines the semantic segmentation result of the image data to be processed.

On the other hand, the image data processing device in the embodiment of the present application can also send the image data to be processed to the device with the function of determining the semantic segmentation result of the image data to be processed, so that the device with the function of determining the semantic segmentation result of the image data to be processed The device determines the semantic segmentation result of the image data to be processed. Then, the image data processing apparatus obtains the semantic segmentation result of the image data to be processed from the device having the function of determining the semantic segmentation result of the image data to be processed. In the embodiment of the present application, the image data processing apparatus can detect one or more planes included in the image data to be processed by determining the semantic segmentation result of the image data to be processed.

Step 303: The semantic map module 203 obtains a first dense semantic map according to the semantic segmentation result. The first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud. One first three-dimensional point corresponds to at least one pixel point in the at least part of the pixel points. The purpose of step 303 in the embodiment of this application is that the semantic map module 203 uses the plane category of each pixel in the two-dimensional space to update the plane category of the three-dimensional point corresponding to the pixel in the three-dimensional space, that is, as the three-dimensional point The target plane category.

In a possible implementation manner, the semantic map module 203 may use at least one target plane category corresponding to the three-dimensional point cloud corresponding to all pixels in the semantic segmentation result as the first dense semantic map. Through step 303, the performance of the semantic map generation algorithm can be improved. Step 304: The semantic clustering module 204 performs plane semantic category recognition according to the first dense semantic map, and obtains the plane semantic category of one or more planes included in the image data to be processed.

The embodiment of the present application provides a method for recognizing planar semantic categories. The method obtains the result of semantic segmentation of image data to be processed. Since the result of semantic segmentation includes the target to which each pixel of the N pixels included in the image data to be processed belongs For plane categories, subsequent semantic segmentation can improve the accuracy of plane semantic recognition. In addition, in the method provided by the embodiments of the present application, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then, uses the first dense semantic map to recognize the planar semantic category to obtain the planar semantic category of the image data to be processed It can enhance the accuracy and stability of planar semantic recognition.

In a possible implementation manner, step 303 in the embodiment of the present application can be implemented in the following manner: the semantic map module 203 determines whether the current state of the image data processing device is a motion state. When it is determined that the current state is a motion state, the first dense semantic map is obtained according to the semantic segmentation result. By judging whether it is in the motion state, the first dense semantic map is obtained according to the semantic segmentation result in the motion state, which can reduce the amount of calculation.

In a possible implementation manner, when the current state of the image data processing device is not in motion, that is, when it is in a static state, the image data processing device uses the historical dense semantic map as the first dense semantic map.

As a possible implementation manner, the image data to be processed in the embodiment of the present application is image data after correction. The semantic segmentation module 202 corrects the image data to be processed or adopts the image data to be processed before performing semantic segmentation on the image data to be processed, which can reduce the constraint of the semantic segmentation model on rotation invariance and improve the recognition rate.

It should be noted that if the first image data acquired by the semantic segmentation module 202 is not set in the embodiment of the present application, as shown in FIG. 4, the method provided in the embodiment of the present application may further include before step 301: step 305 , The semantic segmentation module 202 obtains the first image data shot by the first device.

Optionally, the image data processing apparatus may control the first device to capture the first image data, and send the captured first image data to the semantic segmentation module 202. Of course, the first image data can also be obtained by the semantic segmentation module 202 from a memory pre-stored in the image data processing apparatus, or the semantic segmentation module 202 can obtain the first image taken by the first device from other devices (for example, SLR or DV). Image data.

Exemplarily, the first device may be a camera built in the image data processing device, or a photographing device connected to the image data processing device. Correspondingly, step 301 can be implemented by the following step 3011: step 3011, the semantic segmentation module 202 corrects the first image data according to the first device pose of the first device corresponding to the first image data to obtain the image data to be processed. It should be noted that each image data in the embodiment of the present application may correspond to a device pose.

In the embodiment of the present application, if the semantic segmentation module 202 determines that the first image data is not set, it may set the first image data according to the pose of the device corresponding to the shooting of the first image data. The semantic segmentation module 202 can independently determine that the first image data is not set. Of course, in the case that the image data processing apparatus receives an operation instruction for the first image data input by the user for indicating the first image data to be set, In this way, the image data processing device can determine that the first image data is not set, and then the image data processing device uses the semantic segmentation module 202 to set the first image data.

The pose of the device corresponding to the image data in the embodiment of the present application refers to the pose of the device that captured the image data when the image data was captured. The same device may correspond to different device poses at different times. It can be understood that, if the first image data is image data that is to be corrected, the process of correcting the image to be processed can be omitted.

As shown in FIG. 5, (a) in FIG. 5 shows the first image data obtained by the image data processing device. As can be seen from (a) in FIG. 5, the first image data is not set to be positive. The image data processing device may correct the first image data according to the device pose of the device that took the first image data, and the image data after the correction is as shown in (b) of FIG. 5.

As another possible embodiment, with reference to FIG. 4, the method provided in this embodiment of the present application may be implemented in step 302 through the following steps 3021 and 3022:

Step 3021, the semantic segmentation module 202 determines one or more plane categories corresponding to any one pixel in at least some of the pixels and the probability of each plane category in the one or more plane categories. As a possible implementation, step 3021 in the embodiment of the present application can be specifically implemented in the following manner: the semantic segmentation module 202 performs semantic segmentation on the image data to be processed according to the neural network, and obtains at least part of the pixel points corresponding to any one of the pixels. One or more plane categories and the probability of each plane category in the one or more plane categories. The training and prediction process of the neural network can be specifically referred to the prior art, which is not limited in this embodiment.

Step 3022, the semantic segmentation module 202 uses the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain the image data to be processed. Semantic segmentation results. Among them, the probability of the target plane category corresponding to any one pixel is the largest among the probabilities of one or more plane categories corresponding to any one pixel.

It is understandable that any pixel in the embodiment of the present application may correspond to one or more plane categories, and any pixel may correspond to the probability of belonging to each plane category of the one or more plane categories. The sum of the probabilities of one or more plane categories corresponding to any pixel is equal to 1.

After the semantic segmentation module 202 obtains the image data to be processed, in order to enable the semantic segmentation module 202 to identify the plane category to which each region of the image data to be processed belongs, semantic segmentation processing may be performed on the image data to be processed. It is understandable that the purpose of semantic segmentation is to assign a category label to each pixel in the image data to be processed.

The image data to be processed is composed of many pixels, and semantic segmentation is to group the pixels according to the different semantic meanings expressed in the image. That is, the so-called semantic segmentation is to segment the image data to be processed into regions with different semantics, and mark the plane category to which each region belongs, such as cars, trees, or faces. Semantic segmentation combines the two technologies of segmentation and target recognition, and can segment the image into regions with advanced semantic content. For example, through semantic segmentation, an image data can be segmented into three different semantic regions of "cow", "grass" and "sky". As shown in Figure 6 (a) and Figure 6 (b), Figure 6 (a) shows a to-be-processed image data provided by an embodiment of the present application, and Figure 6 (b) shows The schematic diagram of the image data to be processed after semantic segmentation processing is shown. From (b) in Figure 6, it can be concluded that the image data to be processed is divided into four different semantic regions, such as "ground", "table", wall", and "chair".

In the embodiment of the present application, the semantic segmentation module 202 may use a semantic segmentation model to determine the probability of one or more plane categories to which each pixel of the N pixels belongs. As a possible implementation manner, in the embodiment of the present application, each pixel point may correspond to one or more plane categories, and the sum of the probabilities of all plane categories corresponding to each pixel point is equal to 1. The probability of the target plane category corresponding to any one of the N pixels is the highest probability among the probabilities of one or more plane categories corresponding to the any one pixel.

Taking (a) in FIG. 6 as an example, the plane category of one or more planes included in the image data to be processed is ground, table, chair, wall, etc., then the image data processing device can obtain through step 302 The target plane category to which pixel 1 to pixel 4 belong is shown in Table 1:

Table 1 Semantic segmentation results

pixel

Of the ground

Of the chair

Belonging to the table

Of the wall

Target plane

To	概率Probability	概率Probability	概率Probability	概率Probability	类别category
像素点1Pixel 1	1％1%	98％98%	1％1%	0％0%	椅子Chair
像素点2Pixel 2	1％1%	88％88%	1％1%	10％10%	椅子Chair
像素点3Pixel 3	10％10%	20％20%	70％70%	0％0%	桌子table
像素点4Pixel 4	98％98%	0.5％0.5%	1％1%	0.5％0.5%	地面ground

As a possible implementation manner, the semantic segmentation model in the embodiment of the present application may use mobileNet v2 as the coding network, or may be implemented by MaskRCNN. It should be understood that in the embodiments of the present application, any other model that can perform semantic segmentation may also be used to obtain the semantic segmentation result. The embodiment of the present application uses mobileNet v2 as an encoding network for semantic segmentation as an example for description, but this does not cause a restriction on the semantic segmentation method, and will not be repeated in the following. In addition, the mobileNet v2 model has the advantages of small size, fast speed, and high accuracy, which meets the requirements of mobile phone platforms and enables semantic segmentation to reach a frame rate of more than 5fps.

By performing semantic segmentation on the image data to be processed, the probability of the plane category corresponding to each pixel in the image data to be processed in the two-dimensional space can be obtained. Correspondingly, as shown in FIG. 4, step 302 in the embodiment of the present application can be implemented in the following manner: the semantic segmentation module 202 according to one or more plane categories corresponding to each of at least some of the N pixels The probability of determining the semantic segmentation result of the image data to be processed. That is, the semantic segmentation module 202 determines the plane category with the highest probability in each pixel of at least some of the pixels as the respective target plane category of each pixel to obtain the semantic segmentation result of the image data to be processed.

In a possible embodiment, in order to improve the accuracy of semantic segmentation, as shown in FIG. 4, the method provided in this embodiment of the present application after step 302 and before step 303 may further include: step 306, the semantic segmentation module 202 according to The image data to be processed and the depth information included in the depth image corresponding to the image data to be processed perform an optimization operation on the semantic segmentation result, and the optimization operation is used to correct the noise in the semantic segmentation result and the error part caused by the segmentation process. For example, there may be a pixel A corresponding to the ground in an image data close to the table, but the target plane category of the pixel A in the semantic segmentation result is a table, but in fact the target plane category of the pixel A should be the ground , So the target plane category of pixel A can be changed from table to ground. Or, a certain pixel point B is not segmented, and the target plane category of the pixel point B can be determined by performing an optimization operation. For the specific algorithm implementation of the optimization operation, reference may be made to the prior art for details, and details are not described in this embodiment.

The depth information in the embodiment of the present application includes the distance between each pixel and the device that captures the image data to be processed. The purpose of the optimization operation on the semantic segmentation result in the embodiment of the present application is to optimize and repair the semantic segmentation result. The depth information can be used to filter and modify the semantic segmentation results, avoiding wrong segmentation and unsegmentation in the semantic segmentation results. For the detailed process of optimizing the semantic segmentation result, please refer to the description of FIG. 10 and FIG. 11 below, which will not be repeated here.

As a possible implementation, the semantic map module 303 in this embodiment of the application determines whether the current state of the image data processing device is in a motion state (that is, step 303), which can be implemented in the following manner: the semantic map module 303 obtains the second image taken by the camera. Image data. The semantic map module 303 is based on the difference between the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data, and the frame difference between the second image data and the image data to be processed Determine whether the current state of the image data processing device is a motion state.

Specifically, as shown in FIG. 8, the difference between the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data is less than or equal to the first threshold, and the second image data When the frame difference between the image data and the image data to be processed is greater than the second threshold, the semantic map module 303 determines that the current state of the image data processing device is a motion state. Wherein, the second image data is adjacent to the image data to be processed and is located in the previous frame of the image data to be processed. Refer to Figure 8 for the specific process.

In addition, as shown in FIG. 8, the difference between the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data is less than or equal to the first threshold and the second image data is less than or equal to the When the frame difference between processed image data is less than or equal to the second threshold, the image data processing device determines that the current state of the image data processing device is a static state. When the current state is a static state, the image data processing device can directly use the historical dense semantic map as the first dense semantic map, and perform subsequent processing.

The historical dense semantic map in the embodiment of the present application may be stored in the image data processing device, or of course, may also be obtained from other equipment by the image data processing device, which is not limited in the embodiment of the present application. Among them, the historical dense semantic map is the semantic image result generated and saved in history. After each frame of new image data arrives, the historical dense semantic map will be updated. Optionally, the historical dense semantic map is a dense semantic map corresponding to the previous frame of the dense semantic map corresponding to one frame of image data, or a synthesis of the dense semantic maps corresponding to the previous few frames of images.

As a possible implementation manner, step 304 of the embodiment of the present application can be implemented in the following manner: the semantic map module 303 obtains a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed. The semantic map module 303 directly uses the second dense semantic map as the first dense semantic map. That is, every time the second dense semantic map is calculated, the second dense semantic map is directly used as the subsequent calculation.

The depth image corresponding to the image data to be processed in the embodiment of the present application refers to an image that has the same size as the image data to be processed and whose element value is the depth value of the scene point corresponding to the image point in the image data to be processed. Specifically, the image data to be processed is acquired by the image acquisition device shown in FIG. 2, and the depth image corresponding to the processed image data is acquired by the TOF shown in the figure.

In the embodiments of the present application, methods such as TOF camera, structured light, laser scanning, etc. may be used to obtain depth information, thereby obtaining a depth image. It should be understood that, in the embodiments of the present application, any other method (or camera) for obtaining a depth image may also be used to obtain a depth image. In the following, only the use of the TOF camera to obtain the depth image is used as an example for description, but this does not cause a restriction on the way of obtaining the depth image, and will not be repeated in the following.

It should be noted that although the point cloud is a three-dimensional concept and the pixels in the depth image are a two-dimensional concept, when the depth value of a point in the two-dimensional image is known, the image coordinates of the point can be converted into three-dimensional The world coordinates in the space, so the point cloud in the three-dimensional space can be recovered according to the depth image. For example, the principle of visual geometry can be used to convert image coordinates into world coordinates. According to the principle of visual geometry, the process of mapping a three-dimensional point M (Xw, Yw, Zw) in a world coordinate system to a point m (u, v) on the image is shown in Figure 7. The Xc axis of the dashed line in Figure 7 is based on The Xc axis of the solid line is obtained after translation, and the Yc axis of the dashed line is obtained after the Yc axis is translated based on the solid line.

Figure 7 satisfies the following mathematical relationship:

Among them, u, v are arbitrary coordinate points in the image coordinate system. f is the focal length of the camera, dx and dy are the pixel sizes in the x and y directions, _{respectively, and u 0} and v ₀ are the center coordinates of the image, respectively. Xw, Yw, Zw represent the three-dimensional coordinate points in the world coordinate system. Zc represents the Z-axis value of the camera coordinates, that is, the distance from the target to the camera. R and T are the 3x3 rotation matrix and 3x1 translation matrix of the external parameter matrix, respectively.

First, the depth map can be restored to a point cloud based on the camera coordinate system, that is, the rotation matrix R takes the identity matrix, and the translation vector T is 0, and we can get:

Among them, Xc, Yc, Zc are the three-dimensional point coordinates in the camera coordinate system.

It can be derived from the above formula:

Zc represents the value on the depth map. The current depth unit obtained by TOF is millimeter (mm), so that the coordinates of the three-dimensional point in the camera coordinate system can be calculated, and then the device pose R and T calculated by the SLAM module can be calculated. The point cloud data converted to the world coordinate system can be obtained. Specifically, as shown in the following formula:

When the device pose calculated by the SLAM module and the depth data obtained by the TOF are accurate, a better point cloud registration result can be obtained. The three-dimensional points in this embodiment are three-dimensional pixels, that is, the two-dimensional pixels involved in steps 301 and 302 are converted into three-dimensional pixels.

As another possible implementation manner, step 304 of the embodiment of the present application can be implemented in the following manner: the semantic map module 303 obtains the second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed. The multiple three-dimensional points included in the second dense semantic map obtained by combining multiple two-dimensional pixel points and depth images can refer to the prior art. The semantic map module 303 uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the historical dense semantic map to obtain the first dense semantic map. Different from directly using the second dense semantic map as the first dense semantic map, a part of all the three-dimensional points in the second three-dimensional point cloud can be used for the update. Therefore, the update may not be for all three-dimensional points in the second dense semantic map, but only replace the target plane category of the corresponding three-dimensional point of the historical dense semantic map with the probability of the target plane category of some three-dimensional points in the second dense semantic map. Probability. Therefore, the update may be an update to a part of the dense semantic map, instead of directly using the second dense semantic map as the first dense semantic map.

Specifically, the semantic map module 303 uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the three-dimensional points corresponding to the one or more second three-dimensional points in the historical dense semantic map. The probability, or the probability of the target plane category of the three-dimensional point, obtains the first dense semantic map. It should be understood that the semantic map module 303 uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the three-dimensional points corresponding to the one or more second three-dimensional points in the historical dense semantic map. The probability is the probability of replacing the target plane category of the three-dimensional point A in the historical dense semantic map with the probability of the three-dimensional point A being in the target plane category of the second dense semantic map.

As a possible implementation manner, as shown in FIG. 4, step 304 in the embodiment of the present application can be specifically implemented in the following manner: step 3041, the semantic clustering module 204 determines the plane equation of each of the one or more planes . For example, the semantic clustering module 204 performs plane fitting on the three-dimensional point cloud data of each pixel to obtain a plane equation.

Specifically, the semantic clustering module 204 can use the RANSAC method or the SVD equation solving method to perform plane fitting on the three-dimensional point cloud data of each pixel to obtain a plane equation.

It is understandable that after obtaining the plane equation of each plane, the image data processing device in the embodiment of the present application can determine the respective area of each plane and the orientation of each plane. Taking the plane equation as AX+BY+CZ+D=0 as an example, the normal vector of the plane is n=(A, B, C). The normal vector is used to indicate the orientation of the plane. The orientation of the plane in the embodiment of the present application can also be replaced by the orientation of the plane in expression.

The semantic clustering module 204 performs the following steps 3042 and 3043 on any one of the one or more planes to obtain the plane semantic category of one or more planes: Step 3042, the semantic clustering module 204 according to the plane semantic category of any one of the planes The plane equation and the first dense semantic map determine one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories.

In a possible implementation manner, step 3042 in the embodiment of the present application can be implemented in the following manner: the semantic clustering module 204 determines M from the first dense semantic map according to the plane equation of any one of the planes. For the first three-dimensional point, the distance between the M first three-dimensional points and any one of the planes is less than a third threshold, and M is a positive integer; the semantic clustering module 204 assigns one of the M first three-dimensional points to one The one or more target plane categories are determined as the one or more target plane categories corresponding to the any one plane, the orientation of the one or more target plane categories is consistent with the orientation of the any one plane, and the statistics of the one The ratio of the number of three-dimensional points corresponding to each target plane category in the multiple target plane categories among the M first three-dimensional points is used to obtain the confidence of the one or more target plane categories. The embodiment of the present application does not limit the specific value of the third threshold, and it can be set as needed in the actual process.

In the embodiment of the present application, the M first three-dimensional points determined from the first dense semantic map can be regarded as three-dimensional points belonging to any plane. Since the plane category to which each of the M first three-dimensional points belongs can be determined, the plane categories to which different three-dimensional points belong may be the same or different. For example, the plane category of three-dimensional point A among the M first three-dimensional points is "Ground", and the plane category of the three-dimensional point B among the M first three-dimensional points is "table", so the M first three-dimensional points can be obtained according to the plane category to which each of the M first three-dimensional points belongs Corresponding one or more plane categories. Since the M first three-dimensional points determined from the first dense semantic map are regarded as three-dimensional points belonging to the any plane, it can be determined that the any plane also corresponds to the one or more plane categories. The plane category of each three-dimensional point in the M first three-dimensional points may be the target plane category of the two-dimensional pixel points corresponding to the three-dimensional point mentioned in the previous embodiment. For example, step 3022 can be used to obtain the target plane category of each pixel and use it as the plane category of the three-dimensional point corresponding to each pixel, so that one or more target plane categories corresponding to the M first three-dimensional points can be obtained.

For example, if the plane category of N1 three-dimensional points among the M first three-dimensional points is "ground", that is, the number of three-dimensional points whose plane category is "ground" is N1, and the plane category of N2 three-dimensional points is "table" "That is to say, the number of 3D points whose plane type is "table" is N2, and the plane type to which N3 3D points belong is "wall", that is, the number of 3D points whose plane type is "wall" is N3, where N1+ N2+N3 is less than or equal to M, N1, N2, and N3 are positive integers, and the probability that the number of 3D points included in the plane category "ground" is among the M first 3D points is: N1/M. The probability of the number of three-dimensional points included in the plane category "table" in the M first three-dimensional points is: N2/M. The probability that the number of three-dimensional points included in the plane category "wall" is among the M first three-dimensional points is: N3/M. Then the confidences of one or more plane categories of any plane are: N1/M, N2/M, and N3/M. If N2/M>N1/M, and N2/M is greater than N3/M, then the semantic plane category of any plane is "ground".

Step 3043: The semantic clustering module 204 selects the target plane category with the highest confidence among the one or more target plane categories as the semantic plane category of any one of the planes.

For example, the confidence that plane A corresponds to the ground is P1, the confidence that plane A corresponds to the table is P2, the confidence that plane A corresponds to the wall is P3, and P1>P2>P3, so the semantic clustering module 204 can determine the semantics of plane A The plane category is ground.

Since any plane may correspond to one or multiple target plane categories, but not all target plane categories in the one or more target plane categories corresponding to any one plane have the same orientation as that of any plane, that is, a plane It may correspond to the target plane category that is consistent with the orientation of the plane, or it may correspond to the target plane category that is inconsistent with the orientation of the plane, and the target plane category that is inconsistent with the orientation of the plane has a lower probability of belonging to the semantic plane category of the plane. The probability of the plane category. Based on this, in order to simplify the subsequent calculation process and reduce the calculation error, in a possible implementation manner, the orientation of one or more target plane categories corresponding to any plane in the embodiment of the present application is consistent with the orientation of any plane. That is, the one or more target plane categories are plane categories selected by the image data processing device from all target plane categories corresponding to any one plane and consistent with the orientation of the any one plane. The one or more target plane categories may be all plane categories of all target plane categories corresponding to any one plane, or may be part of the plane categories, which is not limited in the embodiment of the present application. All target plane categories corresponding to any plane in the embodiment of the present application can be regarded as all target plane categories corresponding to the M first three-dimensional points.

For example, the plane a is facing downwards, the plane category (ground) is facing upwards, the plane category (table) is facing downwards, and the plane category (ceiling) is facing downwards. Therefore, in the calculation plane a belongs to one or more planes When the confidence of the category, the confidence that the plane a is determined to belong to the plane category (ground) can be eliminated. This not only reduces the calculation burden of the image data processing device, but also improves the calculation accuracy.

In a possible implementation, the semantic clustering module 204 counts the proportion of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points, and obtains all the three-dimensional points. After describing the confidence of one or more target plane categories, the method provided in the embodiment of the present application further includes: the semantic clustering module 204 updates one or more corresponding planes according to at least one of the Bayes theorem or the voting mechanism. The confidence of each target plane category.

Specifically, the semantic clustering module 204 performs plane fitting on the three-dimensional point cloud data of each three-dimensional point to obtain a plane equation. The formula of the plane equation is as follows: AX+BY+CZ+D=0. Among them, A, B, C, D are the plane equation parameters that need to be solved, and the optimal plane equation parameters are solved through multiple points. The specific fitting scheme can refer to the prior art. At the same time, the outermost point of all the points involved in the calculation will be the boundary point of the plane. The normal vector of the plane, that is, n=(A, B, C) can be used as the direction vector of the plane, and the area of the plane is defined as the area of the smallest enclosing rectangle of the boundary points of the plane.

Then, the semantic clustering module 204 counts and filters the M first three-dimensional points whose distance from the plane is less than the third threshold from the first dense semantic map based on the plane equation, orientation and area of the detected plane. The first three-dimensional point corresponds to one or more target plane categories. The semantic clustering module 204 regularizes the three-dimensional points of various plane categories in one or more target plane categories as the confidence of the plane category, that is, counts the number of three-dimensional points included in various target plane categories in all three-dimensional points (M first The ratio of the number of three-dimensional points. Based on Bayes' theorem and voting mechanism and the confidence of the last recorded various plane categories are updated, and the plane category with the highest current confidence is selected as the plane category of plane semantics, which can enhance the accuracy and stability of plane semantic recognition.

The semantic clustering module 204 specifically uses Bayes' theorem and voting mechanism to count the confidence that a plane calculated before the current moment belongs to multiple plane categories, so as to determine whether a plane calculated at the current moment belongs to multiple plane categories according to the obtained confidence. The confidence of each plane category is revised and updated.

For example, suppose the maximum number of votes under the voting mechanism is MAX_VOTE_COUNT, and the initial number of votes is 0. If the plane category of a 3D point C in the current frame is consistent with the plane category of the 3D point C in the previous frame of the current frame, then the 3D The number of votes corresponding to point C is increased by 1, and the plane category probability prob to which the three-dimensional point C belongs is updated to slide between the average value and the maximum value of the two. E.g,

Among them, prob _c represents the probability distribution of the plane category to which the three-dimensional point C of the current frame belongs, and prob _p represents the probability distribution of the plane category to which the three-dimensional point C of the frame before the current frame belongs. alpha=vote/MAX_VOTE_COUNT.

If the plane category to which a certain 3D point C belongs in the current frame is inconsistent with the plane category to which the 3D point C in the previous frame of the current frame belongs, the number of votes is reduced by 1 and the plane category probability prob is updated to take 80% of its value.

Specifically, step 304 can be specifically implemented as described in FIG. 9: step 901, the semantic clustering module 204 executes a plane detection step to obtain one or more planes included in the image data to be processed. Since the semantic clustering module 204 calculates the plane category of the plane semantics of each plane of one or more planes in the same manner and principle, the following steps take the process of calculating the plane category of the plane semantics of the first plane by the image data processing device as an example. It does not have an indicative meaning.

Step 902: The semantic clustering module 204 obtains the plane equation of the first plane. Step 903: The semantic clustering module 204 calculates the area of the first plane. Step 904: The semantic clustering module 204 calculates the orientation of the first plane. For the specific implementation of step 903 and step 904, reference may be made to the process of calculating the area and orientation of a plane in the prior art, which will not be repeated here. Step 905: The semantic clustering module 204 counts the M three-dimensional points of the three-dimensional points of various plane categories in the first dense semantic map whose distance between the first plane is less than the third threshold. Step 906: The semantic clustering module 204 determines whether the orientation of each target plane category in the one or more target plane categories corresponding to the M three-dimensional points is consistent or the same as the orientation of the first plane.

Step 907: If the orientation of the various target plane categories is consistent with the orientation of the first plane, the semantic clustering module 204 determines whether the number of three-dimensional points included in each target plane category in the unit plane meets the threshold according to the area of the first plane.

Step 908: If it is determined according to the plane area that the number of 3D points included in each target plane category in the unit plane meets the threshold, the semantic clustering module 204 performs regularization processing on the number of 3D points included in each target plane category, that is The proportion of the total number of three-dimensional points included in each target plane category in the M first three-dimensional points is calculated to obtain the confidence that the first plane belongs to one or more target plane categories. Step 909: The semantic clustering module 204 updates the Bayesian probability of the currently calculated confidence that the first plane belongs to one or more target plane categories and the currently calculated confidence that the first plane belongs to various target plane categories. Step 910: The semantic clustering module 204 uses the target plane category with the highest current confidence of the first plane as the plane category of the first plane.

It should be noted that if the orientation of each target plane category is inconsistent with the orientation of the first plane, the semantic clustering module 204 determines that the process stops. In addition, if the semantic clustering module 204 determines that the number of three-dimensional points included in the various target plane categories in the unit plane does not meet the threshold value according to the area of the first plane, the image data processing apparatus determines that the process stops.

In the embodiment of the present application, the semantic segmentation module 202 performs an optimization operation on the semantic segmentation result according to the image data to be processed and the depth information of the image data to be processed, including the random sample consensus (RANSAC) as described in FIG. 10 The ground equation estimation process and the semantic seed point area growth process shown in Figure 11.

(1) RANSAC ground equation estimation

Floor, as an important part of the scene, has the following characteristics: Floor is a plane with a large area; Floor is an important reference for SLAM initialization; the ground is easier to detect and recognize than other semantic targets; objects in the scene Mostly located on the ground; the height of the objects in the scene is mostly relative to the ground. Therefore, it is very necessary to divide the ground first and obtain the plane equation.

The RANSAC algorithm is also known as the random sampling consensus estimation method. It is a robust estimation method, which is more suitable for the estimation of the plane with a large area such as the ground. The result of semantic segmentation of the deep neural network is relied on here. The ground semantic pixels are extracted from multiple three-dimensional points (FLOOR three-dimensional points) and obtain the point cloud data composed of the depth information to realize the ground equation estimation based on RANSAC. The specific steps are shown in Figure 10:

As a possible implementation manner, in the embodiment of the present application, when the plane type is ground, the ground equation can also be estimated by using AI.

Step 1011: The semantic segmentation module 202 obtains P three-dimensional points included in the ground by performing semantic segmentation processing on the ground. The number of iterations of the RANSAC algorithm is M. If M>0, the image data processing device randomly selects l (for example, l is 3) three-dimensional points from the P three-dimensional points as sampling points. Otherwise, skip to step 1016 for execution.

Step 1012: The semantic segmentation module 202 brings the three-dimensional coordinates of l three-dimensional points into the plane equation Ax+By+Cz=1, and uses Singular Value Decomposition (SVD) to solve the plane equation parameters n=[A B C].

Step 1013: The semantic segmentation module 202 brings the three-dimensional coordinates q of the P three-dimensional points into the estimated plane equation respectively, and obtains the scalar distance d from the P three-dimensional points to the plane equation. If d is less than the preset threshold η, Then P three-dimensional points are considered as interior points, and the number k of interior points is counted. in,

Step 1014: The semantic segmentation module 202 discriminates the size of the number k of interior points in this iteration and the optimal number of interior points K. If k<K, the semantic segmentation module 202 reduces the number of iterations M of the RANSAC algorithm by 1, and jumps to step 1011 execute, otherwise continue to execute.

Step 1015: The semantic segmentation module 202 assigns the number k of interior points in this iteration to the optimal number of interior points K, the semantic segmentation module 202 saves the index of the optimal interior points, and calculates the percentage of the number of interior points v=K/P, and formula

Modify the number of iterations M, where w=0.99 and n=3.

Step 1016: The semantic segmentation module 202 uses K optimal interior points to re-estimate the plane equation, that is, establishes an overdetermined equation composed of K equations, and uses SVD to find the global optimal plane equation.

(2) Semantic seed point area growth

Aiming at the problem of under-segmentation or over-segmentation in the result of neural network semantic segmentation, the semantic seed is combined with depth information to grow the segmentation area and modify the segmentation result. Here, the number of pixels in the semantic segmentation category is used as an indicator of the priority of region growth, to give priority to the category with a larger number of pixels in the semantic segmentation category for region growth, but the priority of the ground is the highest, that is, the ground is grown first before the region is grown. Regional growth for other categories.

The region growing algorithm relies on the degree of similarity between the seed points and their domain points to merge adjacent points with higher similarity and continue to grow outward until the neighboring points that do not meet the similarity condition are merged in. Here A typical 8-neighborhood is selected for region growth, and the similarity condition is also selected for depth and color information to express, so that the under-segmented region can be better corrected. The so-called seed point is the initial point of region growth, and region growth uses a method similar to Breadth-First-Search (BFS) to spread out and grow. The specific steps are shown in Figure 11:

Step 1101: The semantic segmentation module 202 traverses the priority list of semantic segmentation categories, and pushes the plane category with high priority onto the stack (push into the seed point stack) for region growth. Exemplarily, suppose the seed point stack of the current push category is

That is, there are K seed points, and the coordinates of the two-dimensional pixel points corresponding to each seed point are (i, j). The so-called priority list is the statistical segmentation result, which is established from more to less according to the number of each plane category.

Step 1102: If the seed point stack is not empty, the semantic segmentation module 202 _{pops the last seed point s K} (i,j) from the stack and deletes it from the stack, and determines its neighboring point p(i+m,j+n ) Whether the category is other (OTHER), if it is to continue to execute, otherwise skip to step 1101 for execution.

Step 1103: The semantic segmentation module 202 compares _{the similarity distance d between the seed point s K} and the neighboring point p. If the similarity distance d is less than the given threshold η, then continue to execute, otherwise jump to step 1101 for execution, The expression of similarity distance d is as follows:

Step 1104: The semantic segmentation module 202 pushes the neighborhood point p satisfying the similarity condition into the seed point stack

Then jump to step 1101 for execution. Then build a semantic map according to the method described above, and complete the plane detection and recognition, and a stable and accurate plane semantic result can be obtained, as shown in Figure 12 below.

The foregoing mainly introduces the solution of the embodiment of the present application from the perspective of the image data processing device. It can be understood that, in order to realize the above-mentioned functions, the image data processing apparatus and the like include hardware structures and/or software modules corresponding to the respective functions. Those skilled in the art should easily realize that in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

The embodiment of the present application may divide the functional units according to the foregoing method example image data processing apparatus. For example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.

In the case of dividing each function module corresponding to each function, FIG. 2 shows a possible structural schematic diagram of the image data processing device involved in the foregoing embodiment. The image data processing device includes: a semantic segmentation module 202, The semantic map module 203 and the semantic clustering module 204. Among them, the semantic segmentation module 202 is used to support the image data processing apparatus to execute steps 301 and 302 in the above-mentioned embodiment. The semantic map module 203 is used to support the image data processing device to execute step 303 in the foregoing embodiment. The semantic clustering module 204 is used to support the image data processing apparatus to perform step 304 in the above-mentioned embodiment.

In a possible embodiment, the semantic segmentation module 202 is further configured to support the image data processing apparatus to execute step 305 in the foregoing embodiment. The semantic segmentation module 202 is used to support the image data processing device to execute step 3011 in the foregoing embodiment. The semantic segmentation module 202 is used to support the image data processing device to execute step 306, step 3021, and step 3022 in the above-mentioned embodiment. In a possible embodiment, the semantic clustering module 204 is used to support the image data processing apparatus to perform step 3041, step 3042, and step 3043 in the foregoing embodiment. In addition, the semantic clustering module 204 is also used to support the image data processing device to execute steps 901 to 910 in the foregoing embodiment. The device can be implemented in the form of software and stored in a storage medium.

The above describes an image data processing device in the embodiment of the present application from the perspective of a modular functional entity, and the following describes an image data processing device in the embodiment of the present application from the perspective of hardware processing. As shown in FIG. 13, FIG. 13 shows a schematic diagram of a possible hardware structure of the image data processing device involved in the above-mentioned embodiment. The image data processing device includes: a first processor 1301 and a second processor 1302. Optionally, the image data processing apparatus may further include a communication interface 1303, a memory 1304, and a bus 1305. The communication interface 1303 may include an input interface 13031 and an output interface 13032. Correspondingly, when the image data processing apparatus is an electronic device, the first processor 1301 and the second processor 1302 may be the processor 120 shown in FIG. 1. For example, the first processor 1301 may be a DSP or a CPU. The second processor 1302 may be an NPU. The communication interface 1303 may be the input device 140 in FIG. 1. The memory 1304 is used to store program codes and data of the image data processing device, and corresponds to the memory 130 in FIG. 1. The bus 1305 may be built in the processor 120 shown in FIG. 1.

In this case, the first processor 1301 and the second processor 1302 are configured to perform part of the functions in the image data processing method described above. For example, the first processor 1301 is configured to support the image data processing apparatus to execute step 301 of the foregoing embodiment. The second processor 1302 is used for the image data processing apparatus to execute step 302 of the foregoing embodiment. The first processor 1301 is used for the image data processing apparatus to execute step 303 and step 304 of the foregoing embodiment.

In a possible embodiment, the first processor 1301 is further configured to support the image data processing apparatus to execute step 305, step 3011, step 3041, step 3042, step 3043 in the foregoing embodiment. The second processor 1302 is also configured to support the image data processing apparatus to execute step 306, step 3021, and step 3022 in the foregoing embodiment. Optionally, the first processor 1301 is further configured to support the image data processing apparatus to execute steps 901 to 910 in the foregoing embodiment.

In some feasible embodiments, the first processor 1301 or the second processor 1302 may be a single-processor structure, a multi-processor structure, a single-threaded processor, a multi-threaded processor, etc.; in some feasible embodiments The first processor 1301 may be a central processing unit, a general-purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The second processor 1302 may be a neural network processor, which may implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of this application. The processor may also be a combination that implements computing functions, for example, a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so on.

Output interface 13032: This output interface is used to output the processing result in the above-mentioned image data processing method. In some feasible embodiments, the processing result can be directly output by the processor, or it can be stored in the memory first, and then passed through the memory. Output; in some feasible embodiments, there may be only one output interface, or there may be multiple output interfaces. In some feasible embodiments, the processing result output by the output interface can be sent to the memory for storage, or sent to another processing flow to continue processing, or sent to the display device for display, or sent to the player terminal for playback. Wait.

Memory 1301: The memory 1301 can store the aforementioned image data to be processed and related instructions for configuring the first processor or the second processor. In some feasible embodiments, there may be one memory or multiple memories; the memory may be a floppy disk, a hard disk such as a built-in hard disk and a mobile hard disk, a magnetic disk, an optical disk, a magneto-optical disk such as CD_ROM, DCD_ROM, non-volatile storage Devices such as RAM, ROM, PROM, EPROM, EEPROM, flash memory, or any other form of storage medium known in the technical field.

Bus 1304: This bus may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in FIG. 13, but it does not mean that there is only one bus or one type of bus.

The components of the above-mentioned image data processing device provided by the embodiments of this application are respectively used to realize the functions of the corresponding steps of the aforementioned feature extraction method, because in the foregoing image data processing method embodiments, each step has been performed. The detailed description will not be repeated here.

The embodiment of the present application also provides a computer-readable storage medium that stores instructions in the computer-readable storage medium, and when it runs on a device (for example, the device may be a single-chip microcomputer, a chip, a computer, etc.), The device executes one or more of steps 301 to 3011 of the above-mentioned image data processing method. If each component module of the above-mentioned image data processing device is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in the computer readable storage medium.

Based on this understanding, the embodiments of the present application also provide a computer program product containing instructions. The technical solution of the present application is essentially or a part that contributes to the prior art, or all or part of the technical solution can be a software product. The computer software product is stored in a storage medium and includes a number of instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor therein execute the various embodiments of the present application All or part of the steps of the method.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer programs or instructions. When the computer program or instruction is loaded and executed on the computer, the process or function described in the embodiment of the present application is executed in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, network equipment, user equipment, or other programmable devices. The computer program or instruction may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer program or instruction may be downloaded from a website, computer, The server or data center transmits to another website site, computer, server or data center through wired or wireless means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that integrates one or more available media. The usable medium may be a magnetic medium, such as a floppy disk, a hard disk, and a magnetic tape; it may also be an optical medium, such as a digital video disc (digital video disc, DVD); and it may also be a semiconductor medium, such as a solid state drive (solid state drive). , SSD).

Although the present application is described in conjunction with various embodiments, in the process of implementing the claimed application, those skilled in the art can understand and realize the disclosure by looking at the drawings, the disclosure, and the appended claims. Other changes to the embodiment. In the claims, the word "comprising" does not exclude other components or steps, and "a" does not exclude multiple. A single processor or other unit may implement several functions listed in the claims. Certain measures are described in mutually different dependent claims, but this does not mean that these measures cannot be combined to produce good results.

Although the application has been described in combination with specific features and embodiments, it is obvious that various modifications and combinations can be made without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely exemplary descriptions of the application as defined by the appended claims, and are deemed to cover any and all modifications, changes, combinations or equivalents within the scope of the application. Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of this application fall within the scope of the claims of this application and their equivalent technologies, this application is also intended to include these modifications and variations.

Claims

A method for identifying plane semantic categories, which is characterized in that it includes:

Acquiring image data to be processed, where the image data to be processed includes N pixels, where N is a positive integer;

Determining a semantic segmentation result of the to-be-processed image data, where the semantic segmentation result includes a target plane category corresponding to at least some of the N pixels;

According to the semantic segmentation result, a first dense semantic map is obtained. The first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud. The point corresponds to at least one pixel point in the at least part of the pixel points;

Perform plane semantic category recognition according to the first dense semantic map, and obtain plane semantic categories of one or more planes included in the image data to be processed.
The method according to claim 1, wherein the obtaining a first dense semantic map according to the semantic segmentation result comprises:

Obtaining a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed;

Use the second dense semantic map as the first dense semantic map, or,

The historical dense semantic map is updated by using one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map.
The method according to claim 1 or 2, wherein after the determination of the semantic segmentation result of the image data to be processed, the method further comprises:

Perform an optimization operation on the semantic segmentation result according to the image data to be processed and the depth information included in the depth image corresponding to the image data to be processed, and the optimization operation is used to correct noise and errors in the semantic segmentation result part.
The method according to any one of claims 1 to 3, wherein the determining the semantic segmentation result of the image data to be processed comprises:

Determining one or more plane categories corresponding to any one of the at least part of the pixel points and the probability of each plane category in the one or more plane categories;

Taking the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain a semantic segmentation result of the image data to be processed.
The method according to claim 4, wherein the determining the probability of each of the one or more plane categories corresponding to any one of the at least part of the pixels comprises:

Perform semantic segmentation on the to-be-processed image data according to the neural network to obtain the probability of each plane category in one or more plane categories corresponding to any one of the at least some pixels.
The method according to any one of claims 1 to 5, wherein the plane semantic category recognition is performed according to the first dense semantic map to obtain the plane of one or more planes included in the image data to be processed Semantic categories, including:

Determine the plane equation of each of the one or more planes according to the image data to be processed;

Perform the following steps on any one of the one or more planes to obtain the plane semantic category of any one of the planes:

Determine one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories according to the plane equation of any one of the planes and the first dense semantic map;

Select the target plane category with the highest confidence among the one or more target plane categories as the semantic plane category of any one of the planes.
The method according to claim 6, wherein the orientation of the one or more target plane categories corresponding to the any plane is consistent with the orientation of the any plane.
The method according to claim 6 or 7, wherein the determining one or more target planes corresponding to the any plane according to the plane equation of the any plane and the first dense semantic map The category and the confidence of the one or more target plane categories include:

Determining M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the distance between the M first three-dimensional points and any one of the planes is less than a third threshold, M is a positive integer;

The one or more target plane categories corresponding to the M first three-dimensional points are determined as the one or more target plane categories corresponding to any one of the planes, and the orientation of the one or more target plane categories is different from the direction of the target plane. The orientation of any one of the planes is the same,

The ratio of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points is calculated to obtain the confidence of the one or more target plane categories.
The method according to claim 8, wherein the method further comprises:

Update the confidence of one or more target plane categories corresponding to any one of the planes according to at least one of Bayes' theorem or voting mechanism.
The method according to any one of claims 1-9, wherein the obtaining a first dense semantic map according to the semantic segmentation result comprises:

Determine whether the current state is an exercise state;

In the case that the current state is a motion state, a first dense semantic map is obtained according to the semantic segmentation result.
The method according to any one of claims 1-10, wherein the image data to be processed is image data after correction.
An image data processing device, characterized in that it comprises:

A semantic segmentation module for acquiring image data to be processed including N pixels, and determining a semantic segmentation result of the image data to be processed, wherein the semantic segmentation result includes at least some of the N pixels Corresponding target plane category, N is a positive integer;

The semantic map module is configured to obtain a first dense semantic map according to the semantic segmentation result, and the first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, so The at least one first three-dimensional point corresponds to at least one pixel point in the at least part of the pixel points;

The semantic clustering module is configured to recognize the plane semantic category according to the first dense semantic map, and obtain the plane semantic category of one or more planes included in the image data to be processed.
The device according to claim 12, wherein the semantic map module is specifically configured to:

Obtaining a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed;

Use the second dense semantic map as the first dense semantic map, or,

It is used to update the historical dense semantic map by using one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map.
The device according to claim 12 or 13, characterized in that, after the semantic segmentation result of the image data to be processed is determined, the semantic segmentation module is further configured to perform according to the image data to be processed and the image data to be processed. For the depth information included in the depth image corresponding to the image data, an optimization operation is performed on the semantic segmentation result, and the optimization operation is used to correct noise and error parts in the semantic segmentation result.
The device according to any one of claims 12-14, wherein the semantic segmentation module is specifically configured to determine one or more plane categories corresponding to any one of the at least partial pixels and the The probability of each plane category in one or more plane categories;

And it is used to set the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain the semantic segmentation result of the image data to be processed.
The device according to claim 15, wherein the semantic segmentation module is configured to perform semantic segmentation on the image data to be processed according to a neural network to obtain a pixel corresponding to any one of the at least some pixels. Or the probability of each plane category in multiple plane categories.
The device according to any one of claims 12-16, wherein the semantic clustering module is configured to:

Determine the plane equation of each of the one or more planes according to the image data to be processed;

Perform the following steps on any one of the one or more planes to obtain the plane semantic category of the one or more planes:

Determining one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories according to the plane equation of any one of the planes and the first dense semantic map;

Select the target plane category with the highest confidence among the one or more target plane categories as the semantic plane category of any one of the planes.
The device according to claim 17, wherein the orientation of the one or more target plane categories corresponding to the any one plane is consistent with the orientation of the any one plane.
The device according to claim 17 or 18, wherein the semantic clustering module is specifically configured to:

Determining M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the distance between the M first three-dimensional points and any one of the planes is less than a third threshold, M is a positive integer;

The one or more target plane categories corresponding to the M first three-dimensional points are determined as the one or more target plane categories corresponding to any one of the planes, and the orientation of the one or more target plane categories is different from the direction of the target plane. The orientation of any one of the planes is the same,

The ratio of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points is calculated to obtain the confidence of the one or more target plane categories.
The device according to claim 19, wherein the semantic clustering module counts the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories in the M first three-dimensional points After obtaining the confidence of the one or more target plane categories, the semantic clustering module is also used to update one or the corresponding one of the planes according to at least one of Bayes' theorem or voting mechanism. Confidence of multiple target plane categories.
The device according to any one of claims 12 to 20, wherein the semantic map module is specifically configured to determine whether the current state is a motion state, and when determining that the current state is the motion state, according to the According to the semantic segmentation result, the first dense semantic map is obtained.
The device according to any one of claims 12-21, wherein the image data to be processed is image data after correction.
A computer-readable storage medium, characterized in that instructions are stored in the readable storage medium, and when the instructions are executed, the method according to any one of claims 1 to 11 is implemented.
A processing device, characterized by comprising a first processor and a second processor, wherein the first processor is used to obtain image data to be processed, and the image data to be processed includes N pixels, where N is positive Integer

The second processor is configured to determine a semantic segmentation result of the image data to be processed, wherein the semantic segmentation result includes a target plane category corresponding to at least some of the N pixels;

The first processor is further configured to obtain a first dense semantic map according to the semantic segmentation result, where the first dense semantic map includes at least one target corresponding to at least one first three-dimensional point in the first three-dimensional point cloud Plane category, where the at least one first three-dimensional point corresponds to at least one pixel in the at least part of the pixel points; performing planar semantic category recognition according to the first dense semantic map to obtain one of the image data to be processed Or multiple plane semantic categories.
The processing device according to claim 24, wherein the second processor is specifically configured to determine one or more plane categories corresponding to any one of the at least part of the pixels and the one or more The probability of each plane category in the plane category;

Taking the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain a semantic segmentation result of the image data to be processed.
The processing device according to claim 25, wherein the second processor is specifically configured to perform semantic segmentation on the to-be-processed image data according to a neural network to obtain a pixel point corresponding to any one of the at least some pixels. The probability of each plane category in one or more plane categories.
A processing device, characterized by comprising: one or more processors, wherein the one or more processors are used to execute instructions stored in a memory to execute the method according to any one of claims 1 to 11 .