WO2021147113A1 - Plane semantic category identification method and image data processing apparatus - Google Patents

Plane semantic category identification method and image data processing apparatus Download PDF

Info

Publication number
WO2021147113A1
WO2021147113A1 PCT/CN2020/074040 CN2020074040W WO2021147113A1 WO 2021147113 A1 WO2021147113 A1 WO 2021147113A1 CN 2020074040 W CN2020074040 W CN 2020074040W WO 2021147113 A1 WO2021147113 A1 WO 2021147113A1
Authority
WO
WIPO (PCT)
Prior art keywords
plane
semantic
image data
category
categories
Prior art date
Application number
PCT/CN2020/074040
Other languages
French (fr)
Chinese (zh)
Inventor
马超群
陈平
方晓鑫
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202080001308.1A priority Critical patent/CN113439275A/en
Priority to PCT/CN2020/074040 priority patent/WO2021147113A1/en
Publication of WO2021147113A1 publication Critical patent/WO2021147113A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the embodiments of the present application relate to the field of image processing technology, and in particular, to a method for recognizing planar semantic categories and an image data processing device.
  • Augmented reality is a technology that calculates the position and angle of the camera image in real time and adds corresponding images, videos, and 3D models.
  • the goal of this technology is to put the virtual world in the real world on the screen And interact.
  • plane detection as a more important function in augmented reality, provides the perception of the basic three-dimensional environment of the real world, so that developers can place virtual objects according to the detected plane to achieve the effect of augmented reality.
  • Three-dimensional space plane detection is an important and basic function, because after the plane is detected, the anchor point of the object can be further determined, so as to render the object at the determined anchor point, so it is an important function in augmented reality , Provides the perception of the basic three-dimensional environment of the real world, enabling developers to place virtual objects according to the detected plane to achieve the effect of augmented reality.
  • the plane detected by most augmented reality algorithms only provides location information and cannot identify the plane category of the plane.
  • the plane category of the recognition plane can help developers improve the authenticity and interest of augmented reality applications.
  • the embodiments of the present application provide a method for recognizing planar semantic categories and an image data processing device to improve the accuracy of planar semantic category recognition.
  • an embodiment of the present application provides a method for recognizing planar semantic categories, including: an image data processing device obtains image data to be processed including N pixels, where N is a positive integer.
  • the image data processing device determines a semantic segmentation result of the image data to be processed, where the semantic segmentation result includes the target plane category corresponding to at least some of the N pixels.
  • the image data processing device obtains a first dense semantic map according to the semantic segmentation result, the first dense semantic map including at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, and the at least one first three-dimensional point cloud A three-dimensional point corresponds to at least one pixel point in the at least part of the pixel points.
  • the image data processing device performs plane semantic category recognition according to the first dense semantic map, and obtains the plane semantic category of one or more planes included in the image data to be processed.
  • the embodiment of the present application provides a method for recognizing planar semantic categories.
  • the method obtains the result of semantic segmentation of image data to be processed. Since the result of semantic segmentation includes the target to which each pixel of the N pixels included in the image data to be processed belongs For plane categories, subsequent semantic segmentation can improve the accuracy of plane semantic recognition.
  • the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then, uses the first dense semantic map to recognize the planar semantic category to obtain the planar semantic category of the image data to be processed It can enhance the accuracy of plane semantic recognition.
  • the image data processing device obtains the first dense semantic map according to the semantic segmentation result, including: the image data processing device obtains the second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed Semantic map.
  • the image data processing device uses the second dense semantic map as the first dense semantic map.
  • the image data processing device obtains the first dense semantic map according to the semantic segmentation result, including: the image data processing device obtains the second dense semantic map according to the semantic segmentation result.
  • the image data processing device uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the historical dense semantic map to obtain the first dense semantic map.
  • the image data processing apparatus judging whether the current state of the image data processing apparatus is a motion state includes: the image data processing apparatus obtains second image data, which is different from the image data to be processed.
  • the image data processing device determines whether the state of the image data processing device is a motion state according to the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data.
  • the second image data is adjacent to the image data to be processed and is located in the previous frame of the image data to be processed.
  • the image data processing apparatus determining that the current state is the motion state includes: when the difference between the pose of the first device and the pose of the second device is less than or equal to a first threshold, determining that the current state is Motion state
  • the image data processing device determining that the current state is a motion state includes: the image data processing device acquires second image data shot by the camera; wherein the second image data is adjacent to the image data to be processed and is located The last frame of the image data to be processed; the image data processing device according to the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data, and between the second image data and the image data to be processed The difference between frames to determine the state of the image data processing device is a motion state.
  • the image data processing apparatus according to the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data, and the relationship between the second image data and the image data to be processed
  • the difference between frames to determine the state of the image data processing device as a motion state includes: the difference between the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data is less than or equal to In the case where the first threshold value and the frame difference between the second image data and the image data to be processed are greater than the second threshold value, the state of the image data processing device is a motion state.
  • the method provided in the embodiment of the present application further includes: the image data processing apparatus according to the image data to be processed and the image data to be processed For the depth information included in the depth image corresponding to the image data, an optimization operation is performed on the semantic segmentation result, and the optimization operation is used to correct noise and error parts in the semantic segmentation result. This can make subsequent semantic recognition more accurate.
  • the image data processing apparatus determining the semantic segmentation result of the image data to be processed includes: the image data processing apparatus determining each plane in one or more plane categories corresponding to any one of at least some of the pixels The probability of the category.
  • the image data processing device uses the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain the semantic segmentation result of the image data to be processed . That is, the probability of the target plane category corresponding to any pixel is the largest among the probabilities of one or more plane categories corresponding to any pixel. This can provide the accuracy of semantic recognition.
  • the image data processing device determines the probability of each plane category in one or more plane categories corresponding to any one of at least some of the pixels, including: the image data processing device processes according to the neural network Semantic segmentation of the image data is performed to obtain the probability of each plane category in one or more plane categories corresponding to any one of at least some pixels.
  • the image data processing device recognizes the plane semantic category according to the first dense semantic map, and obtains the plane semantic category of one or more planes included in the image data to be processed, including: the image data processing device recognizes the plane semantic category according to the to-be-processed image data. Process the image data to determine the plane equation for each of one or more planes.
  • the image data processing device performs the following steps on any one of the one or more planes to obtain the plane semantic category of the one or more planes: the image data processing device according to the plane equation of any one of the planes, and the first dense Semantic map, determining one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories; selecting the target plane category with the highest confidence from the one or more target plane categories As the semantic plane category of any one of the planes. That is, the semantic plane category of any plane is the target plane category with the highest confidence among the one or more target plane categories corresponding to the any plane, and the target plane category with the highest confidence is selected as the semantic plane category of any plane. It can enhance the accuracy of plane semantic recognition.
  • the orientation of one or more target plane categories corresponding to any plane is consistent with the orientation of any plane. That is, the orientation of one or more target plane categories corresponding to each plane is consistent with the respective orientation of each plane. In this way, the target plane categories that are inconsistent with the plane orientation can be filtered out, and the accuracy of plane semantic recognition can be enhanced.
  • the image data processing device determines one or more target plane categories corresponding to any one plane and the first dense semantic map according to the plane equation of any one plane and the first dense semantic map.
  • the confidence of one or more target plane categories includes: the image data processing device determines M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the M first three-dimensional points The distance between the point and any one of the planes is less than a third threshold, and M is a positive integer; the one or more target plane categories corresponding to the M first three-dimensional points are determined as the one corresponding to the any one of the planes One or more target plane categories, the orientation of the one or more target plane categories is consistent with the orientation of any one of the planes, and the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories is counted The ratio among the M first three-dimensional points obtains the confidence of the one or more target plane categories.
  • the target plane category corresponding to each first three-dimensional points obtains the confidence of
  • the image data processing device counts the proportion of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points to obtain
  • the method provided in the embodiment of the present application further includes: the image data processing device updates one or more corresponding to any one of the planes according to at least one of Bayes' theorem or voting mechanism The confidence of the target plane category.
  • the video sequence based on Bayes' theorem and voting mechanism updates the confidence of one or more plane categories corresponding to any plane, so that the final result of the plane semantic category of each plane is more stable.
  • the method provided in the embodiment of the present application further includes: the image data processing device determines whether the current state of the image data processing device is a motion state, and when the current state is a motion state, the image data processing device According to the semantic segmentation result, the first dense semantic map is obtained. By judging whether it is in a motion state, in the motion state, according to the semantic segmentation result, the first dense semantic map can be obtained, which can reduce the amount of data calculated by the image data processing device, thereby reducing computing resources and improving the semantic map generation algorithm. performance.
  • the image data to be processed is image data after correction.
  • the method provided in the embodiment of the present application further includes: the image data processing apparatus obtains the first image data taken by the camera.
  • the image data processing device corrects the first image data according to the device pose corresponding to the first image data to obtain the image data to be processed.
  • an embodiment of the present application provides an image data processing device.
  • the image data processing device includes a semantic segmentation module, a semantic map module, and a semantic clustering module.
  • the semantic segmentation module is used to obtain information provided by the camera including N Image data to be processed of pixels, N is a positive integer.
  • the semantic segmentation module is also used to determine the semantic segmentation result of the image data to be processed, where the semantic segmentation result includes the target plane category corresponding to at least some of the N pixels.
  • the semantic map module is configured to obtain a first dense semantic map according to the semantic segmentation result, the first dense semantic map including at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, the at least One first three-dimensional point corresponds to at least one pixel point in the at least part of the pixel points.
  • the semantic clustering module is used to recognize the plane semantic category according to the first dense semantic map to obtain the plane semantic category of one or more planes included in the image data to be processed.
  • the embodiment of the present application provides an image data processing device, which obtains the semantic segmentation result of the image data to be processed. Because the semantic segmentation result includes the N pixels included in the image data to be processed, each pixel belongs to The target plane category can improve the accuracy of plane semantic recognition through semantic segmentation.
  • the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then, uses the first dense semantic map to recognize the planar semantic category to obtain the planar semantic category of the image data to be processed It can enhance the accuracy of plane semantic recognition.
  • the semantic map module is used to obtain the first dense semantic map according to the semantic segmentation result, including: the semantic map module is used to obtain the second dense semantic map according to the semantic segmentation result.
  • the semantic map module is used to use the second dense semantic map as the first dense semantic map.
  • the semantic map module is used to obtain the first dense semantic map according to the semantic segmentation result, including: the semantic map module is used to obtain the second dense semantic map according to the semantic segmentation result.
  • the semantic map module is used to update the historical dense semantic map by using one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map.
  • the image data processing device further includes: an instant localization and map construction (simultaneous localization and mapping, SLAM) module, which is used to calculate the device pose (such as the camera pose) of the image data, and the semantic map
  • the module is used to determine whether the current state of the image data processing device is a motion state, and includes: a semantic map module is used to obtain second image data provided by the camera, and the second image data is different from the image data to be processed.
  • the semantic map module is used to determine whether the state of the image data processing device is a motion state according to the first device pose corresponding to the image data to be processed provided by the SLAM module and the second device pose corresponding to the second image data provided by the SLAM module.
  • the second image data is adjacent to the image data to be processed and is located in the previous frame of the image data to be processed.
  • the semantic map module used to determine that the current state is a motion state includes: when the difference between the pose of the first device and the pose of the second device is less than or equal to the first threshold, the semantic map module Used to determine that the current state is an exercise state;
  • the semantic map module used to determine that the current state is a motion state includes: the semantic map module is used to obtain second image data taken by the camera; wherein the second image data is adjacent to the image data to be processed, And is located in the previous frame of the image data to be processed; the semantic map module is used for the first device pose corresponding to the image data to be processed provided by the SLAM module and the second device pose corresponding to the second image data provided by the SLAM module, and The frame difference between the second image data and the image data to be processed determines that the current state of the image data processing device is a motion state.
  • the semantic map module is used to determine the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data, and the difference between the second image data and the image data to be processed.
  • Determine the current state of the image data processing device as a motion state including: the difference between the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data is less than
  • the semantic map module is used to determine that the current state of the image data processing device is a motion state.
  • the semantic segmentation module is further configured to perform an optimization operation on the semantic segmentation result according to the image data to be processed and the depth information included in the depth image corresponding to the image data to be processed, and the optimization operation is used to modify the image data to be processed. Describe the noise and error in the semantic segmentation results.
  • the semantic segmentation module is used to determine the semantic segmentation result of the image data to be processed, including the method used to determine one or more plane categories corresponding to any one of the at least partial pixels and the The probability of each plane category in one or more plane categories, and the plane category with the highest probability among the one or more plane categories corresponding to any one pixel is used as the target plane category corresponding to any one pixel , In order to obtain the semantic segmentation result of the image data to be processed. That is, the probability of the target plane category corresponding to any one of at least some pixels included in the semantic segmentation result of the image data to be processed is the largest among the probabilities of one or more plane categories corresponding to any one pixel.
  • the semantic segmentation module is used to perform semantic segmentation on the image data to be processed according to the neural network to obtain at least part of the pixel points corresponding to any one of the one or more plane categories of each plane category Probability.
  • the semantic clustering module is used to recognize the plane semantic category according to the first dense semantic map to obtain the plane semantic category of one or more planes included in the image data to be processed, including: a semantic clustering module , Used to determine the plane equation of each of one or more planes according to the image data to be processed.
  • the semantic clustering module is also used to perform the following steps on any one of the one or more planes to obtain the plane semantic category of the one or more planes: the semantic clustering module is used to perform the following steps according to any one of the planes.
  • the semantic clustering module is used in all Among the one or more target plane categories, the target plane category with the highest confidence is selected as the semantic plane category of any one of the planes.
  • the orientation of one or more target plane categories corresponding to each plane is consistent with the respective orientation of each plane. That is, the orientation of one or more target plane categories corresponding to any plane is consistent with the orientation of the any plane.
  • the semantic clustering module is used to determine one or more target plane categories corresponding to any plane according to the plane equation of any plane and the first dense semantic map, and
  • the confidence of the one or more target plane categories includes: the semantic clustering module is configured to determine M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the M The distance between the first three-dimensional points and any one of the planes is less than a third threshold, and the orientation of the target plane category corresponding to the M first three-dimensional points is consistent with the orientation of any one of the planes, and M is a positive integer ,
  • the M first three-dimensional points correspond to the one or more plane categories; and counting the number of three-dimensional points corresponding to each plane category in the one or more plane categories among the M first three-dimensional points Ratio to obtain the confidence of the one or more plane categories.
  • the semantic clustering module is used to count the proportion of the number of three-dimensional points corresponding to each plane category in the one or more plane categories among the M first three-dimensional points to obtain the After the confidence of one or more plane categories, the semantic clustering module is further used to update the confidence of one or more target plane categories corresponding to any one of the planes according to at least one of Bayes' theorem or voting mechanism.
  • the semantic map module is used to determine whether the current state of the image data processing device is a motion state. When it is determined that the current state is the motion state, the semantic map module is used to obtain the first dense semantic map according to the semantic segmentation result.
  • the image data to be processed is image data after correction.
  • the semantic segmentation module before the semantic segmentation module is used to obtain the image data to be processed, the semantic segmentation module is also used to obtain the first image data taken by the camera.
  • the semantic segmentation module is used to correct the first image data according to the device pose corresponding to the first image data provided by the SLAM module to obtain the image data to be processed.
  • the SLAM module, the semantic clustering module, and the semantic map module run on the central processing unit CPU, and the semantic segmentation part of the semantic segmentation module can run on the NPU.
  • the semantic segmentation module excludes semantics. The other part of the function of the division runs on the central processing unit (CPU).
  • embodiments of the present application provide a computer-readable storage medium that stores instructions in the readable storage medium, and when the instructions are executed, the method described in any aspect of the first aspect is implemented.
  • an embodiment of the present application provides an image data processing device, including: a first processor and a second processor, where the first processor is configured to obtain image data to be processed including N pixels, N is a positive integer.
  • the second processor is configured to determine the semantic segmentation result of the image data to be processed, wherein the semantic segmentation result includes the target plane category corresponding to at least some of the N pixels;
  • the first processor is configured to segment the image data according to the semantic
  • a first dense semantic map is obtained, the first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to the at least part of the At least one pixel in the pixel;
  • the first processor is configured to recognize the plane semantic category according to the first dense semantic map to obtain the plane semantic category of one or more planes included in the image data to be processed.
  • the first processor is specifically configured to obtain a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed; the first processor specifically uses The second dense semantic map is used as the first dense semantic map, or, the first processor is specifically configured to use one or more second three-dimensional point clouds in the second dense semantic map.
  • the historical dense semantic map is updated with two and three-dimensional points to obtain the first dense semantic map.
  • the second processor is used to determine the semantic segmentation result of the image data to be processed, including depth information included in the image data to be processed and the depth image corresponding to the image data to be processed , Performing an optimization operation on the semantic segmentation result, and the optimization operation is used to correct noise and error parts in the semantic segmentation result.
  • the second processor before the second processor is configured to determine the semantic segmentation result of the to-be-processed image data, the second processor is also configured to determine the pixel corresponding to any one of the at least some pixels.
  • the probability of each plane category in one or more plane categories; and the plane category with the highest probability among the one or more plane categories corresponding to any one pixel is used as the target plane category corresponding to any one pixel , In order to obtain the semantic segmentation result of the image data to be processed. That is, the probability of the target plane category corresponding to any pixel is the largest among the probabilities of one or more plane categories corresponding to any pixel. This can provide the accuracy of semantic recognition.
  • the second processor is configured to perform semantic segmentation on the to-be-processed image data according to the neural network to obtain one or more plane categories corresponding to any one of the at least some pixels The probability of each plane category in.
  • the first processor is used to determine the plane equation of each of the one or more planes; the first processor is also used to determine the plane equation of any one of the one or more planes.
  • a plane executes the following steps to obtain the plane semantic category of the one or more planes: the first processor is further configured to determine the plane semantic category according to the plane equation of any plane and the first dense semantic map One or more target plane categories corresponding to any one plane and the confidence of the one or more target plane categories; the first processor is further configured to select the target with the highest confidence among the one or more target plane categories
  • the plane category is used as the semantic plane category of any one of the planes. That is, the semantic plane category of any plane is the highest-confidence target plane category among the one or more target plane categories corresponding to the any plane.
  • the orientation of one or more target plane categories corresponding to any plane is consistent with the orientation of the any plane.
  • the first processor is specifically configured to determine M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the M first three-dimensional points are The distance between the three-dimensional point and any one of the planes is less than a third threshold, and M is a positive integer; the one or more target plane categories corresponding to the M first three-dimensional points are determined as the all corresponding to the any one of the planes.
  • the one or more target plane categories, the orientation of the one or more target plane categories is consistent with the orientation of any one of the planes, and the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories is counted According to the ratio of the target in the M first three-dimensional points, the confidence of the one or more target plane categories is obtained.
  • the first processor is specifically configured to count the proportion of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points After obtaining the confidence of the one or more target plane categories, the first processor is further configured to update one or more targets corresponding to any one of the planes according to at least one of Bayes' theorem or a voting mechanism The confidence of the plane category.
  • the first processor is configured to determine whether the current state is the motion state; and when it is determined that the current state is the motion state, obtain the first dense semantic map according to the semantic segmentation result.
  • the first processor may be a CPU or a DSP.
  • the second processor may be an NPU.
  • an embodiment of the present application provides an image data processing device, including: one or more processors, wherein the one or more processors are configured to execute instructions stored in a memory to execute instructions as described in any aspect of the first aspect Methods.
  • a computer program product including instructions.
  • the computer program product includes instructions. When the instructions are executed, the method as described in any aspect of the first aspect is implemented.
  • FIG. 1 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the application.
  • FIG. 2 is a schematic diagram of a software architecture applicable to a method for identifying planar semantic categories provided by an embodiment of the application;
  • FIG. 3 is a schematic flowchart of a method for recognizing planar semantic categories according to an embodiment of this application
  • FIG. 4 is a schematic flowchart of another method for recognizing planar semantic categories according to an embodiment of this application.
  • FIG. 5 is a schematic diagram of the first image data before and after processing obtained by the image data processing device provided by the embodiment of the application;
  • FIG. 6 is a schematic diagram of a semantic segmentation result provided by an embodiment of this application.
  • FIG. 7 is a schematic diagram of a coordinate mapping provided by an embodiment of this application.
  • FIG. 8 is a schematic diagram of a flow state determination process provided by an embodiment of the application.
  • FIG. 9 is a calculation flow of plane confidence provided by an embodiment of the application.
  • FIG. 10 is a schematic flow chart of performing filtering on semantic segmentation results according to an embodiment of this application.
  • FIG. 11 is a schematic diagram of another process of performing filtering on semantic segmentation results according to an embodiment of the application.
  • FIG. 12 is a schematic diagram of a planar semantic result provided by an embodiment of this application.
  • FIG. 13 is a schematic structural diagram of an image data processing device provided by an embodiment of the application.
  • At least one item (a) refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .
  • the method for identifying planar semantic categories provided in the embodiments of the present application can be applied to various image data processing apparatuses with TOF, and the image data processing apparatus may be an electronic device.
  • electronic devices may include, but are not limited to, personal computers, server computers, handheld or laptop devices, mobile devices (such as mobile phones, mobile phones, tablet computers, personal digital assistants, media players, etc.), consumer electronic devices, Small computers, large computers, mobile robots, drones, etc.
  • the electronic device in the embodiment of the present application may be a device with AR function, for example, a device with AR glasses function, which can be applied to scenarios such as AR automatic measurement, AR decoration, and AR interaction.
  • the image data processing device can use the plane semantic category provided in this embodiment of the application.
  • the recognition method to obtain the plane category recognition result of the image data to be processed.
  • the image data processing device may send the image data to be processed to other devices that have the realization of the recognition process of the flat semantic category, such as a server or a terminal device, and the server or the terminal device performs the recognition of the flat semantic category. Process, and then the image data processing device receives plane category recognition results from other equipment.
  • the electronic device 100 may include a display device 110, a processor 120 and a memory 130.
  • the memory 130 may be used to store software programs and data, and the processor 120 may execute various functional applications and data processing of the electronic device 100 by running the software programs and data stored in the memory 130.
  • the memory 130 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system, an application program (such as an image capture function, etc.) required by at least one function; the storage data area may store information according to the electronic device 100 Use the created data (such as audio data, text information, image data), etc.
  • the memory 130 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the processor 120 is the control center of the electronic device 100, which uses various interfaces and lines to connect various parts of the entire electronic device, and executes various functions of the electronic device 100 by running or executing software programs and/or data stored in the memory 130 And process data to monitor the electronic equipment as a whole.
  • the processor 120 may include one or more processing units.
  • the processor 120 may include a central processing unit (CPU), an application processor (AP), a modem processor, and a graphics processor ( graphics processing unit (GPU), image signal processor (ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network Processor (Neural-network Processing Unit, NPU), etc.
  • the different processing units may be independent devices or integrated in one or more processors.
  • the NPU is used as a neural-network (NN) computing processor.
  • NN neural-network
  • the NPU By learning from the structure of biological neural networks, such as the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn.
  • applications such as intelligent cognition of the electronic device 100 can be realized, such as image recognition, face recognition, voice recognition, text understanding, and so on.
  • the processor 120 may include one or more interfaces.
  • Interfaces can include integrated circuit (I2C) interfaces, integrated circuits built-in audio (inter-integrated circuitsound, I2S) interfaces, pulse code modulation (PCM) interfaces, universal asynchronous receivers /transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and/or Universal serial bus (USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuits built-in audio
  • PCM pulse code modulation
  • UART universal asynchronous receivers /transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB Universal serial bus
  • the I2C interface is a bidirectional synchronous serial bus, which includes a serial data line (SDA) and a serial clock line (SCL).
  • the processor 120 may include multiple sets of I2C buses.
  • the processor 120 may be coupled to the touch sensor, charger, flashlight, image acquisition device 160, etc., respectively through different I2C bus interfaces.
  • the processor 120 may couple the touch sensor through an I2C interface, so that the processor 120 communicates with the touch sensor through the I2C bus interface, so as to realize the touch function of the electronic device 100.
  • the I2S interface can be used for audio communication.
  • the processor 120 may include multiple sets of I2S buses.
  • the processor 120 may be coupled with the audio module through an I2S bus to implement communication between the processor 120 and the audio module.
  • the audio module can transmit audio signals to the WiFi module 190 through the I2S interface, so as to realize the function of answering calls through the Bluetooth headset.
  • the PCM interface can also be used for audio communication to sample, quantize and encode analog signals.
  • the audio module and the WiFi module 190 may be coupled through a PCM bus interface.
  • the audio module may also transmit audio signals to the WiFi module 190 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.
  • UART interface is a universal serial data bus used for asynchronous communication.
  • the bus can be a two-way communication bus. It converts the data to be transmitted between serial communication and parallel communication.
  • the UART interface is usually used to connect the processor 120 and the WiFi module 190.
  • the processor 120 communicates with the Bluetooth module in the WiFi module 190 through the UART interface to realize the Bluetooth function.
  • the audio module can transmit audio signals to the WiFi module 190 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.
  • the MIPI interface can be used to connect the processor 120 with peripheral devices such as the display device 110 and the image acquisition device 160.
  • the MIPI interface includes an image acquisition device 160 serial interface (camera serial interface, CSI), a display serial interface (display serial interface, DSI), and so on.
  • the processor 120 and the image acquisition device 160 communicate through a CSI interface to implement the shooting function of the electronic device 100.
  • the processor 120 communicates with the display screen through the DSI interface to realize the display function of the electronic device 100.
  • the GPIO interface can be configured through software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface can be used to connect the processor 120 with the image capture device 160, the display device 110, the WiFi module 190, the audio module, the sensor module, and so on.
  • the GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.
  • the USB interface is an interface that complies with the USB standard specifications, and can be a Mini USB interface, a Micro USB interface, and a USB Type C interface.
  • the USB interface can be used to connect a charger to charge the electronic device 100, and can also be used to transfer data between the electronic device 100 and peripheral devices. It can also be used to connect earphones and play audio through earphones. This interface can also be used to connect other electronic devices, such as AR devices.
  • the interface connection relationship between the modules illustrated in the embodiment of the present invention is merely a schematic description, and does not constitute a structural limitation of the electronic device 100.
  • the electronic device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
  • the electronic device 100 also includes an image capture device 160 for capturing images or videos.
  • the image capturing device 160 includes one or more cameras for capturing image data and a TOF camera for capturing depth images.
  • a camera is used to collect video graphics sequence (Video Graphics Array, VGA) or image data and send it to the CPU and GPU.
  • the camera can be an ordinary camera or a focusing camera.
  • the electronic device 100 may also include an input device 140 for receiving inputted digital information, character information, or contact touch operations/non-contact gestures, and generating signal inputs related to user settings and function control of the electronic device 100.
  • the display device 110 includes a display panel 111 that is used to display information input by the user or information provided to the user, and various menu interfaces of the electronic device 100. In the embodiment of the present application, it is mainly used to display the camera in the electronic device 100 Or the image data to be processed obtained by the sensor.
  • the display panel may be configured with the display panel 111 in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • the electronic device 100 may also include one or more sensors 170, such as an image sensor, an infrared sensor, a laser sensor, a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, an ambient light sensor, Fingerprint sensors, touch sensors, temperature sensors, bone conduction sensors, inertial measurement units (IMU), etc., where the image sensors may be time of flight (TOF) sensors, structured light sensors, and the like.
  • the inertial measurement unit is a device that measures the three-axis attitude angle (or angular velocity) and acceleration of an object.
  • an IMU contains three single-axis accelerometers and three single-axis gyroscopes.
  • the accelerometer detects the acceleration signal of the object in the independent three-axis of the carrier coordinate system.
  • the gyroscope detects the angular velocity signal of the carrier relative to the navigation coordinate system, measures the angular velocity and acceleration of the object in three-dimensional space, and calculates the posture of the object.
  • the image sensor may be a device in the image acquisition device 160 or an independent device for acquiring image data.
  • the electronic device 100 may also include a power supply 150 for supplying power to other modules.
  • the electronic device 100 may also include a radio frequency (RF) circuit 180 for network communication with wireless network devices, and may also include a WiFi module 190 for WiFi communication with other devices, for example, for acquiring other devices The transmitted image or data, etc.
  • RF radio frequency
  • the electronic device 100 may also include other possible functional modules such as a flashlight, a Bluetooth module, an external interface, a button, a motor, etc., which will not be described here.
  • FIG. 2 shows a software architecture applicable to a planar semantic category recognition method provided by an embodiment of the present application.
  • the software architecture is applied to the electronic device 100 shown in FIG. 1, and the architecture includes: semantic segmentation Module 202, semantic map module 203, and semantic clustering module 204.
  • the software architecture may also include an instant localization and mapping (simultaneous localization and mapping, SLAM) module 201.
  • SLAM instant localization and mapping
  • the SLAM module 201, the semantic map module 203, and the semantic clustering module 204 run on the CPU of the electronic device as described in FIG. 1.
  • part of the functions in the SLAM module 201 can be deployed on digital signal processing (Digital Signal Processing, DSP), and part of the functions in the semantic segmentation module 202 runs on the NPU of the electronic device as described in FIG.
  • DSP Digital Signal Processing
  • the functions of the segmentation module 202 other than the functions running on the NPU of the electronic device as described in FIG. 1 are running on the CPU. Refer to the subsequent descriptions for the specific functions that run on the NPU.
  • the SLAM module 201 is used to provide a video graphics sequence including one or more image data provided by a camera (also called a camera corresponding to the image acquisition device 160 of the electronic device described in FIG. 1), and the depth of the image data provided by the TOF (Depth) information or depth image, and IMU is used to provide IMU data, as well as the correlation between image data frames, combined with the principle of visual geometry, the device pose can be calculated (for example, if the device is a camera, the device position The pose can refer to the camera pose), that is, the rotation and translation of the camera relative to the first frame, and the plane is detected, and the device pose, normal parameters, and boundary points of the plane are output.
  • IMU data includes accelerometers and gyroscopes.
  • the depth information includes the distance between each pixel in the image data and the camera that captured the image data.
  • the semantic segmentation module 202 implements semantic segmentation data enhancement based on SLAM technology, and is divided into pre-processing, AI processing, and post-processing.
  • the input of the pre-processing is the original image data (for example, RGB image) provided by the camera and the device pose obtained by the SLAM module 201.
  • the original image data is corrected according to the device pose and output as the corrected image data.
  • the semantic segmentation model's constraints on rotation invariance can be reduced, and the recognition rate can be improved.
  • AI processing is based on neural network for semantic segmentation, running on the NPU, the input is the image data after the correction, and the output is the image data after the correction.
  • Each pixel included in the image data belongs to one or more plane categories. Probability distribution (that is, the probability that each pixel belongs to one or more plane categories). If the plane category with the highest probability is selected, the pixel-level semantic segmentation result can be obtained.
  • the aforementioned neural network may be a convolutional neural network (convolutional neural network, CNN), a deep neural network (deep neural network, DNN), and a recurrent neural network (recurrent neural network, RNN).
  • the input of post-processing is the original image data provided by the camera, depth information, and the semantic segmentation result of AI processing input.
  • the semantic segmentation result is filtered mainly according to the original image data and depth information, and the output is the optimized semantic segmentation result.
  • the accuracy and edges of the segmentation after post-processing are better.
  • post-processing is not a necessary technique in the embodiment and may not be executed.
  • the pre-processing and post-processing may run on the CPU or other processors instead of the NPU, which is not limited in this embodiment.
  • the input of the semantic map module 203 is the optimized semantic segmentation result or the unoptimized semantic segmentation result, the device pose provided by the SLAM module 201, the depth information provided by the TOF, and the original image data provided by the camera.
  • the semantic map module 203 mainly Dense Semantic Map is generated based on SLAM technology.
  • the optimized semantic segmentation result or the unoptimized semantic segmentation result, as well as the device pose provided by the SLAM module 201, the depth information provided by the TOF, and the original image data provided by the camera generate dense semantic ground.
  • the process includes converting the original two-dimensional image data into a three-dimensional dense semantic map.
  • the two-dimensional RGB pixels in the two-dimensional original image data are converted into three-dimensional points in the three-dimensional space, so that each pixel includes depth information in addition to the RGB information.
  • the process of this two-dimensional to three-dimensional conversion can refer to the description in the prior art, which will not be repeated here.
  • the target plane category of each pixel is used as the target plane category of the three-dimensional point corresponding to the pixel, so that the target plane category of multiple pixels is transformed into the target plane category of multiple three-dimensional points. Therefore, the dense semantic map includes multiple three-dimensional point target plane categories.
  • the target plane category of any three-dimensional point corresponds to the target plane category of the two-dimensional pixel point corresponding to the three-dimensional point.
  • the input of the semantic map module 203 is the optimized semantic segmentation result; when the post-processing is not performed in the embodiment, the input of the semantic map module 203 is the unoptimized semantic segmentation result .
  • the semantic clustering module 204 performs planar semantic recognition based on the dense semantic map.
  • this application provides a method for identifying plane semantic categories and an image data processing device, wherein the method can enable the image data processing device to detect more than one plane included in the image data.
  • the method and the image data processing device are based on the same inventive concept. Since the method and the computing device have similar principles for solving the problem, the implementation of the image data processing device and the method can be referred to each other, and the repetition will not be repeated.
  • Figure 3 shows a planar semantic category recognition method provided by an embodiment of the present application.
  • the method is applied to an image data processing device.
  • the method includes: step 301.
  • the semantic segmentation module 202 obtains the to-be-processed Image data.
  • the image data to be processed includes N pixels, and N is a positive integer.
  • the image data to be processed may be taken by the camera of the image data processing device and provided to the semantic segmentation module 202, or it may be obtained by the semantic segmentation module 202 from the image library used for storing image data in the image data processing device. It may also be sent by other devices, which is not limited in the embodiment of the present application.
  • the image data to be processed may be a two-dimensional image.
  • the image data to be processed may be a color photo or a black and white photo, which is not limited in the embodiment of the present application.
  • the N pixels may be all pixels in the image data to be processed, or may be some pixels in the image data to be processed.
  • the N pixels may be pixels belonging to the plane category in the image data to be processed, excluding the pixels of the non-planar category.
  • a pixel of a non-planar category refers to a pixel that does not belong to any recognized plane category, and at this time, the pixel is considered to not belong to a pixel on any plane.
  • Step 302 The semantic segmentation module 202 determines the semantic segmentation result of the image data to be processed.
  • the semantic segmentation result includes the target plane category corresponding to at least some of the N pixels included in the image data to be processed.
  • the at least part of the pixels may be pixels of one or more planes included in the image data to be processed.
  • the target plane category corresponding to at least some of the N pixels may refer to the target plane category corresponding to some of the N pixels, or may refer to the target plane category corresponding to all of the N pixels.
  • the image data processing device in the embodiment of the present application can independently determine the semantic segmentation result of the image data to be processed, and at this time, the image data processing device may have a module (for example, NPU) that determines the semantic segmentation result of the image data to be processed.
  • a module for example, NPU
  • the image data processing device in the embodiment of the present application can also send the image data to be processed to the device with the function of determining the semantic segmentation result of the image data to be processed, so that the device with the function of determining the semantic segmentation result of the image data to be processed The device determines the semantic segmentation result of the image data to be processed. Then, the image data processing apparatus obtains the semantic segmentation result of the image data to be processed from the device having the function of determining the semantic segmentation result of the image data to be processed. In the embodiment of the present application, the image data processing apparatus can detect one or more planes included in the image data to be processed by determining the semantic segmentation result of the image data to be processed.
  • Step 303 The semantic map module 203 obtains a first dense semantic map according to the semantic segmentation result.
  • the first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud.
  • One first three-dimensional point corresponds to at least one pixel point in the at least part of the pixel points.
  • the purpose of step 303 in the embodiment of this application is that the semantic map module 203 uses the plane category of each pixel in the two-dimensional space to update the plane category of the three-dimensional point corresponding to the pixel in the three-dimensional space, that is, as the three-dimensional point The target plane category.
  • the semantic map module 203 may use at least one target plane category corresponding to the three-dimensional point cloud corresponding to all pixels in the semantic segmentation result as the first dense semantic map. Through step 303, the performance of the semantic map generation algorithm can be improved.
  • Step 304: The semantic clustering module 204 performs plane semantic category recognition according to the first dense semantic map, and obtains the plane semantic category of one or more planes included in the image data to be processed.
  • the embodiment of the present application provides a method for recognizing planar semantic categories.
  • the method obtains the result of semantic segmentation of image data to be processed. Since the result of semantic segmentation includes the target to which each pixel of the N pixels included in the image data to be processed belongs For plane categories, subsequent semantic segmentation can improve the accuracy of plane semantic recognition.
  • the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then, uses the first dense semantic map to recognize the planar semantic category to obtain the planar semantic category of the image data to be processed It can enhance the accuracy and stability of planar semantic recognition.
  • step 303 in the embodiment of the present application can be implemented in the following manner: the semantic map module 203 determines whether the current state of the image data processing device is a motion state. When it is determined that the current state is a motion state, the first dense semantic map is obtained according to the semantic segmentation result. By judging whether it is in the motion state, the first dense semantic map is obtained according to the semantic segmentation result in the motion state, which can reduce the amount of calculation.
  • the image data processing device uses the historical dense semantic map as the first dense semantic map.
  • the image data to be processed in the embodiment of the present application is image data after correction.
  • the semantic segmentation module 202 corrects the image data to be processed or adopts the image data to be processed before performing semantic segmentation on the image data to be processed, which can reduce the constraint of the semantic segmentation model on rotation invariance and improve the recognition rate.
  • the method provided in the embodiment of the present application may further include before step 301: step 305 , The semantic segmentation module 202 obtains the first image data shot by the first device.
  • the image data processing apparatus may control the first device to capture the first image data, and send the captured first image data to the semantic segmentation module 202.
  • the first image data can also be obtained by the semantic segmentation module 202 from a memory pre-stored in the image data processing apparatus, or the semantic segmentation module 202 can obtain the first image taken by the first device from other devices (for example, SLR or DV). Image data.
  • the first device may be a camera built in the image data processing device, or a photographing device connected to the image data processing device.
  • step 301 can be implemented by the following step 3011: step 3011, the semantic segmentation module 202 corrects the first image data according to the first device pose of the first device corresponding to the first image data to obtain the image data to be processed.
  • each image data in the embodiment of the present application may correspond to a device pose.
  • the semantic segmentation module 202 may set the first image data according to the pose of the device corresponding to the shooting of the first image data.
  • the semantic segmentation module 202 can independently determine that the first image data is not set.
  • the image data processing apparatus receives an operation instruction for the first image data input by the user for indicating the first image data to be set, In this way, the image data processing device can determine that the first image data is not set, and then the image data processing device uses the semantic segmentation module 202 to set the first image data.
  • the pose of the device corresponding to the image data in the embodiment of the present application refers to the pose of the device that captured the image data when the image data was captured.
  • the same device may correspond to different device poses at different times. It can be understood that, if the first image data is image data that is to be corrected, the process of correcting the image to be processed can be omitted.
  • FIG. 5 shows the first image data obtained by the image data processing device.
  • the first image data is not set to be positive.
  • the image data processing device may correct the first image data according to the device pose of the device that took the first image data, and the image data after the correction is as shown in (b) of FIG. 5.
  • the method provided in this embodiment of the present application may be implemented in step 302 through the following steps 3021 and 3022:
  • Step 3021 the semantic segmentation module 202 determines one or more plane categories corresponding to any one pixel in at least some of the pixels and the probability of each plane category in the one or more plane categories.
  • step 3021 in the embodiment of the present application can be specifically implemented in the following manner: the semantic segmentation module 202 performs semantic segmentation on the image data to be processed according to the neural network, and obtains at least part of the pixel points corresponding to any one of the pixels.
  • One or more plane categories and the probability of each plane category in the one or more plane categories can be specifically referred to the prior art, which is not limited in this embodiment.
  • Step 3022 the semantic segmentation module 202 uses the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain the image data to be processed. Semantic segmentation results. Among them, the probability of the target plane category corresponding to any one pixel is the largest among the probabilities of one or more plane categories corresponding to any one pixel.
  • any pixel in the embodiment of the present application may correspond to one or more plane categories, and any pixel may correspond to the probability of belonging to each plane category of the one or more plane categories.
  • the sum of the probabilities of one or more plane categories corresponding to any pixel is equal to 1.
  • semantic segmentation processing may be performed on the image data to be processed. It is understandable that the purpose of semantic segmentation is to assign a category label to each pixel in the image data to be processed.
  • the image data to be processed is composed of many pixels, and semantic segmentation is to group the pixels according to the different semantic meanings expressed in the image. That is, the so-called semantic segmentation is to segment the image data to be processed into regions with different semantics, and mark the plane category to which each region belongs, such as cars, trees, or faces. Semantic segmentation combines the two technologies of segmentation and target recognition, and can segment the image into regions with advanced semantic content. For example, through semantic segmentation, an image data can be segmented into three different semantic regions of "cow”, "grass” and "sky".
  • Figure 6 (a) shows a to-be-processed image data provided by an embodiment of the present application
  • Figure 6 (b) shows The schematic diagram of the image data to be processed after semantic segmentation processing is shown. From (b) in Figure 6, it can be concluded that the image data to be processed is divided into four different semantic regions, such as "ground”, “table”, wall”, and "chair”.
  • the semantic segmentation module 202 may use a semantic segmentation model to determine the probability of one or more plane categories to which each pixel of the N pixels belongs.
  • each pixel point may correspond to one or more plane categories, and the sum of the probabilities of all plane categories corresponding to each pixel point is equal to 1.
  • the probability of the target plane category corresponding to any one of the N pixels is the highest probability among the probabilities of one or more plane categories corresponding to the any one pixel.
  • the plane category of one or more planes included in the image data to be processed is ground, table, chair, wall, etc.
  • the image data processing device can obtain through step 302
  • the target plane category to which pixel 1 to pixel 4 belong is shown in Table 1:
  • the semantic segmentation model in the embodiment of the present application may use mobileNet v2 as the coding network, or may be implemented by MaskRCNN. It should be understood that in the embodiments of the present application, any other model that can perform semantic segmentation may also be used to obtain the semantic segmentation result.
  • the embodiment of the present application uses mobileNet v2 as an encoding network for semantic segmentation as an example for description, but this does not cause a restriction on the semantic segmentation method, and will not be repeated in the following.
  • the mobileNet v2 model has the advantages of small size, fast speed, and high accuracy, which meets the requirements of mobile phone platforms and enables semantic segmentation to reach a frame rate of more than 5fps.
  • step 302 in the embodiment of the present application can be implemented in the following manner: the semantic segmentation module 202 according to one or more plane categories corresponding to each of at least some of the N pixels The probability of determining the semantic segmentation result of the image data to be processed. That is, the semantic segmentation module 202 determines the plane category with the highest probability in each pixel of at least some of the pixels as the respective target plane category of each pixel to obtain the semantic segmentation result of the image data to be processed.
  • the method provided in this embodiment of the present application after step 302 and before step 303 may further include: step 306, the semantic segmentation module 202 according to The image data to be processed and the depth information included in the depth image corresponding to the image data to be processed perform an optimization operation on the semantic segmentation result, and the optimization operation is used to correct the noise in the semantic segmentation result and the error part caused by the segmentation process.
  • the target plane category of the pixel A in the semantic segmentation result is a table, but in fact the target plane category of the pixel A should be the ground , So the target plane category of pixel A can be changed from table to ground.
  • a certain pixel point B is not segmented, and the target plane category of the pixel point B can be determined by performing an optimization operation. For the specific algorithm implementation of the optimization operation, reference may be made to the prior art for details, and details are not described in this embodiment.
  • the depth information in the embodiment of the present application includes the distance between each pixel and the device that captures the image data to be processed.
  • the purpose of the optimization operation on the semantic segmentation result in the embodiment of the present application is to optimize and repair the semantic segmentation result.
  • the depth information can be used to filter and modify the semantic segmentation results, avoiding wrong segmentation and unsegmentation in the semantic segmentation results.
  • FIG. 10 and FIG. 11 For the detailed process of optimizing the semantic segmentation result, please refer to the description of FIG. 10 and FIG. 11 below, which will not be repeated here.
  • the semantic map module 303 in this embodiment of the application determines whether the current state of the image data processing device is in a motion state (that is, step 303), which can be implemented in the following manner: the semantic map module 303 obtains the second image taken by the camera. Image data. The semantic map module 303 is based on the difference between the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data, and the frame difference between the second image data and the image data to be processed Determine whether the current state of the image data processing device is a motion state.
  • the difference between the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data is less than or equal to the first threshold, and the second image data
  • the semantic map module 303 determines that the current state of the image data processing device is a motion state.
  • the second image data is adjacent to the image data to be processed and is located in the previous frame of the image data to be processed. Refer to Figure 8 for the specific process.
  • the image data processing device determines that the current state of the image data processing device is a static state.
  • the image data processing device can directly use the historical dense semantic map as the first dense semantic map, and perform subsequent processing.
  • the historical dense semantic map in the embodiment of the present application may be stored in the image data processing device, or of course, may also be obtained from other equipment by the image data processing device, which is not limited in the embodiment of the present application.
  • the historical dense semantic map is the semantic image result generated and saved in history. After each frame of new image data arrives, the historical dense semantic map will be updated.
  • the historical dense semantic map is a dense semantic map corresponding to the previous frame of the dense semantic map corresponding to one frame of image data, or a synthesis of the dense semantic maps corresponding to the previous few frames of images.
  • step 304 of the embodiment of the present application can be implemented in the following manner: the semantic map module 303 obtains a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed.
  • the semantic map module 303 directly uses the second dense semantic map as the first dense semantic map. That is, every time the second dense semantic map is calculated, the second dense semantic map is directly used as the subsequent calculation.
  • the depth image corresponding to the image data to be processed in the embodiment of the present application refers to an image that has the same size as the image data to be processed and whose element value is the depth value of the scene point corresponding to the image point in the image data to be processed.
  • the image data to be processed is acquired by the image acquisition device shown in FIG. 2, and the depth image corresponding to the processed image data is acquired by the TOF shown in the figure.
  • methods such as TOF camera, structured light, laser scanning, etc. may be used to obtain depth information, thereby obtaining a depth image.
  • any other method (or camera) for obtaining a depth image may also be used to obtain a depth image.
  • the TOF camera to obtain the depth image is used as an example for description, but this does not cause a restriction on the way of obtaining the depth image, and will not be repeated in the following.
  • the point cloud is a three-dimensional concept and the pixels in the depth image are a two-dimensional concept
  • the image coordinates of the point can be converted into three-dimensional The world coordinates in the space, so the point cloud in the three-dimensional space can be recovered according to the depth image.
  • the principle of visual geometry can be used to convert image coordinates into world coordinates. According to the principle of visual geometry, the process of mapping a three-dimensional point M (Xw, Yw, Zw) in a world coordinate system to a point m (u, v) on the image is shown in Figure 7.
  • the Xc axis of the dashed line in Figure 7 is based on The Xc axis of the solid line is obtained after translation, and the Yc axis of the dashed line is obtained after the Yc axis is translated based on the solid line.
  • u, v are arbitrary coordinate points in the image coordinate system.
  • f is the focal length of the camera
  • dx and dy are the pixel sizes in the x and y directions, respectively
  • u 0 and v 0 are the center coordinates of the image, respectively.
  • Xw, Yw, Zw represent the three-dimensional coordinate points in the world coordinate system.
  • Zc represents the Z-axis value of the camera coordinates, that is, the distance from the target to the camera.
  • R and T are the 3x3 rotation matrix and 3x1 translation matrix of the external parameter matrix, respectively.
  • the depth map can be restored to a point cloud based on the camera coordinate system, that is, the rotation matrix R takes the identity matrix, and the translation vector T is 0, and we can get:
  • Xc, Yc, Zc are the three-dimensional point coordinates in the camera coordinate system.
  • Zc represents the value on the depth map.
  • the current depth unit obtained by TOF is millimeter (mm), so that the coordinates of the three-dimensional point in the camera coordinate system can be calculated, and then the device pose R and T calculated by the SLAM module can be calculated.
  • the point cloud data converted to the world coordinate system can be obtained. Specifically, as shown in the following formula:
  • the three-dimensional points in this embodiment are three-dimensional pixels, that is, the two-dimensional pixels involved in steps 301 and 302 are converted into three-dimensional pixels.
  • step 304 of the embodiment of the present application can be implemented in the following manner: the semantic map module 303 obtains the second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed.
  • the multiple three-dimensional points included in the second dense semantic map obtained by combining multiple two-dimensional pixel points and depth images can refer to the prior art.
  • the semantic map module 303 uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the historical dense semantic map to obtain the first dense semantic map. Different from directly using the second dense semantic map as the first dense semantic map, a part of all the three-dimensional points in the second three-dimensional point cloud can be used for the update.
  • the update may not be for all three-dimensional points in the second dense semantic map, but only replace the target plane category of the corresponding three-dimensional point of the historical dense semantic map with the probability of the target plane category of some three-dimensional points in the second dense semantic map. Probability. Therefore, the update may be an update to a part of the dense semantic map, instead of directly using the second dense semantic map as the first dense semantic map.
  • the semantic map module 303 uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the three-dimensional points corresponding to the one or more second three-dimensional points in the historical dense semantic map.
  • the probability or the probability of the target plane category of the three-dimensional point, obtains the first dense semantic map.
  • the semantic map module 303 uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the three-dimensional points corresponding to the one or more second three-dimensional points in the historical dense semantic map.
  • the probability is the probability of replacing the target plane category of the three-dimensional point A in the historical dense semantic map with the probability of the three-dimensional point A being in the target plane category of the second dense semantic map.
  • step 304 in the embodiment of the present application can be specifically implemented in the following manner: step 3041, the semantic clustering module 204 determines the plane equation of each of the one or more planes .
  • the semantic clustering module 204 performs plane fitting on the three-dimensional point cloud data of each pixel to obtain a plane equation.
  • the semantic clustering module 204 can use the RANSAC method or the SVD equation solving method to perform plane fitting on the three-dimensional point cloud data of each pixel to obtain a plane equation.
  • the image data processing device in the embodiment of the present application can determine the respective area of each plane and the orientation of each plane.
  • the normal vector is used to indicate the orientation of the plane.
  • the orientation of the plane in the embodiment of the present application can also be replaced by the orientation of the plane in expression.
  • the semantic clustering module 204 performs the following steps 3042 and 3043 on any one of the one or more planes to obtain the plane semantic category of one or more planes: Step 3042, the semantic clustering module 204 according to the plane semantic category of any one of the planes.
  • the plane equation and the first dense semantic map determine one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories.
  • step 3042 in the embodiment of the present application can be implemented in the following manner: the semantic clustering module 204 determines M from the first dense semantic map according to the plane equation of any one of the planes. For the first three-dimensional point, the distance between the M first three-dimensional points and any one of the planes is less than a third threshold, and M is a positive integer; the semantic clustering module 204 assigns one of the M first three-dimensional points to one
  • the one or more target plane categories are determined as the one or more target plane categories corresponding to the any one plane, the orientation of the one or more target plane categories is consistent with the orientation of the any one plane, and the statistics of the one
  • the ratio of the number of three-dimensional points corresponding to each target plane category in the multiple target plane categories among the M first three-dimensional points is used to obtain the confidence of the one or more target plane categories.
  • the embodiment of the present application does not limit the specific value of the third threshold, and it can be set as needed in the actual process.
  • the M first three-dimensional points determined from the first dense semantic map can be regarded as three-dimensional points belonging to any plane. Since the plane category to which each of the M first three-dimensional points belongs can be determined, the plane categories to which different three-dimensional points belong may be the same or different. For example, the plane category of three-dimensional point A among the M first three-dimensional points is "Ground", and the plane category of the three-dimensional point B among the M first three-dimensional points is "table", so the M first three-dimensional points can be obtained according to the plane category to which each of the M first three-dimensional points belongs Corresponding one or more plane categories.
  • the plane category of each three-dimensional point in the M first three-dimensional points may be the target plane category of the two-dimensional pixel points corresponding to the three-dimensional point mentioned in the previous embodiment.
  • step 3022 can be used to obtain the target plane category of each pixel and use it as the plane category of the three-dimensional point corresponding to each pixel, so that one or more target plane categories corresponding to the M first three-dimensional points can be obtained.
  • the plane category of N1 three-dimensional points among the M first three-dimensional points is "ground", that is, the number of three-dimensional points whose plane category is “ground” is N1
  • the plane category of N2 three-dimensional points is "table” "That is to say, the number of 3D points whose plane type is “table” is N2
  • the plane type to which N3 3D points belong is "wall", that is, the number of 3D points whose plane type is "wall” is N3, where N1+ N2+N3 is less than or equal to M, N1, N2, and N3 are positive integers
  • the probability that the number of 3D points included in the plane category "ground” is among the M first 3D points is: N1/M.
  • the probability of the number of three-dimensional points included in the plane category "table" in the M first three-dimensional points is: N2/M.
  • the probability that the number of three-dimensional points included in the plane category "wall" is among the M first three-dimensional points is: N3/M.
  • the confidences of one or more plane categories of any plane are: N1/M, N2/M, and N3/M. If N2/M>N1/M, and N2/M is greater than N3/M, then the semantic plane category of any plane is "ground".
  • Step 3043 The semantic clustering module 204 selects the target plane category with the highest confidence among the one or more target plane categories as the semantic plane category of any one of the planes.
  • the semantic clustering module 204 can determine the semantics of plane A
  • the plane category is ground.
  • any plane may correspond to one or multiple target plane categories, but not all target plane categories in the one or more target plane categories corresponding to any one plane have the same orientation as that of any plane, that is, a plane It may correspond to the target plane category that is consistent with the orientation of the plane, or it may correspond to the target plane category that is inconsistent with the orientation of the plane, and the target plane category that is inconsistent with the orientation of the plane has a lower probability of belonging to the semantic plane category of the plane. The probability of the plane category. Based on this, in order to simplify the subsequent calculation process and reduce the calculation error, in a possible implementation manner, the orientation of one or more target plane categories corresponding to any plane in the embodiment of the present application is consistent with the orientation of any plane.
  • the one or more target plane categories are plane categories selected by the image data processing device from all target plane categories corresponding to any one plane and consistent with the orientation of the any one plane.
  • the one or more target plane categories may be all plane categories of all target plane categories corresponding to any one plane, or may be part of the plane categories, which is not limited in the embodiment of the present application. All target plane categories corresponding to any plane in the embodiment of the present application can be regarded as all target plane categories corresponding to the M first three-dimensional points.
  • the plane a is facing downwards
  • the plane category (ground) is facing upwards
  • the plane category (table) is facing downwards
  • the plane category (ceiling) is facing downwards. Therefore, in the calculation plane a belongs to one or more planes
  • the confidence of the category the confidence that the plane a is determined to belong to the plane category (ground) can be eliminated. This not only reduces the calculation burden of the image data processing device, but also improves the calculation accuracy.
  • the semantic clustering module 204 counts the proportion of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points, and obtains all the three-dimensional points.
  • the method provided in the embodiment of the present application further includes: the semantic clustering module 204 updates one or more corresponding planes according to at least one of the Bayes theorem or the voting mechanism. The confidence of each target plane category.
  • the semantic clustering module 204 performs plane fitting on the three-dimensional point cloud data of each three-dimensional point to obtain a plane equation.
  • A, B, C, D are the plane equation parameters that need to be solved, and the optimal plane equation parameters are solved through multiple points.
  • the specific fitting scheme can refer to the prior art.
  • the outermost point of all the points involved in the calculation will be the boundary point of the plane.
  • the semantic clustering module 204 counts and filters the M first three-dimensional points whose distance from the plane is less than the third threshold from the first dense semantic map based on the plane equation, orientation and area of the detected plane.
  • the first three-dimensional point corresponds to one or more target plane categories.
  • the semantic clustering module 204 regularizes the three-dimensional points of various plane categories in one or more target plane categories as the confidence of the plane category, that is, counts the number of three-dimensional points included in various target plane categories in all three-dimensional points (M first The ratio of the number of three-dimensional points. Based on Bayes' theorem and voting mechanism and the confidence of the last recorded various plane categories are updated, and the plane category with the highest current confidence is selected as the plane category of plane semantics, which can enhance the accuracy and stability of plane semantic recognition.
  • the semantic clustering module 204 specifically uses Bayes' theorem and voting mechanism to count the confidence that a plane calculated before the current moment belongs to multiple plane categories, so as to determine whether a plane calculated at the current moment belongs to multiple plane categories according to the obtained confidence. The confidence of each plane category is revised and updated.
  • the voting mechanism For example, suppose the maximum number of votes under the voting mechanism is MAX_VOTE_COUNT, and the initial number of votes is 0. If the plane category of a 3D point C in the current frame is consistent with the plane category of the 3D point C in the previous frame of the current frame, then the 3D The number of votes corresponding to point C is increased by 1, and the plane category probability prob to which the three-dimensional point C belongs is updated to slide between the average value and the maximum value of the two.
  • the number of votes is reduced by 1 and the plane category probability prob is updated to take 80% of its value.
  • step 304 can be specifically implemented as described in FIG. 9: step 901, the semantic clustering module 204 executes a plane detection step to obtain one or more planes included in the image data to be processed. Since the semantic clustering module 204 calculates the plane category of the plane semantics of each plane of one or more planes in the same manner and principle, the following steps take the process of calculating the plane category of the plane semantics of the first plane by the image data processing device as an example. It does not have an indicative meaning.
  • Step 902 The semantic clustering module 204 obtains the plane equation of the first plane.
  • Step 903 The semantic clustering module 204 calculates the area of the first plane.
  • Step 904 The semantic clustering module 204 calculates the orientation of the first plane.
  • Step 905 The semantic clustering module 204 counts the M three-dimensional points of the three-dimensional points of various plane categories in the first dense semantic map whose distance between the first plane is less than the third threshold.
  • Step 906 The semantic clustering module 204 determines whether the orientation of each target plane category in the one or more target plane categories corresponding to the M three-dimensional points is consistent or the same as the orientation of the first plane.
  • Step 907 If the orientation of the various target plane categories is consistent with the orientation of the first plane, the semantic clustering module 204 determines whether the number of three-dimensional points included in each target plane category in the unit plane meets the threshold according to the area of the first plane.
  • Step 908 If it is determined according to the plane area that the number of 3D points included in each target plane category in the unit plane meets the threshold, the semantic clustering module 204 performs regularization processing on the number of 3D points included in each target plane category, that is The proportion of the total number of three-dimensional points included in each target plane category in the M first three-dimensional points is calculated to obtain the confidence that the first plane belongs to one or more target plane categories.
  • Step 909 The semantic clustering module 204 updates the Bayesian probability of the currently calculated confidence that the first plane belongs to one or more target plane categories and the currently calculated confidence that the first plane belongs to various target plane categories.
  • Step 910 The semantic clustering module 204 uses the target plane category with the highest current confidence of the first plane as the plane category of the first plane.
  • the semantic clustering module 204 determines that the process stops. In addition, if the semantic clustering module 204 determines that the number of three-dimensional points included in the various target plane categories in the unit plane does not meet the threshold value according to the area of the first plane, the image data processing apparatus determines that the process stops.
  • the semantic segmentation module 202 performs an optimization operation on the semantic segmentation result according to the image data to be processed and the depth information of the image data to be processed, including the random sample consensus (RANSAC) as described in FIG. 10
  • RNSAC random sample consensus
  • Floor as an important part of the scene, has the following characteristics: Floor is a plane with a large area; Floor is an important reference for SLAM initialization; the ground is easier to detect and recognize than other semantic targets; objects in the scene Mostly located on the ground; the height of the objects in the scene is mostly relative to the ground. Therefore, it is very necessary to divide the ground first and obtain the plane equation.
  • the RANSAC algorithm is also known as the random sampling consensus estimation method. It is a robust estimation method, which is more suitable for the estimation of the plane with a large area such as the ground.
  • the result of semantic segmentation of the deep neural network is relied on here.
  • the ground semantic pixels are extracted from multiple three-dimensional points (FLOOR three-dimensional points) and obtain the point cloud data composed of the depth information to realize the ground equation estimation based on RANSAC. The specific steps are shown in Figure 10:
  • the ground equation can also be estimated by using AI.
  • Step 1011 The semantic segmentation module 202 obtains P three-dimensional points included in the ground by performing semantic segmentation processing on the ground.
  • the number of iterations of the RANSAC algorithm is M. If M>0, the image data processing device randomly selects l (for example, l is 3) three-dimensional points from the P three-dimensional points as sampling points. Otherwise, skip to step 1016 for execution.
  • SVD Singular Value Decomposition
  • Step 1013 The semantic segmentation module 202 brings the three-dimensional coordinates q of the P three-dimensional points into the estimated plane equation respectively, and obtains the scalar distance d from the P three-dimensional points to the plane equation. If d is less than the preset threshold ⁇ , Then P three-dimensional points are considered as interior points, and the number k of interior points is counted. in,
  • Step 1014 The semantic segmentation module 202 discriminates the size of the number k of interior points in this iteration and the optimal number of interior points K. If k ⁇ K, the semantic segmentation module 202 reduces the number of iterations M of the RANSAC algorithm by 1, and jumps to step 1011 execute, otherwise continue to execute.
  • Step 1016 The semantic segmentation module 202 uses K optimal interior points to re-estimate the plane equation, that is, establishes an overdetermined equation composed of K equations, and uses SVD to find the global optimal plane equation.
  • the semantic seed is combined with depth information to grow the segmentation area and modify the segmentation result.
  • the number of pixels in the semantic segmentation category is used as an indicator of the priority of region growth, to give priority to the category with a larger number of pixels in the semantic segmentation category for region growth, but the priority of the ground is the highest, that is, the ground is grown first before the region is grown. Regional growth for other categories.
  • the region growing algorithm relies on the degree of similarity between the seed points and their domain points to merge adjacent points with higher similarity and continue to grow outward until the neighboring points that do not meet the similarity condition are merged in.
  • a typical 8-neighborhood is selected for region growth, and the similarity condition is also selected for depth and color information to express, so that the under-segmented region can be better corrected.
  • the so-called seed point is the initial point of region growth
  • region growth uses a method similar to Breadth-First-Search (BFS) to spread out and grow. The specific steps are shown in Figure 11:
  • Step 1101 The semantic segmentation module 202 traverses the priority list of semantic segmentation categories, and pushes the plane category with high priority onto the stack (push into the seed point stack) for region growth.
  • the seed point stack of the current push category is That is, there are K seed points, and the coordinates of the two-dimensional pixel points corresponding to each seed point are (i, j).
  • the so-called priority list is the statistical segmentation result, which is established from more to less according to the number of each plane category.
  • Step 1102 If the seed point stack is not empty, the semantic segmentation module 202 pops the last seed point s K (i,j) from the stack and deletes it from the stack, and determines its neighboring point p(i+m,j+n ) Whether the category is other (OTHER), if it is to continue to execute, otherwise skip to step 1101 for execution.
  • Step 1103 The semantic segmentation module 202 compares the similarity distance d between the seed point s K and the neighboring point p. If the similarity distance d is less than the given threshold ⁇ , then continue to execute, otherwise jump to step 1101 for execution,
  • the expression of similarity distance d is as follows:
  • Step 1104 The semantic segmentation module 202 pushes the neighborhood point p satisfying the similarity condition into the seed point stack
  • the image data processing apparatus and the like include hardware structures and/or software modules corresponding to the respective functions.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
  • the embodiment of the present application may divide the functional units according to the foregoing method example image data processing apparatus.
  • each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • FIG. 2 shows a possible structural schematic diagram of the image data processing device involved in the foregoing embodiment.
  • the image data processing device includes: a semantic segmentation module 202, The semantic map module 203 and the semantic clustering module 204.
  • the semantic segmentation module 202 is used to support the image data processing apparatus to execute steps 301 and 302 in the above-mentioned embodiment.
  • the semantic map module 203 is used to support the image data processing device to execute step 303 in the foregoing embodiment.
  • the semantic clustering module 204 is used to support the image data processing apparatus to perform step 304 in the above-mentioned embodiment.
  • the semantic segmentation module 202 is further configured to support the image data processing apparatus to execute step 305 in the foregoing embodiment.
  • the semantic segmentation module 202 is used to support the image data processing device to execute step 3011 in the foregoing embodiment.
  • the semantic segmentation module 202 is used to support the image data processing device to execute step 306, step 3021, and step 3022 in the above-mentioned embodiment.
  • the semantic clustering module 204 is used to support the image data processing apparatus to perform step 3041, step 3042, and step 3043 in the foregoing embodiment.
  • the semantic clustering module 204 is also used to support the image data processing device to execute steps 901 to 910 in the foregoing embodiment.
  • the device can be implemented in the form of software and stored in a storage medium.
  • FIG. 13 shows a schematic diagram of a possible hardware structure of the image data processing device involved in the above-mentioned embodiment.
  • the image data processing device includes: a first processor 1301 and a second processor 1302.
  • the image data processing apparatus may further include a communication interface 1303, a memory 1304, and a bus 1305.
  • the communication interface 1303 may include an input interface 13031 and an output interface 13032.
  • the first processor 1301 and the second processor 1302 may be the processor 120 shown in FIG. 1.
  • the first processor 1301 may be a DSP or a CPU.
  • the second processor 1302 may be an NPU.
  • the communication interface 1303 may be the input device 140 in FIG. 1.
  • the memory 1304 is used to store program codes and data of the image data processing device, and corresponds to the memory 130 in FIG. 1.
  • the bus 1305 may be built in the processor 120 shown in FIG. 1.
  • the first processor 1301 and the second processor 1302 are configured to perform part of the functions in the image data processing method described above.
  • the first processor 1301 is configured to support the image data processing apparatus to execute step 301 of the foregoing embodiment.
  • the second processor 1302 is used for the image data processing apparatus to execute step 302 of the foregoing embodiment.
  • the first processor 1301 is used for the image data processing apparatus to execute step 303 and step 304 of the foregoing embodiment.
  • the first processor 1301 is further configured to support the image data processing apparatus to execute step 305, step 3011, step 3041, step 3042, step 3043 in the foregoing embodiment.
  • the second processor 1302 is also configured to support the image data processing apparatus to execute step 306, step 3021, and step 3022 in the foregoing embodiment.
  • the first processor 1301 is further configured to support the image data processing apparatus to execute steps 901 to 910 in the foregoing embodiment.
  • the first processor 1301 or the second processor 1302 may be a single-processor structure, a multi-processor structure, a single-threaded processor, a multi-threaded processor, etc.; in some feasible embodiments
  • the first processor 1301 may be a central processing unit, a general-purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof.
  • the second processor 1302 may be a neural network processor, which may implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of this application.
  • the processor may also be a combination that implements computing functions, for example, a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so on.
  • Output interface 13032 This output interface is used to output the processing result in the above-mentioned image data processing method.
  • the processing result can be directly output by the processor, or it can be stored in the memory first, and then passed through the memory. Output; in some feasible embodiments, there may be only one output interface, or there may be multiple output interfaces.
  • the processing result output by the output interface can be sent to the memory for storage, or sent to another processing flow to continue processing, or sent to the display device for display, or sent to the player terminal for playback. Wait.
  • the memory 1301 can store the aforementioned image data to be processed and related instructions for configuring the first processor or the second processor.
  • the memory may be a floppy disk, a hard disk such as a built-in hard disk and a mobile hard disk, a magnetic disk, an optical disk, a magneto-optical disk such as CD_ROM, DCD_ROM, non-volatile storage Devices such as RAM, ROM, PROM, EPROM, EEPROM, flash memory, or any other form of storage medium known in the technical field.
  • Bus 1304 This bus may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in FIG. 13, but it does not mean that there is only one bus or one type of bus.
  • the embodiment of the present application also provides a computer-readable storage medium that stores instructions in the computer-readable storage medium, and when it runs on a device (for example, the device may be a single-chip microcomputer, a chip, a computer, etc.), The device executes one or more of steps 301 to 3011 of the above-mentioned image data processing method. If each component module of the above-mentioned image data processing device is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in the computer readable storage medium.
  • the embodiments of the present application also provide a computer program product containing instructions.
  • the technical solution of the present application is essentially or a part that contributes to the prior art, or all or part of the technical solution can be a software product.
  • the computer software product is stored in a storage medium and includes a number of instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor therein execute the various embodiments of the present application All or part of the steps of the method.
  • the computer program product includes one or more computer programs or instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, network equipment, user equipment, or other programmable devices.
  • the computer program or instruction may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer program or instruction may be downloaded from a website, computer, The server or data center transmits to another website site, computer, server or data center through wired or wireless means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that integrates one or more available media.
  • the usable medium may be a magnetic medium, such as a floppy disk, a hard disk, and a magnetic tape; it may also be an optical medium, such as a digital video disc (digital video disc, DVD); and it may also be a semiconductor medium, such as a solid state drive (solid state drive). , SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed are a plane semantic category identification method and an image data processing apparatus, wherein same relate to the technical field of image processing, and are used for accurately determining a plane semantic category. The method comprises: acquiring image data to be processed, wherein the image data to be processed comprises N pixel points; determining a semantic segmentation result of the image data to be processed, wherein the semantic segmentation result comprises target plane categories corresponding to at least some of the N pixel points; according to the semantic segmentation result, obtaining a first dense semantic map, wherein the first dense semantic map comprises at least one target plane category corresponding to at least one first three-dimensional point in a first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to at least one pixel point in the at least some pixel points; and performing plane semantic category identification according to the first dense semantic map in order to obtain plane semantic categories of one or more planes comprised in the image data to be processed. The method can improve the accuracy of plane semantic recognition.

Description

一种平面语义类别的识别方法以及图像数据处理装置Recognition method of plane semantic category and image data processing device 技术领域Technical field
本申请实施例涉及图像处理技术领域,尤其涉及一种平面语义类别的识别方法以及图像数据处理装置。The embodiments of the present application relate to the field of image processing technology, and in particular, to a method for recognizing planar semantic categories and an image data processing device.
背景技术Background technique
增强现实技术(augmented reality,AR)是一种实时地计算摄影机影像的位置及角度并加上相应图像、视频、3D模型的技术,这种技术的目标是在屏幕上把虚拟世界套在现实世界并进行互动。其中,平面检测作为增强现实中比较重要的功能,提供了对真实世界基本三维环境的感知,使得开发者能够将虚拟物体根据检测到的平面放置,以实现增强现实的效果。三维空间平面检测是一项重要、并且基础的功能,因为在检测到平面之后,才能进一步确定物体的锚点,从而在确定出的锚点处渲染出物体,因此作为增强现实中比较重要的功能,提供了对真实世界基本三维环境的感知,使得开发者能够将虚拟物体根据检测到的平面放置,以实现增强现实的效果。Augmented reality (AR) is a technology that calculates the position and angle of the camera image in real time and adds corresponding images, videos, and 3D models. The goal of this technology is to put the virtual world in the real world on the screen And interact. Among them, plane detection, as a more important function in augmented reality, provides the perception of the basic three-dimensional environment of the real world, so that developers can place virtual objects according to the detected plane to achieve the effect of augmented reality. Three-dimensional space plane detection is an important and basic function, because after the plane is detected, the anchor point of the object can be further determined, so as to render the object at the determined anchor point, so it is an important function in augmented reality , Provides the perception of the basic three-dimensional environment of the real world, enabling developers to place virtual objects according to the detected plane to achieve the effect of augmented reality.
目前可以基于激光设备获取平面上的多个三维点,通过三维点统计计算出平面的平面方程,通过平面方程确定平面的位置信息。但目前大多数增强现实算法检测到的平面仅提供位置信息,并不能识别该平面的平面类别。而识别平面的平面类别可以帮助开发者提高增强现实应用的真实性和趣味性。At present, it is possible to obtain multiple 3D points on a plane based on laser equipment, calculate the plane equation of the plane through 3D point statistics, and determine the position information of the plane through the plane equation. However, the plane detected by most augmented reality algorithms only provides location information and cannot identify the plane category of the plane. The plane category of the recognition plane can help developers improve the authenticity and interest of augmented reality applications.
基于上述,目前可以通过神经网络对红绿蓝(Red-Green-Blue,RGB)图像数据或红绿蓝深度(Red-Green-Blue-Depth,RGBD)图像数据进行语义分割,并根据语义分割结果建立语义地图。然后利用语义地图生成平面语义类别。但是由于该方案直接使用语义分割结果建立语义地图,这样可能使得语义分割结果中存在错分割和未分割的部分,以使得语义类别识别的准确性降低。Based on the above, it is currently possible to perform semantic segmentation on red-green-blue (Red-Green-Blue, RGB) image data or red-green-blue-depth (Red-Green-Blue-Depth, RGBD) image data through neural networks, and perform semantic segmentation according to the semantic segmentation results Build a semantic map. Then use semantic maps to generate plane semantic categories. However, because this solution directly uses the semantic segmentation result to build a semantic map, it may cause the wrong segmentation and unsegmented parts in the semantic segmentation result, so that the accuracy of semantic category recognition is reduced.
发明内容Summary of the invention
本申请实施例提供一种平面语义类别的识别方法以及图像数据处理装置,用以提高平面语义类别识别的准确性。The embodiments of the present application provide a method for recognizing planar semantic categories and an image data processing device to improve the accuracy of planar semantic category recognition.
为了达到上述目的,本申请实施例提供如下技术方案:In order to achieve the foregoing objectives, the embodiments of the present application provide the following technical solutions:
第一方面,本申请实施例提供一种平面语义类别的识别方法,包括:图像数据处理装置获取包括N个像素点的待处理图像数据,N为正整数。图像数据处理装置确定待处理图像数据的语义分割结果,其中,语义分割结果包括N个像素点中至少部分像素点对应的目标平面类别。图像数据处理装置根据语义分割结果,得到第一稠密语义地图,所述第一稠密语义地图包括第一三维点云中的至少一个第一三维点对应的至少一个目标平面类别,所述至少一个第一三维点对应于所述至少部分像素点中的至少一个像素点。图像数据处理装置根据第一稠密语义地图进行平面语义类别识别,得到待处理图像数据包括的一个或多个平面的平面语义类别。In the first aspect, an embodiment of the present application provides a method for recognizing planar semantic categories, including: an image data processing device obtains image data to be processed including N pixels, where N is a positive integer. The image data processing device determines a semantic segmentation result of the image data to be processed, where the semantic segmentation result includes the target plane category corresponding to at least some of the N pixels. The image data processing device obtains a first dense semantic map according to the semantic segmentation result, the first dense semantic map including at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, and the at least one first three-dimensional point cloud A three-dimensional point corresponds to at least one pixel point in the at least part of the pixel points. The image data processing device performs plane semantic category recognition according to the first dense semantic map, and obtains the plane semantic category of one or more planes included in the image data to be processed.
本申请实施例提供一种平面语义类别的识别方法,该方法通过获取待处理图像数据的语义分割结果,由于语义分割结果包括待处理图像数据包括的N个像素点中每个像素点所属的目标平面类别,通过语义分割后续可以提高平面语义识别的准确率。此 外,本申请实施例提供的方法,图像数据处理装置根据语义分割结果,得到第一稠密语义地图,之后,通过第一稠密语义地图进行平面语义类别的识别,得到待处理图像数据的平面语义类别可以增强平面语义识别的准确性。The embodiment of the present application provides a method for recognizing planar semantic categories. The method obtains the result of semantic segmentation of image data to be processed. Since the result of semantic segmentation includes the target to which each pixel of the N pixels included in the image data to be processed belongs For plane categories, subsequent semantic segmentation can improve the accuracy of plane semantic recognition. In addition, in the method provided by the embodiments of the present application, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then, uses the first dense semantic map to recognize the planar semantic category to obtain the planar semantic category of the image data to be processed It can enhance the accuracy of plane semantic recognition.
在一种可能的实现方式中,图像数据处理装置根据语义分割结果,得到第一稠密语义地图,包括:图像数据处理装置根据语义分割结果,和待处理图像数据对应的深度图像,得到第二稠密语义地图。图像数据处理装置将第二稠密语义地图作为第一稠密语义地图。In a possible implementation manner, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, including: the image data processing device obtains the second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed Semantic map. The image data processing device uses the second dense semantic map as the first dense semantic map.
在一种可能的实现方式中,图像数据处理装置根据语义分割结果,得到第一稠密语义地图,包括:图像数据处理装置根据语义分割结果,得到第二稠密语义地图。图像数据处理装置利用第二稠密语义地图中的第二三维点云中的一个或多个第二三维点更新历史稠密语义地图,以得到第一稠密语义地图。In a possible implementation manner, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, including: the image data processing device obtains the second dense semantic map according to the semantic segmentation result. The image data processing device uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the historical dense semantic map to obtain the first dense semantic map.
在一种可能的实现方式中,图像数据处理装置判断图像数据处理装置的当前状态是否为运动状态,包括:图像数据处理装置获取第二图像数据,该第二图像数据不同于待处理图像数据。图像数据处理装置根据待处理图像数据对应的第一设备位姿和第二图像数据对应的第二设备位姿,判断图像数据处理装置的状态是否为运动状态。例如,第二图像数据与待处理图像数据相邻,且位于待处理图像数据的上一帧。In a possible implementation manner, the image data processing apparatus judging whether the current state of the image data processing apparatus is a motion state includes: the image data processing apparatus obtains second image data, which is different from the image data to be processed. The image data processing device determines whether the state of the image data processing device is a motion state according to the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data. For example, the second image data is adjacent to the image data to be processed and is located in the previous frame of the image data to be processed.
在一种可能的实现方式中,图像数据处理装置确定当前状态为运动状态包括:在第一设备位姿和第二设备位姿之间的差值小于或等于第一阈值时,确定当前状态为运动状态;In a possible implementation, the image data processing apparatus determining that the current state is the motion state includes: when the difference between the pose of the first device and the pose of the second device is less than or equal to a first threshold, determining that the current state is Motion state
在一种可能的实现方式中,图像数据处理装置确定当前状态为运动状态包括:图像数据处理装置获取相机拍摄的第二图像数据;其中,第二图像数据与待处理图像数据相邻,且位于待处理图像数据的上一帧;图像数据处理装置根据待处理图像数据对应的第一设备位姿和第二图像数据对应的第二设备位姿,以及第二图像数据和待处理图像数据之间的帧间差,判断图像数据处理装置的状态为运动状态。In a possible implementation, the image data processing device determining that the current state is a motion state includes: the image data processing device acquires second image data shot by the camera; wherein the second image data is adjacent to the image data to be processed and is located The last frame of the image data to be processed; the image data processing device according to the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data, and between the second image data and the image data to be processed The difference between frames to determine the state of the image data processing device is a motion state.
在一种可能的实现方式中,图像数据处理装置根据待处理图像数据对应的第一设备位姿和第二图像数据对应的第二设备位姿,以及第二图像数据和待处理图像数据之间的帧间差,判断图像数据处理装置的状态为运动状态,包括:在待处理图像数据对应的第一设备位姿和第二图像数据对应的第二设备位姿之间的差值小于或等于第一阈值,且第二图像数据和待处理图像数据之间的帧间差大于第二阈值的情况下,图像数据处理装置的状态为运动状态。In a possible implementation manner, the image data processing apparatus according to the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data, and the relationship between the second image data and the image data to be processed The difference between frames to determine the state of the image data processing device as a motion state includes: the difference between the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data is less than or equal to In the case where the first threshold value and the frame difference between the second image data and the image data to be processed are greater than the second threshold value, the state of the image data processing device is a motion state.
在一种可能的实现方式中,图像数据处理装置确定待处理图像数据的语义分割结果之后,本申请实施例提供的方法还包括:图像数据处理装置根据所述待处理图像数据以及所述待处理图像数据对应的深度图像包括的深度信息,对所述语义分割结果执行优化操作,所述优化操作用于修正所述语义分割结果中的噪声和误差部分。这样可以使得后续语义识别更加准确。In a possible implementation, after the image data processing apparatus determines the semantic segmentation result of the image data to be processed, the method provided in the embodiment of the present application further includes: the image data processing apparatus according to the image data to be processed and the image data to be processed For the depth information included in the depth image corresponding to the image data, an optimization operation is performed on the semantic segmentation result, and the optimization operation is used to correct noise and error parts in the semantic segmentation result. This can make subsequent semantic recognition more accurate.
在一种可能的实现方式中,图像数据处理装置确定待处理图像数据的语义分割结果包括:图像数据处理装置确定至少部分像素点中任一个像素点对应的一个或多个平面类别中每个平面类别的概率。图像数据处理装置将所述任一个像素点对应的一个或多个平面类别中概率最大的平面类别作为所述任一个像素点对应的目标平面类别,以 得到所述待处理图像数据的语义分割结果。也即任一个像素点对应的目标平面类别的概率在任一个像素点对应的一个或多个平面类别的概率中最大。这样可以提供语义识别的准确性。In a possible implementation manner, the image data processing apparatus determining the semantic segmentation result of the image data to be processed includes: the image data processing apparatus determining each plane in one or more plane categories corresponding to any one of at least some of the pixels The probability of the category. The image data processing device uses the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain the semantic segmentation result of the image data to be processed . That is, the probability of the target plane category corresponding to any pixel is the largest among the probabilities of one or more plane categories corresponding to any pixel. This can provide the accuracy of semantic recognition.
在一种可能的实现方式中,图像数据处理装置确定至少部分像素点中任一个像素点对应的一个或多个平面类别中每个平面类别的概率,包括:图像数据处理装置根据神经网络对待处理图像数据进行语义分割,得到至少部分像素点中任一个像素点对应的一个或多个平面类别中每个平面类别的概率。In a possible implementation manner, the image data processing device determines the probability of each plane category in one or more plane categories corresponding to any one of at least some of the pixels, including: the image data processing device processes according to the neural network Semantic segmentation of the image data is performed to obtain the probability of each plane category in one or more plane categories corresponding to any one of at least some pixels.
在一种可能的实现方式中,图像数据处理装置根据第一稠密语义地图进行平面语义类别识别,得到待处理图像数据包括的一个或多个平面的平面语义类别,包括:图像数据处理装置根据待处理图像数据,确定一个或多个平面中每个平面的平面方程。图像数据处理装置对一个或多个平面中任一个平面执行下述步骤以得到一个或多个平面的平面语义类别:图像数据处理装置根据所述任一个平面的平面方程,以及所述第一稠密语义地图,确定所述任一个平面对应的一个或多个目标平面类别以及所述一个或多个目标平面类别的置信;在所述一个或多个目标平面类别中选取具有最高置信的目标平面类别作为所述任一个平面的语义平面类别。也即任一个平面的语义平面类别为该任一个平面对应的述一个或多个目标平面类别中最高置信的目标平面类别,选取具有最高置信的目标平面类别作为任一个平面的语义平面类别,这样可以增强平面语义识别的准确性。In a possible implementation manner, the image data processing device recognizes the plane semantic category according to the first dense semantic map, and obtains the plane semantic category of one or more planes included in the image data to be processed, including: the image data processing device recognizes the plane semantic category according to the to-be-processed image data. Process the image data to determine the plane equation for each of one or more planes. The image data processing device performs the following steps on any one of the one or more planes to obtain the plane semantic category of the one or more planes: the image data processing device according to the plane equation of any one of the planes, and the first dense Semantic map, determining one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories; selecting the target plane category with the highest confidence from the one or more target plane categories As the semantic plane category of any one of the planes. That is, the semantic plane category of any plane is the target plane category with the highest confidence among the one or more target plane categories corresponding to the any plane, and the target plane category with the highest confidence is selected as the semantic plane category of any plane. It can enhance the accuracy of plane semantic recognition.
在一种可能的实现方式中,任一个平面对应的一个或多个目标平面类别的朝向与任一个平面的朝向一致。即每个平面对应的一个或多个目标平面类别的朝向与该每个平面各自的朝向一致。这样可以过滤掉与平面朝向不一致的目标平面类别,增强平面语义识别的准确性。In a possible implementation manner, the orientation of one or more target plane categories corresponding to any plane is consistent with the orientation of any plane. That is, the orientation of one or more target plane categories corresponding to each plane is consistent with the respective orientation of each plane. In this way, the target plane categories that are inconsistent with the plane orientation can be filtered out, and the accuracy of plane semantic recognition can be enhanced.
在一种可能的实现方式中,图像数据处理装置根据所述任一个平面的平面方程,以及所述第一稠密语义地图,确定所述任一个平面对应的一个或多个目标平面类别以及所述一个或多个目标平面类别的置信,包括:图像数据处理装置根据所述任一个平面的平面方程,从所述第一稠密语义地图中确定M个第一三维点,所述M个第一三维点与所述任一个平面之间的距离小于第三阈值,M为正整数;将所述M个第一三维点对应的一个或多个目标平面类别确定为所述任一个平面对应的所述一个或多个目标平面类别,所述一个或多个目标平面类别的朝向与所述任一个平面的朝向一致,统计所述一个或多个目标平面类别中每个目标平面类别对应的三维点数目在所述M个第一三维点中的比例,得到所述一个或多个目标平面类别的置信。例如。每个第一三维点对应的目标平面类别即是该第一三维点对应的二维的像素点的目标平面类别,则可以此获得全部M个第一三维点的一个或多个目标平面类别。In a possible implementation manner, the image data processing device determines one or more target plane categories corresponding to any one plane and the first dense semantic map according to the plane equation of any one plane and the first dense semantic map. The confidence of one or more target plane categories includes: the image data processing device determines M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the M first three-dimensional points The distance between the point and any one of the planes is less than a third threshold, and M is a positive integer; the one or more target plane categories corresponding to the M first three-dimensional points are determined as the one corresponding to the any one of the planes One or more target plane categories, the orientation of the one or more target plane categories is consistent with the orientation of any one of the planes, and the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories is counted The ratio among the M first three-dimensional points obtains the confidence of the one or more target plane categories. E.g. The target plane category corresponding to each first three-dimensional point is the target plane category of the two-dimensional pixel points corresponding to the first three-dimensional point, and one or more target plane categories of all M first three-dimensional points can be obtained.
在一种可能的实现方式中,图像数据处理装置在统计所述一个或多个目标平面类别中每个目标平面类别对应的三维点数目在所述M个第一三维点中的比例,得到所述一个或多个目标平面类别的置信之后,本申请实施例提供的方法还包括:图像数据处理装置根据贝叶斯定理或投票机制中至少一项更新所述任一个平面对应的一个或多个目标平面类别的置信。基于贝叶斯定理和投票机制的视频序列更新任一个平面对应的一个或多个平面类别的置信,使最终得到的每个平面的平面语义类别结果更稳定。In a possible implementation manner, the image data processing device counts the proportion of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points to obtain After the confidence of one or more target plane categories, the method provided in the embodiment of the present application further includes: the image data processing device updates one or more corresponding to any one of the planes according to at least one of Bayes' theorem or voting mechanism The confidence of the target plane category. The video sequence based on Bayes' theorem and voting mechanism updates the confidence of one or more plane categories corresponding to any plane, so that the final result of the plane semantic category of each plane is more stable.
在一种可能的实现方式中,本申请实施例提供的方法还包括:图像数据处理装置判断图像数据处理装置的当前状态是否为运动状态,在当前状态为运动状态的情况下,图像数据处理装置根据语义分割结果,得到第一稠密语义地图。通过判断是否为运动状态,在运动状态时,根据语义分割结果,得到第一稠密语义地图,可以使得图像数据处理装置计算的数据量降低,从而可以降低计算资源,还可以提高语义地图生成算法的性能。In a possible implementation manner, the method provided in the embodiment of the present application further includes: the image data processing device determines whether the current state of the image data processing device is a motion state, and when the current state is a motion state, the image data processing device According to the semantic segmentation result, the first dense semantic map is obtained. By judging whether it is in a motion state, in the motion state, according to the semantic segmentation result, the first dense semantic map can be obtained, which can reduce the amount of data calculated by the image data processing device, thereby reducing computing resources and improving the semantic map generation algorithm. performance.
在一种可能的实现方式中,待处理图像数据为置正后的图像数据。In a possible implementation manner, the image data to be processed is image data after correction.
在一种可能的实现方式中,图像数据处理装置获取待处理图像数据之前,本申请实施例提供的方法还包括:图像数据处理装置获取相机拍摄的第一图像数据。图像数据处理装置根据第一图像数据对应的设备位姿,将第一图像数据置正,得到待处理图像数据。In a possible implementation manner, before the image data processing apparatus obtains the image data to be processed, the method provided in the embodiment of the present application further includes: the image data processing apparatus obtains the first image data taken by the camera. The image data processing device corrects the first image data according to the device pose corresponding to the first image data to obtain the image data to be processed.
第二方面,本申请实施例提供一种图像数据处理装置,该图像数据处理装置包括:语义分割模块、语义地图模块以及语义聚类模块,其中,该语义分割模块用于获取相机提供的包括N个像素点的待处理图像数据,N为正整数。语义分割模块还用于确定待处理图像数据的语义分割结果,其中,语义分割结果包括N个像素点中至少部分像素点对应的目标平面类别。语义地图模块,用于根据语义分割结果,得到第一稠密语义地图,所述第一稠密语义地图包括第一三维点云中的至少一个第一三维点对应的至少一个目标平面类别,所述至少一个第一三维点对应于所述至少部分像素点中的至少一个像素点。语义聚类模块,用于根据第一稠密语义地图进行平面语义类别识别,得到待处理图像数据包括的一个或多个平面的平面语义类别。In a second aspect, an embodiment of the present application provides an image data processing device. The image data processing device includes a semantic segmentation module, a semantic map module, and a semantic clustering module. The semantic segmentation module is used to obtain information provided by the camera including N Image data to be processed of pixels, N is a positive integer. The semantic segmentation module is also used to determine the semantic segmentation result of the image data to be processed, where the semantic segmentation result includes the target plane category corresponding to at least some of the N pixels. The semantic map module is configured to obtain a first dense semantic map according to the semantic segmentation result, the first dense semantic map including at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, the at least One first three-dimensional point corresponds to at least one pixel point in the at least part of the pixel points. The semantic clustering module is used to recognize the plane semantic category according to the first dense semantic map to obtain the plane semantic category of one or more planes included in the image data to be processed.
本申请实施例提供一种图像数据处理装置,该图像数据处理装置通过获取待处理图像数据的语义分割结果,由于语义分割结果包括待处理图像数据包括的N个像素点中每个像素点所属的目标平面类别,通过语义分割后续可以提高平面语义识别的准确率。此外,本申请实施例提供的方法,图像数据处理装置根据语义分割结果,得到第一稠密语义地图,之后,通过第一稠密语义地图进行平面语义类别的识别,得到待处理图像数据的平面语义类别可以增强平面语义识别的准确性。在一种可能的实现方式中,语义地图模块用于根据语义分割结果,得到第一稠密语义地图,包括:语义地图模块用于根据语义分割结果,得到第二稠密语义地图。语义地图模块用于将第二稠密语义地图作为第一稠密语义地图。The embodiment of the present application provides an image data processing device, which obtains the semantic segmentation result of the image data to be processed. Because the semantic segmentation result includes the N pixels included in the image data to be processed, each pixel belongs to The target plane category can improve the accuracy of plane semantic recognition through semantic segmentation. In addition, in the method provided by the embodiments of the present application, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then, uses the first dense semantic map to recognize the planar semantic category to obtain the planar semantic category of the image data to be processed It can enhance the accuracy of plane semantic recognition. In a possible implementation manner, the semantic map module is used to obtain the first dense semantic map according to the semantic segmentation result, including: the semantic map module is used to obtain the second dense semantic map according to the semantic segmentation result. The semantic map module is used to use the second dense semantic map as the first dense semantic map.
在一种可能的实现方式中,语义地图模块用于根据语义分割结果,得到第一稠密语义地图,包括:语义地图模块用于根据语义分割结果,得到第二稠密语义地图。语义地图模块用于利用第二稠密语义地图中的第二三维点云中的一个或多个第二三维点更新历史稠密语义地图,以得到第一稠密语义地图。In a possible implementation manner, the semantic map module is used to obtain the first dense semantic map according to the semantic segmentation result, including: the semantic map module is used to obtain the second dense semantic map according to the semantic segmentation result. The semantic map module is used to update the historical dense semantic map by using one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map.
在一种可能的实现方式中,该图像数据处理装置还包括:即时定位与地图构建(simultaneous localization and mapping,SLAM)模块,用于计算图像数据的设备位姿(例如相机位姿),语义地图模块用于判断图像数据处理装置的当前状态是否为运动状态,包括:语义地图模块用于获取相机提供的第二图像数据,该第二图像数据不同于待处理图像数据。语义地图模块用于根据SLAM模块提供的待处理图像数据对应的第一设备位姿和SLAM模块提供的第二图像数据对应的第二设备位姿,判断图像数据 处理装置的状态是否为运动状态。例如,第二图像数据与待处理图像数据相邻,且位于待处理图像数据的上一帧。In a possible implementation, the image data processing device further includes: an instant localization and map construction (simultaneous localization and mapping, SLAM) module, which is used to calculate the device pose (such as the camera pose) of the image data, and the semantic map The module is used to determine whether the current state of the image data processing device is a motion state, and includes: a semantic map module is used to obtain second image data provided by the camera, and the second image data is different from the image data to be processed. The semantic map module is used to determine whether the state of the image data processing device is a motion state according to the first device pose corresponding to the image data to be processed provided by the SLAM module and the second device pose corresponding to the second image data provided by the SLAM module. For example, the second image data is adjacent to the image data to be processed and is located in the previous frame of the image data to be processed.
在一种可能的实现方式中,语义地图模块用于确定当前状态为运动状态包括:在第一设备位姿和第二设备位姿之间的差值小于或等于第一阈值时,语义地图模块用于确定当前状态为运动状态;In a possible implementation, the semantic map module used to determine that the current state is a motion state includes: when the difference between the pose of the first device and the pose of the second device is less than or equal to the first threshold, the semantic map module Used to determine that the current state is an exercise state;
在一种可能的实现方式中,语义地图模块用于确定当前状态为运动状态包括:语义地图模块用于获取相机拍摄的第二图像数据;其中,第二图像数据与待处理图像数据相邻,且位于待处理图像数据的上一帧;语义地图模块用于根据SLAM模块提供的待处理图像数据对应的第一设备位姿和SLAM模块提供的第二图像数据对应的第二设备位姿,以及第二图像数据和待处理图像数据之间的帧间差,判断图像数据处理装置的当前状态为运动状态。In a possible implementation, the semantic map module used to determine that the current state is a motion state includes: the semantic map module is used to obtain second image data taken by the camera; wherein the second image data is adjacent to the image data to be processed, And is located in the previous frame of the image data to be processed; the semantic map module is used for the first device pose corresponding to the image data to be processed provided by the SLAM module and the second device pose corresponding to the second image data provided by the SLAM module, and The frame difference between the second image data and the image data to be processed determines that the current state of the image data processing device is a motion state.
在一种可能的实现方式中,语义地图模块用于根据待处理图像数据对应的第一设备位姿和第二图像数据对应的第二设备位姿,以及第二图像数据和待处理图像数据之间的帧间差,判断图像数据处理装置的当前状态为运动状态,包括:在待处理图像数据对应的第一设备位姿和第二图像数据对应的第二设备位姿之间的差值小于或等于第一阈值,且第二图像数据和待处理图像数据之间的帧间差大于第二阈值的情况下,语义地图模块用于确定图像数据处理装置的当前状态为运动状态。In a possible implementation, the semantic map module is used to determine the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data, and the difference between the second image data and the image data to be processed. Determine the current state of the image data processing device as a motion state, including: the difference between the first device pose corresponding to the image data to be processed and the second device pose corresponding to the second image data is less than When it is equal to the first threshold and the frame difference between the second image data and the image data to be processed is greater than the second threshold, the semantic map module is used to determine that the current state of the image data processing device is a motion state.
在一种可能的实现方式中,语义分割模块还用于根据待处理图像数据以及待处理图像数据对应的深度图像包括的深度信息,对语义分割结果执行优化操作,所述优化操作用于修正所述语义分割结果中的噪声和误差部分。In a possible implementation, the semantic segmentation module is further configured to perform an optimization operation on the semantic segmentation result according to the image data to be processed and the depth information included in the depth image corresponding to the image data to be processed, and the optimization operation is used to modify the image data to be processed. Describe the noise and error in the semantic segmentation results.
在一种可能的实现方式中,语义分割模块用于确定待处理图像数据的语义分割结果,包括用于确定所述至少部分像素点中任一个像素点对应的一个或多个平面类别和所述一个或多个平面类别中每个平面类别的概率,以及用于将所述任一个像素点对应的一个或多个平面类别中概率最大的平面类别作为所述任一个像素点对应的目标平面类别,以得到所述待处理图像数据的语义分割结果。也即待处理图像数据的语义分割结果包括的至少部分像素点中任一个像素点对应的目标平面类别的概率在任一个像素点对应的一个或多个平面类别的概率中最大。In a possible implementation manner, the semantic segmentation module is used to determine the semantic segmentation result of the image data to be processed, including the method used to determine one or more plane categories corresponding to any one of the at least partial pixels and the The probability of each plane category in one or more plane categories, and the plane category with the highest probability among the one or more plane categories corresponding to any one pixel is used as the target plane category corresponding to any one pixel , In order to obtain the semantic segmentation result of the image data to be processed. That is, the probability of the target plane category corresponding to any one of at least some pixels included in the semantic segmentation result of the image data to be processed is the largest among the probabilities of one or more plane categories corresponding to any one pixel.
在一种可能的实现方式中,语义分割模块,用于根据神经网络对待处理图像数据进行语义分割,得到至少部分像素点中任一个像素点对应的一个或多个平面类别中每个平面类别的概率。In a possible implementation, the semantic segmentation module is used to perform semantic segmentation on the image data to be processed according to the neural network to obtain at least part of the pixel points corresponding to any one of the one or more plane categories of each plane category Probability.
在一种可能的实现方式中,语义聚类模块用于根据第一稠密语义地图进行平面语义类别识别,得到待处理图像数据包括的一个或多个平面的平面语义类别,包括:语义聚类模块,用于根据待处理图像数据,确定一个或多个平面中每个平面的平面方程。语义聚类模块还用于对所述一个或多个平面中任一个平面执行下述步骤以得到所述一个或多个平面的平面语义类别:语义聚类模块,用于根据所述任一个平面的平面方程,以及所述第一稠密语义地图,确定所述任一个平面对应的一个或多个目标平面类别以及所述一个或多个目标平面类别的置信;语义聚类模块,用于在所述一个或多个目标平面类别中选取具有最高置信的目标平面类别作为所述任一个平面的语义平面类别。In a possible implementation, the semantic clustering module is used to recognize the plane semantic category according to the first dense semantic map to obtain the plane semantic category of one or more planes included in the image data to be processed, including: a semantic clustering module , Used to determine the plane equation of each of one or more planes according to the image data to be processed. The semantic clustering module is also used to perform the following steps on any one of the one or more planes to obtain the plane semantic category of the one or more planes: the semantic clustering module is used to perform the following steps according to any one of the planes. And the first dense semantic map to determine one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories; the semantic clustering module is used in all Among the one or more target plane categories, the target plane category with the highest confidence is selected as the semantic plane category of any one of the planes.
在一种可能的实现方式中,每个平面对应的一个或多个目标平面类别的朝向与每 个平面各自的朝向一致。即任一个平面对应的一个或多个目标平面类别的朝向与所述任一个平面的朝向一致。In a possible implementation, the orientation of one or more target plane categories corresponding to each plane is consistent with the respective orientation of each plane. That is, the orientation of one or more target plane categories corresponding to any plane is consistent with the orientation of the any plane.
在一种可能的实现方式中,语义聚类模块用于根据所述任一个平面的平面方程,以及所述第一稠密语义地图,确定所述任一个平面对应的一个或多个目标平面类别以及所述一个或多个目标平面类别的置信,包括:语义聚类模块用于根据所述任一个平面的平面方程,从所述第一稠密语义地图中确定M个第一三维点,所述M个第一三维点与所述任一个平面之间的距离小于第三阈值,且所述M个第一三维点对应的目标平面类别的朝向与所述任一个平面的朝向一致,M为正整数,所述M个第一三维点对应所述一个或多个平面类别;以及统计所述一个或多个平面类别中每个平面类别对应的三维点数目在所述M个第一三维点中的比例,得到所述一个或多个平面类别的置信。In a possible implementation manner, the semantic clustering module is used to determine one or more target plane categories corresponding to any plane according to the plane equation of any plane and the first dense semantic map, and The confidence of the one or more target plane categories includes: the semantic clustering module is configured to determine M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the M The distance between the first three-dimensional points and any one of the planes is less than a third threshold, and the orientation of the target plane category corresponding to the M first three-dimensional points is consistent with the orientation of any one of the planes, and M is a positive integer , The M first three-dimensional points correspond to the one or more plane categories; and counting the number of three-dimensional points corresponding to each plane category in the one or more plane categories among the M first three-dimensional points Ratio to obtain the confidence of the one or more plane categories.
在一种可能的实现方式中,语义聚类模块用于统计所述一个或多个平面类别中每个平面类别对应的三维点数目在所述M个第一三维点中的比例,得到所述一个或多个平面类别的置信之后,语义聚类模块还用于根据贝叶斯定理或投票机制中至少一项更新所述任一个平面对应的一个或多个目标平面类别的置信。In a possible implementation manner, the semantic clustering module is used to count the proportion of the number of three-dimensional points corresponding to each plane category in the one or more plane categories among the M first three-dimensional points to obtain the After the confidence of one or more plane categories, the semantic clustering module is further used to update the confidence of one or more target plane categories corresponding to any one of the planes according to at least one of Bayes' theorem or voting mechanism.
在一种可能的实现方式中,语义地图模块,用于判断图像数据处理装置的当前状态是否为运动状态。在确定当前状态为运动状态时,语义地图模块用于根据语义分割结果,得到第一稠密语义地图。In a possible implementation, the semantic map module is used to determine whether the current state of the image data processing device is a motion state. When it is determined that the current state is the motion state, the semantic map module is used to obtain the first dense semantic map according to the semantic segmentation result.
在一种可能的实现方式中,待处理图像数据为置正后的图像数据。In a possible implementation manner, the image data to be processed is image data after correction.
在一种可能的实现方式中,语义分割模块用于获取待处理图像数据之前,语义分割模块还用于获取相机拍摄的第一图像数据。语义分割模块用于根据SLAM模块提供的第一图像数据对应的设备位姿,将第一图像数据置正,得到待处理图像数据。In a possible implementation manner, before the semantic segmentation module is used to obtain the image data to be processed, the semantic segmentation module is also used to obtain the first image data taken by the camera. The semantic segmentation module is used to correct the first image data according to the device pose corresponding to the first image data provided by the SLAM module to obtain the image data to be processed.
在一种可能的实现方式中,SLAM模块、语义聚类模块以及语义地图模块运行在中央处理器CPU上,而语义分割模块中执行语义分割的部分可以运行在NPU上,语义分割模块中除语义分割的功能外的其他部分运行在中央处理器CPU上。In a possible implementation, the SLAM module, the semantic clustering module, and the semantic map module run on the central processing unit CPU, and the semantic segmentation part of the semantic segmentation module can run on the NPU. The semantic segmentation module excludes semantics. The other part of the function of the division runs on the central processing unit (CPU).
第三方面,本申请实施例提供一种计算机可读存储介质,该可读存储介质中存储有指令,当指令被执行时,实现如第一方面任一方面描述的方法。In a third aspect, embodiments of the present application provide a computer-readable storage medium that stores instructions in the readable storage medium, and when the instructions are executed, the method described in any aspect of the first aspect is implemented.
第四方面,本申请实施例提供一种图像数据处理装置,包括:第一处理器、以及第二处理器,其中,第一处理器,用于获取包括N个像素点的待处理图像数据,N为正整数。第二处理器,用于确定待处理图像数据的语义分割结果,其中,语义分割结果包括N个像素点中至少部分像素点对应的目标平面类别;第一处理器,用于根据所述语义分割结果,得到第一稠密语义地图,第一稠密语义地图包括第一三维点云中的至少一个第一三维点对应的至少一个目标平面类别,所述至少一个第一三维点对应于所述至少部分像素点中的至少一个像素点;第一处理器,用于根据所述第一稠密语义地图进行平面语义类别识别,得到所述待处理图像数据包括的一个或多个平面的平面语义类别。In a fourth aspect, an embodiment of the present application provides an image data processing device, including: a first processor and a second processor, where the first processor is configured to obtain image data to be processed including N pixels, N is a positive integer. The second processor is configured to determine the semantic segmentation result of the image data to be processed, wherein the semantic segmentation result includes the target plane category corresponding to at least some of the N pixels; the first processor is configured to segment the image data according to the semantic As a result, a first dense semantic map is obtained, the first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, and the at least one first three-dimensional point corresponds to the at least part of the At least one pixel in the pixel; the first processor is configured to recognize the plane semantic category according to the first dense semantic map to obtain the plane semantic category of one or more planes included in the image data to be processed.
在一种可能的实现方式中,第一处理器,具体用于根据所述语义分割结果,和所述待处理图像数据对应的深度图像,得到第二稠密语义地图;第一处理器,具体用于将所述第二稠密语义地图作为所述第一稠密语义地图,或,第一处理器,具体用于利用所述第二稠密语义地图中的第二三维点云中的一个或多个第二三维点更新历史稠密 语义地图,以得到所述第一稠密语义地图。In a possible implementation, the first processor is specifically configured to obtain a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed; the first processor specifically uses The second dense semantic map is used as the first dense semantic map, or, the first processor is specifically configured to use one or more second three-dimensional point clouds in the second dense semantic map. The historical dense semantic map is updated with two and three-dimensional points to obtain the first dense semantic map.
在一种可能的实现方式中,在第二处理器,用于确定所述待处理图像数据的语义分割结果,包括用于根据待处理图像数据以及待处理图像数据对应的深度图像包括的深度信息,对语义分割结果执行优化操作,所述优化操作用于修正所述语义分割结果中的噪声和误差部分。In a possible implementation manner, the second processor is used to determine the semantic segmentation result of the image data to be processed, including depth information included in the image data to be processed and the depth image corresponding to the image data to be processed , Performing an optimization operation on the semantic segmentation result, and the optimization operation is used to correct noise and error parts in the semantic segmentation result.
在一种可能的实现方式中,第二处理器,用于确定所述待处理图像数据的语义分割结果之前,第二处理器还用于确定所述至少部分像素点中任一个像素点对应的一个或多个平面类别中每个平面类别的概率;以及用于将所述任一个像素点对应的一个或多个平面类别中概率最大的平面类别作为所述任一个像素点对应的目标平面类别,以得到所述待处理图像数据的语义分割结果。也即任一个像素点对应的目标平面类别的概率在任一个像素点对应的一个或多个平面类别的概率中最大。这样可以提供语义识别的准确性。In a possible implementation, before the second processor is configured to determine the semantic segmentation result of the to-be-processed image data, the second processor is also configured to determine the pixel corresponding to any one of the at least some pixels. The probability of each plane category in one or more plane categories; and the plane category with the highest probability among the one or more plane categories corresponding to any one pixel is used as the target plane category corresponding to any one pixel , In order to obtain the semantic segmentation result of the image data to be processed. That is, the probability of the target plane category corresponding to any pixel is the largest among the probabilities of one or more plane categories corresponding to any pixel. This can provide the accuracy of semantic recognition.
在一种可能的实现方式中,第二处理器,用于根据神经网络对所述待处理图像数据进行语义分割,得到所述至少部分像素点中任一个像素点对应的一个或多个平面类别中每个平面类别的概率。In a possible implementation manner, the second processor is configured to perform semantic segmentation on the to-be-processed image data according to the neural network to obtain one or more plane categories corresponding to any one of the at least some pixels The probability of each plane category in.
在一种可能的实现方式中,第一处理器,用于确定所述一个或多个平面中每个平面的平面方程;第一处理器,还用于对所述一个或多个平面中任一个平面执行下述步骤以得到所述一个或多个平面的平面语义类别:第一处理器,还用于根据所述任一个平面的平面方程,以及所述第一稠密语义地图,确定所述任一个平面对应的一个或多个目标平面类别以及所述一个或多个目标平面类别的置信;第一处理器,还用于在所述一个或多个目标平面类别中选取具有最高置信的目标平面类别作为所述任一个平面的语义平面类别。也即任一个平面的语义平面类别为该任一个平面对应的述一个或多个目标平面类别中最高置信的目标平面类别。In a possible implementation manner, the first processor is used to determine the plane equation of each of the one or more planes; the first processor is also used to determine the plane equation of any one of the one or more planes. A plane executes the following steps to obtain the plane semantic category of the one or more planes: the first processor is further configured to determine the plane semantic category according to the plane equation of any plane and the first dense semantic map One or more target plane categories corresponding to any one plane and the confidence of the one or more target plane categories; the first processor is further configured to select the target with the highest confidence among the one or more target plane categories The plane category is used as the semantic plane category of any one of the planes. That is, the semantic plane category of any plane is the highest-confidence target plane category among the one or more target plane categories corresponding to the any plane.
在一种可能的实现方式中,任一个平面对应的一个或多个目标平面类别的朝向与所述任一个平面的朝向一致。In a possible implementation manner, the orientation of one or more target plane categories corresponding to any plane is consistent with the orientation of the any plane.
在一种可能的实现方式中,第一处理器,具体用于根据所述任一个平面的平面方程,从所述第一稠密语义地图中确定M个第一三维点,所述M个第一三维点与所述任一个平面之间的距离小于第三阈值,M为正整数;将所述M个第一三维点对应的一个或多个目标平面类别确定为所述任一个平面对应的所述一个或多个目标平面类别,所述一个或多个目标平面类别的朝向与所述任一个平面的朝向一致,统计所述一个或多个目标平面类别中每个目标平面类别对应的三维点数目在所述M个第一三维点中的比例,得到所述一个或多个目标平面类别的置信。In a possible implementation manner, the first processor is specifically configured to determine M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the M first three-dimensional points are The distance between the three-dimensional point and any one of the planes is less than a third threshold, and M is a positive integer; the one or more target plane categories corresponding to the M first three-dimensional points are determined as the all corresponding to the any one of the planes. The one or more target plane categories, the orientation of the one or more target plane categories is consistent with the orientation of any one of the planes, and the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories is counted According to the ratio of the target in the M first three-dimensional points, the confidence of the one or more target plane categories is obtained.
在一种可能的实现方式中,第一处理器,具体用于统计所述一个或多个目标平面类别中每个目标平面类别对应的三维点数目在所述M个第一三维点中的比例,得到所述一个或多个目标平面类别的置信之后,所述第一处理器,还用于根据贝叶斯定理或投票机制中至少一项更新所述任一个平面对应的一个或多个目标平面类别的置信。In a possible implementation manner, the first processor is specifically configured to count the proportion of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points After obtaining the confidence of the one or more target plane categories, the first processor is further configured to update one or more targets corresponding to any one of the planes according to at least one of Bayes' theorem or a voting mechanism The confidence of the plane category.
在一种可能的实现方式中,第一处理器,用于判断当前状态是否为运动状态;以及用于在确定当前状态为所述运动状态时,根据语义分割结果,得到第一稠密语义地图。In a possible implementation manner, the first processor is configured to determine whether the current state is the motion state; and when it is determined that the current state is the motion state, obtain the first dense semantic map according to the semantic segmentation result.
在一种可能的实现方式中,第一处理器可以为CPU或者DSP。第二处理器可以为NPU。In a possible implementation manner, the first processor may be a CPU or a DSP. The second processor may be an NPU.
第五方面,本申请实施例提供一种图像数据处理装置,包括:一个或多个处理器,其中,一个或多个处理器用于运行存储器中存储的指令以执行如第一方面任一方面描述的方法。In a fifth aspect, an embodiment of the present application provides an image data processing device, including: one or more processors, wherein the one or more processors are configured to execute instructions stored in a memory to execute instructions as described in any aspect of the first aspect Methods.
第六方面,提供一种包括指令的计算机程序产品,计算机程序产品中包括指令,当指令被运行时,实现如第一方面任一方面描述的方法。In a sixth aspect, a computer program product including instructions is provided. The computer program product includes instructions. When the instructions are executed, the method as described in any aspect of the first aspect is implemented.
附图说明Description of the drawings
图1为本申请实施例提供的一种电子设备的硬件结构示意图;FIG. 1 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the application;
图2为本申请实施例提供的一种平面语义类别的识别方法适用的软件架构示意图;2 is a schematic diagram of a software architecture applicable to a method for identifying planar semantic categories provided by an embodiment of the application;
图3为本申请实施例提供的一种平面语义类别的识别方法的流程示意图;FIG. 3 is a schematic flowchart of a method for recognizing planar semantic categories according to an embodiment of this application;
图4为本申请实施例提供的另一种平面语义类别的识别方法的流程示意图;FIG. 4 is a schematic flowchart of another method for recognizing planar semantic categories according to an embodiment of this application;
图5为本申请实施例提供的图像数据处理装置获取到的第一图像数据处理前和处理后的示意图;5 is a schematic diagram of the first image data before and after processing obtained by the image data processing device provided by the embodiment of the application;
图6为本申请实施例提供的语义分割结果示意图;FIG. 6 is a schematic diagram of a semantic segmentation result provided by an embodiment of this application;
图7为本申请实施例提供的一种坐标映射示意图;FIG. 7 is a schematic diagram of a coordinate mapping provided by an embodiment of this application;
图8为本申请实施例提供的一种运动状态的判断流程示意图;FIG. 8 is a schematic diagram of a flow state determination process provided by an embodiment of the application;
图9为本申请实施例提供的一种平面置信的计算流程;FIG. 9 is a calculation flow of plane confidence provided by an embodiment of the application;
图10为本申请实施例提供的一种语义分割结果执行滤波的流程示意图;FIG. 10 is a schematic flow chart of performing filtering on semantic segmentation results according to an embodiment of this application;
图11为本申请实施例提供的另一种语义分割结果执行滤波的流程示意图;FIG. 11 is a schematic diagram of another process of performing filtering on semantic segmentation results according to an embodiment of the application;
图12为本申请实施例提供的平面语义结果的示意图;FIG. 12 is a schematic diagram of a planar semantic result provided by an embodiment of this application;
图13为本申请实施例提供的一种图像数据处理装置的结构示意图。FIG. 13 is a schematic structural diagram of an image data processing device provided by an embodiment of the application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述。本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。In order to make the purpose, technical solutions, and advantages of the present application clearer, the present application will be further described in detail below with reference to the accompanying drawings. In this application, "at least one" refers to one or more, and "multiple" refers to two or more. "And/or" describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, and B exists alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are in an "or" relationship. "The following at least one item (a)" or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a). For example, at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .
本申请实施例提供的平面语义类别的识别方法可应用于各种带TOF的图像数据处理装置,该图像数据处理装置可以是电子设备。其中,电子设备可以包括但不限于个人计算机、服务器计算机、手持式或膝上型设备、移动设备(比如手机、移动电话、平板电脑、个人数字助理、媒体播放器等)、消费型电子设备、小型计算机、大型计算机、移动机器人、无人机等。举例说明,本申请实施例中的电子设备可以为具有AR功能的设备,例如,具有AR眼镜功能的设备,可以应用于AR自动测量,AR装修,AR交互等场景。The method for identifying planar semantic categories provided in the embodiments of the present application can be applied to various image data processing apparatuses with TOF, and the image data processing apparatus may be an electronic device. Among them, electronic devices may include, but are not limited to, personal computers, server computers, handheld or laptop devices, mobile devices (such as mobile phones, mobile phones, tablet computers, personal digital assistants, media players, etc.), consumer electronic devices, Small computers, large computers, mobile robots, drones, etc. For example, the electronic device in the embodiment of the present application may be a device with AR function, for example, a device with AR glasses function, which can be applied to scenarios such as AR automatic measurement, AR decoration, and AR interaction.
当图像数据处理装置需要识别待处理图像数据包括的一个或多个平面中每个平面 的平面类别时,一种可能的实现方式中,图像数据处理装置可以采用本申请实施例提供的平面语义类别的识别方法,得到待处理图像数据的平面类别识别结果。另一种可能的实现方式中,图像数据处理装置可以将待处理图像数据发送给具有实现平面语义类别识别过程的其它设备,比如服务器或者终端设备,由该服务器或者终端设备执行平面语义类别的识别过程,然后该图像数据处理装置接收来自其它设备的平面类别识别结果。When the image data processing device needs to identify the plane category of each of the one or more planes included in the image data to be processed, in a possible implementation manner, the image data processing device can use the plane semantic category provided in this embodiment of the application. The recognition method to obtain the plane category recognition result of the image data to be processed. In another possible implementation manner, the image data processing device may send the image data to be processed to other devices that have the realization of the recognition process of the flat semantic category, such as a server or a terminal device, and the server or the terminal device performs the recognition of the flat semantic category. Process, and then the image data processing device receives plane category recognition results from other equipment.
以下实施例中,以图像数据处理装置为电子设备为例,对本申请实施例中提供的一种平面语义类别的识别方法进行介绍。本申请实例提供的一种平面语义类别的识别方法,适用于如图1所示的电子设备,下面先简单介绍电子设备的具体结构。In the following embodiments, taking the image data processing apparatus as an electronic device as an example, a method for recognizing planar semantic categories provided in the embodiments of the present application is introduced. The method for identifying flat semantic categories provided by the example of this application is applicable to the electronic device as shown in FIG. 1. The specific structure of the electronic device will be briefly introduced below.
参考图1所示,为本申请实施例应用的电子设备的硬件结构示意图。如图1所示,电子设备100可以包括显示设备110、处理器120以及存储器130。其中,存储器130可用于存储软件程序以及数据,处理器120可以通过运行存储在存储器130的软件程序以及数据,从而执行电子设备100的各种功能应用以及数据处理。Referring to FIG. 1, it is a schematic diagram of the hardware structure of an electronic device applied in an embodiment of this application. As shown in FIG. 1, the electronic device 100 may include a display device 110, a processor 120 and a memory 130. The memory 130 may be used to store software programs and data, and the processor 120 may execute various functional applications and data processing of the electronic device 100 by running the software programs and data stored in the memory 130.
存储器130可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如图像采集功能等)等;存储数据区可存储根据电子设备100的使用所创建的数据(比如音频数据、文本信息、图像数据)等。此外,存储器130可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 130 may mainly include a storage program area and a storage data area. The storage program area may store an operating system, an application program (such as an image capture function, etc.) required by at least one function; the storage data area may store information according to the electronic device 100 Use the created data (such as audio data, text information, image data), etc. In addition, the memory 130 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
处理器120是电子设备100的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器130内的软件程序和/或数据,执行电子设备100的各种功能和处理数据,从而对电子设备进行整体监控。处理器120可以包括一个或多个处理单元,例如:处理器120可以包括中央处理器(central processing unit,CPU),应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(Neural-network Processing Unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 120 is the control center of the electronic device 100, which uses various interfaces and lines to connect various parts of the entire electronic device, and executes various functions of the electronic device 100 by running or executing software programs and/or data stored in the memory 130 And process data to monitor the electronic equipment as a whole. The processor 120 may include one or more processing units. For example, the processor 120 may include a central processing unit (CPU), an application processor (AP), a modem processor, and a graphics processor ( graphics processing unit (GPU), image signal processor (ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network Processor (Neural-network Processing Unit, NPU), etc. Among them, the different processing units may be independent devices or integrated in one or more processors.
其中,NPU作为神经网络(neural-network,NN)的计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。Among them, the NPU is used as a neural-network (NN) computing processor. By learning from the structure of biological neural networks, such as the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn. Through the NPU, applications such as intelligent cognition of the electronic device 100 can be realized, such as image recognition, face recognition, voice recognition, text understanding, and so on.
在一些实施例中,处理器120可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuitsound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purposeinput/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。In some embodiments, the processor 120 may include one or more interfaces. Interfaces can include integrated circuit (I2C) interfaces, integrated circuits built-in audio (inter-integrated circuitsound, I2S) interfaces, pulse code modulation (PCM) interfaces, universal asynchronous receivers /transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and/or Universal serial bus (USB) interface, etc.
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA) 和一根串行时钟线(derail clock line,SCL)。在一些实施例中,处理器120可以包含多组I2C总线。处理器120可以通过不同的I2C总线接口分别耦合触摸传感器,充电器,闪光灯,图像采集设备160等。例如:处理器120可以通过I2C接口耦合触摸传感器,使处理器120与触摸传感器通过I2C总线接口通信,实现电子设备100的触摸功能。The I2C interface is a bidirectional synchronous serial bus, which includes a serial data line (SDA) and a serial clock line (SCL). In some embodiments, the processor 120 may include multiple sets of I2C buses. The processor 120 may be coupled to the touch sensor, charger, flashlight, image acquisition device 160, etc., respectively through different I2C bus interfaces. For example, the processor 120 may couple the touch sensor through an I2C interface, so that the processor 120 communicates with the touch sensor through the I2C bus interface, so as to realize the touch function of the electronic device 100.
I2S接口可以用于音频通信。在一些实施例中,处理器120可以包含多组I2S总线。处理器120可以通过I2S总线与音频模块耦合,实现处理器120与音频模块之间的通信。在一些实施例中,音频模块可以通过I2S接口向WiFi模块190传递音频信号,实现通过蓝牙耳机接听电话的功能。The I2S interface can be used for audio communication. In some embodiments, the processor 120 may include multiple sets of I2S buses. The processor 120 may be coupled with the audio module through an I2S bus to implement communication between the processor 120 and the audio module. In some embodiments, the audio module can transmit audio signals to the WiFi module 190 through the I2S interface, so as to realize the function of answering calls through the Bluetooth headset.
PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。在一些实施例中,音频模块与WiFi模块190可以通过PCM总线接口耦合。在一些实施例中,音频模块也可以通过PCM接口向WiFi模块190传递音频信号,实现通过蓝牙耳机接听电话的功能。所述I2S接口和所述PCM接口都可以用于音频通信。The PCM interface can also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module and the WiFi module 190 may be coupled through a PCM bus interface. In some embodiments, the audio module may also transmit audio signals to the WiFi module 190 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.
UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。在一些实施例中,UART接口通常被用于连接处理器120与WiFi模块190。例如:处理器120通过UART接口与WiFi模块190中的蓝牙模块通信,实现蓝牙功能。在一些实施例中,音频模块可以通过UART接口向WiFi模块190传递音频信号,实现通过蓝牙耳机播放音乐的功能。UART interface is a universal serial data bus used for asynchronous communication. The bus can be a two-way communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, the UART interface is usually used to connect the processor 120 and the WiFi module 190. For example, the processor 120 communicates with the Bluetooth module in the WiFi module 190 through the UART interface to realize the Bluetooth function. In some embodiments, the audio module can transmit audio signals to the WiFi module 190 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.
MIPI接口可以被用于连接处理器120与显示设备110、图像采集设备160等外围器件。MIPI接口包括图像采集设备160串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器120和图像采集设备160通过CSI接口通信,实现电子设备100的拍摄功能。处理器120和显示屏通过DSI接口通信,实现电子设备100的显示功能。The MIPI interface can be used to connect the processor 120 with peripheral devices such as the display device 110 and the image acquisition device 160. The MIPI interface includes an image acquisition device 160 serial interface (camera serial interface, CSI), a display serial interface (display serial interface, DSI), and so on. In some embodiments, the processor 120 and the image acquisition device 160 communicate through a CSI interface to implement the shooting function of the electronic device 100. The processor 120 communicates with the display screen through the DSI interface to realize the display function of the electronic device 100.
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器120与图像采集设备160,显示设备110,WiFi模块190,音频模块,传感器模块等。GPIO接口还可以被配置为I2C接口,I2S接口,UART接口,MIPI接口等。The GPIO interface can be configured through software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface can be used to connect the processor 120 with the image capture device 160, the display device 110, the WiFi module 190, the audio module, the sensor module, and so on. The GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.
USB接口是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口可以用于连接充电器为电子设备100充电,也可以用于电子设备100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。The USB interface is an interface that complies with the USB standard specifications, and can be a Mini USB interface, a Micro USB interface, and a USB Type C interface. The USB interface can be used to connect a charger to charge the electronic device 100, and can also be used to transfer data between the electronic device 100 and peripheral devices. It can also be used to connect earphones and play audio through earphones. This interface can also be used to connect other electronic devices, such as AR devices.
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备100的结构限定。在本申请另一些实施例中,电子设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present invention is merely a schematic description, and does not constitute a structural limitation of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
电子设备100中还包括用于拍摄图像或视频的图像采集设备160。图像采集设备160包含一个或多个摄像头,用于采集图像数据,以及采集深度图像的TOF摄像头。例如,摄像机,用于采集视频图形序列(Video Graphics Array,VGA)或者图像数据并发送给CPU和GPU。摄像机可以是普通摄像头,也可以是对焦摄像头。The electronic device 100 also includes an image capture device 160 for capturing images or videos. The image capturing device 160 includes one or more cameras for capturing image data and a TOF camera for capturing depth images. For example, a camera is used to collect video graphics sequence (Video Graphics Array, VGA) or image data and send it to the CPU and GPU. The camera can be an ordinary camera or a focusing camera.
电子设备100还可以包括输入设备140,用于接收输入的数字信息、字符信息或 接触式触摸操作/非接触式手势,以及产生与电子设备100的用户设置以及功能控制有关的信号输入等。The electronic device 100 may also include an input device 140 for receiving inputted digital information, character information, or contact touch operations/non-contact gestures, and generating signal inputs related to user settings and function control of the electronic device 100.
显示设备110,包括的显示面板111,用于显示由用户输入的信息或提供给用户的信息以及电子设备100的各种菜单界面等,在本申请实施例中主要用于显示电子设备100中摄像头或者传感器获取的待处理图像数据。可选的,显示面板可以采用液晶显示器(liquid crystal display,LCD)或有机发光二极管(organic light-emitting diode,OLED)等形式来配置显示面板111。The display device 110 includes a display panel 111 that is used to display information input by the user or information provided to the user, and various menu interfaces of the electronic device 100. In the embodiment of the present application, it is mainly used to display the camera in the electronic device 100 Or the image data to be processed obtained by the sensor. Optionally, the display panel may be configured with the display panel 111 in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).
电子设备100还可以包括一个或多个传感器170,例如图像传感器、红外传感器、激光传感器、压力传感器、陀螺仪传感器、气压传感器、磁传感器、加速度传感器、距离传感器、接近光传感器、环境光传感器、指纹传感器、触摸传感器、温度传感器、骨传导传感器、惯性测量单元(Inertial measurement unit,IMU)等,其中图像传感器可以为飞行时间(time of flight,TOF)传感器、结构光传感器等。具体的,惯性测量单元是测量物体三轴姿态角(或角速率)以及加速度的装置。一般的,一个IMU包含了三个单轴的加速度计和三个单轴的陀螺。加速度计检测物体在载体坐标系统独立三轴的加速度信号。而陀螺检测载体相对于导航坐标系的角速度信号,测量物体在三维空间中的角速度和加速度,并以此解算出物体的姿态。此外,图像传感器可以为图像采集设备160中的器件或独立的器件,用于采集图像数据。The electronic device 100 may also include one or more sensors 170, such as an image sensor, an infrared sensor, a laser sensor, a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, an ambient light sensor, Fingerprint sensors, touch sensors, temperature sensors, bone conduction sensors, inertial measurement units (IMU), etc., where the image sensors may be time of flight (TOF) sensors, structured light sensors, and the like. Specifically, the inertial measurement unit is a device that measures the three-axis attitude angle (or angular velocity) and acceleration of an object. Generally, an IMU contains three single-axis accelerometers and three single-axis gyroscopes. The accelerometer detects the acceleration signal of the object in the independent three-axis of the carrier coordinate system. The gyroscope detects the angular velocity signal of the carrier relative to the navigation coordinate system, measures the angular velocity and acceleration of the object in three-dimensional space, and calculates the posture of the object. In addition, the image sensor may be a device in the image acquisition device 160 or an independent device for acquiring image data.
除此之外,电子设备100还可以包括用于给其他模块供电的电源150。电子设备100还可以包括无线射频(radio frequency,RF)电路180,用于与无线网络设备进行网络通信,还可以包括WiFi模块190,用于与其他设备进行WiFi通信,比如,用于获取其他设备传输的图像或者数据等。尽管并未在图1中示出,电子设备100还可以包括闪光灯、蓝牙模块、外部接口、按键、马达等其他可能的功能模块,在此不再赘述。In addition, the electronic device 100 may also include a power supply 150 for supplying power to other modules. The electronic device 100 may also include a radio frequency (RF) circuit 180 for network communication with wireless network devices, and may also include a WiFi module 190 for WiFi communication with other devices, for example, for acquiring other devices The transmitted image or data, etc. Although not shown in FIG. 1, the electronic device 100 may also include other possible functional modules such as a flashlight, a Bluetooth module, an external interface, a button, a motor, etc., which will not be described here.
如图2所示,图2示出了本申请实施例提供的一种平面语义类别的识别方法适用的软件架构,该软件构架应用于图1所示的电子设备100,该架构包括:语义分割模块202、语义地图模块203以及语义聚类模块204。可选的,该软件架构还可以包括即时定位与地图构建(simultaneous localization and mapping,SLAM)模块201。其中,SLAM模块201、语义地图模块203以及语义聚类模块204运行在如图1所描述的电子设备的CPU。或可选地,SLAM模块201中的部分功能可以部署在数字信号处理(Digital Signal Processing,DSP)上,语义分割模块202中的部分功能运行在如图1所描述的电子设备的NPU上,语义分割模块202中除运行在如图1所描述的电子设备的NPU上的功能外的其他功能运行在CPU上。运行于NPU上的功能具体包括哪些可参照后续描述。As shown in FIG. 2, FIG. 2 shows a software architecture applicable to a planar semantic category recognition method provided by an embodiment of the present application. The software architecture is applied to the electronic device 100 shown in FIG. 1, and the architecture includes: semantic segmentation Module 202, semantic map module 203, and semantic clustering module 204. Optionally, the software architecture may also include an instant localization and mapping (simultaneous localization and mapping, SLAM) module 201. Among them, the SLAM module 201, the semantic map module 203, and the semantic clustering module 204 run on the CPU of the electronic device as described in FIG. 1. Or alternatively, part of the functions in the SLAM module 201 can be deployed on digital signal processing (Digital Signal Processing, DSP), and part of the functions in the semantic segmentation module 202 runs on the NPU of the electronic device as described in FIG. The functions of the segmentation module 202 other than the functions running on the NPU of the electronic device as described in FIG. 1 are running on the CPU. Refer to the subsequent descriptions for the specific functions that run on the NPU.
其中,SLAM模块201用于根据相机(也可以称为摄像头对应图1所描述的电子设备的图像采集设备160)提供的包括一个或多个图像数据的视频图形序列,TOF提供的图像数据的深度(Depth)信息或深度图像,以及IMU用于提供IMU数据,以及图像数据帧间的关联性,并结合视觉几何原理,可以计算出设备位姿(例如,以设备为相机为例,则设备位姿可以指相机位姿),即相机的相对于第一帧的旋转和平移,并检测平面,输出平面的设备位姿、法线参数、边界点。例如,IMU数据包括加速度 计、陀螺仪。深度信息包括图像数据中的每个像素点与拍摄图像数据的相机之间的距离。Among them, the SLAM module 201 is used to provide a video graphics sequence including one or more image data provided by a camera (also called a camera corresponding to the image acquisition device 160 of the electronic device described in FIG. 1), and the depth of the image data provided by the TOF (Depth) information or depth image, and IMU is used to provide IMU data, as well as the correlation between image data frames, combined with the principle of visual geometry, the device pose can be calculated (for example, if the device is a camera, the device position The pose can refer to the camera pose), that is, the rotation and translation of the camera relative to the first frame, and the plane is detected, and the device pose, normal parameters, and boundary points of the plane are output. For example, IMU data includes accelerometers and gyroscopes. The depth information includes the distance between each pixel in the image data and the camera that captured the image data.
语义分割模块202是基于SLAM技术实现语义分割数据增强,分为前处理、AI处理、后处理。其中,前处理的输入为相机提供的原始图像数据(例如,RGB图像)和利用SLAM模块201获取的设备位姿,根据设备位姿将原始图像数据置正,输出为置正后的图像数据。相比于在训练时增加旋转数据,可以减少语义分割模型对旋转不变性的约束,提高识别率。The semantic segmentation module 202 implements semantic segmentation data enhancement based on SLAM technology, and is divided into pre-processing, AI processing, and post-processing. The input of the pre-processing is the original image data (for example, RGB image) provided by the camera and the device pose obtained by the SLAM module 201. The original image data is corrected according to the device pose and output as the corrected image data. Compared with increasing rotation data during training, the semantic segmentation model's constraints on rotation invariance can be reduced, and the recognition rate can be improved.
AI处理是基于神经网络进行语义分割,运行在NPU上,输入为置正后的图像数据,输出是置正后的图像数据包括的每个像素点属于一个或多个平面类别中各平面类别的概率分布(即每个像素点属于一个或多个平面类别的概率)。如果选取概率最大的平面类别,就可以得到像素级别的语义分割结果。例如,上述神经网络可以为卷积神经网络(convolutional neural network,CNN),深度神经网络(deep neural networks,DNN),循环神经网络(recurrent neural network,RNN)。AI processing is based on neural network for semantic segmentation, running on the NPU, the input is the image data after the correction, and the output is the image data after the correction. Each pixel included in the image data belongs to one or more plane categories. Probability distribution (that is, the probability that each pixel belongs to one or more plane categories). If the plane category with the highest probability is selected, the pixel-level semantic segmentation result can be obtained. For example, the aforementioned neural network may be a convolutional neural network (convolutional neural network, CNN), a deep neural network (deep neural network, DNN), and a recurrent neural network (recurrent neural network, RNN).
后处理的输入为相机提供的原始图像数据、深度信息以及AI处理输入的语义分割结果,主要根据原始图像数据、深度信息对语义分割结果进行滤波,输出是优化后的语义分割结果。经过后处理之后的分割的准确率和边缘更好。可以理解,后处理不是实施例必要的技术,可以不被执行。可选地,前处理和后处理可以是运行在CPU或其他处理器上而非NPU上,本实施例对此不限定。The input of post-processing is the original image data provided by the camera, depth information, and the semantic segmentation result of AI processing input. The semantic segmentation result is filtered mainly according to the original image data and depth information, and the output is the optimized semantic segmentation result. The accuracy and edges of the segmentation after post-processing are better. It can be understood that post-processing is not a necessary technique in the embodiment and may not be executed. Optionally, the pre-processing and post-processing may run on the CPU or other processors instead of the NPU, which is not limited in this embodiment.
语义地图模块203的输入为优化后的语义分割结果或者未被优化的语义分割结果,以及SLAM模块201提供的设备位姿、TOF提供的深度信息以及相机提供的原始图像数据,语义地图模块203主要基于SLAM技术生成稠密语义地图(Dense Semantic Map)。关于语义地图模块203基于SLAM技术,优化后的语义分割结果或者未被优化的语义分割结果,以及SLAM模块201提供的设备位姿、TOF提供的深度信息以及相机提供的原始图像数据生成稠密语义地,该过程包括将二维的原始图像数据转换成三维的稠密语义地图。经过转换,二维的原始图像数据中的二维RGB像素被转换成三维空间中的三维点,使得每个像素除了RGB信息也包括深度信息。这个二维至三维转换的过程可以参考现有技术中的描述,此处不再赘述。通过转换,每个像素的目标平面类别被作为对应该像素点的三维点的目标平面类别,从而将多个像素的目标平面类别转变为多个三维点的目标平面类别。因此,稠密语义地图中包括多个三维点的目标平面类别。任意三维点的目标平面类别对于与该三维点对应的二维像素点的目标平面类别。The input of the semantic map module 203 is the optimized semantic segmentation result or the unoptimized semantic segmentation result, the device pose provided by the SLAM module 201, the depth information provided by the TOF, and the original image data provided by the camera. The semantic map module 203 mainly Dense Semantic Map is generated based on SLAM technology. Regarding the semantic map module 203 based on SLAM technology, the optimized semantic segmentation result or the unoptimized semantic segmentation result, as well as the device pose provided by the SLAM module 201, the depth information provided by the TOF, and the original image data provided by the camera generate dense semantic ground. , The process includes converting the original two-dimensional image data into a three-dimensional dense semantic map. After conversion, the two-dimensional RGB pixels in the two-dimensional original image data are converted into three-dimensional points in the three-dimensional space, so that each pixel includes depth information in addition to the RGB information. The process of this two-dimensional to three-dimensional conversion can refer to the description in the prior art, which will not be repeated here. Through conversion, the target plane category of each pixel is used as the target plane category of the three-dimensional point corresponding to the pixel, so that the target plane category of multiple pixels is transformed into the target plane category of multiple three-dimensional points. Therefore, the dense semantic map includes multiple three-dimensional point target plane categories. The target plane category of any three-dimensional point corresponds to the target plane category of the two-dimensional pixel point corresponding to the three-dimensional point.
可以理解,当实施例中执行后处理,则语义地图模块203的输入为优化后的语义分割结果;当实施例中不执行后处理,则语义地图模块203的输入为未被优化的语义分割结果。It can be understood that when the post-processing is performed in the embodiment, the input of the semantic map module 203 is the optimized semantic segmentation result; when the post-processing is not performed in the embodiment, the input of the semantic map module 203 is the unoptimized semantic segmentation result .
语义聚类模块204是基于稠密语义地图进行平面语义识别。基于上述介绍,本申请提供一种平面语义类别的识别方法及图像数据处理装置,其中方法可以使得图像数据处理装置能够检测到图像数据中包括的不止一个平面。本申请实施例中,方法和图像数据处理装置是基于同一发明构思的,由于方法及计算设备解决问题的原理相似,因此图像数据处理装置与方法的实施可以相互参见,重复之处不再赘述。The semantic clustering module 204 performs planar semantic recognition based on the dense semantic map. Based on the above introduction, this application provides a method for identifying plane semantic categories and an image data processing device, wherein the method can enable the image data processing device to detect more than one plane included in the image data. In the embodiments of the present application, the method and the image data processing device are based on the same inventive concept. Since the method and the computing device have similar principles for solving the problem, the implementation of the image data processing device and the method can be referred to each other, and the repetition will not be repeated.
如图3所示,图3示出了本申请实施例提供的一种平面语义类别的识别方法,该 方法应用于图像数据处理装置中,该方法包括:步骤301、语义分割模块202获取待处理图像数据,该待处理图像数据包括N个像素点,N为正整数。应理解,待处理图像数据可以是图像数据处理装置的相机拍摄的,并提供给语义分割模块202,也可以是语义分割模块202从图像数据处理装置中用于存储图像数据的图库中获取的,也可以是其它设备发送的,本申请实施例对此不作限定。例如,待处理图像数据可以是二维图像。待处理图像数据可以为彩色照片,也可以为黑白照片,本申请实施例对此不作限定。As shown in Figure 3, Figure 3 shows a planar semantic category recognition method provided by an embodiment of the present application. The method is applied to an image data processing device. The method includes: step 301. The semantic segmentation module 202 obtains the to-be-processed Image data. The image data to be processed includes N pixels, and N is a positive integer. It should be understood that the image data to be processed may be taken by the camera of the image data processing device and provided to the semantic segmentation module 202, or it may be obtained by the semantic segmentation module 202 from the image library used for storing image data in the image data processing device. It may also be sent by other devices, which is not limited in the embodiment of the present application. For example, the image data to be processed may be a two-dimensional image. The image data to be processed may be a color photo or a black and white photo, which is not limited in the embodiment of the present application.
需要说明的是,该N个像素点可以为待处理图像数据中的所有像素点,也可以为待处理图像数据中的部分像素点。当N个像素点为待处理图像数据中的部分像素点时,该N个像素点可以为待处理图像数据中属于平面类别的像素点,而不包括非平面类别的像素点。可以理解,非平面类别的像素点是指不属于识别出的任一种平面类别,此时该像素点被认为不是属于任一个平面上的像素点。It should be noted that the N pixels may be all pixels in the image data to be processed, or may be some pixels in the image data to be processed. When the N pixels are part of the pixels in the image data to be processed, the N pixels may be pixels belonging to the plane category in the image data to be processed, excluding the pixels of the non-planar category. It can be understood that a pixel of a non-planar category refers to a pixel that does not belong to any recognized plane category, and at this time, the pixel is considered to not belong to a pixel on any plane.
步骤302、语义分割模块202确定待处理图像数据的语义分割结果。其中,语义分割结果包括待处理图像数据包括的N个像素点中至少部分像素点对应的目标平面类别。可选的,该至少部分像素点可以为待处理图像数据中包括的一个或多个平面的像素点。其中,N个像素点中至少部分像素点对应的目标平面类别可以指N个像素点中部分像素点对应的目标平面类别,也可以指N个像素点中全部像素点对应的目标平面类别。Step 302: The semantic segmentation module 202 determines the semantic segmentation result of the image data to be processed. The semantic segmentation result includes the target plane category corresponding to at least some of the N pixels included in the image data to be processed. Optionally, the at least part of the pixels may be pixels of one or more planes included in the image data to be processed. Wherein, the target plane category corresponding to at least some of the N pixels may refer to the target plane category corresponding to some of the N pixels, or may refer to the target plane category corresponding to all of the N pixels.
一方面,本申请实施例中图像数据处理装置可以自主确定待处理图像数据的语义分割结果,此时图像数据处理装置中可以具有确定待处理图像数据的语义分割结果的模块(例如,NPU)。On the one hand, the image data processing device in the embodiment of the present application can independently determine the semantic segmentation result of the image data to be processed, and at this time, the image data processing device may have a module (for example, NPU) that determines the semantic segmentation result of the image data to be processed.
另一方面,本申请实施例中图像数据处理装置还可以将待处理图像数据发送给具有确定待处理图像数据的语义分割结果功能的设备,以由具有确定待处理图像数据的语义分割结果功能的设备,确定待处理图像数据的语义分割结果。然后,图像数据处理装置从具有确定待处理图像数据的语义分割结果功能的设备处获取待处理图像数据的语义分割结果。本申请实施例中,图像数据处理装置通过确定待处理图像数据的语义分割结果,可以检测到待处理图像数据包括的一个或多个平面。On the other hand, the image data processing device in the embodiment of the present application can also send the image data to be processed to the device with the function of determining the semantic segmentation result of the image data to be processed, so that the device with the function of determining the semantic segmentation result of the image data to be processed The device determines the semantic segmentation result of the image data to be processed. Then, the image data processing apparatus obtains the semantic segmentation result of the image data to be processed from the device having the function of determining the semantic segmentation result of the image data to be processed. In the embodiment of the present application, the image data processing apparatus can detect one or more planes included in the image data to be processed by determining the semantic segmentation result of the image data to be processed.
步骤303、语义地图模块203根据语义分割结果,得到第一稠密语义地图,该第一稠密语义地图包括第一三维点云中的至少一个第一三维点对应的至少一个目标平面类别,所述至少一个第一三维点对应于所述至少部分像素点中的至少一个像素点。本申请实施例中的步骤303的目的在于,语义地图模块203利用每个像素点在二维空间中的平面类别更新该像素点在三维空间中对应的三维点的平面类别,即作为该三维点的目标平面类别。Step 303: The semantic map module 203 obtains a first dense semantic map according to the semantic segmentation result. The first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud. One first three-dimensional point corresponds to at least one pixel point in the at least part of the pixel points. The purpose of step 303 in the embodiment of this application is that the semantic map module 203 uses the plane category of each pixel in the two-dimensional space to update the plane category of the three-dimensional point corresponding to the pixel in the three-dimensional space, that is, as the three-dimensional point The target plane category.
一种可能的实现方式,语义地图模块203可以将语义分割结果中全部像素点对应的三维点云对应的至少一个目标平面类别作为第一稠密语义地图。通过步骤303可以提高语义地图生成算法的性能。步骤304、语义聚类模块204根据第一稠密语义地图进行平面语义类别识别,得到待处理图像数据包括的一个或多个平面的平面语义类别。In a possible implementation manner, the semantic map module 203 may use at least one target plane category corresponding to the three-dimensional point cloud corresponding to all pixels in the semantic segmentation result as the first dense semantic map. Through step 303, the performance of the semantic map generation algorithm can be improved. Step 304: The semantic clustering module 204 performs plane semantic category recognition according to the first dense semantic map, and obtains the plane semantic category of one or more planes included in the image data to be processed.
本申请实施例提供一种平面语义类别的识别方法,该方法通过获取待处理图像数据的语义分割结果,由于语义分割结果包括待处理图像数据包括的N个像素点中每个 像素点所属的目标平面类别,通过语义分割后续可以提高平面语义识别的准确率。此外,本申请实施例提供的方法,图像数据处理装置根据语义分割结果,得到第一稠密语义地图,之后,通过第一稠密语义地图进行平面语义类别的识别,得到待处理图像数据的平面语义类别可以增强平面语义识别的准确性和稳定性。The embodiment of the present application provides a method for recognizing planar semantic categories. The method obtains the result of semantic segmentation of image data to be processed. Since the result of semantic segmentation includes the target to which each pixel of the N pixels included in the image data to be processed belongs For plane categories, subsequent semantic segmentation can improve the accuracy of plane semantic recognition. In addition, in the method provided by the embodiments of the present application, the image data processing device obtains the first dense semantic map according to the semantic segmentation result, and then, uses the first dense semantic map to recognize the planar semantic category to obtain the planar semantic category of the image data to be processed It can enhance the accuracy and stability of planar semantic recognition.
在一种可能的实现方式中,本申请实施例中的步骤303可以通过以下方式实现:语义地图模块203判断图像数据处理装置的当前状态是否为运动状态。在确定当前状态为运动状态时,根据语义分割结果,得到第一稠密语义地图。通过判断是否为运动状态,在运动状态时根据语义分割结果,得到第一稠密语义地图,这样可以降低计算量。In a possible implementation manner, step 303 in the embodiment of the present application can be implemented in the following manner: the semantic map module 203 determines whether the current state of the image data processing device is a motion state. When it is determined that the current state is a motion state, the first dense semantic map is obtained according to the semantic segmentation result. By judging whether it is in the motion state, the first dense semantic map is obtained according to the semantic segmentation result in the motion state, which can reduce the amount of calculation.
在一种可能的实现方式中,图像数据处理装置的当前状态非运动状态时,也即为静止状态时,图像数据处理装置使用历史稠密语义地图作为第一稠密语义地图。In a possible implementation manner, when the current state of the image data processing device is not in motion, that is, when it is in a static state, the image data processing device uses the historical dense semantic map as the first dense semantic map.
作为一种可能的实现方式,本申请实施例中的待处理图像数据为置正后的图像数据。语义分割模块202在对待处理图像数据执行语义分割之前将待处理图像数据置正或者采用已置正的待处理图像数据可以减少语义分割模型对旋转不变性的约束,提高识别率。As a possible implementation manner, the image data to be processed in the embodiment of the present application is image data after correction. The semantic segmentation module 202 corrects the image data to be processed or adopts the image data to be processed before performing semantic segmentation on the image data to be processed, which can reduce the constraint of the semantic segmentation model on rotation invariance and improve the recognition rate.
需要说明的是,本申请实施例中如果语义分割模块202获取到的第一图像数据未置正,则如图4所示,本申请实施例提供的方法在步骤301之前还可以包括:步骤305、语义分割模块202获取第一设备拍摄的第一图像数据。It should be noted that if the first image data acquired by the semantic segmentation module 202 is not set in the embodiment of the present application, as shown in FIG. 4, the method provided in the embodiment of the present application may further include before step 301: step 305 , The semantic segmentation module 202 obtains the first image data shot by the first device.
可选的,图像数据处理装置可以控制第一设备拍摄第一图像数据,并将拍摄的第一图像数据发送给语义分割模块202。当然该第一图像数据也可以由语义分割模块202从预先存储在图像数据处理装置的存储器中获取,或者语义分割模块202从其它设备(例如,单反或者DV)处获取第一设备拍摄的第一图像数据。Optionally, the image data processing apparatus may control the first device to capture the first image data, and send the captured first image data to the semantic segmentation module 202. Of course, the first image data can also be obtained by the semantic segmentation module 202 from a memory pre-stored in the image data processing apparatus, or the semantic segmentation module 202 can obtain the first image taken by the first device from other devices (for example, SLR or DV). Image data.
示例性的,第一设备可以为图像数据处理装置自带的相机,或者与图像数据处理装置连接的拍摄装置。相应的,步骤301可以通过以下步骤3011实现:步骤3011、语义分割模块202根据第一图像数据对应的第一设备的第一设备位姿,将第一图像数据置正,得到待处理图像数据。需要说明的是,本申请实施例中每个图像数据可以对应一个设备位姿。Exemplarily, the first device may be a camera built in the image data processing device, or a photographing device connected to the image data processing device. Correspondingly, step 301 can be implemented by the following step 3011: step 3011, the semantic segmentation module 202 corrects the first image data according to the first device pose of the first device corresponding to the first image data to obtain the image data to be processed. It should be noted that each image data in the embodiment of the present application may correspond to a device pose.
本申请实施例中语义分割模块202如果确定第一图像数据未置正,则可以根据拍摄该第一图像数据对应的设备位姿将第一图像数据置正。语义分割模块202可以自主确定第一图像数据未置正,当然在图像数据处理装置接收到用户输入的针对该第一图像数据的用于表示将第一图像数据置正的操作指令的情况下,这样图像数据处理装置便可以确定第一图像数据未置正,然后图像数据处理装置通过语义分割模块202将第一图像数据置正。In the embodiment of the present application, if the semantic segmentation module 202 determines that the first image data is not set, it may set the first image data according to the pose of the device corresponding to the shooting of the first image data. The semantic segmentation module 202 can independently determine that the first image data is not set. Of course, in the case that the image data processing apparatus receives an operation instruction for the first image data input by the user for indicating the first image data to be set, In this way, the image data processing device can determine that the first image data is not set, and then the image data processing device uses the semantic segmentation module 202 to set the first image data.
本申请实施例中图像数据对应的设备位姿指拍摄该图像数据的设备在拍摄该图像数据时的位姿。同一个设备在不同时刻可能对应不同的设备位姿。可以理解的是,如果第一图像数据为置正的图像数据,则将待处理图像置正的过程可以省略。The pose of the device corresponding to the image data in the embodiment of the present application refers to the pose of the device that captured the image data when the image data was captured. The same device may correspond to different device poses at different times. It can be understood that, if the first image data is image data that is to be corrected, the process of correcting the image to be processed can be omitted.
如图5所示,图5中的(a)示出了图像数据处理装置获取到的第一图像数据,由图5中的(a)可以看出,该第一图像数据未置正,因此图像数据处理装置可以根据拍摄该第一图像数据的设备的设备位姿,将第一图像数据置正,置正后的图像数据如图 5中的(b)所示。As shown in FIG. 5, (a) in FIG. 5 shows the first image data obtained by the image data processing device. As can be seen from (a) in FIG. 5, the first image data is not set to be positive. The image data processing device may correct the first image data according to the device pose of the device that took the first image data, and the image data after the correction is as shown in (b) of FIG. 5.
作为另一种可能的实施例,结合图4,本申请实施例提供的方法在步骤302可以通过下述步骤3021和步骤3022实现:As another possible embodiment, with reference to FIG. 4, the method provided in this embodiment of the present application may be implemented in step 302 through the following steps 3021 and 3022:
步骤3021、语义分割模块202确定至少部分像素点中任一个像素点对应的一个或多个平面类别和所述一个或多个平面类别中每个平面类别的概率。作为一种可能的实现方式,本申请实施例中的步骤3021可以通过以下方式具体实现:语义分割模块202根据神经网络对待处理图像数据进行语义分割,得到至少部分像素点中任一个像素点对应的一个或多个平面类别和所述一个或多个平面类别中每个平面类别的概率。神经网络的训练和预测的过程具体可参照现有技术,本实施例对此不限定。Step 3021, the semantic segmentation module 202 determines one or more plane categories corresponding to any one pixel in at least some of the pixels and the probability of each plane category in the one or more plane categories. As a possible implementation, step 3021 in the embodiment of the present application can be specifically implemented in the following manner: the semantic segmentation module 202 performs semantic segmentation on the image data to be processed according to the neural network, and obtains at least part of the pixel points corresponding to any one of the pixels. One or more plane categories and the probability of each plane category in the one or more plane categories. The training and prediction process of the neural network can be specifically referred to the prior art, which is not limited in this embodiment.
步骤3022、语义分割模块202将所述任一个像素点对应的一个或多个平面类别中概率最大的平面类别作为所述任一个像素点对应的目标平面类别,以得到所述待处理图像数据的语义分割结果。其中,任一个像素点对应的目标平面类别的概率在任一个像素点对应的一个或多个平面类别的概率中最大。Step 3022, the semantic segmentation module 202 uses the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain the image data to be processed. Semantic segmentation results. Among them, the probability of the target plane category corresponding to any one pixel is the largest among the probabilities of one or more plane categories corresponding to any one pixel.
可以理解的是,本申请实施例中任一个像素点可以对应一个或多个平面类别,任一个像素点可以对应属于该一个或多个平面类别中每个平面类别的概率。任一个像素点对应的一个或多个平面类别的概率之和等于1。It is understandable that any pixel in the embodiment of the present application may correspond to one or more plane categories, and any pixel may correspond to the probability of belonging to each plane category of the one or more plane categories. The sum of the probabilities of one or more plane categories corresponding to any pixel is equal to 1.
在语义分割模块202获取到待处理图像数据之后,为了使得语义分割模块202可以识别待处理图像数据每个区域所属的平面类别,可以对待处理图像数据执行语义分割处理。可以理解的是,语义分割的目的即为待处理图像数据中的每个像素分配一个类别的标签。After the semantic segmentation module 202 obtains the image data to be processed, in order to enable the semantic segmentation module 202 to identify the plane category to which each region of the image data to be processed belongs, semantic segmentation processing may be performed on the image data to be processed. It is understandable that the purpose of semantic segmentation is to assign a category label to each pixel in the image data to be processed.
待处理图像数据是由许多像素组成,而语义分割就是将像素按照图像中表达语义含义的不同进行分组。也即所谓的语义分割即将待处理图像数据分割成具有不同语义的区域,并且标注出每个区域所属的平面类别,例如汽车、树或人脸等。语义分割结合了分割与目标识别这两种技术,可以将图像分割成具有高级语义内容的区域。例如,通过语义分割,一幅图像数据可以被分割成分别具有"牛"、"草地"和"天空"三种不同语义的区域。如图6中的(a)和图6中的(b)所示,图6中的(a)示出了本申请实施例提供的一种待处理图像数据,图6中的(b)示出了待处理图像数据经过语义分割处理后的示意图。由图6中的(b)可以得出该待处理图像数据被分割成分别具有"地面"、"桌子"、墙面"、"椅子"等四种不同语义的区域。The image data to be processed is composed of many pixels, and semantic segmentation is to group the pixels according to the different semantic meanings expressed in the image. That is, the so-called semantic segmentation is to segment the image data to be processed into regions with different semantics, and mark the plane category to which each region belongs, such as cars, trees, or faces. Semantic segmentation combines the two technologies of segmentation and target recognition, and can segment the image into regions with advanced semantic content. For example, through semantic segmentation, an image data can be segmented into three different semantic regions of "cow", "grass" and "sky". As shown in Figure 6 (a) and Figure 6 (b), Figure 6 (a) shows a to-be-processed image data provided by an embodiment of the present application, and Figure 6 (b) shows The schematic diagram of the image data to be processed after semantic segmentation processing is shown. From (b) in Figure 6, it can be concluded that the image data to be processed is divided into four different semantic regions, such as "ground", "table", wall", and "chair".
本申请实施例中语义分割模块202可以采用语义分割模型确定N个像素点中每个像素点所属的一个或多个平面类别的概率。作为一种可能的实现方式,本申请实施例中每个像素点可以对应一个或多个平面类别,每个像素点对应的所有平面类别的概率之和等于1。N个像素点中任一个像素点对应的目标平面类别的概率为该任一个像素点对应的一个或多个平面类别的概率中概率最大的。In the embodiment of the present application, the semantic segmentation module 202 may use a semantic segmentation model to determine the probability of one or more plane categories to which each pixel of the N pixels belongs. As a possible implementation manner, in the embodiment of the present application, each pixel point may correspond to one or more plane categories, and the sum of the probabilities of all plane categories corresponding to each pixel point is equal to 1. The probability of the target plane category corresponding to any one of the N pixels is the highest probability among the probabilities of one or more plane categories corresponding to the any one pixel.
以图6中的(a)为例,该待处理图像数据中所包括的一个或多个平面的平面类别为地面、桌子、椅子、墙面等,则图像数据处理装置通过步骤302可以得出像素点1~像素点4所属的目标平面类别,如表1所示:Taking (a) in FIG. 6 as an example, the plane category of one or more planes included in the image data to be processed is ground, table, chair, wall, etc., then the image data processing device can obtain through step 302 The target plane category to which pixel 1 to pixel 4 belong is shown in Table 1:
表1语义分割结果Table 1 Semantic segmentation results
像素点pixel 属于地面的Of the ground 属于椅子的Of the chair 属于桌子的Belonging to the table 属于墙面的Of the wall 目标平面Target plane
  To 概率Probability 概率Probability 概率Probability 概率Probability 类别category
像素点1Pixel 1 1%1% 98%98% 1%1% 0%0% 椅子Chair
像素点2Pixel 2 1%1% 88%88% 1%1% 10%10% 椅子Chair
像素点3Pixel 3 10%10% 20%20% 70%70% 0%0% 桌子table
像素点4Pixel 4 98%98% 0.5%0.5% 1%1% 0.5%0.5% 地面ground
作为一种可能的实现方式,本申请实施例中语义分割模型可以采用mobileNet v2作为编码网络,或者采用MaskRCNN等实现。应理解,本申请实施例中也可以采用其它任何可以进行语义分割的模型获取语义分割结果。本申请实施例以采用mobileNet v2作为编码网络进行语义分割为例进行说明,但这不造成语义分割方式的限制,后续不在赘述。此外,mobileNet v2模型具有体积小、速度快、精度高等优点,满足手机平台的要求,使得语义分割能够达到5fps以上的帧率。As a possible implementation manner, the semantic segmentation model in the embodiment of the present application may use mobileNet v2 as the coding network, or may be implemented by MaskRCNN. It should be understood that in the embodiments of the present application, any other model that can perform semantic segmentation may also be used to obtain the semantic segmentation result. The embodiment of the present application uses mobileNet v2 as an encoding network for semantic segmentation as an example for description, but this does not cause a restriction on the semantic segmentation method, and will not be repeated in the following. In addition, the mobileNet v2 model has the advantages of small size, fast speed, and high accuracy, which meets the requirements of mobile phone platforms and enables semantic segmentation to reach a frame rate of more than 5fps.
通过对待处理图像数据执行语义分割可以得到待处理图像数据中每个像素点在二维空间中对应的平面类别的概率。相应的,如图4所示,本申请实施例中的步骤302可以通过以下方式实现:语义分割模块202根据N个像素点中至少部分像素点中每个像素点对应的一个或多个平面类别的概率,确定待处理图像数据的语义分割结果。即语义分割模块202将至少部分像素点中每个像素点中对应的概率最大的平面类别确定为每个像素点各自的目标平面类别,以得到待处理图像数据的语义分割结果。By performing semantic segmentation on the image data to be processed, the probability of the plane category corresponding to each pixel in the image data to be processed in the two-dimensional space can be obtained. Correspondingly, as shown in FIG. 4, step 302 in the embodiment of the present application can be implemented in the following manner: the semantic segmentation module 202 according to one or more plane categories corresponding to each of at least some of the N pixels The probability of determining the semantic segmentation result of the image data to be processed. That is, the semantic segmentation module 202 determines the plane category with the highest probability in each pixel of at least some of the pixels as the respective target plane category of each pixel to obtain the semantic segmentation result of the image data to be processed.
在一种可能的实施例中,为了提高语义分割的准确性,如图4所示,本申请实施例提供的方法在步骤302之后,步骤303之前还可以包括:步骤306、语义分割模块202根据待处理图像数据以及待处理图像数据对应的深度图像包括的深度信息,对语义分割结果执行优化操作,该优化操作用于修正语义分割结果中的噪声和分割过程导致的误差部分。举例说明,在一个图像数据中可能存在某个地面对应的像素点A靠近桌子,但是语义分割结果中该像素点A的目标平面类别为桌子,但是实际上像素点A的目标平面类别应该为地面,因此可以将该像素点A的目标平面类别从桌子修改为地面。或者,某个像素点B未被分割,通过执行优化操作可以确定像素点B的目标平面类别。所述优化操作的具体算法实现可具体参照现有技术,本实施例不做赘述。In a possible embodiment, in order to improve the accuracy of semantic segmentation, as shown in FIG. 4, the method provided in this embodiment of the present application after step 302 and before step 303 may further include: step 306, the semantic segmentation module 202 according to The image data to be processed and the depth information included in the depth image corresponding to the image data to be processed perform an optimization operation on the semantic segmentation result, and the optimization operation is used to correct the noise in the semantic segmentation result and the error part caused by the segmentation process. For example, there may be a pixel A corresponding to the ground in an image data close to the table, but the target plane category of the pixel A in the semantic segmentation result is a table, but in fact the target plane category of the pixel A should be the ground , So the target plane category of pixel A can be changed from table to ground. Or, a certain pixel point B is not segmented, and the target plane category of the pixel point B can be determined by performing an optimization operation. For the specific algorithm implementation of the optimization operation, reference may be made to the prior art for details, and details are not described in this embodiment.
本申请实施例中深度信息包括每个像素点与拍摄待处理图像数据的设备之间的距离。本申请实施例中对语义分割结果进行优化操作的目的在于:对语义分割结果进行优化修复。通过深度信息可以对语义分割结果进行滤波和修正,避免语义分割结果中存在错分割和未分割。关于对语义分割结果进行优化操作的详细过程可以参考下述图10和图11的描述,此处不再赘述。The depth information in the embodiment of the present application includes the distance between each pixel and the device that captures the image data to be processed. The purpose of the optimization operation on the semantic segmentation result in the embodiment of the present application is to optimize and repair the semantic segmentation result. The depth information can be used to filter and modify the semantic segmentation results, avoiding wrong segmentation and unsegmentation in the semantic segmentation results. For the detailed process of optimizing the semantic segmentation result, please refer to the description of FIG. 10 and FIG. 11 below, which will not be repeated here.
作为一种可能的实现方式,本申请实施例中语义地图模块303确定图像数据处理装置的当前状态是否为运动状态(即步骤303)可以通过以下方式实现:语义地图模块303获取相机拍摄的第二图像数据。语义地图模块303根据待处理图像数据对应的第一设备位姿和第二图像数据对应的第二设备位姿之间的差值,以及第二图像数据和待处理图像数据之间的帧间差确定图像数据处理装置的当前状态是否为运动状态。As a possible implementation, the semantic map module 303 in this embodiment of the application determines whether the current state of the image data processing device is in a motion state (that is, step 303), which can be implemented in the following manner: the semantic map module 303 obtains the second image taken by the camera. Image data. The semantic map module 303 is based on the difference between the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data, and the frame difference between the second image data and the image data to be processed Determine whether the current state of the image data processing device is a motion state.
具体的,如图8所示,在待处理图像数据对应的第一设备位姿和第二图像数据对应的第二设备位姿之间的差值小于或等于第一阈值,且第二图像数据和待处理图像数据之间的帧间差大于第二阈值的情况下,语义地图模块303确定图像数据处理装置的当前状态为运动状态。其中,第二图像数据与待处理图像数据相邻,且位于待处理图像数据的上一帧。具体过程可以参考图8所示。Specifically, as shown in FIG. 8, the difference between the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data is less than or equal to the first threshold, and the second image data When the frame difference between the image data and the image data to be processed is greater than the second threshold, the semantic map module 303 determines that the current state of the image data processing device is a motion state. Wherein, the second image data is adjacent to the image data to be processed and is located in the previous frame of the image data to be processed. Refer to Figure 8 for the specific process.
此外,如图8所示,在待处理图像数据对应的第一设备位姿和第二图像数据对应的第二设备位姿之间的差值小于或等于第一阈值且第二图像数据和待处理图像数据之间的帧间差小于或等于第二阈值的情况下,图像数据处理装置确定图像数据处理装置的当前状态为静止状态。在当前状态为静止状态的情况下,图像数据处理装置可以直接使用历史稠密语义地图作为第一稠密语义地图,并进行后续处理。In addition, as shown in FIG. 8, the difference between the pose of the first device corresponding to the image data to be processed and the pose of the second device corresponding to the second image data is less than or equal to the first threshold and the second image data is less than or equal to the When the frame difference between processed image data is less than or equal to the second threshold, the image data processing device determines that the current state of the image data processing device is a static state. When the current state is a static state, the image data processing device can directly use the historical dense semantic map as the first dense semantic map, and perform subsequent processing.
本申请实施例中的历史稠密语义地图可以存储在图像数据处理装置内部,当然也可以由图像数据处理装置从其它设备处获取,本申请实施例对此不作限定。其中,历史稠密语义地图即为历史产生并保存的语义图像结果,每一帧新的图像数据到来后,会更新历史稠密语义地图。可选地,历史稠密语义地图是一帧的图像数据对应的稠密语义地图的前一帧图像对应的稠密语义地图,或者前几帧图像对应的稠密语义地图的综合。The historical dense semantic map in the embodiment of the present application may be stored in the image data processing device, or of course, may also be obtained from other equipment by the image data processing device, which is not limited in the embodiment of the present application. Among them, the historical dense semantic map is the semantic image result generated and saved in history. After each frame of new image data arrives, the historical dense semantic map will be updated. Optionally, the historical dense semantic map is a dense semantic map corresponding to the previous frame of the dense semantic map corresponding to one frame of image data, or a synthesis of the dense semantic maps corresponding to the previous few frames of images.
作为一种可能的实现方式,本申请实施例的步骤304可以通过以下方式实现:语义地图模块303根据所述语义分割结果,以及待处理图像数据对应的深度图像,得到第二稠密语义地图。语义地图模块303直接将所述第二稠密语义地图作为所述第一稠密语义地图。即每次计算出第二稠密语义地图,都直接使用该第二稠密语义地图作为后续的计算。As a possible implementation manner, step 304 of the embodiment of the present application can be implemented in the following manner: the semantic map module 303 obtains a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed. The semantic map module 303 directly uses the second dense semantic map as the first dense semantic map. That is, every time the second dense semantic map is calculated, the second dense semantic map is directly used as the subsequent calculation.
本申请实施例中待处理图像数据对应的深度图像即是指与待处理图像数据的大小相同、且元素值为待处理图像数据中的图像点对应的场景点的深度值的图像。具体的,待处理图像数据由图2所示的图像采集设备获取,处理图像数据对应的深度图像由图所示的TOF获取。The depth image corresponding to the image data to be processed in the embodiment of the present application refers to an image that has the same size as the image data to be processed and whose element value is the depth value of the scene point corresponding to the image point in the image data to be processed. Specifically, the image data to be processed is acquired by the image acquisition device shown in FIG. 2, and the depth image corresponding to the processed image data is acquired by the TOF shown in the figure.
本申请实施例中,可以采用TOF相机、结构光、激光扫描等方式来获取深度信息,从而得到深度图像。应理解,本申请实施例中也可以采用其它任何可以获得深度图像的方式(或相机)来实现获取深度图像。下文中仅以采用TOF相机获取深度图像为例进行说明,但这不造成对获取深度图像方式的限制,后续不在赘述。In the embodiments of the present application, methods such as TOF camera, structured light, laser scanning, etc. may be used to obtain depth information, thereby obtaining a depth image. It should be understood that, in the embodiments of the present application, any other method (or camera) for obtaining a depth image may also be used to obtain a depth image. In the following, only the use of the TOF camera to obtain the depth image is used as an example for description, but this does not cause a restriction on the way of obtaining the depth image, and will not be repeated in the following.
需要说明的是,虽然点云是三维概念,深度图像中的像素点是二维概念,但是已知二维图像中某个点的深度值的情况下,可以将该点的图像坐标转换成三维空间中的世界坐标,所以,根据深度图像可以恢复出三维空间中的点云。例如,采用视觉几何的原理可以完成将图像坐标转换成世界坐标。根据视觉几何的原理,一个世界坐标系下的三维点M(Xw,Yw,Zw)映射到图像上点m(u,v)过程如图7所示,在图7中虚线的Xc轴为基于实线的Xc轴平移后得到,虚线的Yc轴为基于实线的Yc轴平移后得到。It should be noted that although the point cloud is a three-dimensional concept and the pixels in the depth image are a two-dimensional concept, when the depth value of a point in the two-dimensional image is known, the image coordinates of the point can be converted into three-dimensional The world coordinates in the space, so the point cloud in the three-dimensional space can be recovered according to the depth image. For example, the principle of visual geometry can be used to convert image coordinates into world coordinates. According to the principle of visual geometry, the process of mapping a three-dimensional point M (Xw, Yw, Zw) in a world coordinate system to a point m (u, v) on the image is shown in Figure 7. The Xc axis of the dashed line in Figure 7 is based on The Xc axis of the solid line is obtained after translation, and the Yc axis of the dashed line is obtained after the Yc axis is translated based on the solid line.
图7满足如下的数学关系:
Figure PCTCN2020074040-appb-000001
Figure 7 satisfies the following mathematical relationship:
Figure PCTCN2020074040-appb-000001
其中,u,v为图像坐标系下的任意坐标点。f是相机的焦距,dx和dy分别是x和y方向的像素尺寸,u 0和v 0分别为图像的中心坐标。Xw,Yw,Zw表示世界坐标系下的三维坐标点。Zc表示相机坐标的Z轴值,即目标到相机的距离。R,T分别为外参矩阵的3x3旋转矩阵和3x1平移矩阵。 Among them, u, v are arbitrary coordinate points in the image coordinate system. f is the focal length of the camera, dx and dy are the pixel sizes in the x and y directions, respectively, and u 0 and v 0 are the center coordinates of the image, respectively. Xw, Yw, Zw represent the three-dimensional coordinate points in the world coordinate system. Zc represents the Z-axis value of the camera coordinates, that is, the distance from the target to the camera. R and T are the 3x3 rotation matrix and 3x1 translation matrix of the external parameter matrix, respectively.
首先可以先将深度图还原为以相机坐标系为基准的点云,即旋转矩阵R取单位矩阵,平移向量T为0,可得到:First, the depth map can be restored to a point cloud based on the camera coordinate system, that is, the rotation matrix R takes the identity matrix, and the translation vector T is 0, and we can get:
Figure PCTCN2020074040-appb-000002
Figure PCTCN2020074040-appb-000002
其中,Xc,Yc,Zc是相机坐标系下的三维点坐标。Among them, Xc, Yc, Zc are the three-dimensional point coordinates in the camera coordinate system.
由上式可推导得到:
Figure PCTCN2020074040-appb-000003
It can be derived from the above formula:
Figure PCTCN2020074040-appb-000003
Zc表示深度图上的值,目前TOF获取的深度单位为毫米(mm),这样就可以计算得到相机坐标系下的三维点的坐标,然后再根据SLAM模块计算的设备位姿R和T,就可以得到转换到世界坐标系上的点云数据。具体的,如下式所示:Zc represents the value on the depth map. The current depth unit obtained by TOF is millimeter (mm), so that the coordinates of the three-dimensional point in the camera coordinate system can be calculated, and then the device pose R and T calculated by the SLAM module can be calculated. The point cloud data converted to the world coordinate system can be obtained. Specifically, as shown in the following formula:
Figure PCTCN2020074040-appb-000004
当SLAM模块计算的设备位姿和TOF获取的深度数据准确时,就可以得到一个比较好的点云配准结果。本实施例的三维点是三维的像素点,即将步骤301和302中涉及的二维像素点转换为三维后的像素点。
Figure PCTCN2020074040-appb-000004
When the device pose calculated by the SLAM module and the depth data obtained by the TOF are accurate, a better point cloud registration result can be obtained. The three-dimensional points in this embodiment are three-dimensional pixels, that is, the two-dimensional pixels involved in steps 301 and 302 are converted into three-dimensional pixels.
作为另一种可能的实现方式,本申请实施例的步骤304可以通过以下方式实现:语义地图模块303根据语义分割结果,以及待处理图像数据对应的深度图像,得到第二稠密语义地图,具体如何综合多个二维像素点和深度图像得到第二稠密语义地图包括的多个三维点可参照现有技术。语义地图模块303利用第二稠密语义地图中的第二三维点云中的一个或多个第二三维点更新历史稠密语义地图,以得到所述第一稠密语义地图。与直接将所述第二稠密语义地图作为所述第一稠密语义地图不同的是,第二三维点云中所有的三维点中的一部分三维点可以被用于进行所述更新。因此,所述更新可以不是针对第二稠密语义地图中全部三维点,只是用第二稠密语义地图中部分三维点的目标平面类别的概率来替换历史稠密语义地图的对应三维点的目标平面类别的概率。因此,所述更新可以是对稠密语义地图中部分的更新,而非直接使用所述第二稠密语义地图作为所述第一稠密语义地图。As another possible implementation manner, step 304 of the embodiment of the present application can be implemented in the following manner: the semantic map module 303 obtains the second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed. The multiple three-dimensional points included in the second dense semantic map obtained by combining multiple two-dimensional pixel points and depth images can refer to the prior art. The semantic map module 303 uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the historical dense semantic map to obtain the first dense semantic map. Different from directly using the second dense semantic map as the first dense semantic map, a part of all the three-dimensional points in the second three-dimensional point cloud can be used for the update. Therefore, the update may not be for all three-dimensional points in the second dense semantic map, but only replace the target plane category of the corresponding three-dimensional point of the historical dense semantic map with the probability of the target plane category of some three-dimensional points in the second dense semantic map. Probability. Therefore, the update may be an update to a part of the dense semantic map, instead of directly using the second dense semantic map as the first dense semantic map.
具体的,语义地图模块303利用第二稠密语义地图中的第二三维点云中的一个或多个第二三维点更新历史稠密语义地图中与一个或多个第二三维点对应的三维点的概率或叫该三维点的目标平面类别的概率,得到第一稠密语义地图。应理解,语义地图模块303利用第二稠密语义地图中的第二三维点云中的一个或多个第二三维点更新历史稠密语义地图中与一个或多个第二三维点对应的三维点的概率即为用三维点A在第 二稠密语义地图的目标平面类别的概率替换该三维点A在历史稠密语义地图中已有的目标平面类别的概率。Specifically, the semantic map module 303 uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the three-dimensional points corresponding to the one or more second three-dimensional points in the historical dense semantic map. The probability, or the probability of the target plane category of the three-dimensional point, obtains the first dense semantic map. It should be understood that the semantic map module 303 uses one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to update the three-dimensional points corresponding to the one or more second three-dimensional points in the historical dense semantic map. The probability is the probability of replacing the target plane category of the three-dimensional point A in the historical dense semantic map with the probability of the three-dimensional point A being in the target plane category of the second dense semantic map.
作为一种可能的实现方式,如图4所示,本申请实施例中的步骤304具体可以通过以下方式实现:步骤3041、语义聚类模块204确定一个或多个平面中每个平面的平面方程。举例说明,语义聚类模块204对每个像素点的三维点云数据进行平面拟合,获得平面方程。As a possible implementation manner, as shown in FIG. 4, step 304 in the embodiment of the present application can be specifically implemented in the following manner: step 3041, the semantic clustering module 204 determines the plane equation of each of the one or more planes . For example, the semantic clustering module 204 performs plane fitting on the three-dimensional point cloud data of each pixel to obtain a plane equation.
具体的,语义聚类模块204可以采用RANSAC的方法,或SVD解方程的方法对每个像素点的三维点云数据进行平面拟合,获得平面方程。Specifically, the semantic clustering module 204 can use the RANSAC method or the SVD equation solving method to perform plane fitting on the three-dimensional point cloud data of each pixel to obtain a plane equation.
可以理解的是,本申请实施例中图像数据处理装置得到每个平面的平面方程之后,便可以确定每个平面各自的面积以及每个平面的朝向。以平面方程为AX+BY+CZ+D=0为例,则平面的法向量即n=(A,B,C)。法向量用于指明所述平面的朝向。本申请实施例中平面的朝向也可以在表达上用平面的方向替代。It is understandable that after obtaining the plane equation of each plane, the image data processing device in the embodiment of the present application can determine the respective area of each plane and the orientation of each plane. Taking the plane equation as AX+BY+CZ+D=0 as an example, the normal vector of the plane is n=(A, B, C). The normal vector is used to indicate the orientation of the plane. The orientation of the plane in the embodiment of the present application can also be replaced by the orientation of the plane in expression.
语义聚类模块204对一个或多个平面中任一个平面执行下述步骤3042和步骤3043以得到一个或多个平面的平面语义类别:步骤3042、语义聚类模块204根据所述任一个平面的平面方程,以及所述第一稠密语义地图,确定所述任一个平面对应的一个或多个目标平面类别以及所述一个或多个目标平面类别的置信。The semantic clustering module 204 performs the following steps 3042 and 3043 on any one of the one or more planes to obtain the plane semantic category of one or more planes: Step 3042, the semantic clustering module 204 according to the plane semantic category of any one of the planes The plane equation and the first dense semantic map determine one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories.
在一种可能的实现方式中,本申请实施例中的步骤3042可以通过以下方式实现:语义聚类模块204根据所述任一个平面的平面方程,从所述第一稠密语义地图中确定M个第一三维点,所述M个第一三维点与所述任一个平面之间的距离小于第三阈值,M为正整数;语义聚类模块204将所述M个第一三维点对应的一个或多个目标平面类别确定为所述任一个平面对应的所述一个或多个目标平面类别,所述一个或多个目标平面类别的朝向与所述任一个平面的朝向一致,统计所述一个或多个目标平面类别中每个目标平面类别对应的三维点数目在所述M个第一三维点中的比例,得到所述一个或多个目标平面类别的置信。本申请实施例对第三阈值的具体数值不作限定,在实际过程中可以根据需要进行设置。In a possible implementation manner, step 3042 in the embodiment of the present application can be implemented in the following manner: the semantic clustering module 204 determines M from the first dense semantic map according to the plane equation of any one of the planes. For the first three-dimensional point, the distance between the M first three-dimensional points and any one of the planes is less than a third threshold, and M is a positive integer; the semantic clustering module 204 assigns one of the M first three-dimensional points to one The one or more target plane categories are determined as the one or more target plane categories corresponding to the any one plane, the orientation of the one or more target plane categories is consistent with the orientation of the any one plane, and the statistics of the one The ratio of the number of three-dimensional points corresponding to each target plane category in the multiple target plane categories among the M first three-dimensional points is used to obtain the confidence of the one or more target plane categories. The embodiment of the present application does not limit the specific value of the third threshold, and it can be set as needed in the actual process.
本申请实施例中可以将从第一稠密语义地图中确定M个第一三维点即看作属于该任一平面的三维点。由于可以确定M个第一三维点中每个三维点所属的平面类别,不同三维点所属的平面类别可以相同也可以不相同,例如,M个第一三维点中三维点A属于的平面类别为“地面”,而M个第一三维点中三维点B所属的平面类别为“桌子”,因此可以根据M个第一三维点中每个三维点所属的平面类别得到该M个第一三维点对应的一个或多个平面类别。由于以将从第一稠密语义地图中确定M个第一三维点即看作属于该任一平面的三维点,因此,可以确定该任一平面也与该一个或多个平面类别对应。M个第一三维点中每个三维点的平面类别可以是之前实施例提到的对应该三维点的二维像素点的目标平面类别。例如可采用步骤3022得到每个像素点的目标平面类别并作为与所述每个像素点对应的三维点的平面类别,从而可获取M个第一三维点对应的一个或多个目标平面类别。In the embodiment of the present application, the M first three-dimensional points determined from the first dense semantic map can be regarded as three-dimensional points belonging to any plane. Since the plane category to which each of the M first three-dimensional points belongs can be determined, the plane categories to which different three-dimensional points belong may be the same or different. For example, the plane category of three-dimensional point A among the M first three-dimensional points is "Ground", and the plane category of the three-dimensional point B among the M first three-dimensional points is "table", so the M first three-dimensional points can be obtained according to the plane category to which each of the M first three-dimensional points belongs Corresponding one or more plane categories. Since the M first three-dimensional points determined from the first dense semantic map are regarded as three-dimensional points belonging to the any plane, it can be determined that the any plane also corresponds to the one or more plane categories. The plane category of each three-dimensional point in the M first three-dimensional points may be the target plane category of the two-dimensional pixel points corresponding to the three-dimensional point mentioned in the previous embodiment. For example, step 3022 can be used to obtain the target plane category of each pixel and use it as the plane category of the three-dimensional point corresponding to each pixel, so that one or more target plane categories corresponding to the M first three-dimensional points can be obtained.
举例说明,如果M个第一三维点中N1个三维点属于的平面类别为“地面”,也即平面类别为“地面”的三维点数目为N1,N2个三维点属于的平面类别为“桌子”也即平面类别为“桌子”的三维点数目为N2,N3个三维点属于的平面类别为“墙面”, 也即平面类别为“墙面”的三维点数目为N3,其中,N1+N2+N3小于或等于M,N1,N2,N3为正整数,则平面类别为“地面”包括的三维点数目在M个第一三维点中的概率为:N1/M。平面类别为“桌子”包括的三维点数目在M个第一三维点中的概率为:N2/M。平面类别为“墙面”包括的三维点数目在M个第一三维点中的概率为:N3/M。则任一个平面的一个或多个平面类别的置信分别为:N1/M、N2/M、N3/M。若N2/M>N1/M,且N2/M大于N3/M,则该任一个平面的语义平面类别为“地面”。For example, if the plane category of N1 three-dimensional points among the M first three-dimensional points is "ground", that is, the number of three-dimensional points whose plane category is "ground" is N1, and the plane category of N2 three-dimensional points is "table" "That is to say, the number of 3D points whose plane type is "table" is N2, and the plane type to which N3 3D points belong is "wall", that is, the number of 3D points whose plane type is "wall" is N3, where N1+ N2+N3 is less than or equal to M, N1, N2, and N3 are positive integers, and the probability that the number of 3D points included in the plane category "ground" is among the M first 3D points is: N1/M. The probability of the number of three-dimensional points included in the plane category "table" in the M first three-dimensional points is: N2/M. The probability that the number of three-dimensional points included in the plane category "wall" is among the M first three-dimensional points is: N3/M. Then the confidences of one or more plane categories of any plane are: N1/M, N2/M, and N3/M. If N2/M>N1/M, and N2/M is greater than N3/M, then the semantic plane category of any plane is "ground".
步骤3043、语义聚类模块204在所述一个或多个目标平面类别中选取具有最高置信的目标平面类别作为所述任一个平面的语义平面类别。Step 3043: The semantic clustering module 204 selects the target plane category with the highest confidence among the one or more target plane categories as the semantic plane category of any one of the planes.
举例说明,平面A对应地面的置信为P1,平面A对应桌子的置信为P2,平面A对应墙面的置信为P3,且P1>P2>P3,因此语义聚类模块204可以确定平面A的语义平面类别为地面。For example, the confidence that plane A corresponds to the ground is P1, the confidence that plane A corresponds to the table is P2, the confidence that plane A corresponds to the wall is P3, and P1>P2>P3, so the semantic clustering module 204 can determine the semantics of plane A The plane category is ground.
由于任一个平面可能对应一个或目标多个平面类别,但是并非该任一个平面对应的一个或多个目标平面类别中所有目标平面类别的朝向均和该任一个平面的朝向一致,也即一个平面可能对应与该平面朝向一致的目标平面类别也可能对应与该平面朝向不一致的目标平面类别,而与该平面朝向不一致的目标平面类别属于该平面的语义平面类别的概率低于与平面朝向一致的平面类别的概率。基于此,为了简化后续计算过程,并降低计算误差,在一种可能的实现方式中,本申请实施例中任一个平面对应的一个或多个目标平面类别的朝向与任一个平面的朝向一致。也即该一个或多个目标平面类别为图像数据处理装置从任一个平面对应的所有目标平面类别中选择的与该任一个平面的朝向一致的平面类别。该一个或多个目标平面类别可以为任一个平面对应的所有目标平面类别的全部平面类别,也可以为部分平面类别,本申请实施例对此不做限定。本申请实施例中任一个平面对应的所有目标平面类别可以看作为M个第一三维点对应的所有目标平面类别。Since any plane may correspond to one or multiple target plane categories, but not all target plane categories in the one or more target plane categories corresponding to any one plane have the same orientation as that of any plane, that is, a plane It may correspond to the target plane category that is consistent with the orientation of the plane, or it may correspond to the target plane category that is inconsistent with the orientation of the plane, and the target plane category that is inconsistent with the orientation of the plane has a lower probability of belonging to the semantic plane category of the plane. The probability of the plane category. Based on this, in order to simplify the subsequent calculation process and reduce the calculation error, in a possible implementation manner, the orientation of one or more target plane categories corresponding to any plane in the embodiment of the present application is consistent with the orientation of any plane. That is, the one or more target plane categories are plane categories selected by the image data processing device from all target plane categories corresponding to any one plane and consistent with the orientation of the any one plane. The one or more target plane categories may be all plane categories of all target plane categories corresponding to any one plane, or may be part of the plane categories, which is not limited in the embodiment of the present application. All target plane categories corresponding to any plane in the embodiment of the present application can be regarded as all target plane categories corresponding to the M first three-dimensional points.
例如,平面a的朝向向下,平面类别(地面)的朝向向上,平面类别(桌子)的朝向向下,平面类别(天花板)的朝向向下,因此,在计算平面a属于一个或多个平面类别的置信时,可以剔除确定平面a属于平面类别(地面)的置信。这样不仅可以降低图像数据处理装置的计算负担,还可以提高计算精度。For example, the plane a is facing downwards, the plane category (ground) is facing upwards, the plane category (table) is facing downwards, and the plane category (ceiling) is facing downwards. Therefore, in the calculation plane a belongs to one or more planes When the confidence of the category, the confidence that the plane a is determined to belong to the plane category (ground) can be eliminated. This not only reduces the calculation burden of the image data processing device, but also improves the calculation accuracy.
在一种可能的实现方式中,语义聚类模块204统计所述一个或多个目标平面类别中每个目标平面类别对应的三维点数目在所述M个第一三维点中的比例,得到所述一个或多个目标平面类别的置信之后,本申请实施例提供的方法还包括:语义聚类模块204根据贝叶斯定理或投票机制中至少一项更新所述任一个平面对应的一个或多个目标平面类别的置信。In a possible implementation, the semantic clustering module 204 counts the proportion of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points, and obtains all the three-dimensional points. After describing the confidence of one or more target plane categories, the method provided in the embodiment of the present application further includes: the semantic clustering module 204 updates one or more corresponding planes according to at least one of the Bayes theorem or the voting mechanism. The confidence of each target plane category.
具体的,语义聚类模块204对每个三维点的三维点云数据进行平面拟合,获得平面方程。该平面方程的公式如下:AX+BY+CZ+D=0。其中,A,B,C,D是需要求解的平面方程参数,通过多个点求解最优的平面方程参数。具体的拟合方案可参照现有技术。同时参与计算的所有点集中最外围的点就作为平面的边界点。平面的法向量即n=(A,B,C)可以作为平面的方向向量,平面的面积定义为平面的边界点的最小包围矩形的面积。Specifically, the semantic clustering module 204 performs plane fitting on the three-dimensional point cloud data of each three-dimensional point to obtain a plane equation. The formula of the plane equation is as follows: AX+BY+CZ+D=0. Among them, A, B, C, D are the plane equation parameters that need to be solved, and the optimal plane equation parameters are solved through multiple points. The specific fitting scheme can refer to the prior art. At the same time, the outermost point of all the points involved in the calculation will be the boundary point of the plane. The normal vector of the plane, that is, n=(A, B, C) can be used as the direction vector of the plane, and the area of the plane is defined as the area of the smallest enclosing rectangle of the boundary points of the plane.
然后语义聚类模块204基于检测到的平面的平面方程、朝向和面积,从第一稠密语义地图中统计和筛选与平面之间的距离小于第三阈值的M个第一三维点,该M个 第一三维点对应一个或多个目标平面类别。语义聚类模块204正则化一个或多个目标平面类别中各种平面类别的三维点作为该平面类别的置信,即统计各种目标平面类别包括的三维点数目占所有三维点(M个第一三维点)数目的比例。基于贝叶斯定理和投票机制与上一次记录的各种平面类别的置信进行更新,选取当前置信最高的平面类别作为平面语义的平面类别,这样可以增强平面语义识别的准确性和稳定性。Then, the semantic clustering module 204 counts and filters the M first three-dimensional points whose distance from the plane is less than the third threshold from the first dense semantic map based on the plane equation, orientation and area of the detected plane. The first three-dimensional point corresponds to one or more target plane categories. The semantic clustering module 204 regularizes the three-dimensional points of various plane categories in one or more target plane categories as the confidence of the plane category, that is, counts the number of three-dimensional points included in various target plane categories in all three-dimensional points (M first The ratio of the number of three-dimensional points. Based on Bayes' theorem and voting mechanism and the confidence of the last recorded various plane categories are updated, and the plane category with the highest current confidence is selected as the plane category of plane semantics, which can enhance the accuracy and stability of plane semantic recognition.
语义聚类模块204具体利用贝叶斯定理和投票机制对当前时刻之前计算的某个平面属于多个平面类别的置信进行统计,从而根据获取的置信度来对当前时刻计算的某个平面属于多个平面类别的置信进行修正更新。The semantic clustering module 204 specifically uses Bayes' theorem and voting mechanism to count the confidence that a plane calculated before the current moment belongs to multiple plane categories, so as to determine whether a plane calculated at the current moment belongs to multiple plane categories according to the obtained confidence. The confidence of each plane category is revised and updated.
例如,设投票机制下的最大投票次数为MAX_VOTE_COUNT,初始投票次数为0,若当前帧中某个三维点C所属的平面类别和当前帧之前一帧的三维点C所属的平面类别一致,则三维点C对应的投票次数(vote)加1,并更新三维点C所属的平面类别概率prob取值在两者均值与最大值间滑动。例如,
Figure PCTCN2020074040-appb-000005
其中,prob c表示当前帧三维点C所属的平面类别的概率分布,prob p表示当前帧之前一帧三维点C所属的平面类别的概率分布。alpha=vote/MAX_VOTE_COUNT。
For example, suppose the maximum number of votes under the voting mechanism is MAX_VOTE_COUNT, and the initial number of votes is 0. If the plane category of a 3D point C in the current frame is consistent with the plane category of the 3D point C in the previous frame of the current frame, then the 3D The number of votes corresponding to point C is increased by 1, and the plane category probability prob to which the three-dimensional point C belongs is updated to slide between the average value and the maximum value of the two. E.g,
Figure PCTCN2020074040-appb-000005
Among them, prob c represents the probability distribution of the plane category to which the three-dimensional point C of the current frame belongs, and prob p represents the probability distribution of the plane category to which the three-dimensional point C of the frame before the current frame belongs. alpha=vote/MAX_VOTE_COUNT.
若当前帧中某个三维点C所属的平面类别和当前帧之前一帧的三维点C所属的平面类别不一致,则投票次数vote减1并更新平面类别概率prob取其值的80%。If the plane category to which a certain 3D point C belongs in the current frame is inconsistent with the plane category to which the 3D point C in the previous frame of the current frame belongs, the number of votes is reduced by 1 and the plane category probability prob is updated to take 80% of its value.
具体的,步骤304可以通过如图9的描述具体实现:步骤901、语义聚类模块204执行平面检测步骤,以得到该待处理图像数据中包括的一个或多个平面。由于语义聚类模块204计算一个或多个平面每个平面的平面语义的平面类别的方式和原理相同,下述步骤以图像数据处理装置计算第一平面的平面语义的平面类别的过程为例,并不具有指示性含义。Specifically, step 304 can be specifically implemented as described in FIG. 9: step 901, the semantic clustering module 204 executes a plane detection step to obtain one or more planes included in the image data to be processed. Since the semantic clustering module 204 calculates the plane category of the plane semantics of each plane of one or more planes in the same manner and principle, the following steps take the process of calculating the plane category of the plane semantics of the first plane by the image data processing device as an example. It does not have an indicative meaning.
步骤902、语义聚类模块204获取第一平面的平面方程。步骤903、语义聚类模块204计算第一平面的面积。步骤904、语义聚类模块204计算第一平面的朝向。关于步骤903和步骤904的具体实现可以参考现有技术中计算平面的面积和朝向的过程,此处不再赘述。步骤905、语义聚类模块204统计第一稠密语义地图中各种平面类别的三维点中距离第一平面之间距离小于第三阈值的M个三维点。步骤906、语义聚类模块204判断M个三维点对应的一个或多个目标平面类别中各个目标平面类别的朝向是否与第一平面的朝向一致或相同。Step 902: The semantic clustering module 204 obtains the plane equation of the first plane. Step 903: The semantic clustering module 204 calculates the area of the first plane. Step 904: The semantic clustering module 204 calculates the orientation of the first plane. For the specific implementation of step 903 and step 904, reference may be made to the process of calculating the area and orientation of a plane in the prior art, which will not be repeated here. Step 905: The semantic clustering module 204 counts the M three-dimensional points of the three-dimensional points of various plane categories in the first dense semantic map whose distance between the first plane is less than the third threshold. Step 906: The semantic clustering module 204 determines whether the orientation of each target plane category in the one or more target plane categories corresponding to the M three-dimensional points is consistent or the same as the orientation of the first plane.
步骤907、如果各种目标平面类别的朝向与第一平面的朝向一致,则语义聚类模块204根据第一平面的面积判断单位平面中每个目标平面类别包括的三维点个数是否满足阈值。Step 907: If the orientation of the various target plane categories is consistent with the orientation of the first plane, the semantic clustering module 204 determines whether the number of three-dimensional points included in each target plane category in the unit plane meets the threshold according to the area of the first plane.
步骤908、如果根据平面面积判断单位平面中每个目标平面类别包括的三维点个数满足阈值,则语义聚类模块204将每个目标平面类别包括的三维点个数进行正则化处理,也即计算每个目标平面类别包括的三维点总数在M个第一三维点中占据的比例,以得到该第一平面属于一个或多个目标平面类别的置信。步骤909、语义聚类模块204将当前计算的第一平面属于一个或多个目标平面类别的置信与当前之前一次计算的该第一平面属于各种目标平面类别的置信进行贝叶斯概率更新。步骤910、语义聚类模块204将该第一平面当前置信最高的目标平面类别作为该第一平面的平面类别。Step 908: If it is determined according to the plane area that the number of 3D points included in each target plane category in the unit plane meets the threshold, the semantic clustering module 204 performs regularization processing on the number of 3D points included in each target plane category, that is The proportion of the total number of three-dimensional points included in each target plane category in the M first three-dimensional points is calculated to obtain the confidence that the first plane belongs to one or more target plane categories. Step 909: The semantic clustering module 204 updates the Bayesian probability of the currently calculated confidence that the first plane belongs to one or more target plane categories and the currently calculated confidence that the first plane belongs to various target plane categories. Step 910: The semantic clustering module 204 uses the target plane category with the highest current confidence of the first plane as the plane category of the first plane.
需要说明的是,如果各个目标平面类别的朝向与第一平面的朝向不一致,则语义聚类模块204确定流程停止。此外,如果语义聚类模块204根据第一平面的面积判断单位平面中各种目标平面类别包括的三维点个数不满足阈值则图像数据处理装置确定流程停止。It should be noted that if the orientation of each target plane category is inconsistent with the orientation of the first plane, the semantic clustering module 204 determines that the process stops. In addition, if the semantic clustering module 204 determines that the number of three-dimensional points included in the various target plane categories in the unit plane does not meet the threshold value according to the area of the first plane, the image data processing apparatus determines that the process stops.
本申请实施例中语义分割模块202根据待处理图像数据以及待处理图像数据的深度信息,对语义分割结果执行优化操作的具体步骤包括如图10所描述的随机抽样一致(Random Sample Consensus,RANSAC)地面方程估计过程和图11所示的语义种子点区域生长过程。In the embodiment of the present application, the semantic segmentation module 202 performs an optimization operation on the semantic segmentation result according to the image data to be processed and the depth information of the image data to be processed, including the random sample consensus (RANSAC) as described in FIG. 10 The ground equation estimation process and the semantic seed point area growth process shown in Figure 11.
(1)RANSAC地面方程估计(1) RANSAC ground equation estimation
地面(Floor)作为场景中一个重要的组成部分,其具有以下特征:Floor表现为一个区域很大的平面;Floor为SLAM初始化的重要参考;相对其他语义目标地面更易被检测识别;场景中的物体多位于地面之上;场景中物体的高度多相对于地面而言。因而将地面优先分割出来并求取平面方程是十分有其必要的。Floor, as an important part of the scene, has the following characteristics: Floor is a plane with a large area; Floor is an important reference for SLAM initialization; the ground is easier to detect and recognize than other semantic targets; objects in the scene Mostly located on the ground; the height of the objects in the scene is mostly relative to the ground. Therefore, it is very necessary to divide the ground first and obtain the plane equation.
RANSAC算法又称为随机采样一致性估计方法,其是一种鲁棒性较强的估计方法,较为适用于地面这样区域较大的平面估计,在此依赖深度神经网络语义分割的结果,提取其中的地面语义像素,即在多个三维点中提取FLOOR像素(FLOOR三维点)并获取其深度信息构成的点云数据,来实现基于RANSAC的地面方程估计,具体步骤如图10所示:The RANSAC algorithm is also known as the random sampling consensus estimation method. It is a robust estimation method, which is more suitable for the estimation of the plane with a large area such as the ground. The result of semantic segmentation of the deep neural network is relied on here. The ground semantic pixels are extracted from multiple three-dimensional points (FLOOR three-dimensional points) and obtain the point cloud data composed of the depth information to realize the ground equation estimation based on RANSAC. The specific steps are shown in Figure 10:
作为一种可能的实现方式,本申请实施例中当平面类别为地面时,也可以使用AI的方式估计地面方程。As a possible implementation manner, in the embodiment of the present application, when the plane type is ground, the ground equation can also be estimated by using AI.
步骤1011:语义分割模块202通过对地面执行语义分割处理,获取地面包括的P个三维点。RANSAC算法迭代次数为M,若M>0,则图像数据处理装置从P个三维点中随机选取l(例如,l为3)个三维点作为采样点。否则,跳转至步骤1016执行。Step 1011: The semantic segmentation module 202 obtains P three-dimensional points included in the ground by performing semantic segmentation processing on the ground. The number of iterations of the RANSAC algorithm is M. If M>0, the image data processing device randomly selects l (for example, l is 3) three-dimensional points from the P three-dimensional points as sampling points. Otherwise, skip to step 1016 for execution.
步骤1012:语义分割模块202将l个三维点的三维坐标带入平面方程Ax+By+Cz=1中,并利用奇异值分解(Singular Value Decomposition,SVD)求解平面方程的参数n=[A B C]。Step 1012: The semantic segmentation module 202 brings the three-dimensional coordinates of l three-dimensional points into the plane equation Ax+By+Cz=1, and uses Singular Value Decomposition (SVD) to solve the plane equation parameters n=[A B C].
步骤1013:语义分割模块202将P个三维点的三维坐标q分别带入估计的平面方程之中,并求取P个三维点至平面方程的标量距离d,若d小于预设的阈值η,则认为P个三维点为内点,并统计内点的数目k。其中,
Figure PCTCN2020074040-appb-000006
Step 1013: The semantic segmentation module 202 brings the three-dimensional coordinates q of the P three-dimensional points into the estimated plane equation respectively, and obtains the scalar distance d from the P three-dimensional points to the plane equation. If d is less than the preset threshold η, Then P three-dimensional points are considered as interior points, and the number k of interior points is counted. in,
Figure PCTCN2020074040-appb-000006
步骤1014:语义分割模块202判别该次迭代内点数目k与最优内点数目K的大小,若k<K,则语义分割模块202将RANSAC算法的迭代次数M减1,并跳转至步骤1011执行,否则继续往下执行。Step 1014: The semantic segmentation module 202 discriminates the size of the number k of interior points in this iteration and the optimal number of interior points K. If k<K, the semantic segmentation module 202 reduces the number of iterations M of the RANSAC algorithm by 1, and jumps to step 1011 execute, otherwise continue to execute.
步骤1015:语义分割模块202将该次迭代内点数目k赋值给最优内点数目K,语义分割模块202保存最优内点的索引,并计算内点数目的百分比v=K/P,并根据公式
Figure PCTCN2020074040-appb-000007
修改迭代次数M,其中,w=0.99,n=3。
Step 1015: The semantic segmentation module 202 assigns the number k of interior points in this iteration to the optimal number of interior points K, the semantic segmentation module 202 saves the index of the optimal interior points, and calculates the percentage of the number of interior points v=K/P, and formula
Figure PCTCN2020074040-appb-000007
Modify the number of iterations M, where w=0.99 and n=3.
步骤1016:语义分割模块202利用K个最优内点来对平面方程进行重新估计,即建立K个方程构成的超定方程,利用SVD求取全局最优的平面方程。Step 1016: The semantic segmentation module 202 uses K optimal interior points to re-estimate the plane equation, that is, establishes an overdetermined equation composed of K equations, and uses SVD to find the global optimal plane equation.
(2)语义种子点区域生长(2) Semantic seed point area growth
针对神经网络语义分割的结果存在欠分割或过分割的问题,结合深度信息来对语义种 子进行区域生长来扩大分割区域和修正分割结果。在此以语义分割类别的像素数目作为区域生长优先级的指标,来对语义分割类别中像素数目较多的类别优先进行区域生长,但地面的优先级最高,即先对地面进行区域生长后才对其他类别进行区域生长。Aiming at the problem of under-segmentation or over-segmentation in the result of neural network semantic segmentation, the semantic seed is combined with depth information to grow the segmentation area and modify the segmentation result. Here, the number of pixels in the semantic segmentation category is used as an indicator of the priority of region growth, to give priority to the category with a larger number of pixels in the semantic segmentation category for region growth, but the priority of the ground is the highest, that is, the ground is grown first before the region is grown. Regional growth for other categories.
区域生长算法是依赖种子点与其领域点间的相似性程度,来实现对相似度较高的相邻点合并起来继续往外生长,直至没有满足相似性条件的邻域点被合并进来为止,在此选取了典型的8邻域来进行区域生长,相似性条件则同时选取了深度和色彩信息来表达,使得欠分割区域能够得到较好地修正。所谓的种子点即区域生长的初始点,区域生长是使用类似用广度优先搜索(Breadth-First-Search,BFS)的方法,向外扩散生长。具体步骤如图11所示:The region growing algorithm relies on the degree of similarity between the seed points and their domain points to merge adjacent points with higher similarity and continue to grow outward until the neighboring points that do not meet the similarity condition are merged in. Here A typical 8-neighborhood is selected for region growth, and the similarity condition is also selected for depth and color information to express, so that the under-segmented region can be better corrected. The so-called seed point is the initial point of region growth, and region growth uses a method similar to Breadth-First-Search (BFS) to spread out and grow. The specific steps are shown in Figure 11:
步骤1101:语义分割模块202遍历语义分割类别优先级列表,并将优先级高的平面类别率先压栈(压入种子点堆栈)进行区域生长。示例性的,设当前压栈类别的种子点堆栈为
Figure PCTCN2020074040-appb-000008
即包含有K个种子点,每个种子点对应的二维像素点的坐标为(i,j)。所谓优先级列表即统计分割结果,按各平面类别的数量从多到少建立。
Step 1101: The semantic segmentation module 202 traverses the priority list of semantic segmentation categories, and pushes the plane category with high priority onto the stack (push into the seed point stack) for region growth. Exemplarily, suppose the seed point stack of the current push category is
Figure PCTCN2020074040-appb-000008
That is, there are K seed points, and the coordinates of the two-dimensional pixel points corresponding to each seed point are (i, j). The so-called priority list is the statistical segmentation result, which is established from more to less according to the number of each plane category.
步骤1102:若种子点堆栈不为空,语义分割模块202将最后一个种子点s K(i,j)出栈并从堆栈中删除,并判别其邻域点p(i+m,j+n)类别是否为其他(OTHER),若是继续往下执行,否则跳转至步骤1101执行。 Step 1102: If the seed point stack is not empty, the semantic segmentation module 202 pops the last seed point s K (i,j) from the stack and deletes it from the stack, and determines its neighboring point p(i+m,j+n ) Whether the category is other (OTHER), if it is to continue to execute, otherwise skip to step 1101 for execution.
步骤1103:语义分割模块202比较种子点s K与邻域点p间的相似度距离d,若相似度距离d小于给定的阈值η,则继续往下执行,否则跳转至步骤1101执行,其中相似度距离d表达式如下所示: Step 1103: The semantic segmentation module 202 compares the similarity distance d between the seed point s K and the neighboring point p. If the similarity distance d is less than the given threshold η, then continue to execute, otherwise jump to step 1101 for execution, The expression of similarity distance d is as follows:
Figure PCTCN2020074040-appb-000009
Figure PCTCN2020074040-appb-000009
步骤1104:语义分割模块202将满足相似度条件的邻域点p压入种子点堆栈Step 1104: The semantic segmentation module 202 pushes the neighborhood point p satisfying the similarity condition into the seed point stack
Figure PCTCN2020074040-appb-000010
则跳转至步骤1101执行。然后按照上面所述的方法建立语义地图,并完成平面的检测和识别,即可得到一个稳定且准确的平面语义结果,如下图12所示。
Figure PCTCN2020074040-appb-000010
Then jump to step 1101 for execution. Then build a semantic map according to the method described above, and complete the plane detection and recognition, and a stable and accurate plane semantic result can be obtained, as shown in Figure 12 below.
上述主要从图像数据处理装置的角度对本申请实施例的方案进行了介绍。可以理解的是,图像数据处理装置等为了实现上述功能,其包括了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。The foregoing mainly introduces the solution of the embodiment of the present application from the perspective of the image data processing device. It can be understood that, in order to realize the above-mentioned functions, the image data processing apparatus and the like include hardware structures and/or software modules corresponding to the respective functions. Those skilled in the art should easily realize that in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
本申请实施例可以根据上述方法示例图像数据处理装置进行功能单元的划分,例如,可以对应各个功能划分各个功能单元,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。The embodiment of the present application may divide the functional units according to the foregoing method example image data processing apparatus. For example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
在采用对应各个功能划分各个功能模块的情况下,如图2示出了上述实施例中所涉及的图像数据处理装置的一种可能的结构示意图,该图像数据处理装置包括:语义分割模块202、语义地图模块203以及语义聚类模块204。其中,语义分割模块202用于支持图像数据处理装置执行上述实施例中的步骤301和步骤302。语义地图模块203用于支持图像数据处理装置执行上述实施例中的步骤303。语义聚类模块204用于支持图像数据处理装置执行上 述实施例中的步骤304。In the case of dividing each function module corresponding to each function, FIG. 2 shows a possible structural schematic diagram of the image data processing device involved in the foregoing embodiment. The image data processing device includes: a semantic segmentation module 202, The semantic map module 203 and the semantic clustering module 204. Among them, the semantic segmentation module 202 is used to support the image data processing apparatus to execute steps 301 and 302 in the above-mentioned embodiment. The semantic map module 203 is used to support the image data processing device to execute step 303 in the foregoing embodiment. The semantic clustering module 204 is used to support the image data processing apparatus to perform step 304 in the above-mentioned embodiment.
在一种可能的实施例中,该语义分割模块202还用于支持图像数据处理装置执行上述实施例中的步骤305。语义分割模块202用于支持图像数据处理装置执行上述实施例中步骤3011。语义分割模块202用于支持图像数据处理装置执行上述实施例中步骤306、步骤3021以及步骤3022。在一种可能的实施例中,该语义聚类模块204用于支持图像数据处理装置执行上述实施例中步骤3041、步骤3042以及步骤3043。此外,语义聚类模块204还用于支持图像数据处理装置执行上述实施例中901~步骤910。本装置可以如软件形式实现,并被存储在存储介质中。In a possible embodiment, the semantic segmentation module 202 is further configured to support the image data processing apparatus to execute step 305 in the foregoing embodiment. The semantic segmentation module 202 is used to support the image data processing device to execute step 3011 in the foregoing embodiment. The semantic segmentation module 202 is used to support the image data processing device to execute step 306, step 3021, and step 3022 in the above-mentioned embodiment. In a possible embodiment, the semantic clustering module 204 is used to support the image data processing apparatus to perform step 3041, step 3042, and step 3043 in the foregoing embodiment. In addition, the semantic clustering module 204 is also used to support the image data processing device to execute steps 901 to 910 in the foregoing embodiment. The device can be implemented in the form of software and stored in a storage medium.
上面从模块化功能实体的角度对本申请实施例中的一种图像数据处理装置进行描述,下面从硬件处理的角度对本申请实施例中的一种图像数据处理装置进行描述。如图13所示,图13示出了上述实施例中所涉及的图像数据处理装置的一种可能的硬件结构示意图。该图像数据处理装置包括:第一处理器1301、以及第二处理器1302。可选的,该图像数据处理装置还可以包括通信接口1303、存储器1304以及总线1305。其中,通信接口1303可以包括输入接口13031和输出接口13032。相应的,当图像数据处理装置为电子设备时,第一处理器1301和第二处理器1302可以为图1所示的处理器120。比如,第一处理器1301可以为DSP或CPU。第二处理器1302可以为NPU。通信接口1303可以为图1中的输入设备140。存储器1304,用于存储图像数据处理装置可的程序代码和数据,对应于图1中存储器130。总线1305可以内置于所述图1所示的处理器120。The above describes an image data processing device in the embodiment of the present application from the perspective of a modular functional entity, and the following describes an image data processing device in the embodiment of the present application from the perspective of hardware processing. As shown in FIG. 13, FIG. 13 shows a schematic diagram of a possible hardware structure of the image data processing device involved in the above-mentioned embodiment. The image data processing device includes: a first processor 1301 and a second processor 1302. Optionally, the image data processing apparatus may further include a communication interface 1303, a memory 1304, and a bus 1305. The communication interface 1303 may include an input interface 13031 and an output interface 13032. Correspondingly, when the image data processing apparatus is an electronic device, the first processor 1301 and the second processor 1302 may be the processor 120 shown in FIG. 1. For example, the first processor 1301 may be a DSP or a CPU. The second processor 1302 may be an NPU. The communication interface 1303 may be the input device 140 in FIG. 1. The memory 1304 is used to store program codes and data of the image data processing device, and corresponds to the memory 130 in FIG. 1. The bus 1305 may be built in the processor 120 shown in FIG. 1.
在这种情况下,第一处理器1301和第二处理器1302被配置为可执行上述图像数据处理方法中的部分功能。例如,第一处理器1301,用于支持该图像数据处理装置执行上述实施例的步骤301。第二处理器1302用于该图像数据处理装置执行上述实施例的步骤302。第一处理器1301用于图像数据处理装置执行上述实施例的步骤303、步骤304。In this case, the first processor 1301 and the second processor 1302 are configured to perform part of the functions in the image data processing method described above. For example, the first processor 1301 is configured to support the image data processing apparatus to execute step 301 of the foregoing embodiment. The second processor 1302 is used for the image data processing apparatus to execute step 302 of the foregoing embodiment. The first processor 1301 is used for the image data processing apparatus to execute step 303 and step 304 of the foregoing embodiment.
在一种可能的实施例中,第一处理器1301,还用于支持图像数据处理装置执行上述实施例中的步骤305,以及步骤3011,步骤3041、步骤3042、步骤3043。第二处理器1302,还用于支持图像数据处理装置执行上述实施例中步骤306、步骤3021以及步骤3022。可选的,该第一处理器1301还用于支持图像数据处理装置执行上述实施例中的步骤901-步骤910。In a possible embodiment, the first processor 1301 is further configured to support the image data processing apparatus to execute step 305, step 3011, step 3041, step 3042, step 3043 in the foregoing embodiment. The second processor 1302 is also configured to support the image data processing apparatus to execute step 306, step 3021, and step 3022 in the foregoing embodiment. Optionally, the first processor 1301 is further configured to support the image data processing apparatus to execute steps 901 to 910 in the foregoing embodiment.
在一些可行的实施例中,该第一处理器1301或第二处理器1302可以是单处理器结构、多处理器结构、单线程处理器以及多线程处理器等;在一些可行的实施例中,第一处理器1301可以是中央处理器单元,通用处理器,数字信号处理器,专用集成电路,现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。第二处理器1302可以为神经网络处理器,其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。In some feasible embodiments, the first processor 1301 or the second processor 1302 may be a single-processor structure, a multi-processor structure, a single-threaded processor, a multi-threaded processor, etc.; in some feasible embodiments The first processor 1301 may be a central processing unit, a general-purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The second processor 1302 may be a neural network processor, which may implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of this application. The processor may also be a combination that implements computing functions, for example, a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so on.
输出接口13032:该输出接口用于输出上述图像数据处理方法中的处理结果,在一些可行的实施例中,该处理结果可以由处理器直接输出,也可以先被存储于存储器中,然后经存储器输出;在一些可行的实施例中,可以只有一个输出接口,也可以有多个输出接口。在一些可行的实施例中,该输出接口输出的处理结果可以送到存储器中存储,也可以送到另外的处理流程中继续进行处理,或者送到显示设备进行显示、送到播放器终端进行播放等。Output interface 13032: This output interface is used to output the processing result in the above-mentioned image data processing method. In some feasible embodiments, the processing result can be directly output by the processor, or it can be stored in the memory first, and then passed through the memory. Output; in some feasible embodiments, there may be only one output interface, or there may be multiple output interfaces. In some feasible embodiments, the processing result output by the output interface can be sent to the memory for storage, or sent to another processing flow to continue processing, or sent to the display device for display, or sent to the player terminal for playback. Wait.
存储器1301:该存储器1301中可存储上述的待处理图像数据、以及配置第一处理器或第二处理器的相关指令等。在一些可行的实施例中,可以有一个存储器,也可以有多个存储器;该存储器可以是软盘,硬盘如内置硬盘和移动硬盘,磁盘,光盘,磁光盘如CD_ROM、DCD_ROM,非易失性存储设备如RAM、ROM、PROM、EPROM、EEPROM、闪存、或者技术领域内所公知的任意其他形式的存储介质。Memory 1301: The memory 1301 can store the aforementioned image data to be processed and related instructions for configuring the first processor or the second processor. In some feasible embodiments, there may be one memory or multiple memories; the memory may be a floppy disk, a hard disk such as a built-in hard disk and a mobile hard disk, a magnetic disk, an optical disk, a magneto-optical disk such as CD_ROM, DCD_ROM, non-volatile storage Devices such as RAM, ROM, PROM, EPROM, EEPROM, flash memory, or any other form of storage medium known in the technical field.
总线1304:该总线可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图13中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。Bus 1304: This bus may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in FIG. 13, but it does not mean that there is only one bus or one type of bus.
本申请实施例提供的上述图像数据处理装置的各组成部分分别用于实现相对应的前述特征提取方法的各步骤的功能,由于在前述的图像数据处理方法实施例中,已经对各步骤进行了详细说明,在此不再赘述。The components of the above-mentioned image data processing device provided by the embodiments of this application are respectively used to realize the functions of the corresponding steps of the aforementioned feature extraction method, because in the foregoing image data processing method embodiments, each step has been performed. The detailed description will not be repeated here.
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在一个设备(比如,该设备可以是单片机,芯片、计算机等)上运行时,使得该设备执行上述图像数据处理方法的步骤301-步骤3011中的一个或多个步骤。上述图像数据处理装置的各组成模块如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在所述计算机可读取存储介质中。The embodiment of the present application also provides a computer-readable storage medium that stores instructions in the computer-readable storage medium, and when it runs on a device (for example, the device may be a single-chip microcomputer, a chip, a computer, etc.), The device executes one or more of steps 301 to 3011 of the above-mentioned image data processing method. If each component module of the above-mentioned image data processing device is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in the computer readable storage medium.
基于这样的理解,本申请实施例还提供一种包含指令的计算机程序产品,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或其中的处理器执行本申请各个实施例所述方法的全部或部分步骤。Based on this understanding, the embodiments of the present application also provide a computer program product containing instructions. The technical solution of the present application is essentially or a part that contributes to the prior art, or all or part of the technical solution can be a software product. The computer software product is stored in a storage medium and includes a number of instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor therein execute the various embodiments of the present application All or part of the steps of the method.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘(digital video disc,DVD);还可以是半导体介质,例如,固态硬盘(solid state drive,SSD)。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer programs or instructions. When the computer program or instruction is loaded and executed on the computer, the process or function described in the embodiment of the present application is executed in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, network equipment, user equipment, or other programmable devices. The computer program or instruction may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer program or instruction may be downloaded from a website, computer, The server or data center transmits to another website site, computer, server or data center through wired or wireless means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that integrates one or more available media. The usable medium may be a magnetic medium, such as a floppy disk, a hard disk, and a magnetic tape; it may also be an optical medium, such as a digital video disc (digital video disc, DVD); and it may also be a semiconductor medium, such as a solid state drive (solid state drive). , SSD).
尽管在此结合各实施例对本申请进行了描述,然而,在实施所要求保护的本申请过程中,本领域技术人员通过查看附图、公开内容、以及所附权利要求书,可理解并实现公开实施例的其他变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一个”不排除多个的情况。单个处理器或其他单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合 起来产生良好的效果。Although the present application is described in conjunction with various embodiments, in the process of implementing the claimed application, those skilled in the art can understand and realize the disclosure by looking at the drawings, the disclosure, and the appended claims. Other changes to the embodiment. In the claims, the word "comprising" does not exclude other components or steps, and "a" does not exclude multiple. A single processor or other unit may implement several functions listed in the claims. Certain measures are described in mutually different dependent claims, but this does not mean that these measures cannot be combined to produce good results.
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包括这些改动和变型在内。Although the application has been described in combination with specific features and embodiments, it is obvious that various modifications and combinations can be made without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely exemplary descriptions of the application as defined by the appended claims, and are deemed to cover any and all modifications, changes, combinations or equivalents within the scope of the application. Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of this application fall within the scope of the claims of this application and their equivalent technologies, this application is also intended to include these modifications and variations.

Claims (27)

  1. 一种平面语义类别的识别方法,其特征在于,包括:A method for identifying plane semantic categories, which is characterized in that it includes:
    获取待处理图像数据,所述待处理图像数据包括N个像素点,N为正整数;Acquiring image data to be processed, where the image data to be processed includes N pixels, where N is a positive integer;
    确定所述待处理图像数据的语义分割结果,其中,所述语义分割结果包括所述N个像素点中至少部分像素点对应的目标平面类别;Determining a semantic segmentation result of the to-be-processed image data, where the semantic segmentation result includes a target plane category corresponding to at least some of the N pixels;
    根据所述语义分割结果,得到第一稠密语义地图,所述第一稠密语义地图包括第一三维点云中的至少一个第一三维点对应的至少一个目标平面类别,所述至少一个第一三维点对应于所述至少部分像素点中的至少一个像素点;According to the semantic segmentation result, a first dense semantic map is obtained. The first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud. The point corresponds to at least one pixel point in the at least part of the pixel points;
    根据所述第一稠密语义地图进行平面语义类别识别,得到所述待处理图像数据包括的一个或多个平面的平面语义类别。Perform plane semantic category recognition according to the first dense semantic map, and obtain plane semantic categories of one or more planes included in the image data to be processed.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述语义分割结果,得到第一稠密语义地图,包括:The method according to claim 1, wherein the obtaining a first dense semantic map according to the semantic segmentation result comprises:
    根据所述语义分割结果,和所述待处理图像数据对应的深度图像,得到第二稠密语义地图;Obtaining a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed;
    将所述第二稠密语义地图作为所述第一稠密语义地图,或,Use the second dense semantic map as the first dense semantic map, or,
    利用所述第二稠密语义地图中的第二三维点云中的一个或多个第二三维点更新历史稠密语义地图,以得到所述第一稠密语义地图。The historical dense semantic map is updated by using one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map.
  3. 根据权利要求1或2所述的方法,其特征在于,所述确定所述待处理图像数据的语义分割结果之后,所述方法还包括:The method according to claim 1 or 2, wherein after the determination of the semantic segmentation result of the image data to be processed, the method further comprises:
    根据所述待处理图像数据以及所述待处理图像数据对应的深度图像包括的深度信息,对所述语义分割结果执行优化操作,所述优化操作用于修正所述语义分割结果中的噪声和误差部分。Perform an optimization operation on the semantic segmentation result according to the image data to be processed and the depth information included in the depth image corresponding to the image data to be processed, and the optimization operation is used to correct noise and errors in the semantic segmentation result part.
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述确定所述待处理图像数据的语义分割结果,包括:The method according to any one of claims 1 to 3, wherein the determining the semantic segmentation result of the image data to be processed comprises:
    确定所述至少部分像素点中任一个像素点对应的一个或多个平面类别和所述一个或多个平面类别中每个平面类别的概率;Determining one or more plane categories corresponding to any one of the at least part of the pixel points and the probability of each plane category in the one or more plane categories;
    将所述任一个像素点对应的一个或多个平面类别中概率最大的平面类别作为所述任一个像素点对应的目标平面类别,以得到所述待处理图像数据的语义分割结果。Taking the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain a semantic segmentation result of the image data to be processed.
  5. 根据权利要求4所述的方法,其特征在于,所述确定所述至少部分像素点中任一个像素点对应的一个或多个平面类别中每个平面类别的概率,包括:The method according to claim 4, wherein the determining the probability of each of the one or more plane categories corresponding to any one of the at least part of the pixels comprises:
    根据神经网络对所述待处理图像数据进行语义分割,得到所述至少部分像素点中任一个像素点对应的一个或多个平面类别中每个平面类别的概率。Perform semantic segmentation on the to-be-processed image data according to the neural network to obtain the probability of each plane category in one or more plane categories corresponding to any one of the at least some pixels.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述根据所述第一稠密语义地图进行平面语义类别识别,得到所述待处理图像数据包括的一个或多个平面的平面语义类别,包括:The method according to any one of claims 1 to 5, wherein the plane semantic category recognition is performed according to the first dense semantic map to obtain the plane of one or more planes included in the image data to be processed Semantic categories, including:
    根据所述待处理图像数据,确定所述一个或多个平面中每个平面的平面方程;Determine the plane equation of each of the one or more planes according to the image data to be processed;
    对所述一个或多个平面中任一个平面执行下述步骤以得到所述任一个平面的平面语义类别:Perform the following steps on any one of the one or more planes to obtain the plane semantic category of any one of the planes:
    根据所述任一个平面的平面方程,以及所述第一稠密语义地图,确定所述任一个 平面对应的一个或多个目标平面类别以及所述一个或多个目标平面类别的置信;Determine one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories according to the plane equation of any one of the planes and the first dense semantic map;
    在所述一个或多个目标平面类别中选取具有最高置信的目标平面类别作为所述任一个平面的语义平面类别。Select the target plane category with the highest confidence among the one or more target plane categories as the semantic plane category of any one of the planes.
  7. 根据权利要求6所述的方法,其特征在于,所述任一个平面对应的所述一个或多个目标平面类别的朝向与所述任一个平面的朝向一致。The method according to claim 6, wherein the orientation of the one or more target plane categories corresponding to the any plane is consistent with the orientation of the any plane.
  8. 根据权利要求6或7所述的方法,其特征在于,所述根据所述任一个平面的平面方程,以及所述第一稠密语义地图,确定所述任一个平面对应的一个或多个目标平面类别以及所述一个或多个目标平面类别的置信,包括:The method according to claim 6 or 7, wherein the determining one or more target planes corresponding to the any plane according to the plane equation of the any plane and the first dense semantic map The category and the confidence of the one or more target plane categories include:
    根据所述任一个平面的平面方程,从所述第一稠密语义地图中确定M个第一三维点,所述M个第一三维点与所述任一个平面之间的距离小于第三阈值,M为正整数;Determining M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the distance between the M first three-dimensional points and any one of the planes is less than a third threshold, M is a positive integer;
    将所述M个第一三维点对应的一个或多个目标平面类别确定为所述任一个平面对应的所述一个或多个目标平面类别,所述一个或多个目标平面类别的朝向与所述任一个平面的朝向一致,The one or more target plane categories corresponding to the M first three-dimensional points are determined as the one or more target plane categories corresponding to any one of the planes, and the orientation of the one or more target plane categories is different from the direction of the target plane. The orientation of any one of the planes is the same,
    统计所述一个或多个目标平面类别中每个目标平面类别对应的三维点数目在所述M个第一三维点中的比例,得到所述一个或多个目标平面类别的置信。The ratio of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points is calculated to obtain the confidence of the one or more target plane categories.
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:The method according to claim 8, wherein the method further comprises:
    根据贝叶斯定理或投票机制中至少一项更新所述任一个平面对应的一个或多个目标平面类别的置信。Update the confidence of one or more target plane categories corresponding to any one of the planes according to at least one of Bayes' theorem or voting mechanism.
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述根据所述语义分割结果,得到第一稠密语义地图,包括:The method according to any one of claims 1-9, wherein the obtaining a first dense semantic map according to the semantic segmentation result comprises:
    判断当前状态是否为运动状态;Determine whether the current state is an exercise state;
    在所述当前状态为运动状态的情况下,根据所述语义分割结果,得到第一稠密语义地图。In the case that the current state is a motion state, a first dense semantic map is obtained according to the semantic segmentation result.
  11. 根据权利要求1-10任一项所述的方法,其特征在于,所述待处理图像数据为置正后的图像数据。The method according to any one of claims 1-10, wherein the image data to be processed is image data after correction.
  12. 一种图像数据处理装置,其特征在于,包括:An image data processing device, characterized in that it comprises:
    语义分割模块,用于获取包括N个像素点的待处理图像数据,以及确定所述待处理图像数据的语义分割结果,其中,所述语义分割结果包括所述N个像素点中至少部分像素点对应的目标平面类别,N为正整数;A semantic segmentation module for acquiring image data to be processed including N pixels, and determining a semantic segmentation result of the image data to be processed, wherein the semantic segmentation result includes at least some of the N pixels Corresponding target plane category, N is a positive integer;
    语义地图模块,用于根据所述语义分割结果,得到第一稠密语义地图,所述第一稠密语义地图包括第一三维点云中的至少一个第一三维点对应的至少一个目标平面类别,所述至少一个第一三维点对应于所述至少部分像素点中的至少一个像素点;The semantic map module is configured to obtain a first dense semantic map according to the semantic segmentation result, and the first dense semantic map includes at least one target plane category corresponding to at least one first three-dimensional point in the first three-dimensional point cloud, so The at least one first three-dimensional point corresponds to at least one pixel point in the at least part of the pixel points;
    语义聚类模块,用于根据所述第一稠密语义地图进行平面语义类别识别,得到所述待处理图像数据包括的一个或多个平面的平面语义类别。The semantic clustering module is configured to recognize the plane semantic category according to the first dense semantic map, and obtain the plane semantic category of one or more planes included in the image data to be processed.
  13. 根据权利要求12所述的装置,其特征在于,所述语义地图模块,具体用于:The device according to claim 12, wherein the semantic map module is specifically configured to:
    根据所述语义分割结果,和所述待处理图像数据对应的深度图像,得到第二稠密语义地图;Obtaining a second dense semantic map according to the semantic segmentation result and the depth image corresponding to the image data to be processed;
    将所述第二稠密语义地图作为所述第一稠密语义地图,或,Use the second dense semantic map as the first dense semantic map, or,
    用于利用所述第二稠密语义地图中的第二三维点云中的一个或多个第二三维点更 新历史稠密语义地图,以得到所述第一稠密语义地图。It is used to update the historical dense semantic map by using one or more second three-dimensional points in the second three-dimensional point cloud in the second dense semantic map to obtain the first dense semantic map.
  14. 根据权利要求12或13所述的装置,其特征在于,在确定所述待处理图像数据的语义分割结果之后,所述语义分割模块,还用于根据所述待处理图像数据以及所述待处理图像数据对应的深度图像包括的深度信息,对所述语义分割结果执行优化操作,所述优化操作用于修正所述语义分割结果中的噪声和误差部分。The device according to claim 12 or 13, characterized in that, after the semantic segmentation result of the image data to be processed is determined, the semantic segmentation module is further configured to perform according to the image data to be processed and the image data to be processed. For the depth information included in the depth image corresponding to the image data, an optimization operation is performed on the semantic segmentation result, and the optimization operation is used to correct noise and error parts in the semantic segmentation result.
  15. 根据权利要求12-14任一项所述的装置,其特征在于,所述语义分割模块,具体用于确定所述至少部分像素点中任一个像素点对应的一个或多个平面类别和所述一个或多个平面类别中每个平面类别的概率;The device according to any one of claims 12-14, wherein the semantic segmentation module is specifically configured to determine one or more plane categories corresponding to any one of the at least partial pixels and the The probability of each plane category in one or more plane categories;
    以及用于将所述任一个像素点对应的一个或多个平面类别中概率最大的平面类别作为所述任一个像素点对应的目标平面类别,以得到所述待处理图像数据的语义分割结果。And it is used to set the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain the semantic segmentation result of the image data to be processed.
  16. 根据权利要求15所述的装置,其特征在于,所述语义分割模块,用于根据神经网络对所述待处理图像数据进行语义分割,得到所述至少部分像素点中任一个像素点对应的一个或多个平面类别中每个平面类别的概率。The device according to claim 15, wherein the semantic segmentation module is configured to perform semantic segmentation on the image data to be processed according to a neural network to obtain a pixel corresponding to any one of the at least some pixels. Or the probability of each plane category in multiple plane categories.
  17. 根据权利要求12-16任一项所述的装置,其特征在于,所述语义聚类模块,用于:The device according to any one of claims 12-16, wherein the semantic clustering module is configured to:
    根据所述待处理图像数据,确定所述一个或多个平面中每个平面的平面方程;Determine the plane equation of each of the one or more planes according to the image data to be processed;
    对所述一个或多个平面中任一个平面执行下述步骤以得到所述一个或多个平面的平面语义类别:Perform the following steps on any one of the one or more planes to obtain the plane semantic category of the one or more planes:
    根据所述任一个平面的平面方程,以及所述第一稠密语义地图,确定所述任一个平面对应的一个或多个目标平面类别以及所述一个或多个目标平面类别的置信;Determining one or more target plane categories corresponding to any one of the planes and the confidence of the one or more target plane categories according to the plane equation of any one of the planes and the first dense semantic map;
    在所述一个或多个目标平面类别中选取具有最高置信的目标平面类别作为所述任一个平面的语义平面类别。Select the target plane category with the highest confidence among the one or more target plane categories as the semantic plane category of any one of the planes.
  18. 根据权利要求17所述的装置,其特征在于,所述任一个平面对应的所述一个或多个目标平面类别的朝向与所述任一个平面的朝向一致。The device according to claim 17, wherein the orientation of the one or more target plane categories corresponding to the any one plane is consistent with the orientation of the any one plane.
  19. 根据权利要求17或18所述的装置,其特征在于,所述语义聚类模块,具体用于:The device according to claim 17 or 18, wherein the semantic clustering module is specifically configured to:
    根据所述任一个平面的平面方程,从所述第一稠密语义地图中确定M个第一三维点,所述M个第一三维点与所述任一个平面之间的距离小于第三阈值,M为正整数;Determining M first three-dimensional points from the first dense semantic map according to the plane equation of any one of the planes, and the distance between the M first three-dimensional points and any one of the planes is less than a third threshold, M is a positive integer;
    将所述M个第一三维点对应的一个或多个目标平面类别确定为所述任一个平面对应的所述一个或多个目标平面类别,所述一个或多个目标平面类别的朝向与所述任一个平面的朝向一致,The one or more target plane categories corresponding to the M first three-dimensional points are determined as the one or more target plane categories corresponding to any one of the planes, and the orientation of the one or more target plane categories is different from the direction of the target plane. The orientation of any one of the planes is the same,
    统计所述一个或多个目标平面类别中每个目标平面类别对应的三维点数目在所述M个第一三维点中的比例,得到所述一个或多个目标平面类别的置信。The ratio of the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories among the M first three-dimensional points is calculated to obtain the confidence of the one or more target plane categories.
  20. 根据权利要求19所述的装置,其特征在于,所述语义聚类模块在统计所述一个或多个目标平面类别中每个目标平面类别对应的三维点数目在所述M个第一三维点中的比例,得到所述一个或多个目标平面类别的置信之后,所述语义聚类模块,还用于根据贝叶斯定理或投票机制中至少一项更新所述任一个平面对应的一个或多个目标平面类别的置信。The device according to claim 19, wherein the semantic clustering module counts the number of three-dimensional points corresponding to each target plane category in the one or more target plane categories in the M first three-dimensional points After obtaining the confidence of the one or more target plane categories, the semantic clustering module is also used to update one or the corresponding one of the planes according to at least one of Bayes' theorem or voting mechanism. Confidence of multiple target plane categories.
  21. 根据权利要求12~20任一项所述的装置,其特征在于,所述语义地图模块,具体用于判断当前状态是否为运动状态,在确定所述当前状态为所述运动状态时,根据所述语义分割结果,得到第一稠密语义地图。The device according to any one of claims 12 to 20, wherein the semantic map module is specifically configured to determine whether the current state is a motion state, and when determining that the current state is the motion state, according to the According to the semantic segmentation result, the first dense semantic map is obtained.
  22. 根据权利要求12-21任一项所述的装置,其特征在于,所述待处理图像数据为置正后的图像数据。The device according to any one of claims 12-21, wherein the image data to be processed is image data after correction.
  23. 一种计算机可读存储介质,其特征在于,所述可读存储介质中存储有指令,当所述指令被执行时,实现如权利要求1~11任一项所述的方法。A computer-readable storage medium, characterized in that instructions are stored in the readable storage medium, and when the instructions are executed, the method according to any one of claims 1 to 11 is implemented.
  24. 一种处理设备,其特征在于,包括第一处理器以及第二处理器,其中,第一处理器,用于获取待处理图像数据,所述待处理图像数据包括N个像素点,N为正整数;A processing device, characterized by comprising a first processor and a second processor, wherein the first processor is used to obtain image data to be processed, and the image data to be processed includes N pixels, where N is positive Integer
    所述第二处理器,用于确定所述待处理图像数据的语义分割结果,其中,所述语义分割结果包括所述N个像素点中至少部分像素点对应的目标平面类别;The second processor is configured to determine a semantic segmentation result of the image data to be processed, wherein the semantic segmentation result includes a target plane category corresponding to at least some of the N pixels;
    所述第一处理器,还用于根据所述语义分割结果,得到第一稠密语义地图,所述第一稠密语义地图包括第一三维点云中的至少一个第一三维点对应的至少一个目标平面类别,所述至少一个第一三维点对应于所述至少部分像素点中的至少一个像素点;根据所述第一稠密语义地图进行平面语义类别识别,得到所述待处理图像数据包括的一个或多个平面的平面语义类别。The first processor is further configured to obtain a first dense semantic map according to the semantic segmentation result, where the first dense semantic map includes at least one target corresponding to at least one first three-dimensional point in the first three-dimensional point cloud Plane category, where the at least one first three-dimensional point corresponds to at least one pixel in the at least part of the pixel points; performing planar semantic category recognition according to the first dense semantic map to obtain one of the image data to be processed Or multiple plane semantic categories.
  25. 根据权利要求24所述的处理设备,其特征在于,所述第二处理器具体用于确定所述至少部分像素点中任一个像素点对应的一个或多个平面类别和所述一个或多个平面类别中每个平面类别的概率;The processing device according to claim 24, wherein the second processor is specifically configured to determine one or more plane categories corresponding to any one of the at least part of the pixels and the one or more The probability of each plane category in the plane category;
    将所述任一个像素点对应的一个或多个平面类别中概率最大的平面类别作为所述任一个像素点对应的目标平面类别,以得到所述待处理图像数据的语义分割结果。Taking the plane category with the highest probability among the one or more plane categories corresponding to any one pixel as the target plane category corresponding to any one pixel to obtain a semantic segmentation result of the image data to be processed.
  26. 根据权利要求25所述的处理设备,其特征在于,所述第二处理器具体用于根据神经网络对所述待处理图像数据进行语义分割,得到所述至少部分像素点中任一个像素点对应的一个或多个平面类别中每个平面类别的概率。The processing device according to claim 25, wherein the second processor is specifically configured to perform semantic segmentation on the to-be-processed image data according to a neural network to obtain a pixel point corresponding to any one of the at least some pixels. The probability of each plane category in one or more plane categories.
  27. 一种处理设备,其特征在于,包括:一个或多个处理器,其中,所述一个或多个处理器用于运行存储器中存储的指令以执行如权利要求1~11任一项所述的方法。A processing device, characterized by comprising: one or more processors, wherein the one or more processors are used to execute instructions stored in a memory to execute the method according to any one of claims 1 to 11 .
PCT/CN2020/074040 2020-01-23 2020-01-23 Plane semantic category identification method and image data processing apparatus WO2021147113A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080001308.1A CN113439275A (en) 2020-01-23 2020-01-23 Identification method of plane semantic category and image data processing device
PCT/CN2020/074040 WO2021147113A1 (en) 2020-01-23 2020-01-23 Plane semantic category identification method and image data processing apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/074040 WO2021147113A1 (en) 2020-01-23 2020-01-23 Plane semantic category identification method and image data processing apparatus

Publications (1)

Publication Number Publication Date
WO2021147113A1 true WO2021147113A1 (en) 2021-07-29

Family

ID=76992013

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/074040 WO2021147113A1 (en) 2020-01-23 2020-01-23 Plane semantic category identification method and image data processing apparatus

Country Status (2)

Country Link
CN (1) CN113439275A (en)
WO (1) WO2021147113A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4138390A1 (en) * 2021-08-20 2023-02-22 Beijing Xiaomi Mobile Software Co., Ltd. Method for camera control, image signal processor and device with temporal control of image acquisition parameters
WO2023051362A1 (en) * 2021-09-30 2023-04-06 北京字跳网络技术有限公司 Image area processing method and device
WO2023088177A1 (en) * 2021-11-16 2023-05-25 华为技术有限公司 Neural network model training method, and vectorized three-dimensional model establishment method and device

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114119998B (en) * 2021-12-01 2023-04-18 成都理工大学 Vehicle-mounted point cloud ground point extraction method and storage medium
CN115527028A (en) * 2022-08-16 2022-12-27 北京百度网讯科技有限公司 Map data processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145747A (en) * 2018-07-20 2019-01-04 华中科技大学 A kind of water surface panoramic picture semantic segmentation method
US20190287254A1 (en) * 2018-03-16 2019-09-19 Honda Motor Co., Ltd. Lidar noise removal using image pixel clusterings
CN110378349A (en) * 2019-07-16 2019-10-25 北京航空航天大学青岛研究院 The mobile terminal Android indoor scene three-dimensional reconstruction and semantic segmentation method
CN110458805A (en) * 2019-03-26 2019-11-15 华为技术有限公司 Plane detection method, computing device and circuit system
CN110633617A (en) * 2018-06-25 2019-12-31 苹果公司 Plane detection using semantic segmentation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190287254A1 (en) * 2018-03-16 2019-09-19 Honda Motor Co., Ltd. Lidar noise removal using image pixel clusterings
CN110633617A (en) * 2018-06-25 2019-12-31 苹果公司 Plane detection using semantic segmentation
CN109145747A (en) * 2018-07-20 2019-01-04 华中科技大学 A kind of water surface panoramic picture semantic segmentation method
CN110458805A (en) * 2019-03-26 2019-11-15 华为技术有限公司 Plane detection method, computing device and circuit system
CN110378349A (en) * 2019-07-16 2019-10-25 北京航空航天大学青岛研究院 The mobile terminal Android indoor scene three-dimensional reconstruction and semantic segmentation method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4138390A1 (en) * 2021-08-20 2023-02-22 Beijing Xiaomi Mobile Software Co., Ltd. Method for camera control, image signal processor and device with temporal control of image acquisition parameters
WO2023051362A1 (en) * 2021-09-30 2023-04-06 北京字跳网络技术有限公司 Image area processing method and device
WO2023088177A1 (en) * 2021-11-16 2023-05-25 华为技术有限公司 Neural network model training method, and vectorized three-dimensional model establishment method and device

Also Published As

Publication number Publication date
CN113439275A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
WO2021147113A1 (en) Plane semantic category identification method and image data processing apparatus
US10198823B1 (en) Segmentation of object image data from background image data
US11481923B2 (en) Relocalization method and apparatus in camera pose tracking process, device, and storage medium
US11373393B2 (en) Image based object detection
US9965865B1 (en) Image data segmentation using depth data
US10719759B2 (en) System for building a map and subsequent localization
CN112189335B (en) CMOS assisted inside-out dynamic vision sensor tracking for low power mobile platforms
US9406137B2 (en) Robust tracking using point and line features
CN109934065B (en) Method and device for gesture recognition
US20170013195A1 (en) Wearable information system having at least one camera
US20200117936A1 (en) Combinatorial shape regression for face alignment in images
WO2018049801A1 (en) Depth map-based heuristic finger detection method
CN112889068A (en) Neural network object recognition for image processing
US20240104744A1 (en) Real-time multi-view detection of objects in multi-camera environments
US10827125B2 (en) Electronic device for playing video based on movement information and operating method thereof
CN109493349B (en) Image feature processing module, augmented reality equipment and corner detection method
US11688094B1 (en) Method and system for map target tracking
US20240177329A1 (en) Scaling for depth estimation
US20230377182A1 (en) Augmented reality device for obtaining depth information and method of operating the same
US20230162375A1 (en) Method and system for improving target detection performance through dynamic learning
CN116576866B (en) Navigation method and device
US20240153245A1 (en) Hybrid system for feature detection and descriptor generation
WO2023102873A1 (en) Enhanced techniques for real-time multi-person three-dimensional pose tracking using a single camera
WO2021179905A1 (en) Motion blur robust image feature descriptor
WO2024112458A1 (en) Scaling for depth estimation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20915567

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20915567

Country of ref document: EP

Kind code of ref document: A1