CN113302654A

CN113302654A - Method and device for measuring spatial dimension of object in image

Info

Publication number: CN113302654A
Application number: CN201980006529.5A
Authority: CN
Inventors: 邓清珊; 陈平; 马超群; 方晓鑫
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2021-08-24
Also published as: WO2021127947A1

Abstract

A method and a device for measuring the space dimension of an object in an image are used for automatically measuring the space dimension of the object in a graph. The method comprises the following steps: identifying a first image to obtain N objects, converting the N objects into N three-dimensional objects of the N objects, wherein each three-dimensional object comprises a three-dimensional point cloud and is at least one part of an object in a three-dimensional environment space where the first image is located, and then determining the space dimensions of the N objects according to a first reference plane and the N three-dimensional objects in the three-dimensional environment space, wherein the space dimension of each object in the N objects comprises at least one of the following items: a distance of at least one surface of the object to the first reference plane, or a three-dimensional dimension of the object.

Description

Method and device for measuring spatial dimension of object in image

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a method and an apparatus for measuring a spatial dimension of an object in an image.

Background

At present, the following methods are mainly used for measuring the spatial dimension of an object:

1) the manual measurement is performed by conventional measuring tools such as a ruler, a tape measure and the like. In this way, if the volume of the measured object is large, a plurality of persons are required to participate in the measurement, and the process is complicated. And when the height of the measured object is higher (for example, 10 meters), direct measurement cannot be performed, and the measurement can be completed only by means of tools such as a ladder. Therefore, the measurement process is inconvenient and has potential safety hazards.

2) By using an infrared or laser measuring tool, the length, the width and the height of the measured object can be respectively calculated by sending signals to the measured object from different angles, receiving reflected signals reflected by the measured object and calculating the consumed time length of the sent signals and the reflected signals for multiple times so as to obtain the spatial dimension of the measured object. This kind of mode need carry out many times from different angles and measure the space dimension that just can obtain the object, and measurement efficiency is lower.

3) The method comprises the steps of establishing three-dimensional information of a three-dimensional environment space through an AR technology by using an Augmented Reality (AR) measuring tool, and obtaining the spatial dimension of a measured object by combining manual interaction. For example, the user selects the start point position and the end point position of the measurement of the measured object from different angles to obtain a plurality of minimum bounding boxes and the like. This kind of mode needs user's participant to go on, and the operation is comparatively inconvenient, also need carry out the space dimension that measures many times just can obtain the object from different angles, and measurement efficiency is lower.

Therefore, when the space dimensionality of an object is measured at present, the user is usually required to participate in the measurement, the operation is inconvenient, the space dimensionality of the object can be obtained only by measuring the space dimensionality from different angles for many times, and the measurement efficiency is low.

Disclosure of Invention

The embodiment of the application provides a method and a device for measuring the spatial dimension of an object in an image, which are used for automatically measuring the spatial dimension of the object in the image.

In a first aspect, an embodiment of the present application provides a method for measuring a spatial dimension of an object in an image, where the method includes: the first image is identified to obtain N objects in the first image, each object in the N objects comprises a pixel point set, each pixel point set comprises a plurality of pixel points, and N is an integer greater than or equal to 1. And converting the N objects into N three-dimensional objects corresponding to the N objects, wherein each three-dimensional object comprises a three-dimensional point cloud and is at least one part of an object in a three-dimensional environment space where the first image is located. Then, according to a first reference plane and N three-dimensional objects in the three-dimensional environment space, determining spatial dimensions of N objects, the spatial dimension of each object in the N objects including at least one of: a distance of at least one surface of the object to a first reference plane, or a three-dimensional dimension of the object, wherein the at least one surface is parallel to the first reference plane.

Compared with the mode that the object measurement can be completed only by the participation of a user in the prior art, the scheme provided by the embodiment of the application can convert N objects obtained by first image recognition into N three-dimensional objects corresponding to the N objects, each three-dimensional object comprises a three-dimensional point cloud and is at least part of an object in a three-dimensional environment space where the first image is located, and then the space dimensions of the N three-dimensional objects can be determined by taking the first reference plane as a reference, so that the space dimensions of the N objects are obtained. According to the scheme provided by the embodiment of the application, the space dimensionality of the object can be automatically measured, some measurement tasks which are difficult to complete by users can be completed without participation of the users, and the method and the device are applicable to various measurement environments. For example, the height of the ceiling from the ground may be measured, and for example, the length, width and height of a bulky object may be measured. Compared with a method for measuring for multiple times from different angles in the prior art, the method for measuring the image space dimension of the object in the image can be determined by the scheme provided by the application, the operation is convenient, and the measuring efficiency can be improved.

In one possible design, the N objects include a first object corresponding to a first three-dimensional object including a first point cloud of three-dimensional points, the first three-dimensional object being a first object in the three-dimensional environment space; the determining spatial dimensions of the N objects from the first reference plane and the N three-dimensional objects in the three-dimensional environment space comprises: projecting the first three-dimensional point cloud onto the first reference plane to obtain a first projection area of the first three-dimensional point cloud on the first reference plane; determining a plurality of first distances from a plurality of first three-dimensional points in the first three-dimensional point cloud to the first reference plane; determining a three-dimensional size of the first object based on the first projection area and the plurality of first distances.

In the design, a first projection area of the first three-dimensional point cloud on the first reference plane can be obtained by projecting the first three-dimensional point cloud on the first reference plane, then a plurality of first distances from a plurality of first three-dimensional points in the first three-dimensional point cloud to the first reference plane are determined, and then the length, the width and the height of a first object corresponding to the first three-dimensional point cloud can be determined according to the first projection area and the plurality of first distances. In the process, the reference plane does not need to be manually selected by a user, and the length, the width and the height of the first object in the first image can be directly obtained according to the first projection area of the first three-dimensional point cloud corresponding to the first object on the first reference plane and the plurality of first distances, so that the user experience can be improved.

In one possible design, the N objects include a second object corresponding to a second three-dimensional object including a second three-dimensional point cloud, the second three-dimensional object being a second object in the three-dimensional environment space; the determining spatial dimensions of the N objects from the first reference plane and the N three-dimensional objects in the three-dimensional environment space comprises: determining a third three-dimensional point cloud corresponding to a first surface of the second object in the second three-dimensional point cloud, the first surface being parallel to the first reference plane; determining a plurality of second distances from a plurality of second three-dimensional points in the third three-dimensional point cloud to the first reference plane; determining a distance from the first surface to the first reference plane based on the plurality of second distances.

In the above design, a third three-dimensional point cloud corresponding to the first surface of the second object is determined from the second three-dimensional point cloud, the first surface is parallel to the first reference plane, then a plurality of second distances from a plurality of second three-dimensional points in the third three-dimensional point cloud to the first reference plane are determined, and then the distance from the first surface to the first reference plane can be determined according to the plurality of second distances. The reference plane is not selected by a user, and the distance between the first surface of the second object and the first reference plane can be directly obtained according to the plurality of second distances, so that the user experience can be improved.

In one possible design, the N objects include a third object corresponding to a third three-dimensional object including a fourth three-dimensional point cloud, the third three-dimensional object being part of a third object in the three-dimensional environment space; the determining spatial dimensions of the N objects from the first reference plane and the N three-dimensional objects in the three-dimensional environment space comprises: obtaining a semantic map, wherein the semantic map is a three-dimensional image comprising the three-dimensional environment space; determining a fifth three-dimensional point cloud corresponding to the third object according to the semantic map and the fourth three-dimensional point cloud; projecting the fifth three-dimensional point cloud onto the first reference plane to obtain a second projection area of the fifth three-dimensional point cloud on the first reference plane; determining a plurality of third distances from a plurality of third three-dimensional points in the fifth three-dimensional point cloud to the first reference plane; determining a three-dimensional size of the third object according to the second projection area and the plurality of third distances.

In the above design, when the third three-dimensional object corresponding to the third object is a part of a third object in the three-dimensional environment space, determining a fifth three-dimensional point cloud of the third object through the semantic map and a fourth three-dimensional point cloud included in the third three-dimensional object, and then obtaining the three-dimensional size of the third object according to a second projection area where the fifth three-dimensional point cloud is projected onto the first reference plane and a plurality of third distances from a plurality of third three-dimensional points in the fifth three-dimensional point cloud to the first reference plane. Although the first image only comprises partial content of the third object, the fifth three-dimensional point cloud corresponding to the third object can be obtained through the three-dimensional image of the three-dimensional environment space and the fourth three-dimensional point cloud, and then the length, width and height of the third object can be automatically measured according to the fifth three-dimensional point cloud and the first reference plane, so that the measurement of the space dimension of the object in the first image is completed.

In one possible design, the recognizing the first image to obtain N objects in the first image includes: performing semantic segmentation on the first image to obtain N candidate pixel point sets aiming at the N objects and a first pixel point set not aiming at a specific object; adding at least one second pixel point in the plurality of second pixel points in each candidate pixel point set according to the first information of the plurality of first pixel points in each candidate pixel point set in the N candidate pixel point sets and the first information of the plurality of second pixel points in the first pixel point set to obtain one object in the N objects, wherein the pixel point set included in the object comprises the candidate pixel point set of the object and the at least one second pixel point; wherein the first information comprises at least one of: depth information or color information.

In the above design, a semantic segmentation result can be obtained by performing semantic segmentation on the first image, where the semantic segmentation result includes N candidate pixel point sets for N objects and a first pixel point set not for a specific object, and then at least one second pixel point of the plurality of second pixel points is added to each candidate pixel point set according to first information of the plurality of first pixel points in each candidate pixel point set and first information of the plurality of second pixel points in the first pixel point set. Therefore, the semantic segmentation result can be optimized, so that at least one second pixel which is not originally specific to a specific object is re-segmented into a pixel which is specific to one object of the N objects, namely, the pixel which is not successfully identified in the first image is identified again, the semantic segmentation precision is improved, and the accuracy of the space dimension of subsequent measurement can be improved.

In a possible design, a similarity distance between each second pixel point of the at least one second pixel point and at least one first pixel point of the candidate pixel point set of the object is less than or equal to a first preset threshold, and the similarity distance between any second pixel point and any first pixel point is obtained from the first information of any second pixel point and the first information of any first pixel point.

In the above design, the at least one second pixel point is added to each candidate pixel point set, and the similarity distance between each second pixel point in the at least one second pixel point and at least one pixel point in the candidate pixel point set of the object is less than or equal to a first preset threshold. The similarity distance is obtained from the depth information and/or the color information, that is, the second pixel point in the first pixel point set is added to the candidate pixel point set where the first pixel point with the color similar to the second pixel point and/or the depth similar to the second pixel point is located. This means that the second pixel point in the first pixel point set which is not originally directed to the specific object is added to one candidate pixel point set in the N candidate pixel point sets directed to the N objects through the depth information and/or the color information, so that the semantic segmentation accuracy can be improved, and further the spatial dimension of the subsequent measurement can be more accurate.

In a possible design, a distance between a position of each second pixel point in the at least one second pixel point in the first image and a position of at least one first pixel point in the candidate pixel point set of the object in the first image is less than or equal to a second preset threshold.

In the above design, the at least one second pixel point is added to each candidate pixel point set, and a distance between a position of each second pixel point in the at least one second pixel point in the first image and a position of at least one first pixel point in the candidate pixel point set of the object in the first image is less than or equal to a second preset threshold. Therefore, the situation that pixel points originally belonging to one object are wrongly segmented to another object due to similar colors and/or similar depth values can be avoided, the semantic segmentation accuracy is improved, and further the space dimensionality of subsequent measurement can be more accurate.

In one possible design, the first reference plane is the ground. In the design, the ground is a large area and is easy to identify, most objects in the three-dimensional environment space are located above the ground, the height of the most objects is the distance between one surface of the objects and the ground, the ground is used as a reference plane to measure the space dimension of the objects, the distance between one surface of the objects and the ground, namely the height of the objects, can be accurately measured, a user does not need to select the reference plane through manual interaction, the operation is convenient, and the user experience can be improved.

In a second aspect, an embodiment of the present application provides an apparatus for measuring a spatial dimension of an object in an image, where the apparatus includes an identification unit, a conversion unit, and a processing unit; the identification unit is configured to identify a first image to obtain N objects in the first image, where N is an integer greater than or equal to 1, and each object in the N objects includes a pixel point set, and the pixel point set includes a plurality of pixel points; the conversion unit is used for converting the N objects into N three-dimensional objects corresponding to the N objects, wherein each three-dimensional object comprises a three-dimensional point cloud and is at least one part of an object in a three-dimensional environment space where the first image is located; the processing unit is configured to determine spatial dimensions of N objects according to a first reference plane in the three-dimensional environment space and the N three-dimensional objects, where the spatial dimension of each of the N objects includes at least one of: a distance of at least one surface of the object to the first reference plane, or a three-dimensional dimension of the object, the at least one surface being parallel to the first reference plane.

In one possible design, the N objects include a first object corresponding to a first three-dimensional object including a first three-dimensional point cloud, the first three-dimensional object being a first object in the three-dimensional environment space; the processing unit is specifically configured to: projecting the first three-dimensional point cloud onto the first reference plane to obtain a first projection area of the first three-dimensional point cloud on the first reference plane; determining a plurality of first distances from a plurality of first three-dimensional points in the first three-dimensional point cloud to the first reference plane; determining a three-dimensional size of the first object based on the first projection area and the plurality of first distances.

In one possible design, the N objects include a second object corresponding to a second three-dimensional object including a second three-dimensional point cloud, the second three-dimensional object being a second object in the three-dimensional environment space; the processing unit is specifically configured to: determining a third three-dimensional point cloud corresponding to a first surface of the second object in the second three-dimensional point cloud, the first surface being parallel to the first reference plane; determining a plurality of second distances from a plurality of second three-dimensional points in the third three-dimensional point cloud to the first reference plane; determining a distance from the first surface to the first reference plane based on the plurality of second distances.

In one possible design, the N objects include a third object corresponding to a third three-dimensional object including a fourth three-dimensional point cloud, the third three-dimensional object being part of a third object in the three-dimensional environment space; the processing unit is specifically configured to: obtaining a semantic map, wherein the semantic map is a three-dimensional image comprising the three-dimensional environment space; determining a fifth three-dimensional point cloud corresponding to the third object according to the semantic map and the fourth three-dimensional point cloud; projecting the fifth three-dimensional point cloud onto the first reference plane to obtain a second projection area of the fifth three-dimensional point cloud on the first reference plane; determining a plurality of third distances from a plurality of third three-dimensional points in the fifth three-dimensional point cloud to the first reference plane; determining a three-dimensional size of the third object according to the second projection area and the plurality of third distances.

In a possible design, the identification unit is specifically configured to: performing semantic segmentation on the first image to obtain N candidate pixel point sets aiming at the N objects and a first pixel point set not aiming at a specific object; adding at least one second pixel point of the plurality of second pixel points in each candidate pixel point set of the N candidate pixel point sets according to the first information of the plurality of first pixel points in each candidate pixel point set and the first information of the plurality of second pixel points in the first pixel point set so as to obtain one object of the N objects, wherein the pixel point set of the object comprises the candidate pixel point set of the object and the at least one second pixel point; wherein the first information comprises at least one of: depth information or color information.

In one possible design, the first reference plane is the ground.

In a third aspect, embodiments of the present application provide yet another apparatus for measuring a spatial dimension of an object in an image, the apparatus including at least one processor; the at least one processor is configured to execute a computer program or instructions to cause the apparatus to perform the method described in the first aspect.

In one possible design, the at least one processor, when executing the computer program or instructions, performs the steps of: identifying a first image to obtain N objects in the first image, wherein N is an integer greater than or equal to 1, each object in the N objects comprises a pixel point set, and the pixel point set comprises a plurality of pixel points; converting the N objects into N three-dimensional objects corresponding to the N objects, wherein each three-dimensional object comprises a three-dimensional point cloud and is at least one part of an object in a three-dimensional environment space where the first image is located; determining spatial dimensions of N objects according to a first reference plane and the N three-dimensional objects in the three-dimensional environment space, the spatial dimensions of each of the N objects including at least one of: a distance of at least one surface of the object to the first reference plane, or a three-dimensional dimension of the object, the at least one surface being parallel to the first reference plane.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium for storing computer instructions, which, when executed on a computer, cause the computer to perform the method of the first aspect or any one of the possible designs.

In a fifth aspect, embodiments of the present application provide a computer program product for storing computer instructions, which when executed on a computer, cause the computer to perform the method of the first aspect or any one of the possible designs.

In a sixth aspect, an embodiment of the present application provides a chip system, where the chip system includes a processor and may further include a memory, and is configured to implement the method in the first aspect or any one of the possible designs. The chip system may be formed by a chip, and may also include a chip and other discrete devices.

Advantageous effects of the second to sixth aspects and implementations thereof described above reference may be made to the description of the advantageous effects of the method of the first aspect and implementations thereof.

Drawings

Fig. 1 is a block diagram of an electronic device according to an embodiment of the present disclosure;

fig. 2 is a schematic data flow diagram of a method for measuring a spatial dimension of an object in an image according to an embodiment of the present disclosure;

FIG. 3 is a data flow diagram illustrating a region growing method according to an embodiment of the present disclosure;

fig. 4 is a data flow diagram illustrating a method for determining a spatial dimension of an object according to an embodiment of the present disclosure;

fig. 5 is a data flow diagram of another method for determining a spatial dimension of an object according to an embodiment of the present application;

fig. 6 is a data flow diagram illustrating a method for determining a spatial dimension of an object according to an embodiment of the present application;

FIG. 7a is a schematic diagram of a measurement result of a spatial dimension of an object in an image according to an embodiment of the present disclosure;

FIG. 7b is a schematic diagram of another example of a spatial dimension measurement of an object in an image according to the present disclosure;

FIG. 7c is a schematic diagram of a measurement result of a spatial dimension of an object in an image according to an embodiment of the present application;

fig. 8 is a structural diagram of an apparatus for measuring a spatial dimension of an object in an image according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram of a plurality of pixel points according to an embodiment of the present disclosure;

fig. 10 is another schematic diagram of a plurality of pixel points according to an embodiment of the present application;

fig. 11 is a further schematic diagram of a plurality of pixel points according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a three-dimensional point cloud projected onto a first reference plane according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a second three-dimensional point cloud provided by an embodiment of the present application;

FIG. 14 is a schematic diagram of a third three-dimensional point cloud provided by an embodiment of the present application;

FIG. 15 is a schematic diagram of a fourth three-dimensional point cloud provided by an embodiment of the present application;

fig. 16 is a schematic diagram of a fifth three-dimensional point cloud provided in the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in further detail with reference to the accompanying drawings. For the skilled person to understand, technical terms related to the embodiments of the present application will be described first.

In the embodiments of the present application, "a plurality" means two or more, and in view of this, the "plurality" may also be understood as "at least two". "at least one" is to be understood as meaning one or more, for example one, two or more. For example, including at least one means including one, two, or more, and does not limit which ones are included, for example, including at least one of A, B and C, then including may be A, B, C, A and B, A and C, B and C, or a and B and C. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" generally indicates that the preceding and following related objects are in an "or" relationship, unless otherwise specified. Unless stated to the contrary, the embodiments of the present application refer to the ordinal numbers "first", "second", etc., for distinguishing between a plurality of objects, and do not limit the sequence, timing, priority, or importance of the plurality of objects.

Next, technical features according to the embodiments of the present application will be described. In the prior art, the AR measurement tool is adopted to perform measurement, the participation of a user is required to be available, the operation is inconvenient, and the user experience is poor. In view of this, the present application provides a method for measuring a spatial dimension of an object in an image. According to the method, the first reference plane is used as a reference, the spatial dimension of the object can be automatically measured, the spatial dimension of the object can be measured without the participation of a user, the operation is convenient, the measurement efficiency is high, the method is suitable for measuring the spatial dimension of the object in various measurement environments, and the user experience can be improved.

The scheme for measuring the spatial dimension of the object in the image provided by the embodiment of the application can be executed by various computing devices, and the computing devices can be electronic devices. Which may include, but are not limited to, personal computers, server computers, hand-held or laptop devices, mobile devices (such as cell phones, mobile phones, tablets, personal digital assistants, media players, etc.), consumer electronics, minicomputers, mainframe computers, mobile robots, drones, and the like.

In the following embodiments, taking an example that a computing device is an electronic device, a method for measuring a spatial dimension of an object in an image provided in the embodiments of the present application is described. The method for measuring the spatial dimension of the object in the image, provided by the embodiment of the application, is suitable for the electronic device shown in fig. 1, and the specific structure of the electronic device is briefly introduced below. Fig. 1 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure. As shown in fig. 1, the electronic device 100 may include a processor 110 and an acquisition apparatus 120. The processor 110 processes the data acquired by the acquisition device 120.

The processor 110 is a control center of the electronic device 100, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by running or executing software programs and/or data stored in the memory. Processor 110 may include one or more processing units, such as: the processor 110 may include one or more processing units such as a Central Processing Unit (CPU), an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, a neural-Network Processing Unit (NPU), and the like. The different processing units may be separate devices or may be integrated into one or more processors. The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referring to a biological neural network structure, for example, by referring to a transfer mode between neurons of a human brain, and can also continuously learn by self. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The capturing device 120 may include a camera 121 for capturing images or video. The camera 121 may be a general camera or a focusing camera. Further, the camera 121 may be used to capture RGB images. The acquisition device 120 may also include one or more sensors 122, such as one or more of an image sensor, an infrared sensor, a laser sensor, a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a velocity sensor, a distance sensor, a proximity light sensor, an ambient light sensor, a fingerprint sensor, a touch sensor, a temperature sensor, or a bone conduction sensor. The image sensor is, for example, a time of flight (TOF) sensor or a structured light sensor. The acceleration sensor and the speed sensor may form an Inertial Measurement Unit (IMU), and the IMU may measure the three-axis attitude angle (or angular rate) and acceleration of the object. In the embodiment of the present application, the IMU is mainly used to measure the pose of the electronic device 100 to determine whether the electronic device 100 is in a stationary state or in a moving state.

The electronic device may also include memory 130. The memory 130 may be used for storing software programs and data, and the processor 110 may execute various functional applications and data processing of the electronic device 100 by operating the software programs and data stored in the memory 130. The memory 130 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as an image capturing function, an image recognition function, etc.), and the like; the storage data area may store data (such as audio data, text information, image data, semantic maps, etc.) created according to the use of the electronic apparatus 100, and the like. Further, the memory 130 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The electronic device may also include a display device 140, the display device 140 including a display panel 141 for displaying one or more of information input by the user, information provided to the user, or various menu interfaces of the electronic device 100. In the embodiment of the present application, the display device 140 is mainly used for displaying an image acquired by the camera 121 or the sensor 122 in the electronic device 100. Alternatively, the display panel 141 may include a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like.

The electronic device 100 may further include an input device 150 for receiving input numerical information, character information, or contact touch operation/non-contact gesture, and generating signal input related to user setting and function control of the electronic device 100, and the like.

In some embodiments, processor 110 may include one or more interfaces. The interface may include a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

The MIPI interface may be used to connect the processor 110 with peripheral devices such as the display device 140, the camera 121, and the like. The MIPI interface includes a camera 121 serial interface (CSI), a display screen serial interface (DSI), and the like. In some embodiments, processor 110 and camera 121 communicate via a CSI interface to implement the capture functionality of electronic device 100. The processor 110 and the display device 140 communicate via the DSI interface to implement the display function of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 121, the display device 140, the sensor 122, and the like.

The USB interface is an interface which accords with the USB standard specification, and specifically can be a Mini USB interface, a Micro USB interface, a USB Type C interface and the like. The USB interface may be used to connect a charger to charge the electronic device 100, and may also be used to transmit data between the electronic device 100 and a peripheral device. The interface may be used to connect other electronic devices, such as Augmented Reality (AR) devices and the like.

It should be understood that the connection relationship between the modules according to the embodiment of the present invention is only illustrative, and is not limited to the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

Although not shown in fig. 1, the electronic device 100 may further include a Radio Frequency (RF) circuit, a power supply, a flashlight, an external interface, a key, a motor, and other possible functional modules, which are not described in detail herein.

Based on the above description, the embodiments of the present application provide a method and an apparatus for measuring a spatial dimension of an object in an image, where the method can measure the spatial dimension of the object in the image. In the embodiment of the application, the method and the device are based on the same inventive concept, and because the principles of solving the problems of the method and the device are similar, the device and the method can be referred to each other, and repeated parts are not described again.

In the embodiment of the present application, the description is given by taking the electronic device 100 as an example, but the embodiment of the present application is not limited to being executed by other types of computing devices. Referring to fig. 2, a flowchart of a method for measuring a spatial dimension of an object in an image according to an embodiment of the present disclosure is shown, where the method may be executed by the terminal device 100 shown in fig. 1, for example, the method may be executed by the processor 110 in the electronic device 100. Fig. 2 shows a flow chart of the method.

S201: the processor 110 acquires a first image. The processor 110 may acquire a first image through the camera 121. For example, when the user can press a shooting key when shooting an image, the electronic apparatus 100 equivalently receives a shooting instruction. In response to the shooting instruction, the camera 121 can obtain a first image. After obtaining the first image, the camera 121 may send the first image to the processor 110. The processor 110 may further image process or image optimize the first image, such as noise cancellation, white balance, color alignment or sharpening.

S202: the processor 110 identifies the first image to obtain N objects in the first image. Each of the N objects may include a set of pixel points, and the N objects include N sets of pixel points in total. Each of the N sets of pixels may include a plurality of pixels. The identified first image may be an image that has been subjected to image processing or image optimization by a processor.

The first image may include N objects, and one object of the N objects may be a table, a chair, a tree, or the like. The three-dimensional object corresponding to one of the N objects may be at least a portion of an object in the three-dimensional environment space in which the first image is located. For example, the camera 121 only captures a part of a table, and the three-dimensional object corresponding to the object (two-dimensional) corresponding to the table (three-dimensional) in the first image is a part of the table but not the whole table. For another example, if the camera 121 has collected all of one table, the three-dimensional object corresponding to the table in the first image is all of the table. Wherein N is an integer greater than or equal to 1.

For example, the processor 110 may semantically segment the first image using a deep neural network model to obtain N objects in the first image. Specifically, the processor 110 performs semantic segmentation on the first image by using a deep neural network to obtain an initial semantic segmentation result of the first image, where the initial semantic segmentation result includes N candidate pixel point sets for N objects and a first pixel point set not for a specific object. The method comprises the steps that one candidate pixel point set aims at one object, each candidate pixel point set comprises a plurality of first pixel points, and each first pixel point set comprises a plurality of second pixel points.

If one candidate pixel point set is directed to one object, the semantic labels of the first pixel points included in the candidate pixel point set are the object, for example, if one candidate pixel point set is directed to a table, then the semantic labels of the first pixel points included in the candidate pixel point set are all tables. The first pixel point set which is not specific to the characteristic object means that a plurality of second pixel points included in the first pixel point set cannot be effectively segmented, namely, for any second pixel point, the semantic label of the second pixel point is not any of the N objects.

The accuracy of the initial semantic segmentation result obtained by the processor 110 is related to the convergence degree of the deep neural network module, the number of learning samples, and the like; the higher the convergence degree is, the higher the accuracy of the initial semantic segmentation result of the first image is, and the larger the calculation amount is correspondingly; the more learning samples, the higher the accuracy of the initial semantic segmentation result of the first image, and accordingly the larger the calculation amount.

It should be noted that the deep neural network model may be a deep residual network (ResNet) model, a visual geometry group network (VGG) model, or a convolutional neural network model such as AlexNet, which is not limited in this embodiment of the present invention.

In practical applications, a larger calculation amount means a higher requirement for hardware conditions and a correspondingly higher manufacturing cost, and therefore, the convergence degree of the deep neural network model cannot be improved without change and the learning samples cannot be increased without change due to factors such as hardware conditions and manufacturing cost, which means that the initial semantic segmentation result obtained by the deep neural network model may have wrong segmentation and/or no segmentation. The wrong segmentation refers to segmenting pixels originally belonging to a first candidate pixel point set into a second candidate pixel point set, wherein the first candidate pixel point set and the second candidate pixel point set are any two candidate pixel point sets in N candidate pixel point sets; the non-segmentation means that the pixels originally belonging to the N candidate pixel point sets are segmented into the first pixel point set.

Further, after the processor 110 obtains the initial semantic segmentation result of the first image, the processor 110 is limited by the convergence degree of the deep neural network and the number of the learning samples, and there may be cases of erroneous segmentation and/or non-segmentation in the initial semantic segmentation result, and the processor 110 may optimize the initial semantic segmentation result to obtain an optimized semantic segmentation result, where the number of pixels that are erroneously segmented and/or non-segmented in the optimized semantic segmentation result is reduced, so that the accuracy of the semantic segmentation result can be increased, and the accuracy of the spatial dimension of the object obtained by subsequent measurement is improved.

In one example, the processor 110 may optimize the initial semantic segmentation result by using filtering. The processor 110 filters the initial semantic segmentation result to obtain a filtered semantic segmentation result, and can remove obvious noise and abnormal pixel points in the semantic segmentation result after filtering, that is, the number of wrongly segmented pixel points in the filtered semantic segmentation result is reduced. The abnormal pixel points refer to that one pixel point with the semantic label of the first category in the first image after semantic segmentation is located in a plurality of pixel points with the semantic label of the second category. For example, the first image includes a first object and a second object, the first object corresponds to a table, the second object corresponds to a chair, after semantic segmentation, a pixel point labeled as a chair in the first image is located in a plurality of pixel points labeled as a table, and the candidate pixel point set corresponding to the table and the candidate pixel point set corresponding to the chair are filtered, so that the pixel point labeled as a chair can be re-segmented into the candidate pixel point set corresponding to the table, that is, the semantic labeling of the pixel point is modified from the chair to the table. It should be noted that filtering the initial semantic segmentation result may be implemented by using the prior art, and is not described herein again.

In another example, the processor 110 may optimize the initial semantic segmentation result by using a region growing method, that is, the processor 110 may perform region growing on N candidate pixel point sets in the initial semantic segmentation result, and add at least one second pixel point of a plurality of second pixel points in the first pixel point set to each candidate pixel point set in the N candidate pixel point sets after the region growing.

The process of performing region growing on the N candidate pixel point sets is described in detail below with reference to fig. 3. S301: the processor 110 obtains the jth pixel point in the ith candidate pixel point set. Wherein i is an integer greater than 1 and less than N, and j is an integer greater than 1 and less than M_iInteger of (1), M_iThe number of the first pixel points in the ith candidate pixel point set is obtained.

Optionally, the processor 110 may perform priority ranking on the N candidate pixel sets, where a higher priority sequence number is closer to the front, meaning that one candidate pixel set with the highest priority is the first candidate pixel set for region growing in the N candidate pixel sets. For example, the processor 110 may determine the priority according to the number of first pixels in each candidate pixel set, where the greater the number of first pixels, the higher the priority of the corresponding candidate pixel set in the N candidate pixel sets, and the smaller the number of first pixels, the lower the priority of the corresponding candidate pixel set in the N candidate pixel sets.

S302: the processor 110 obtains a plurality of pixel points, where a distance between a position in the first image and a position of the jth pixel point in the first image is less than or equal to a second preset threshold. For example, when the second preset threshold takes 1, it indicates that the acquired pixel point is adjacent to the jth pixel point. For example, the processor 110 may obtain 8 pixels located at the upper, lower, left, right, upper left corner, upper right corner, lower left corner, and lower right corner of the jth pixel, as shown in fig. 9, the jth pixel is a pixel a, and a plurality of pixels, whose distances between the position in the first image and the position of the pixel a in the first image are less than or equal to a second preset threshold, are a pixel B1, a pixel B2, a pixel B3, a pixel B4, a pixel B5, a pixel B6, a pixel B7, and a pixel B8, respectively. For another example, the processor 110 may obtain 4 pixels located at the upper, lower, left, and right of the jth pixel in the first image, as shown in fig. 10, where the jth pixel is a pixel a, and a plurality of pixels whose distances between the position in the first image and the position of the pixel a in the first image are less than or equal to a second preset threshold are a pixel B2, a pixel B4, a pixel B6, and a pixel B8, respectively.

S303: the processor 110 determines whether at least one of the plurality of pixels is a second pixel. If at least one of the plurality of pixel points is the second pixel point, executing S304; if there is no second pixel point in the plurality of pixel points, S307 is executed.

The processor 110 determines whether at least one of the plurality of pixel points is a second pixel point, in other words, the processor 110 determines whether at least one of the plurality of pixel points belongs to the first pixel point set. If at least one of the plurality of pixel points is the second pixel point, S306 is executed. If there is no second pixel point among the plurality of pixel points, S305 is performed.

Hereinafter, description will be given taking an example in which one of the plurality of pixels (denoted as a k-th pixel) is a second pixel. k is greater than 1 and less than or equal to M₁Integer of (1), M₁The number of the second pixel points in the first pixel point set is.

It should be understood that, when at least two of the plurality of pixels belong to the first pixel set, the processor 110 may determine a similarity distance between each of the at least two pixels and the jth pixel, and then perform the steps shown in S305, respectively. For example, the pixel point located at the upper left corner and the pixel point located at the lower right corner of the jth pixel point both belong to the first pixel point set, the processor 110 determines the similarity distance between the jth pixel point and the pixel point at the upper left corner and the similarity distance between the jth pixel point and the pixel point at the lower right corner respectively, and then executes the step shown in S305 according to the two determined similarity distances. For example, as shown in fig. 11, the plurality of pixel points include a pixel point B1, a pixel point B2, a pixel point B3, a pixel point B4, a pixel point B5, a pixel point B6, a pixel point B7, and a pixel point B8, among the plurality of pixel points, the pixel point B2, the pixel point B3, the pixel point B4, the pixel point B6, the pixel point B7, and the pixel point B8 are first pixel points, and the pixel point B1 and the pixel point B5 are second pixel points, the processor 110 may determine similarity distances between the pixel point a and the pixel point B1, and the pixel point B5, and then perform the step shown in S305 according to the two determined similarity distances.

In S302 and S303, the processor 110 determines whether at least one of the plurality of pixels whose position is less than or equal to the second preset threshold belongs to the first pixel set according to the position of the jth pixel in the first image. Therefore, the situation that pixel points originally belonging to one object are wrongly segmented to another object due to similar colors and/or similar depth values can be avoided, so that the semantic segmentation accuracy is improved, and the space dimensionality of subsequent measurement can be more accurate.

S304: the processor 110 determines the similarity distance between the jth pixel point and the kth pixel point. The similarity distance can be used for indicating the color difference between the two pixel points, or the similarity distance can be used for indicating the depth value difference between the two pixel points, or the similarity distance can be used for indicating the color difference and the depth value difference between the two pixel points. The depth value is used to indicate a distance between the electronic device 100 and the photographed object.

For example, the processor 110 may determine the similarity distance between the jth pixel point and the kth pixel point according to the first information of the jth pixel point and the first information of the kth pixel point. Wherein the first information comprises depth information, or the first information comprises color information, or the first information comprises depth information and color information. The color information is used to indicate the color of the object, and the processor 110 may obtain the color information of each pixel point in the first image from the RGB image collected by the camera 121. The depth information is used to indicate a distance between the electronic device 100 and the photographed object, and the processor 110 may acquire the depth information of each pixel point in the first image through the TOF sensor.

In this embodiment of the application, the processor 110 may obtain depth information of each pixel point in the first image through a TOF sensor, a structured light sensor, a laser sensor, and the like, and further may obtain a depth image corresponding to the first image. It should be understood that, in the embodiment of the present application, any other manner (or camera) that can obtain depth information may also be used to obtain depth information, and the embodiment of the present application is not limited to this.

In one example, the similarity distance between the jth pixel point and the kth pixel point satisfies the following formula:

D＝α∑ _j＝x,y,zabs(p _j-p _k)+(1-α)∑ _j＝r,g,babs(p _j-p _k) (formula 1)

Wherein D represents a similarity distance, α is a constant, p_jRepresenting the j-th pixel point, p_kRepresents the kth pixel point, sigma_j＝x,y,z(. represents a spatial distance summation operation, Σ_j＝r,g,b(. cndot.) denotes a color distance cumulative sum operation, and abs (. cndot.) denotes an absolute value operation.

S305: the processor 110 determines whether the similarity distance between the jth pixel point and the kth pixel point is less than or equal to a first preset threshold. If the similarity distance between the jth pixel point and the kth pixel point is less than or equal to a first preset threshold, executing S306; if the similarity distance between the jth pixel point and the kth pixel point is greater than the first preset threshold, S307 is executed.

S306: the processor 110 adds the kth pixel point to the ith candidate pixel point set. In S304 to S306, the processor 110 may determine a similarity distance between the jth pixel point and the kth pixel point according to the first information of the jth pixel point and the first information of the kth pixel point, and add the kth pixel point to the ith candidate pixel point set when it is determined that the similarity distance is less than or equal to a first preset threshold, which means that a second pixel point in the first pixel point set that is not originally directed to the specific object is added to one of the N candidate pixel point sets directed to the N objects through the depth information and/or the color information, so as to improve the accuracy of semantic segmentation, and further enable the spatial dimension of subsequent measurement to be more accurate.

S307: processor 110 assigns j to (j + 1). S308: the processor determines whether the assigned j is greater than M_i. If the assigned j is greater than M_iIf yes, go to S309; if assigned j is less than or equal to M_iThen S301 is executed. S309: processor 110 assigns i to (i + 1). S310: the processor 110 determines whether the assigned i is greater than N. If the assigned i is larger than N, the process is ended; if the assigned i is less than or equal to N, S301 is performed.

In the process shown in fig. 3, the processor 110 obtains an object of N objects according to the first information of a plurality of first pixel points in each candidate pixel point set of the N candidate pixel point sets and at least one second pixel point in the first pixel point set, where the pixel point set included in the object includes the candidate pixel point set and the at least one second pixel point of the object. Therefore, the N candidate pixel point sets obtained by the initial semantic segmentation result can be subjected to region growth, so that at least one second pixel point which is not originally specific to a specific object is re-segmented into pixel points specific to one object in the N objects, that is, pixel points which are not successfully identified in the first image are identified again, the accuracy of the semantic segmentation result can be improved, and the accuracy of the spatial dimension of the object obtained by subsequent measurement can be improved.

It should be noted that, the processor 110 may optimize the initial semantic segmentation result in a filtering manner, may optimize the initial semantic segmentation result in a region growing manner, and may optimize the initial semantic segmentation result in a filtering and region growing manner, which is not limited in this embodiment of the present invention.

As an example, after obtaining the initial semantic segmentation result of the first image, the processor 110 may first filter N candidate pixel point sets in the initial semantic segmentation result to obtain the filtered N candidate pixel point sets, and then perform region growing on the filtered N candidate pixel point sets.

As an example, after obtaining the initial semantic segmentation result of the first image, the processor 110 may first perform region growing on N candidate pixel point sets in the initial semantic segmentation result to obtain N candidate pixel point sets after the region growing, and then perform filtering on the N candidate pixel point sets after the region growing.

As another example, after obtaining the initial semantic segmentation result of the first image, the processor 110 may optimize the initial semantic segmentation result by using filtering and region growing simultaneously. For example, the processor 110 may optimize a part of pixel point sets in the N candidate pixel point sets in the initial semantic segmentation result by using a filtering method, and meanwhile, the processor 110 may also optimize the remaining part of pixel point sets in the N candidate pixel point sets in the initial semantic segmentation result by using a region growing method.

S203: the processor 110 converts the N objects into N three-dimensional objects corresponding to the N objects. Wherein each three-dimensional object comprises a three-dimensional point cloud, and each three-dimensional object is at least one part of an object in the three-dimensional environment space where the first image is located. The processor 110 may convert the N objects into N three-dimensional objects corresponding to the N objects according to the depth information of each pixel point in the N candidate pixel point sets of the N objects.

It should be noted that, although the three-dimensional point cloud is a three-dimensional concept, and the pixel points of the N objects in the first image are two-dimensional concepts, when the depth value of a certain pixel point in the two-dimensional image is known, the two-dimensional coordinate of the pixel point can be converted into world coordinates (i.e., three-dimensional coordinates) in a three-dimensional space, so that the N three-dimensional point clouds corresponding to the N objects in the first image can be obtained according to the depth information. For example, the processor 110 may perform the conversion of the two-dimensional coordinates of the image into world coordinates by using a multi-view geometric algorithm, and the specific conversion manner and process are not limited.

S204: the processor 110 determines spatial dimensions of N objects according to a first reference plane and N three-dimensional objects in a three-dimensional environment space, the spatial dimensions of each of the N objects including at least one of: a distance of at least one surface of the object from the first reference plane, or a three-dimensional dimension of the object, the at least one surface being parallel to the first reference plane. The processor 110 determines the spatial dimensions of N objects corresponding to the N objects in the first image with reference to the first reference plane. Wherein the processor 110 may obtain the plane equation corresponding to the first reference plane from the memory 130.

As an example, the first reference plane may be a plane in which the ground is located. The ground area is large and easy to identify, and most objects in the three-dimensional environment space are located above the ground, which means that the heights of most objects can be determined by the distance between one surface of the object and the ground, so that the ground is used as a reference plane to measure the spatial dimension of the object, the distance between one surface of the object and the ground, namely the height of the object, can be accurately measured, a user does not need to select the reference plane through manual interaction, the operation is convenient and fast, and the user experience can be improved.

Taking the first reference plane as the ground as an example, the processor 110 obtains a second image including the ground before obtaining the first image, performs semantic segmentation on the second image to obtain a pixel point set of the ground, obtains a three-dimensional point cloud corresponding to the ground based on the depth information, and obtains a plane equation corresponding to the ground based on a random sampling consistency estimation method. It should be noted that obtaining a plane equation from an image can be implemented by using the prior art, and the implementation method and process thereof are not described herein again.

In the following description, the first reference plane is described as satisfying Ax + By + Cz ═ 1, where A, B and C are known constants. The processor 110 determines a spatial dimension of one of the N objects, the spatial dimension of each object including at least one of: a three-dimensional dimension of the object, or a distance of at least one surface of the object to the first reference plane. How to determine the spatial dimension of an object in an image is described below with reference to fig. 4, 5, and 6.

Example 1: the N objects include a first object corresponding to a first three-dimensional object including a first three-dimensional point cloud, the first three-dimensional object being a first object in a three-dimensional environmental space. Please refer to fig. 4 for the process of determining the spatial dimension of the first object.

S41: the processor 110 projects the first three-dimensional point cloud onto the first reference plane to obtain a first projection area of the first three-dimensional point cloud on the first reference plane.

As an example, the first three-dimensional point cloud includes two first three-dimensional points, and after the processor 110 projects the two first three-dimensional points onto the first reference plane, a first projection area of the two first three-dimensional points in the first reference plane is obtained, as shown in fig. 12, where a cube represents one three-dimensional point.

Taking a first three-dimensional point in the first three-dimensional point cloud as an example, the processor 110 determines a first distance d from the first three-dimensional point (denoted as p) to the first reference plane. Alternatively, the first distance may satisfy the following formula:

where d denotes the first distance, n denotes a normal vector of the first reference plane, i.e., n ═ ab C, and p denotes the three-dimensional coordinates of the first three-dimensional point in the world coordinate system.

The processor 110 can determine a projection point (denoted as p) of the first three-dimensional point p on the first reference plane according to the first distance from the first three-dimensional point p to the first reference plane¹). Optionally, the projection point p¹The following formula can be satisfied:

wherein p is¹Three-dimensional coordinates representing world coordinates of a projected point of the first three-dimensional point p on the first reference plane, d represents a first distance, and n represents a distanceNormal vector of the first reference plane, i.e. n ═ ab C]And p represents the three-dimensional coordinates of the first three-dimensional point in the world coordinate system.

The processor 110 performs dimension reduction processing on the three-dimensional first reference plane to obtain a two-dimensional plane. Taking as an example that a three-dimensional first reference plane is converted into a two-dimensional plane, the two-dimensional plane is a plane formed by an X axis and a Y axis (referred to as an X0Y plane), wherein the X0Y plane represents a two-dimensional plane perpendicular to the Z axis.

Specifically, the processor 110 may determine a rotation vector and a rotation angle of the three-dimensional first reference plane and the X0Y plane according to a normal vector of the three-dimensional first reference plane and a normal vector of the X0Y plane.

Optionally, the rotation vector of the three-dimensional first reference plane and the X0Y plane may satisfy the following formula:

wherein n is_rRepresenting a rotation vector, n representing a normal vector of the first reference plane, i.e. n ═ ab C]，n _yNormal vector representing the X0Z plane, i.e. n_y＝[1 0 0]，n _xNormal vector representing the Y0Z plane, i.e. n_x＝[1 0 0]Wherein, the X0Z plane is a plane formed by the X axis and the Z axis and is vertical to the Y axis, and the Y0Z plane is a plane formed by the Y axis and the Z axis and is vertical to the X axis.

Alternatively, the rotation angle of the three-dimensional first reference plane and the X0Y plane may satisfy the following formula:

where θ represents a rotation angle, n_zNormal vector representing the X0Y plane, i.e. n_z＝[1 0 0]，n _xNormal vector representing the Y0Z plane, i.e. n_x＝[1 0 0]And n represents a normal direction of the first reference planeAmount, i.e. n ═ aBC]。

Further, the processor 110 may determine a transformation matrix for transforming the three-dimensional first reference plane into the X0Y plane according to the rotation vector and the rotation angle of the three-dimensional first reference plane and the X0Y plane.

Optionally, the transformation matrix may satisfy the following formula:

wherein the content of the first and second substances,

h denotes a transformation matrix, C₁Representing a constant.

After obtaining the transformation matrix, the processor 110 may determine a transformation point (denoted as p) of the projection point on the X0Y plane according to the transformation matrix²)。

Optionally, the transformation point may satisfy the following formula:

p ²＝Hp ¹(formula 7)

Wherein p is²Representing transformation points, H representing a transformation matrix, p¹Representing the proxels.

The processor 110 may determine a plurality of transformation points of a plurality of first three-dimensional points in the first three-dimensional point cloud on the X0Y plane according to formulas 2 to 7, determine a first minimum bounding rectangle including the plurality of transformation points by using a minimum bounding rectangle (minareaRect) function, and respectively mark four vertexes of the first minimum bounding rectangle as four vertexes

And

the processor 110 may determine four vertices of a second minimum bounding rectangle containing the first projection region according to the four vertices of the first minimum bounding rectangle.

Optionally, the vertices of the second circumscribed rectangle may satisfy the following formula:

q ¹＝H ^-1q ²(formula 8)

Wherein H^-1Representing the inverse of the transformation matrix, q¹Representing the vertices of a second circumscribed rectangle, q²Representing the vertices of the first circumscribed rectangle.

S42: the processor 110 determines a plurality of first distances from a plurality of first three-dimensional points in the first three-dimensional point cloud to the first reference plane. Specifically, the processor 110 may determine a plurality of first distances from a plurality of first three-dimensional points in the first three-dimensional point cloud to the first reference plane according to equation 2.

S43: the processor 110 determines a three-dimensional size of the first object according to the first projection area and the plurality of first distances.

Specifically, the processor 110 determines a maximum distance from the plurality of first distances, determines a first minimum bounding box containing the first three-dimensional point cloud according to the maximum distance and a second bounding box containing the first projection area, and then determines the length, width, height, volume, and the like of the first object according to the first minimum bounding box.

The processor 110 may determine the four vertices of the first minimum bounding box top surface based on the four vertices and the maximum distance of the second bounding rectangle. Wherein, the top surface is the upper surface of the first minimal bounding box parallel to the first reference plane.

Optionally, the vertex of the first minimum bounding box top surface may satisfy the following formula:

wherein q representsVertex of the first smallest circumscribed envelope_maxDenotes the maximum distance, q¹Representing the vertices of the second circumscribed rectangle, n representing the normal vector of the first reference plane, i.e. n ═ ab C]。

In the above embodiment 1, the processor 110 may obtain a first projection area of the first three-dimensional point cloud on the first reference plane by projecting the first three-dimensional point cloud onto the first reference plane, determine a plurality of first distances from a plurality of first three-dimensional points in the first three-dimensional point cloud to the first reference plane, and determine a length, a width, and a height of the first object corresponding to the first three-dimensional point cloud according to the first projection area and the plurality of first distances. The reference plane may be acquired from the memory 130, and the processor 110 may obtain the length, the width, and the height of the first object in the first image according to the first projection area of the first three-dimensional point cloud corresponding to the first object on the first reference plane and the plurality of first distances without manual selection by the user, so that the measurement efficiency may be improved, and the user experience may be improved.

Example 2: the N objects include a second object corresponding to a second three-dimensional object including a second three-dimensional point cloud, the second three-dimensional object being a second object in the three-dimensional environment space. Please refer to fig. 5 for the process of determining the spatial dimension of the second object.

S51: the processor 110 determines a third three-dimensional point cloud in the second three-dimensional point cloud corresponding to a first surface of the second object, the first surface being parallel to the first reference plane.

After obtaining the second three-dimensional point cloud corresponding to the second object, the processor 110 determines a third three-dimensional point cloud corresponding to the first surface of the second object from the second three-dimensional point cloud, that is, the processor 110 may identify a planar point cloud from the second three-dimensional point cloud.

As an example, the second object is a cuboid, and the processor 110 performs three-dimensional point cloud conversion on the second object to obtain a second three-dimensional point cloud corresponding to the second object, as shown in fig. 13, where one cube is a three-dimensional point, and the second three-dimensional point cloud includes 18 second three-dimensional points; the first surface is an upper surface of the second object parallel to the first reference plane, and the processor 110 identifies a third three-dimensional point cloud corresponding to the first surface from the second three-dimensional point cloud, as shown in fig. 14, where the third three-dimensional point cloud is parallel to the first reference plane, and the third three-dimensional point cloud includes 6 second three-dimensional points.

It should be noted that, identifying a planar point cloud from a three-dimensional point cloud may be implemented by using the prior art, which is not limited in the embodiment of the present application.

S52: the processor 110 determines a plurality of second distances from a plurality of second three-dimensional points in the third three-dimensional point cloud to the first reference plane. Specifically, the processor 110 may determine a plurality of second distances from a plurality of second three-dimensional points in the third three-dimensional point cloud to the first reference plane by formula 2.

S53: the processor 110 determines a distance of the first surface from the first reference plane based on the plurality of second distances. The processor 110 may perform an arithmetic mean calculation on the plurality of second distances to obtain a mean value of the plurality of second distances, where the mean value is a distance from the first surface to the first reference plane; alternatively, the processor 110 may perform a weighted average on the plurality of second distances to obtain a weighted average of the plurality of second distances, where the weighted average is a distance from the first surface to the first reference plane; the embodiments of the present application do not limit this.

In the above embodiment 2, the processor 110 determines the distance from the first surface to the first reference plane by determining the third three-dimensional point cloud corresponding to the first surface of the second object from the second three-dimensional point cloud, where the first surface is parallel to the first reference plane, then determining a plurality of second distances from a plurality of second three-dimensional points in the third three-dimensional point cloud to the first reference plane, and then determining the distance from the first surface to the first reference plane according to the plurality of second distances. The reference plane may be obtained from the memory 130, and the processor 110 may derive the distance between the first surface of the second object and the first reference plane according to the plurality of second distances without user's selection, so as to improve user experience.

Example 3: the N objects include a third object corresponding to a third three-dimensional object including a fourth three-dimensional point cloud, the third three-dimensional object being a portion of a third object in the three-dimensional environmental space. Please refer to fig. 6 for the process of determining the spatial dimension of the third object.

S61: the processor 110 obtains a semantic map that is a three-dimensional image that includes a three-dimensional environmental space. The processor 110 may obtain and store a semantic map of a three-dimensional environment space according to a semantic instant positioning and mapping (SLAM) technique.

As an example, the processor 110 may obtain a semantic map corresponding to the three-dimensional environment space in which the first image is located from the memory 130. It should be noted that, the processor 110 may obtain the semantic map according to the semantic SLAM technology, or may obtain the semantic map by using other existing technologies, which is not limited in the embodiment of the present application.

S62: the processor 110 determines a fifth three-dimensional point cloud corresponding to the third object according to the semantic map and the fourth three-dimensional point cloud. At S62, the processor 110 may aggregate the semantic relevance to determine a fifth three-dimensional point cloud corresponding to the third object in the semantic map according to the semantics of the fourth three-dimensional point cloud. For example, the processor 110 may determine the fifth three-dimensional point cloud corresponding to the third object in the semantic map according to the semantic of the fourth three-dimensional point cloud by using a semantic clustering method, where a specific implementation process of the semantic clustering method may be implemented by using the prior art, and the embodiment of the present application does not limit this. As an example, the third object is a cube in shape, the camera 140 acquires only a part of the third object, the part is marked as a third three-dimensional object, and the processor 110 performs three-dimensional point cloud conversion on the object in the first image to obtain a fourth three-dimensional point cloud corresponding to the third three-dimensional object, as shown in fig. 15, a cube represents a third three-dimensional point, the fourth three-dimensional point cloud includes 18 third three-dimensional points, and the shape of the fourth three-dimensional point cloud is a cuboid; then, the processor 110 obtains a fifth three-dimensional point cloud corresponding to the third object from the semantic map according to the fourth three-dimensional point cloud, as shown in fig. 16, the fifth three-dimensional point cloud includes 27 third three-dimensional points, and the shape of the fifth three-dimensional point cloud is a cube.

S63: the processor 110 projects the fifth three-dimensional point cloud to the first reference plane to obtain a second projection area of the fifth three-dimensional point cloud on the first reference plane. The processor 110 may obtain the second projection area of the fifth three-dimensional point cloud on the first reference plane through formulas 2 to 7, and for a specific embodiment, refer to the embodiment of obtaining the first projection area in embodiment 1, which is not described herein again.

S64: the processor 110 determines a plurality of third distances from a plurality of third three-dimensional points in the fifth three-dimensional point cloud to the first reference plane. The processor 110 may determine a plurality of third distances from the plurality of third three-dimensional point clouds in the fifth three-dimensional point cloud to the first reference plane by equation 2.

S65: the processor 110 determines a three-dimensional size of a third object according to the second projection area and the plurality of third distances. In the above embodiment 3, when the third three-dimensional object corresponding to the third object is a part of a third object in the three-dimensional environment space, the processor 110 determines a fifth three-dimensional point cloud of the third object through the semantic map and a fourth three-dimensional point cloud included in the third three-dimensional object, and then obtains the three-dimensional size of the third object according to the second projection area of the fifth three-dimensional point cloud projected onto the first reference plane and the third distances from the third three-dimensional points in the fifth three-dimensional point cloud to the first reference plane. Although the first image only comprises partial content of the third object, the fifth three-dimensional point cloud corresponding to the third object can be obtained through the three-dimensional image of the three-dimensional environment space and the fourth three-dimensional point cloud, and then the length, width and height of the third object can be automatically measured according to the fifth three-dimensional point cloud and the first reference plane, so that the measurement of the space dimension of the object in the first image is completed.

S205: the processor 110 sends the spatial dimensions of the N objects to the display device 140 to cause the display device 140 to display the spatial dimensions of the N objects in the display panel 141. The processor 110 may send the spatial dimensions of the N objects to the display device 140 to cause the display device 140 to display the spatial dimensions of the N objects in the display panel 141; alternatively, the processor 110 may send the spatial dimension of one of the N objects to the display device 140, so that the display device 140 displays the spatial dimension of the object in the display panel 141; alternatively, in response to a focus instruction sent by the user, the focus instruction being used to instruct to display only the spatial dimension of the focus object in the first image, the processor 110 may send the spatial dimension of the focus object to the display device 140 to cause the display device 140 to display the spatial dimension of the focus object in the display panel 141.

Taking the example that the display panel 141 only displays the spatial dimension of one object of the N objects, the first reference plane is the ground, and the N objects include chairs, the processor 110 may determine the three-dimensional size of the chair according to the method flow described in the above embodiment 1, or the processor 110 may determine the three-dimensional size of the chair according to the method flow shown in fig. 4, that is, the height is 0.45 m, the length is 0.76 m, and the width is 0.56 m, and the specific implementation manner may refer to the flow shown in the foregoing embodiment 1 or fig. 4, which is not described again; the processor 110 then sends the determined three-dimensional size of the chair to the display device 140; the display device 140 displays the three-dimensional size of the chair on the display panel 141 as shown in fig. 7 a.

Taking the example that the display panel 141 only displays the spatial dimension of one object of the N objects, the first reference plane is the ground, and the N objects include the ceiling, the processor 110 may determine that the distance from the lower surface of the ceiling parallel to the ground is 3.22 meters according to the method flow shown in the above embodiment 2, or the processor 110 may determine that the distance from the lower surface of the ceiling parallel to the ground is 3.22 meters according to the method flow shown in fig. 5, and the specific implementation manner may refer to the flow shown in the above embodiment 2 or fig. 5, which is not described again; the processor 110 then sends the determined distance to the display device 140; the display device 140 displays the distance from the lower surface of the ceiling to the floor on the display panel 141 as shown in fig. 7 b.

Taking the example that the display panel 141 only displays the spatial dimension of one object of the N objects, the first reference plane is the ground, and the N objects include chairs, the three-dimensional object corresponding to the first image is a part of a chair, the processor 110 may determine the three-dimensional dimensions of the chair, that is, the height is 0.45 m, the length is 0.76 m, and the width is 0.56 m, according to the method flow shown in fig. 6, and the specific implementation manner may refer to the flow shown in the foregoing embodiment 3 or fig. 6, which is not described again; the processor 110 then sends the determined three-dimensional size of the chair to the display device 140; the display device 140 displays the three-dimensional size of the chair on the display panel 141 as shown in fig. 7 c.

In a possible implementation, after acquiring the first image, the processor 110 may determine a difference between a first pose at which the camera 121 acquires the first image and a second pose at which the camera 121 acquires the third image; if the difference is greater than or equal to a fourth preset threshold, determining that the electronic device 100 is in a motion state; if the difference is less than the fourth preset threshold, it is determined that the electronic device 100 is in a stationary state. The third image is an image of a previous frame of the first image, and the pose may be determined by the SLAM technique or by the sensor 122, which is not limited in this embodiment of the application.

Further, when the electronic device 100 is in a static state, the processor 110 executes the process shown in fig. 2 to obtain spatial dimensions of N objects corresponding to the N objects in the first image. While the electronic device 100 is in motion, the processor 110 performs semantic segmentation on the first image and then reconstructs a semantic map according to the result of the semantic segmentation. It should be understood that the semantic map reconstruction may be implemented by using the prior art, and the embodiment of the present application is not limited thereto.

In the above embodiment, when the electronic device 100 is in the static state, the processor 110 only determines the spatial dimension of the object in the first image, and since the quality of the image acquired in the static state is better than that of the image in the moving state, the accuracy rate of measuring the spatial dimension of the object in the image in the static state is higher. For example, images acquired in a motion state may have problems of smearing and blurring, which may reduce the accuracy of semantic segmentation results, thereby reducing the accuracy of measuring the spatial dimension of an object. And the three-dimensional environment space where the first image is located is not changed in the static state, so that semantic map reconstruction can be carried out without repetition, and the calculation amount can be reduced. When the electronic device 100 is in a moving state, the processor 110 performs semantic map reconstruction only according to the first image, so as to prepare for measurement of the spatial dimension of the object in a static state, and improve measurement efficiency and measurement accuracy.

In the above embodiment of the application, N objects are obtained by identifying the first image, and then the N objects are converted into N three-dimensional objects, where each three-dimensional object is at least a part of an object in a three-dimensional environment space where the first image is located, and then the spatial dimensions of the N objects are measured with the first reference plane as a reference. Compared with the mode that the object measurement can be completed only by the participation of the user in the prior art, the method and the device for measuring the object space dimension can identify the N objects obtained by the first image, automatically measure the space dimension of the N objects corresponding to the N objects in the first image by taking the first reference plane as a reference, and can complete some measurement tasks which are difficult to complete by the user without the participation of the user. In addition, the spatial dimension of the object in the image can be determined through the acquired image, and compared with a scheme that the measurement needs to be performed from different angles for multiple times in the prior art, the measurement efficiency can be improved.

For the above method flow, the embodiment of the present application further provides a device for measuring a spatial dimension of an object in an image, and specific implementation of the device can be referred to the above method flow. Based on the same inventive concept, the embodiment of the present application further provides a device for measuring a spatial dimension of an object in an image, which may be the processor 110 shown in fig. 1, and which may be configured to perform the processes shown in fig. 2 to fig. 6. Referring to fig. 8, the apparatus 800 includes an identification unit 801, a conversion unit 802, and a processing unit 803.

The identification unit 801 is configured to identify a first image to obtain N objects in the first image, where N is an integer greater than or equal to 1, and each object in the N objects includes a pixel point set, and the pixel point set includes a plurality of pixel points.

The conversion unit 802 is configured to convert the N objects into N three-dimensional objects corresponding to the N objects, where each three-dimensional object includes a three-dimensional point cloud and is at least a part of an object in a three-dimensional environment space where the first image is located;

the processing unit 803 is configured to determine spatial dimensions of N objects according to the first reference plane in the three-dimensional environment space and the N three-dimensional objects, where the spatial dimension of each of the N objects includes at least one of: a distance of at least one surface of the object to the first reference plane, or a three-dimensional dimension of the object, the at least one surface being parallel to the first reference plane.

In one possible design, the N objects include a first object corresponding to a first three-dimensional object including a first three-dimensional point cloud, the first three-dimensional object being a first object in the three-dimensional environment space; the processing unit 803 is specifically configured to: projecting the first three-dimensional point cloud onto the first reference plane to obtain a first projection area of the first three-dimensional point cloud on the first reference plane; determining a plurality of first distances from a plurality of first three-dimensional points in the first three-dimensional point cloud to the first reference plane; determining a three-dimensional size of the first object based on the first projection area and the plurality of first distances.

In one possible design, the N objects include a second object corresponding to a second three-dimensional object including a second three-dimensional point cloud, the second three-dimensional object being a second object in the three-dimensional environment space; the processing unit 803 is specifically configured to: determining a third three-dimensional point cloud corresponding to a first surface of the second object in the second three-dimensional point cloud, the first surface being parallel to the first reference plane; determining a plurality of second distances from a plurality of second three-dimensional points in the third three-dimensional point cloud to the first reference plane; determining a distance from the first surface to the first reference plane based on the plurality of second distances.

In one possible design, the N objects include a third object corresponding to a third three-dimensional object including a fourth three-dimensional point cloud, the third three-dimensional object being part of a third object in the three-dimensional environment space; the processing unit 803 is specifically configured to: obtaining a semantic map, wherein the semantic map is a three-dimensional image comprising the three-dimensional environment space; determining a fifth three-dimensional point cloud corresponding to the third object according to the semantic map and the fourth three-dimensional point cloud; projecting the fifth three-dimensional point cloud onto the first reference plane to obtain a second projection area of the fifth three-dimensional point cloud on the first reference plane; determining a plurality of third distances from a plurality of third three-dimensional points in the fifth three-dimensional point cloud to the first reference plane; determining a three-dimensional size of the third object according to the second projection area and the plurality of third distances.

In one possible design, the identification unit 801 is specifically configured to: performing semantic segmentation on the first image to obtain N candidate pixel point sets aiming at the N objects and a first pixel point set not aiming at a specific object; adding at least one second pixel point in the plurality of second pixel points in each candidate pixel point set according to the first information of the plurality of first pixel points in each candidate pixel point set in the N candidate pixel point sets and the first information of the plurality of second pixel points in the first pixel point set to obtain one object in the N objects, wherein the pixel point set included in the object comprises the candidate pixel point set of the object and the at least one second pixel point; wherein the first information comprises at least one of: depth information or color information.

In one possible design, the first reference plane is the ground.

It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. The functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, a form of a software functional unit, or a form of a combination of software and hardware.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only for the specific implementation of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims

A method for measuring the spatial dimension of an object in an image is characterized by comprising the following steps:

identifying a first image to obtain N objects in the first image, wherein N is an integer greater than or equal to 1, each object in the N objects comprises a pixel point set, and the pixel point set comprises a plurality of pixel points;

converting the N objects into N three-dimensional objects corresponding to the N objects, wherein each three-dimensional object comprises a three-dimensional point cloud and is at least one part of an object in a three-dimensional environment space where the first image is located;

determining spatial dimensions of N objects according to a first reference plane and the N three-dimensional objects in the three-dimensional environment space, the spatial dimensions of each of the N objects including at least one of: a distance of at least one surface of the object to the first reference plane, or a three-dimensional dimension of the object, the at least one surface being parallel to the first reference plane.
The method of claim 1, wherein the N objects comprise a first object corresponding to a first three-dimensional object comprising a first three-dimensional point cloud, the first three-dimensional object being a first object in the three-dimensional environmental space;

the determining spatial dimensions of the N objects from the first reference plane and the N three-dimensional objects in the three-dimensional environment space comprises:

projecting the first three-dimensional point cloud onto the first reference plane to obtain a first projection area of the first three-dimensional point cloud on the first reference plane;

determining a plurality of first distances from a plurality of first three-dimensional points in the first three-dimensional point cloud to the first reference plane;

determining a three-dimensional size of the first object based on the first projection area and the plurality of first distances.
The method of claim 1 or 2, wherein the N objects comprise a second object corresponding to a second three-dimensional object comprising a second three-dimensional point cloud, the second three-dimensional object being a second object in the three-dimensional environment space;

the determining spatial dimensions of the N objects from the first reference plane and the N three-dimensional objects in the three-dimensional environment space comprises:

determining a third three-dimensional point cloud corresponding to a first surface of the second object in the second three-dimensional point cloud, the first surface being parallel to the first reference plane;

determining a plurality of second distances from a plurality of second three-dimensional points in the third three-dimensional point cloud to the first reference plane;

determining a distance from the first surface to the first reference plane based on the plurality of second distances.
The method of any one of claims 1-3, wherein the N objects comprise a third object corresponding to a third three-dimensional object comprising a fourth three-dimensional point cloud, the third three-dimensional object being part of a third object in the three-dimensional environment space;

the determining spatial dimensions of the N objects from the first reference plane and the N three-dimensional objects in the three-dimensional environment space comprises:

obtaining a semantic map, wherein the semantic map is a three-dimensional image comprising the three-dimensional environment space;

determining a fifth three-dimensional point cloud corresponding to the third object according to the semantic map and the fourth three-dimensional point cloud;

projecting the fifth three-dimensional point cloud onto the first reference plane to obtain a second projection area of the fifth three-dimensional point cloud on the first reference plane;

determining a plurality of third distances from a plurality of third three-dimensional points in the fifth three-dimensional point cloud to the first reference plane;

determining a three-dimensional size of the third object according to the second projection area and the plurality of third distances.
The method according to any one of claims 1-4, wherein the identifying the first image to obtain N objects in the first image comprises:

performing semantic segmentation on the first image to obtain N candidate pixel point sets aiming at the N objects and a first pixel point set not aiming at a specific object;

adding at least one second pixel point in the plurality of second pixel points in each candidate pixel point set according to the first information of the plurality of first pixel points in each candidate pixel point set in the N candidate pixel point sets and the first information of the plurality of second pixel points in the first pixel point set to obtain one object in the N objects, wherein the pixel point set included in the object comprises the candidate pixel point set of the object and the at least one second pixel point;

wherein the first information comprises at least one of: depth information or color information.
The method according to claim 5, wherein a similarity distance between each of the at least one second pixel and at least one first pixel in the candidate pixel set of the object is less than or equal to a first preset threshold, and the similarity distance between any one second pixel and any one first pixel is obtained from the first information of any one second pixel and the first information of any one first pixel.
The method according to claim 5 or 6, wherein the distance between the position of each of the at least one second pixel points in the first image and the position of at least one first pixel point in the set of candidate pixel points of the object in the first image is less than or equal to a second preset threshold.
The method of any one of claims 1-7, wherein the first reference plane is the ground.
The device for measuring the spatial dimension of the object in the image is characterized by comprising an identification unit, a conversion unit and a processing unit;

the identification unit is configured to identify a first image to obtain N objects in the first image, where N is an integer greater than or equal to 1, and each object in the N objects includes a pixel point set, and the pixel point set includes a plurality of pixel points;

the conversion unit is used for converting the N objects into N three-dimensional objects corresponding to the N objects, wherein each three-dimensional object comprises a three-dimensional point cloud and is at least one part of an object in a three-dimensional environment space where the first image is located;

the processing unit is configured to determine spatial dimensions of N objects according to a first reference plane in the three-dimensional environment space and the N three-dimensional objects, where the spatial dimension of each of the N objects includes at least one of: a distance of at least one surface of the object to the first reference plane, or a three-dimensional dimension of the object, the at least one surface being parallel to the first reference plane.
The apparatus of claim 9, wherein the N objects comprise a first object corresponding to a first three-dimensional object comprising a first three-dimensional point cloud, the first three-dimensional object being a first object in the three-dimensional environment space;

the processing unit is specifically configured to:

projecting the first three-dimensional point cloud onto the first reference plane to obtain a first projection area of the first three-dimensional point cloud on the first reference plane;

determining a plurality of first distances from a plurality of first three-dimensional points in the first three-dimensional point cloud to the first reference plane;

determining a three-dimensional size of the first object based on the first projection area and the plurality of first distances.
The apparatus of claim 9 or 10, wherein the N objects comprise a second object corresponding to a second three-dimensional object comprising a second three-dimensional point cloud, the second three-dimensional object being a second object in the three-dimensional environment space;

the processing unit is specifically configured to:

determining a third three-dimensional point cloud corresponding to a first surface of the second object in the second three-dimensional point cloud, the first surface being parallel to the first reference plane;

determining a plurality of second distances from a plurality of second three-dimensional points in the third three-dimensional point cloud to the first reference plane;

determining a distance from the first surface to the first reference plane based on the plurality of second distances.
The transpose of any one of claims 9-11, wherein the N objects include a third object that corresponds to a third three-dimensional object that includes a fourth three-dimensional point cloud, the third three-dimensional object being part of a third object in the three-dimensional environment space;

the processing unit is specifically configured to:

obtaining a semantic map, wherein the semantic map is a three-dimensional image comprising the three-dimensional environment space;

determining a fifth three-dimensional point cloud corresponding to the third object according to the semantic map and the fourth three-dimensional point cloud;

projecting the fifth three-dimensional point cloud onto the first reference plane to obtain a second projection area of the fifth three-dimensional point cloud on the first reference plane;

determining a plurality of third distances from a plurality of third three-dimensional points in the fifth three-dimensional point cloud to the first reference plane;

determining a three-dimensional size of the third object according to the second projection area and the plurality of third distances.
The apparatus according to any of claims 9-12, wherein the identification unit is specifically configured to:

performing semantic segmentation on the first image to obtain N candidate pixel point sets aiming at the N objects and a first pixel point set not aiming at a specific object;

adding at least one second pixel point in the plurality of second pixel points in each candidate pixel point set according to the first information of the plurality of first pixel points in each candidate pixel point set in the N candidate pixel point sets and the first information of the plurality of second pixel points in the first pixel point set to obtain one object in the N objects, wherein the pixel point set included in the object comprises the candidate pixel point set of the object and the at least one second pixel point;

wherein the first information comprises at least one of: depth information or color information.
The apparatus according to claim 13, wherein a similarity distance between each of the at least one second pixel and at least one first pixel in the candidate pixel set of the object is smaller than or equal to a first preset threshold, and the similarity distance between any one second pixel and any one first pixel is obtained from the first information of any one second pixel and the first information of any one first pixel.
The apparatus according to claim 13 or 14, wherein a distance between a position of each of the at least one second pixel points in the first image and a position of at least one first pixel point in the set of candidate pixel points of the object in the first image is smaller than or equal to a second preset threshold.
The apparatus of any one of claims 9-15, wherein the first reference plane is the ground.
An apparatus for measuring a spatial dimension of an object in an image, the apparatus comprising a memory and a processor;

wherein the memory is used for storing a software program;

a processor for reading the software program in the memory and performing the method of any one of claims 1 to 8.
A computer storage medium, characterized in that the storage medium has stored therein a software program that, when read and executed by one or more processors, implements the method of any one of claims 1 to 8.
A computer program product comprising program code which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 8.