CN115482359A

CN115482359A - Method for measuring size of object, electronic device and medium thereof

Info

Publication number: CN115482359A
Application number: CN202110611252.8A
Authority: CN
Inventors: 王诗槐; 陈金昌; 邹维韬; 李龙华; 张文洋
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2022-12-16

Abstract

The application is suitable for the technical field of computer vision and artificial intelligence, and provides a method for measuring the size of a target, which comprises the following steps: acquiring a two-dimensional information image and a three-dimensional information image of a real plane including a target to be detected in a real scene, wherein pixels in the two-dimensional information image and pixels in the three-dimensional information image have a one-to-one correspondence relationship in position; acquiring position information of a plane according to the two-dimensional information image, and acquiring three-dimensional pixel information of pixels corresponding to the position information in the three-dimensional information image; converting the obtained three-dimensional pixel information into point clouds, and generating a virtual plane corresponding to a real plane in a virtual space based on the converted point clouds; and measuring the size of the target to be measured by taking the virtual plane as a reference plane. By the method, the step that a user manually selects a measurement starting point or a measurement reference surface is omitted, so that the size measurement is more intelligent, and the problem of deviation of the size of the target caused by the fact that the measurement reference surface selected by the user is not accurate enough is solved.

Description

Method for measuring size of object, electronic device and medium thereof

Technical Field

The application belongs to the technical field of computer vision and artificial intelligence, and particularly relates to a target size measuring method, electronic equipment and a medium thereof.

Background

Augmented Reality (AR) technology is a technology that combines image recognition positioning technology, computer graphics technology, and visualization technology to generate virtual objects and accurately fuse the virtual objects in a real environment to present a real sensory experience to a user. AR technology has been widely used in a variety of fields, such as industrial maintenance, medical, ancient rehabilitation, and digital cultural heritage protection, and mobile device applications. In the field of mobile device applications, AR technology may be used, among other things, to measure the length, height of an object, such as the height of a person or animal, etc.

However, currently, when a user uses the AR technology to measure the length, height, etc. of a target to be measured, the user is always required to manually select a measurement starting point or a measurement reference plane, which results in a complicated operation of the AR measurement process, inaccurate measurement results, and influences user experience.

Disclosure of Invention

In order to solve the technical problem, in the method of the present application, a mobile phone fits a plane of a target to be measured in an AR scene according to a depth image of the plane of the target to be measured in a real scene, and directly performs height measurement on the target to be measured by using the plane as a reference plane for size measurement, thereby omitting a step of manually selecting a measurement starting point or a measurement reference plane by a user, simplifying the operation of size measurement, making size measurement more intelligent, and solving the problem of deviation in measuring the size of the target to be measured due to the fact that the measurement reference plane selected by the user is not accurate enough.

In a first aspect, an embodiment of the present application provides a method for measuring a size of a target, including: acquiring a two-dimensional information image and a three-dimensional information image of a real plane including a target to be detected in a real scene, wherein pixels in the two-dimensional information image and pixels in the three-dimensional information image have a one-to-one correspondence relationship in position; acquiring position information of a plane according to the two-dimensional information image, and acquiring three-dimensional pixel information of pixels corresponding to the position information in the three-dimensional information image; converting the obtained three-dimensional pixel information into point clouds, and generating a virtual plane corresponding to a real plane in a virtual space based on the converted point clouds; and measuring the size of the target to be measured by taking the virtual plane as a reference plane.

Optionally, the two-dimensional information image represents two-dimensional features of the target, where the two-dimensional features include one or more of color features, grayscale features, and texture features, and the three-dimensional information image represents three-dimensional features of the target, where the three-dimensional features include a spatial depth value of the target, and the spatial depth value of the target refers to a distance between any part (or any point) of the target and a camera that captures the target.

Because the two-dimensional information image and the pixel of the three-dimensional information image corresponding to the two-dimensional information image have a one-to-one correspondence relationship in position, the three-dimensional pixel information corresponding to the plane in the three-dimensional information image can be further determined through a real plane in the two-dimensional information image, the three-dimensional pixel information corresponding to the real plane is converted into a corresponding point cloud through an internal reference matrix and a mapping formula of a camera (a camera in the application), finally, a virtual plane corresponding to the real plane in a virtual space is fitted according to the point cloud of the real plane, and the virtual plane is used as a reference plane for size measurement to measure the size of the target to be measured.

Alternatively, in a possible embodiment, the target to be measured may be a person, an animal, a building, or the like, which is perpendicular to the ground, and accordingly, the size of the target to be measured may be a height of the person, a height of the animal, a height, a width, or the like of the building. It should be understood that, if the object to be measured is an object with one end fixed on the wall surface, and then the distance between the object to be measured and the fixed end of the wall surface and the end of the object to be measured far away from the wall surface is to be measured, the measuring method is the same as the measuring method for measuring the object perpendicular to the ground, at this time, the electronic device fits the 'wall surface' of the virtual space corresponding to the wall surface according to the three-dimensional pixel information of the wall surface in the real scene and the point cloud corresponding to the three-dimensional pixel information, and measures the size of the object to be measured by taking the 'wall surface' of the virtual space as a reference surface.

Optionally, in one possible implementation, the virtual space includes an augmented reality AR scene, the two-dimensional information image includes a color space image, the color space image includes an RGB image or a YUV image, and the three-dimensional information image includes a depth image.

By taking the virtual space as the AR scene, the two-dimensional information image as the RGB image, and the three-dimensional information image as the depth image, for example, in the above manner, when an electronic device such as a mobile phone measures the size of the target to be measured, a virtual "ground" in the AR scene may be automatically generated according to a plane obtained by a camera, such as the RGB image and the depth image of the ground, and then the size of the target to be measured is measured by using the virtual "ground", in this process, a measurement starting point or a measurement reference plane does not need to be manually selected by a user, thereby simplifying the operation of measuring the size of the target to be measured, making the whole operation process more intelligent, and simultaneously, avoiding an error occurring when the size of the target to be measured is measured due to the fact that the measurement starting point or the measurement starting plane manually selected by the user is not accurate enough.

Optionally, the two-dimensional information image and the three-dimensional information image may be obtained by simultaneously obtaining, by a depth-sensitive camera of the electronic device, the two-dimensional information image and the three-dimensional information image including a real plane where the object to be measured is located. The depth-sensing camera of the electronic device includes a time of flight (TOF) camera.

In a possible implementation manner of the first aspect, the electronic device may further identify a plane in the two-dimensional information image by using a semantic segmentation model, and then acquire position information of the plane, where acquiring the position information of the plane may be acquiring the position information of the plane by determining an area of the plane or an outline of the plane. For example, assuming that the obtained position information of the ground in the real scene is obtained, the electronic device identifies the ground area or the contour of the ground in the two-dimensional information image by using the semantic segmentation model, and then determines the three-dimensional pixel information corresponding to the ground part in the three-dimensional information image according to the ground area or the contour of the ground.

In one possible implementation manner of the first aspect, the semantic segmentation model is a full convolution neural network FCNs model. Optionally, the semantic segmentation model may also be other neural network models, and the application does not limit the type of the neural network model on which the semantic segmentation model is based.

In a possible implementation manner of the first aspect, converting the obtained three-dimensional pixel information into a point cloud, and generating a virtual plane corresponding to a real plane in a virtual space based on the converted point cloud includes: generating a plurality of sub-virtual planes of the corresponding real plane in the virtual space based on the converted point cloud; determining a plane confidence coefficient of each virtual plane in the plurality of sub virtual planes, and generating the virtual plane based on part of the sub virtual planes in the plurality of sub virtual planes, wherein the plane confidence coefficient represents a ratio between the number of points of the point cloud in each sub virtual plane and the number of points of the point cloud corresponding to the real plane.

Since the real plane is not necessarily a completely flat plane, when a corresponding virtual plane in the virtual space is generated according to the point cloud corresponding to the real plane, one or more sub-virtual planes corresponding to the real plane may be generated, and then the sub-virtual planes are fitted according to their plane confidence degrees to generate a larger virtual plane, and the largest virtual plane is taken as the virtual plane corresponding to the real plane. The fitting process can refer to the following detailed description of the plane fitting in the embodiments, which will not be described herein too much.

When generating a virtual plane in the virtual space and measuring the size of the target to be measured with the virtual plane as a reference plane, the measurement process may specifically be as follows:

in a possible implementation manner, taking an example that the target to be measured is a human, the measurement method further includes: acquiring a head image of a person; determining a vertex of a human head from the image of the head; the distance between the vertex of the head and the virtual plane is taken as the height of the person.

In another possible implementation manner, in order to improve the accuracy of measurement, the vertex of the head of the person to be measured may also be determined by using a three-dimensional face recognition method, that is: the head image is a head three-dimensional information image, and then face characteristic points are identified from the head three-dimensional information image of the person by a three-dimensional face identification method; the head vertex is then determined based on the facial feature points. And finally, taking the distance between the vertex of the head of the person and the virtual plane as the height of the person.

In a second aspect, the present application provides a computer-readable medium, where instructions are stored on the computer-readable medium, and when executed on an electronic device, the instructions cause the electronic device to perform the method for measuring a dimension of an object according to any one of the first aspect.

In a third aspect, an embodiment of the present application provides an electronic device, including: the camera is connected with the inertial measurement unit IMU; one or more processors; one or more memories; a module installed with a plurality of applications; the memory stores one or more programs, the one or more programs comprising instructions, which when executed by the electronic device, cause the electronic device to perform the method of dimensional measurement of an object as in any of the first aspects.

In a fourth aspect, embodiments of the present application provide a computer program product, which, when run on an electronic device, causes a processor to perform a method of dimensional measurement of an object as in any one of the first aspect.

It is understood that the beneficial effects of the second to fourth aspects can be seen from the description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic view of a User Interface (UI) for performing AR measurement by using an AR measurement APP (application) on a mobile phone according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an example height calculation method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a UI change for measuring a height through an AR measurement APP (i.e., an AR measurement APP) on a mobile phone according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an example UI provided by embodiments of the present application;

FIG. 5 is a schematic diagram of an example UI provided by embodiments of the present application;

FIG. 6 is a schematic diagram of an example UI provided by an embodiment of the application;

FIG. 7 is a schematic diagram of an example UI provided by embodiments of the present application;

FIG. 8 is a schematic diagram of an example UI provided by embodiments of the application;

FIG. 9 is a schematic diagram of an example UI provided by embodiments of the present application;

fig. 10 (a) is a schematic diagram of an exemplary hardware structure of a mobile phone according to an embodiment of the present application;

fig. 10 (b) is a schematic diagram of an example of a mobile phone having multiple cameras according to an embodiment of the present application;

FIG. 11 (a) is a schematic flow chart of a method for measuring height of a person according to an embodiment of the present application;

FIG. 11 (b) is a schematic process diagram of an example of plane fitting provided in the embodiments of the present application;

FIG. 12 is a schematic diagram of an example height calculation method according to an embodiment of the present application;

FIG. 13 is a schematic diagram illustrating an example of recognizing a head contour of a person by using an image recognition method according to an embodiment of the present application;

FIG. 14 (a) is a flowchart of a method for determining a vertex of a human head and calculating a height of the human body by using a three-dimensional face recognition technology according to an embodiment of the present application;

fig. 14 (b) is a schematic flowchart of generating a face mesh by using a three-dimensional face recognition technology according to an embodiment of the present application;

fig. 15 is a schematic diagram of facial feature points of an example provided in the embodiment of the present application:

fig. 16 is a schematic software framework diagram of an example of a mobile phone according to an embodiment of the present application.

Detailed Description

The terminology used in the following embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in the specification of this application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the listed items.

The technical solution of the present application will be described in detail below with reference to the accompanying drawings.

The mobile AR technology can be used to measure the length and height of an object, such as the height of a person or an animal, etc., in the field of mobile device applications. However, no matter what kind of related application program, when the user uses the mobile phone to perform the AR measurement, the user is required to manually select a measurement starting point to perform the height measurement, and the operation is cumbersome. Specifically, fig. 1 (a) to (d) show UI diagrams of AR measurement using an AR measurement APP on a mobile phone.

As shown in FIG. 1 (a), the user may select the "height measurement" button 102 in the "measurement mode option bar" 101, and in response to user operation, the cell phone 10 enters the height measurement mode. The display interface UI 100 of the AR measurement APP then prompts the user to move the handset 10 to find the ground in order to identify the ground and fit the ground in the AR scene.

As shown in FIG. 1 (b), when the AR scene is successfully established, the UI 100 may display "ground recognition successful" to prompt the user that a height measurement may be taken at this time. At this time, in order to measure the height, the user may first select a measurement starting point P3' on the sole of the person to be measured.

Then, as shown in fig. 1 (c), the mobile phone 10 constructs a measurement reference plane β from the measurement starting point selected by the user, and prompts the user to move the mobile phone 10 upward to recognize a human face.

Finally, as shown in fig. 1 (d), the mobile phone 10 identifies the face to be measured, determines the head vertex of the person to be measured, and displays the height of the person to be measured, that is, the distance "1.8m" between the head vertex and the measurement reference plane β.

Correspondingly, the process of calculating the height of the person to be measured specifically as shown in fig. 2 includes:

the mobile phone 10 fits the measurement reference plane β with the starting point P3 'manually selected by the user as a reference, then calculates the distance from the reference plane β to the vertex P1 of the head of the person to be measured, i.e., P2' P1, and takes the distance as the height of the person to be measured.

However, it is obvious that the height of the person to be measured should be the distance from the head vertex P1 to the ground α in the AR scene fitted by the mobile phone 10 scanning the ground, i.e. P1P2.

Therefore, if the height is measured by fitting the reference surface with the measurement starting point manually selected by the user, the reference surface and the actual ground in the AR scene are not on the same horizontal plane, and further, the measurement of the height of the person to be measured is subject to errors.

Moreover, if the plane fitted by the mobile phone 10 according to the measurement starting point selected by the user and the two points around the measurement starting point is inclined, the height measurement error of the person to be measured is larger. For example, as shown in fig. 2 (b), if the measurement starting point selected by the user is P3 ″, an included angle exists between the reference plane ω fitted by the mobile phone 10 according to the measurement starting point P3 ″ and the actual ground α in the AR scene, and the height P1P2 ″ of the person to be measured obtained according to the reference plane ω is far from the actual height P1P2 of the person to be measured.

In order to solve the above technical problem, an embodiment of the present application provides a method for measuring a size of an object. In the method for measuring the size of the target, a user does not need to manually select a measurement starting point of the target to be measured, when the user slowly moves the mobile phone 10 to search for the ground, the mobile phone 10 can accurately identify the ground in a real scene through a trained semantic segmentation model, the ground in an AR scene is fitted according to a depth image of the ground, and then the mobile phone 10 directly measures the size of the target to be measured by taking the ground in the AR scene as a reference surface.

In addition, when the target to be detected is a person and the height of the person is measured, the 3D (dimension) face recognition technology is utilized to recognize the face and generate a 3D face mesh (mesh) model. Different from the 2D image which only can identify the outline of a person, the 3D face mesh is a model which is similar to a mask and can be directly and closely attached to a real face, and the nose tip of the person in the 3D face mesh is the most prominent, so that the nose tip of the person is used as a feature point, the head vertex of the person is further determined, then the distance from the head vertex to the ground in the AR scene is calculated, the distance is used as the height of the person to be measured, and the height of the person to be measured is accurately measured.

The dimensional measurement method of the object of the present application will be explained in further detail below with reference to fig. 3 to 9.

For convenience of explanation, the embodiment of the present application will be described by taking the measurement of the height of the person to be measured by using the mobile phone 10 as an example. However, it should be understood that the solution of the present application may be applied not only to the mobile phone 10, but also to electronic devices such as a tablet computer, a wearable device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), and the like, and the embodiments of the present application do not set any limit to specific types of electronic devices capable of implementing the solution of the present application.

Fig. 3 (a) to (c) are schematic diagrams illustrating UI changes for measuring height through an AR measurement APP (i.e., an AR measurement APP) on a mobile phone according to an embodiment of the present application.

As shown in fig. 3 (a), the user may open the display interface UI 300 of the AR measurement APP by clicking the icon of the AR measurement APP on the desktop of the mobile phone 10, so as to measure the target to be measured. When the user clicks the "AR measurement" 301, the UI 300 displays a start interface of the "AR measurement" as illustrated in (b) of fig. 3.

When the start interface is finished, the user enters the home interface of the AR measurement application, and as shown in fig. 3 (c), the "measurement mode option bar" 302 of the UI 300 displays a "length measurement" mode 302a, an "area measurement" mode 302b, and a "volume measurement" mode 302c including a "height measurement" mode 302d, and simultaneously displays the functions of the measurement modes above the corresponding measurement modes. Taking the "length measurement" mode 302a as an example, when the user presses the "length measurement" 302a key for a long time, the function "length of measurement object" 303a corresponding to the mode is displayed above the mode in the form of a floating window, and the desktop length "1.8m" measured by the mode is displayed in the area 303b, so as to prompt the user about the functions and effects that can be realized by each measurement mode.

Optionally, the start interface may be displayed once when the user first installs and opens the "AR measurement" application program, or may be displayed each time the user opens the "AR measurement" application program, which is not limited in this application.

When the user clicks the "height measurement" 302d button, UI 300 jumps to the interface corresponding to the height measurement mode, UI 400. Specifically, as shown in fig. 4 (a), when the user selects "height measurement mode", the area 401 of the UI 400 will display "automatic measurement mode is on", and the cell phone 10 will prompt the user to "slow down the mobile device, find the ground";

after the ground recognition is successful, as shown in fig. 4 (b), the area 401 of the UI 400 displays "the ground recognition is successful", and the mobile phone 10 prompts the user to "slow down the mobile device to search for a face", at this time, the user can move the mobile phone 10 upward so that the mobile phone 10 can recognize the face of the person to be detected (refer to fig. 4 (c)); when the mobile phone 10 recognizes a human face, the height "1.8m" of the person to be measured is displayed in the area 405 of the UI 400.

In one possible implementation, when the user wants to re-select the measurement start point, the measurement start point may be re-selected in the function option bar 404 by a pull-back operation and a add-on operation.

Alternatively, the height of the person to be tested may be displayed in any position of the UI 400 that can be seen by the user, which is not limited in this application.

Alternatively, the height unit of the person to be measured may be set by the user clicking the setting button 402, and the height unit of the person to be measured may be metric units such as m and cm, and may also be english units. This is not limited by the present application.

In another possible implementation manner, the user may also capture a screen of the current interface displaying the height of the person to be tested through the photographing function in the function option bar 404 (refer to fig. 4 (b)), and store the captured screen as an image.

It can be seen that the height measurement operation process of the present application is different from the height measurement operation process in the conventional scheme. In the height measurement process shown in fig. 4, after the user selects the height measurement function and enters the height measurement mode, the user only needs to move the mobile phone 10 to enable the mobile phone 10 to establish an AR scene and recognize a human face, and the user does not need to select a measurement starting point, so that the height measurement process is more intelligent, and the user experience is improved.

In some embodiments of the present application, the mobile phone 10 can also measure heights of multiple persons to be measured at the same time. Specifically, when the user clicks the "height measurement" mode 302d key to enter the height measurement mode (see fig. 4 (a)), the area 401 of the UI 400 displays "ground identification is successful" when the mobile phone 10 successfully identifies the ground, as shown in fig. 4 (b). At the same time, the handset 10 will prompt the user to "slow down the mobile device, look for a face". The user slowly moves the mobile phone 10 following the prompt, as shown in fig. 5 (a), the mobile phone 10 recognizes 3 faces. Thereafter, as shown in fig. 5 (b), the UI 300 simultaneously displays the heights "1.8m", "1.9m", and "1.75m" of the three persons to be measured in the area 501.

Optionally, the mobile phone 10 may also display the height difference between the three people to be measured in the area 501, so as to increase the interest of height measurement. For example, as shown in fig. 5 (c), the region 501 may also display a height difference "0.3m" or "0.35m" between the three test persons and the highest-height test person.

Optionally, the height difference may be a difference between the height of each person to be measured and the height of the highest person to be measured among all the heights of the persons to be measured, or may be a difference between the height of each person to be measured and the height of the lowest person to be measured among all the heights of the persons to be measured, or may be a height difference between adjacent persons to be measured, or the like. The present application does not set any limit to the calculation method of the height difference.

In further embodiments of the present application, when multiple people are simultaneously present in the current interface of the mobile phone 10, the user can select a specific one or more people among the multiple people to perform the height measurement, so as to increase the intelligence of the height measurement.

Specifically, when the user clicks the "height measurement" mode 302d key to enter the height measurement mode (see fig. 4 (a)), the area 401 of the UI 400 displays "ground identification is successful" when the mobile phone 10 successfully identifies the ground, as shown in fig. 4 (b). Meanwhile, the mobile phone 10 prompts the user to 'slowly move the device to search for a face', the user slowly moves the mobile phone 10 along with the prompt, as shown in fig. 5 (a), after the mobile phone 10 recognizes 3 faces, the user can select a middle person as a person to be measured to measure the height according to the requirement.

As shown in fig. 6 (a), the user can move the "selection box" 601 up and down left and right to select a person whose height is to be measured.

Optionally, in some embodiments of the present application, a "selection box" 601 may also be displayed when the user selects a specific object to be measured, so as to prompt the user of the currently selected object to be measured.

As shown in fig. 6 (b), when the user selects the middle human to be tested, the mobile phone 10 prompts the user to "slow down the mobile device to search for the face", and the user follows the guidance so that the mobile phone 10 recognizes the face of the human to be tested selected by the user.

Finally, as shown in FIG. 6 (c), the height of the person to be tested is "1.9m" displayed on the UI 600.

In other embodiments of the present application, under the condition that the distance between the ceiling and the ground is fixed, for example, indoors, the mobile phone 10 may also determine the height of the person to be measured by recognizing the ceiling and the face of the person to be measured indoors, and then using the known distance between the ceiling and the ground. Specifically, when the user clicks the "height measurement" mode 302d button to enter the height measurement mode (refer to fig. 4 (a)), as shown in fig. 7 (a), the "automatic measurement mode is turned on" is displayed on the interface UI 700 of the mobile phone 10, and the user is prompted to "slow down the mobile device to find the ceiling", and then the user follows the guidance of the mobile phone 10 to find the ceiling so that the mobile phone can recognize the ceiling;

when the mobile phone 10 successfully recognizes the ceiling, the area 701 of the UI 700 shows "the ceiling is successfully recognized". As shown in fig. 7 (b), at this time, the mobile phone 10 prompts the user to "slow down the mobile device and search for a face", and the user follows the guidance, and the mobile phone 10 searches for the face so that the mobile phone can identify the face of the person to be detected.

Alternatively, in order to identify the face of the user more clearly, the camera 193 of the mobile phone 10 increases the focal length and reduces the field of view to enlarge the face of the person to be tested, as shown in fig. 7 (c), it can be seen that the face of the person to be tested displayed on the UI 700 is larger than the faces of the person to be tested displayed on the UIs 700 in fig. 7 (a) and (b).

It should be understood that the field of view may be expressed in terms of a field angle that is inversely proportional to the size of the focal length of the camera 193 of the handset 10. The larger the focal length of the camera is, the smaller the field angle is, and the smaller the object range can be displayed is; the smaller the focal length of the camera is, the larger the field angle is, the larger the field of view is, and the larger the range of objects which can be displayed is.

When the mobile phone 10 successfully recognizes the face of the person to be tested, the mobile phone displays the height "1.8m" of the person to be tested in the area 702 of the UI 700.

In addition, when there are a plurality of people far away from each other in the space where the mobile phone 10 is located, the display screen 194 of the mobile phone 10 may not display all people at the same time; or, when the mobile phone 10 already displays part of the height of the person to be measured, and at this time, another person walks into the picture, and the user does not want to repeat the steps in fig. 4 (a) to 4 (c), the user can directly move the mobile phone 10 to recognize the face of the other person in the current mode, so that the mobile phone 10 continues to measure the height of the other person and display the height of the other person.

Specifically, fig. 8 is a UI diagram provided by some embodiments of the present application.

As shown in fig. 8 (a), at this time, the mobile phone 10 has already measured the height of the person to be measured Tom, and displays the height of "1.8m" of the person to be measured Tom.

However, there is a Joey to be detected in the room at the same time, and since the Joey to be detected is far away from the Tom to be detected, the current interface of the mobile phone 10 cannot accommodate two people at the same time, as shown in fig. 8 (a), the user moves the mobile phone 10 in a manner of relatively far away from the Tom to be detected, and moves the mobile phone 10 leftward to search for the Joey to be detected, so that the mobile phone 10 can identify the face of the Joey to be detected, until the Joey to be detected and the Joey to be detected can be accommodated in the mobile phone 10 at the same time (refer to fig. 8 (b));

then, as shown in fig. 8 (c), after the mobile phone 10 recognizes the face of the Joey, the height of the Joey is displayed as "1.9m".

Alternatively, in order to facilitate the user to visually recognize the heights of the Tom and Joey, as shown in fig. 8 (d), the mobile phone 10 may record the heights of the Tom and Joey in a panoramic mode, where a "panorama" is displayed in an area 801 of the UI 800, and the heights of the two people are displayed in the panoramic image. At this time, the user may also slide the UI 800 left and right to display other details in the panoramic photograph (e.g., objects such as tables, doors, etc. recorded in the panoramic image).

It should be understood that the panoramic mode is a mode in which more objects are put into the lens by moving or rotating the mobile phone 10 during shooting, the camera automatically shoots a plurality of photos during the moving or rotating process, and then the photos are spliced into a panoramic image through the software built in the mobile phone 10.

Optionally, under the condition that there are many people to be measured and the mobile phone 10 cannot display all the people to be measured in the vertical screen mode, as shown in fig. 9, the user may perform horizontal screen display on the mobile phone 10 and perform height measurement on the people to be measured. Wherein, the specific operation mode of height measurement is the same as the above mode in FIGS. 3-8, and is not described herein again.

In the above embodiments, for convenience of describing the operation manner of the dimension measurement of the object of the present application, the object to be measured is taken as an example for illustration. It should be understood that the present application is applicable to objects including, but not limited to, humans, and that the dimensional measurement of the object of the present application may also be applicable to other objects. For example, an animal, a plant, a building, or the like has a characteristic point. The feature points refer to points or a collection of points that can be identified based on a neural network model and can represent the shape, color, height, and the like of an object. For example, the corners of the building in the image may represent the shape, edges, nose tip, head vertex, height, etc. of the building.

In order to understand the implementation process of the solution of the present application, the following still takes the target to be measured as a person and measures the height of the person as an example, and further describes the solution of the present application with reference to fig. 10 to 16.

Fig. 10 is a schematic diagram of a hardware structure of a mobile phone 10 according to an embodiment of the present application. As shown in fig. 10 (a), the mobile phone 10 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like.

The sensor module 180 may include a pressure sensor 180A, an inertial measurement unit 180B (including a gyroscope sensor 1801B and an acceleration sensor 1802B), a magnetic sensor 180C, an air pressure sensor 180D, a distance sensor 180E, a proximity light sensor 180F, a fingerprint sensor 180G, a temperature sensor 180H, a touch sensor 180J, an ambient light sensor 180K, a bone conduction sensor 180L, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the mobile phone 10. In other embodiments of the present application, the handset 10 may include more or fewer components than shown, or some components may be combined, some components may be separated, or a different arrangement of components may be used.

The illustrated components may be implemented in hardware, software, or a combination of software and hardware. Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system. In some embodiments of the present application, the processor 110 is responsible for creating an AR scene from depth images of the ground acquired by the handset 10.

The mobile phone 10 implements the display function through the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information. In some embodiments of the present application, a measurement reference plane in an AR scene may be rendered using a Graphics rendering tool (Graphics kit).

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), a quantum dot light-emitting diode (QLED), or the like. In some embodiments, the cell phone 10 may include 1 or N display screens 194, N being a positive integer greater than 1.

In this application embodiment, the display screen 194 can be used to display the aforementioned various APPs and function options, and in response to the operation of the user, height measurement is performed on the person to be measured, and the measured height value is displayed on the display screen 194. The specific measurement manner may refer to the related description of the subsequent embodiments, and is not described herein again.

The mobile phone 10 may implement a shooting (or photographing) function through the ISP, the camera 193, the video codec, the GPU, the display 194, the application processor, and the like. In some embodiments of the present application, the user may capture and save the screen of the interface displaying the height of the person to be measured through the photographing function. As will be described in detail below.

The ISP is used to process the data fed back by the camera 193. For example, when taking a picture, the shutter is opened, light is transmitted to the camera 193 photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera 193 photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to the naked eye. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193. The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the handset 10 may include 1 or N cameras 193, N being a positive integer greater than 1.

For example, in some embodiments of the present application, as shown in fig. 10 (b), the handset 10 may include 5 rear cameras 193, respectively a multi-reflection periscopic tele camera 193, an ultra-aware camera 193, a tele camera 193, a movie camera 193, and a depth camera 193. In some embodiments of the present application, the mobile phone 10 obtains a depth image of the ground of the space where the mobile phone 10 is located through the depth-sensing camera 193, and establishes an AR scene and fits the ground in the AR scene according to the obtained depth image. The specific modes are described in detail below.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as smart recognition of the mobile phone 10, for example: image recognition, face recognition, speech recognition, text understanding, and the like. In some embodiments of the present application, the mobile phone 10 identifies the face of the person to be measured through the NPU, so as to determine the vertex of the head of the person to be measured. In other embodiments of the present application, the mobile phone 10 trains the semantic segmentation model by using the NPU based on the semantic segmentation database, so that the mobile phone 10 can identify the ground more accurately in the process of being slowly moved by the user, and construct a more accurate AR scene based on the above, so that the mobile phone 10 can fit the ground in an accurate measurement scene. As will be described in detail below.

The pressure sensor 180A is used for sensing a pressure signal, and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 180A, the capacitance between the electrodes changes. The handset 10 determines the intensity of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the mobile phone 10 detects the intensity of the touch operation according to the pressure sensor 180A. The cellular phone 10 may calculate the touched position based on the detection signal of the pressure sensor 180A. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message. In other embodiments of the present application, the mobile phone 10 may execute the corresponding instructions through the specific operations of the user on the display screen 194. For example, if the user clicks the "height measurement" button, the cell phone 10 will display the current page as the height measurement interface shown in FIGS. 3-9. As will be described in detail below.

The gyro sensor 1801B may be used to determine the motion pose of the handset 10. In some embodiments, the angular velocity of the handset 10 about three axes (i.e., x, y, and z axes) may be determined by the gyro sensors 1801B. The gyro sensor 1801B may be used for photographing anti-shake.

Illustratively, when the shutter is pressed, the gyroscope sensor 180B detects the shake angle of the mobile phone 10, calculates the distance to be compensated for the lens module according to the shake angle, and allows the lens to counteract the shake of the mobile phone 10 through a reverse movement, thereby achieving anti-shake. The gyro sensor 1801B may also be used for navigation and body sensing of a game scene.

The acceleration sensor 1802B detects the magnitude of acceleration of the cellular phone 10 in various directions (typically three axes). The magnitude and direction of gravity can be detected when the handset 10 is stationary. The method can also be used for identifying the gesture of the mobile phone 10, and is applied to horizontal and vertical screen switching, pedometers and other applications.

In some embodiments of the present application, the gyroscope sensor 1801B and the acceleration sensor 1802B constitute an Inertial Measurement Unit (IMU) 180b, the IMU unit 108B is configured to obtain an acceleration and an angular velocity of the camera 193 of the cell phone 10 at each time, so as to determine pose information of the camera 193 at each time, so that the cell phone 10 can perform plane fitting according to the pose information of the camera 193 at each time, where the pose information includes a three-dimensional coordinate and an orientation of the camera 193 at each time, the pose information of the camera 193 can indicate a position change of the camera 193, and the cell phone 10 fits a ground in an AR scene according to depth images of a space where a person to be measured is located, which are obtained by the camera 193 at different positions, so that the ground in the AR scene is closer to the ground in a real scene. As will be described in detail below.

It is to be understood that the coordinates and coordinate systems described herein and the corresponding refer to world coordinates and world coordinate systems, unless otherwise specified herein.

A specific implementation of the dimensional measurement method object of the present application is described below in conjunction with the UI diagrams of fig. 3-9 above.

Generally, when the mobile phone 10 fits a plane in an AR scene, it needs to obtain a depth image of the plane in a real scene, and then convert coordinates under an image coordinate system corresponding to each pixel in the depth image into coordinates under a world coordinate system to obtain a point cloud (point cloud) corresponding to the depth image, and fit the plane in the AR scene according to the point cloud.

As described above, in the scene where the AR measures the height of the person to be measured, the mobile phone 10 does not fit the ground in the AR scene completely according to the depth image of the ground in the real scene when fitting the ground in the AR scene. Specifically, when the user uses "AR measurement" APP to move the mobile phone 10 to search for the ground, the plane fitted by the mobile phone 10 is not the ground in the AR scene, so when measuring the height of the person to be measured, the mobile phone 10 further needs the user to select a measurement starting point at the bottom of the foot of the person to be measured (refer to fig. 1 (b)), then perform plane fitting according to the measurement starting point selected by the user, and use the plane as a measurement reference plane for measuring the height of the person to be measured, that is, the ground in the AR scene, and perform height measurement on the person to be measured. Thus, when the measurement starting point selected by the user does not belong to a point in the ground in the real scene, the ground in the AR scene fitted by the mobile phone 10 according to the measurement starting point is inaccurate, and the height measurement error may occur as described above.

In order to avoid the above situation, in the method for measuring a size of an object of the present application, the mobile phone 10 will introduce a semantic segmentation technology to train a semantic segmentation model for identifying a plane type in a real scene. Then, when the mobile phone 10 is "slowly moved by the user to search for the ground" (refer to fig. 4 (a)), the mobile phone 10 may continuously acquire color system (RGB) images corresponding to the camera 193 in different poses and depth images corresponding to the RGB images through the camera 193, identify the acquired RGB images by using the trained semantic segmentation model, determine the ground in the RGB images, and then directly fit the plane in the AR scene with the point cloud of the ground in the RGB images and the ground in the depth images corresponding to the RGB images.

So, cell-phone 10 just can be directly measure the height of the person that awaits measuring with this plane as the ground in the AR scene also the height measurement benchmark face of the person that awaits measuring, and then makes the operation of AR measurement succinct more, intelligent, also can not appear simultaneously and be similar to measuring the error that the person height of awaiting measuring appears among the prior art.

FIG. 11 (a) is a flow chart illustrating an exemplary method for measuring a height of a person according to some embodiments of the present disclosure.

As shown in fig. 11 (a), method 1100 includes:

1101: the mobile phone 10 acquires a two-dimensional information image of a space where a person is located in a real scene and a three-dimensional information image corresponding to the two-dimensional information image through the depth perception camera 193.

Wherein the two-dimensional information image reflects two-dimensional characteristics of objects in the image, and the three-dimensional information image reflects three-dimensional characteristics of objects in the image.

In one possible implementation manner of the present application, the two-dimensional information image is an R (red) G (green) B (blue) image, and the RGB image, that is, a color image or a color system image, is a common RGB three-channel color image. The RGB image reflects two-dimensional characteristics of objects in the image, such as color, gray scale, brightness, sharpness, etc., and the pixels of the RGB image are RGB values ranging from 0 to 225. Alternatively, the two-dimensional information image may also be an image of other color space, such as a YUV image, where "Y" represents a gray scale value (1 uma) of the image, "U" represents a hue (hue) of the image, and "V" represents a saturation (value) of the image.

The three-dimensional information image is a depth image (depth image), which is an image in which the distance from the image capturing device to each point in the scene is used as a pixel value (i.e., three-dimensional pixel information), for example, the distance from the depth sensing camera 193 to each point in the scene where the person to be measured is located is used as a pixel value in the present application. The depth image reflects the geometry of the visible surface of the object.

In this application, the mobile phone 10 will acquire an RGB image and a depth image, i.e., an RGB-D image, through the depth-sensing camera 193. It should be understood that the RGB image acquired by the depth perception camera 193 and its corresponding depth image are registered, and thus there is also a one-to-one correspondence between their pixels. For example, taking the ground RGB image in the present application as an example, the pixels of the depth image corresponding to the ground RGB image are the ground-to-depth camera 193 distance values.

Optionally, in a possible implementation, the RGB image may also be obtained by using a common camera 193, and the depth image is obtained by using a depth sensor, such as a laser radar, and then the RGB image and the depth image are subjected to image registration, so that pixels of the RGB image and the depth image correspond to each other one by one, that is, a final RGB-D image is obtained. It should be understood that the foregoing image registration may be performed by respectively performing normalization processing on pixel values of the RGB image and the depth image, and then performing image alignment processing using feature points in the RGB image, such as color feature points, texture feature points, and feature points in the depth image. For example, taking a face image as an example, the feature points in the RGB face image are still feature points in the depth image, such as the tip of the nose, the wing of the nose, etc., so the RGB face image and the depth image can be aligned according to the feature points in the RGB face image which coincide with the feature points in the depth image of the face, so as to obtain an RGB-D image of the face.

Optionally, in some embodiments of the present application, the depth-sensing camera 193 of the mobile phone 10 may acquire an RGB-D image stream of a space where the person to be detected is located in the real scene during the process of "slow moving the mobile device, and recognize the ground" (refer to fig. 4 (a)) following the guidance of the mobile phone 10. It is to be understood that an RGB-D image stream is a video stream composed of a plurality of consecutive RGB-D images.

1102: the mobile phone 10 identifies the ground in the two-dimensional information image, then obtains the point cloud of the ground in the corresponding three-dimensional information image according to the coordinates corresponding to the ground in the two-dimensional information image, and fits the ground where the person is located in the AR scene with the obtained point cloud of the ground.

After acquiring the RGB-D image of the space where the person to be measured is located in the real scene, the mobile phone 10 identifies the ground in the RGB image by using the trained semantic segmentation model, and then fits the ground in the AR scene with the point cloud data of the ground in the depth image corresponding to the RGB image.

It can be understood that both the AR scene and the VR scene belong to specific applications in the technical field of vision, and both involve fitting of a virtual plane and construction of a virtual space, so that the height measurement method can also be applied to the VR scene.

It is understood that the semantic segmentation model in the present application can identify the types of the planes included in the RGB image in step 1101, for example, the mobile phone 10 identifies the types of the planes of the wall, the ground, the desktop, and the like in the RGB image by using the semantic segmentation model. The specific functions and training process of the semantic segmentation model will be described in detail below.

It can also be understood that a point cloud (point cloud) refers to a set of points obtained after scanning a three-dimensional scene by using an apparatus such as a three-dimensional coordinate measuring machine, a photographic scanner (i.e., the camera 193 in this application), or a laser scanner, and each point in the set has a determined three-dimensional coordinate and corresponds to the position of the point in the depth image. The distribution of the point cloud may characterize the shape and location of the object.

In the present application, the mobile phone 10 may determine the point cloud corresponding to the depth image in the RGB-D image by combining the depth image in the RGB-D image obtained by the depth sensing camera 193 with the camera internal reference matrix of the mobile phone 10. Then, the mobile phone 10 performs plane fitting on the position where the point cloud distribution is dense according to the point cloud density or the point cloud sparsity in the point cloud to obtain a plane in the AR scene. The manner in which the handset 10 fits a plane in an AR scene from a point cloud will be described in detail below.

1103: the mobile phone 10 measures the height of a person on the ground in an AR scene as a measurement reference surface for measuring the height of the person.

After the mobile phone 10 fits the ground in the AR scene by the method in step 1102, the mobile phone 10 takes the ground as a measurement reference surface for height measurement of the person, and performs height measurement of the person. Specifically, the mobile phone 10 recognizes the head vertex P1 of the person, then calculates the distance from the head vertex P1 of the person to the measurement reference plane, and displays the distance from the head vertex P1 of the person to the measurement reference plane as the height of the person on the UI of the mobile phone 10 (as shown in fig. 4 (d)). FIG. 12 is a schematic representation of a human high-computation approach provided by some embodiments of the present application.

As shown in fig. 12, the mobile phone 10 uses the plane α as a height measurement reference plane, calculates a distance P1P2 from a vertex P1 of the head of the person recognized by the head of the person to the measurement reference plane, and displays the distance P1P2 as the height of the person on the mobile phone 10.

It can be seen that the height measurement method is different from the conventional height measurement method, as shown in fig. 2 (a), the conventional height measurement method requires a user to select a measurement starting point P3', then the mobile phone 10 establishes a plane β with the measurement starting point P3' selected by the user, and calculates a distance P1P2' of the plane β from a vertex P1 of the head of the user, and an error exists in the manner of measuring the height of a person because the plane β is not the ground where the person is located in the AR scene; however, in the height measuring method of the present application, the user does not need to select a measurement starting point, and the mobile phone 10 directly uses the established plane a as a measurement reference surface to measure the height of the person, so that the error of height measurement caused by the fact that the plane β established by the mobile phone 10 using the measurement starting point selected by the user is not at the same position as the actual ground in the AR scene is avoided, and the accuracy of height measurement is improved.

The method for identifying the head vertex P1 of the person to be detected by the mobile phone 10 may be that the mobile phone 10 identifies the head vertex P1 of the person to be detected by a two-dimensional image identification technology, or that the mobile phone 10 identifies the head feature point of the person to be detected by a three-dimensional face identification technology. Details will be described below.

By means of steps 1101 to 1103, the user can implement the height measuring method of the person described in fig. 4-9 on the mobile phone 10.

In order to facilitate a more intuitive understanding of the implementation processes of steps 1101-1103, specific implementation details of steps 1101-1103 will be described below with reference to fig. 11 (b).

As shown in FIG. 11 (b), corresponding to the above steps 1101-1102, the steps 1101-1102 may be divided into two parts, namely, plane generation 1110b and semantic segmentation 1120b.

The plane generation 1110b includes acquiring a depth image 1111b, determining a point cloud 1113b corresponding to the depth image according to the acquired depth image, acquiring an acceleration and an angular velocity 1112b of the mobile phone 10, determining pose information 1114b of the camera 193 by using a SLAM algorithm according to the acceleration and the angular velocity of the mobile phone 10, and then fitting the plane 1115 by combining the point cloud 1113b in the depth image and the pose information of the camera 193.

Meanwhile, in the semantic segmentation 1120b, the RGB images 1121b corresponding to the depth image pixels one to one are also acquired through the camera 193, and then the acquired RGB images are recognized by using the trained semantic segmentation model 1122b to determine the type of each plane in the RGB images, so as to generate semantic labels 1123b, where the semantic labels 1123b refer to the types of each plane.

Then, the mobile phone 10 adds the generated semantic tag 1123b to the fitted plane, and finally generates a plane with the semantic tag 1123b. Optionally, in the height measuring method in the present application, the plane with semantic label 1123b is the ground in the AR scene with label "ground".

It can be understood that the above-mentioned obtaining of the depth image 1111b corresponds to step 1101 in the method 1100, and the specific way of obtaining the depth image may refer to the description of step 1101, and the way of generating the point cloud 1113 corresponding to the depth image according to the obtained depth image is consistent with the way of determining the point cloud according to the depth image in step 1102, which is not repeated herein.

The following describes the process of training the semantic segmentation model with reference to step 1102 and semantic segmentation 1120b.

Specifically, the training process of the semantic segmentation model in step 1102 is as follows:

the mobile phone 10 trains the semantic segmentation model based on the image with the pixel-level label in the existing semantic segmentation data set, so that the trained semantic segmentation model can recognize that the deep-feeling camera 193 acquires the ground in the RGB image. The image with the pixel-level label means that each pixel in the image has a type label corresponding to the pixel, for example, a part of the pixels corresponds to the ground, a part of the pixels corresponds to the wall, a part of the pixels corresponds to the human body, and a part of the pixels corresponds to the sofa.

Specifically, the mobile phone 10 takes an image in the semantic data set with the preset pixel level category label as target data; then, the mobile phone 10 inputs the target data into the semantic segmentation model to be trained to obtain a semantic segmentation result of the target data, and calculates a loss function of the semantic segmentation model according to the semantic segmentation result of the target data.

Alternatively, the formula for the calculation of the loss function may be

L _Seg ＝-y(t)logF(x ^t )

Wherein, F (x) ^t ) For the result after segmentation by the semantic segmentation model, y (t) is a preset pixel level class label of the target data, L _Seg Is a loss function of the semantic segmentation model.

Then, the mobile phone 10 adjusts parameters in the semantic segmentation model according to the result of the loss function, for example, a weight value of each layer of neural network in the neural network used by the semantic segmentation model, to reduce the result of the loss function, so that an output result of the semantic segmentation model is the same as or similar to an input result (i.e., a preset pixel level class label), and when the output result of the semantic segmentation model is the same as or similar to the input result, the semantic segmentation model is considered to be trained completely.

In some embodiments of the present application, the result of the loss function may be compared to a preset threshold to determine whether the semantic segmentation training is complete. And when the result of the loss function is smaller than or equal to the preset threshold, the output result of the semantic segmentation model is considered to be the same as or similar to the input result, namely the training of the semantic segmentation model is finished. The setting of the preset threshold is related to the adopted neural network model and the loss function, the performance of the neural network model is better, the preset threshold can be set to be lower, and the setting mode of the preset threshold is not limited.

Alternatively, the semantic segmentation data set used for training the semantic segmentation model may be a 2D data set such as VOC (nominal visual object classes), MS COCO (systematic micro common objects in context), and the like; or 2.5D data sets such as NYU-D V2, SUN-3D, SUN RGB-D and the like; the semantic segmentation model can also be a 3D data set such as Stanford 2D-3D and Shape-Net Core, and the data set adopted by the training semantic segmentation model is not limited in the application.

Optionally, the semantic segmentation model for training may be based on a neural network model architecture (FCNs), seg-Net, U-Net, deep-Lab V1-V3, or the like. The method and the device do not limit the type of the architecture of the neural network model used for semantic segmentation training.

It can be understood that, in the semantic segmentation training process, segmentation training is mainly performed on RGB images, so in the training process, the mobile phone 10 may also acquire RGB images for training through the ordinary RGB camera 193.

Optionally, the above training process of the semantic segmentation model may be performed on the mobile phone 10, or may be performed on other electronic devices. When other electronic devices train the semantic segmentation model, after the training of the semantic segmentation model is completed, the semantic segmentation model is packaged and encapsulated by other electronic devices into a Software Development Kit (SDK) file, the SDK file is sent to the mobile phone 10 and installed on the mobile phone 10, and when the mobile phone 10 needs the semantic segmentation model, the mobile phone 10 uses the semantic segmentation model by calling a function corresponding to the semantic segmentation model. The other electronic devices may be electronic devices with model training capabilities, such as a notebook computer, a desktop computer, a cloud computer, and a tablet computer, and the form of the other electronic devices is not limited in the present application.

The process of plane fitting is described below in connection with step 1102 and plane fitting 1110.

Specifically, the process of plane fitting includes:

1) The mobile phone 10 converts the depth image of the space where the object to be measured is located into a corresponding point cloud 1113b.

As can be seen from the above, the pixel value of each point in the depth image of the space where the object to be measured is the distance value from the point to the camera 193 of the mobile phone 10, the Z coordinate of each point in the point cloud corresponding to the depth image can be calculated according to the pixel value of each point in the depth image, and then the X coordinate and the Y coordinate of the point are calculated according to the Z coordinate of each point in the point cloud.

Specifically, the mobile phone 10 converts the depth image of the space where the object to be detected is located into the corresponding point cloud 1113b by using the camera internal reference and the mapping formula. Optionally, the specific mapping formula may be:

wherein Pz is the Z coordinate of each point in the point cloud, px is the X coordinate of each point in the point cloud, py is the Y coordinate of each point in the point cloud;

i (I, j) refers to the pixel value of the point in the ith row and the jth column in the depth image of the space where the object to be measured is located, i.e., the distance from the point to the camera 193 of the mobile phone 10;

scale represents the ratio of the pixel value of a certain point in the depth image of the space where the object to be measured is located to the actual physical distance (millimeter) between the point and the camera;

fx and camera 193 are actual physical lengths (mm) represented by each pixel in the lateral direction of an image captured by the camera 193: cx denotes a number of horizontal pixels of a difference between the center point coordinate of the depth image and the origin point coordinate of the depth image, and cy denotes a number of vertical pixels of a difference between the center point coordinate of the depth image and the origin point coordinate of the depth image.

It should be understood that the internal references of the different cameras are different, and in some embodiments of the present application, the internal reference of the camera of the cell phone 10 may be camera.scale =1.0; camera. Cx =313.259979; camera.cy =270.867126; camera.fx =563.343384; camera.fy =563.343384.

Then, the mobile phone 10 calculates coordinates (px, py, pz) of each point in the point cloud corresponding to the depth image of the space where the object to be measured is located by using the mapping formula.

2) The mobile phone 10 determines a ground area in the depth image corresponding to the RGB image according to the ground area in the RGB image, and converts pixel data of the depth image corresponding to the ground area in the depth image into ground point cloud.

Specifically, the mobile phone 10 identifies a ground area in the RGB image of the space where the object to be detected is located by using the trained semantic segmentation model; then, the mobile phone 10 determines a ground area in the depth image corresponding to the RGB image according to the ground area in the RGB image; finally, the mobile phone 10 determines the ground point cloud in the point cloud 1113b corresponding to the depth image according to the ground area in the depth image.

3) The handset 10 fits one or more planes (i.e., sub-virtual planes) using the above-described ground point cloud.

It should be understood that the ground point cloud corresponds to the ground area in the depth image, so if it is a completely flat ground, the Z coordinate of the ground point cloud corresponding to the ground should be the same, but even the ground in the real scene is not necessarily a completely flat plane. For example, the floor may be covered with a carpet or the like, or the floor may have depressions, bumps, or the like. The actual ground point cloud should be a collection of discrete points that reflect the general shape of the ground, i.e., the Z coordinate of each point in the ground point cloud is not necessarily equal.

Therefore, the mobile phone 10 obtains the ground point cloud corresponding to the ground area in the space where the object to be measured is located through the above steps 1) to 2), and in the process of fitting the ground of the AR scene according to the coordinates (px, py, pz) (mainly Z coordinates, pz) of each point in the ground point cloud, it is possible to fit one or more planes, and then further plane fitting is performed on the fitted planes according to the distance between each plane and the camera 193 or the distance difference between each plane to form a larger plane.

For example, it is assumed that there are 100 points in the ground point cloud, wherein 20 points are, for example, P1, P2, and P3.... P20, 20 points are, for example, P21, P22, and P23.. P40, 20 points are, for example, P41, and P42, and P43.. P60, 30 points are, for example, P61, P62, and P63.. P90, and the remaining 10 points are, for example, P91, P92, and P93.. P100, and have different Z coordinates.

Then the mobile phone 10 finally fits 4 planes according to the Z coordinates of the above points, that is, the mobile phone 10 fits P1, P2, P3.. Cndot.p 20 with the same Z coordinate of-8 to the plane 1, fits P21, P22, P23.. Cndot.p 40 with the same Z coordinate of-10 to the plane 2, fits P41, P42, P43.. Cndot.p 60 with the same Z coordinate of-9 to the plane 3, fits P61, P62, P63.. Cndot.p 90 with the same Z coordinate of-20 to the plane 4, and discards the points P91, P92, P93.. Cndot.p 100 with different Z coordinates. It will be appreciated that if the Z coordinates of the 100 points are all the same, then the handset 10 can fit a plane based on the 100 points and use the plane as the ground in the AR scene.

4) The handset 10 calculates the distances from the planes to the camera 193.

Specifically, after the

planes

1, 2, 3, and 4 are fitted, the mobile phone 10 calculates the distances from each of the

planes

1, 2, 3, and 4 to the camera 193 of the mobile phone 10. It can be understood that, in the calculation formula of step 1), the coordinates of each point in the point cloud are relative to the camera 193, so that the distance from the plane fitted by each point in the ground point cloud to the camera 193 is equal to the Z coordinate of a certain point in each plane.

Therefore, the distances from the plane 1, the plane 2, the plane 3, and the plane 4 to the camera 193 are 8 cm, 10 cm, 9 cm, and 20 cm, respectively.

5) The mobile phone 10 fits a plane having a distance difference to the camera 193 within a preset distance in each plane until a maximum plane is obtained, and the mobile phone 10 uses the maximum plane as the ground in the AR scene.

For example, plane 1 is 8 cm from camera 193, plane 2 is 10 cm from camera 193, plane 3 is 9 cm from camera 193, and plane 4 is 20 cm from camera 193. Then, the mobile phone 10 fits the plane 1 and the plane 3 to a new plane 5, and the distance from the new plane 5 to the camera 193 is 8.5 cm, which is the average of the distances from the plane 1 to the plane 3 to the camera 193; simultaneously, fitting the plane 2 and the plane 3 into a new plane 6, wherein the distance from the plane 6 to the camera 193 is the average value of the distances from the plane 2 to the plane 3 to the camera 193, namely 9.5 centimeters; the differences between the plane 4 and the

planes

1, 2 and 3 to the camera 193 are all larger than 1 cm, so that the plane 4 is discarded in the ground fitting process;

the handset 10 then repeats the above process until the largest plane is obtained. For example, since the distance from plane 5 to camera 193 is 8.5 cm and the distance from plane 6 to camera 193 is 9.5 cm, the handset 10 continues to fit plane 5 to plane 6 to obtain a new plane 7, and the distance from plane 7 to camera 193 is 9 cm, since plane 7 is the largest plane at this time, so plane 7 is taken as the ground in the AR scene.

It should be understood that, when the ground in the AR scene is fitted, the Z coordinates of each point in the point cloud may be used as the basis for fitting the ground, if the ground is fitted to another plane in the AR scene, for example, a wall, the X coordinates of each point in the point cloud are used as the basis for fitting the wall, and if the ground is fitted to another object in the AR scene, the distance from each point in the point cloud to the camera 193 is comprehensively considered for fitting, where the specific plane fitting process is similar to the principle of the above steps 1) to 5), and is not described herein again.

Alternatively, the mobile phone 10 may also recognize the ground area in the RGB image by using a semantic segmentation model, determine the ground area in the depth image corresponding to the RGB image according to the ground area in the RGB image, convert the ground area in the depth image into a ground point cloud by using the method of step 1) by the mobile phone 10, and fit the ground in the AR scene according to the methods of steps 2) to 5). Compared with the mobile phone 10 which converts the whole depth image into the corresponding point cloud, the mobile phone 10 only converts the ground area in the depth image into the ground point cloud, and the efficiency of the mobile phone 10 in converting the depth image into the corresponding point cloud is improved.

In addition, when the mobile phone 10 identifies the ground in the RGB image by using the semantic segmentation model, the identified ground area may include some edge points, such as a point at a boundary edge between the ground and a wall surface or a point at a boundary edge between the ground and a desktop, and the edge points are often discrete with respect to other points in the ground point cloud distributed in a concentrated manner, and correspondingly, the points also exist in the depth image corresponding to the RGB image, so that when the mobile phone 10 fits the ground in the AR scene according to the ground point cloud, it is necessary to separately determine whether the points can be fitted with other points to form a plane, which affects the efficiency of fitting the plane by the mobile phone 10. Therefore, in a possible implementation manner of the present application, the process of fitting the plane may further be:

firstly, the mobile phone 10 divides the ground point cloud into a plurality of space blocks such as a space block 1, a space block 2 and a space block 3 according to the coordinates of each point in the ground point cloud, wherein the distance from the center of the space block to the mobile phone 10 can also be a preset distance such as 5 cm, 6 cm and 7 cm;

then, the mobile phone 10 determines whether the plurality of space blocks include points and whether the number of the included points satisfies a preset condition, for example, 10 points; when a certain space block, for example, space block 1, does not contain any points or the number of points is less than 10 although the points are contained, space block 1 is discarded and a plane is fitted only according to the points in space block 2 and space block 3. Specifically, the planes in the space block 2, such as the plane 8, the plane 9, and the plane 10, may be fitted according to the points in the space block 2, the planes in the space block 3, such as the plane 11 and the plane 12, may be fitted according to the points in the space block 3, and finally, the mobile phone 10 may determine whether to fit the planes into a larger plane according to the planes 8, 9, and 10 and the distances from the planes 11 and 12 to the camera 193 of the mobile phone 10. The process of fitting the plane in the space block is the same as the process of directly fitting the plane according to the points in the step 3), and details are not repeated here.

In addition, since the position of the mobile phone 10 changes as the mobile phone approaches or moves away from the person to be measured in the process of "moving the device and finding the ground" (as shown in fig. 4 (a)), the three-dimensional coordinates of the depth-sensing camera 193 also change in synchronization with the change of the position of the mobile phone 10 (assuming that the orientation of the depth-sensing camera 193 does not change), and accordingly, the display range of the depth image of the ground where the person to be measured is located in the real scene, which is acquired by the mobile phone 10 through the depth-sensing camera 193, also changes correspondingly.

For example, when the mobile phone 10 is close to the person to be measured, the range included in the depth image of the ground where the person to be measured is located in the real scene, which is acquired by the mobile phone 10 through the depth sensing camera 193, becomes smaller, the details of the ground in the acquired real scene become larger, and when the mobile phone 10 is far away from the person to be measured, the range included in the depth image of the ground where the person to be measured is located in the real scene, which is acquired by the mobile phone 10 through the depth sensing camera 193, becomes larger, and the details of the ground in the acquired real scene become smaller.

It can be understood that, in the above process, if the mobile phone 10 only uses the depth image corresponding to the depth sensing camera 193 in a single pose, it is unable to fit the ground in the AR scene that is consistent with the ground in the real scene. For example, the mobile phone 10 fits the ground in the AR scene by using the depth image of the ground corresponding to the position close to the person to be measured by the depth-sensitive camera 193, and then the ground range in the fitted AR scene is smaller than the ground range of the person to be measured in the real scene.

Therefore, in some embodiments of the present application, the handset 10 will fit the ground of the AR scene in conjunction with the corresponding depth images of the depth-sensing camera 193 in different poses. In one possible implementation, the mobile phone 10 may obtain the acceleration and the angular velocity 1112B of the mobile phone 10 at each time during the slow movement of the mobile phone 10 by the user through the IMU unit 180B, and then determine the current pose information 1114B of the camera 193, that is, the three-dimensional coordinates and the orientation angle of the camera 193, by using a simultaneous localization and mapping (SLAM) algorithm. It should be understood that since the camera 193 is mounted on the mobile phone 10, the acceleration and angular velocity of the mobile phone 10 during movement by the user can be regarded as the acceleration and angular velocity of the camera 193 during movement. Alternatively, the SLAM algorithm may be developed into an SDK file in advance, and then preset in the mobile phone 10, so that after the mobile phone 10 obtains the acceleration and the angular velocity corresponding to the mobile phone 10 at each time in the process that the mobile phone 10 is slowly moved by the user through the IMU unit 180B, the IMU unit 180B inputs the obtained acceleration and the angular velocity corresponding to the mobile phone 10 at each time to SLAM algorithm software, and then obtains the three-dimensional coordinates and the numerical values of the orientation angle of the camera 193 through relevant operations.

Meanwhile, when the user is "slow moving the device to search for the ground" (refer to fig. 4 (a)), the depth-sensitive camera 193 of the mobile phone 10 continuously acquires RGB-D images of the space where the person to be measured is located, the mobile phone 10 fits a plurality of planes in the AR scene according to the depth image of the space where the person to be measured acquired by the depth-sensitive camera 193 in the above steps 1) to 5), and meanwhile, the mobile phone 10 performs semantic segmentation 1120 on the RGB image corresponding to the depth image by using the trained semantic segmentation model to determine the type of each plane in the RGB image, that is, the semantic label 1123b.

As described above, the RGB image in the RGB-D image corresponds to the pixel point of the depth image one to one, and each point of the depth image corresponds to each point in the point cloud obtained from the depth image one to one, so that each point in the RGB image and each point in the point cloud also have a one to one correspondence relationship. Thus, the mobile phone 10 may correspond the types of the planes in the RGB image to the types of the planes in the AR scene fitted according to the point cloud one by one, that is, the mobile phone 10 may add semantic tags 1123b representing the types of the planes to the multiple planes in the fitted AR scene according to the types of the planes in the RGB image, and then generate the plane 1116b with the semantic tags 1123b

For example, the mobile phone 10 determines, through the trained semantic segmentation model, planes such as a ground plane and a wall plane in the RGB image of the space where the person to be measured is located in the real scene, and then the mobile phone 10 marks the "ground plane" on the plane fitted according to the depth image corresponding to the ground plane in the real scene in the AR scene, and marks the "wall plane" on the plane fitted according to the depth image corresponding to the wall plane in the real scene in the AR scene.

As described above, since the pose of the depth-sensing camera 193 is changed all the time when the mobile phone 10 is moved by the user, for example, when the mobile phone 10 is close to or away from the person to be measured, the range of the ground included in the RGB image acquired by the depth-sensing camera 193 is different, and accordingly, the point cloud of the ground in the depth image corresponding to the RGB image is also different. Therefore, the mobile phone 10 calculates plane confidence of a plurality of planes fitted by the point clouds corresponding to the depth images acquired by the depth perception camera 193 in different poses. And the plane confidence coefficient is the ratio of the number of points contained in the fitted plane to the number of points in the point cloud corresponding to the plane in the depth image. The plane confidence calculation process will be described below.

Then, the mobile phone 10 takes the plane type in the RGB image and the plane confidence of the fitted plane as labels of multiple planes in the fitted AR scene, and represents the labels as Tag { type, confidence }.

Finally, the mobile phone 10 fits the ground in the AR scene based on each plane in the AR scene fitted by the RGB-D image of the space where the object to be measured is acquired by the camera 193 in each pose and the Tag content corresponding to each plane, so that the ground in the AR scene is closer to the ground in the real scene.

For example, assuming that the mobile phone 10 is slowly moved by the user, and in the process of finding the ground, multiple RGB-D images in different poses are obtained, and then the mobile phone 10 fits 10 planes with the type of ground and the confidence coefficient of 90%, and 5 planes with the type of ground and the confidence coefficient of 30% according to the RGB-D images in the multiple different poses by using the method described in steps 1) to 5), then, when the mobile phone 10 fits the ground in the AR scene according to the fitted 10 planes with the type of ground and the 5 planes with the type of ground and the confidence coefficient of 30%, the planes with the type of 10 and the type of ground with the higher confidence coefficient may be selected for plane fitting, and the planes with the type of 5 also being the ground and the confidence coefficient of only 30% may be discarded or may be used as plane references for plane fitting in a manner of reducing weights. It can be understood that the more RGB-D of the ground in the real scene that is acquired by the camera 193 in different poses, the more planes of which the type of the plane is the ground in the label that can be fitted by the mobile phone 10, and the closer the ground in the AR scene that is finally fitted by the mobile phone 10 is to the ground in the real scene.

Optionally, each plane with Tag { type, confidence } may be used as a data set for semantic segmentation training to train the semantic segmentation model again, so as to further optimize the semantic segmentation model, so that the type of the plane identified by the semantic segmentation model is more accurate.

For example, if the mobile phone 10 fits 5 planes with a ground confidence of 90% in the type and 5 planes with a wall confidence of 80% in the type by using the method described above, the mobile phone 10 may use the 5 planes with a ground confidence of 90% in the type as the ground in the data set (or use the 5 planes with a wall confidence of 80% in the type as the wall in the data set), input the planes as the target data in the semantic segmentation training to the semantic segmentation model, and train the semantic segmentation model until the confidence of the plane output by the semantic segmentation model is the same as or similar to the plane with a ground confidence of 90% in the type (or the confidence of the plane with a wall confidence of 80% in the type is the same as or similar to the plane), which is considered that the semantic segmentation model is trained. The semantic segmentation model is continuously optimized in this way.

Corresponding to the above plane confidence, the following briefly introduces the calculation method of the plane confidence:

in the process of the above-mentioned plane fitting, the camera 193 of the mobile phone 10 continuously scans the fitted plane to obtain the distance from the fitted plane to the camera 193 of the mobile phone 10, and then determines whether to continue fitting a part of the fitted plane to form a larger plane according to the distance from the fitted plane to the camera 193 of the mobile phone 10. For example, a certain space includes three fitted planes L, M, and N, then the distances from the 3 facets L, M, and N obtained by the mobile phone 10 through the camera 193 to the camera 193 of the mobile phone 10 are 10 centimeters, 9 centimeters, and 30 centimeters, respectively, and then the mobile phone 10 determines whether the 3 facets can be fitted into one plane according to whether the difference between the distances from the facets and the camera 193 of the mobile phone 10 is greater than a preset distance. Taking a preset distance of 1 cm as an example, the distance between the L plane and the M plane and the camera 193 is 10 cm and 9 cm, and the distance difference between the L plane and the M plane is equal to the preset distance of 1 cm, so that the L plane and the M plane can be fitted into one plane, and the distance difference between the N plane and the other two planes and the camera 193 is greater than the preset distance of 1 cm, so that the N plane is discarded in the process of plane fitting; then, the mobile phone 10 will calculate the average of the distances from the camera 193 to the L plane and the N plane, for example, (10 + 9)/2 =9.5 cm, so the mobile phone 10 will fit a new plane O again in the space with the distance from the camera 193 being 9.5 cm in combination with the point cloud of the L plane and the N plane. Assuming that the number of point clouds included in the L plane is 100, the number of point clouds included in the M plane is 200, and the number of point clouds included in the N plane is 200, the plane confidence of the new plane O is: (100 + 200)/(100 + 200) + =60%. It can be understood that the plane confidence may also be calculated by calculating a ratio of the number of point clouds used for fitting into a plane to the number of all point clouds in the space, and the calculation method of the plane confidence is not limited in the present application.

It should be understood that, the size measurement method of the object in the present application may not only measure the size of the object to be measured perpendicular to the ground, such as the height of a person, but also measure the size of the object to be measured perpendicular to the wall surface, at this time, the mobile phone 10 may identify the wall surface from the obtained RGB image, then determine the depth data of the wall surface in the depth image corresponding to the RGB image according to the position information of the wall surface, convert the depth data of the wall surface into a corresponding point cloud, fit the wall surface in the AR scene based on the point cloud, and measure the size of the object to be measured perpendicular to the wall surface with the wall surface as the measurement reference surface of the object to be measured, where a specific measurement principle is consistent with the principle of the height of the person to be measured, and the above description related to the height of the person to be measured may be specifically referred to, and is not repeated here.

In another possible implementation manner of the present application, in the step 1103, the mobile phone 10 may identify the vertex P1 of the head of the person by acquiring an image of the person by the camera 193, then identifying a head contour of the person by using a two-dimensional image identification method, taking a vertex of the head contour as a vertex of the head of the person, and calculating a distance between the vertex of the head and the measurement reference plane.

Fig. 13 is a schematic diagram illustrating an example of recognizing a head contour of a person by using an image recognition method according to an embodiment of the present application.

As shown in fig. 13, the mobile phone 10 obtains a two-dimensional image a including a human head through the camera 193, then determines a contour L of a human face of a human by using an image recognition method based on a convolutional neural network model, and uses a highest point P1 of the contour as a vertex of the human head. Then, the distance from the P1 to the measuring reference plane is calculated, and finally, the distance is displayed on the mobile phone 10 as the height of the person.

Since the above-described method of identifying the head vertex P1 of the person is based on a two-dimensional image, the coordinates of the head vertex P1 at this time belong to the coordinates in the image coordinate system, that is, two-dimensional coordinates, which have only two values of XY horizontal and vertical coordinates, and there is no Z-axis coordinate that can represent the depth data of the head vertex P1.

Since the measurement reference plane d belongs to a plane in the world coordinate system, it is necessary to convert the head vertex P1 into coordinates (Xp 1, yp1, zp 1) in the world coordinate system when calculating the distances between the head vertices P1 to d.

It can be understood that the image coordinate system belongs to a two-dimensional coordinate system, and the world coordinate system belongs to a three-dimensional coordinate system, and the two-dimensional coordinate system lacks depth data relative to the three-dimensional coordinate system, that is, there is no corresponding depth for a point in the two-dimensional coordinate system, so when determining a three-dimensional coordinate corresponding to a certain point in the two-dimensional coordinate system, the distance from the point to the camera 193 is determined according to the coordinates of the point in the multiple two-dimensional images, and then a series of rotation and translation transformations are performed to obtain the three-dimensional coordinate of the point, in this process, as the poses of the multiple two-dimensional images acquired by the camera 193 may be different, the three-dimensional coordinate of the certain point may be different from the three-dimensional coordinate of another point, and the two-dimensional image selected when determining the three-dimensional coordinate of another point may also be different, there must be unavoidable errors in the process of transforming the two-dimensional image coordinate system into the three-dimensional image coordinate system, that is unavoidable errors in the process of transforming the coordinates of the head vertex P1 from the image coordinate system into the world coordinate system, and further results in the insufficient accuracy of human height measurement.

Therefore, in order to further improve the accuracy of the human height measurement, in another possible implementation manner of the present application, the manner in which the mobile phone 10 determines the vertex of the head of the person may be that the mobile phone 10 obtains a depth image of the face of the person, then obtains a first feature point of the face of the person based on the depth image of the face, and then calculates the first feature point of the face or determines the distance from the vertex of the head of the person to the measurement reference plane based on the first feature point of the face. FIG. 14 (a) is a flow chart of a method for determining a human head vertex and calculating a human height using three-dimensional face recognition techniques according to some embodiments of the present application.

As shown in fig. 14 (a), the method 1400 includes:

1401: the handset 10 acquires a face depth image of a person.

In some embodiments of the present application, the cell phone 10 obtains a facial depth image of a person through the depth-sensing camera 193.

1402: the mobile phone 10 determines a first feature point of the face of the person based on the face depth image.

In one possible implementation manner of the present application, the mobile phone 10 identifies the depth image of the face in step 1401 by using a three-dimensional face recognition technology, and determines a first feature point of the face of the person.

The three-dimensional face recognition technology mainly utilizes a face detection algorithm, a feature point calibration algorithm, a three-dimensional greedy mapping surface reconstruction algorithm, a shortest path algorithm, an equidistant mapping algorithm and a matrix K-order moment in deep learning to extract the whole three-dimensional face features, and recognizes the face based on the extracted three-dimensional face features. In the present application, the mobile phone 10 may recognize feature points in a three-dimensional face, that is, first feature points of the face, by using a three-dimensional face recognition technology. Alternatively, the first facial feature point may be a relatively prominent part of the nose tip, the forehead, the two cheeks, the eyebrows, the bridge of the nose, etc.; or sharp corners such as the corners of the eyes, the wings, the corners of the mouth, etc. The present application does not limit the specific expression form of the first feature point of the face.

Optionally, the three-dimensional face recognition technology may be integrated into the face recognition SDK of the mobile phone 10, and when the mobile phone 10 needs to recognize a face, the face recognition SDK is called. The Face SDK may be integrated into a Face (Face) AR recognition module of the mobile phone 10.

1403: the mobile phone 10 determines the head vertex of the person according to the first feature point of the face, determines the distance from the head vertex to the ground in the AR scene, and takes the distance from the head vertex to the ground in the AR scene as the height of the person.

Since the tip of the nose is most prominent among the feature points of the face of a person, the tip of the nose is most easily recognized in a depth image of the face of a person. Therefore, taking the nose tip as the first feature point of the face of the person to be detected identified by the mobile phone 10 as an example, in a possible implementation manner, the mobile phone 10 obtains the nose tip coordinate of the person to be detected, then further determines the distance from the top point of the head of the person to be detected to the nose tip according to the nose tip coordinate based on the face proportion of the person, and then takes the sum of the distance from the top point of the head of the person to be detected to the nose tip and the distance from the nose tip to the ground as the height of the person to be detected.

FIG. 14 (b) is a flow chart of a face model generated using three-dimensional face recognition technology in combination with a neural network model according to some embodiments of the present application.

As shown in fig. 14 (b), the system includes a facial depth image acquisition module 1401b, a 3d face detection module 1402b, a mark generation module 1403b, and a face mesh generation module 1404b.

The module 1401b for obtaining a face depth image is used to implement the above steps 1401,3D face detection module 1402b and mark generation module 1403b to cooperate to implement the above steps 1402-1403, that is, the 3D face detection module 1402b identifies the face depth image of the face, then the mark generation module 1403b marks all the face feature points of the face, and finally the face mesh generation module 1404b generates a face mesh with the marks of the face feature points. It should be understood that the specific implementation processes of the above modules correspond to the steps of the method 1400, and specific reference may be made to the description of the method 1400, which is not described herein again.

After the face mesh is generated by the process shown in fig. 14 (b), the mobile phone 10 determines the position of the vertex of the human head according to the face proportion of the human face in the face mesh. Specifically, fig. 15 is a schematic diagram of an example of determining the vertex position of a human head according to some embodiments of the present disclosure.

As shown in fig. 15, the mobile phone 10 can determine the coordinates of the nose tip 29 in the world coordinate system according to the face depth image of the person, and then determine the coordinates of the head vertex 63 (i.e., P1) of the person to be measured according to the face proportion of the person. For example, assuming that the coordinates of the nose tip 29 are (x 1, y1, z 1) and the coordinates of the point 62 in the middle of the eyebrow are (x 2, y2, z 2), the distance from the head vertex 63 to the point 62 in the middle of the eyebrow is equal to the distance from the nose tip 29 to the point 62 in the middle of the eyebrow and is also equal to the distance from the nose tip 29 to the chin 54 in the face proportion of the person.

Therefore, the height of a person is equal to: the distance from the nose tip 29 to the ground + the distance from the nose tip 29 to the head apex 63, i.e. the distance from the nose tip 29 to the ground +2 times the distance from the nose tip 29 to the chin 54 or the distance from the nose tip 29 to the ground +2 times the distance to the point 62 in the middle of the eyebrow.

After determining the height of the person, the mobile phone 10 displays the height of the person on the UI of the mobile phone 10 according to the method for displaying the height of the person in fig. 4 to 9. Reference may be made to the descriptions in fig. 4 to 9, which are not repeated herein.

Fig. 16 is a software configuration diagram of the mobile phone 10 according to the embodiment of the present application. Taking the operating system of the mobile phone 10 as an Android system as an example, in some embodiments, the Android system is divided into four layers, which are an application layer 1601, an application Framework (FWK) 1602, a Hardware Abstraction Layer (HAL) 1603, and a kernel layer 1604.

As shown in fig. 16, the application layer may be a series application package, and the application package may include applications such as a camera 1601a, a calendar 1601b, a short message 1601c, an ar measurement 1601d, and a navigation 1601 e. The embodiments of the present application are not listed.

The application framework layer 1602 provides an Application Programming Interface (API) and a programming framework for the application of the application layer. The application framework layer 1602 includes some predefined function interfaces, such as an event listening function for receiving events sent by the application framework layer.

In some embodiments of the present application, the event listening function of the application framework layer 1602 receives the "AR measurement" APP started by the user and the touch operation in the "AR measurement" APP, so as to call the corresponding hardware service according to the specific touch operation of the user.

HAL 1603 is an interface layer located above the hardware circuitry in the operating system kernel layer 1604 and the physical layer 1605, which is intended to abstract the hardware. The HAL 1603 shields implementation details of hardware in the handset 10 downwards and provides an abstract interface for the application layer 1601 to call hardware services in the handset 10 upwards.

In a possible implementation manner of the present application, the event monitoring function of the application framework layer 1602 receives the "AR measurement" APP started by the user and the touch operation in the "AR measurement" APP, and in response to the touch operation of the user, the application framework layer 1602 calls the service of the specific hardware in the physical layer 1605 through the ServiceManager 1602a interface shown in fig. 16. For example, when the user turns on the AR measurement 1601D in the application layer 1601 to measure the height of the person under test and the mobile phone 10 looks for the ground, in response to the user selecting the "height measurement" mode (as shown in fig. 4 (a) and 4 (B)), the application "AR measurement" will call the camera driver 1604a and the sensor driver 1604B in the hw _ get _ module 1603a in the kernel layer 1604 to drive the camera 193 (as shown in fig. 10 (a)) to obtain the RGB-D image of the ground in the space where the person under test is located, and simultaneously obtain the pose information of the camera 193 through the IMU unit 180B of the mobile phone 10 (as shown in fig. 10 (a)) to enable the mobile phone 10 to perform a plane fit in the AR scene in combination with the depth image of the ground in the space where the person under test is located and the pose information of the camera 193.

The kernel layer 1604 is a layer between hardware and software. The core layer 1604 includes at least camera drivers 1604a, sensor drivers 1604b, and may also include display drivers, microphone drivers, and the like.

It should be understood that the above-mentioned software architecture layer of the mobile phone 10 is only exemplary and should not be construed as limiting the present application, and in some other implementations, the software architecture layer of the mobile phone 10 may be further divided into more or less layers according to the layering principle, for example, the software architecture layer of the mobile phone 10 may further include a system layer, a physical layer, or the software architecture layer of the mobile phone 10 may not include the HAL 1603. This is not limited by the present application.

All or part of the flow in the method of the embodiments described above can be implemented by a computer program that can be stored in a computer-readable storage medium and that, when executed by a processor, can implement the steps of the embodiments of the methods described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal device, recording medium, computer memory, read-only memory (ROM), random Access Memory (RAM), electrical carrier signal, telecommunications signal and software distribution medium. Such as a usb-drive, a removable hard drive, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In the description above, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing a relative importance or importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless otherwise specifically stated.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A size measuring method of an object is applied to electronic equipment, and is characterized by comprising the following steps:

acquiring a two-dimensional information image and a three-dimensional information image of a real plane where a target to be detected is located in a real scene, wherein pixels in the two-dimensional information image and pixels in the three-dimensional information image have a one-to-one correspondence relationship in position;

acquiring position information of the plane according to the two-dimensional information image, and acquiring three-dimensional pixel information of pixels corresponding to the position information in the three-dimensional information image;

converting the obtained three-dimensional pixel information into point cloud, and generating a virtual plane corresponding to the real plane in a virtual space based on the converted point cloud;

and measuring the size of the target to be measured by taking the virtual plane as a reference plane.

2. The method according to claim 1, wherein the acquiring of the two-dimensional information image and the three-dimensional information image of the real scene including the real plane where the object to be measured is located comprises:

the method comprises the steps of simultaneously acquiring a two-dimensional information image and a three-dimensional information image of a real plane where a target to be detected is located through a depth-sensing camera of the electronic equipment.

3. The method according to claim 1 or 2, characterized in that the two-dimensional information image represents two-dimensional features of the object to be measured, the two-dimensional features comprising one or more of color features, grayscale features, texture features, the three-dimensional information image represents three-dimensional features of the object to be measured, the three-dimensional features comprising spatial depth values of the object to be measured.

4. The method of any of claims 1-3, wherein the virtual space comprises an Augmented Reality (AR) scene, wherein the two-dimensional information image comprises a color space image, wherein the color space image comprises an RGB image or a YUV image, and wherein the three-dimensional information image comprises a depth image.

5. The method according to claim 1, wherein the plane in the two-dimensional information image is identified by using a semantic segmentation model, and position information of the plane is acquired.

6. The method of claim 4, wherein the semantic segmentation model is a full convolution neural network (FCNs) model.

7. The method of claim 1, wherein converting the obtained three-dimensional pixel information into a point cloud and generating a virtual plane in a virtual space corresponding to the real plane based on the converted point cloud comprises:

generating a plurality of sub-virtual planes of the corresponding real plane in a virtual space based on the converted point cloud;

determining a plane confidence coefficient of each of the plurality of sub virtual planes, and generating the virtual plane based on a part of the plurality of sub virtual planes, wherein the plane confidence coefficient represents a ratio between the number of points of the point cloud in each sub virtual plane and the number of points of the point cloud corresponding to the real plane.

8. The method of claim 1, wherein the object to be tested comprises a human, the method further comprising:

acquiring a head image of a person;

determining a head vertex of a person from the head image;

and taking the distance between the vertex of the head and the virtual plane as the height of the person.

9. The method according to claim 8, wherein the head image is a head three-dimensional information image, and

the determining the head vertex of the person from the head image comprises:

recognizing facial feature points from a three-dimensional information image of a human head by a three-dimensional face recognition method;

determining the head vertex based on the facial feature points.

10. A computer-readable medium having stored thereon instructions that, when executed on an electronic device, cause the electronic device to perform the method of sizing an object according to any one of claims 1-9.

11. An electronic device, comprising: the camera is connected with the inertial measurement unit IMU;

one or more processors;

one or more memories;

a module installed with a plurality of applications;

the memory stores one or more programs, the one or more programs comprising instructions, which when executed by the electronic device, cause the electronic device to perform the method of dimensional measurement of an object of any of claims 1-9.

12. A computer program product comprising instructions for causing a processor to perform the method of sizing an object according to any one of claims 1 to 9 when the computer program product is run on an electronic device.