WO2022151507A1

WO2022151507A1 - Movable platform and method and apparatus for controlling same, and machine-readable storage medium

Info

Publication number: WO2022151507A1
Application number: PCT/CN2021/072581
Authority: WO
Inventors: 施泽浩; 封旭阳; 聂谷洪
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2022-07-21

Abstract

A method for controlling a movable platform, comprising: acquiring a depth map of a scene where a target object is located collected by a movable platform (S201); acquiring a first target area covering the target object on the depth map (S202); adjusting the first target area to make the proportion of the area corresponding to the target object in the first target area is increased, so as to obtain a tracking area (S203); and controlling, according to depth information of the tracking area, the movable platform to move with respect to the target object (S204).

Description

Movable platform, control method, control device, and machine-readable storage medium thereof

technical field

The present application relates to the field of intelligent perception, and in particular, to a control method of a movable platform, a movable platform, a control device and a machine-readable storage medium.

Background technique

Use movable platforms, such as drones, smart cars, intelligent robots, etc., to intelligently perceive the target, obtain the distance information between the target and the movable platform, and control the movable platform accordingly. Actions such as interaction are currently a research hotspot in the field of intellisense.

In the related art, the distance information between the target object and the movable platform is usually obtained by acquiring a depth map including the target object, projecting the target frame on the depth map, and averaging the depth values in the target frame. However, the distance information between the target object and the movable platform obtained based on the above method is often inaccurate, which makes the following, obstacle avoidance, information interaction and other operations performed by controlling the movable platform inaccurate. to safety hazards.

SUMMARY OF THE INVENTION

In order to overcome the inaccurate distance information obtained between the target object and the movable platform in the related art, the following, obstacle avoidance, information interaction and other operations performed by controlling the movable platform are inaccurate, and even bring safety. The present application provides a movable platform and its control method, control device and machine-readable storage medium.

According to a first aspect of the embodiments of the present application, a method for controlling a movable platform is provided. The method includes: acquiring a depth map of a scene where a target object is located and collected by the movable platform; the first target area of the target object; adjust the first target area to increase the proportion of the area corresponding to the target object in the first target area to obtain a tracking area; according to the depth information of the tracking area , controlling the movable platform to move relative to the target.

According to a second aspect of the embodiments of the present application, a movable platform is provided, the movable platform includes: an image acquisition device, a memory, and a processor; the image acquisition device is configured to acquire a depth map of a scene where a target object is located; The memory is used for storing program codes; the processor calls the program codes, and when the program codes are executed, is used for performing the following operations: acquiring the depth map and the first depth map including the target object a target area; adjust the first target area to increase the proportion of the area corresponding to the target object in the first target area to obtain a tracking area; control the available tracking area according to the depth information of the tracking area The mobile platform moves relative to the target.

According to a third aspect of the embodiments of the present application, a control device is provided, the control device includes: a memory and a processor; the memory is used for storing program codes; the processor calls the program codes, and when the program codes are executed When executed, it is used to perform the following operations: obtain the depth map of the scene where the target object is located and collected by the movable platform; obtain the first target area including the target object on the depth map; adjust the first target area to Increasing the proportion of the area corresponding to the target in the first target area to obtain a tracking area; and controlling the movable platform to move relative to the target according to the depth information of the tracking area.

According to a fourth aspect of the embodiments of the present application, a movable platform is provided, the movable platform includes: an image acquisition device, a memory, and a processor; the image acquisition device is configured to acquire a depth map of a scene where a target object is located; The memory is used for storing program codes; the processor calls the program codes, and when the program codes are executed, is used for performing the following operations: acquiring the depth map and the first depth map including the target object a target area; delete all or part of other areas in the first target area that do not correspond to the target object to obtain a tracking area; control the movable platform relative to the tracking area according to the depth information of the tracking area The target moves.

According to a fifth aspect of the embodiments of the present application, a control device is provided, the control device includes: a memory and a processor; the memory is used for storing program codes; the processor calls the program codes, and when the program codes are executed When executed, it is used to perform the following operations: obtain the depth map of the scene where the target object is located and collected by the movable platform; obtain the first target area including the target object on the depth map; delete the first target area A tracking area is obtained from all or part of other areas in the area not corresponding to the target object; and the movable platform is controlled to move relative to the target object according to the depth information of the tracking area.

According to a sixth aspect of the embodiments of the present application, there is provided a computer-readable storage medium on which a computer program is stored, and when the computer program is executed, the method described in any of the above embodiments is implemented.

The technical solutions provided by the embodiments of the present application may include the following beneficial effects:

In the embodiment of the present application, the target area including the target object is obtained on the depth map of the scene where the target object is located, and the target area is adjusted so that the target area corresponds to the area of the target object The proportion of the tracking area is increased, and the tracking area is obtained, and then the movable platform is controlled to move relative to the target object according to the depth information of the tracking area. Since the target area is adjusted in the present application, the proportion of the area corresponding to the target object in the target area is increased. Therefore, according to the depth information of the tracking area, a more accurate movable platform and target can be obtained. The distance information of the object can then be more accurately controlled to control the movement of the movable platform relative to the target object, avoiding potential safety hazards and providing the movement performance of the movable platform.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not limiting of the present application.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.

FIG. 1 is a schematic diagram of a depth map that can be used to obtain target object distance information according to an exemplary embodiment of the present application.

Fig. 2 is a flowchart of a method for controlling a movable platform according to an exemplary embodiment of the present application.

FIG. 3 is a flowchart of acquiring a first target area including a target on a depth map according to an exemplary embodiment of the present application.

FIG. 4 shows the alignment relationship in the cache of image frames captured by the main camera of the movable platform and the binocular camera in an ideal situation according to an exemplary embodiment of the present application.

FIG. 5A is a color image including a target obtained by a movable platform according to an exemplary embodiment of the present application.

FIG. 5B is a depth map including a target obtained by a movable platform according to an exemplary embodiment of the present application.

FIG. 6 is an alignment relationship in the cache of image frames captured by the main camera of the movable platform and the binocular camera in an actual situation according to an exemplary embodiment of the present application.

FIG. 7 is a flow chart of correcting the first target area of the depth map by a method based on feature matching according to an exemplary embodiment of the present application.

FIG. 8 is an effect diagram of adjusting the first target area by a method based on image semantic segmentation according to an exemplary embodiment of the present application.

FIG. 9 is an effect diagram of performing semantic segmentation on an image by an image semantic segmentation method according to an exemplary embodiment of the present application.

FIG. 10 is a flowchart of adjusting the first target area based on a semantic segmentation result based on a deep learning model according to an exemplary embodiment of the present application.

Fig. 11A is a schematic diagram of a deep learning model for image semantic segmentation according to an exemplary embodiment of the present application.

FIG. 11B is a feature response diagram of a deep learning model to a target according to an exemplary embodiment of the present application.

FIG. 12 is an effect diagram of etching processing of a morphological operation according to an exemplary embodiment of the present application.

FIG. 13 is a schematic structural diagram of a movable platform according to an exemplary embodiment of the present application.

FIG. 14 is a schematic structural diagram of a control device according to an exemplary embodiment of the present application.

Fig. 15 is a flowchart of another method for controlling a movable platform according to an exemplary embodiment of the present application.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as recited in the appended claims.

The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish information of the same type from one another. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application. Depending on the context, the word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining."

At present, in the field of intelligent perception, mobile platforms, such as unmanned aerial vehicles, smart cars, intelligent robots, etc. themselves or the acquisition devices mounted on them are used to obtain the distance information between the target and the mobile platform, and according to the obtained It is a research hotspot to control the movable platform to perform actions such as following, avoiding obstacles, and information interaction accordingly. However, in the related art, the obtained distance information between the target object and the movable platform usually has the problem of inaccuracy, which makes the following, obstacle avoidance, information interaction and other operations performed by controlling the movable platform inaccurate. even bring security risks.

Taking the movable platform as an unmanned aerial vehicle and the application scenario as an example of an unmanned aerial vehicle intelligently following a target, an exemplary description is given.

In the existing UAV intelligent following technology, one of the important contents of the UAV's observation of the following target is to monitor the target distance to control the distance between the UAV and the following target. Therefore, it is particularly important that the UAV can accurately obtain the distance between the target and the following target.

In the related art, the following method is used to obtain the distance between the drone and the following target: the drone is usually equipped with a main camera and a binocular camera. Among them, the main camera can obtain a color image including the scene where the target object is located, and the binocular camera can obtain a binocular depth map including the scene where the target object is located. By extracting the target frame containing the target from the color image captured by the main camera, projecting the target frame onto the binocular depth map, and taking the average value of the depth values in the target frame projected on the depth map as the target distance between the object and the drone.

However, in the process of the drone following the target, the target may be occluded by other objects. Then, the depth information obtained on the binocular depth map according to the following target frame area contains background noise, which will cause the drone to measure The distance is not allowed, and the problem of following instability.

As shown in Figure 1, the target object in the target frame (shown by the gray rectangle) is occluded by grass. Then, since the grass covering the target is relatively close to the UAV, the distance between the UAV and the target calculated based on the average value in the target frame is closer than the real distance. This kind of depth estimation is inaccurate and will Causes the drone to be unable to keep up quickly after the target is far away.

Of course, those skilled in the art should understand that the above application scenario is only an exemplary illustration. In addition to the target being blocked by other objects, the obtained distance information between the target and the movable platform may be inaccurate due to other reasons such as the target being a moving object and the inaccurate projection of the projection frame, which is not limited in this application. The target frame may be a rectangular frame or a target frame of other shapes, which is not limited in this application.

In order to solve the related defects caused by the inaccuracy of the distance information in the related art, by using the obtained distance information between the target object and the movable platform to control the movable platform to perform the movement related to the distance information, The present application provides a control method for a movable platform. Referring to FIG. 2 , it is a flowchart of the control method. The control method may include the following steps:

Step S201, acquiring the depth map of the scene where the target object is collected and obtained by the movable platform;

Step S202, obtaining a first target area including the target on the depth map;

Step S203, adjusting the first target area so that the proportion of the area corresponding to the target object in the first target area is increased to obtain a tracking area;

Step S204, controlling the movable platform to move relative to the target according to the depth information of the tracking area.

In some embodiments, the movable platform may be an unmanned aerial vehicle, a smart car, an intelligent robot, an unmanned ship, etc., and the specific type of the movable platform is not limited in this application.

In some embodiments, in step S201, the depth map of the scene where the target object is collected and obtained by the movable platform can be obtained by a method based on binocular stereo vision, that is, based on the principle of parallax, using an imaging device from different positions Obtain two images of the object to be measured, and obtain the three-dimensional position information of the object by calculating the position deviation of the corresponding points of the images, and then obtain the depth map of the scene where the target object is located.

In some embodiments, a binocular camera may be mounted on the movable platform, and the binocular camera may be used to obtain an image of the scene where the target object is located. Since the two cameras in the binocular camera are located at different positions, the position of the corresponding point in the image collected by each camera of the binocular camera has a deviation. Based on the deviation information, the three-dimensional image of the object in the image can be extracted. Location information to obtain the depth map of the scene where the target is located.

In some embodiments, the depth map of the scene where the target is located is obtained based on two images with parallax, and the three-dimensional information can be obtained by the principle of trigonometry, that is, a triangle is formed between the position of the binocular camera and the target , knowing the positional relationship between the two cameras of the binocular camera, the three-dimensional size of the object in the common field of view of the two cameras and the three-dimensional coordinates of the feature points of the spatial object can be obtained. Of course, acquiring the depth map of the scene where the target object is based on two images with parallax can also be implemented based on methods such as deep learning, which is not limited in this application.

Of course, those skilled in the art should understand that, in order to obtain the depth map of the scene where the target object is collected by the movable platform, in addition to the method based on binocular stereo vision, the target can also be obtained based on laser radar, ultrasonic ranging, etc. The depth map is obtained by means of the three-dimensional coordinates of the object, and the present application does not limit the specific way for the movable platform to obtain the depth map of the scene where the target object is located.

In some embodiments, in the above-mentioned control method of a movable platform of the present application, step S202, obtaining the first target area including the target on the depth map, can be achieved by the manner shown in FIG. 3:

Step S301, acquiring a second target area containing the target, and the target area is located on the color image collected by the movable platform;

Step S302, projecting the second target area onto the depth map to obtain a first target area including the target on the depth map.

In some embodiments, the movable platform may be equipped with a main camera capable of acquiring a color image of the space surrounding the location of the movable platform. Taking a consumer drone as an example, the main camera may be a camera hung under the drone body or on the pan/tilt on the front side, and is used to collect images of the scene within the camera's field of view, so as to detect the unmanned aerial vehicle. Observation of the environment around the man-machine.

After the movable platform uses its main camera to capture a color image containing the target, based on the color image, a second target area on the color image containing the target can be acquired.

In some embodiments, the second target area on the color image that includes the target object may be a target area manually circled by the user, or may be automatically circled based on a condition input by the user in advance that meets the needs of the user The target area for which conditions are input in advance may also be a target area that is automatically selected based on various deep learning models and includes the target object. The present application does not limit the acquisition method of the second target area.

After acquiring the second target area on the color image including the target collected by the movable platform, project the second target area on the color image onto the depth map to acquire the depth The map includes a first target area of the target.

In some embodiments, projecting the second target area on the acquired color image including the target onto the depth map may be achieved by: projecting the second target area on the color image generated at time T to the depth map on the depth map generated at time T.

Referring to FIG. 4, still taking the drone following scene as an example, the color image is collected and generated by the main camera, and the depth map is collected and generated by the binocular camera. Then, the main camera and the binocular camera simultaneously perform In image acquisition, a color image 401 and a depth map 402 of the scene where the target object is located are acquired, wherein T0 to T5 represent color image frames and depth image frames captured at 6 different moments. Ideally, the color image captured by the main camera and the depth map captured by the binocular camera will contain exactly the same image content at the same time. Therefore, the target area on the color image generated at a certain moment is projected onto the depth map generated at that moment, and the image content contained in the target area is exactly the same, so the second target area of the color image can be The projection result on the depth map is directly used as the first target area. In step 203, the "first target area" in the first target area is adjusted, which refers to the first target area obtained by direct projection.

5A and FIG. 5B are respectively the color image containing the target collected by the main camera and the depth map containing the target collected by the binocular camera, wherein the area selected by the black rectangle is the target containing the target area. Since the color image contains more information, the second target area containing the target can be more accurately determined based on the color image. Since the depth map contains depth information, projecting the second target area of the color image containing the target object onto the depth map can obtain the third target area of the depth map containing the target object; based on the depth map The extracted distance information can obtain the distance information between the movable platform and the target.

The reason why the third target area containing the target on the depth map is obtained by projection based on the second target area containing the target on the color image collected by the movable platform is because the color image contains more useful information. More and more intuitive, the target area containing the target can be determined faster and more accurately. The depth map is a grayscale map, and directly determining the target area including the target object based on the grayscale map is cumbersome and slow to process, and the accuracy will be affected to a certain extent.

Of course, those skilled in the art should understand that, based on the control method of the movable platform described in this application, the depth map obtained by the movable platform can also be directly obtained based on methods such as feature recognition, contour recognition, and deep learning. The depth map includes the first target area of the target object, which is not limited in this application. Therefore, in this case, in step 203, the "first target area" in the first target area is adjusted, which refers to the first target area directly obtained based on the depth map obtained by the movable platform.

As mentioned above, projecting the second target area containing the target in the color image onto the depth map to obtain the first target area is ideally, the color image captured by the main camera at a certain moment is the same as the color image. The image content included in the depth map captured by the binocular camera at the corresponding moment is exactly the same. However, due to some reasons, the color image captured by the main camera at a certain moment and the image content included in the depth map captured by the binocular camera at the corresponding moment are different at the same pixel position.

For example, since the images captured by the main camera and the binocular camera undergo different image processing times, the actual corresponding acquisition times of the color images and depth maps obtained at the same time are different. Referring to FIG. 6 , the color image 601 and the depth map 602 acquired in one acquisition period T0 to T5 are respectively buffered in their respective buffers. The size of the cache is limited due to the different processing time experienced by color images and depth maps. Therefore, at a certain time, the image frames collected by the main camera at time T0 may be discarded first. Then, for the depth map collected at time T0, only the color image at time T1 closest to its time stamp can be found for the target. Box projection. Since the color image at time T0 and the depth map at time T1 contain different image contents at the same pixel position, the depth map collected at time T0 is projected based on the color image collected at time T1, and the color image contains The content of the second target region of the target object and the first target region containing the target object in the depth map may be different, thereby causing inaccurate distance information obtained based on the first target region in the depth map.

In another case, the target may be a moving object, then even if the images captured by the main camera and the binocular camera undergo the same processing time, due to the movement of the captured target, the The objects are caused to be located at different pixel locations in the color image and depth map. Based on the projection of the second target area of the color image onto the depth map to determine the first target area on the depth map that includes the target object, the content contained in the first target area and the second target area may be different. . Correspondingly, the distance information between the movable platform and the target object obtained based on the first target area on the depth map is inaccurate.

Of course, there may also be differences in the image content contained in the second target area of the color image and the first target area on the depth map caused by other situations, which are not limited in this application. All in all, due to the limitations of image processing and image acquisition on the hardware, it is difficult to simultaneously project the color image captured by the main camera at a certain moment and the depth map captured by the binocular camera at the corresponding moment.

In some embodiments, in either case, based on the method described in FIG. 7 , the first target area obtained by projecting the second target area including the target object on the color image onto the depth map A target area is corrected, and the method includes:

Step S701, respectively extracting the first feature point on the color image and the second feature point on the depth map;

Step S702, performing feature matching on the first feature point and the second feature point;

Step S703, modifying the obtained first target area of the depth map based on the result of the feature matching.

Based on extracting feature points from the color image and the depth map, and matching the corresponding feature points, adjusting the first target area containing the target on the depth map can avoid various hardware processing or target Due to reasons such as object movement, the determined first target area on the depth map is inaccurate, which in turn brings about the problem of inaccurate distance information between the movable platform and the target object.

For the specific implementation of the feature point extraction, reference may be made to the related art. The method for extracting feature points can be implemented based on conventional image processing methods such as image texture information and pixel intensity information, and can also be implemented based on deep learning network methods. Of course, it can also be implemented based on other methods. make restrictions.

The correction of the obtained first target area of the depth map based on the result of the feature matching may be to make the first feature point on the color image completely align with the second feature point on the depth map, for example , the target is a person, then the extracted first feature point and second feature point can be multiple feature parts such as the nose and mouth of a person, so that the extracted first feature point and second feature point The points are completely aligned one by one, that is, the nose, mouth, etc. in the color image are aligned with the nose, mouth, etc. in the depth map, respectively, to obtain the first target area of the corrected depth map. Of course, other correction methods are also possible, for example, aligning the centers of the plurality of first feature points with the centers of the plurality of second feature points, etc. The present application does not limit the specific correction methods.

In some embodiments, in the control method of the movable platform described in this application, step S203, adjusting the first target area so that the proportion of the area corresponding to the target object in the first target area increases , which includes: adjusting the first target area, so that the boundary of the first target area shrinks in a direction close to the target.

5B and 8, FIG. 5B is the depth map of the scene where the target object (wherein the target object is a person) collected by the movable platform is obtained. Based on the method described above, the depth map can be obtained including The first target area of the target object (taking the area selected by the black rectangle as the first target area as an example), adjust the first target area in FIG. 5B so that the first target area The proportion of the area corresponding to the target object increases, and the tracking area is obtained, and the tracking area can be shown in FIG. 8 , wherein the tracking area is shown as the area surrounded by the black closed curve of the humanoid outline.

Comparing the area containing the target object in FIG. 5B and FIG. 8 , it can be clearly seen that the first target area containing the target object in FIG. 5B includes not only the target object-person, but also the ground. If the distance of the first target area is averaged based on the first target area shown in FIG. 5B , the average result is taken as the distance between the movable platform and the target, which is not the real distance, but is the average distance between the target (person) and the background (ground) in the first target area and the movable platform. While the tracking area in FIG. 8 almost only includes the target object (person), the distance information calculated based on the tracking area is closer to the real distance.

Of course, those skilled in the art should understand that, in general, when the target object is located in the center of the first target area on the depth map, the first target area is adjusted so that the first target area An increase in the proportion of the area corresponding to the target object in the first target area is achieved by adjusting the first target area so that the boundary of the first target area shrinks in a direction close to the center of the first target area. and when the target object is not located in the center of the first target area, adjusting the first target area so that the proportion of the area corresponding to the target object in the first target area increases, This is achieved by adjusting the first target area so that the boundary of the first target area shrinks in a direction close to the center of the target object. Of course, those skilled in the art can determine according to other actual situations, which is not limited in this application.

In order to increase the proportion of the area corresponding to the target object in the first target area, in addition to the method described in the above embodiment, in one embodiment, other areas in the area not corresponding to the target object in the first target area can also be deleted. Different from the previous embodiment, the boundary of the first target area may not be adjusted at this time, but by directly deleting part or all of the area other than the corresponding target in the first target area , so that the area ratio of the corresponding target object in the first target area can also be increased.

It can be seen from the above-mentioned embodiments that, due to the adjustment of the first target area in the present application, the proportion of the area corresponding to the target object in the first target area is increased. Therefore, the depth information of the tracking area obtained by adjusting at this time is , can obtain more accurate distance information between the movable platform and the target, and then can more accurately control the movement of the movable platform relative to the target, avoid potential safety hazards, and provide the movement performance of the movable platform.

Whether it is by adjusting the first target area to shrink the boundary of the first target area toward the target, or by deleting all or part of the other areas in the first target area that do not correspond to the target To increase the proportion of the area corresponding to the target in the first target area, the first thing to do is to distinguish the area corresponding to the target in the first target area from other areas. For this, semantic segmentation can be used. To distinguish between the target and non-target objects in the first target area, based on the semantic segmentation result, the area corresponding to the target object in the first target area can be determined, so that the first target area can be further adjusted, Make the boundary of the first target area shrink in the direction close to the target; or delete all or part of the other areas in the area that does not correspond to the target in the first target area to increase the corresponding target in the first target area area proportion. Of course, the possibility of using the above two means together cannot be ruled out. The following will introduce how to adjust the first target area in combination with semantic segmentation, so that the boundary of the first target area shrinks in a direction close to the target.

In some embodiments, the adjusting the first target area so that the boundary of the first target area shrinks in a direction close to the target may include: adjusting the depth map to include the target Semantic segmentation is performed on the first target area of (that is, using image semantic segmentation technology), and the first target area is adjusted based on the semantic segmentation result.

Image semantic segmentation technology refers to the use of computer vision, image processing technology, etc. to mark different types of objects in an image, and mark the pixel positions of different types of objects on the image. As shown in Figure 9, given a photo of a person riding a motorcycle (left image), based on image semantic segmentation technology, it can be segmented according to the different semantic meanings expressed in the image (right image), and the image shown on the right can be obtained. Pixel locations where objects of different classes (people and motorcycles) are located.

Image semantic segmentation technology is the cornerstone of image understanding and an important part of computer vision. Currently, there are many image semantic segmentation techniques. For example, traditional image semantic segmentation techniques, such as pixel-level-based "Thresholding methods", pixel clustering-based image semantic segmentation techniques (Clustering-based segmentation methods), "Graph partitioning" segmentation methods (Graph partitioning) segmentation methods) and so on. With the rapid development of deep learning, many deep learning-based image semantic segmentation technologies have emerged, such as image semantic segmentation technology based on fully convolutional neural network, image semantic segmentation technology based on Dilated Convolutions, etc. . The image semantic segmentation for the first target area described in this application may adopt any technology capable of realizing image semantic segmentation, which is not limited in this application.

It can be seen from the above embodiment that the image semantic segmentation technology is used to perform image semantic segmentation on the first target area including the target object on the depth map, and the first target area is adjusted based on the image semantic segmentation result. , so that the proportion of the area corresponding to the target object in the target area is increased, which is accurate and effective. Therefore, according to the depth information of the tracking area obtained after adjustment, more accurate distance information between the movable platform and the target can be obtained, and then the movement of the movable platform relative to the target can be more accurately controlled to avoid safety hidden dangers, providing the movement performance of the movable platform.

In some embodiments, performing image semantic segmentation on the first target region may use a deep learning-based image semantic segmentation technology. Then, perform semantic segmentation on the first target area, and adjust the first target area based on the semantic segmentation result, as shown in Figure 10, including:

Step S1001, performing image semantic segmentation on the depth map based on a pre-trained deep learning model;

Step S1002, adjusting the first target area according to the semantic segmentation result.

When using the pre-trained deep learning model to perform image semantic segmentation, the depth map to be semantically segmented can be input into the pre-trained deep learning model, and the deep learning model can output the depth map that has undergone image semantic segmentation processing. , that is, to mark the pixel positions where objects of different categories in the depth map are marked, that is, to obtain a depth map similar to that shown in Figure 9.

After the image semantic segmentation is performed on the depth map, the first target area including the target object on the depth map is adjusted, so that the area corresponding to the target object in the first target area occupies an The ratio increases, that is, similarly, the first target area containing the target object shown in FIG. 5B can be adjusted to the first target area containing the target object shown in FIG. 8 .

The above describes how to adjust the first target area in combination with semantic segmentation, so that the boundary of the first target area shrinks in the direction close to the target object, so as to increase the area ratio of the corresponding target object in the first target area. It can be understood that, it can also be combined with semantic segmentation to delete all or part of the other areas in the area that does not correspond to the target in the first target area, so as to increase the proportion of the area corresponding to the target in the first target area. The implementation process is similar to the process described above, the only difference is that after distinguishing between the target object and the non-target object on the first target area through the semantic segmentation result, the area corresponding to the non-target object in the first target area is deleted. , since the principle is similar to that of the previous embodiment, it will not be introduced here.

In some embodiments, the training process of the deep learning model includes: acquiring semantic labels corresponding to pixels in a training image output by the deep learning model, calculating a first loss function based on the semantic labels, and calculating a first loss function based on the first loss function. A loss function trains the deep learning model.

Wherein, the deep learning model can be designed as required, and can include at least one of a convolution layer, a batch normalization layer, a nonlinear activation layer, etc., or an existing deep learning model for image semantic segmentation. No restrictions. The initial parameters of the deep learning model may be determined through pre-training, or may be determined according to empirical values, which are not limited in this application.

For the training of the deep learning model, the training samples with semantic labels can be input into the deep learning model to be trained, and the deep learning model can be trained based on the predefined first loss function until the first loss function Convergence or less than the specified threshold.

The training image may be a depth map represented by a grayscale image.

The first loss function may be designed according to training needs, and may also use the first loss function commonly used in image semantic segmentation technology based on a deep learning model, which is not limited in this application.

In some embodiments, the first loss function may be a cross-entropy loss function.

Optionally, the first loss function may be: L=∑-y·logy'-(1-y)·log(1-y);

Among them, y is the semantic segmentation label on each pixel, and y' is the predicted value on the pixel, that is, the output result of the deep learning network on the pixel value.

It can be seen from the above embodiment that the image semantic segmentation technology based on the deep learning model is used to perform image semantic segmentation on the first target area including the target object on the depth map, and based on the image semantic segmentation result, adjust the image semantics. The first target area is selected, so that the proportion of the area corresponding to the target object in the target area is increased. Due to the semantic segmentation of the image based on the deep learning model, the model can learn more information. Therefore, according to the depth information of the tracking area obtained after adjustment, more accurate distance information between the movable platform and the target can be obtained, which can be more accurate. The movement of the movable platform relative to the target is accurately controlled, safety hazards are avoided, and the movement performance of the movable platform is provided.

Since the depth map is usually represented as a grayscale image, the information on the grayscale image is greatly reduced compared to the color image. In order to make the deep learning model more suitable for tasks on grayscale images, in some embodiments, the deep learning model may be trained twice, that is, before training the deep learning model based on the first loss function, the The training process further includes: acquiring a class label corresponding to the training image output by the deep learning model, calculating a second loss function based on the class label, training the deep learning model based on the second loss function, The initial parameters of the deep learning model before the first loss function trains the deep learning model.

Below, with reference to FIG. 11A and FIG. 11B , an exemplary description is given for performing two trainings on the deep learning model.

FIG. 11A shows the supervised classification learning of the deep learning model on the grayscale image classification task, so that the deep learning model has high sensitivity to semantic objects. The specific implementation of an exemplary embodiment is as follows:

The deep learning model is trained by using a single-channel grayscale image and the category label corresponding to each image. The category is selected as the target category of the target. In the application scenario of UAV following, the category is the target category expected to follow. , for example: people, cars, boats, etc. In FIG. 11A , three categories of people, vehicles, and boats are given as illustrative examples.

The training data is input to the deep learning model, which outputs a vector P of length N, where N is the number of the classes. The N vectors P respectively correspond to the predicted probability pi of the i-th target. Using the second loss function, back-propagation learning is performed on the deep learning model, and the deep learning model is trained until the first loss function converges or falls below a threshold.

In some embodiments, the training data is a grayscale image. In addition to the depth map collected by the movable platform, the grayscale image may also include a grayscale image converted from the color image, that is, as a grayscale image. Fake depth maps to increase the amount of training data.

In some embodiments, the second loss function is a cross-entropy loss function.

Optionally, the second loss function may be:

Among them, N is the batch size of the training data, M is the number of categories, y _ic is the target category label corresponding to the ith input image, and _pic is the predicted probability corresponding to the ith target.

Based on the second loss function, the network performs supervised classification learning on the grayscale image classification task. The trained network shows a higher response on the feature map for a specific input target, as shown in Figure 11B.

After the network is trained for supervised classification learning on the grayscale image classification task, the deep learning model is trained on the grayscale image semantic segmentation task to obtain the final image semantic segmentation model. The specific implementation of an exemplary embodiment is:

Using transfer learning, the network parameters obtained by training the deep learning model on the grayscale image classification task described above are used as pre-training parameters to initialize the deep learning model; the training data of the deep learning model adopts single-channel grayscale. The degree map and the semantic segmentation label corresponding to the specific target category of each map, such as: people, cars, ships. For each pixel position of the training data, back-propagation learning is performed on the deep learning network based on the first loss function.

Based on the above two trainings, including training based on classification tasks and based on semantic segmentation tasks, the determined deep learning network can be more adapted to the tasks on grayscale images, and overcome the limited information contained in grayscale images. The defect that the deep learning network performance is not good enough.

Inputting the depth map including the target into the deep learning model determined after two training sessions can more accurately perform semantic segmentation on the target, and then more accurately determine the first target area. By averaging the distance of the first target area determined based on the semantic segmentation result, the real distance information between the target object and the movable platform can be obtained more accurately.

In some cases, such as low light, on the depth map collected by the movable platform, the boundary between the target and the background is blurred. Based on the deep learning model, the first target obtained by semantically segmenting the depth map area, which may be larger than the area where the target is actually located. As shown in Fig. 12, the left picture of Fig. 12 is a depth map obtained based on the deep learning model, wherein the white human-shaped outline is the semantic segmentation result determined based on the deep learning model; The white outline is the first target area where the actual target (person) is located.

Based on the above situation, corrosion processing in the morphological operation may be performed on the semantic segmentation result output by the deep learning model. In some embodiments, the semantic segmentation result output by the deep learning model may be convolved to reduce the segmentation area representing the prime number target in the semantic segmentation result, and based on the convolved segmentation result, the first target area.

The specific size of the convolution operation can be determined according to the actual situation, for example, a 5*5 size convolution kernel can be used to convolve the semantic segmentation result output by the deep learning model.

In some embodiments, performing convolution on the semantic segmentation result output by the deep learning model includes: during each convolution, taking the minimum pixel value of the area where the convolution kernel is located as the difference between the center point of the area where the convolution kernel is located Pixel values.

Of course, those skilled in the art should understand that during each convolution, the pixel value of the area where the convolution kernel is located can also be weighted as the pixel value of the center point of the area where the convolution kernel is located; it is also possible to determine the volume in other ways. The pixel value of the center point of the region where the product kernel is located, which is not limited in this application.

It can be seen from the above embodiment that, for the image semantic segmentation result, continuing to take the corrosion process in the morphological operation and adjusting the first target area can further increase the proportion of the area corresponding to the target object in the target area. Therefore, according to the depth information of the tracking area obtained after adjustment, more accurate distance information between the movable platform and the target can be obtained, and then the movement of the movable platform relative to the target can be more accurately controlled to avoid potential safety hazards. , which provides the motion performance of a movable platform.

In some embodiments, in the control method of the movable platform described in this application, in step S302, the first target area is adjusted so that the proportion of the area corresponding to the target object in the first target area In addition, in addition to using the semantic segmentation technology based on the deep learning model, the moving object segmentation technology based on frame difference statistics can also be used.

The moving object segmentation technology based on frame difference statistics is a method for obtaining the contour of a moving object by performing a difference operation on two consecutive frames of a video image sequence. When the target object in the monitoring scene is a moving object, there will be a difference between two adjacent frames of images. The two frames are subtracted to obtain the absolute value of the pixel value difference at the corresponding position of the image to determine whether it is greater than a certain threshold, and then The motion characteristics of the object of the video or image sequence are analyzed. Then, based on the obtained absolute value of the pixel value difference at the corresponding position of the image, the contour of the target object can be obtained, and then the first target area including the target object in the depth map can be determined.

In some embodiments, the moving object segmentation technology based on frame difference statistics can be implemented by: performing a difference operation on the depth map and its adjacent frame depth maps, and determining the depth map based on the difference result For the contour of the target object, the area included in the contour is regarded as the adjusted area.

Hereinafter, an exemplary description will be given with reference to FIG. 5B . 5B is the depth map of the scene where the target object is obtained by the movable platform. When the target object is a moving person, based on the depth maps collected at different times, since the target object (person) is a moving person , then a difference operation is performed on the depth map acquired at a certain moment and the adjacent depth map acquired at a previous (or subsequent) moment. Find the absolute value of the pixel value at the corresponding position of the two image frames. Since the target (person) is a moving object, the contour of the target (person) can be obtained based on the obtained absolute values of the pixel values corresponding to the two image frames, and based on the contour, it is possible to re-determine Only the first target area of the target is included. Calculating the distance between the target object and the movable platform based on the distance information of the first target area can be more accurate and real.

In some embodiments, in the control method for a movable platform described in this application, in step S304, controlling the movable platform to move relative to the target object according to the depth information of the tracking area includes: The depth information of the tracking area is controlled to control the movable platform to follow the movement of the target.

When the application scenario is that the drone follows the target, then, according to the aforementioned deep learning model-based semantic segmentation technology and/or corrosion processing in morphological operations and/or frame difference statistics-based moving object segmentation technology , the determined depth map includes the first target area of the target object, and the area where the target object is located accounts for a large proportion. Based on the first target area, the distance information between the target object and the movable platform is obtained by using the method of the average distance in the area. Based on the distance information, the movable platform can be more accurately controlled to the target. The following movement of things.

Of course, those skilled in the art should understand that, in addition to the above application scenarios, the control method of the mobile platform of the present application can also be used in many scenarios such as intelligent vehicle obstacle avoidance, intelligent robot ranging and so on. The movable platform moves relative to the target, and may also control the running speed and running direction of the smart car, control the forward direction and forward speed of the smart robot, etc., which are not limited in this application.

An embodiment of the present application also provides a method for controlling a movable platform. Referring to FIG. 15 , FIG. 15 is another method for controlling a movable platform shown in an exemplary embodiment of the present application, including the following steps:

S1501, acquiring the depth map of the scene where the target object is collected and obtained by the movable platform;

S1502, acquiring a first target area including the target on the depth map;

S1503, delete all or part of other areas in the first target area that do not correspond to the area of the target to obtain a tracking area;

S1504, controlling the movable platform to move relative to the target according to the depth information of the tracking area.

Exemplarily, acquiring a first target area including the target on the depth map includes: acquiring a second target area including the target, where the second target area is located in the area collected by the movable platform. on the color image; project the second target area onto the depth map to obtain the first target area.

Exemplarily, projecting the second target area onto the depth map includes: projecting the second target area on the color image generated at time T onto the depth map generated at time T.

Exemplarily, projecting the second target area onto the depth map, further comprising: extracting a first feature point on the color image and a second feature point on the depth map respectively; Feature matching is performed between the first feature point and the second feature point; based on the result of the feature matching, the obtained first target area of the depth map is corrected.

Exemplarily, the deleting all or part of other regions in the first target region that does not correspond to the region of the target includes: performing semantic segmentation on the first target region, and deleting all the regions based on the results of the semantic segmentation. All or part of other areas in the first target area that do not correspond to the area of the target.

Exemplarily, the performing semantic segmentation on the first target area, and deleting all or part of other areas in the first target area that do not correspond to the area of the target object based on the result of the semantic segmentation includes: The trained deep learning model performs semantic segmentation on the depth map; according to the semantic segmentation result, deletes all or part of other areas in the first target area that do not correspond to the area of the target.

Exemplarily, the training process of the deep learning model includes: acquiring semantic labels corresponding to pixels in a training image output by the deep learning model, calculating a first loss function based on the semantic segmentation labels, and calculating a first loss function based on the first loss. function to train the deep learning model.

Exemplarily, before training the deep learning model based on the semantic label, the training process further includes: acquiring a class label corresponding to the training image output by the deep learning model, and calculating a second loss based on the class label. function, training the deep learning model based on the second loss function, and determining initial parameters of the deep learning model before training the deep learning model based on the first loss function.

Exemplarily, the first loss function and/or the second loss function is a cross-entropy loss function.

Exemplarily, the training images include grayscale images.

Exemplarily, according to the semantic segmentation result, deleting all or part of other areas in the first target area that do not correspond to the area of the target object further includes: a semantic segmentation result output by the deep learning model. Convolution is performed using a preset convolution kernel to reduce the segmented area representing the target in the semantic segmentation result, and the first target area is adjusted based on the convolved semantic segmentation result.

Exemplarily, performing convolution on the semantic segmentation result output by the deep learning model includes: during each convolution, taking the minimum pixel value of the area where the convolution kernel is located as the pixel value of the center point of the area where the convolution kernel is located. .

Exemplarily, the deleting all or part of other regions in the first target region that does not correspond to the region of the target includes: performing a difference operation on the depth map and its adjacent frame depth maps, and Based on the difference result, the contour of the target object on the depth map is determined, and other regions other than the region included in the contour are deleted.

Exemplarily, controlling the movable platform to move relative to the target object according to the depth information of the tracking area includes: controlling the movable platform to follow the movement of the target object according to the depth information of the tracking area. .

For the detailed description of the examples given above, reference may be made to the foregoing description, and the description will not be repeated here.

Correspondingly, in order to solve the defects existing in the related art, the present application also provides a movable platform, as shown in FIG.

Wherein, the image acquisition device is used to acquire the depth map of the scene where the target object is located;

The memory 1302 is used to store program codes;

The processor 1303 calls the program code, and when the program code is executed, is used to perform the following operations:

acquiring the depth map and the first target area on the depth map including the target;

Adjusting the first target area so that the proportion of the area corresponding to the target object in the first target area increases to obtain a tracking area;

According to the depth information of the tracking area, the movable platform is controlled to move relative to the target.

The processor 1303 is further configured to perform the following operations:

Obtain the depth map and the first target area including the target on the depth map; delete all or part of other areas in the first target area that do not correspond to the area of the target to obtain a tracking area ; control the movable platform to move relative to the target according to the depth information of the tracking area.

In some embodiments, the movable platform may be a drone, a smart car, a smart robot, etc., which is not limited in this application.

Using a movable platform provided by the present application, the target area including the target object is obtained on the depth map of the scene where the target object is located, and the target area is adjusted so that the target area corresponds to the target area. The proportion of the area of the target object is increased to obtain a tracking area, and then the movable platform is controlled to move relative to the target object according to the depth information of the tracking area. Since the target area is adjusted in the present application, the proportion of the area corresponding to the target object in the target area is increased. Therefore, according to the depth information of the tracking area, a more accurate movable platform and target can be obtained. The distance information of the object can then be more accurately controlled to control the movement of the movable platform relative to the target object, avoiding potential safety hazards and providing the movement performance of the movable platform.

In addition, based on the technical problems existing in the related art, the present application also provides a control device, as shown in FIG. 14 , the control device includes: a memory 1401 and a processor 1402.

Wherein, the memory 1401 is used to store program codes;

The processor 1402 calls the program code, and when the program code is executed, is used to perform the following operations:

Obtain the depth map of the scene where the target object is collected by the movable platform;

acquiring a target area including the target on the depth map;

Adjusting the target area so that the proportion of the area corresponding to the target object in the target area increases to obtain a tracking area;

The processor 1402 is further configured to perform the following operations:

Acquiring the depth map of the scene where the target object is collected by the movable platform; acquiring the first target area including the target object on the depth map; deleting the area in the first target area that does not correspond to the target object The whole or part of other areas of the tracking area is obtained; according to the depth information of the tracking area, the movable platform is controlled to move relative to the target object.

Using a control device provided by the present application, the target area including the target object is obtained on the depth map of the scene where the target object is located, and the target area is adjusted so that the target area corresponds to the target The proportion of the area of the object is increased to obtain a tracking area, and then the movable platform is controlled to move relative to the target object according to the depth information of the tracking area. Since the target area is adjusted in the present application, the proportion of the area corresponding to the target object in the target area is increased. Therefore, according to the depth information of the tracking area, a more accurate movable platform and target can be obtained. The distance information of the object can then be more accurately controlled to control the movement of the movable platform relative to the target object, avoiding potential safety hazards and providing the movement performance of the movable platform.

In the embodiments of the present application, a machine-readable storage medium is also provided, and the machine-readable storage medium stores a computer program, which, when executed on a computer, implements any embodiment of the above method of the present application, It is not repeated here.

The machine-readable storage medium may be an internal storage unit of the device described in any of the foregoing embodiments, such as a hard disk or a memory of the device. The machine-readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card equipped on the device , Flash Card (Flash Card) and so on. Further, the machine-readable storage medium may also include both an internal storage unit of the device and an external storage device. The machine-readable storage medium is used to store the computer program and other programs and data required by the apparatus. The computer-readable storage medium can also be used to temporarily store data that has been or will be output.

In addition, with regard to the apparatus embodiments, since they basically correspond to the method embodiments, reference may be made to the partial descriptions of the method embodiments for related parts. The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing the relevant hardware through a computer program, and the program can be stored in a machine-readable storage medium. During execution, the processes of the embodiments of the above-mentioned methods may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

The foregoing describes specific embodiments of the present application. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Other embodiments of the present application will readily occur to those skilled in the art upon consideration of applying and practicing the inventions claimed herein. This specification is intended to cover any variations, uses or adaptations of this specification that follow the general principles of this specification and include common general knowledge or conventional techniques in the technical field to which this specification does not apply . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present application shall be included in the present application. within the scope of protection.

Claims

A control method for a movable platform, characterized in that the method comprises:

acquiring the depth map of the scene where the target object is collected and obtained by the movable platform;

acquiring a first target area including the target on the depth map;

Adjusting the first target area so that the proportion of the area corresponding to the target object in the first target area increases to obtain a tracking area;

According to the depth information of the tracking area, the movable platform is controlled to move relative to the target.
The method according to claim 1, wherein acquiring the first target area including the target on the depth map comprises:

acquiring a second target area containing the target, the second target area being located on the color image collected by the movable platform;

Projecting the second target area onto the depth map to obtain the first target area.
The method according to claim 2, wherein projecting the second target area onto the depth map comprises:

Projecting the second target area on the color image generated at time T onto the depth map generated at time T.
The method according to claim 2, wherein projecting the second target area onto the depth map, further comprising:

respectively extracting the first feature point on the color image and the second feature point on the depth map;

performing feature matching on the first feature point and the second feature point;

Based on the result of the feature matching, the obtained first target area of the depth map is modified.
The method according to claim 1, wherein adjusting the first target area so that the proportion of the area corresponding to the target object in the first target area increases, comprising:

adjusting the first target area, so that the boundary of the first target area shrinks toward the direction close to the target;

and / or,

Deleting all or part of other areas in the first target area that do not correspond to the area of the target.
The method according to claim 5, wherein the adjusting the first target area so that the boundary of the first target area shrinks in a direction close to the target comprises:

Semantic segmentation is performed on the first target area, and the first target area is adjusted based on the result of the semantic segmentation.
The method according to claim 6, wherein the performing semantic segmentation on the first target area, and adjusting the first target area based on a result of the semantic segmentation, comprises:

performing semantic segmentation on the depth map based on a pre-trained deep learning model;

According to the semantic segmentation result, the first target area is adjusted.
The method according to claim 7, wherein the training process of the deep learning model comprises:

The semantic labels corresponding to the pixels in the training image output by the deep learning model are acquired, a first loss function is calculated based on the semantic segmentation labels, and the deep learning model is trained based on the first loss function.
The method according to claim 8, wherein before training the deep learning model based on the semantic labels, the training process further comprises:

Obtain the class label corresponding to the training image output by the deep learning model, calculate a second loss function based on the class label, train the deep learning model based on the second loss function, and determine whether to train based on the first loss function Before the deep learning model, the initial parameters of the deep learning model.
The method according to claim 9, wherein the first loss function and/or the second loss function is a cross-entropy loss function.
The method of claim 9, wherein the training image comprises a grayscale image.
The method according to claim 7, wherein, adjusting the first target area according to the semantic segmentation result, further comprising:

The semantic segmentation result output by the deep learning model is convolved with a preset convolution kernel to reduce the segmentation area representing the target in the semantic segmentation result, and based on the semantic segmentation result after the convolution, adjust the first target area.
The method according to claim 12, wherein, performing convolution on the semantic segmentation result output by the deep learning model, comprising:

In each convolution, the minimum pixel value of the area where the convolution kernel is located is taken as the pixel value of the center point of the area where the convolution kernel is located.
The method according to claim 1, wherein adjusting the first target area so that the proportion of the area corresponding to the target object in the first target area increases, comprising:

The difference operation is performed on the depth map and the depth maps of adjacent frames, and the contour of the target object on the depth map is determined based on the difference result, and the area included in the contour is used as the adjusted area.
The method according to claim 1, wherein, according to the depth information of the tracking area, controlling the movable platform to move relative to the target comprises:

According to the depth information of the tracking area, the movable platform is controlled to follow the movement of the target.
A control method for a movable platform, characterized in that the method comprises:

acquiring the depth map of the scene where the target object is collected and obtained by the movable platform;

acquiring a first target area including the target on the depth map;

Deleting all or part of other areas in the first target area that do not correspond to the area of the target to obtain a tracking area;

According to the depth information of the tracking area, the movable platform is controlled to move relative to the target.
The method according to claim 16, wherein acquiring the first target area including the target on the depth map comprises:

Obtain a second target area containing the target, and the second target area is located on the color image collected by the movable platform;

Projecting the second target area onto the depth map to obtain the first target area.
The method according to claim 17, wherein projecting the second target region onto the depth map comprises:

Projecting the second target area on the color image generated at time T onto the depth map generated at time T.
The method according to claim 17, wherein projecting the second target area onto the depth map, further comprising:

respectively extracting the first feature point on the color image and the second feature point on the depth map;

performing feature matching on the first feature point and the second feature point;

Based on the result of the feature matching, the obtained first target area of the depth map is modified.
The method according to claim 16, wherein the deleting all or part of other regions in the first target region that does not correspond to the region of the target comprises:

Semantic segmentation is performed on the first target area, and all or part of other areas in the first target area that do not correspond to the area of the target object are deleted based on the semantic segmentation result.
The method according to claim 20, characterized in that, performing semantic segmentation on the first target area, deleting other areas of the first target area that do not correspond to areas of the target object based on a result of the semantic segmentation All or part of the area, including:

performing semantic segmentation on the depth map based on a pre-trained deep learning model;

According to the semantic segmentation result, delete all or part of other regions in the first target region that do not correspond to the region of the target object.
The method according to claim 21, wherein the training process of the deep learning model comprises:

The semantic labels corresponding to the pixels in the training image output by the deep learning model are acquired, a first loss function is calculated based on the semantic segmentation labels, and the deep learning model is trained based on the first loss function.
The method according to claim 22, wherein before training the deep learning model based on the semantic labels, the training process further comprises:

Obtain the class label corresponding to the training image output by the deep learning model, calculate a second loss function based on the class label, train the deep learning model based on the second loss function, and determine whether to train based on the first loss function Before the deep learning model, the initial parameters of the deep learning model.
The method according to claim 23, wherein the first loss function and/or the second loss function is a cross-entropy loss function.
24. The method of claim 23, wherein the training image comprises a grayscale image.
The method according to claim 21, wherein, according to the semantic segmentation result, deleting all or part of other regions in the first target region that does not correspond to the region of the target further comprises:

The semantic segmentation result output by the deep learning model is convolved with a preset convolution kernel to reduce the segmentation area representing the target in the semantic segmentation result, and based on the semantic segmentation result after the convolution, adjust the first target area.
The method according to claim 26, wherein the convolution is performed on the semantic segmentation result output by the deep learning model, comprising:

In each convolution, the minimum pixel value of the area where the convolution kernel is located is taken as the pixel value of the center point of the area where the convolution kernel is located.
The method according to claim 16, wherein the deleting all or part of other regions in the first target region that does not correspond to the region of the target comprises:

A difference operation is performed on the depth map and the depth maps of adjacent frames, and the contour of the target object on the depth map is determined based on the difference result, and other areas except the area included in the contour are deleted.
The method according to claim 16, wherein, according to the depth information of the tracking area, controlling the movable platform to move relative to the target comprises:

According to the depth information of the tracking area, the movable platform is controlled to follow the movement of the target.
A movable platform, characterized in that, the movable platform comprises: an image acquisition device, a memory and a processor;

The image acquisition device is used to acquire a depth map of the scene where the target object is located;

the memory is used to store program codes;

The processor invokes the program code, when the program code is executed, for performing the following operations:

acquiring the depth map and the first target area on the depth map including the target;

Adjusting the first target area to increase the proportion of the area corresponding to the target in the first target area to obtain a tracking area;

According to the depth information of the tracking area, the movable platform is controlled to move relative to the target.
The movable platform according to claim 30, wherein acquiring the first target area including the target on the depth map comprises:

acquiring a second target area containing the target, the second target area being located on the color image collected by the movable platform;

Projecting the second target area onto the depth map to obtain the first target area.
The movable platform of claim 31, wherein projecting the second target area onto the depth map comprises:

Projecting the second target area on the color image generated at time T onto the depth map generated at time T.
The movable platform of claim 31, wherein projecting the second target area onto the depth map, further comprising:

respectively extracting the first feature point on the color image and the second feature point on the depth map;

performing feature matching on the first feature point and the second feature point;

Based on the result of the feature matching, the obtained first target area of the depth map is modified.
The movable platform according to claim 30, wherein adjusting the first target area so that the proportion of the area corresponding to the target object in the first target area increases, comprising:

Adjusting the first target area so that the boundary of the first target area shrinks toward the target; and/or

Deleting all or part of other areas in the first target area that do not correspond to the area of the target.
The movable platform according to claim 34, wherein the adjusting the first target area so that the boundary of the first target area shrinks toward the target object comprises:

Semantic segmentation is performed on the first target area, and the first target area is adjusted based on the result of the semantic segmentation.
The movable platform according to claim 35, wherein the performing semantic segmentation on the first target area and adjusting the first target area based on a result of the semantic segmentation includes:

performing semantic segmentation on the depth map based on a pre-trained deep learning model;

According to the semantic segmentation result, the first target area is adjusted.
The movable platform according to claim 36, wherein the training process of the deep learning model comprises:

A semantic label corresponding to a pixel in a training image output by the deep learning model is acquired, a first loss function is calculated based on the semantic segmentation label, and the deep learning model is trained based on the first loss function.
The movable platform according to claim 37, wherein before training the deep learning model based on the semantic labels, the training process further comprises:

Obtain the class label corresponding to the training image output by the deep learning model, calculate a second loss function based on the class label, train the deep learning model based on the second loss function, and determine to train based on the first loss function Before the deep learning model, the initial parameters of the deep learning model.
The movable platform according to claim 38, wherein the first loss function and/or the second loss function is a cross-entropy loss function.
39. The movable platform of claim 38, wherein the training image comprises a grayscale image.
The movable platform according to claim 36, wherein the adjusting the first target area according to the semantic segmentation result further comprises:

The semantic segmentation result output by the deep learning model is convolved with a preset convolution kernel to reduce the segmentation area representing the target in the semantic segmentation result, and based on the semantic segmentation result after the convolution, adjust the first target area.
The movable platform according to claim 41, wherein the convolution of the semantic segmentation result output by the deep learning model comprises:

In each convolution, the minimum pixel value of the area where the convolution kernel is located is taken as the pixel value of the center point of the area where the convolution kernel is located.
The movable platform according to claim 30, wherein adjusting the first target area so that the proportion of the area corresponding to the target object in the first target area increases, comprising:

The difference operation is performed on the depth map and the depth maps of adjacent frames, and the contour of the target object on the depth map is determined based on the difference result, and the area included in the contour is used as the adjusted area.
The movable platform according to claim 30, wherein, according to the depth information of the tracking area, controlling the movable platform to move relative to the target comprises:

According to the depth information of the tracking area, the movable platform is controlled to follow the movement of the target.
A control device, characterized in that the control device comprises: a memory and a processor;

the memory is used to store program codes;

The processor invokes the program code, when the program code is executed, for performing the following operations:

Obtain the depth map of the scene where the target object is collected by the movable platform;

acquiring a target area including the target on the depth map;

Adjusting the target area so that the proportion of the area corresponding to the target object in the target area increases to obtain a tracking area;

According to the depth information of the tracking area, the movable platform is controlled to move relative to the target.
The control device according to claim 45, wherein acquiring the first target area including the target on the depth map comprises:

acquiring a second target area containing the target, the second target area being located on the color image collected by the movable platform;

Projecting the second target area onto the depth map to obtain the first target area.
The control device according to claim 46, wherein projecting the second target area onto the depth map comprises:

Projecting the second target area on the color image generated at time T onto the depth map generated at time T.
The control device according to claim 46, wherein the second target area is projected onto the depth map, further comprising:

respectively extracting the first feature point on the color image and the second feature point on the depth map;

performing feature matching on the first feature point and the second feature point;

Based on the result of the feature matching, the obtained first target area of the depth map is modified.
The control device according to claim 45, wherein adjusting the first target area so that the proportion of the area corresponding to the target object in the first target area increases, comprising:

Adjusting the first target area so that the boundary of the first target area shrinks toward the target object; and/or,

Deleting all or part of other areas in the first target area that do not correspond to the area of the target.
The control device according to claim 49, wherein the adjusting the first target area so that the boundary of the first target area shrinks toward the target object comprises:

Semantic segmentation is performed on the first target area, and the first target area is adjusted based on the result of the semantic segmentation.
The control device according to claim 50, wherein, performing semantic segmentation on the first target area, and adjusting the first target area based on the semantic segmentation result, comprising:

performing semantic segmentation on the depth map based on a pre-trained deep learning model;

According to the semantic segmentation result, the first target area is adjusted.
The control device according to claim 51, wherein the training process of the deep learning model comprises:

A semantic label corresponding to a pixel in a training image output by the deep learning model is acquired, a first loss function is calculated based on the semantic segmentation label, and the deep learning model is trained based on the first loss function.
The control device according to claim 52, wherein before training the deep learning model based on the semantic label, the training process further comprises:

Obtain the class label corresponding to the training image output by the deep learning model, calculate a second loss function based on the class label, train the deep learning model based on the second loss function, and determine whether to train based on the first loss function Before the deep learning model, the initial parameters of the deep learning model.
The control device according to claim 53, wherein the first loss function and/or the second loss function is a cross-entropy loss function.
The control device of claim 53, wherein the training image comprises a grayscale image.
The control device according to claim 51, wherein the adjusting the first target area according to the semantic segmentation result further comprises:

The semantic segmentation result output by the deep learning model is convolved with a preset convolution kernel to reduce the segmentation area representing the target in the semantic segmentation result, and based on the semantic segmentation result after the convolution, adjust the first target area.
The control device according to claim 56, wherein the convolution of the semantic segmentation result output by the deep learning model comprises:

In each convolution, the minimum pixel value of the area where the convolution kernel is located is taken as the pixel value of the center point of the area where the convolution kernel is located.
The control device according to claim 45, wherein adjusting the first target area so that the proportion of the area corresponding to the target object in the first target area increases, comprising:

The difference operation is performed on the depth map and the depth maps of adjacent frames, and the contour of the target object on the depth map is determined based on the difference result, and the area included in the contour is used as the adjusted area.
The control device according to claim 45, wherein, according to the depth information of the tracking area, controlling the movable platform to move relative to the target comprises:

According to the depth information of the tracking area, the movable platform is controlled to follow the movement of the target.
A movable platform, characterized in that, the movable platform comprises: an image acquisition device, a memory and a processor;

The image acquisition device is used for acquiring the depth map of the scene where the target object is located;

the memory is used to store program codes;

The processor invokes the program code, when the program code is executed, for performing the following operations:

acquiring the depth map and the first target area on the depth map including the target;

Deleting all or part of other areas in the first target area that do not correspond to the area of the target to obtain a tracking area;

According to the depth information of the tracking area, the movable platform is controlled to move relative to the target.
The movable platform according to claim 60, wherein acquiring the first target area including the target on the depth map comprises:

acquiring a second target area containing the target, the second target area being located on the color image collected by the movable platform;

Projecting the second target area onto the depth map to obtain the first target area.
The movable platform of claim 61, wherein projecting the second target area onto the depth map comprises:

Projecting the second target area on the color image generated at time T onto the depth map generated at time T.
The movable platform of claim 61, wherein projecting the second target area onto the depth map, further comprising:

respectively extracting the first feature point on the color image and the second feature point on the depth map;

performing feature matching on the first feature point and the second feature point;

Based on the result of the feature matching, the obtained first target area of the depth map is modified.
The movable platform according to claim 60, wherein the deleting all or part of other areas in the first target area that does not correspond to the area of the target includes:

Semantic segmentation is performed on the first target area, and based on the result of the semantic segmentation, all or part of other areas in the first target area that are not corresponding to the area of the target object are deleted.
The movable platform according to claim 64, wherein the semantic segmentation of the first target area is performed, and other areas in the first target area that do not correspond to the target object are deleted based on a result of the semantic segmentation. All or part of an area, including:

performing semantic segmentation on the depth map based on a pre-trained deep learning model;

According to the semantic segmentation result, delete all or part of other regions in the first target region that do not correspond to the region of the target object.
The movable platform according to claim 65, wherein the training process of the deep learning model comprises:

A semantic label corresponding to a pixel in a training image output by the deep learning model is acquired, a first loss function is calculated based on the semantic segmentation label, and the deep learning model is trained based on the first loss function.
The mobile platform according to claim 66, wherein before training the deep learning model based on the semantic labels, the training process further comprises:

Obtain the class label corresponding to the training image output by the deep learning model, calculate a second loss function based on the class label, train the deep learning model based on the second loss function, and determine whether to train based on the first loss function Before the deep learning model, the initial parameters of the deep learning model.
The movable platform according to claim 67, wherein the first loss function and/or the second loss function is a cross-entropy loss function.
68. The movable platform of claim 67, wherein the training image comprises a grayscale image.
The movable platform according to claim 65, wherein, according to the semantic segmentation result, deleting all or part of the other regions in the first target region that does not correspond to the region of the target further comprises:

The semantic segmentation result output by the deep learning model is convolved with a preset convolution kernel to reduce the segmentation area representing the target in the semantic segmentation result, and based on the semantic segmentation result after the convolution, adjust the first target area.
The movable platform according to claim 70, wherein the convolution of the semantic segmentation result output by the deep learning model comprises:

In each convolution, the minimum pixel value of the area where the convolution kernel is located is taken as the pixel value of the center point of the area where the convolution kernel is located.
The movable platform according to claim 60, wherein the deleting all or part of other areas in the first target area that does not correspond to the area of the target includes:

A difference operation is performed on the depth map and the depth maps of adjacent frames, and the contour of the target object on the depth map is determined based on the difference result, and other areas other than the area included in the contour are deleted.
The movable platform according to claim 60, wherein, according to the depth information of the tracking area, controlling the movable platform to move relative to the target comprises:

According to the depth information of the tracking area, the movable platform is controlled to follow the movement of the target.
A control device, characterized in that the control device comprises: a memory and a processor;

the memory is used to store program codes;

The processor invokes the program code, when the program code is executed, for performing the following operations:

acquiring the depth map of the scene where the target object is collected and obtained by the movable platform;

acquiring a first target area including the target on the depth map;

Deleting all or part of other areas in the first target area that do not correspond to the area of the target to obtain a tracking area;

The movable platform is controlled to move relative to the target according to the depth information of the tracking area.
The apparatus according to claim 74, wherein acquiring the first target area including the target on the depth map comprises:

acquiring a second target area containing the target, the second target area being located on the color image collected by the movable platform;

Projecting the second target area onto the depth map to obtain the first target area.
The apparatus of claim 75, wherein projecting the second target region onto the depth map comprises:

Projecting the second target area on the color image generated at time T onto the depth map generated at time T.
The apparatus of claim 75, wherein projecting the second target area onto the depth map, further comprising:

respectively extracting the first feature point on the color image and the second feature point on the depth map;

performing feature matching on the first feature point and the second feature point;

Based on the result of feature matching, the obtained first target area of the depth map is modified.
The device according to claim 74, wherein the deleting all or part of other regions in the first target region that does not correspond to the region of the target comprises:

Semantic segmentation is performed on the first target area, and based on the result of the semantic segmentation, all or part of other areas in the first target area that are not corresponding to the area of the target object are deleted.
The device according to claim 78, wherein the semantic segmentation of the first target area is performed, based on a result of the semantic segmentation, deletion of other areas in the first target area that do not correspond to the area of the target object All or part of the area, including:

performing semantic segmentation on the depth map based on a pre-trained deep learning model;

According to the semantic segmentation result, delete all or part of other regions in the first target region that do not correspond to the region of the target object.
The apparatus according to claim 79, wherein the training process of the deep learning model comprises:

A semantic label corresponding to a pixel in a training image output by the deep learning model is acquired, a first loss function is calculated based on the semantic segmentation label, and the deep learning model is trained based on the first loss function.
The apparatus according to claim 80, wherein before training the deep learning model based on the semantic labels, the training process further comprises:

Obtain the class label corresponding to the training image output by the deep learning model, calculate a second loss function based on the class label, train the deep learning model based on the second loss function, and determine whether to train based on the first loss function Before the deep learning model, the initial parameters of the deep learning model.
The apparatus according to claim 81, wherein the first loss function and/or the second loss function is a cross-entropy loss function.
81. The apparatus of claim 81, wherein the training image comprises a grayscale image.
The device according to claim 79, wherein, according to the semantic segmentation result, deleting all or part of other regions in the first target region that does not correspond to the region of the target further comprises:

The semantic segmentation result output by the deep learning model is convolved with a preset convolution kernel to reduce the segmentation area representing the target in the semantic segmentation result, and based on the semantic segmentation result after the convolution, adjust the first target area.
The device according to claim 84, wherein the convolution of the semantic segmentation result output by the deep learning model comprises:

In each convolution, the minimum pixel value of the area where the convolution kernel is located is taken as the pixel value of the center point of the area where the convolution kernel is located.
The device according to claim 74, wherein the deleting all or part of other regions in the first target region that does not correspond to the region of the target comprises:

A difference operation is performed on the depth map and the depth maps of adjacent frames, and the contour of the target object on the depth map is determined based on the difference result, and other areas other than the area included in the contour are deleted.
The device according to claim 74, wherein, according to the depth information of the tracking area, controlling the movable platform to move relative to the target comprises:

According to the depth information of the tracking area, the movable platform is controlled to follow the movement of the target.
A machine-readable storage medium, characterized in that a computer program is stored thereon, and when the computer program is executed, the method according to any one of claims 1-29 is implemented.