CN112241662B

CN112241662B - Method and device for detecting drivable area

Info

Publication number: CN112241662B
Application number: CN201910648165.2A
Authority: CN
Inventors: 赵俊
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-07-17
Filing date: 2019-07-17
Publication date: 2024-03-19
Anticipated expiration: 2039-07-17
Also published as: CN112241662A

Abstract

The application relates to a method and a device for detecting a drivable area, and belongs to the field of image processing. The method comprises the following steps: acquiring a video image acquired by a monocular camera installed on equipment; inputting the video image into a depth detection model, wherein the depth detection model is used for detecting the depth information of each pixel point in the video image and acquiring the depth information of each pixel point in the video image output by the depth detection model; determining the position information of the physical point corresponding to each pixel point according to the depth information of each pixel point and the calibrated installation parameters of the monocular camera; and determining a travelable area according to the position information of each physical point. The method and the device can improve the precision of detecting the drivable area.

Description

Method and device for detecting drivable area

Technical Field

The present disclosure relates to the field of image processing, and in particular, to a method and apparatus for detecting a drivable area.

Background

In recent years, as the autopilot technology is brought into the public view, research on sensing three-dimensional information by an autopilot device, such as an autopilot vehicle or a robot, has become a hot spot technology. The automatic driving needs to detect a drivable area in a road in order to perform automatic driving based on the detected drivable area.

Currently, there is a method for detecting a travelable region. The detection method comprises the steps of installing a monocular camera on unmanned equipment to collect road surface images, inputting the road surface images into a preset trained deep learning model, detecting a drivable region image and an undrivable region image which are included in the road surface images through the deep learning model, and detecting depth information of each pixel point in the road surface images. And determining the drivable region according to the depth information of each pixel point in the drivable region image.

The inventors have found that in the process of implementing the present application, at least the following drawbacks exist in the above manner:

the deep learning model can detect the drivable region image and the non-drivable region image included in the road surface image, so that when the training sample is used for training the deep learning model, the drivable region image and the non-drivable region image need to be manually marked in the training sample, the marking work is very time-consuming, the marking precision is difficult to ensure, the precision of the drivable region image and the non-drivable region image in the road surface image is difficult to ensure by using the trained deep learning model, and the detected drivable region may be poor in precision.

Disclosure of Invention

The embodiment of the application provides a method and a device for detecting a drivable area, so as to improve the accuracy of detecting the drivable area. The technical scheme is as follows:

in one aspect, the present application provides a method of detecting a drivable region, the method comprising:

acquiring a video image acquired by a monocular camera installed on equipment;

inputting the video image into a depth detection model, wherein the depth detection model is used for detecting the depth information of each pixel point in the video image and acquiring the depth information of each pixel point in the video image output by the depth detection model;

determining the position information of the physical point corresponding to each pixel point according to the depth information of each pixel point and the calibrated installation parameters of the monocular camera;

and determining a travelable area according to the position information of each physical point.

As an example, the determining the drivable area according to the location information of each physical point includes:

constructing a grid map according to position information of a landing place of the installation position of the monocular camera on the equipment on a road surface, wherein the grid map comprises a plurality of grids;

Determining the number of physical points of each grid falling into the grid map according to the position information of each physical point;

and acquiring grids with the number of the physical points falling below a preset number threshold value, and determining a travelable area according to the acquired grids.

As one example, the location information of a physical point includes an abscissa, an ordinate, and an altitude of the physical point; the determining the drivable area according to the position information of each physical point comprises the following steps:

acquiring physical points with heights in a height range from each physical point, wherein the height range comprises the road surface height, and the interval length of the height range is a preset length threshold;

a travelable region is determined from physical points located in the altitude range.

As an example, the determining the drivable area from the physical points located within the altitude range includes:

clustering each physical point in the height range according to the position information of each physical point in the height range to obtain at least one physical point set, wherein the distance between any two adjacent physical points in the same physical point set does not exceed a preset distance threshold;

a minimum area including each physical point in the set of physical points is determined as a travelable area.

As an example, the method further comprises, before inputting the image into the depth detection model

Acquiring M frames of video images acquired by the monocular camera when the equipment moves, wherein M is an integer greater than or equal to 2;

and training a first deep learning network according to the M frames of video images to obtain the depth detection model.

As an example, the training a first deep learning network according to the M-frame video image to obtain the depth detection model includes:

inputting a first video image into a first deep learning network, wherein the first deep learning network is used for determining depth information of each pixel point in the first video image, acquiring the depth information of each pixel point in the first video image output by the first deep learning network, and the first video image is any frame of video image in the M frames of video images;

acquiring a pose relationship between the first video image and a second video image, wherein the second video image is one frame of video image among the M frames of video images except the first video image;

generating a synthetic image according to the depth information of each pixel point in the first video image, the calibrated installation parameters of the monocular camera and the pose relation;

And adjusting network parameters of the first deep learning network according to the synthesized image and the second video image to obtain the depth detection model.

As an example, the acquiring the pose relationship between the first video image and the second video image includes:

acquiring a plurality of pixel point pairs, wherein the pixel point pairs comprise one pixel point in the first video image and one pixel point in the second video image, and the physical points corresponding to each pixel point included in the pixel point pairs are the same;

and determining the pose relation between the first video image and the second video image according to the pixel point pairs.

inputting the first video image and the second video image into a second deep learning network, wherein the second deep learning network is used for determining the pose relation between the first video image and the second video image and obtaining the pose relation output by the second deep learning network;

after generating the composite image according to the depth information of each pixel point in the first video image, the calibrated installation parameters of the monocular camera and the pose relationship, the method further comprises:

And adjusting network parameters of the second deep learning network according to the composite image and the second video image.

As an example, the generating a composite image according to the depth information of each pixel point in the first video image, the calibrated installation parameters of the monocular camera, and the pose relationship includes:

determining the position information of a physical point corresponding to each pixel point in the first video image according to the depth information of each pixel point in the first video image and the calibration installation parameters of the monocular camera;

and acquiring each pixel point in the composite image according to each pixel point in the first video image, the position information of the physical point corresponding to each pixel point in the first video image and the pose relation.

In another aspect, the present application provides an apparatus for detecting a drivable region, the apparatus comprising:

the first acquisition module is used for acquiring video images acquired by a monocular camera arranged on the equipment;

the second acquisition module is used for inputting the video image into a depth detection model, wherein the depth detection model is used for detecting the depth information of each pixel point in the video image and acquiring the depth information of each pixel point in the video image output by the depth detection model;

The first determining module is used for determining the position information of the physical point corresponding to each pixel point according to the depth information of each pixel point and the calibrated installation parameters of the monocular camera;

and the second determining module is used for determining the drivable area according to the position information of each physical point.

As an example, the second determining module includes:

a construction unit for constructing a grid map according to position information of a landing place of an installation position of the monocular camera on the device on a road surface, the grid map including a plurality of grids;

a first determining unit configured to determine the number of physical points of each grid falling into the grid map according to the position information of each physical point;

the first acquisition unit is used for acquiring grids with the number of the falling physical points smaller than a preset number threshold value and determining a travelable area according to the acquired grids.

As one example, the location information of a physical point includes an abscissa, an ordinate, and an altitude of the physical point; the second determining module includes:

a second obtaining unit, configured to obtain, from each physical point, a physical point whose height is within a height range, where the height range includes a road surface height, and an interval length of the height range is a preset length threshold;

And the second determining unit is used for determining the drivable area according to the physical points in the height range.

As an example, the second determining unit is configured to:

As an example, the apparatus further comprises

The third acquisition module is used for acquiring M frames of video images acquired by the monocular camera when the equipment moves, wherein M is an integer greater than or equal to 2;

and the training module is used for training a first deep learning network according to the M frames of video images to obtain the depth detection model.

As an example, the training module includes:

a third obtaining unit, configured to input a first video image to a first deep learning network, where the first deep learning network is configured to determine depth information of each pixel in the first video image, obtain depth information of each pixel in the first video image output by the first deep learning network, and the first video image is any one of the M frames of video images;

A fourth obtaining unit, configured to obtain a pose relationship between the first video image and a second video image, where the second video image is one frame of video image in the M frames of video images except for the first video image;

the generating unit is used for generating a synthetic image according to the depth information of each pixel point in the first video image, the calibrated installation parameters of the monocular camera and the pose relation;

and the first adjusting unit is used for adjusting network parameters of the first deep learning network according to the composite image and the second video image so as to obtain the depth detection model.

As an example, the fourth obtaining unit is configured to:

The training module further comprises:

and the second adjusting unit is used for adjusting network parameters of the second deep learning network according to the composite image and the second video image.

As an example, the generating unit is configured to:

In another aspect, the present application provides an electronic device, including:

a processor;

a memory for storing executable instructions of the processor;

the processor is configured to execute the executable instructions to implement the above-described method for detecting a travelable region.

In another aspect, the present application provides a computer readable storage medium storing a computer program loaded and executed by a processor to implement instructions of the above-described method of detecting a travelable region.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

video images acquired by a monocular camera mounted on the device; inputting a video image into a depth detection model, wherein the depth detection model is used for detecting the depth information of each pixel point in the video image, and acquiring the depth information of each pixel point in the video image output by the depth detection model; determining the position information of the physical point corresponding to each pixel point according to the depth information of each pixel point and the calibrated installation parameters of the monocular camera; and determining a travelable area according to the position information of each physical point. Because the depth detection model is used for detecting the depth information of each pixel point in the video image, the depth detection model does not need to detect the drivable region image and the non-drivable region image, and the drivable region image and the non-drivable region image do not need to be marked in the training sample, so that the accuracy of detecting the drivable region is not influenced by the marked sample, and the accuracy of detecting the drivable region is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic diagram of a monocular camera mounted on a device according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a detection device according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an image processing module according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for training a depth detection model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a relationship between a first video image and a second video image provided in an embodiment of the present application;

FIG. 6 is a flow chart of a method for detecting a travelable region according to an embodiment of the present application;

FIG. 7 is a schematic diagram of projecting physical points in a grid map provided by an embodiment of the present application;

fig. 8 is a schematic structural diagram of an apparatus for detecting a travelable region according to an embodiment of the present application;

fig. 9 is a schematic diagram of a terminal structure according to an embodiment of the present application.

Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

In recent years, as the autopilot technology has come into the public view, referring to fig. 1, the autopilot can mount a monocular camera 2 on a device 1, detect a drivable area from an image by an image captured by the monocular camera 2, and the device 1 performs autopilot based on the detected drivable area.

The monocular camera 2 is mounted on the apparatus 1. The device 1 may be an automobile or a mobile robot, and when the device 1 is an automobile, the monocular camera 2 may be mounted at a front bumper position of the automobile, a roof of the automobile, or an inside of a cab of the automobile, or the like. The monocular camera 2 shoots toward the front of the car.

When the monocular camera 2 is installed in the automobile cab, the monocular camera 2 can be installed between the front windshield and the rearview mirror in the automobile cab, and the monocular camera 2 can be attached to the front windshield. Thus, the monocular camera 2 is hidden behind the rearview mirror, and the beauty of the automobile cab is not affected.

The drivable area mainly refers to a safe road surface area which can be reached by equipment in a short time under the condition of no danger of collision and the like, and comprises an area without obstacles on the same-direction lane, an intersection or a road surface with a ramp in a potential drivable direction, and the like.

Referring to fig. 2, the apparatus 1 may include a detection device including a monocular camera 2, an image processing module 3, and a control module 4.

The monocular camera 2 captures a video image, which is input to the image processing module 3.

The image processing module 3 comprises a depth detection model, and according to the video image, depth information of each pixel point in the video image is acquired based on the depth detection model; and detecting a travelable region according to the depth information of each pixel point.

The image processing module 3 is located on the embedded platform processor, and obtains depth information corresponding to each pixel point in the video image by processing the video image.

The control module 4 controls the apparatus 1 to travel according to the drivable area.

Referring to fig. 3, the image processing module 3 may include a monocular depth estimation unit 31, a three-dimensional point cloud processing unit 32, and a travelable region detection unit 33.

The monocular depth estimation unit 31 includes a depth detection model to which the video image captured by the monocular camera 2 is input, and acquires depth information of each pixel point in the video image output by the depth detection model.

The three-dimensional point cloud processing unit 32 determines positional information of physical points around the apparatus 1 from the depth information of each pixel point in the video image.

The travelable region detection unit 33 detects a travelable region on the basis of the positional information of each physical point.

The depth detection model is obtained through training, and referring to fig. 4, an embodiment of the present application provides a method for training a depth detection model, including:

step 201: and acquiring M frames of video images acquired by the monocular camera when the equipment moves, wherein M is an integer greater than or equal to 2.

In this step, the device may be normally driven on a road surface, a frame-by-frame video image is acquired using a monocular camera mounted on the device, M video images acquired consecutively are formed into a training sample set, and then the training sample set is used to train the first deep learning network.

The first deep learning network may be a convolutional neural network or the like. The first deep learning network is configured to detect depth information of each pixel in the video image, where the depth information of the pixel may include a depth value of the pixel.

Step 202: and inputting the first video image into a first deep learning network, and acquiring depth information of each pixel point in the first video image output by the first deep learning network, wherein the first video image is any one of the M frames of video images.

The first video image is input to a first deep learning network for detecting depth information for each pixel in the first video image. At this time, the depth information of each pixel detected by the first deep learning network may have a larger error, and the network parameters of the first deep learning network need to be adjusted by the following steps to gradually improve the accuracy of the depth information of the pixel detected by the first deep learning network.

The first deep learning network outputs the depth information of each pixel point in the first video image after detecting the depth information of each pixel point in the first video image. Correspondingly, the depth information of each pixel point in the first video image output by the first deep learning network is obtained.

Step 203: and acquiring a pose relationship between the first video image and a second video image, wherein the second video image is one frame of video image except the first video image in the M frames of video images.

The pose relationship between the first video image and the second video image includes a positional parameter of a camera capturing the first video image and positional information of a center point of a light center, and a positional parameter of a camera capturing the second video image and positional information of a center point of a light center. The positional parameters of the camera include the pitch angle, yaw angle and roll angle of the camera. The positional information of the center point of the optical center of the camera may be coordinates of the center point of the optical center in a coordinate system. The origin of coordinates of the coordinate system is the place where the installation position of the camera on the device falls on the road surface.

In this step, this can be achieved in the following two ways. The two modes comprise:

in a first mode, a plurality of pixel point pairs are acquired, wherein for any pixel point pair, the pixel point pair comprises one pixel point in a first video image and one pixel point in a second video image, and physical points corresponding to each pixel point included in the pixel point pair are the same; and determining the pose relation between the first video image and the second video image according to the plurality of pixel point pairs.

In a second mode, the first video image and the second video image are input to a second deep learning network, the second deep learning network is used for determining the pose relation between the first video image and the second video image, and the pose relation between the first video image and the second video image output by the second deep learning network is obtained.

The second deep learning network may be a convolutional neural network or the like. The second deep learning network is used for determining the pose relationship between the two video images. The first video image and the second video image are input to a second deep learning network, which is used to determine a pose relationship between the first video image and the second video image. At this time, the pose relationship determined by the second deep learning network may have a larger error, and the network parameters of the second deep learning network need to be adjusted by the following steps to gradually improve the accuracy of determining the pose relationship by the second deep learning network.

The second deep learning network outputs the pose relationship between the first video image and the second video image after determining the pose relationship between the first video image and the second video image. Correspondingly, the pose relation between the first video image and the second video image output by the second deep learning network is obtained.

Step 204: and generating a composite image according to the depth information of each pixel point in the first video image and the pose relation.

The calibration installation parameters of the monocular camera comprise at least one of position information or shooting angle of the monocular camera. In this step, a coordinate system is first established, where the origin of coordinates of the coordinate system may be a place where the installation position of the monocular camera on the device falls on the road surface, and the position information of the monocular camera is actually the position of the monocular camera in the coordinate system, including an abscissa, an ordinate, and a height.

Each pixel point in the first video image corresponds to one physical point in the actual physical space, and for each pixel point in the first video image, the depth information of the pixel point is the distance between the physical point corresponding to the pixel point and the shooting lens of the monocular camera, or the depth information of the pixel point is a depth table of 0 to 255, and the depth information of the pixel point is used for representing the distance between the physical point corresponding to the pixel point and the shooting lens of the monocular camera.

This step can be achieved by operations of 2041 to 2043, which are respectively:

2041: and calculating the distance between the physical point corresponding to each pixel point in the first video image and the monocular camera according to the depth information of each pixel point in the first video image.

The distance between each physical point and the monocular camera is the distance between the monocular camera and each physical point when the first video image is taken.

2042: and calculating the position of each physical point at the corresponding pixel point of the second video image according to the distance between each physical point and the monocular camera and the pose relation between the first video image and the second video image.

Referring to fig. 5, a photographing time interval between a first video image and a second video image, which are two images photographed by a monocular camera from the same actual physical space, is short. So for a physical point P in the actual physical space, the physical point P corresponds to one pixel point P1 in the first video image and one pixel point P2 in the second video image.

Typically, the pixel value of the physical point corresponding to one pixel point P1 in the first video image is equal to the pixel value of the physical point corresponding to one pixel point P2 in the second video image.

2043: and creating a blank image with the same size as the first video image, and for each physical point, storing the pixel value of the physical point at the pixel point corresponding to the second video image in the blank image according to the position of the physical point at the pixel point corresponding to the second video image to obtain a composite image.

The resulting composite image may have a difference from the second video image due to errors in the depth information of each pixel in the first video image detected by the first deep learning network. If the pose relationship between the first video image and the second video image is obtained by adopting the second mode, the difference can also be caused by errors in the pose relationship between the first video image and the second video image determined by the second deep learning network.

Step 205: and acquiring the difference information between the synthesized image and the second video image, executing step 206 when the difference information exceeds a preset difference threshold, and determining a depth detection model by the first deep learning network when the difference information does not exceed the preset difference threshold, and ending.

Pixel values of two pixel points located at the same position can be obtained from the composite image and the second video image, and a pixel difference value between the two pixel points can be calculated. The pixel difference values of the two other pixel points at the same position can be obtained in the mode. The difference information between the composite image and the second video image includes an average value of the acquired pixel differences.

After the pose information is obtained in this step, the physical point P can be projected into the second video image. All the pixel points in the first video image can be projected to the second video image plane to form a composite image. The composite image differs from the actually photographed reference frame, and a loss function in the deep learning training process is constructed using the difference. When the difference is smaller, the position of the P point is more accurate, so that the depth information of the pixel point is more accurate. In this way, the depth detection model can be trained independent of the additional depth sensor.

Step 206: network parameters of the first deep learning network are adjusted, and step 202 is executed back.

After the network parameters of the first deep learning network are adjusted, the operations of steps 202 to 206 are executed again until a deep detection model is obtained.

In the embodiment of the application, the monocular image depth estimation method does not need to depend on other depth sensors, and the first deep learning network can be trained only from M frames of video images in a motion state, so that the complexity of training a depth detection model is reduced, and errors caused by calibration and the like when the depth sensor is used for constructing a supervision signal are avoided.

After training the depth detection model, the depth detection model may be used to detect the travelable region. Referring to fig. 6, an embodiment of the present application provides a method for detecting a drivable area, including:

Step 401: and acquiring video images acquired by a monocular camera installed on the equipment.

The device can normally run on the road surface, and a monocular camera arranged on the device is used for collecting video images frame by frame.

And when the monocular camera acquires one frame of video image, acquiring the video image acquired by the monocular camera. Or, a frame of video image acquired by the monocular camera is acquired, and at least one frame of video image is spaced between the acquired frame of video image and the video image acquired last time.

Step 402: and inputting the video image into a depth detection model, wherein the depth detection model is used for detecting the depth information of each pixel point in the video image, and acquiring the depth information of each pixel point in the video image output by the depth detection model.

Optionally, a depth sensor may be further installed on the device, where the depth sensor may collect depth information of each physical point in the space where the device is located.

The depth information of each physical point acquired by the depth sensor and the depth information of each pixel point in the video image can be fused. The implementation process can be as follows:

for the depth information of any physical point acquired by the depth sensor, a corresponding pixel point of the physical point in the video image can be determined according to a preset conversion matrix, the average depth information between the depth information of the pixel point and the depth information of the physical point is calculated, and the depth information of the pixel point is replaced by the average depth information. Thereby, the accuracy of the depth information of the pixel point can be improved.

The preset conversion matrix is determined in advance according to the installation position and the acquisition direction of the depth sensor, and the installation position and the shooting direction of the monocular camera, and is a conversion relation between the coordinate system of the monocular camera and the coordinate system of the depth sensor. The coordinate system of the monocular camera is a coordinate system established by the landing point of the monocular camera on the road surface at the installation position of the monocular camera on the equipment, and the coordinate system of the depth sensor is a coordinate system established by the landing point of the depth sensor on the road surface at the installation position of the depth sensor on the equipment.

In the step, the depth information of the video image acquired by the monocular camera can be acquired through the depth detection model without an additional binocular head or a depth sensor, so that the depth information is processed to obtain a drivable area and a non-drivable area, the complexity of the system is reduced, and the method is suitable for various simple monocular lens systems.

Step 403: and determining the position information of the physical point corresponding to each pixel point according to the depth information of each pixel point and the calibrated installation parameters of the monocular camera.

The positional information of the physical point corresponding to each pixel point is positional information in the coordinate system of the monocular camera.

Step 404: and determining a travelable area according to the position information of each physical point.

In this step, this can be achieved by two methods, namely:

first, a grid map is constructed according to position information of a landing place of a monocular camera on a road surface at an installation position of the device, the grid map including a plurality of grids; determining the number of physical points of each grid falling into the grid map according to the position information of each physical point; and acquiring grids with the number of the physical points falling below a preset number threshold value, and determining a travelable area according to the acquired grids.

Referring to fig. 7, a grid map, which is a plane coincident with the road surface, is constructed from position information of a landing place on the road surface of a mounting position of a monocular camera on the device. If there is an obstacle on a certain grid, there are more physical points above the grid, so that the number of physical points falling into the grid is greater than that of a grid without an obstacle. A grid having a number of physical points falling below a preset number threshold may not have an obstacle therein, and a grid having a number of physical points falling above or equal to the preset number threshold may have an obstacle therein. A grid that can fall into a physical point number smaller than the preset number threshold is determined as a travelable region.

In a second mode, physical points with heights in a height range are obtained from each physical point, the height range comprises the height of the pavement, and the interval length of the height range is a preset length threshold value; the travelable region is determined on the basis of the physical points lying in the height range.

The average height in this height range may be equal to the road surface height. The road surface height may be a preset height, typically a preset routing height of 0. The height of each point in the road surface fluctuates up and down at a height of 0 and generally within the height range, so that a physical point located within the height range can be determined as a physical point on the road surface and a minimum area including a physical point located within the height range can be determined as a travelable area.

As an example, clustering each physical point located in the height range according to the position information of each physical point located in the height range to obtain at least one physical point set, wherein the distance between any two adjacent physical points in the same physical point set does not exceed a preset distance threshold; a minimum area including each physical point in the set of physical points is determined as a travelable area.

The method comprises the steps that when an automobile runs on an overpass or a viaduct and the like, two bifurcated roads appear in front of the automobile, wherein physical points on one road form a physical point set, and physical points on the other road form a physical point set, so that two drivable areas can be determined based on the two physical point sets, namely the two roads are determined.

Alternatively, the drivable regions determined in the two modes may be fused. The fused processing method can also be a grid map and height information fusion processing method, and the combination processing can be used for more accurately judging the drivable area. In order to make the accuracy of the detected travelable region higher, the travelable regions determined in the above two ways may be intersected. Alternatively, in order to obtain a larger range of travelable regions, the travelable regions determined in the two modes can be combined to realize the fusion of the two modes.

In the scheme, pixel point depth estimation is carried out on the acquired video image, three-dimensional point cloud data is restored by using calibration parameters, the obtained position information of each physical point is obtained, three-dimensional obstacle detection (namely the first mode and the second mode) is carried out on the point cloud data, and therefore a travelable area and a non-travelable area are judged.

In the scheme of the monocular camera, a depth detection model is trained, depth information of each pixel point in a video image is detected by the depth detection model, position information of a physical point is obtained based on the depth information of each pixel point, and a travelable area is obtained according to the position information of the physical point. The depth detection model does not detect the travelable region image and the non-travelable region image in the video image. Therefore, when the depth information detection model is trained, time-consuming labeling of a large number of samples for a drivable region image and a non-drivable region image is not needed, standard precision errors in the labeling process are avoided, and the condition that only labeled specific features can be detected by a labeling detection method is avoided.

In this embodiment, when the depth detection model is trained, the first deep learning network is trained through the second deep learning network, or the first deep learning network is trained through a plurality of pixel point pairs, where the pixel point pairs include two pixel points corresponding to the same physical point in the two video images. Therefore, a sample does not need to be manually marked when the depth detection model is trained, and the precision of the trained depth detection model is improved. Training a depth detection model, and acquiring video images through a monocular camera arranged on equipment; inputting a video image into a depth detection model, wherein the depth detection model is used for detecting the depth information of each pixel point in the video image, and acquiring the depth information of each pixel point in the video image output by the depth detection model; determining the position information of the physical point corresponding to each pixel point according to the depth information of each pixel point and the calibrated installation parameters of the monocular camera; and determining a travelable area according to the position information of each physical point. Because the depth detection model is used for detecting the depth information of each pixel point in the video image, the depth detection model does not need to detect the drivable region image and the non-drivable region image, and the drivable region image and the non-drivable region image do not need to be marked in the training sample, so that the accuracy of detecting the drivable region is not influenced by the marked sample, and the accuracy of detecting the drivable region is improved.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Referring to fig. 8, an embodiment of the present application provides an apparatus 700 for detecting a drivable area, the apparatus 700 comprising:

a first acquisition module 701, configured to acquire a video image acquired by a monocular camera installed on the device;

a second obtaining module 702, configured to input the video image into a depth detection model, where the depth detection model is configured to detect depth information of each pixel in the video image, and obtain the depth information of each pixel in the video image output by the depth detection model;

a first determining module 703, configured to determine location information of a physical point corresponding to each pixel according to the depth information of each pixel and the calibrated installation parameter of the monocular camera;

a second determining module 704, configured to determine a drivable area according to the location information of each physical point.

As an example, the second determining module 704 includes:

As one example, the location information of a physical point includes an abscissa, an ordinate, and an altitude of the physical point; the second determining module 704 includes:

As an example, the second determining unit is configured to:

As an example, the apparatus 700 further comprises

As an example, the training module includes:

As an example, the fourth obtaining unit is configured to:

the training module further comprises:

As an example, the generating unit is configured to:

In this embodiment, when the training module trains the depth detection model, the training module trains the first deep learning network through the second deep learning network, or trains the first deep learning network through a plurality of pixel pairs, where the pixel pairs include two pixel points corresponding to the same physical point in the two video images. Therefore, a sample does not need to be manually marked when the depth detection model is trained, and the precision of the trained depth detection model is improved. After training a depth detection model, a first acquisition module acquires video images through a monocular camera installed on equipment; the second acquisition module inputs the video image into a depth detection model, and the depth detection model is used for detecting the depth information of each pixel point in the video image and acquiring the depth information of each pixel point in the video image output by the depth detection model; the first determining module determines the position information of the physical point corresponding to each pixel point according to the depth information of each pixel point and the calibrated installation parameters of the monocular camera; the second determination module determines a travelable region according to the position information of each physical point. Because the depth detection model is used for detecting the depth information of each pixel point in the video image, the depth detection model does not need to detect the drivable region image and the non-drivable region image, and the drivable region image and the non-drivable region image do not need to be marked in the training sample, so that the accuracy of detecting the drivable region is not influenced by the marked sample, and the accuracy of detecting the drivable region is improved.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 9 shows a block diagram of a terminal 800 according to an exemplary embodiment of the present invention. The terminal 800 may be a portable mobile terminal such as: smart phones, tablet computers, vehicle terminals, etc.

In general, the terminal 800 includes: a processor 801 and a memory 802.

Processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 801 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 801 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 801 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, the processor 801 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement the method of detecting a travelable region provided by the method embodiments herein.

In some embodiments, the terminal 800 may further optionally include: a peripheral interface 803, and at least one peripheral. The processor 801, the memory 802, and the peripheral interface 803 may be connected by a bus or signal line. Individual peripheral devices may be connected to the peripheral device interface 803 by buses, signal lines, or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 804, a touch display 805, a camera 806, audio circuitry 807, a positioning component 808, and a power supply 809.

Peripheral interface 803 may be used to connect at least one Input/Output (I/O) related peripheral to processor 801 and memory 802. In some embodiments, processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 804 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 804 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 804 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec core image sets, subscriber identity module cards, and so forth. The radio frequency circuitry 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 804 may also include NFC (Near Field Communication ) related circuitry, which is not limited in this application.

The display 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to collect touch signals at or above the surface of the display 805. The touch signal may be input as a control signal to the processor 801 for processing. At this time, the display 805 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 805 may be one, providing a front panel of the terminal 800; in other embodiments, the display 805 may be at least two, respectively disposed on different surfaces of the terminal 800 or in a folded design; in still other embodiments, the display 805 may be a flexible display disposed on a curved surface or a folded surface of the terminal 800. Even more, the display 805 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display 805 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 806 is used to capture images or video. Optionally, the camera assembly 806 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 806 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

Audio circuitry 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, inputting the electric signals to the processor 801 for processing, or inputting the electric signals to the radio frequency circuit 804 for voice communication. For stereo acquisition or noise reduction purposes, a plurality of microphones may be respectively disposed at different portions of the terminal 800. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 807 may also include a headphone jack.

The location component 808 is utilized to locate the current geographic location of the terminal 800 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 808 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, or the Galileo system of Russia.

A power supply 809 is used to power the various components in the terminal 800. The power supply 809 may be an alternating current, direct current, disposable battery, or rechargeable battery. When the power supply 809 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyroscope sensor 812, pressure sensor 813, fingerprint sensor 814, optical sensor 815, and proximity sensor 816.

The acceleration sensor 811 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 800. For example, the acceleration sensor 811 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 801 may control the touch display screen 805 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 811. Acceleration sensor 811 may also be used for the acquisition of motion data of a game or user.

The gyro sensor 812 may detect a body direction and a rotation angle of the terminal 800, and the gyro sensor 812 may collect a 3D motion of the user to the terminal 800 in cooperation with the acceleration sensor 811. The processor 801 may implement the following functions based on the data collected by the gyro sensor 812: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 813 may be disposed at a side frame of the terminal 800 and/or at a lower layer of the touch display 805. When the pressure sensor 813 is disposed on a side frame of the terminal 800, a grip signal of the terminal 800 by a user may be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at the lower layer of the touch display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 805. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 814 is used to collect a fingerprint of a user, and the processor 801 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 801 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 814 may be provided on the front, back, or side of the terminal 800. When a physical key or vendor Logo is provided on the terminal 800, the fingerprint sensor 814 may be integrated with the physical key or vendor Logo.

The optical sensor 815 is used to collect the ambient light intensity. In one embodiment, the processor 801 may control the display brightness of the touch display screen 805 based on the intensity of ambient light collected by the optical sensor 815. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 805 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 805 is turned down. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera module 806 based on the ambient light intensity collected by the optical sensor 815.

A proximity sensor 816, also referred to as a distance sensor, is typically provided on the front panel of the terminal 800. The proximity sensor 816 is used to collect the distance between the user and the front of the terminal 800. In one embodiment, when the proximity sensor 816 detects that the distance between the user and the front of the terminal 800 gradually decreases, the processor 801 controls the touch display 805 to switch from the bright screen state to the off screen state; when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 gradually increases, the processor 801 controls the touch display 805 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 9 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of detecting a travelable region, the method comprising:

acquiring M frames of video images acquired by a monocular camera when the equipment moves, wherein M is an integer greater than or equal to 2;

training a first deep learning network according to the M frames of video images to obtain a depth detection model;

acquiring video images acquired by a monocular camera mounted on the equipment;

determining a drivable area according to the position information of each physical point;

the training a first deep learning network according to the M-frame video image to obtain the depth detection model includes:

2. The method of claim 1, wherein the determining the drivable region based on the location information of each physical point comprises:

3. The method of claim 1, wherein the location information of a physical point comprises an abscissa, an ordinate, and an altitude of the physical point; the determining the drivable area according to the position information of each physical point comprises the following steps:

4. A method according to claim 3, wherein said determining a travelable region from physical points located within said altitude range comprises:

5. The method of claim 1, wherein the acquiring the pose relationship between the first video image and the second video image comprises:

6. The method of claim 1, wherein the acquiring the pose relationship between the first video image and the second video image comprises:

7. The method of claim 1, wherein the generating a composite image from the depth information of each pixel in the first video image, the calibrated installation parameters of the monocular camera, and the pose relationship comprises:

8. An apparatus for detecting a travelable region, the apparatus comprising:

the training module is used for training the first deep learning network according to the M frames of video images to obtain a depth detection model;

The second determining module is used for determining a drivable area according to the position information of each physical point;

wherein, training module includes:

9. The apparatus of claim 8, wherein the second determination module comprises:

10. The apparatus of claim 8, wherein the location information of a physical point comprises an abscissa, an ordinate, and an altitude of the physical point; the second determining module includes:

11. The apparatus of claim 10, wherein the second determining unit is configured to:

12. The apparatus of claim 8, wherein the fourth acquisition unit is configured to:

13. The apparatus of claim 8, wherein the fourth acquisition unit is configured to:

The training module further comprises:

14. The apparatus of claim 8, wherein the generating unit is to: