WO2022155899A1

WO2022155899A1 - Target detection method and apparatus, movable platform, and storage medium

Info

Publication number: WO2022155899A1
Application number: PCT/CN2021/073334
Authority: WO
Inventors: 蒋卓键; 陈靖宇; 陈超
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2022-07-28

Abstract

The present invention provides a target detection method and apparatus, a movable platform, and a storage medium. The target detection method comprises: acquiring a first image collected by a first camera and a second image collected by a second camera, the first camera and the second camera being provided on a movable platform, and the first camera and the second camera each being used to capture the environment in different distance ranges in front of the movable platform; merging the first image and the second image into a third image; and identifying a road target included in the third image. By means of merging images collected by different cameras into an image having global semantic information and identifying a road target in the merged image, the efficiency of identification can be increased and an accurate identification result can be obtained.

Description

Object detection method, device, removable platform and storage medium

technical field

The present invention relates to the field of artificial intelligence, and in particular, to a target detection method, device, movable platform and storage medium.

Background technique

In the field of autonomous driving, in order to ensure the safe driving of autonomous vehicles, it is necessary for the vehicle to perceive the surrounding environment in a timely and accurate manner so as to make correct driving decisions. There are various objects that need to be perceived in the road traffic environment, such as lane lines. Accurate and timely detection of lane lines can allow vehicles to switch between the current lane and other lanes accurately.

Taking lane line detection as an example, at present, one way to perform lane line detection is to set up multiple cameras on the vehicle to collect images of different fields of view, and perform lane line detection for the images collected by each camera, and finally The detection results corresponding to the multiple cameras are fused together. This detection method has a large amount of calculation, low efficiency, and is prone to missed detection at the edge, and the accuracy is poor.

SUMMARY OF THE INVENTION

The invention provides a target detection method, device, movable platform and storage medium, which can realize efficient and accurate detection of road targets.

A first aspect of the present invention provides a target detection method, which is applied to a movable platform, and the target detection method includes:

Acquiring a first image collected by a first camera and a second image collected by a second camera; wherein the first camera and the second camera are arranged on the movable platform, the first camera and the second camera are The two cameras are respectively used to photograph the environment in different distance ranges in front of the movable platform;

fusing the first image and the second image into a third image;

Road objects contained in the third image are identified.

A second aspect of the present invention provides a target detection device, which is set on a movable platform, and the target detection device includes: a memory and a processor; wherein, executable codes are stored on the memory, and when the executable codes are When the processor executes, the processor is caused to implement:

Obtain a first image collected by a first camera and a second image collected by a second camera, wherein the first camera and the second camera are arranged on the movable platform, the first camera and the second camera are The two cameras are respectively used to photograph the environment in different distance ranges in front of the movable platform;

fusing the first image and the second image into a third image;

Road objects contained in the third image are identified.

A third aspect of the present invention provides a movable platform, comprising:

case;

The first camera and the second camera are arranged inside or outside the casing, and are respectively used for photographing the environment in front of the movable platform with different distance ranges;

a processor, located inside the casing, coupled to the first camera and the second camera, and configured to acquire a first image captured by the first camera and a second image captured by the second camera; Fusing the first image and the second image into a third image; identifying road objects contained in the third image.

A fourth aspect of the present invention provides a computer-readable storage medium, where executable codes are stored in the computer-readable storage medium, and the executable codes are used to implement the target detection method described in the first aspect.

In the target detection solution provided by the present invention, in order to allow the movable platform (such as a vehicle, etc.) to perceive the surrounding road conditions, a first camera and a second camera are set on the movable platform, and the first camera and the second camera are respectively It is used for photographing environments in different distance ranges in front of the movable platform, so as to determine road objects (such as lane lines, etc.) in front of the movable platform based on the first image captured by the first camera and the second image captured by the second camera.

In order to complete the detection of road targets more efficiently, the first image collected by the first camera and the second image collected by the second camera are fused into a third image, and then road targets are identified on the third image to identify the first image. Three road targets included in the images. The third image contains the global information of the first image and the second image, and it is more efficient to perform road target detection on only one third image.

Description of drawings

The drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:

1 is a schematic diagram of a road traffic scene provided by an embodiment of the present invention;

2 is a schematic flowchart of a target detection method according to an embodiment of the present invention;

3 is a schematic flowchart of an image fusion process according to an embodiment of the present invention;

4 is a schematic diagram of the principle of an image recognition process provided by an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a target detection device according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a movable platform according to an embodiment of the present invention.

Detailed ways

In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terms used in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention.

The target detection method provided by the embodiment of the present invention can be applied to the road traffic scene as shown in FIG. 1 , so as to detect the road target existing in front of the vehicle. As shown in FIG. 1 , a first camera 101 and a second camera 102 are provided at different positions on the autonomous vehicle, and the two cameras have different shooting ranges respectively, so as to shoot the environment in front of the vehicle within different distance ranges (including The ground ahead and various objects on the ground, such as other vehicles, guardrails, etc.), in order to combine the images collected by these two cameras to perceive some road targets that exist ahead.

For example, in the process of driving an autonomous vehicle, it is necessary to pay attention to various marking lines on the ground in order to make accurate control decisions in time. At this time, the first camera 101 and the second camera 102 can be used to photograph the front of the vehicle, respectively. On the ground at different distances, the autonomous vehicle combines the images captured by the two cameras to identify road markings on the ground at different distances ahead, and make corresponding driving control decisions.

For example, the first camera 101 is used for shooting a relatively close environment, such as 0-40 meters, and the second camera 102 is used for shooting a relatively long-distance environment, such as 40-150 meters. The image can identify the lane lines that exist at far and near distances. Based on the recognition results of the short-range lane lines, it is possible to avoid the vehicle running on the line or to give accurate guidance when switching lanes;

It is worth noting that the two cameras mentioned here are used to shoot the environment in front of the vehicle at different distances. It is not limited that the shooting angle of the camera is limited to the front of the vehicle. It can have a wider range of setting angles. Angle a1 and angle a2 illustrated in 1.

In practical applications, road targets to be identified may include not only road markings, but also other targets, such as pedestrians and vehicles, where road markings include but are not limited to: lane lines, zebra crossings, garages or parking spaces on the roadside.

The execution process of the target detection method provided by the present invention will be described in detail below with reference to the following embodiments. The target detection method provided by the embodiment of the present invention may be executed by a movable platform, and specifically, may be executed by a processor provided in the movable platform. In practical applications, the movable platform includes but is not limited to various types of vehicles driving on the road.

FIG. 2 is a schematic flowchart of a target detection method provided by an embodiment of the present invention. As shown in FIG. 2 , the target detection method may include the following steps:

201. Acquire a first image collected by a first camera and a second image collected by a second camera; wherein the first camera and the second camera are arranged on the movable platform, and the first camera and the second camera are respectively used to shoot the movable platform Environments at different distances in front of the platform.

In practical applications, in order to ensure that the movable platform can perceive the environment in different distances in front of it, multiple cameras can be set on the movable platform for sensing the environment in different distances in front. The multiple cameras can be two cameras or more than two cameras. When, for example, three or more cameras are set, the execution principle is the same, so only two cameras are used as an example for description in the embodiment of the present invention.

For example, the first camera and the second camera are set on the movable platform. The first camera is used to photograph the environment within a range of 0 to 40 meters in front of the movable platform, and the second camera is used to photograph 40 to 150 meters in front of the movable platform. environment within the range.

It is worth noting that, in order for the movable platform to perceive the environment in the range of 0 to 150 meters ahead, the shooting distances of the above two cameras should be "seamless". For example, if the first camera is set to shoot the environment in the range of 0-30 meters in front of the movable platform, and the second camera is used to shoot the environment in the range of 40-150 meters in front of the movable platform, then the front 30-40 meters Scope is left out, and gaps appear. Of course, in order to ensure "seamless", the shooting distances of the two cameras can partially overlap. For example, the first camera is set to shoot the environment within a range of 0 to 40 meters in front of the movable platform, and the second camera is used to shoot the movable platform. The environment within a range of 35 to 150 meters in front of the platform.

In addition, the first camera and the second camera synchronize image acquisition.

202. Fusion of the first image and the second image into a third image.

After the first camera collects the first image corresponding to a certain distance range (such as 0-40 meters), and the second camera collects the second image corresponding to a certain distance range (such as 40-150 meters), the two Each camera sends the images collected by each of them to the processor set in the movable platform, and the processor completes the fusion of the first image and the second image, that is, the first image and the second image are spliced into a third image, the third image contains the global information of the first image and the second image, in short, the third image will include all feature information of the environment within a distance range of 0 to 150 meters.

As mentioned above, the road targets that the movable platform needs to identify can be various road marking lines on the ground. Fusion of the first image and the second image from the perspective of And the second top view is merged into the third image.

Among them, the first image and the second image are actually front views, and when the recognition task is to identify road markings on the ground, they can be converted into top views, and the top views can be fused. In addition, when performing image fusion, an image captured by one camera can be converted to the perspective of another camera, and two top views corresponding to the perspective of the same camera can be fused. In this embodiment, it is assumed that the second camera is projected to the perspective of the top view of the first camera, that is, the second top view corresponding to the second image in the perspective of the first camera needs to be obtained. The first top view corresponding to the above-mentioned first image is the top view corresponding to the first image under the viewing angle of the first camera.

An optional implementation manner for obtaining the above-mentioned first top view and second top view will be described in detail below. It is assumed here that the first top view and the second top view have been obtained, and then the first top view and the second top view are merged to obtain the third image. Wherein, optionally, a weighted sum operation may be performed on the first top view and the second top view to obtain the third image.

Specifically, assuming that the first top view is represented by Imgbv1, the second top view is represented by Imgbv2, and the third image is represented by Imgbv _stitch , then Imgbv _stitch =a·Imgbv2+(1-a)·Imgbv1. Among them, a is the preset weight, 0<a<1.

The fused third image has the common field of view of the two cameras and can be correlated with each other, which is equivalent to having global information and richer semantic information.

203. Identify the road target included in the third image.

After the fused third image is obtained, the road target is identified on the third image, and the road target contained in the third image can be accurately identified based on the rich semantic information contained in the third image.

Among them, identifying the road target included in the third image is actually determining which pixels in the third image correspond to the road target. For example, assuming that the third image includes target 1 and target 2, the purpose of recognition is to identify which pixels in the third image correspond to target 1 and which pixels correspond to target 2.

Based on the above recognition results, and further combining the conversion relationship between the image coordinate system and the world coordinate system, the relative positional relationship between the above target 1 and target 2 and the movable platform in the actual physical scene can be obtained.

As can be seen from the above introduction, the above recognition task is actually a semantic segmentation task (determining the category corresponding to the pixels in the image). Therefore, optionally, the third image can be input into a preset semantic segmentation model, so as to identify the road target contained in the third image through the semantic segmentation model.

In practical applications, the semantic segmentation model can be implemented as a neural network model, such as a Convolutional Neural Network (CNN) model; a Residual Network (ResNet) model, such as ResNet-18, DLA -34 models, etc.

From the perspective of constituent units, the semantic segmentation model may include a feature extraction layer and an output layer. In practical applications, the feature extraction layer may include convolution layers, downsampling layers, activation functions, etc., and the output layer may include one or more a convolutional layer. As the name suggests, the feature extraction layer is used to extract the semantic features of the input image to obtain a semantic feature map (usually it can also be referred to as a feature map), and the output layer is used to parse the semantic feature map to input the category recognition result corresponding to each pixel. : Whether it corresponds to a road target.

In summary, the images collected by different cameras are fused to obtain an image containing global semantic information. After that, the road target recognition is only performed on the fused image, without the need for road target recognition on the images collected by each camera. , which can improve the recognition efficiency. At the same time, since the fused image contains global semantic information, accurate recognition results can be guaranteed.

The fusion process of the first image and the second image above will be exemplarily described below with reference to the embodiment shown in FIG. 3 .

FIG. 3 is a schematic flowchart of an image fusion process provided by an embodiment of the present invention. As shown in FIG. 3 , the fusion process may include the following steps:

301. Determine a top view projection matrix corresponding to the first camera.

302. Perform a plan view projection on the first image according to the plan view projection matrix, so as to obtain a first plan view corresponding to the first image.

303. Perform a top view projection on the second image according to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, to obtain a second top view corresponding to the second image from the perspective of the first camera.

304. Perform a weighted sum operation on the first top view and the second top view to obtain a third image.

As mentioned above, the first image captured by the first camera is actually a front view. To convert the front view into a top view, it is necessary to first determine the projection matrix used to convert the front view into a top view, which is called the top view projection matrix. .

The top view projection matrix corresponding to the first camera is determined by two matrices: a homography matrix (homography matrix) of the first camera and a perspective transformation matrix (PerspectiveTransform matrix) of the first camera relative to the ground. The perspective transformation matrix is used to project the coordinates of the image captured by the camera in the image coordinate system to the world coordinate system.

Specifically, if the top view projection matrix corresponding to the first camera is represented as tranMFront2Top, the homography matrix of the first camera is represented as Hg2im, and the above perspective transformation matrix is represented as PerspectiveTransform.

Then, tranMFront2Top=Hg2im PerspectiveTransform

That is, the homography matrix of the first camera is multiplied by the perspective transformation matrix to obtain the top view projection matrix corresponding to the first camera.

Wherein, the homography matrix corresponding to the first camera can be determined in the following manner:

The homography matrix is determined according to the camera internal parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground.

Among them, the rotation matrix of the first camera relative to the ground includes the following three matrices: the rotation matrix of the first camera's pitch angle to the ground (Pitch), denoted as RPitch1; the rotation matrix of the first camera's ground angle (Yaw), denoted as RYaw1; the rotation matrix of the roll angle (Roll) of the first camera to the ground, expressed as RRoll1.

The translation matrix of the first camera relative to the ground refers to the translation matrix of the height of the first camera relative to the ground, which is represented as Th1. The camera intrinsic parameter matrix of the first camera is represented as: P1.

The above camera internal parameter matrix, translation matrix, and rotation matrix are all predetermined.

Based on the above assumptions, Hg2im=P1·RRoll1·RYaw1·RPitch1·Th1

The perspective transformation matrix of the first camera relative to the ground can be obtained as follows:

obtaining the first coordinates corresponding to the plurality of reference pixels in the first image in the image coordinate system;

acquiring the respective second coordinates corresponding to the plurality of reference pixels in the world coordinate system after being projected to the top view;

According to the first coordinates and the second coordinates, a perspective transformation matrix of the first camera relative to the ground is determined.

In practical applications, optionally, the above-mentioned multiple reference pixels may be four vertex pixels of the first image. For example, assuming that the first image is a 720*1280 image, the first coordinate corresponding to the upper-left vertex pixel in the image coordinate system is (0,0), and the upper-right vertex pixel corresponding to the first coordinate in the image coordinate system is (720,0), the first coordinate corresponding to the vertex pixel in the lower left corner in the image coordinate system is (0,1280), and the first coordinate corresponding to the vertex pixel in the upper right corner in the image coordinate system is (720,1280).

Then, the second coordinates corresponding to the above-mentioned four vertex pixels in the top view are determined, and the second coordinates represent the position of the top view in the world coordinate system.

Simply put, the perspective transformation matrix can be obtained by giving four pairs of pixel coordinates corresponding to the perspective transformation. The coordinates of the four pairs of pixel points are the coordinates corresponding to the above four vertices in the front view and the top view respectively.

After the top view transformation matrix corresponding to the first camera is obtained in the above manner, the first top view corresponding to the first image can be obtained by using the top view transformation matrix to project the top view of the first image. Assuming that the first top view is represented as Imgbv1, then:

Imgbv1 = warpPerspective(Imgfv1, tranMFront2Top)

Wherein, Imgfv1 represents the front view captured by the first camera, that is, the first image. warpPerspective represents the top view projection function, which can be a preset function that can realize top view projection. The above formula means that the first image and the top view transformation matrix tranMFront2Top are used as the input of the top view projection function, so as to realize the top view projection of the first image.

For the second image collected by the second camera, the corresponding second top view Imgbv2 may be obtained in the following manner.

Among them: warpPerspective is the above-mentioned top view projection function, Imgfv2 represents the front view collected by the second camera, that is, the second image,

represents the camera extrinsic parameter matrix corresponding to the first camera and the second camera.

Afterwards, the first top view Imgbv1 and the second top view Imgbv2 can be fused according to the following method to obtain the third image Imgbv _stitch :

Imgbv _stitch =a·Imgbv2+(1−a)·Imgbv1. Among them, a is the preset weight, 0<a<1.

An implementation manner of image fusion has been introduced above. In fact, image fusion (or image stitching) may also have other implementation manners, which are not limited to the above examples.

As mentioned above, after the fused third image is obtained, the semantic segmentation model can be used to identify the road objects contained in the third image. In order to further improve the accuracy of the road target recognition result, an image recognition scheme as shown in FIG. 4 is also provided in the embodiment of the present invention.

As shown in Figure 4, the semantic segmentation model can include multiple cascaded feature extraction layers and an output layer. The plurality of feature extraction layers illustrated in FIG. 4 include feature extraction layer 1 , feature extraction layer 2 and feature extraction layer 3 .

The third image is first input to the feature extraction layer 1, and the feature extraction layer 1 outputs the semantic feature map Feature1. After that, on the one hand, the semantic feature map Featurer1 is stored, and on the other hand, the semantic feature map Featurer1 is input to the feature extraction layer 2. Assuming that the feature extraction layer 2 outputs the semantic feature map Featuer2, similarly, on the one hand, the semantic feature map Featuer2 is stored, and on the other hand, the semantic feature map Featurer2 is input to the feature extraction layer 3. It is assumed that the feature extraction layer 3 outputs the semantic feature map Featuer3, and stores the semantic feature map Featuer3.

After obtaining multiple semantic feature maps output by the above multiple feature extraction layers, splicing multiple semantic feature maps, and inputting the semantic feature map Featuer obtained by splicing into the output layer to obtain the semantic segmentation result output by the output layer. The segmentation result indicates the pixel corresponding to the road target in the third image, that is, the classification result of each pixel in the third image is obtained.

It can be understood that, among the above feature extraction layers, the later feature extraction layer extracts higher-level semantic information, and the earlier feature extraction layer extracts lower-level semantic information. By splicing the semantic feature maps of different scales extracted by different feature extraction layers, a feature map with rich semantics can be obtained, which helps to improve the accuracy of the recognition results.

In the embodiment of the present invention, since the purpose of setting different cameras on the movable platform is to perceive road targets existing in different distances ahead, in this scenario, the semantic feature map can reflect different distances and semantic feature vectors In other words, the semantic feature map includes semantic feature vectors corresponding to different distances. Therefore, in the process of splicing multiple semantic feature maps, the distance factor can be combined to splicing.

Therefore, optionally, the splicing process of multiple semantic feature maps can be implemented as:

For the target semantic feature map, set the weight of the semantic feature vector corresponding to the preset target distance range in the target semantic feature map as the first weight, and set the weight of the semantic feature vector corresponding to other distance ranges in the target semantic feature map as the second weight, the first weight is greater than the second weight, the target semantic feature map is any one of multiple semantic feature maps, and the preset target distance ranges corresponding to each of the multiple semantic feature maps are different;

According to the set weights, multiple semantic feature maps are spliced.

For ease of understanding, for example, it is assumed that the first weight is denoted w1, the second weight is denoted w2, and that the plurality of semantic feature maps are the semantic feature map Featuer1, the semantic feature map Featuer2, and the semantic feature map Featuer3 illustrated in FIG. 4 . , and it is assumed that the first camera and the second camera can photograph a range of 0 to 150 meters ahead in total. In addition, it is assumed that the preset target distance range corresponding to the semantic feature map Featurer1 is 0 to 30 meters, the preset target distance range corresponding to the semantic feature map Featurer2 is 30 to 60 meters, and the preset target distance corresponding to the semantic feature map Featurer3. The range is 60 to 150 meters.

Based on the above assumptions, the weight setting results of the above three semantic feature maps are as follows:

For the semantic feature map Featuer1, the weight of the semantic feature vector C11 corresponding to the preset target distance range of 0 to 30 meters in the semantic feature map Featuer1 is set to w1, and to other distance ranges of 30 to 60 meters and 60 to 150 meters. The weights of the semantic feature vector C12 and the semantic feature vector C13 are both set to w2;

For the semantic feature map Featuer2, the weight of the semantic feature vector C22 corresponding to the preset target distance range of 30 to 60 meters in the semantic feature map Featuer2 is set to w1, and the semantics corresponding to other distance ranges of 0 to 30 meters and 60 to 150 meters respectively. The weights of feature vector C21 and semantic feature vector C23 are both set to w2;

For the semantic feature map Featuer3, the weight of the semantic feature vector C33 corresponding to the preset target distance range of 60 to 150 meters in the semantic feature map Featuer3 is set to w1, and the semantics corresponding to other distance ranges of 0 to 30 meters and 30 to 60 meters respectively. The weights of the feature vector C31 and the semantic feature vector C32 are both set to w2.

Based on the above weight setting results, the three semantic feature maps are spliced. Suppose that the semantic feature vectors corresponding to the distance ranges of 0-30 meters, 30-60 meters, and 60-150 meters in the semantic feature map Featuer obtained after splicing are expressed as : semantic feature vector C1, semantic feature vector C2 and semantic feature vector C3, there are:

Semantic feature vector C1=w1*semantic feature vector C11+w2*semantic feature vector C21+w2*semantic feature vector C31;

Semantic feature vector C2=w2*semantic feature vector C12+w1*semantic feature vector C22+w2*semantic feature vector C32;

Semantic feature vector C3=w2*semantic feature vector C13+w2*semantic feature vector C23+w1*semantic feature vector C33.

Among them, 0≤w2<w1≤1. Optionally, w1=1, w2=0 can be set.

It can be seen from the above examples that, according to the output order of the plurality of semantic feature maps, the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther. As in the above example, the output order of the above three semantic feature maps is Semantic Feature Map Featurer1, Semantic Feature Map Featurer2, Semantic Feature Map Featurer3, then the preset target distance ranges corresponding to these three semantic feature maps are: 0～ 30 meters, 30 to 60 meters, and 60 to 150 meters, showing a trend of increasing distance in turn.

The reason for this trend is that the semantic feature map output later contains higher-level semantic information, and the higher-level semantic information often corresponds to the farther the environment, that is to say, the farther away. The more road objects require higher-level semantic information.

To sum up, on the basis that the input image (ie, the input image of the semantic segmentation model) is an image obtained by fusing the images collected by different cameras, in the feature extraction layer, multiple semantic feature maps are further spliced. The input image contains rich semantics, and on the other hand, the spliced semantic feature map can strengthen the semantic features of different distances, which can reduce the amount of calculation and improve the recognition efficiency, and at the same time, better recognition results can be obtained.

FIG. 5 is a schematic structural diagram of a target detection apparatus provided by an embodiment of the present invention. The target detection apparatus is set on a movable platform. As shown in FIG. 5 , the target detection apparatus includes: a memory 11 and a processor 12 . Wherein, the executable code is stored on the memory 11, and when the executable code is executed by the processor 12, the processor 12 is made to realize:

fusing the first image and the second image into a third image;

Road objects contained in the third image are identified.

Optionally, in the process of fusing the first image and the second image into a third image, the processor 12 is specifically configured to:

Obtain a first top view corresponding to the first image and a second top view corresponding to the second image from the perspective of the first camera; and fuse the first top view and the second top view into the first top view Three images.

Wherein, optionally, the processor 12 is specifically configured to: determine a top view projection matrix corresponding to the first camera; perform top view projection on the first image according to the top view projection matrix to obtain the first image The corresponding first top view; according to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image to obtain the second image A corresponding second top view under the viewing angle of the first camera.

Wherein, optionally, the processor 12 is specifically configured to: perform a weighted sum operation on the first top view and the second top view to obtain the third image.

Wherein, optionally, in the process of determining the top view projection matrix corresponding to the first camera, the processor 12 is specifically configured to: determine the homography matrix corresponding to the first camera; acquire the first image the first coordinates corresponding to the multiple reference pixels in the image coordinate system; obtain the second coordinates corresponding to the multiple reference pixels in the world coordinate system after being projected to the top view; according to the first coordinates and The second coordinate determines the perspective transformation matrix of the first camera relative to the ground;

The top view projection matrix is determined according to the homography matrix and the perspective transformation matrix.

Optionally, the processor 12 is specifically configured to: determine the single camera according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground. Responsiveness Matrix.

Optionally, in the process of recognizing the road target contained in the third image, the processor 12 is specifically configured to: input the third image into a preset semantic segmentation model, so as to pass the semantic The segmentation model identifies road objects contained in the third image.

Optionally, the semantic segmentation model includes: cascaded multiple feature extraction layers and output layers. Based on this, the processor 12 is specifically configured to: acquire multiple semantic feature maps output by the multiple feature extraction layers; splicing the multiple semantic feature maps; input the spliced semantic feature maps to the output layer , so as to obtain the semantic segmentation result output by the output layer, the semantic segmentation result indicating the pixels corresponding to the road target in the third image.

Wherein, optionally, in the process of splicing the plurality of semantic feature maps, the processor 12 is specifically configured to: for the target semantic feature map, combine the target semantic feature map corresponding to the preset target distance range The weight of the semantic feature vector is set as the first weight, the weight of the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the second weight, the first weight is greater than the second weight, the The target semantic feature map is any one of the multiple semantic feature maps, and the preset target distance ranges corresponding to each of the multiple semantic feature maps are different; the multiple semantic feature maps are spliced according to the set weights.

Optionally, the first weight is 1, and the second weight is 0.

Optionally, according to the output order of the plurality of semantic feature maps, the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther.

Optionally, the road target includes any one of the following: lane lines, parking space lines, and zebra crossings.

For the specific execution process of the target detection device shown in FIG. 5 in the target detection process, reference may be made to the relevant descriptions in the other embodiments described above, and details are not described here.

FIG. 6 is a schematic structural diagram of a movable platform according to an embodiment of the present invention. As shown in FIG. 6 , the movable platform includes:

shell 21;

The first camera 22 and the second camera 23 are arranged inside or outside the casing 21, and are respectively used for photographing the environment in front of the movable platform with different distance ranges;

The processor 24 is arranged inside the casing 21 and is coupled to the first camera 22 and the second camera 23 for acquiring the first image captured by the first camera 22 and the second camera 23 the collected second image; fuse the first image and the second image into a third image; identify the road target contained in the third image.

Optionally, in the process of fusing the first image and the second image into a third image, the processor 24 is specifically configured to:

Wherein, optionally, the processor 24 is specifically configured to: determine a top view projection matrix corresponding to the first camera; perform top view projection on the first image according to the top view projection matrix to obtain the first image The corresponding first top view; according to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image to obtain the second image A corresponding second top view under the viewing angle of the first camera.

Wherein, optionally, the processor 24 is specifically configured to: perform a weighted sum operation on the first top view and the second top view to obtain the third image.

Wherein, optionally, in the process of determining the top view projection matrix corresponding to the first camera, the processor 24 is specifically configured to: determine the homography matrix corresponding to the first camera; acquire the first image the first coordinates corresponding to the multiple reference pixels in the image coordinate system; obtain the second coordinates corresponding to the multiple reference pixels in the world coordinate system after being projected to the top view; according to the first coordinates and The second coordinate determines the perspective transformation matrix of the first camera relative to the ground;

Optionally, the processor 24 is specifically configured to: determine the single camera according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground. Responsiveness Matrix.

Optionally, in the process of recognizing the road target contained in the third image, the processor 24 is specifically configured to: input the third image into a preset semantic segmentation model, so as to pass the semantic The segmentation model identifies road objects contained in the third image.

Optionally, the semantic segmentation model includes: cascaded multiple feature extraction layers and output layers. Based on this, the processor 24 is specifically configured to: acquire multiple semantic feature maps output by the multiple feature extraction layers; splicing the multiple semantic feature maps; input the spliced semantic feature maps to the output layer , so as to obtain the semantic segmentation result output by the output layer, the semantic segmentation result indicating the pixels corresponding to the road target in the third image.

Wherein, optionally, in the process of splicing the plurality of semantic feature maps, the processor 24 is specifically configured to: for the target semantic feature map, combine the target semantic feature map corresponding to the preset target distance range The weight of the semantic feature vector is set as the first weight, the weight of the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the second weight, the first weight is greater than the second weight, the The target semantic feature map is any one of the multiple semantic feature maps, and the preset target distance ranges corresponding to each of the multiple semantic feature maps are different; the multiple semantic feature maps are spliced according to the set weights.

Optionally, the first weight is 1, and the second weight is 0.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where executable codes are stored in the computer-readable storage medium, and the executable codes are used to implement the target detection methods provided by the foregoing embodiments.

The technical solutions and technical features in each of the above embodiments can be used alone or in combination if they do not conflict with each other, as long as they do not exceed the cognitive scope of those skilled in the art, they all belong to equivalent embodiments within the protection scope of the present application.

The above descriptions are only the embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied to other related technologies Fields are similarly included in the scope of patent protection of the present invention.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. scope.

Claims

A target detection method, characterized in that, applied to a movable platform, the method comprising:

Acquiring a first image collected by a first camera and a second image collected by a second camera; wherein the first camera and the second camera are arranged on the movable platform, the first camera and the second camera are The two cameras are respectively used to photograph the environment in different distance ranges in front of the movable platform;

fusing the first image and the second image into a third image;

Road objects contained in the third image are identified.
The method according to claim 1, wherein the fusion of the first image and the second image into a third image comprises:

acquiring a first top view corresponding to the first image, and a second top view corresponding to the second image under the viewing angle of the first camera;

The first top view and the second top view are merged into the third image.
The method according to claim 2, wherein the acquiring a first top view corresponding to the first image and a second top view corresponding to the second image under the viewing angle of the first camera comprises:

determining a top view projection matrix corresponding to the first camera;

Perform a top view projection on the first image according to the top view projection matrix to obtain a first top view corresponding to the first image;

According to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image, so as to obtain the second image on the first camera The corresponding second top view from the viewing angle.
The method according to claim 2, wherein the combining the first top view and the second top view into the third image comprises:

A weighted sum operation is performed on the first top view and the second top view to obtain the third image.
The method according to claim 3, wherein the determining the top view projection matrix corresponding to the first camera comprises:

determining a homography matrix corresponding to the first camera;

Acquiring respective first coordinates in the image coordinate system of a plurality of reference pixels in the first image;

acquiring the respective second coordinates corresponding to the plurality of reference pixels in the world coordinate system after being projected to the top view;

According to the first coordinate and the second coordinate, determine the perspective transformation matrix of the first camera relative to the ground;

The top view projection matrix is determined according to the homography matrix and the perspective transformation matrix.
The method according to claim 5, wherein the determining the homography matrix corresponding to the first camera comprises:

The homography matrix is determined according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground.
The method according to any one of claims 1 to 6, wherein the identifying a road target included in the third image comprises:

The third image is input into a preset semantic segmentation model, so as to identify the road target contained in the third image through the semantic segmentation model.
The method according to claim 7, wherein the semantic segmentation model comprises: a plurality of cascaded feature extraction layers and output layers;

The identifying the road target contained in the third image by the semantic segmentation model includes:

obtaining multiple semantic feature maps output by the multiple feature extraction layers;

splicing the plurality of semantic feature maps;

The spliced semantic feature map is input to the output layer to obtain a semantic segmentation result output by the output layer, and the semantic segmentation result indicates the pixels corresponding to the road target in the third image.
The method according to claim 8, wherein the splicing the plurality of semantic feature maps comprises:

For the target semantic feature map, the weight of the semantic feature vector corresponding to the preset target distance range in the target semantic feature map is set as the first weight, and the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the first weight. The weight of the target is set to the second weight, the first weight is greater than the second weight, the target semantic feature map is any one of the multiple semantic feature maps, and each of the multiple semantic feature maps corresponds to the preset Set the target distance range is different;

According to the set weights, the multiple semantic feature maps are spliced.
The method according to claim 9, wherein the first weight is 1, and the second weight is 0.
The method according to claim 9, wherein, according to the output order of the plurality of semantic feature maps, the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther.
The method according to claim 1, wherein the road target comprises any one of the following:

Lane lines, parking space lines, zebra crossings.
A target detection device, characterized in that it is set on a movable platform, and the device comprises: a memory and a processor; wherein, executable code is stored on the memory, and when the executable code is executed by the processor , make the processor implement:

Obtain a first image collected by a first camera and a second image collected by a second camera, wherein the first camera and the second camera are arranged on the movable platform, the first camera and the second camera are The two cameras are respectively used to photograph the environment in different distance ranges in front of the movable platform;

fusing the first image and the second image into a third image;

Road objects contained in the third image are identified.
The apparatus according to claim 13, wherein in the process of fusing the first image and the second image into a third image, the processor is specifically configured to:

acquiring a first top view corresponding to the first image, and a second top view corresponding to the second image under the viewing angle of the first camera;

The first top view and the second top view are merged into the third image.
The apparatus according to claim 14, wherein the processor is specifically configured to:

determining a top view projection matrix corresponding to the first camera;

Perform a top view projection on the first image according to the top view projection matrix to obtain a first top view corresponding to the first image;

According to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image, so as to obtain the second image on the first camera The corresponding second top view from the viewing angle.
The apparatus according to claim 14, wherein the processor is specifically configured to:

A weighted sum operation is performed on the first top view and the second top view to obtain the third image.
The device according to claim 15, wherein, in the process of determining the top view projection matrix corresponding to the first camera, the processor is specifically configured to:

determining a homography matrix corresponding to the first camera;

Acquiring respective first coordinates in the image coordinate system of a plurality of reference pixels in the first image;

acquiring the respective second coordinates corresponding to the plurality of reference pixels in the world coordinate system after being projected to the top view;

According to the first coordinate and the second coordinate, determine the perspective transformation matrix of the first camera relative to the ground;

The top view projection matrix is determined according to the homography matrix and the perspective transformation matrix.
The apparatus according to claim 17, wherein the processor is specifically configured to:

The homography matrix is determined according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground.
The device according to any one of claims 13 to 18, wherein in the process of recognizing the road target included in the third image, the processor is specifically configured to:

The third image is input into a preset semantic segmentation model, so as to identify the road target contained in the third image through the semantic segmentation model.
The device according to claim 19, wherein the semantic segmentation model comprises: cascaded multiple feature extraction layers and output layers;

The processor is specifically used for:

obtaining multiple semantic feature maps output by the multiple feature extraction layers;

splicing the plurality of semantic feature maps;

The spliced semantic feature map is input to the output layer to obtain a semantic segmentation result output by the output layer, and the semantic segmentation result indicates the pixels corresponding to the road target in the third image.
The device according to claim 20, wherein, in the process of splicing the plurality of semantic feature maps, the processor is specifically configured to:

For the target semantic feature map, the weight of the semantic feature vector corresponding to the preset target distance range in the target semantic feature map is set as the first weight, and the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the first weight. The weight of the target is set to the second weight, the first weight is greater than the second weight, the target semantic feature map is any one of the multiple semantic feature maps, and each of the multiple semantic feature maps corresponds to the preset Set the target distance range is different;

According to the set weights, the multiple semantic feature maps are spliced.
The apparatus according to claim 21, wherein the first weight is 1, and the second weight is 0.
The device according to claim 21, wherein, according to the output order of the plurality of semantic feature maps, the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther.
The device according to claim 13, wherein the road target comprises any one of the following:

Lane lines, parking space lines, zebra crossings.
A movable platform, characterized in that, comprising:

case;

The first camera and the second camera are arranged inside or outside the casing, and are respectively used for photographing the environment in front of the movable platform with different distance ranges;

a processor, located inside the casing, coupled to the first camera and the second camera, and configured to acquire a first image captured by the first camera and a second image captured by the second camera; Fusing the first image and the second image into a third image; identifying road objects contained in the third image.
The movable platform according to claim 25, wherein in the process of fusing the first image and the second image into a third image, the processor is specifically configured to:

acquiring a first top view corresponding to the first image, and a second top view corresponding to the second image under the viewing angle of the first camera;

The first top view and the second top view are merged into the third image.
The movable platform according to claim 26, wherein the processor is specifically configured to:

determining a top view projection matrix corresponding to the first camera;

Perform a top view projection on the first image according to the top view projection matrix to obtain a first top view corresponding to the first image;

According to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image, so as to obtain the second image on the first camera The corresponding second top view from the viewing angle.
The movable platform according to claim 26, wherein the processor is specifically configured to:

A weighted sum operation is performed on the first top view and the second top view to obtain the third image.
The movable platform according to claim 27, wherein, in the process of determining the top view projection matrix corresponding to the first camera, the processor is specifically configured to:

determining a homography matrix corresponding to the first camera;

Acquiring respective first coordinates in the image coordinate system of a plurality of reference pixels in the first image;

acquiring the respective second coordinates corresponding to the plurality of reference pixels in the world coordinate system after being projected to the top view;

According to the first coordinate and the second coordinate, determine the perspective transformation matrix of the first camera relative to the ground;

The top view projection matrix is determined according to the homography matrix and the perspective transformation matrix.
The movable platform according to claim 29, wherein the processor is specifically configured to:

The homography matrix is determined according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground.
The movable platform according to any one of claims 25 to 30, wherein in the process of recognizing the road target included in the third image, the processor is specifically configured to:

The third image is input into a preset semantic segmentation model, so as to identify the road target contained in the third image through the semantic segmentation model.
The movable platform according to claim 31, wherein the semantic segmentation model comprises: a plurality of cascaded feature extraction layers and output layers;

The processor is specifically used for:

obtaining multiple semantic feature maps output by the multiple feature extraction layers;

splicing the plurality of semantic feature maps;

The spliced semantic feature map is input to the output layer to obtain a semantic segmentation result output by the output layer, and the semantic segmentation result indicates the pixels corresponding to the road target in the third image.
The movable platform according to claim 32, wherein, in the process of splicing the plurality of semantic feature maps, the processor is specifically configured to:

For the target semantic feature map, the weight of the semantic feature vector corresponding to the preset target distance range in the target semantic feature map is set as the first weight, and the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the first weight. The weight of the target is set to the second weight, the first weight is greater than the second weight, the target semantic feature map is any one of the multiple semantic feature maps, and each of the multiple semantic feature maps corresponds to the preset Set the target distance range is different;

According to the set weights, the multiple semantic feature maps are spliced.
The movable platform of claim 33, wherein the first weight is 1 and the second weight is 0.
The movable platform according to claim 33, wherein, according to the output order of the plurality of semantic feature maps, the respective preset target distance ranges corresponding to the plurality of semantic feature maps gradually become farther.
The movable platform of claim 25, wherein the road target comprises any of the following:

Lane lines, parking space lines, zebra crossings.
A computer-readable storage medium, wherein executable codes are stored in the computer-readable storage medium, and the executable codes are used to implement the target detection method according to any one of claims 1 to 12.