WO2022155899A1 - Target detection method and apparatus, movable platform, and storage medium - Google Patents

Target detection method and apparatus, movable platform, and storage medium Download PDF

Info

Publication number
WO2022155899A1
WO2022155899A1 PCT/CN2021/073334 CN2021073334W WO2022155899A1 WO 2022155899 A1 WO2022155899 A1 WO 2022155899A1 CN 2021073334 W CN2021073334 W CN 2021073334W WO 2022155899 A1 WO2022155899 A1 WO 2022155899A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
camera
top view
semantic feature
target
Prior art date
Application number
PCT/CN2021/073334
Other languages
French (fr)
Chinese (zh)
Inventor
蒋卓键
陈靖宇
陈超
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2021/073334 priority Critical patent/WO2022155899A1/en
Publication of WO2022155899A1 publication Critical patent/WO2022155899A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis

Definitions

  • the present invention relates to the field of artificial intelligence, and in particular, to a target detection method, device, movable platform and storage medium.
  • one way to perform lane line detection is to set up multiple cameras on the vehicle to collect images of different fields of view, and perform lane line detection for the images collected by each camera, and finally The detection results corresponding to the multiple cameras are fused together.
  • This detection method has a large amount of calculation, low efficiency, and is prone to missed detection at the edge, and the accuracy is poor.
  • the invention provides a target detection method, device, movable platform and storage medium, which can realize efficient and accurate detection of road targets.
  • a first aspect of the present invention provides a target detection method, which is applied to a movable platform, and the target detection method includes:
  • a second aspect of the present invention provides a target detection device, which is set on a movable platform, and the target detection device includes: a memory and a processor; wherein, executable codes are stored on the memory, and when the executable codes are When the processor executes, the processor is caused to implement:
  • first camera and the second camera are arranged on the movable platform, the first camera and the second camera are The two cameras are respectively used to photograph the environment in different distance ranges in front of the movable platform;
  • a third aspect of the present invention provides a movable platform, comprising:
  • the first camera and the second camera are arranged inside or outside the casing, and are respectively used for photographing the environment in front of the movable platform with different distance ranges;
  • a processor located inside the casing, coupled to the first camera and the second camera, and configured to acquire a first image captured by the first camera and a second image captured by the second camera; Fusing the first image and the second image into a third image; identifying road objects contained in the third image.
  • a fourth aspect of the present invention provides a computer-readable storage medium, where executable codes are stored in the computer-readable storage medium, and the executable codes are used to implement the target detection method described in the first aspect.
  • a first camera and a second camera are set on the movable platform, and the first camera and the second camera are respectively It is used for photographing environments in different distance ranges in front of the movable platform, so as to determine road objects (such as lane lines, etc.) in front of the movable platform based on the first image captured by the first camera and the second image captured by the second camera.
  • the first image collected by the first camera and the second image collected by the second camera are fused into a third image, and then road targets are identified on the third image to identify the first image.
  • the third image contains the global information of the first image and the second image, and it is more efficient to perform road target detection on only one third image.
  • FIG. 1 is a schematic diagram of a road traffic scene provided by an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of a target detection method according to an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of an image fusion process according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of the principle of an image recognition process provided by an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a target detection device according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a movable platform according to an embodiment of the present invention.
  • the target detection method provided by the embodiment of the present invention can be applied to the road traffic scene as shown in FIG. 1 , so as to detect the road target existing in front of the vehicle.
  • a first camera 101 and a second camera 102 are provided at different positions on the autonomous vehicle, and the two cameras have different shooting ranges respectively, so as to shoot the environment in front of the vehicle within different distance ranges (including The ground ahead and various objects on the ground, such as other vehicles, guardrails, etc.), in order to combine the images collected by these two cameras to perceive some road targets that exist ahead.
  • the first camera 101 and the second camera 102 can be used to photograph the front of the vehicle, respectively.
  • the autonomous vehicle On the ground at different distances, the autonomous vehicle combines the images captured by the two cameras to identify road markings on the ground at different distances ahead, and make corresponding driving control decisions.
  • the first camera 101 is used for shooting a relatively close environment, such as 0-40 meters
  • the second camera 102 is used for shooting a relatively long-distance environment, such as 40-150 meters.
  • the image can identify the lane lines that exist at far and near distances. Based on the recognition results of the short-range lane lines, it is possible to avoid the vehicle running on the line or to give accurate guidance when switching lanes;
  • road targets to be identified may include not only road markings, but also other targets, such as pedestrians and vehicles, where road markings include but are not limited to: lane lines, zebra crossings, garages or parking spaces on the roadside.
  • the execution process of the target detection method provided by the present invention will be described in detail below with reference to the following embodiments.
  • the target detection method provided by the embodiment of the present invention may be executed by a movable platform, and specifically, may be executed by a processor provided in the movable platform.
  • the movable platform includes but is not limited to various types of vehicles driving on the road.
  • FIG. 2 is a schematic flowchart of a target detection method provided by an embodiment of the present invention. As shown in FIG. 2 , the target detection method may include the following steps:
  • 201 Acquire a first image collected by a first camera and a second image collected by a second camera; wherein the first camera and the second camera are arranged on the movable platform, and the first camera and the second camera are respectively used to shoot the movable platform Environments at different distances in front of the platform.
  • multiple cameras can be set on the movable platform for sensing the environment in different distances in front.
  • the multiple cameras can be two cameras or more than two cameras. When, for example, three or more cameras are set, the execution principle is the same, so only two cameras are used as an example for description in the embodiment of the present invention.
  • the first camera and the second camera are set on the movable platform.
  • the first camera is used to photograph the environment within a range of 0 to 40 meters in front of the movable platform
  • the second camera is used to photograph 40 to 150 meters in front of the movable platform. environment within the range.
  • the shooting distances of the above two cameras should be "seamless". For example, if the first camera is set to shoot the environment in the range of 0-30 meters in front of the movable platform, and the second camera is used to shoot the environment in the range of 40-150 meters in front of the movable platform, then the front 30-40 meters Scope is left out, and gaps appear.
  • the shooting distances of the two cameras can partially overlap.
  • the first camera is set to shoot the environment within a range of 0 to 40 meters in front of the movable platform, and the second camera is used to shoot the movable platform. The environment within a range of 35 to 150 meters in front of the platform.
  • first camera and the second camera synchronize image acquisition.
  • the two Each camera sends the images collected by each of them to the processor set in the movable platform, and the processor completes the fusion of the first image and the second image, that is, the first image and the second image are spliced into a third image, the third image contains the global information of the first image and the second image, in short, the third image will include all feature information of the environment within a distance range of 0 to 150 meters.
  • the road targets that the movable platform needs to identify can be various road marking lines on the ground. Fusion of the first image and the second image from the perspective of And the second top view is merged into the third image.
  • the first image and the second image are actually front views, and when the recognition task is to identify road markings on the ground, they can be converted into top views, and the top views can be fused.
  • an image captured by one camera can be converted to the perspective of another camera, and two top views corresponding to the perspective of the same camera can be fused.
  • the second camera is projected to the perspective of the top view of the first camera, that is, the second top view corresponding to the second image in the perspective of the first camera needs to be obtained.
  • the first top view corresponding to the above-mentioned first image is the top view corresponding to the first image under the viewing angle of the first camera.
  • first top view and second top view An optional implementation manner for obtaining the above-mentioned first top view and second top view will be described in detail below. It is assumed here that the first top view and the second top view have been obtained, and then the first top view and the second top view are merged to obtain the third image. Wherein, optionally, a weighted sum operation may be performed on the first top view and the second top view to obtain the third image.
  • Imgbv stitch a ⁇ Imgbv2+(1-a) ⁇ Imgbv1.
  • a is the preset weight, 0 ⁇ a ⁇ 1.
  • the fused third image has the common field of view of the two cameras and can be correlated with each other, which is equivalent to having global information and richer semantic information.
  • the road target is identified on the third image, and the road target contained in the third image can be accurately identified based on the rich semantic information contained in the third image.
  • identifying the road target included in the third image is actually determining which pixels in the third image correspond to the road target. For example, assuming that the third image includes target 1 and target 2, the purpose of recognition is to identify which pixels in the third image correspond to target 1 and which pixels correspond to target 2.
  • the above recognition task is actually a semantic segmentation task (determining the category corresponding to the pixels in the image). Therefore, optionally, the third image can be input into a preset semantic segmentation model, so as to identify the road target contained in the third image through the semantic segmentation model.
  • the semantic segmentation model can be implemented as a neural network model, such as a Convolutional Neural Network (CNN) model; a Residual Network (ResNet) model, such as ResNet-18, DLA -34 models, etc.
  • CNN Convolutional Neural Network
  • ResNet Residual Network
  • the semantic segmentation model may include a feature extraction layer and an output layer.
  • the feature extraction layer may include convolution layers, downsampling layers, activation functions, etc.
  • the output layer may include one or more a convolutional layer.
  • the feature extraction layer is used to extract the semantic features of the input image to obtain a semantic feature map (usually it can also be referred to as a feature map), and the output layer is used to parse the semantic feature map to input the category recognition result corresponding to each pixel. : Whether it corresponds to a road target.
  • the images collected by different cameras are fused to obtain an image containing global semantic information.
  • the road target recognition is only performed on the fused image, without the need for road target recognition on the images collected by each camera. , which can improve the recognition efficiency.
  • the fused image contains global semantic information, accurate recognition results can be guaranteed.
  • FIG. 3 is a schematic flowchart of an image fusion process provided by an embodiment of the present invention. As shown in FIG. 3 , the fusion process may include the following steps:
  • the first image captured by the first camera is actually a front view.
  • To convert the front view into a top view it is necessary to first determine the projection matrix used to convert the front view into a top view, which is called the top view projection matrix. .
  • the top view projection matrix corresponding to the first camera is determined by two matrices: a homography matrix (homography matrix) of the first camera and a perspective transformation matrix (PerspectiveTransform matrix) of the first camera relative to the ground.
  • the perspective transformation matrix is used to project the coordinates of the image captured by the camera in the image coordinate system to the world coordinate system.
  • the top view projection matrix corresponding to the first camera is represented as tranMFront2Top
  • the homography matrix of the first camera is represented as Hg2im
  • the above perspective transformation matrix is represented as PerspectiveTransform.
  • the homography matrix of the first camera is multiplied by the perspective transformation matrix to obtain the top view projection matrix corresponding to the first camera.
  • the homography matrix corresponding to the first camera can be determined in the following manner:
  • the homography matrix is determined according to the camera internal parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground.
  • the rotation matrix of the first camera relative to the ground includes the following three matrices: the rotation matrix of the first camera's pitch angle to the ground (Pitch), denoted as RPitch1; the rotation matrix of the first camera's ground angle (Yaw), denoted as RYaw1; the rotation matrix of the roll angle (Roll) of the first camera to the ground, expressed as RRoll1.
  • the translation matrix of the first camera relative to the ground refers to the translation matrix of the height of the first camera relative to the ground, which is represented as Th1.
  • the camera intrinsic parameter matrix of the first camera is represented as: P1.
  • the above camera internal parameter matrix, translation matrix, and rotation matrix are all predetermined.
  • Hg2im P1 ⁇ RRoll1 ⁇ RYaw1 ⁇ RPitch1 ⁇ Th1
  • the perspective transformation matrix of the first camera relative to the ground can be obtained as follows:
  • a perspective transformation matrix of the first camera relative to the ground is determined.
  • the above-mentioned multiple reference pixels may be four vertex pixels of the first image.
  • the first image is a 720*1280 image
  • the first coordinate corresponding to the upper-left vertex pixel in the image coordinate system is (0,0)
  • the upper-right vertex pixel corresponding to the first coordinate in the image coordinate system is (720,0)
  • the first coordinate corresponding to the vertex pixel in the lower left corner in the image coordinate system is (0,1280)
  • the first coordinate corresponding to the vertex pixel in the upper right corner in the image coordinate system is (720,1280).
  • the second coordinates corresponding to the above-mentioned four vertex pixels in the top view are determined, and the second coordinates represent the position of the top view in the world coordinate system.
  • the perspective transformation matrix can be obtained by giving four pairs of pixel coordinates corresponding to the perspective transformation.
  • the coordinates of the four pairs of pixel points are the coordinates corresponding to the above four vertices in the front view and the top view respectively.
  • the first top view corresponding to the first image can be obtained by using the top view transformation matrix to project the top view of the first image. Assuming that the first top view is represented as Imgbv1, then:
  • Imgbv1 warpPerspective(Imgfv1, tranMFront2Top)
  • Imgfv1 represents the front view captured by the first camera, that is, the first image.
  • warpPerspective represents the top view projection function, which can be a preset function that can realize top view projection.
  • the above formula means that the first image and the top view transformation matrix tranMFront2Top are used as the input of the top view projection function, so as to realize the top view projection of the first image.
  • the corresponding second top view Imgbv2 may be obtained in the following manner.
  • warpPerspective is the above-mentioned top view projection function
  • Imgfv2 represents the front view collected by the second camera, that is, the second image, represents the camera extrinsic parameter matrix corresponding to the first camera and the second camera.
  • the first top view Imgbv1 and the second top view Imgbv2 can be fused according to the following method to obtain the third image Imgbv stitch :
  • Imgbv stitch a ⁇ Imgbv2+(1 ⁇ a) ⁇ Imgbv1.
  • a is the preset weight, 0 ⁇ a ⁇ 1.
  • image fusion (or image stitching) may also have other implementation manners, which are not limited to the above examples.
  • the semantic segmentation model can be used to identify the road objects contained in the third image.
  • an image recognition scheme as shown in FIG. 4 is also provided in the embodiment of the present invention.
  • the semantic segmentation model can include multiple cascaded feature extraction layers and an output layer.
  • the plurality of feature extraction layers illustrated in FIG. 4 include feature extraction layer 1 , feature extraction layer 2 and feature extraction layer 3 .
  • the third image is first input to the feature extraction layer 1, and the feature extraction layer 1 outputs the semantic feature map Feature1.
  • the semantic feature map Featurer1 is stored, and on the other hand, the semantic feature map Featurer1 is input to the feature extraction layer 2.
  • the feature extraction layer 2 outputs the semantic feature map Featuer2
  • the semantic feature map Featuer2 is stored, and on the other hand, the semantic feature map Featurer2 is input to the feature extraction layer 3.
  • the feature extraction layer 3 outputs the semantic feature map Featuer3, and stores the semantic feature map Featuer3.
  • the segmentation result indicates the pixel corresponding to the road target in the third image, that is, the classification result of each pixel in the third image is obtained.
  • the later feature extraction layer extracts higher-level semantic information
  • the earlier feature extraction layer extracts lower-level semantic information.
  • the semantic feature map can reflect different distances and semantic feature vectors
  • the semantic feature map includes semantic feature vectors corresponding to different distances. Therefore, in the process of splicing multiple semantic feature maps, the distance factor can be combined to splicing.
  • the splicing process of multiple semantic feature maps can be implemented as:
  • the target semantic feature map For the target semantic feature map, set the weight of the semantic feature vector corresponding to the preset target distance range in the target semantic feature map as the first weight, and set the weight of the semantic feature vector corresponding to other distance ranges in the target semantic feature map as the second weight, the first weight is greater than the second weight, the target semantic feature map is any one of multiple semantic feature maps, and the preset target distance ranges corresponding to each of the multiple semantic feature maps are different;
  • multiple semantic feature maps are spliced.
  • the first weight is denoted w1
  • the second weight is denoted w2
  • the plurality of semantic feature maps are the semantic feature map Featuer1, the semantic feature map Featuer2, and the semantic feature map Featuer3 illustrated in FIG. 4 .
  • the first camera and the second camera can photograph a range of 0 to 150 meters ahead in total.
  • the preset target distance range corresponding to the semantic feature map Featurer1 is 0 to 30 meters
  • the preset target distance range corresponding to the semantic feature map Featurer2 is 30 to 60 meters
  • the range is 60 to 150 meters.
  • the weight of the semantic feature vector C11 corresponding to the preset target distance range of 0 to 30 meters in the semantic feature map Featuer1 is set to w1, and to other distance ranges of 30 to 60 meters and 60 to 150 meters.
  • the weights of the semantic feature vector C12 and the semantic feature vector C13 are both set to w2;
  • the weight of the semantic feature vector C22 corresponding to the preset target distance range of 30 to 60 meters in the semantic feature map Featuer2 is set to w1, and the semantics corresponding to other distance ranges of 0 to 30 meters and 60 to 150 meters respectively.
  • the weights of feature vector C21 and semantic feature vector C23 are both set to w2;
  • the weight of the semantic feature vector C33 corresponding to the preset target distance range of 60 to 150 meters in the semantic feature map Featuer3 is set to w1, and the semantics corresponding to other distance ranges of 0 to 30 meters and 30 to 60 meters respectively.
  • the weights of the feature vector C31 and the semantic feature vector C32 are both set to w2.
  • semantic feature vector C1 semantic feature vector C2
  • semantic feature vector C3 semantic feature vector C3
  • Semantic feature vector C1 w1*semantic feature vector C11+w2*semantic feature vector C21+w2*semantic feature vector C31;
  • Semantic feature vector C2 w2*semantic feature vector C12+w1*semantic feature vector C22+w2*semantic feature vector C32;
  • Semantic feature vector C3 w2*semantic feature vector C13+w2*semantic feature vector C23+w1*semantic feature vector C33.
  • the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther.
  • the output order of the above three semantic feature maps is Semantic Feature Map Featurer1, Semantic Feature Map Featurer2, Semantic Feature Map Featurer3, then the preset target distance ranges corresponding to these three semantic feature maps are: 0 ⁇ 30 meters, 30 to 60 meters, and 60 to 150 meters, showing a trend of increasing distance in turn.
  • the input image ie, the input image of the semantic segmentation model
  • the input image of the semantic segmentation model is an image obtained by fusing the images collected by different cameras
  • multiple semantic feature maps are further spliced.
  • the input image contains rich semantics, and on the other hand, the spliced semantic feature map can strengthen the semantic features of different distances, which can reduce the amount of calculation and improve the recognition efficiency, and at the same time, better recognition results can be obtained.
  • FIG. 5 is a schematic structural diagram of a target detection apparatus provided by an embodiment of the present invention.
  • the target detection apparatus is set on a movable platform.
  • the target detection apparatus includes: a memory 11 and a processor 12 .
  • the executable code is stored on the memory 11, and when the executable code is executed by the processor 12, the processor 12 is made to realize:
  • first camera and the second camera are arranged on the movable platform, the first camera and the second camera are The two cameras are respectively used to photograph the environment in different distance ranges in front of the movable platform;
  • the processor 12 is specifically configured to:
  • the processor 12 is specifically configured to: determine a top view projection matrix corresponding to the first camera; perform top view projection on the first image according to the top view projection matrix to obtain the first image The corresponding first top view; according to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image to obtain the second image A corresponding second top view under the viewing angle of the first camera.
  • the processor 12 is specifically configured to: perform a weighted sum operation on the first top view and the second top view to obtain the third image.
  • the processor 12 is specifically configured to: determine the homography matrix corresponding to the first camera; acquire the first image the first coordinates corresponding to the multiple reference pixels in the image coordinate system; obtain the second coordinates corresponding to the multiple reference pixels in the world coordinate system after being projected to the top view; according to the first coordinates and The second coordinate determines the perspective transformation matrix of the first camera relative to the ground;
  • the top view projection matrix is determined according to the homography matrix and the perspective transformation matrix.
  • the processor 12 is specifically configured to: determine the single camera according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground. Responsiveness Matrix.
  • the processor 12 is specifically configured to: input the third image into a preset semantic segmentation model, so as to pass the semantic The segmentation model identifies road objects contained in the third image.
  • the semantic segmentation model includes: cascaded multiple feature extraction layers and output layers.
  • the processor 12 is specifically configured to: acquire multiple semantic feature maps output by the multiple feature extraction layers; splicing the multiple semantic feature maps; input the spliced semantic feature maps to the output layer , so as to obtain the semantic segmentation result output by the output layer, the semantic segmentation result indicating the pixels corresponding to the road target in the third image.
  • the processor 12 is specifically configured to: for the target semantic feature map, combine the target semantic feature map corresponding to the preset target distance range
  • the weight of the semantic feature vector is set as the first weight
  • the weight of the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the second weight
  • the first weight is greater than the second weight
  • the The target semantic feature map is any one of the multiple semantic feature maps
  • the preset target distance ranges corresponding to each of the multiple semantic feature maps are different
  • the multiple semantic feature maps are spliced according to the set weights.
  • the first weight is 1, and the second weight is 0.
  • the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther.
  • the road target includes any one of the following: lane lines, parking space lines, and zebra crossings.
  • FIG. 6 is a schematic structural diagram of a movable platform according to an embodiment of the present invention. As shown in FIG. 6 , the movable platform includes:
  • the first camera 22 and the second camera 23 are arranged inside or outside the casing 21, and are respectively used for photographing the environment in front of the movable platform with different distance ranges;
  • the processor 24 is arranged inside the casing 21 and is coupled to the first camera 22 and the second camera 23 for acquiring the first image captured by the first camera 22 and the second camera 23 the collected second image; fuse the first image and the second image into a third image; identify the road target contained in the third image.
  • the processor 24 is specifically configured to:
  • the processor 24 is specifically configured to: determine a top view projection matrix corresponding to the first camera; perform top view projection on the first image according to the top view projection matrix to obtain the first image The corresponding first top view; according to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image to obtain the second image A corresponding second top view under the viewing angle of the first camera.
  • the processor 24 is specifically configured to: perform a weighted sum operation on the first top view and the second top view to obtain the third image.
  • the processor 24 is specifically configured to: determine the homography matrix corresponding to the first camera; acquire the first image the first coordinates corresponding to the multiple reference pixels in the image coordinate system; obtain the second coordinates corresponding to the multiple reference pixels in the world coordinate system after being projected to the top view; according to the first coordinates and The second coordinate determines the perspective transformation matrix of the first camera relative to the ground;
  • the top view projection matrix is determined according to the homography matrix and the perspective transformation matrix.
  • the processor 24 is specifically configured to: determine the single camera according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground. Responsiveness Matrix.
  • the processor 24 is specifically configured to: input the third image into a preset semantic segmentation model, so as to pass the semantic The segmentation model identifies road objects contained in the third image.
  • the semantic segmentation model includes: cascaded multiple feature extraction layers and output layers.
  • the processor 24 is specifically configured to: acquire multiple semantic feature maps output by the multiple feature extraction layers; splicing the multiple semantic feature maps; input the spliced semantic feature maps to the output layer , so as to obtain the semantic segmentation result output by the output layer, the semantic segmentation result indicating the pixels corresponding to the road target in the third image.
  • the processor 24 is specifically configured to: for the target semantic feature map, combine the target semantic feature map corresponding to the preset target distance range
  • the weight of the semantic feature vector is set as the first weight
  • the weight of the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the second weight
  • the first weight is greater than the second weight
  • the The target semantic feature map is any one of the multiple semantic feature maps
  • the preset target distance ranges corresponding to each of the multiple semantic feature maps are different
  • the multiple semantic feature maps are spliced according to the set weights.
  • the first weight is 1, and the second weight is 0.
  • the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther.
  • the road target includes any one of the following: lane lines, parking space lines, and zebra crossings.
  • an embodiment of the present invention further provides a computer-readable storage medium, where executable codes are stored in the computer-readable storage medium, and the executable codes are used to implement the target detection methods provided by the foregoing embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
  • Traffic Control Systems (AREA)

Abstract

The present invention provides a target detection method and apparatus, a movable platform, and a storage medium. The target detection method comprises: acquiring a first image collected by a first camera and a second image collected by a second camera, the first camera and the second camera being provided on a movable platform, and the first camera and the second camera each being used to capture the environment in different distance ranges in front of the movable platform; merging the first image and the second image into a third image; and identifying a road target included in the third image. By means of merging images collected by different cameras into an image having global semantic information and identifying a road target in the merged image, the efficiency of identification can be increased and an accurate identification result can be obtained.

Description

目标检测方法、装置、可移动平台和存储介质Object detection method, device, removable platform and storage medium 技术领域technical field
本发明涉及人工智能领域,尤其涉及一种目标检测方法、装置、可移动平台和存储介质。The present invention relates to the field of artificial intelligence, and in particular, to a target detection method, device, movable platform and storage medium.
背景技术Background technique
在自动驾驶领域,为保证自动驾驶车辆的安全行驶,需要让车辆能够及时、准确地感知周围的环境,以便做出正确的驾驶决策。道路交通环境中会存在需要感知的各种目标,比如车道线,准确、及时地检测出车道线可以让车辆能够在当前车道以及其他车道间准确切换。In the field of autonomous driving, in order to ensure the safe driving of autonomous vehicles, it is necessary for the vehicle to perceive the surrounding environment in a timely and accurate manner so as to make correct driving decisions. There are various objects that need to be perceived in the road traffic environment, such as lane lines. Accurate and timely detection of lane lines can allow vehicles to switch between the current lane and other lanes accurately.
以车道线检测为例,目前,一种进行车道线检测的方式是:在车辆上设置多路相机,以采集不同视野范围的图像,分别针对每路相机采集的图像进行车道线检测,最终把多路相机各自对应的检测结果融合起来。这种检测方式计算量大,效率低,且容易出现边缘处的漏检情况,准确性较差。Taking lane line detection as an example, at present, one way to perform lane line detection is to set up multiple cameras on the vehicle to collect images of different fields of view, and perform lane line detection for the images collected by each camera, and finally The detection results corresponding to the multiple cameras are fused together. This detection method has a large amount of calculation, low efficiency, and is prone to missed detection at the edge, and the accuracy is poor.
发明内容SUMMARY OF THE INVENTION
本发明提供了一种目标检测方法、装置、可移动平台和存储介质,可以实现道路目标的高效、准确检测。The invention provides a target detection method, device, movable platform and storage medium, which can realize efficient and accurate detection of road targets.
本发明的第一方面提供了一种目标检测方法,应用于可移动平台,该目标检测方法包括:A first aspect of the present invention provides a target detection method, which is applied to a movable platform, and the target detection method includes:
获取第一相机采集的第一图像以及第二相机采集的第二图像;其中,所述第一相机和所述第二相机设置于所述可移动平台上,所述第一相机和所述第二相机分别用于拍摄所述可移动平台前方不同距离范围的环境;Acquiring a first image collected by a first camera and a second image collected by a second camera; wherein the first camera and the second camera are arranged on the movable platform, the first camera and the second camera are The two cameras are respectively used to photograph the environment in different distance ranges in front of the movable platform;
将所述第一图像和所述第二图像融合为第三图像;fusing the first image and the second image into a third image;
识别所述第三图像中包含的道路目标。Road objects contained in the third image are identified.
本发明的第二方面提供了一种目标检测装置,设于可移动平台,该目标检测装置包括:存储器、处理器;其中,所述存储器上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器实现:A second aspect of the present invention provides a target detection device, which is set on a movable platform, and the target detection device includes: a memory and a processor; wherein, executable codes are stored on the memory, and when the executable codes are When the processor executes, the processor is caused to implement:
获取第一相机采集的第一图像以及第二相机采集的第二图像,其中,所述第一相机和所述第二相机设置于所述可移动平台上,所述第一相机和所述第二相机分别用于拍摄所述可移动平台前方不同距离范围的环境;Obtain a first image collected by a first camera and a second image collected by a second camera, wherein the first camera and the second camera are arranged on the movable platform, the first camera and the second camera are The two cameras are respectively used to photograph the environment in different distance ranges in front of the movable platform;
将所述第一图像和所述第二图像融合为第三图像;fusing the first image and the second image into a third image;
识别所述第三图像中包含的道路目标。Road objects contained in the third image are identified.
本发明的第三方面提供了一种可移动平台,包括:A third aspect of the present invention provides a movable platform, comprising:
壳体;case;
第一相机和第二相机,设于所述壳体内部或外部,分别用于拍摄所述可移动平台前方不同距离范围的环境;The first camera and the second camera are arranged inside or outside the casing, and are respectively used for photographing the environment in front of the movable platform with different distance ranges;
处理器,设于所述壳体内部,与所述第一相机和所述第二相机耦合,用于获取所述第一相机采集的第一图像以及所述第二相机采集的第二图像;将所述第一图像和所述第二图像融合为第三图像;识别所述第三图像中包含的道路目标。a processor, located inside the casing, coupled to the first camera and the second camera, and configured to acquire a first image captured by the first camera and a second image captured by the second camera; Fusing the first image and the second image into a third image; identifying road objects contained in the third image.
本发明的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有可执行代码,所述可执行代码用于实现上述第一方面所述的目标检测方法。A fourth aspect of the present invention provides a computer-readable storage medium, where executable codes are stored in the computer-readable storage medium, and the executable codes are used to implement the target detection method described in the first aspect.
在本发明提供的目标检测方案中,为了让可移动平台(比如车辆等)能够感知周围的道路状况,在可移动平台上设置了第一相机和第二相机,第一相机和第二相机分别用于拍摄可移动平台前方不同距离范围的环境,以便基于第一相机采集的第一图像以及第二相机采集的第二图像,确定可移动平台前方存在的道路目标(比如车道线等)。In the target detection solution provided by the present invention, in order to allow the movable platform (such as a vehicle, etc.) to perceive the surrounding road conditions, a first camera and a second camera are set on the movable platform, and the first camera and the second camera are respectively It is used for photographing environments in different distance ranges in front of the movable platform, so as to determine road objects (such as lane lines, etc.) in front of the movable platform based on the first image captured by the first camera and the second image captured by the second camera.
为了能够更高效地完成道路目标的检测,将第一相机采集的第一图像以及第二相机采集的第二图像融合为第三图像,之后对第三图像进行道路目标的识别,以识别出第三图像中包含的道路目标。第三图像中包含第一图像和 第二图像的全局信息,仅对一张第三图像进行道路目标检测,效率更高。In order to complete the detection of road targets more efficiently, the first image collected by the first camera and the second image collected by the second camera are fused into a third image, and then road targets are identified on the third image to identify the first image. Three road targets included in the images. The third image contains the global information of the first image and the second image, and it is more efficient to perform road target detection on only one third image.
附图说明Description of drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:
图1为本发明实施例提供的一种道路交通场景的示意图;1 is a schematic diagram of a road traffic scene provided by an embodiment of the present invention;
图2为本发明实施例提供的一种目标检测方法的流程示意图;2 is a schematic flowchart of a target detection method according to an embodiment of the present invention;
图3为本发明实施例提供的一种图像融合过程的流程示意图;3 is a schematic flowchart of an image fusion process according to an embodiment of the present invention;
图4为本发明实施例提供的一种图像识别过程的原理示意图;4 is a schematic diagram of the principle of an image recognition process provided by an embodiment of the present invention;
图5为本发明实施例提供的一种目标检测装置的结构示意图;FIG. 5 is a schematic structural diagram of a target detection device according to an embodiment of the present invention;
图6为本发明实施例提供的一种可移动平台的结构示意图。FIG. 6 is a schematic structural diagram of a movable platform according to an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
除非另有定义,本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本发明。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terms used in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention.
本发明实施例提供的目标检测方法可以适用于如图1所示的道路交通场景中,以用于检测车辆前方存在的道路目标。如图1中所示,自动驾驶车辆上的不同位置处设有第一相机101和第二相机102,这两个相机分别具有不同的拍摄范围,以拍摄车辆前方不同距离范围内的环境(包括前方的地面以及地面上的各种对象,比如其他车辆、护栏等),以便结合这两个相机采集的图 像来感知前方存在的某些道路目标。The target detection method provided by the embodiment of the present invention can be applied to the road traffic scene as shown in FIG. 1 , so as to detect the road target existing in front of the vehicle. As shown in FIG. 1 , a first camera 101 and a second camera 102 are provided at different positions on the autonomous vehicle, and the two cameras have different shooting ranges respectively, so as to shoot the environment in front of the vehicle within different distance ranges (including The ground ahead and various objects on the ground, such as other vehicles, guardrails, etc.), in order to combine the images collected by these two cameras to perceive some road targets that exist ahead.
举例来说,在自动驾驶车辆的行驶过程中,需要关注地面上各种标识线,以便及时做出准确的控制决策,此时,第一相机101和第二相机102可以分别用于拍摄车辆前方不同距离范围的地面,自动驾驶车辆结合这两个相机拍摄的图像来识别出前方不同距离的地面上存在的道路标识线,做出对应的行驶控制决策。For example, in the process of driving an autonomous vehicle, it is necessary to pay attention to various marking lines on the ground in order to make accurate control decisions in time. At this time, the first camera 101 and the second camera 102 can be used to photograph the front of the vehicle, respectively. On the ground at different distances, the autonomous vehicle combines the images captured by the two cameras to identify road markings on the ground at different distances ahead, and make corresponding driving control decisions.
比如,第一相机101用于拍摄较近距离的环境,比如0~40米,第二相机102用于拍摄较远距离的环境,比如40~150米,这样,结合这两个相机拍得的图像,可以识别出远近距离分别存在的车道线。基于近距离的车道线的识别结果,可以避免车辆压线行驶或在需要切换车道时给以准确指导;基于远距离的车道线识别结果,可以及时、准确地做出直行或转弯的控制。For example, the first camera 101 is used for shooting a relatively close environment, such as 0-40 meters, and the second camera 102 is used for shooting a relatively long-distance environment, such as 40-150 meters. The image can identify the lane lines that exist at far and near distances. Based on the recognition results of the short-range lane lines, it is possible to avoid the vehicle running on the line or to give accurate guidance when switching lanes;
值得说明的是,这里所说的这两个相机分别用于拍摄车辆前方不同距离范围的环境,并非限定相机的拍摄角度仅限于车辆的正前方,可以具有更广范围的设定角度,比如图1中示意的角度a1和角度a2。It is worth noting that the two cameras mentioned here are used to shoot the environment in front of the vehicle at different distances. It is not limited that the shooting angle of the camera is limited to the front of the vehicle. It can have a wider range of setting angles. Angle a1 and angle a2 illustrated in 1.
实际应用中,需要识别的道路目标不仅可以包括道路标识线,还可以包括其他目标,比如行人、车辆,其中,道路标识线包括但不限于:车道线、斑马线、车库或路边的车位线。In practical applications, road targets to be identified may include not only road markings, but also other targets, such as pedestrians and vehicles, where road markings include but are not limited to: lane lines, zebra crossings, garages or parking spaces on the roadside.
下面结合以下实施例对本发明提供的目标检测方法的执行过程进行详细说明。本发明实施例提供的目标检测方法可以由可移动平台来执行,具体地,可以由可移动平台中设置的处理器来执行。实际应用中,该可移动平台包括但不限于行驶在道路上的各种类型的车辆。The execution process of the target detection method provided by the present invention will be described in detail below with reference to the following embodiments. The target detection method provided by the embodiment of the present invention may be executed by a movable platform, and specifically, may be executed by a processor provided in the movable platform. In practical applications, the movable platform includes but is not limited to various types of vehicles driving on the road.
图2为本发明实施例提供的一种目标检测方法的流程示意图,如图2所示,该目标检测方法可以包括如下步骤:FIG. 2 is a schematic flowchart of a target detection method provided by an embodiment of the present invention. As shown in FIG. 2 , the target detection method may include the following steps:
201、获取第一相机采集的第一图像以及第二相机采集的第二图像;其中,第一相机和第二相机设置于可移动平台上,第一相机和第二相机分别用于拍摄可移动平台前方不同距离范围的环境。201. Acquire a first image collected by a first camera and a second image collected by a second camera; wherein the first camera and the second camera are arranged on the movable platform, and the first camera and the second camera are respectively used to shoot the movable platform Environments at different distances in front of the platform.
实际应用中,为保证可移动平台能够感知其前方远近不同距离范围内的 环境,可以在可移动平台上设置用于感知前方不同距离范围的环境的多个相机。多个相机可以是两个相机,也可以不止两个相机。当设置比如3个或更多相机时,执行原理是一样的,所以本发明实施例中仅以两个相机为例说明。In practical applications, in order to ensure that the movable platform can perceive the environment in different distances in front of it, multiple cameras can be set on the movable platform for sensing the environment in different distances in front. The multiple cameras can be two cameras or more than two cameras. When, for example, three or more cameras are set, the execution principle is the same, so only two cameras are used as an example for description in the embodiment of the present invention.
比如,在可移动平台上设置上述第一相机和第二相机,第一相机用于拍摄可移动平台前方0~40米范围内的环境,第二相机用于拍摄可移动平台前方40~150米范围内的环境。For example, the first camera and the second camera are set on the movable platform. The first camera is used to photograph the environment within a range of 0 to 40 meters in front of the movable platform, and the second camera is used to photograph 40 to 150 meters in front of the movable platform. environment within the range.
值得说明的是,为了能够让可移动平台感知前方0~150米范围内的环境,上述两个相机的拍摄距离之间应该是“无缝”的。举例来说,如果设置第一相机用于拍摄可移动平台前方0~30米范围内的环境,第二相机用于拍摄可移动平台前方40~150米范围内的环境,那么前方30~40米范围就被遗漏了,出现了缝隙。当然,为保证“无缝”,两个相机的拍摄距离可以有部分重叠,比如,设置第一相机用于拍摄可移动平台前方0~40米范围内的环境,第二相机用于拍摄可移动平台前方35~150米范围内的环境。It is worth noting that, in order for the movable platform to perceive the environment in the range of 0 to 150 meters ahead, the shooting distances of the above two cameras should be "seamless". For example, if the first camera is set to shoot the environment in the range of 0-30 meters in front of the movable platform, and the second camera is used to shoot the environment in the range of 40-150 meters in front of the movable platform, then the front 30-40 meters Scope is left out, and gaps appear. Of course, in order to ensure "seamless", the shooting distances of the two cameras can partially overlap. For example, the first camera is set to shoot the environment within a range of 0 to 40 meters in front of the movable platform, and the second camera is used to shoot the movable platform. The environment within a range of 35 to 150 meters in front of the platform.
另外,第一相机和第二相机同步进行图像的采集。In addition, the first camera and the second camera synchronize image acquisition.
202、将第一图像和第二图像融合为第三图像。202. Fusion of the first image and the second image into a third image.
在第一相机采集到对应于某种距离范围(如0~40米)的第一图像,以及第二相机采集到对应于某种距离范围(如40~150米)的第二图像后,两个相机将各自采集的图像发送至可移动平台中设置的处理器,由处理器来完成第一图像和第二图像的融合,即将第一图像和第二图像拼接为第三图像,第三图像中包含第一图像和第二图像的全局信息,简单来说,就是第三图像中会包括0~150米距离范围内环境的全部特征信息。After the first camera collects the first image corresponding to a certain distance range (such as 0-40 meters), and the second camera collects the second image corresponding to a certain distance range (such as 40-150 meters), the two Each camera sends the images collected by each of them to the processor set in the movable platform, and the processor completes the fusion of the first image and the second image, that is, the first image and the second image are spliced into a third image, the third image contains the global information of the first image and the second image, in short, the third image will include all feature information of the environment within a distance range of 0 to 150 meters.
上文中提到,可移动平台需要识别出的道路目标可以是位于地面上的各种道路标识线,为了能够更加突出这些道路标识线,弱化道路上其他对象的干扰,可选地,可以以俯视图的视角对第一图像和第二图像进行融合,此时,可以获取第一图像对应的第一俯视图,以及第二图像在第一相机的视角下对应的第二俯视图,之后,将第一俯视图和第二俯视图融合为第三图像。As mentioned above, the road targets that the movable platform needs to identify can be various road marking lines on the ground. Fusion of the first image and the second image from the perspective of And the second top view is merged into the third image.
其中,第一图像和第二图像实际上是前视图,当识别任务是识别地面上 的道路标识线时,可以转换为俯视图,对俯视图进行融合。另外,在进行图像融合时,可以将一个相机采集的图像转换到另一个相机的视角下,将在同一个相机的视角下对应的两个俯视图进行融合。本实施例中,假设的是将第二相机投影到第一相机的俯视图的视角下,即需要获得第二图像在第一相机的视角下对应的第二俯视图。上述第一图像对应的第一俯视图,即为第一图像在第一相机的视角下对应的俯视图。Among them, the first image and the second image are actually front views, and when the recognition task is to identify road markings on the ground, they can be converted into top views, and the top views can be fused. In addition, when performing image fusion, an image captured by one camera can be converted to the perspective of another camera, and two top views corresponding to the perspective of the same camera can be fused. In this embodiment, it is assumed that the second camera is projected to the perspective of the top view of the first camera, that is, the second top view corresponding to the second image in the perspective of the first camera needs to be obtained. The first top view corresponding to the above-mentioned first image is the top view corresponding to the first image under the viewing angle of the first camera.
下文中会详细介绍一种获取上述第一俯视图和第二俯视图的可选实现方式。这里假设已经得到第一俯视图和第二俯视图,之后,融合第一俯视图和第二俯视图以得到第三图像。其中,可选地,可以对第一俯视图和第二俯视图进行加权求和运算,以得到第三图像。An optional implementation manner for obtaining the above-mentioned first top view and second top view will be described in detail below. It is assumed here that the first top view and the second top view have been obtained, and then the first top view and the second top view are merged to obtain the third image. Wherein, optionally, a weighted sum operation may be performed on the first top view and the second top view to obtain the third image.
具体来说,假设第一俯视图表示为Imgbv1,第二俯视图表示为Imgbv2,第三图像表示为Imgbv stitch,则Imgbv stitch=a·Imgbv2+(1-a)·Imgbv1。其中,a为预设权重,0<a<1。 Specifically, assuming that the first top view is represented by Imgbv1, the second top view is represented by Imgbv2, and the third image is represented by Imgbv stitch , then Imgbv stitch =a·Imgbv2+(1-a)·Imgbv1. Among them, a is the preset weight, 0<a<1.
融合后的第三图像具备两个相机的共同视野,并且能相互关联,相当于拥有全局信息,语义信息更丰富。The fused third image has the common field of view of the two cameras and can be correlated with each other, which is equivalent to having global information and richer semantic information.
203、识别第三图像中包含的道路目标。203. Identify the road target included in the third image.
在得到融合后的第三图像后,对该第三图像进行道路目标的识别,基于第三图像中包含的丰富语义信息,可以准确识别出其中包含的道路目标。After the fused third image is obtained, the road target is identified on the third image, and the road target contained in the third image can be accurately identified based on the rich semantic information contained in the third image.
其中,识别第三图像中包含的道路目标,其实就是确定第三图像中哪些像素是与道路目标对应的。举例来说,假设第三图像中包括目标1和目标2,识别的目的就是识别出第三图像中哪些像素对应于目标1,哪些像素对应于目标2。Among them, identifying the road target included in the third image is actually determining which pixels in the third image correspond to the road target. For example, assuming that the third image includes target 1 and target 2, the purpose of recognition is to identify which pixels in the third image correspond to target 1 and which pixels correspond to target 2.
基于上述识别结果,进一步结合图像坐标系与世界坐标系的转换关系,可以得到在实际物理场景中,上述目标1、目标2分别与可移动平台之间的相对位置关系。Based on the above recognition results, and further combining the conversion relationship between the image coordinate system and the world coordinate system, the relative positional relationship between the above target 1 and target 2 and the movable platform in the actual physical scene can be obtained.
由上述介绍可知,上述识别任务实际上是一种语义分割任务(确定图像中像素所对应的类别)。因此,可选地,可以将第三图像输入到预设的语义 分割模型中,以通过该语义分割模型识别出第三图像中包含的道路目标。As can be seen from the above introduction, the above recognition task is actually a semantic segmentation task (determining the category corresponding to the pixels in the image). Therefore, optionally, the third image can be input into a preset semantic segmentation model, so as to identify the road target contained in the third image through the semantic segmentation model.
实际应用中,该语义分割模型可以实现为一种神经网络模型,比如卷积神经网络(Convolutional Neural Network,简称CNN)模型;残差网络(Residual Network,简称ResNet)模型,如ResNet-18,DLA-34模型,等等。In practical applications, the semantic segmentation model can be implemented as a neural network model, such as a Convolutional Neural Network (CNN) model; a Residual Network (ResNet) model, such as ResNet-18, DLA -34 models, etc.
从组成单元的角度上说,该语义分割模型可以包括特征提取层和输出层,实际应用中,特征提取层可以包括诸如卷积层、下采样层、激活函数等,输出层可以包括一个或多个卷积层。顾名思义,特征提取层用于提取输入图像的语义特征,以得到语义特征图(通常也可以简称为特征图),输出层用于对语义特征图进行解析,以输入每个像素对应的类别识别结果:是否对应于某个道路目标。From the perspective of constituent units, the semantic segmentation model may include a feature extraction layer and an output layer. In practical applications, the feature extraction layer may include convolution layers, downsampling layers, activation functions, etc., and the output layer may include one or more a convolutional layer. As the name suggests, the feature extraction layer is used to extract the semantic features of the input image to obtain a semantic feature map (usually it can also be referred to as a feature map), and the output layer is used to parse the semantic feature map to input the category recognition result corresponding to each pixel. : Whether it corresponds to a road target.
综上,将不同相机采集的图像进行融合,得到包含全局语义信息的图像,之后,仅针对融合后的图像进行道路目标的识别,而不需针对每个相机采集的图像都进行道路目标的识别,可以提高识别效率。同时,由于融合后的图像中包含全局的语义信息,可以保证得到准确的识别结果。In summary, the images collected by different cameras are fused to obtain an image containing global semantic information. After that, the road target recognition is only performed on the fused image, without the need for road target recognition on the images collected by each camera. , which can improve the recognition efficiency. At the same time, since the fused image contains global semantic information, accurate recognition results can be guaranteed.
下面结合图3所示实施例对上文中第一图像和第二图像的融合过程进行示例性说明。The fusion process of the first image and the second image above will be exemplarily described below with reference to the embodiment shown in FIG. 3 .
图3为本发明实施例提供的一种图像融合过程的流程示意图,如图3所示,该融合过程可以包括如下步骤:FIG. 3 is a schematic flowchart of an image fusion process provided by an embodiment of the present invention. As shown in FIG. 3 , the fusion process may include the following steps:
301、确定第一相机对应的俯视图投影矩阵。301. Determine a top view projection matrix corresponding to the first camera.
302、根据俯视图投影矩阵对第一图像进行俯视图投影,以得到第一图像对应的第一俯视图。302. Perform a plan view projection on the first image according to the plan view projection matrix, so as to obtain a first plan view corresponding to the first image.
303、根据对应于第一相机和第二相机的相机外参矩阵,以及俯视图投影矩阵,对第二图像进行俯视图投影,以得到第二图像在第一相机的视角下对应的第二俯视图。303. Perform a top view projection on the second image according to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, to obtain a second top view corresponding to the second image from the perspective of the first camera.
304、对第一俯视图和第二俯视图进行加权求和运算,以得到第三图像。304. Perform a weighted sum operation on the first top view and the second top view to obtain a third image.
如前文所述,第一相机采集的第一图像实际上是前视图,要想将该前视图转换为俯视图,需要先确定由该前视图转换为俯视图所用到的投影矩阵, 称为俯视图投影矩阵。As mentioned above, the first image captured by the first camera is actually a front view. To convert the front view into a top view, it is necessary to first determine the projection matrix used to convert the front view into a top view, which is called the top view projection matrix. .
第一相机对应的俯视图投影矩阵由两个矩阵确定:第一相机的单应性矩阵(homography矩阵)以及第一相机相对地面的透视变换矩阵(PerspectiveTransform矩阵)。其中,透视变换矩阵用于将相机采集的图像在图像坐标系下的坐标投影到世界坐标系下。The top view projection matrix corresponding to the first camera is determined by two matrices: a homography matrix (homography matrix) of the first camera and a perspective transformation matrix (PerspectiveTransform matrix) of the first camera relative to the ground. The perspective transformation matrix is used to project the coordinates of the image captured by the camera in the image coordinate system to the world coordinate system.
具体地,如果将第一相机对应的俯视图投影矩阵表示为tranMFront2Top,将第一相机的单应性矩阵表示为Hg2im,将上述透视变换矩阵表示为PerspectiveTransform。Specifically, if the top view projection matrix corresponding to the first camera is represented as tranMFront2Top, the homography matrix of the first camera is represented as Hg2im, and the above perspective transformation matrix is represented as PerspectiveTransform.
那么,tranMFront2Top=Hg2im·PerspectiveTransformThen, tranMFront2Top=Hg2im PerspectiveTransform
即第一相机的单应性矩阵与透视变换矩阵相乘,得到第一相机对应的俯视图投影矩阵。That is, the homography matrix of the first camera is multiplied by the perspective transformation matrix to obtain the top view projection matrix corresponding to the first camera.
其中,可以通过如下方式确定第一相机对应的单应性矩阵:Wherein, the homography matrix corresponding to the first camera can be determined in the following manner:
根据第一相机的相机内参矩阵、第一相机相对地面的平移矩阵、第一相机相对地面的旋转矩阵,确定单应性矩阵。The homography matrix is determined according to the camera internal parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground.
其中,第一相机相对地面的旋转矩阵包括如下三种矩阵:第一相机对地俯仰角(Pitch)的旋转矩阵,表示为RPitch1;第一相机对地航向角(Yaw)的旋转矩阵,表示为RYaw1;第一相机对地横滚角(Roll)的旋转矩阵,表示为RRoll1。Among them, the rotation matrix of the first camera relative to the ground includes the following three matrices: the rotation matrix of the first camera's pitch angle to the ground (Pitch), denoted as RPitch1; the rotation matrix of the first camera's ground angle (Yaw), denoted as RYaw1; the rotation matrix of the roll angle (Roll) of the first camera to the ground, expressed as RRoll1.
其中,第一相机相对地面的平移矩阵,是指第一相机对地高度的平移矩阵,表示为Th1。第一相机的相机内参矩阵表示为:P1。The translation matrix of the first camera relative to the ground refers to the translation matrix of the height of the first camera relative to the ground, which is represented as Th1. The camera intrinsic parameter matrix of the first camera is represented as: P1.
以上相机内参矩阵、平移矩阵、旋转矩阵都是预先确定好的。The above camera internal parameter matrix, translation matrix, and rotation matrix are all predetermined.
基于上述假设,Hg2im=P1·RRoll1·RYaw1·RPitch1·Th1Based on the above assumptions, Hg2im=P1·RRoll1·RYaw1·RPitch1·Th1
可以通过如下方式得到第一相机相对地面的透视变换矩阵:The perspective transformation matrix of the first camera relative to the ground can be obtained as follows:
获取第一图像中多个参考像素点在图像坐标系中各自对应的第一坐标;obtaining the first coordinates corresponding to the plurality of reference pixels in the first image in the image coordinate system;
获取所述多个参考像素点在投影到俯视图后在世界坐标系中各自对应的第二坐标;acquiring the respective second coordinates corresponding to the plurality of reference pixels in the world coordinate system after being projected to the top view;
根据所述第一坐标和所述第二坐标,确定第一相机相对地面的透视变换 矩阵。According to the first coordinates and the second coordinates, a perspective transformation matrix of the first camera relative to the ground is determined.
实际应用中,可选地,上述多个参考像素点可以是第一图像的四个顶点像素。举例来说,假设第一图像是720*1280的图像,左上角顶点像素在图像坐标系中对应的第一坐标为(0,0),右上角顶点像素在图像坐标系中对应的第一坐标为(720,0),左下角顶点像素在图像坐标系中对应的第一坐标为(0,1280),右上角顶点像素在图像坐标系中对应的第一坐标为(720,1280)。In practical applications, optionally, the above-mentioned multiple reference pixels may be four vertex pixels of the first image. For example, assuming that the first image is a 720*1280 image, the first coordinate corresponding to the upper-left vertex pixel in the image coordinate system is (0,0), and the upper-right vertex pixel corresponding to the first coordinate in the image coordinate system is (720,0), the first coordinate corresponding to the vertex pixel in the lower left corner in the image coordinate system is (0,1280), and the first coordinate corresponding to the vertex pixel in the upper right corner in the image coordinate system is (720,1280).
之后标定出上述四个顶点像素在俯视图中对应的第二坐标,该第二坐标表示俯视图在世界坐标系下看的位置。Then, the second coordinates corresponding to the above-mentioned four vertex pixels in the top view are determined, and the second coordinates represent the position of the top view in the world coordinate system.
简单来说,就是给定透视变换对应的四对像素点坐标,即可以求得透视变换矩阵。这四对像素点坐标即为上述四个顶点在前视图和俯视图中分别对应的坐标。Simply put, the perspective transformation matrix can be obtained by giving four pairs of pixel coordinates corresponding to the perspective transformation. The coordinates of the four pairs of pixel points are the coordinates corresponding to the above four vertices in the front view and the top view respectively.
在通过上述方式得到第一相机对应的俯视图变换矩阵后,使用该俯视图变换矩阵对第一图像进行俯视图投影,便可以得到第一图像对应的第一俯视图。假设将第一俯视图表示为Imgbv1,那么:After the top view transformation matrix corresponding to the first camera is obtained in the above manner, the first top view corresponding to the first image can be obtained by using the top view transformation matrix to project the top view of the first image. Assuming that the first top view is represented as Imgbv1, then:
Imgbv1=warpPerspective(Imgfv1,tranMFront2Top)Imgbv1 = warpPerspective(Imgfv1, tranMFront2Top)
其中,Imgfv1表示第一相机采集的前视图,亦即第一图像。warpPerspective表示俯视图投影函数,可以是预设的某种可以实现俯视图投影的函数。上述公式意味着,以第一图像和俯视图变换矩阵tranMFront2Top作为该俯视图投影函数的输入,以实现对第一图像的俯视图投影。Wherein, Imgfv1 represents the front view captured by the first camera, that is, the first image. warpPerspective represents the top view projection function, which can be a preset function that can realize top view projection. The above formula means that the first image and the top view transformation matrix tranMFront2Top are used as the input of the top view projection function, so as to realize the top view projection of the first image.
针对第二相机采集的第二图像,可以通过如下方式得到对应的第二俯视图Imgbv2。For the second image collected by the second camera, the corresponding second top view Imgbv2 may be obtained in the following manner.
Figure PCTCN2021073334-appb-000001
Figure PCTCN2021073334-appb-000001
其中:warpPerspective为上述俯视图投影函数,Imgfv2表示第二相机采集的前视图,亦即第二图像,
Figure PCTCN2021073334-appb-000002
表示对应于第一相机和第二相机的相机外参矩阵。
Among them: warpPerspective is the above-mentioned top view projection function, Imgfv2 represents the front view collected by the second camera, that is, the second image,
Figure PCTCN2021073334-appb-000002
represents the camera extrinsic parameter matrix corresponding to the first camera and the second camera.
之后,可以根据如下方式融合第一俯视图Imgbv1和第二俯视图Imgbv2,以得到第三图像Imgbv stitchAfterwards, the first top view Imgbv1 and the second top view Imgbv2 can be fused according to the following method to obtain the third image Imgbv stitch :
Imgbv stitch=a·Imgbv2+(1-a)·Imgbv1。其中,a为预设权重,0<a<1。 Imgbv stitch =a·Imgbv2+(1−a)·Imgbv1. Among them, a is the preset weight, 0<a<1.
以上介绍了一种图像融合的实现方式,实际上,图像融合(或者说图像拼接)还可以有其他实现方式,不以上述举例为限。An implementation manner of image fusion has been introduced above. In fact, image fusion (or image stitching) may also have other implementation manners, which are not limited to the above examples.
上文中提到,在得到融合后的第三图像后,可以使用语义分割模型来识别第三图像中包含的道路目标。为进一步提高道路目标识别结果的准确性,本发明实施例中还提供了如图4所示的图像识别方案。As mentioned above, after the fused third image is obtained, the semantic segmentation model can be used to identify the road objects contained in the third image. In order to further improve the accuracy of the road target recognition result, an image recognition scheme as shown in FIG. 4 is also provided in the embodiment of the present invention.
如图4所示,语义分割模型中可以包括级联的多个特征提取层,以及输出层。在图4中示意的多个特征提取层包括特征提取层1、特征提取层2和特征提取层3。As shown in Figure 4, the semantic segmentation model can include multiple cascaded feature extraction layers and an output layer. The plurality of feature extraction layers illustrated in FIG. 4 include feature extraction layer 1 , feature extraction layer 2 and feature extraction layer 3 .
第三图像先输入到特征提取层1,由特征提取层1输出语义特征图Featuer1。之后,一方面将语义特征图Featuer1存储下来,另一方面,将语义特征图Featuer1输入到特征提取层2。假设特征提取层2输出语义特征图Featuer2,同样地,一方面将语义特征图Featuer2存储下来,另一方面,将语义特征图Featuer2输入到特征提取层3。假设特征提取层3输出语义特征图Featuer3,存储语义特征图Featuer3。The third image is first input to the feature extraction layer 1, and the feature extraction layer 1 outputs the semantic feature map Feature1. After that, on the one hand, the semantic feature map Featurer1 is stored, and on the other hand, the semantic feature map Featurer1 is input to the feature extraction layer 2. Assuming that the feature extraction layer 2 outputs the semantic feature map Featuer2, similarly, on the one hand, the semantic feature map Featuer2 is stored, and on the other hand, the semantic feature map Featurer2 is input to the feature extraction layer 3. It is assumed that the feature extraction layer 3 outputs the semantic feature map Featuer3, and stores the semantic feature map Featuer3.
在得到上述多个特征提取层输出的多个语义特征图后,拼接多个语义特征图,将拼接得到后的语义特征图Featuer输入到输出层,以获取输出层输出的语义分割结果,该语义分割结果指示出第三图像中对应于道路目标的像素,即得到第三图像中各像素的类别识别结果。After obtaining multiple semantic feature maps output by the above multiple feature extraction layers, splicing multiple semantic feature maps, and inputting the semantic feature map Featuer obtained by splicing into the output layer to obtain the semantic segmentation result output by the output layer. The segmentation result indicates the pixel corresponding to the road target in the third image, that is, the classification result of each pixel in the third image is obtained.
可以理解的是,上述多个特征提取层中,越靠后的特征提取层所提取到的是越高层的语义信息,越靠前的特征提取层所提取到的是越低层的语义信息,将不同特征提取层提取到的不同尺度的语义特征图进行拼接,可以得到语义丰富的特征图,有助于提高识别结果的准确性。It can be understood that, among the above feature extraction layers, the later feature extraction layer extracts higher-level semantic information, and the earlier feature extraction layer extracts lower-level semantic information. By splicing the semantic feature maps of different scales extracted by different feature extraction layers, a feature map with rich semantics can be obtained, which helps to improve the accuracy of the recognition results.
在本发明实施例中,由于在可移动平台上设置不同相机的目的是感知前方不同距离范围内存在的道路目标,因此,在该场景下,语义特征图中可以反映出不同距离与语义特征向量间的对应关系,也就是说,语义特征图中包 括了不同距离所对应的语义特征向量。因此,在多个语义特征图拼接的过程中,可以结合距离因素来拼接。In the embodiment of the present invention, since the purpose of setting different cameras on the movable platform is to perceive road targets existing in different distances ahead, in this scenario, the semantic feature map can reflect different distances and semantic feature vectors In other words, the semantic feature map includes semantic feature vectors corresponding to different distances. Therefore, in the process of splicing multiple semantic feature maps, the distance factor can be combined to splicing.
因此,可选地,多个语义特征图的拼接过程可以实现为:Therefore, optionally, the splicing process of multiple semantic feature maps can be implemented as:
对于目标语义特征图,将目标语义特征图中与预设目标距离范围对应的语义特征向量的权重设为第一权重,将目标语义特征图中与其他距离范围对应的语义特征向量的权重设为第二权重,第一权重大于第二权重,目标语义特征图是多个语义特征图中的任一个,多个语义特征图各自对应的预设目标距离范围不同;For the target semantic feature map, set the weight of the semantic feature vector corresponding to the preset target distance range in the target semantic feature map as the first weight, and set the weight of the semantic feature vector corresponding to other distance ranges in the target semantic feature map as the second weight, the first weight is greater than the second weight, the target semantic feature map is any one of multiple semantic feature maps, and the preset target distance ranges corresponding to each of the multiple semantic feature maps are different;
根据设置的权重,对多个语义特征图进行拼接。According to the set weights, multiple semantic feature maps are spliced.
为便于理解,举例来说,假设第一权重表示为w1,第二权重表示为w2,并假设多个语义特征图是图4中示意的语义特征图Featuer1、语义特征图Featuer2和语义特征图Featuer3,以及假设第一相机和第二相机一共能够拍摄前方0~150米的范围。另外,假设与语义特征图Featuer1对应的预设目标距离范围为0~30米,与语义特征图Featuer2对应的预设目标距离范围为30~60米,与语义特征图Featuer3对应的预设目标距离范围为60~150米。For ease of understanding, for example, it is assumed that the first weight is denoted w1, the second weight is denoted w2, and that the plurality of semantic feature maps are the semantic feature map Featuer1, the semantic feature map Featuer2, and the semantic feature map Featuer3 illustrated in FIG. 4 . , and it is assumed that the first camera and the second camera can photograph a range of 0 to 150 meters ahead in total. In addition, it is assumed that the preset target distance range corresponding to the semantic feature map Featurer1 is 0 to 30 meters, the preset target distance range corresponding to the semantic feature map Featurer2 is 30 to 60 meters, and the preset target distance corresponding to the semantic feature map Featurer3. The range is 60 to 150 meters.
基于上述假设,上述三个语义特征图的权重设置结果如下:Based on the above assumptions, the weight setting results of the above three semantic feature maps are as follows:
对于语义特征图Featuer1,语义特征图Featuer1中与预设目标距离范围0~30米对应的语义特征向量C11的权重设为w1,与其他距离范围30~60米以及60~150米米分别对应的语义特征向量C12和语义特征向量C13的权重均设为w2;For the semantic feature map Featuer1, the weight of the semantic feature vector C11 corresponding to the preset target distance range of 0 to 30 meters in the semantic feature map Featuer1 is set to w1, and to other distance ranges of 30 to 60 meters and 60 to 150 meters. The weights of the semantic feature vector C12 and the semantic feature vector C13 are both set to w2;
对于语义特征图Featuer2,语义特征图Featuer2中与预设目标距离范围30~60米对应的语义特征向量C22的权重设为w1,与其他距离范围0~30米以及60~150米分别对应的语义特征向量C21和语义特征向量C23的权重均设为w2;For the semantic feature map Featuer2, the weight of the semantic feature vector C22 corresponding to the preset target distance range of 30 to 60 meters in the semantic feature map Featuer2 is set to w1, and the semantics corresponding to other distance ranges of 0 to 30 meters and 60 to 150 meters respectively. The weights of feature vector C21 and semantic feature vector C23 are both set to w2;
对于语义特征图Featuer3,语义特征图Featuer3中与预设目标距离范围60~150米对应的语义特征向量C33的权重设为w1,与其他距离范围0~30米以及30~60米分别对应的语义特征向量C31和语义特征向量C32的权重均设为w2。For the semantic feature map Featuer3, the weight of the semantic feature vector C33 corresponding to the preset target distance range of 60 to 150 meters in the semantic feature map Featuer3 is set to w1, and the semantics corresponding to other distance ranges of 0 to 30 meters and 30 to 60 meters respectively. The weights of the feature vector C31 and the semantic feature vector C32 are both set to w2.
基于上述权重设置结果,对三个语义特征图进行拼接,假设拼接后得到的语义特征图Featuer中与距离范围0~30米、30~60米以及60~150米分别对应的语义特征向量表示为:语义特征向量C1、语义特征向量C2和语义特征向量C3,则有:Based on the above weight setting results, the three semantic feature maps are spliced. Suppose that the semantic feature vectors corresponding to the distance ranges of 0-30 meters, 30-60 meters, and 60-150 meters in the semantic feature map Featuer obtained after splicing are expressed as : semantic feature vector C1, semantic feature vector C2 and semantic feature vector C3, there are:
语义特征向量C1=w1*语义特征向量C11+w2*语义特征向量C21+w2*语义特征向量C31;Semantic feature vector C1=w1*semantic feature vector C11+w2*semantic feature vector C21+w2*semantic feature vector C31;
语义特征向量C2=w2*语义特征向量C12+w1*语义特征向量C22+w2*语义特征向量C32;Semantic feature vector C2=w2*semantic feature vector C12+w1*semantic feature vector C22+w2*semantic feature vector C32;
语义特征向量C3=w2*语义特征向量C13+w2*语义特征向量C23+w1*语义特征向量C33。Semantic feature vector C3=w2*semantic feature vector C13+w2*semantic feature vector C23+w1*semantic feature vector C33.
其中,0≤w2<w1≤1。可选地,可以设置w1=1,w2=0。Among them, 0≤w2<w1≤1. Optionally, w1=1, w2=0 can be set.
由上述举例可知,按照多个语义特征图的输出顺序,多个语义特征图各自对应的预设目标距离范围依次变远。如上述举例中,上述三个语义特征图的输出顺序依次是语义特征图Featuer1、语义特征图Featuer2、语义特征图Featuer3,那么这三个语义特征图对应的预设目标距离范围分别是:0~30米、30~60米、60~150米,呈现依次变远的趋势。It can be seen from the above examples that, according to the output order of the plurality of semantic feature maps, the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther. As in the above example, the output order of the above three semantic feature maps is Semantic Feature Map Featurer1, Semantic Feature Map Featurer2, Semantic Feature Map Featurer3, then the preset target distance ranges corresponding to these three semantic feature maps are: 0~ 30 meters, 30 to 60 meters, and 60 to 150 meters, showing a trend of increasing distance in turn.
之所以呈现这样的趋势是因为越靠后输出的语义特征图中包含的是越高层次的语义信息,而越高层次的语义信息往往对应于越远距离的环境,也就是说越远处的道路目标越需要更高层次的语义信息。The reason for this trend is that the semantic feature map output later contains higher-level semantic information, and the higher-level semantic information often corresponds to the farther the environment, that is to say, the farther away. The more road objects require higher-level semantic information.
综上,在输入图像(即语义分割模型的输入图像)是对不同相机各自采集的图像进行融合得到的图像的基础上,在特征提取层,又进一步进行多个语义特征图的拼接,一方面使得输入图像中包含丰富的语义,另一方面使得拼接后的语义特征图中能够强化不同距离的语义特征,在降低计算量,提高识别效率的同时,可以获得更佳的识别结果。To sum up, on the basis that the input image (ie, the input image of the semantic segmentation model) is an image obtained by fusing the images collected by different cameras, in the feature extraction layer, multiple semantic feature maps are further spliced. The input image contains rich semantics, and on the other hand, the spliced semantic feature map can strengthen the semantic features of different distances, which can reduce the amount of calculation and improve the recognition efficiency, and at the same time, better recognition results can be obtained.
图5为本发明实施例提供的一种目标检测装置的结构示意图,该目标检测装置设于可移动平台,如图5所示,该目标检测装置包括:存储器11、处理器12。其中,存储器11上存储有可执行代码,当所述可执行代码被处理器12执 行时,使处理器12实现:FIG. 5 is a schematic structural diagram of a target detection apparatus provided by an embodiment of the present invention. The target detection apparatus is set on a movable platform. As shown in FIG. 5 , the target detection apparatus includes: a memory 11 and a processor 12 . Wherein, the executable code is stored on the memory 11, and when the executable code is executed by the processor 12, the processor 12 is made to realize:
获取第一相机采集的第一图像以及第二相机采集的第二图像,其中,所述第一相机和所述第二相机设置于所述可移动平台上,所述第一相机和所述第二相机分别用于拍摄所述可移动平台前方不同距离范围的环境;Obtain a first image collected by a first camera and a second image collected by a second camera, wherein the first camera and the second camera are arranged on the movable platform, the first camera and the second camera are The two cameras are respectively used to photograph the environment in different distance ranges in front of the movable platform;
将所述第一图像和所述第二图像融合为第三图像;fusing the first image and the second image into a third image;
识别所述第三图像中包含的道路目标。Road objects contained in the third image are identified.
可选地,在将所述第一图像和所述第二图像融合为第三图像的过程中,所述处理器12具体用于:Optionally, in the process of fusing the first image and the second image into a third image, the processor 12 is specifically configured to:
获取所述第一图像对应的第一俯视图,以及所述第二图像在所述第一相机的视角下对应的第二俯视图;将所述第一俯视图和所述第二俯视图融合为所述第三图像。Obtain a first top view corresponding to the first image and a second top view corresponding to the second image from the perspective of the first camera; and fuse the first top view and the second top view into the first top view Three images.
其中,可选地,所述处理器12具体用于:确定所述第一相机对应的俯视图投影矩阵;根据所述俯视图投影矩阵对所述第一图像进行俯视图投影,以得到所述第一图像对应的第一俯视图;根据对应于所述第一相机和所述第二相机的相机外参矩阵,以及所述俯视图投影矩阵,对所述第二图像进行俯视图投影,以得到所述第二图像在所述第一相机的视角下对应的第二俯视图。Wherein, optionally, the processor 12 is specifically configured to: determine a top view projection matrix corresponding to the first camera; perform top view projection on the first image according to the top view projection matrix to obtain the first image The corresponding first top view; according to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image to obtain the second image A corresponding second top view under the viewing angle of the first camera.
其中,可选地,所述处理器12具体用于:对所述第一俯视图和所述第二俯视图进行加权求和运算,以得到所述第三图像。Wherein, optionally, the processor 12 is specifically configured to: perform a weighted sum operation on the first top view and the second top view to obtain the third image.
其中,可选地,在确定所述第一相机对应的俯视图投影矩阵的过程中,所述处理器12具体用于:确定所述第一相机对应的单应性矩阵;获取所述第一图像中多个参考像素点在图像坐标系中各自对应的第一坐标;获取所述多个参考像素点在投影到俯视图后在世界坐标系中各自对应的第二坐标;根据所述第一坐标和所述第二坐标,确定所述第一相机相对地面的透视变换矩阵;Wherein, optionally, in the process of determining the top view projection matrix corresponding to the first camera, the processor 12 is specifically configured to: determine the homography matrix corresponding to the first camera; acquire the first image the first coordinates corresponding to the multiple reference pixels in the image coordinate system; obtain the second coordinates corresponding to the multiple reference pixels in the world coordinate system after being projected to the top view; according to the first coordinates and The second coordinate determines the perspective transformation matrix of the first camera relative to the ground;
根据所述单应性矩阵和所述透视变换矩阵,确定所述俯视图投影矩阵。The top view projection matrix is determined according to the homography matrix and the perspective transformation matrix.
可选地,所述处理器12具体用于:根据所述第一相机的相机内参矩阵、所述第一相机相对地面的平移矩阵、所述第一相机相对地面的旋转矩阵,确定所述单应性矩阵。Optionally, the processor 12 is specifically configured to: determine the single camera according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground. Responsiveness Matrix.
可选地,在识别所述第三图像中包含的道路目标的过程中,所述处理器12具体用于:将所述第三图像输入到预设的语义分割模型中,以通过所述语义分割模型识别出所述第三图像中包含的道路目标。Optionally, in the process of recognizing the road target contained in the third image, the processor 12 is specifically configured to: input the third image into a preset semantic segmentation model, so as to pass the semantic The segmentation model identifies road objects contained in the third image.
可选地,所述语义分割模型中包括:级联的多个特征提取层和输出层。基于此,所述处理器12具体用于:获取所述多个特征提取层输出的多个语义特征图;拼接所述多个语义特征图;将拼接后的语义特征图输入到所述输出层,以获取所述输出层输出的语义分割结果,所述语义分割结果指示出所述第三图像中对应于道路目标的像素。Optionally, the semantic segmentation model includes: cascaded multiple feature extraction layers and output layers. Based on this, the processor 12 is specifically configured to: acquire multiple semantic feature maps output by the multiple feature extraction layers; splicing the multiple semantic feature maps; input the spliced semantic feature maps to the output layer , so as to obtain the semantic segmentation result output by the output layer, the semantic segmentation result indicating the pixels corresponding to the road target in the third image.
其中,可选地,在拼接所述多个语义特征图的过程中,所述处理器12具体用于:对于目标语义特征图,将所述目标语义特征图中与预设目标距离范围对应的语义特征向量的权重设为第一权重,将所述目标语义特征图中与其他距离范围对应的语义特征向量的权重设为第二权重,所述第一权重大于所述第二权重,所述目标语义特征图是所述多个语义特征图中的任一个,所述多个语义特征图各自对应的预设目标距离范围不同;根据设置的权重,对所述多个语义特征图进行拼接。Wherein, optionally, in the process of splicing the plurality of semantic feature maps, the processor 12 is specifically configured to: for the target semantic feature map, combine the target semantic feature map corresponding to the preset target distance range The weight of the semantic feature vector is set as the first weight, the weight of the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the second weight, the first weight is greater than the second weight, the The target semantic feature map is any one of the multiple semantic feature maps, and the preset target distance ranges corresponding to each of the multiple semantic feature maps are different; the multiple semantic feature maps are spliced according to the set weights.
可选地,所述第一权重为1,所述第二权重为0。Optionally, the first weight is 1, and the second weight is 0.
可选地,按照所述多个语义特征图的输出顺序,所述多个语义特征图各自对应的预设目标距离范围依次变远。Optionally, according to the output order of the plurality of semantic feature maps, the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther.
可选地,所述道路目标包括如下任一种:车道线、车位线、斑马线。Optionally, the road target includes any one of the following: lane lines, parking space lines, and zebra crossings.
图5所示目标检测装置在目标检测过程中的具体执行过程,可以参考前述其他实施例中的相关说明,在此不赘述。For the specific execution process of the target detection device shown in FIG. 5 in the target detection process, reference may be made to the relevant descriptions in the other embodiments described above, and details are not described here.
图6为本发明实施例提供的一种可移动平台的结构示意图,如图6所示,该可移动平台包括:FIG. 6 is a schematic structural diagram of a movable platform according to an embodiment of the present invention. As shown in FIG. 6 , the movable platform includes:
壳体21; shell 21;
第一相机22和第二相机23,设于所述壳体21内部或外部,分别用于拍摄所述可移动平台前方不同距离范围的环境;The first camera 22 and the second camera 23 are arranged inside or outside the casing 21, and are respectively used for photographing the environment in front of the movable platform with different distance ranges;
处理器24,设于所述壳体21内部,与所述第一相机22和所述第二相机23 耦合,用于获取所述第一相机22采集的第一图像以及所述第二相机23采集的第二图像;将所述第一图像和所述第二图像融合为第三图像;识别所述第三图像中包含的道路目标。The processor 24 is arranged inside the casing 21 and is coupled to the first camera 22 and the second camera 23 for acquiring the first image captured by the first camera 22 and the second camera 23 the collected second image; fuse the first image and the second image into a third image; identify the road target contained in the third image.
可选地,在将所述第一图像和所述第二图像融合为第三图像的过程中,所述处理器24具体用于:Optionally, in the process of fusing the first image and the second image into a third image, the processor 24 is specifically configured to:
获取所述第一图像对应的第一俯视图,以及所述第二图像在所述第一相机的视角下对应的第二俯视图;将所述第一俯视图和所述第二俯视图融合为所述第三图像。Obtain a first top view corresponding to the first image and a second top view corresponding to the second image from the perspective of the first camera; and fuse the first top view and the second top view into the first top view Three images.
其中,可选地,所述处理器24具体用于:确定所述第一相机对应的俯视图投影矩阵;根据所述俯视图投影矩阵对所述第一图像进行俯视图投影,以得到所述第一图像对应的第一俯视图;根据对应于所述第一相机和所述第二相机的相机外参矩阵,以及所述俯视图投影矩阵,对所述第二图像进行俯视图投影,以得到所述第二图像在所述第一相机的视角下对应的第二俯视图。Wherein, optionally, the processor 24 is specifically configured to: determine a top view projection matrix corresponding to the first camera; perform top view projection on the first image according to the top view projection matrix to obtain the first image The corresponding first top view; according to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image to obtain the second image A corresponding second top view under the viewing angle of the first camera.
其中,可选地,所述处理器24具体用于:对所述第一俯视图和所述第二俯视图进行加权求和运算,以得到所述第三图像。Wherein, optionally, the processor 24 is specifically configured to: perform a weighted sum operation on the first top view and the second top view to obtain the third image.
其中,可选地,在确定所述第一相机对应的俯视图投影矩阵的过程中,所述处理器24具体用于:确定所述第一相机对应的单应性矩阵;获取所述第一图像中多个参考像素点在图像坐标系中各自对应的第一坐标;获取所述多个参考像素点在投影到俯视图后在世界坐标系中各自对应的第二坐标;根据所述第一坐标和所述第二坐标,确定所述第一相机相对地面的透视变换矩阵;Wherein, optionally, in the process of determining the top view projection matrix corresponding to the first camera, the processor 24 is specifically configured to: determine the homography matrix corresponding to the first camera; acquire the first image the first coordinates corresponding to the multiple reference pixels in the image coordinate system; obtain the second coordinates corresponding to the multiple reference pixels in the world coordinate system after being projected to the top view; according to the first coordinates and The second coordinate determines the perspective transformation matrix of the first camera relative to the ground;
根据所述单应性矩阵和所述透视变换矩阵,确定所述俯视图投影矩阵。The top view projection matrix is determined according to the homography matrix and the perspective transformation matrix.
可选地,所述处理器24具体用于:根据所述第一相机的相机内参矩阵、所述第一相机相对地面的平移矩阵、所述第一相机相对地面的旋转矩阵,确定所述单应性矩阵。Optionally, the processor 24 is specifically configured to: determine the single camera according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground. Responsiveness Matrix.
可选地,在识别所述第三图像中包含的道路目标的过程中,所述处理器24具体用于:将所述第三图像输入到预设的语义分割模型中,以通过所述语义分割模型识别出所述第三图像中包含的道路目标。Optionally, in the process of recognizing the road target contained in the third image, the processor 24 is specifically configured to: input the third image into a preset semantic segmentation model, so as to pass the semantic The segmentation model identifies road objects contained in the third image.
可选地,所述语义分割模型中包括:级联的多个特征提取层和输出层。基于此,所述处理器24具体用于:获取所述多个特征提取层输出的多个语义特征图;拼接所述多个语义特征图;将拼接后的语义特征图输入到所述输出层,以获取所述输出层输出的语义分割结果,所述语义分割结果指示出所述第三图像中对应于道路目标的像素。Optionally, the semantic segmentation model includes: cascaded multiple feature extraction layers and output layers. Based on this, the processor 24 is specifically configured to: acquire multiple semantic feature maps output by the multiple feature extraction layers; splicing the multiple semantic feature maps; input the spliced semantic feature maps to the output layer , so as to obtain the semantic segmentation result output by the output layer, the semantic segmentation result indicating the pixels corresponding to the road target in the third image.
其中,可选地,在拼接所述多个语义特征图的过程中,所述处理器24具体用于:对于目标语义特征图,将所述目标语义特征图中与预设目标距离范围对应的语义特征向量的权重设为第一权重,将所述目标语义特征图中与其他距离范围对应的语义特征向量的权重设为第二权重,所述第一权重大于所述第二权重,所述目标语义特征图是所述多个语义特征图中的任一个,所述多个语义特征图各自对应的预设目标距离范围不同;根据设置的权重,对所述多个语义特征图进行拼接。Wherein, optionally, in the process of splicing the plurality of semantic feature maps, the processor 24 is specifically configured to: for the target semantic feature map, combine the target semantic feature map corresponding to the preset target distance range The weight of the semantic feature vector is set as the first weight, the weight of the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the second weight, the first weight is greater than the second weight, the The target semantic feature map is any one of the multiple semantic feature maps, and the preset target distance ranges corresponding to each of the multiple semantic feature maps are different; the multiple semantic feature maps are spliced according to the set weights.
可选地,所述第一权重为1,所述第二权重为0。Optionally, the first weight is 1, and the second weight is 0.
可选地,按照所述多个语义特征图的输出顺序,所述多个语义特征图各自对应的预设目标距离范围依次变远。Optionally, according to the output order of the plurality of semantic feature maps, the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther.
可选地,所述道路目标包括如下任一种:车道线、车位线、斑马线。Optionally, the road target includes any one of the following: lane lines, parking space lines, and zebra crossings.
另外,本发明实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有可执行代码,所述可执行代码用于实现如前述各实施例提供的目标检测方法。In addition, an embodiment of the present invention further provides a computer-readable storage medium, where executable codes are stored in the computer-readable storage medium, and the executable codes are used to implement the target detection methods provided by the foregoing embodiments.
以上各个实施例中的技术方案、技术特征在不相冲突的情况下均可以单独,或者进行组合,只要未超出本领域技术人员的认知范围,均属于本申请保护范围内的等同实施例。The technical solutions and technical features in each of the above embodiments can be used alone or in combination if they do not conflict with each other, as long as they do not exceed the cognitive scope of those skilled in the art, they all belong to equivalent embodiments within the protection scope of the present application.
以上所述仅为本发明的实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above descriptions are only the embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied to other related technologies Fields are similarly included in the scope of patent protection of the present invention.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通 技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. scope.

Claims (37)

  1. 一种目标检测方法,其特征在于,应用于可移动平台,所述方法包括:A target detection method, characterized in that, applied to a movable platform, the method comprising:
    获取第一相机采集的第一图像以及第二相机采集的第二图像;其中,所述第一相机和所述第二相机设置于所述可移动平台上,所述第一相机和所述第二相机分别用于拍摄所述可移动平台前方不同距离范围的环境;Acquiring a first image collected by a first camera and a second image collected by a second camera; wherein the first camera and the second camera are arranged on the movable platform, the first camera and the second camera are The two cameras are respectively used to photograph the environment in different distance ranges in front of the movable platform;
    将所述第一图像和所述第二图像融合为第三图像;fusing the first image and the second image into a third image;
    识别所述第三图像中包含的道路目标。Road objects contained in the third image are identified.
  2. 根据权利要求1所述的方法,其特征在于,所述将所述第一图像和所述第二图像融合为第三图像,包括:The method according to claim 1, wherein the fusion of the first image and the second image into a third image comprises:
    获取所述第一图像对应的第一俯视图,以及所述第二图像在所述第一相机的视角下对应的第二俯视图;acquiring a first top view corresponding to the first image, and a second top view corresponding to the second image under the viewing angle of the first camera;
    将所述第一俯视图和所述第二俯视图融合为所述第三图像。The first top view and the second top view are merged into the third image.
  3. 根据权利要求2所述的方法,其特征在于,所述获取所述第一图像对应的第一俯视图,以及所述第二图像在所述第一相机的视角下对应的第二俯视图,包括:The method according to claim 2, wherein the acquiring a first top view corresponding to the first image and a second top view corresponding to the second image under the viewing angle of the first camera comprises:
    确定所述第一相机对应的俯视图投影矩阵;determining a top view projection matrix corresponding to the first camera;
    根据所述俯视图投影矩阵对所述第一图像进行俯视图投影,以得到所述第一图像对应的第一俯视图;Perform a top view projection on the first image according to the top view projection matrix to obtain a first top view corresponding to the first image;
    根据对应于所述第一相机和所述第二相机的相机外参矩阵,以及所述俯视图投影矩阵,对所述第二图像进行俯视图投影,以得到所述第二图像在所述第一相机的视角下对应的第二俯视图。According to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image, so as to obtain the second image on the first camera The corresponding second top view from the viewing angle.
  4. 根据权利要求2所述的方法,其特征在于,所述将所述第一俯视图和所述第二俯视图融合为所述第三图像,包括:The method according to claim 2, wherein the combining the first top view and the second top view into the third image comprises:
    对所述第一俯视图和所述第二俯视图进行加权求和运算,以得到所述第三图像。A weighted sum operation is performed on the first top view and the second top view to obtain the third image.
  5. 根据权利要求3所述的方法,其特征在于,所述确定所述第一相机对应的俯视图投影矩阵,包括:The method according to claim 3, wherein the determining the top view projection matrix corresponding to the first camera comprises:
    确定所述第一相机对应的单应性矩阵;determining a homography matrix corresponding to the first camera;
    获取所述第一图像中多个参考像素点在图像坐标系中各自对应的第一坐标;Acquiring respective first coordinates in the image coordinate system of a plurality of reference pixels in the first image;
    获取所述多个参考像素点在投影到俯视图后在世界坐标系中各自对应的第二坐标;acquiring the respective second coordinates corresponding to the plurality of reference pixels in the world coordinate system after being projected to the top view;
    根据所述第一坐标和所述第二坐标,确定所述第一相机相对地面的透视变换矩阵;According to the first coordinate and the second coordinate, determine the perspective transformation matrix of the first camera relative to the ground;
    根据所述单应性矩阵和所述透视变换矩阵,确定所述俯视图投影矩阵。The top view projection matrix is determined according to the homography matrix and the perspective transformation matrix.
  6. 根据权利要求5所述的方法,其特征在于,所述确定所述第一相机对应的单应性矩阵,包括:The method according to claim 5, wherein the determining the homography matrix corresponding to the first camera comprises:
    根据所述第一相机的相机内参矩阵、所述第一相机相对地面的平移矩阵、所述第一相机相对地面的旋转矩阵,确定所述单应性矩阵。The homography matrix is determined according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground.
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述识别所述第三图像中包含的道路目标,包括:The method according to any one of claims 1 to 6, wherein the identifying a road target included in the third image comprises:
    将所述第三图像输入到预设的语义分割模型中,以通过所述语义分割模型识别出所述第三图像中包含的道路目标。The third image is input into a preset semantic segmentation model, so as to identify the road target contained in the third image through the semantic segmentation model.
  8. 根据权利要求7所述的方法,其特征在于,所述语义分割模型中包括:级联的多个特征提取层和输出层;The method according to claim 7, wherein the semantic segmentation model comprises: a plurality of cascaded feature extraction layers and output layers;
    所述通过所述语义分割模型识别出所述第三图像中包含的道路目标,包括:The identifying the road target contained in the third image by the semantic segmentation model includes:
    获取所述多个特征提取层输出的多个语义特征图;obtaining multiple semantic feature maps output by the multiple feature extraction layers;
    拼接所述多个语义特征图;splicing the plurality of semantic feature maps;
    将拼接后的语义特征图输入到所述输出层,以获取所述输出层输出的语义分割结果,所述语义分割结果指示出所述第三图像中对应于道路目标的像素。The spliced semantic feature map is input to the output layer to obtain a semantic segmentation result output by the output layer, and the semantic segmentation result indicates the pixels corresponding to the road target in the third image.
  9. 根据权利要求8所述的方法,其特征在于,所述拼接所述多个语义特征图,包括:The method according to claim 8, wherein the splicing the plurality of semantic feature maps comprises:
    对于目标语义特征图,将所述目标语义特征图中与预设目标距离范围对应的语义特征向量的权重设为第一权重,将所述目标语义特征图中与其他距离范围对应的语义特征向量的权重设为第二权重,所述第一权重大于所述第二权重,所述目标语义特征图是所述多个语义特征图中的任一个,所述多个语义特征图各自对应的预设目标距离范围不同;For the target semantic feature map, the weight of the semantic feature vector corresponding to the preset target distance range in the target semantic feature map is set as the first weight, and the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the first weight. The weight of the target is set to the second weight, the first weight is greater than the second weight, the target semantic feature map is any one of the multiple semantic feature maps, and each of the multiple semantic feature maps corresponds to the preset Set the target distance range is different;
    根据设置的权重,对所述多个语义特征图进行拼接。According to the set weights, the multiple semantic feature maps are spliced.
  10. 根据权利要求9所述的方法,其特征在于,所述第一权重为1,所述第二权重为0。The method according to claim 9, wherein the first weight is 1, and the second weight is 0.
  11. 根据权利要求9所述的方法,其特征在于,按照所述多个语义特征图的输出顺序,所述多个语义特征图各自对应的预设目标距离范围依次变远。The method according to claim 9, wherein, according to the output order of the plurality of semantic feature maps, the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther.
  12. 根据权利要求1所述的方法,其特征在于,所述道路目标包括如下任一种:The method according to claim 1, wherein the road target comprises any one of the following:
    车道线、车位线、斑马线。Lane lines, parking space lines, zebra crossings.
  13. 一种目标检测装置,其特征在于,设于可移动平台,所述装置包括:存储器、处理器;其中,所述存储器上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器实现:A target detection device, characterized in that it is set on a movable platform, and the device comprises: a memory and a processor; wherein, executable code is stored on the memory, and when the executable code is executed by the processor , make the processor implement:
    获取第一相机采集的第一图像以及第二相机采集的第二图像,其中,所述第一相机和所述第二相机设置于所述可移动平台上,所述第一相机和所述第二相机分别用于拍摄所述可移动平台前方不同距离范围的环境;Obtain a first image collected by a first camera and a second image collected by a second camera, wherein the first camera and the second camera are arranged on the movable platform, the first camera and the second camera are The two cameras are respectively used to photograph the environment in different distance ranges in front of the movable platform;
    将所述第一图像和所述第二图像融合为第三图像;fusing the first image and the second image into a third image;
    识别所述第三图像中包含的道路目标。Road objects contained in the third image are identified.
  14. 根据权利要求13所述的装置,其特征在于,在将所述第一图像和所述第二图像融合为第三图像的过程中,所述处理器具体用于:The apparatus according to claim 13, wherein in the process of fusing the first image and the second image into a third image, the processor is specifically configured to:
    获取所述第一图像对应的第一俯视图,以及所述第二图像在所述第一相机的视角下对应的第二俯视图;acquiring a first top view corresponding to the first image, and a second top view corresponding to the second image under the viewing angle of the first camera;
    将所述第一俯视图和所述第二俯视图融合为所述第三图像。The first top view and the second top view are merged into the third image.
  15. 根据权利要求14所述的装置,其特征在于,所述处理器具体用于:The apparatus according to claim 14, wherein the processor is specifically configured to:
    确定所述第一相机对应的俯视图投影矩阵;determining a top view projection matrix corresponding to the first camera;
    根据所述俯视图投影矩阵对所述第一图像进行俯视图投影,以得到所述第一图像对应的第一俯视图;Perform a top view projection on the first image according to the top view projection matrix to obtain a first top view corresponding to the first image;
    根据对应于所述第一相机和所述第二相机的相机外参矩阵,以及所述俯视图投影矩阵,对所述第二图像进行俯视图投影,以得到所述第二图像在所述第一相机的视角下对应的第二俯视图。According to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image, so as to obtain the second image on the first camera The corresponding second top view from the viewing angle.
  16. 根据权利要求14所述的装置,其特征在于,所述处理器具体用于:The apparatus according to claim 14, wherein the processor is specifically configured to:
    对所述第一俯视图和所述第二俯视图进行加权求和运算,以得到所述第三图像。A weighted sum operation is performed on the first top view and the second top view to obtain the third image.
  17. 根据权利要求15所述的装置,其特征在于,在确定所述第一相机对应的俯视图投影矩阵的过程中,所述处理器具体用于:The device according to claim 15, wherein, in the process of determining the top view projection matrix corresponding to the first camera, the processor is specifically configured to:
    确定所述第一相机对应的单应性矩阵;determining a homography matrix corresponding to the first camera;
    获取所述第一图像中多个参考像素点在图像坐标系中各自对应的第一坐标;Acquiring respective first coordinates in the image coordinate system of a plurality of reference pixels in the first image;
    获取所述多个参考像素点在投影到俯视图后在世界坐标系中各自对应的第二坐标;acquiring the respective second coordinates corresponding to the plurality of reference pixels in the world coordinate system after being projected to the top view;
    根据所述第一坐标和所述第二坐标,确定所述第一相机相对地面的透视变换矩阵;According to the first coordinate and the second coordinate, determine the perspective transformation matrix of the first camera relative to the ground;
    根据所述单应性矩阵和所述透视变换矩阵,确定所述俯视图投影矩阵。The top view projection matrix is determined according to the homography matrix and the perspective transformation matrix.
  18. 根据权利要求17所述的装置,其特征在于,所述处理器具体用于:The apparatus according to claim 17, wherein the processor is specifically configured to:
    根据所述第一相机的相机内参矩阵、所述第一相机相对地面的平移矩阵、所述第一相机相对地面的旋转矩阵,确定所述单应性矩阵。The homography matrix is determined according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground.
  19. 根据权利要求13至18中任一项所述的装置,其特征在于,在识别所述第三图像中包含的道路目标的过程中,所述处理器具体用于:The device according to any one of claims 13 to 18, wherein in the process of recognizing the road target included in the third image, the processor is specifically configured to:
    将所述第三图像输入到预设的语义分割模型中,以通过所述语义分割模型识别出所述第三图像中包含的道路目标。The third image is input into a preset semantic segmentation model, so as to identify the road target contained in the third image through the semantic segmentation model.
  20. 根据权利要求19所述的装置,其特征在于,所述语义分割模型中包 括:级联的多个特征提取层和输出层;The device according to claim 19, wherein the semantic segmentation model comprises: cascaded multiple feature extraction layers and output layers;
    所述处理器具体用于:The processor is specifically used for:
    获取所述多个特征提取层输出的多个语义特征图;obtaining multiple semantic feature maps output by the multiple feature extraction layers;
    拼接所述多个语义特征图;splicing the plurality of semantic feature maps;
    将拼接后的语义特征图输入到所述输出层,以获取所述输出层输出的语义分割结果,所述语义分割结果指示出所述第三图像中对应于道路目标的像素。The spliced semantic feature map is input to the output layer to obtain a semantic segmentation result output by the output layer, and the semantic segmentation result indicates the pixels corresponding to the road target in the third image.
  21. 根据权利要求20所述的装置,其特征在于,在拼接所述多个语义特征图的过程中,所述处理器具体用于:The device according to claim 20, wherein, in the process of splicing the plurality of semantic feature maps, the processor is specifically configured to:
    对于目标语义特征图,将所述目标语义特征图中与预设目标距离范围对应的语义特征向量的权重设为第一权重,将所述目标语义特征图中与其他距离范围对应的语义特征向量的权重设为第二权重,所述第一权重大于所述第二权重,所述目标语义特征图是所述多个语义特征图中的任一个,所述多个语义特征图各自对应的预设目标距离范围不同;For the target semantic feature map, the weight of the semantic feature vector corresponding to the preset target distance range in the target semantic feature map is set as the first weight, and the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the first weight. The weight of the target is set to the second weight, the first weight is greater than the second weight, the target semantic feature map is any one of the multiple semantic feature maps, and each of the multiple semantic feature maps corresponds to the preset Set the target distance range is different;
    根据设置的权重,对所述多个语义特征图进行拼接。According to the set weights, the multiple semantic feature maps are spliced.
  22. 根据权利要求21所述的装置,其特征在于,所述第一权重为1,所述第二权重为0。The apparatus according to claim 21, wherein the first weight is 1, and the second weight is 0.
  23. 根据权利要求21所述的装置,其特征在于,按照所述多个语义特征图的输出顺序,所述多个语义特征图各自对应的预设目标距离范围依次变远。The device according to claim 21, wherein, according to the output order of the plurality of semantic feature maps, the preset target distance ranges corresponding to each of the plurality of semantic feature maps sequentially become farther.
  24. 根据权利要求13所述的装置,其特征在于,所述道路目标包括如下任一种:The device according to claim 13, wherein the road target comprises any one of the following:
    车道线、车位线、斑马线。Lane lines, parking space lines, zebra crossings.
  25. 一种可移动平台,其特征在于,包括:A movable platform, characterized in that, comprising:
    壳体;case;
    第一相机和第二相机,设于所述壳体内部或外部,分别用于拍摄所述可移动平台前方不同距离范围的环境;The first camera and the second camera are arranged inside or outside the casing, and are respectively used for photographing the environment in front of the movable platform with different distance ranges;
    处理器,设于所述壳体内部,与所述第一相机和所述第二相机耦合,用 于获取所述第一相机采集的第一图像以及所述第二相机采集的第二图像;将所述第一图像和所述第二图像融合为第三图像;识别所述第三图像中包含的道路目标。a processor, located inside the casing, coupled to the first camera and the second camera, and configured to acquire a first image captured by the first camera and a second image captured by the second camera; Fusing the first image and the second image into a third image; identifying road objects contained in the third image.
  26. 根据权利要求25所述的可移动平台,其特征在于,在将所述第一图像和所述第二图像融合为第三图像的过程中,所述处理器具体用于:The movable platform according to claim 25, wherein in the process of fusing the first image and the second image into a third image, the processor is specifically configured to:
    获取所述第一图像对应的第一俯视图,以及所述第二图像在所述第一相机的视角下对应的第二俯视图;acquiring a first top view corresponding to the first image, and a second top view corresponding to the second image under the viewing angle of the first camera;
    将所述第一俯视图和所述第二俯视图融合为所述第三图像。The first top view and the second top view are merged into the third image.
  27. 根据权利要求26所述的可移动平台,其特征在于,所述处理器具体用于:The movable platform according to claim 26, wherein the processor is specifically configured to:
    确定所述第一相机对应的俯视图投影矩阵;determining a top view projection matrix corresponding to the first camera;
    根据所述俯视图投影矩阵对所述第一图像进行俯视图投影,以得到所述第一图像对应的第一俯视图;Perform a top view projection on the first image according to the top view projection matrix to obtain a first top view corresponding to the first image;
    根据对应于所述第一相机和所述第二相机的相机外参矩阵,以及所述俯视图投影矩阵,对所述第二图像进行俯视图投影,以得到所述第二图像在所述第一相机的视角下对应的第二俯视图。According to the camera extrinsic parameter matrix corresponding to the first camera and the second camera, and the top view projection matrix, perform top view projection on the second image, so as to obtain the second image on the first camera The corresponding second top view from the viewing angle.
  28. 根据权利要求26所述的可移动平台,其特征在于,所述处理器具体用于:The movable platform according to claim 26, wherein the processor is specifically configured to:
    对所述第一俯视图和所述第二俯视图进行加权求和运算,以得到所述第三图像。A weighted sum operation is performed on the first top view and the second top view to obtain the third image.
  29. 根据权利要求27所述的可移动平台,其特征在于,在确定所述第一相机对应的俯视图投影矩阵的过程中,所述处理器具体用于:The movable platform according to claim 27, wherein, in the process of determining the top view projection matrix corresponding to the first camera, the processor is specifically configured to:
    确定所述第一相机对应的单应性矩阵;determining a homography matrix corresponding to the first camera;
    获取所述第一图像中多个参考像素点在图像坐标系中各自对应的第一坐标;Acquiring respective first coordinates in the image coordinate system of a plurality of reference pixels in the first image;
    获取所述多个参考像素点在投影到俯视图后在世界坐标系中各自对应的第二坐标;acquiring the respective second coordinates corresponding to the plurality of reference pixels in the world coordinate system after being projected to the top view;
    根据所述第一坐标和所述第二坐标,确定所述第一相机相对地面的透视变换矩阵;According to the first coordinate and the second coordinate, determine the perspective transformation matrix of the first camera relative to the ground;
    根据所述单应性矩阵和所述透视变换矩阵,确定所述俯视图投影矩阵。The top view projection matrix is determined according to the homography matrix and the perspective transformation matrix.
  30. 根据权利要求29所述的可移动平台,其特征在于,所述处理器具体用于:The movable platform according to claim 29, wherein the processor is specifically configured to:
    根据所述第一相机的相机内参矩阵、所述第一相机相对地面的平移矩阵、所述第一相机相对地面的旋转矩阵,确定所述单应性矩阵。The homography matrix is determined according to the camera intrinsic parameter matrix of the first camera, the translation matrix of the first camera relative to the ground, and the rotation matrix of the first camera relative to the ground.
  31. 根据权利要求25至30中任一项所述的可移动平台,其特征在于,在识别所述第三图像中包含的道路目标的过程中,所述处理器具体用于:The movable platform according to any one of claims 25 to 30, wherein in the process of recognizing the road target included in the third image, the processor is specifically configured to:
    将所述第三图像输入到预设的语义分割模型中,以通过所述语义分割模型识别出所述第三图像中包含的道路目标。The third image is input into a preset semantic segmentation model, so as to identify the road target contained in the third image through the semantic segmentation model.
  32. 根据权利要求31所述的可移动平台,其特征在于,所述语义分割模型中包括:级联的多个特征提取层和输出层;The movable platform according to claim 31, wherein the semantic segmentation model comprises: a plurality of cascaded feature extraction layers and output layers;
    所述处理器具体用于:The processor is specifically used for:
    获取所述多个特征提取层输出的多个语义特征图;obtaining multiple semantic feature maps output by the multiple feature extraction layers;
    拼接所述多个语义特征图;splicing the plurality of semantic feature maps;
    将拼接后的语义特征图输入到所述输出层,以获取所述输出层输出的语义分割结果,所述语义分割结果指示出所述第三图像中对应于道路目标的像素。The spliced semantic feature map is input to the output layer to obtain a semantic segmentation result output by the output layer, and the semantic segmentation result indicates the pixels corresponding to the road target in the third image.
  33. 根据权利要求32所述的可移动平台,其特征在于,在拼接所述多个语义特征图的过程中,所述处理器具体用于:The movable platform according to claim 32, wherein, in the process of splicing the plurality of semantic feature maps, the processor is specifically configured to:
    对于目标语义特征图,将所述目标语义特征图中与预设目标距离范围对应的语义特征向量的权重设为第一权重,将所述目标语义特征图中与其他距离范围对应的语义特征向量的权重设为第二权重,所述第一权重大于所述第二权重,所述目标语义特征图是所述多个语义特征图中的任一个,所述多个语义特征图各自对应的预设目标距离范围不同;For the target semantic feature map, the weight of the semantic feature vector corresponding to the preset target distance range in the target semantic feature map is set as the first weight, and the semantic feature vector corresponding to other distance ranges in the target semantic feature map is set as the first weight. The weight of the target is set to the second weight, the first weight is greater than the second weight, the target semantic feature map is any one of the multiple semantic feature maps, and each of the multiple semantic feature maps corresponds to the preset Set the target distance range is different;
    根据设置的权重,对所述多个语义特征图进行拼接。According to the set weights, the multiple semantic feature maps are spliced.
  34. 根据权利要求33所述的可移动平台,其特征在于,所述第一权重为1,所述第二权重为0。The movable platform of claim 33, wherein the first weight is 1 and the second weight is 0.
  35. 根据权利要求33所述的可移动平台,其特征在于,按照所述多个语义特征图的输出顺序,所述多个语义特征图各自对应的预设目标距离范围依次变远。The movable platform according to claim 33, wherein, according to the output order of the plurality of semantic feature maps, the respective preset target distance ranges corresponding to the plurality of semantic feature maps gradually become farther.
  36. 根据权利要求25所述的可移动平台,其特征在于,所述道路目标包括如下任一种:The movable platform of claim 25, wherein the road target comprises any of the following:
    车道线、车位线、斑马线。Lane lines, parking space lines, zebra crossings.
  37. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有可执行代码,所述可执行代码用于实现权利要求1至12中任一项所述的目标检测方法。A computer-readable storage medium, wherein executable codes are stored in the computer-readable storage medium, and the executable codes are used to implement the target detection method according to any one of claims 1 to 12.
PCT/CN2021/073334 2021-01-22 2021-01-22 Target detection method and apparatus, movable platform, and storage medium WO2022155899A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/073334 WO2022155899A1 (en) 2021-01-22 2021-01-22 Target detection method and apparatus, movable platform, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/073334 WO2022155899A1 (en) 2021-01-22 2021-01-22 Target detection method and apparatus, movable platform, and storage medium

Publications (1)

Publication Number Publication Date
WO2022155899A1 true WO2022155899A1 (en) 2022-07-28

Family

ID=82549176

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073334 WO2022155899A1 (en) 2021-01-22 2021-01-22 Target detection method and apparatus, movable platform, and storage medium

Country Status (1)

Country Link
WO (1) WO2022155899A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115909255A (en) * 2023-01-05 2023-04-04 北京百度网讯科技有限公司 Image generation method, image segmentation method, image generation device, image segmentation device, vehicle-mounted terminal and medium
CN117350926A (en) * 2023-12-04 2024-01-05 北京航空航天大学合肥创新研究院 Multi-mode data enhancement method based on target weight

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103985254A (en) * 2014-05-29 2014-08-13 四川川大智胜软件股份有限公司 Multi-view video fusion and traffic parameter collecting method for large-scale scene traffic monitoring
CN108230254A (en) * 2017-08-31 2018-06-29 北京同方软件股份有限公司 A kind of full lane line automatic testing method of the high-speed transit of adaptive scene switching
CN110969592A (en) * 2018-09-29 2020-04-07 北京嘀嘀无限科技发展有限公司 Image fusion method, automatic driving control method, device and equipment
CN111144330A (en) * 2019-12-29 2020-05-12 浪潮(北京)电子信息产业有限公司 Deep learning-based lane line detection method, device and equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103985254A (en) * 2014-05-29 2014-08-13 四川川大智胜软件股份有限公司 Multi-view video fusion and traffic parameter collecting method for large-scale scene traffic monitoring
CN108230254A (en) * 2017-08-31 2018-06-29 北京同方软件股份有限公司 A kind of full lane line automatic testing method of the high-speed transit of adaptive scene switching
CN110969592A (en) * 2018-09-29 2020-04-07 北京嘀嘀无限科技发展有限公司 Image fusion method, automatic driving control method, device and equipment
CN111144330A (en) * 2019-12-29 2020-05-12 浪潮(北京)电子信息产业有限公司 Deep learning-based lane line detection method, device and equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115909255A (en) * 2023-01-05 2023-04-04 北京百度网讯科技有限公司 Image generation method, image segmentation method, image generation device, image segmentation device, vehicle-mounted terminal and medium
CN117350926A (en) * 2023-12-04 2024-01-05 北京航空航天大学合肥创新研究院 Multi-mode data enhancement method based on target weight
CN117350926B (en) * 2023-12-04 2024-02-13 北京航空航天大学合肥创新研究院 Multi-mode data enhancement method based on target weight

Similar Documents

Publication Publication Date Title
JP6900081B2 (en) Vehicle travel route planning programs, devices, systems, media and devices
WO2020042349A1 (en) Positioning initialization method applied to vehicle positioning and vehicle-mounted terminal
CN112444242B (en) Pose optimization method and device
CN103358993B (en) A system and method for recognizing a parking space line marking for a vehicle
JP6781711B2 (en) Methods and systems for automatically recognizing parking zones
US11670087B2 (en) Training data generating method for image processing, image processing method, and devices thereof
WO2022155899A1 (en) Target detection method and apparatus, movable platform, and storage medium
CN112585659A (en) Navigation method, device and system
WO2021134325A1 (en) Obstacle detection method and apparatus based on driverless technology and computer device
CN112154445A (en) Method and device for determining lane line in high-precision map
CN113129339B (en) Target tracking method and device, electronic equipment and storage medium
WO2023123837A1 (en) Map generation method and apparatus, electronic device, and storage medium
CN111860352A (en) Multi-lens vehicle track full-tracking system and method
Batista et al. Lane detection and estimation using perspective image
CN110084754A (en) A kind of image superimposing method based on improvement SIFT feature point matching algorithm
CN114543819A (en) Vehicle positioning method and device, electronic equipment and storage medium
CN110930437B (en) Target tracking method and device
JP5435294B2 (en) Image processing apparatus and image processing program
CN111460854A (en) Remote target detection method, device and system
CN115790568A (en) Map generation method based on semantic information and related equipment
CN111612812A (en) Target detection method, target detection device and electronic equipment
Tang Development of a multiple-camera tracking system for accurate traffic performance measurements at intersections
Huang et al. 360vot: A new benchmark dataset for omnidirectional visual object tracking
Tsukada et al. Road structure based scene understanding for intelligent vehicle systems
CN112818866A (en) Vehicle positioning method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21920297

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21920297

Country of ref document: EP

Kind code of ref document: A1