CN116246235A

CN116246235A - Target detection method and device based on traveling and parking integration, electronic equipment and medium

Info

Publication number: CN116246235A
Application number: CN202310017712.3A
Authority: CN
Inventors: 张兵; 左佳琪; 王贺; 韦松
Original assignee: Jika Intelligent Robot Co ltd
Current assignee: Jika Intelligent Robot Co ltd
Priority date: 2023-01-06
Filing date: 2023-01-06
Publication date: 2023-06-09
Anticipated expiration: 2043-01-06
Also published as: CN116246235B

Abstract

The disclosure relates to a target detection method, a target detection device, electronic equipment and a target detection medium based on a traveling parking integration. The method comprises the following steps: determining feature information for each image in a set of images, wherein the images include one or more objects to be detected located around the host vehicle; predicting depth information for feature points of each image in a set of images to determine target spatial three-dimensional information based on the feature information and the depth information; extracting features of the three-dimensional information of the target space to obtain first-moment target space features aiming at one or more targets to be detected; fusing the first time target space characteristics and the second time target space characteristics; and outputting one or more pieces of target detection information based on the information obtained by fusing the first time target spatial feature and the second time target spatial feature. In this way, the current image and the historical image are fused, so that the same perception strategy can be used in various driving states, target shielding and omission can be effectively avoided, and the perception precision is improved.

Description

Target detection method and device based on traveling and parking integration, electronic equipment and medium

Technical Field

The present disclosure relates generally to the field of autopilot technology, and in particular, to a method, apparatus, electronic device, and computer-readable storage medium for row-poise integrated-based target detection.

Background

In the traditional separated architecture, the driving and parking are two independent systems, the driving function can only call the chip and the sensor of the driving, such as a front-view camera and a millimeter wave radar, and the parking function can only call the chip and the sensor of the parking, such as a fisheye camera and an ultrasonic radar. In the parking process, attention is paid to the perception of obstacles in the short-distance range of the vehicle body, especially the perception of blind areas of the visual field of a driver, and pedestrians and animals are prevented from being intruded.

The method generally uses a non-maximum suppression mode for fusion when multi-camera images are fused. However, this approach is difficult to achieve when projected into a particular space (e.g., a bird's eye view BEV space). This is because, after the target is predicted by this method, the original information needs to be converted into spatial information by inverse perspective transformation IPM function, but the IPM function needs some a priori knowledge, such as ground level assumption, pitch angle of camera is not changed, which is difficult to realize in practical engineering. Moreover, in multi-camera image fusion, since overlapping areas may exist in pictures shot by a plurality of cameras, the overlapping areas need to be processed to achieve the effect of de-duplication during post-processing, while the existing technology, which suppresses deletion of redundant prediction frames by IOU values between the prediction frames, has poor effect when an object is in the overlapping areas of two pictures.

In automatic driving, driving and parking functions are integrated into a system and are deployed on a chip, and the intelligent driving of the integrated driving and parking system is technically required to be focused on fusion and adjustment aiming at different scenes, and obviously cannot be realized by adopting a traditional mode.

Therefore, a target detection scheme is needed, and when driving, parking and other scenes are integrated into the same system, the target in the surrounding environment of the vehicle can be efficiently and accurately perceived, and a stable perception effect is obtained in intelligent driving with integrated driving and parking.

Disclosure of Invention

According to an example embodiment of the present disclosure, a row-poise integrated-based target detection scheme is provided to at least partially solve the problems existing in the prior art.

In a first aspect of the present disclosure, a method for detecting a target based on a traveling and parking entity is provided. The method comprises the following steps: determining feature information for each image in a set of images, wherein the set of images is obtained at a first time and at least some of the images in the set of images include one or more objects to be detected located around the host vehicle; predicting depth information for feature points of each image in a set of images to determine target spatial three-dimensional information based on the feature information and the depth information; extracting features of the three-dimensional information of the target space to obtain first-moment target space features aiming at one or more targets to be detected; fusing the first time target space feature with a second time target space feature, wherein the second time target space feature is obtained at a second time and the second time is earlier than the first time; and outputting one or more target detection information corresponding to the one or more targets to be detected based on the information obtained by fusing the first time target spatial feature and the second time target spatial feature.

In some embodiments, predicting depth information for feature points of each image in the set of images to determine target spatial three-dimensional information based on the feature information and the depth information comprises: performing depth estimation based on the camera internal and external parameters and the characteristic information; and performing depth supervision on the initial depth information subjected to the depth estimation to obtain corrected depth information.

In some embodiments, the first time and the second time are continuous.

In some embodiments, feature extraction of the three-dimensional information of the target space, obtaining the first temporal target space features for one or more targets to be detected preferably includes: and fusing the pixel-level visual data associated with the characteristic information and the laser radar point cloud associated with the depth information to obtain the three-dimensional information of the target space to be extracted for characteristic.

In some embodiments, the second moment target spatial feature is obtained via: determining feature information for each image in a set of images, wherein the set of images is obtained at a second time and at least some of the images in the set of images include one or more objects to be detected located around the host vehicle; predicting depth information for feature points of each image in a set of images to determine target spatial three-dimensional information based on the feature information and the depth information; and extracting the characteristics of the three-dimensional information of the target space to obtain the target space characteristics of the target at the second moment aiming at one or more targets to be detected.

In some embodiments, the method further comprises: acquiring a group of images from a plurality of cameras arranged on a vehicle; and image preprocessing each image in the set of images.

In some embodiments, the feature information includes NxCxHxW type information, where N represents the number of images in a set of images, C represents the number of channels, and H and W represent feature map sizes of each image in the set of images after feature extraction; the depth information comprises NxDxHxW type information, wherein N represents the number of images in a group of images, D represents the depth distribution probability of feature points, and H and W represent the feature map size of each image in the group of images after feature extraction; the target space three-dimensional information includes bird's-eye view BEV space three-dimensional information, and any one of the first-time target space feature and the second-time target space feature includes bird's-eye view BEV space feature; and the one or more target detection information includes one or more of object classification detection information or lane line shape detection information.

In a second aspect of the present disclosure, a row-poise integrated-based object detection device is provided. The device comprises: a feature information determination module configured to determine feature information for each image in a set of images, wherein the set of images is obtained at a first time and at least a portion of the set of images includes one or more objects to be detected located around the host vehicle; a target space three-dimensional information determination module configured to predict depth information for feature points of each image in a set of images to determine target space three-dimensional information based on the feature information and the depth information; the target space feature acquisition module is configured to perform feature extraction on the three-dimensional information of the target space to obtain target space features of one or more targets to be detected at a first moment; the target space feature fusion module is configured to fuse the first time target space feature and the second time target space feature, wherein the second time target space feature is obtained at a second time and the second time is earlier than the first time; and an object detection information output module configured to output one or more object detection information corresponding to one or more objects to be detected based on the information obtained by the fused first-time object spatial feature and second-time object spatial feature.

In some embodiments, the target space three-dimensional information determination module may be further configured to perform depth estimation based on camera intrinsic and extrinsic and feature information; and performing depth supervision on the initial depth information subjected to the depth estimation to obtain corrected depth information.

In some embodiments, the target spatial feature acquisition module may be further configured to fuse the pixel-level visual data associated with the feature information and the lidar point cloud associated with the depth information to obtain target spatial three-dimensional information to be extracted for the feature.

In some embodiments, the apparatus may further include a second time-of-day target spatial feature acquisition module that may be further configured to determine feature information for each image in a set of images, wherein the set of images is obtained at the second time and at least a portion of the set of images includes one or more targets to be detected located around the host vehicle; predicting depth information for feature points of each image in a set of images to determine target spatial three-dimensional information based on the feature information and the depth information; and extracting the characteristics of the three-dimensional information of the target space to obtain the target space characteristics of the target at the second moment aiming at one or more targets to be detected.

In some embodiments, the apparatus may be further configured to acquire a set of images from a plurality of cameras disposed on the vehicle; and image preprocessing each image in the set of images.

In a third aspect of the present disclosure, an electronic device is provided. The apparatus includes: one or more processors; and storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a method according to the first aspect of the present disclosure.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The medium has stored thereon a computer program which, when executed by a processor, implements a method according to the first aspect of the present disclosure.

In a fifth aspect of the present disclosure, a computer program product is provided. The article of manufacture comprises a computer program/instruction which, when executed by a processor, implements a method according to the first aspect of the disclosure.

It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements. The accompanying drawings are included to provide a better understanding of the present disclosure, and are not to be construed as limiting the disclosure, wherein:

FIG. 1 illustrates a flowchart of an example method of row-poise based object detection in accordance with some embodiments of the present disclosure;

FIG. 2 illustrates a schematic flow diagram of target spatial feature extraction in accordance with some embodiments of the present disclosure;

FIG. 3 illustrates a schematic flow diagram for target spatial feature fusion in accordance with some embodiments of the present disclosure;

FIG. 4 illustrates a schematic block diagram of a line-poise-based object detection device in accordance with some embodiments of the present disclosure; and

FIG. 5 illustrates a block diagram of a computing device capable of implementing various embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As described above, in the current driving and parking scenarios, the fusion and the adjustment for different scenarios are required to be focused on in algorithm, but some priori knowledge required in the target prediction in the current technology is difficult to implement, and the current non-maximum suppression technology has poor fusion effect when the target is in the picture overlapping area during the fusion of the multi-camera images. Based on the method, the device and the system, the semantic information provided by the images, the depth information of the images and the time sequence information at different moments are combined, three-dimensional target detection is carried out in a space-time four-dimensional space, target space features at the current moment are aligned with target space features at the previous moment in a world coordinate system according to the motion information of the vehicle, target detection is carried out on the target space features fused in time, targets in different distance ranges around the vehicle are perceived, and stable perception effect is obtained in intelligent driving under multiple scenes such as a parking entity.

Exemplary embodiments of the present disclosure will be described below in conjunction with fig. 1-5.

Fig. 1 illustrates a flowchart of an example method 100 of row-poise based object detection in accordance with some embodiments of the present disclosure. Referring to fig. 1, overall, at block 110, feature information is determined for each image in a set of images, wherein the set of images is obtained at a first time and at least a portion of the set of images includes one or more objects to be detected located around a host vehicle. At block 120, depth information for feature points of each image in the set of images is predicted to determine target spatial three-dimensional information based on the feature information and the depth information. At block 130, feature extraction is performed on the target space three-dimensional information to obtain first-time target space features for one or more targets to be detected. At block 140, the first temporal target spatial feature and the second temporal target spatial feature are fused, the second temporal target spatial feature being obtained at a second time and the second time being earlier than the first time. At block 150, one or more target detection information corresponding to one or more targets to be detected is output based on the information obtained by the fused first time target spatial feature and second time target spatial feature.

Exemplary embodiments of the various operations of the method 100 of fig. 1 will be described in detail below in conjunction with fig. 2-3. Wherein fig. 2 illustrates a schematic diagram of a target spatial feature extraction environment 200, according to some embodiments of the present disclosure. Fig. 3 illustrates a target spatial feature fusion schematic flow diagram 300 according to some embodiments of the present disclosure. It should be understood that the environment 200 shown in fig. 2 and the target spatial feature fusion schematic flowchart 300 shown in fig. 3 are merely exemplary and should not be construed as limiting the functionality and scope of the implementations described in this disclosure.

As shown in fig. 2, the environment 200 includes a vehicle 205 traveling in a roadway. In the example of fig. 1, the autonomous vehicle 205 may be any type of vehicle that may carry people and/or things and that is moved by a power system such as an engine, including, but not limited to, a car, truck, bus, electric car, motorcycle, caravan, train, and the like. In some embodiments, one or more of the self-vehicles 205 (also referred to as vehicles 205) in the environment 200 may be vehicles with certain autopilot capabilities, such vehicles also being referred to as unmanned vehicles. In some embodiments, the vehicle 205 may also be a vehicle that does not have autopilot capability.

The cart 205 may be communicatively coupled to a computing device 210. Although shown as a separate entity, computing device 210 may be embedded in a cart 205. The computing device 210 may also be an entity external to the cart 205 and may communicate with the cart 205 via a wireless network. Computing device 210 may be any device having computing capabilities. As non-limiting examples, computing device 210 may be any type of fixed, mobile, or portable computing device, including but not limited to a desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, multimedia computer, mobile phone, etc.; all or a portion of the components of computing device 210 may be distributed across the cloud. Computing device 210 contains at least a processor, memory, and other components typically found in general purpose computers to perform computing, storage, communication, control, etc. functions.

In some embodiments, the cart 205 may acquire images or videos of the physical environment in which the cart 205 is located through its own sensor or camera as a set of images in block 210 of fig. 1. For example, in a scene such as L3 level autopilot, images or video may be acquired as a set of images via one forward wide angle camera and four fisheye cameras. Preferably, the own vehicle 205 may acquire a set of images of the physical environment in which the own vehicle 205 is located through all of its own cameras. The set of images may also be acquired in any suitable manner, which is not limiting to the present disclosure.

Alternatively, in some embodiments, the vehicle 205 may also acquire images or videos as a set of images via other devices in the intelligent transportation system. For example, the vehicle 205 may communicate with roadside cameras around the vehicle to acquire images and videos including the vehicle 205 itself as a set of images.

In some embodiments, computing device 210 may be used to at least partially perform various operations in method 100 shown in fig. 1. For example, computing device 210 may implement method 100 through the operations of the dashed box in fig. 2 and the operations in fig. 3. This will be described in detail below.

In some embodiments, as shown in fig. 2, a set of images obtained by respective cameras of the cart 205 at a certain time instant (also referred to as a "first time instant") may include one or more targets to be detected located around the cart 205. The one or more objects to be detected may be, for example, one or more of object classification detection information or lane line shape detection information. As shown in fig. 3, the object classification detection information may detect that the object is a car, a truck, a van, a motorcycle, a bicycle, a pedestrian, a pet, or the like, for example, according to the size of the object. The lane line shape detection may be, for example, a lane line shape such as a solid line, a broken line, a fishbone line, or a double solid line. It should be appreciated that after calculation by the computing device 210, one or more targets to be detected will be output as specific information to target objects and lane lines in the environment surrounding the host-vehicle 205.

In some embodiments, as shown in fig. 2, after the computing device 210 receives a set of images from the host-vehicle 205, image preprocessing may be performed on each image in the set of images. After each image in a set of images is preprocessed, preprocessed feature information of each image can be obtained, for example, image input of Nx3xHxW information is obtained for image feature extraction. Where N represents N cameras, and 3xHxW represents information of each picture.

In some embodiments, the preprocessed feature information may then be input to an image feature extraction network, resulting in feature information for each image (i.e., each camera). The image feature extraction network is used for encoding the input image into high-level features, and mainly comprises a back for high-level feature extraction and a multi-resolution feature fusion, wherein the back can use ResNet, swintransformer, denseNet, HRNet, for example, and the back can use FPN-LSS, PAFPN-LSS, NAS-FPN-LSS, and the like, for example. In one embodiment, the image feature extraction network is preferably a backbone network shared by the ResNet network and the FPN network.

In some embodiments, the feature information may include, for example, nxCxHxW type information, where N represents the number of images in the set of images, C represents the number of channels, and H and W represent feature map sizes of each image in the set of images after feature extraction. It is to be understood that the characteristic information belongs to the visual data information at the pixel level.

After the feature information of each image is obtained, depth information prediction may be performed on the respective feature points of each image. In one embodiment, the depth estimation module shown in fig. 2 may specifically obtain depth information of each feature point in the image in combination with the camera internal and external parameters of a set of images. In one embodiment, for example, LSS techniques may be employed to predict a depth information for each feature point in the feature map. Thus, the picture has depth information.

In one embodiment, with continued reference to fig. 2, after each feature point in the image has depth information, in order to ensure accuracy of the depth information, the depth estimation module may fully use accurate depth information provided by the laser radar point cloud, constraint the depth estimation network by using the accurate depth provided by the point cloud in the network training stage, where each pixel has an estimated depth D, N input cameras may generate a camera feature point cloud with a size of nxhxwxd, where H and W represent sizes of camera feature diagrams, and D is depth information. After training is completed, the depth of the network detection target can be estimated during reasoning. When training is performed, depth information predicted by the network is supervised by using the depth information of the laser radar, so that the network corrects the predicted depth information and obtains corrected depth information. After training is completed, the depth of the network detection target can be estimated during reasoning. The lidar information may be obtained, for example, by a lidar camera disposed on the cart 205 or by any other suitable means, which is not limiting to the present disclosure. Through the depth supervision process, accurate depth information can be obtained, and the accuracy of the subsequent conversion to the target space is ensured.

After the feature information and the depth information of each feature point are obtained, the image features with the depth information can be mapped to the target space through the target space mapping table, so that the three-dimensional information of the target space is obtained. This operation corresponds to block 120 in map 1. In one embodiment, referring to fig. 2, the target space three-dimensional information may be implemented by manipulating picture information with depth information as it is. For example, the NxDxHxW information and the NxCxHxW may be configured to obtain Nx (c+d) xHxW feature information, where the feature information may be three-dimensional information of the target space. Wherein N represents the number of images in the group of images, C represents the number of channels, H and W represent the size of the feature map after feature extraction of each image in the group of images, and D represents the probability of depth distribution of feature points.

In one embodiment, the target space may be a bird's eye view BEV space. BEV is a visual angle or coordinate system (3D) used to describe the perceived world, and BEV is also used to refer to an end-to-end technique in the field of computer vision that converts visual information from image space to BEV space by a neural network. Correspondingly, the target space three-dimensional information may be BEV space three-dimensional information. It should be appreciated that any other suitable target space may be employed to achieve the above operations, as this disclosure is not limited in this regard.

And then, extracting the characteristics of the three-dimensional information of the target space to obtain the characteristics of the target space at the first moment aiming at one or more targets to be detected. This operation corresponds to block 140 shown in fig. 1. In one embodiment, the target spatial features may be extracted using a BEV encoder, whose backbones and negs may be, for example, resNet50 For BEV and FPN (FPN For BEV) For BEV, for example, using ResNet with residual modules as backbones, FPN-LSS as negs, for achieving extraction of target spatial three-dimensional information (i.e., BEV spatial three-dimensional information). By feature extraction of BEV space, high precision important signals such as scale, rotation, speed can be perceived. The specific operation is that C channels are obtained through feature extraction on BEV space, and information such as inferred scale, rotation, speed and the like is respectively represented in the C channels.

In one embodiment, the first time target spatial feature obtained by extracting the three-dimensional information of the target space may be, for example, a BEV spatial feature of CxHxW, where C represents a channel, and H and W represent feature sizes of the image after feature extraction. In this embodiment, the pixel-level visual data associated with the feature information and the lidar point cloud associated with the depth information may be fused to obtain the target space three-dimensional information to be extracted for the feature. When the model is trained, the real depth information detected by the laser radar is used for supervising the depth information predicted by the training neural network, so that the function of predicting the depth information by the model can be achieved.

In one embodiment, the above operations may be iterated to obtain the target spatial features for each time instant. For example, blocks 110 through 130 of method 100 may be iteratively performed to obtain the second moment-of-time target spatial feature. Specifically, feature information for each image in a set of images may be determined, wherein the set of images is obtained at a second time and at least some of the set of images include one or more targets to be detected located around the host vehicle, and depth information for feature points of each image in the set of images is predicted to determine target space three-dimensional information based on the feature information and the depth information, and feature extraction is performed on the target space three-dimensional information to obtain second time target space features for the one or more targets to be detected.

It should be appreciated that embodiments adapted to obtain the first temporal target spatial feature may be adapted to obtain the second temporal target spatial feature. This is because, in an automatic driving scenario, data, features, results, etc. of the history frame are often readily available, and since there is often more feature overlap between the history frame and the current frame, such history frame information may effectively assist in the perception of the current frame.

Then, the first time target spatial feature and the second time target spatial feature are fused, the second time target spatial feature being obtained at a second time and the second time being earlier than the first time. In one embodiment, the second time instant and the first time instant may be consecutive, that is, the images captured at the first time instant and the second time instant, respectively, may be consecutive frames. In other embodiments, the first and second moments in time may also be discontinuous, that is to say several frames may be spaced between the images captured at the first and second moments in time, respectively.

In one embodiment, as shown in fig. 3, taking an embodiment in which the first time and the second time are continuous as an example, the T time BEV feature extracted at the T time (for example, cxHxW output in fig. 2) and the T-1 time BEV feature extracted at the T-1 time (for example, cxHxW output in fig. 2) may be fused by performing a concatate operation to form perception information of a higher space-time dimension, for example, to form perception information of a 4D space-time dimension (for example, 2CxHxW shown in fig. 3). In the concatate operation, the BEV features at the time T and the BEV features extracted at the time T-1 may be spliced, and the formed target space perception information with higher space-time dimension may include time dimension information and three-dimensional space information, respectively.

Then, one or more target detection information corresponding to one or more targets to be detected is output based on the information obtained by the fused first-time target spatial feature and second-time target spatial feature. In one embodiment, the information derived based on the fused first temporal target spatial feature and second temporal target spatial feature may be perceptual information of a higher temporal-spatial dimension as previously described. The one or more target detection information may be, for example, one or more of the classification detection information or lane line shape detection information as described above.

In one embodiment, with continued reference to fig. 3, the higher spatio-temporal dimension of perceptual information is input to a multitasking head for multitasking output. The multitasking output may be static semantic maps (lane lines, bin detection), dynamic detection (pedestrians, vehicles), motion prediction (speed), etc., for use by the downstream regulation module. In one embodiment, the specific task head can use the first stage of the centrhead in the centrpoint network to perform target detection, and perform classification detection according to the size of the object, so as to be able to detect cars, trucks, vans, motorcycles, bicycles, pedestrians, pets and the like. In another embodiment, the upstream output features may also be used for lane line detection, lane line segmentation, etc. For example, classification detection may be performed according to the difference in lane line shape, resulting in implementation, broken line, fishbone line, and double solid line. The centrhead target detection may be accuracy controlled via a class Loss (e.g., gaussian foldloss function) and a regression Loss (e.g., L1los function), respectively, and the lane line segmentation may be accuracy controlled via a class Loss (e.g., gaussian foldloss function) and a positioning Loss (e.g., cross entropy Loss function), respectively.

Therefore, based on the target space feature fusion of a convolutional neural network, depth supervision provided by a laser radar point cloud and front and rear frame feature fusion are added to improve the perception effect of the surrounding environment of the vehicle, and explicit visual angle transformation is performed by utilizing the internal and external parameters of a camera and feature coding is performed on the target space. The characteristics are explicitly coded in the three-dimensional space, so that the multi-sensor fusion, the multi-task prediction and the time sequence fusion are conveniently carried out, the environment sensing can be carried out by using the same set of sensing algorithm under two driving states of parking and driving, the problems of shielding, missing detection and the like are effectively avoided, and the sensing precision is improved.

Fig. 4 illustrates a schematic block diagram of a line-poise integrated-based object detection device 400 in accordance with some embodiments of the present disclosure. In fig. 4, the apparatus 400 includes a feature information determination module 410, a target space three-dimensional information determination module 420, a target space feature acquisition module 430, a target space feature fusion module 440, and a target detection information output module 450.

The feature information determination module 410 is configured to determine feature information for each image in a set of images obtained at a first time and at least some of the set of images including one or more objects to be detected located around the host vehicle.

The target space three-dimensional information determination module 420 is configured to predict depth information for feature points of each image in a set of images to determine target space three-dimensional information based on the feature information and the depth information.

The target space feature acquisition module 430 is configured to perform feature extraction on the three-dimensional information of the target space to obtain target space features of one or more targets to be detected at a first moment;

the target spatial feature fusion module 440 is configured to fuse the first temporal target spatial feature with a second temporal target spatial feature, wherein the second temporal target spatial feature is obtained at a second time and the second time is earlier than the first time; and

the target detection information output module 450 is configured to output one or more target detection information corresponding to one or more targets to be detected based on the information obtained by fusing the first time target spatial feature and the second time target spatial feature.

In some embodiments, the target space three-dimensional information determination module 420 may be further configured to perform depth estimation based on camera intrinsic and extrinsic and feature information; and performing depth supervision on the initial depth information subjected to the depth estimation to obtain corrected depth information.

In some embodiments, the target spatial feature acquisition module 430 may be further configured to fuse pixel-level visual data associated with the feature information with a lidar point cloud associated with the depth information to obtain target spatial three-dimensional information to be extracted for characterization.

In some embodiments, the apparatus 400 may further include a second moment-in-time target spatial feature acquisition module that may be further configured to determine feature information for each image in a set of images, wherein the set of images is obtained at the second moment in time and at least a portion of the set of images includes one or more targets to be detected located around the host vehicle; predicting depth information for feature points of each image in a set of images to determine target spatial three-dimensional information based on the feature information and the depth information; and extracting the characteristics of the three-dimensional information of the target space to obtain the target space characteristics of the target at the second moment aiming at one or more targets to be detected.

In some embodiments, the apparatus 400 may be further configured to acquire a set of images from a plurality of cameras disposed on the vehicle; and image preprocessing each image in the set of images.

Fig. 5 illustrates a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure. The electronic device 500 may be used, for example, to implement the operations in the method 100 shown in fig. 1 or to at least partially implement the computing device 210 shown in fig. 2. Electronic device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 includes a computing unit 501 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as method 100. For example, in some embodiments, the method 100 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of method 100 described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the method 100 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A target detection method based on a traveling and parking integration, the method comprising:

determining feature information for each image in a set of images, wherein the set of images is obtained at a first time and at least some of the images in the set of images include one or more objects to be detected located around the host vehicle;

predicting depth information for feature points of each image in a set of images to determine target spatial three-dimensional information based on the feature information and the depth information;

Extracting features of the three-dimensional information of the target space to obtain first-moment target space features aiming at one or more targets to be detected;

fusing the first time target spatial feature and a second time target spatial feature, wherein the second time target spatial feature is obtained at a second time and the second time is earlier than the first time; and

and outputting one or more target detection information corresponding to one or more targets to be detected based on the information obtained by the fused first time target space characteristics and the second time target space characteristics.

2. The method of claim 1, wherein predicting depth information for feature points of each image in a set of images to determine target spatial three-dimensional information based on the feature information and the depth information comprises:

performing depth estimation based on the camera internal and external parameters and the characteristic information; and

and performing depth supervision on the initial depth information subjected to the depth estimation to obtain the corrected depth information.

3. The method of claim 1, wherein the first time and the second time are continuous.

4. The method according to claim 1, wherein extracting features from the three-dimensional information of the target space to obtain first temporal target space features for one or more of the targets to be detected preferably comprises:

and fusing the pixel-level visual data associated with the characteristic information and the laser radar point cloud associated with the depth information to obtain the three-dimensional information of the target space to be extracted for characteristic.

5. The method of claim 1, wherein the second time-of-day target spatial feature is obtained via:

determining feature information for each image in a set of images, wherein the set of images is obtained at a second time and at least some of the images in the set of images include one or more objects to be detected located around the host vehicle;

predicting depth information for feature points of each image in a set of images to determine target spatial three-dimensional information based on the feature information and the depth information; and

and extracting the characteristics of the three-dimensional information of the target space to obtain the target space characteristics of the target at the second moment aiming at one or more targets to be detected.

6. The method according to claim 1, wherein the method further comprises:

acquiring the set of images from a plurality of cameras disposed on the host vehicle; and

image preprocessing is performed on each image in the set of images.

7. The method according to any one of claim 1 to 6, wherein,

the feature information comprises NxCxHxW type information, wherein N represents the number of images in the group of images, C represents the number of channels, and H and W represent the size of a feature map of each image in the group of images after feature extraction;

the depth information comprises NxDxHxW type information, wherein N represents the number of images in the group of images, D represents the depth distribution probability of feature points, and H and W represent the size of a feature map of each image in the group of images after feature extraction;

the target space three-dimensional information includes bird's-eye view BEV space three-dimensional information, and any one of the first-time target space feature and the second-time target space feature includes bird's-eye view BEV space feature; and

the one or more target detection information includes one or more of object classification detection information or lane line shape detection information.

8. An object detection device based on a traveling and parking body is characterized by comprising:

a feature information determination module configured to determine feature information for each image in a set of images, wherein the set of images is obtained at a first time and at least a portion of the set of images includes one or more objects to be detected located around the host vehicle;

a target space three-dimensional information determination module configured to predict depth information of feature points for each image in a set of images to determine target space three-dimensional information based on the feature information and the depth information;

the target space feature acquisition module is configured to perform feature extraction on the target space three-dimensional information to obtain first-moment target space features aiming at one or more targets to be detected;

a target spatial feature fusion module configured to fuse the first temporal target spatial feature and a second temporal target spatial feature, wherein the second temporal target spatial feature is obtained at a second time and the second time is earlier than the first time; and

and the target detection information output module is configured to output one or more pieces of target detection information corresponding to one or more targets to be detected based on the information obtained by the fused first time target space characteristics and the second time target space characteristics.

9. An electronic device, the device comprising:

one or more processors; and

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method according to any of claims 1 to 7.