CN115797455B

CN115797455B - Target detection method, device, electronic equipment and storage medium

Info

Publication number: CN115797455B
Application number: CN202310080216.2A
Authority: CN
Inventors: 叶晓青
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-01-18
Filing date: 2023-01-18
Publication date: 2023-05-02
Anticipated expiration: 2043-01-18
Also published as: CN115797455A

Abstract

The disclosure provides a target detection method, a target detection device, electronic equipment and a storage medium, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, 3D vision, deep learning and the like, and can be applied to scenes such as automatic driving, smart cities and the like. The implementation scheme is as follows: obtaining a plurality of images shot for a target space where a target object is located from the position where the target object is located, the plurality of images corresponding to a plurality of view angles, the target space including a plurality of positions; obtaining a first spatial feature for each of the plurality of locations by mapping an image feature for each of the plurality of images to a corresponding location of the plurality of locations; obtaining a second spatial feature of each of the plurality of locations by projecting the location onto a respective one of the plurality of images and extracting a respective feature in the respective image; and obtaining a target detection result of the target space based on the first spatial feature and the second spatial feature of each of the plurality of locations.

Description

Target detection method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the technical field of artificial intelligence, and in particular to the technical fields of computer vision, 3D vision, deep learning, and the like, and more particularly to a target detection method, apparatus, electronic device, computer-readable storage medium, and computer program product.

Background

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

Based on image processing of artificial intelligence, by processing two-dimensional images, instances in a three-dimensional space corresponding to the two-dimensional images are identified, 3D target detection is realized, and the method has been widely applied to various fields.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides a target detection method, apparatus, electronic device, computer-readable storage medium, and computer program product.

According to an aspect of the present disclosure, there is provided a target detection method including: obtaining a plurality of images shot from a position of a target object for a target space of the target object, wherein the plurality of images correspond to a plurality of view angles, and the target space comprises a plurality of positions; obtaining a first spatial feature for each of the plurality of locations by mapping an image feature for each of the plurality of images to a corresponding location of the plurality of locations; obtaining a second spatial feature of each of the plurality of locations by projecting the location onto a respective one of the plurality of images and extracting a respective feature in the respective image; and obtaining a target detection result of the target space based on the first spatial feature and the second spatial feature of each of the plurality of locations, the target detection result indicating a target instance located in the target space.

According to another aspect of the present disclosure, there is provided an object detection apparatus including: an image acquisition unit configured to acquire a plurality of images taken for a target space in which a target object is located from a position in which the target object is located, the plurality of images corresponding to a plurality of angles of view, the target space including a plurality of positions; a first spatial feature acquisition unit configured to acquire a first spatial feature of each of the plurality of positions by mapping an image feature of each of the plurality of images to a corresponding position of the plurality of positions; a second spatial feature acquisition unit configured to acquire a second spatial feature of each of the plurality of positions by projecting the position onto a corresponding one of the plurality of images and extracting a corresponding feature in the corresponding image; and a detection result acquisition unit configured to acquire a target detection result of the target space based on the first spatial feature and the second spatial feature of each of the plurality of positions, the target detection result indicating a target instance located in the target space.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method according to an embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a method according to embodiments of the present disclosure.

According to one or more embodiments of the present disclosure, the accuracy of the obtained target detection result may be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a target detection method according to an embodiment of the present disclosure;

FIG. 3 illustrates a flowchart of a process for obtaining a first spatial feature for each of the plurality of locations by mapping an image feature for each of the plurality of images to a corresponding location of the plurality of locations in a target detection method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a process of mapping image features of each of the plurality of images to corresponding ones of the plurality of locations in a target detection method according to an embodiment of the present disclosure;

FIG. 5 illustrates a flow chart of a process in a target detection method according to an embodiment of the present disclosure for obtaining a second spatial feature of each of the plurality of locations by projecting the location onto a respective one of the plurality of images and extracting a respective feature in the respective image;

FIG. 6 is a schematic diagram illustrating a process of projecting each of the plurality of locations onto a respective image of the plurality of images in a target detection method according to an embodiment of the disclosure;

FIG. 7 illustrates a flowchart of a process for obtaining a target detection result for the target space based on a first spatial feature and a second spatial feature for each of the plurality of locations in a target detection method according to an embodiment of the present disclosure;

fig. 8 shows a flowchart of a process of obtaining the target detection result based on the first and second bird's-eye view features in a target detection method according to an embodiment of the present disclosure;

FIG. 9 shows a schematic diagram of a process of a target detection method according to an embodiment of the disclosure;

FIG. 10 shows a block diagram of a structure of an object detection device according to an embodiment of the present disclosure;

fig. 11 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another element. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In an embodiment of the present disclosure, the server 120 may run one or more services or software applications that enable execution of the object detection method according to the present disclosure.

In some embodiments, server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user

operating client devices

101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use the

client devices

101, 102, 103, 104, 105, and/or 106 to receive the target detection results obtained according to the target detection methods of the present disclosure. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of

client devices

101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and/or 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

According to an aspect of the present disclosure, there is provided a target detection method. As shown in fig. 2, an object detection method 200 according to some embodiments of the present disclosure includes:

step S210: obtaining a plurality of images shot from a position of a target object for a target space of the target object, wherein the plurality of images correspond to a plurality of view angles, and the target space comprises a plurality of positions;

step S220: obtaining a first spatial feature for each of the plurality of locations by mapping an image feature for each of the plurality of images to a corresponding location of the plurality of locations;

step S230: obtaining a second spatial feature of each of the plurality of locations by projecting the location onto a respective one of the plurality of images and extracting a respective feature in the respective image; and

step S240: a target detection result of the target space is obtained based on the first spatial feature and the second spatial feature of each of the plurality of locations, the target detection result being indicative of a target instance located in the target space.

In the related art, an image feature of a two-dimensional image is converted into a three-dimensional space, and a target detection result is obtained based on the spatial feature obtained after conversion, wherein the conversion of the image feature of the two-dimensional image into the three-dimensional space is performed by projecting points of the space onto images with different visual angles, sampling is performed to take out the corresponding image feature, so that the receptive field is insufficient, and the accuracy of the obtained target detection result is not high.

In the embodiment of the disclosure, the first spatial feature and the second spatial feature are obtained by performing the feature transformation from two dimensions to three dimensions on the obtained multiple images in different manners, and the target detection result is obtained based on the first spatial feature and the second spatial feature, so that in the process of obtaining the target detection result, the spatial feature obtained after the feature transformation from two dimensions to three dimensions in multiple manners is considered in the process of obtaining the target detection result, and the accuracy of the obtained spatial feature is improved, thereby improving the accuracy of the detection result. Meanwhile, according to the embodiment of the disclosure, in the process of converting the two-dimensional image features into the unit space, the first space features are obtained by forward mapping the plurality of images, and the second space features are obtained by backward projecting the space positions and then extracting the features, and meanwhile, the three-dimensional space features obtained by converting the features of the images into the 3D space by the forward projecting method and the backward feature extracting method are considered, so that the receptive field of the obtained three-dimensional space features is enlarged, and the accuracy of the detection result is improved.

In some embodiments, the target object may be any object, such as a vehicle, a person, or a tree, among others.

In some embodiments, the target object includes a vehicle, and the plurality of images include a plurality of images respectively acquired by a plurality of image capturing devices on the vehicle, wherein the plurality of image capturing devices are respectively installed at different positions of the vehicle, and the plurality of images may be a plurality of images respectively acquired by the plurality of image capturing devices at the same time. So that the object detection method according to the present disclosure can be applied to detection and recognition of an obstacle in a space where a vehicle is located during automatic driving.

In some embodiments, the target object comprises an imaging device, and the plurality of images comprises a plurality of images acquired by the imaging device in a plurality of directions, respectively. For example, the image pickup device is rotated about a central axis so that a plurality of images are picked up by the image pickup device during the rotation.

In some embodiments, the target space is a three-dimensional space in which the target object is located. It may be a preset space with the target object as a base point. For example, a range most recently and most recently seen in each image is predefined, and a three-dimensional space is obtained from the range.

In some embodiments, the plurality of locations of the target space may be a plurality of locations corresponding to a plurality of voxels obtained after the target space is divided into the plurality of voxels.

For example, the range most seen in each image is predefined as the range from the target object, dmax, based on which the target space is divided into L voxels, where the side length of each voxel is Dmax/L in the case of uniform distribution and is nonlinear in the case of non-linear distribution

Can be provided by formula (1):

（1）

wherein the method comprises the steps of

Is the index of the first voxel of the depth distribution, < >>

Is the corresponding depth value.

In some embodiments, the target space is a three-dimensional space obtained after uniform sampling in the height direction based on a Bird's Eye View (BEV) space corresponding to the target object, for example, the size of the bird's eye view space is BEV

Within each cell, according to the height [ -5m,5m]Uniformly sampling Z points to obtain a 3D point set of real space>

Wherein->

And

the length and width of the BEV space, respectively.

In some embodiments, as shown in fig. 3, step S210, obtaining the first spatial feature of each of the plurality of positions by mapping the image feature of each of the plurality of images to a corresponding position of the plurality of positions includes:

step S310: for each of the plurality of images, obtaining an image feature of the image, obtaining a distribution probability of each of a plurality of depths of each pixel on the image along a viewing angle direction to which the image corresponds, and obtaining a first spatial sub-feature of each of the plurality of positions corresponding to the image based on the distribution probability of each of the pixels on the image along each of the plurality of depths and the image feature of the image; and

Step S310: for each of the plurality of locations, a first spatial feature of the location is obtained based on a plurality of first spatial sub-features of the location corresponding to the plurality of images.

The image is subjected to prediction of depth distribution probability corresponding to each pixel point, and based on the depth distribution probability, image characteristics are transformed from two dimensions to three dimensions to obtain first spatial characteristics, so that depth information contained in the image is considered in the process of obtaining the first spatial characteristics, and the obtained first spatial characteristics are accurate.

In some embodiments, in step S310, feature extraction is performed on the plurality of images through the backbone network to obtain a feature map

Wherein the feature map->

Indicating image features of the respective images obtained after stitching in the channel dimension, wherein +.>

，/>

，[H,W]Is the resolution of each image, nCal is the number of images, and C is the number of channels of the feature map. 32 denotes that the feature map after passing through the feature extractor is 1/32 of the original size.

In some embodiments, in step S310, the process will further be

As an input, the depth distribution probability of each pixel on each image is predicted to be +. >

The depth distribution probability indicates a distribution probability of each of a plurality of depths of the pixel in a viewing angle direction to which the image corresponds.

In some embodiments, in step S310, for each image, a transformation of the image features of the image from two dimensions to three dimensions is achieved by multiplying the image features of the image with the corresponding depth distribution probabilities.

In some embodiments, in step S310, a plurality of depths in their corresponding viewing directions for each image are also included, each translated into a location in a corresponding target space. Specifically, the depth, pixel coordinates on the image and the camera reference matrix corresponding to the image are shared

And the external reference matrix->

Calculating the corresponding position in the target space corresponding to each depth, wherein the coordinates of the target space corresponding to each depth are +.>

Calculated by formula (2):

（2）

wherein, the liquid crystal display device comprises a liquid crystal display device,

is homogeneous image coordinates, D is depth, < ->

Then it is the coordinates under the target space +.>

。

After obtaining the corresponding position of each depth in the target space, mapping the transformed feature to a corresponding position in three-dimensional space, i.e. the position corresponding to the first spatial sub-feature of the image, based on the transformed feature of the image feature corresponding to each depth from two-dimensional to three-dimensional and the position in the three-dimensional space corresponding to the depth.

Referring to fig. 4, a schematic diagram illustrating a process of mapping image features of one image to corresponding positions in a plurality of positions in a target three-dimensional space in a target detection method according to some embodiments of the present disclosure is shown. Wherein by calculating the depth distribution of individual pixels in the image

The features of each pixel in the image (as shown at 410) are then mapped to each location in the target space (as shown at 420).

The above-described process is implemented for each image, thereby obtaining a plurality of first spatial sub-features each position corresponding to a plurality of images.

In some embodiments, for each of a plurality of locations in the target space, the location is corresponding to an average of a plurality of first spatial sub-features of the plurality of images as the first spatial feature of the location.

In some embodiments, as shown in fig. 5, step S230, obtaining the second spatial feature of each of the plurality of locations by projecting the location onto a respective one of the plurality of images and extracting a respective feature in the respective image includes:

step S510: obtaining a camera pose for each of the plurality of images;

step S520: for each of the plurality of locations, projecting each of the plurality of images based on a camera pose of the image, responsive to determining that the image has a projected location corresponding to the location, determining that the image is a corresponding image of the location, and extracting features of the corresponding image corresponding to the projected location as second spatial sub-features of the location corresponding to the image; and

Step S530: for each of the plurality of locations, in response to obtaining one or more second spatial sub-features of the location corresponding to one or more respective images of the plurality of images, obtaining a second spatial feature of the location based on the one or more spatial sub-features.

By projecting each position in the target space into the image, the feature transformation of the image into the three-dimensional space is performed in such a way that features of the corresponding position in the image are extracted, so that the data processing amount is small because only extraction of image features is involved in the process.

In some embodiments, the camera pose indicates an internal reference matrix of the camera

And the external reference matrix->

Each of a plurality of locations in the target space is projected into a respective image by an internal reference matrix and an external reference matrix.

Referring to fig. 6, a schematic diagram of a process in an object detection method according to some embodiments of the present disclosure by projecting each of the plurality of locations onto a respective one of the plurality of images and extracting a respective feature in the respective image is shown. Wherein each position in the target space (as shown at 610) is projected into a plurality of images corresponding to a plurality of viewing angles (as shown at 620) by being based on the internal reference matrix K and the external reference matrix T.

It will be appreciated that some locations may be projected into multiple images, some locations may be projected into only one image, or not into any image. In a specific calculation process, for a position which can be projected into an image, the image feature of the corresponding position in the image is extracted and stored in the position as a second space sub-feature of the position corresponding to the image.

In some embodiments, for each of the plurality of locations, the second spatial feature of the location is a mean of one or more second spatial sub-features corresponding to the location.

And improving the accuracy of the obtained second spatial features by taking the average value of the one or more second spatial sub-features corresponding to the obtained position as the second spatial features for each of the plurality of positions.

In some embodiments, in step S240, the target detection result is obtained directly based on the first spatial feature and the second spatial feature of each of the plurality of locations. For example, the first spatial feature and the second spatial feature of each of the plurality of locations are input to the MLP and the task head for specific target detection, and target attributes corresponding to the respective locations are predicted, thereby obtaining target detection results.

In some embodiments, as shown in fig. 7, step S240, obtaining the target detection result of the target space based on the first spatial feature and the second spatial feature of each of the plurality of positions includes:

step S710: obtaining a first aerial view feature based on a first spatial feature of each of the plurality of locations, the first aerial view feature comprising a first feature of each of a plurality of first locations of the plurality of locations, the plurality of first locations being located on a horizontal plane in which the target object is located, the first feature being obtained based on a plurality of first spatial features of a plurality of second locations of the plurality of locations, the plurality of second locations being located in a vertical direction in which the first locations are located;

step S720: obtaining a second aerial view feature based on a second spatial feature of each of the plurality of locations, the second aerial view feature comprising a second feature of each of the plurality of first locations, the second feature being obtained based on a plurality of second spatial features of the plurality of second locations; and

step S730: and obtaining the target detection result based on the first aerial view characteristic and the second aerial view characteristic.

After the first aerial view feature and the second aerial view feature are obtained based on the first spatial feature and the second spatial feature of each of the plurality of positions, respectively, and based on the first aerial view feature and the second aerial view feature, a target detection result is obtained, and since the aerial view feature expresses the distribution of the image features of each image on the plane where the target object is located in the target space, and the expressed distribution is accurate, the accuracy of the obtained target detection result is improved.

In some embodiments, the first feature of each of the plurality of first locations comprises a mean of a plurality of first spatial features of the plurality of second locations; or alternatively

The second feature of each of the plurality of first locations comprises a mean of the plurality of second spatial features of the plurality of second locations.

The first aerial view features or the second aerial view features are obtained by means of mean value fusion of the first spatial features or the second spatial features of the first positions in the same vertical direction, and accuracy of the obtained first aerial view features and second aerial view features is high.

For example, in step S710, as shown in fig. 4, the average value of the plurality of first spatial features corresponding to the plurality of second positions located in the same vertical direction (same column) in the plurality of positions of the target space is taken as the average value of the plurality of first spatial features corresponding to the plurality of positions A first feature of the first position, thereby obtaining a first aerial view feature

(as shown at 430).

For another example, in step S720, as shown in fig. 6, the average value of a plurality of second spatial features corresponding to a plurality of second positions located in the same vertical direction (same column) in a plurality of positions of the target space is used as the second feature of the first position corresponding to the plurality of positions, thereby obtaining a second bird' S-eye view feature

(as shown at 630).

In some embodiments, in step S730, the respective corresponding target detection results are obtained by inputting the first bird 'S-eye view feature and the second bird' S-eye view feature into the MLP and the detection head, respectively, and the final detection result is obtained based on the respective corresponding target detection results.

In some embodiments, as shown in fig. 8, in step S730, the obtaining the target detection result based on the first aerial view feature and the second aerial view feature includes:

step S810: obtaining a first attention weight corresponding to the first aerial view feature and a second attention weight corresponding to the second aerial view feature, wherein the first attention weight indicates the importance degree of the first aerial view feature, and the second attention indicates the importance degree of the second aerial view feature;

Step S820: based on the first attention weight and the second attention weight, performing feature fusion on the first aerial view feature and the second aerial view feature to obtain fusion features; and

step S830: and obtaining the target detection result based on the fusion characteristic.

And respectively obtaining weights corresponding to the first aerial view feature and the second aerial view feature, carrying out feature fusion on the first aerial view feature and the second aerial view feature, and obtaining a target detection result based on the fusion feature, so that the importance of the first aerial view feature and the second aerial view feature is considered by the obtained target detection result, and the accuracy of the obtained target detection result is improved.

In some embodiments, the first attention weight and the second attention weight are weights corresponding to the first aerial view feature and the second aerial view feature, respectively, and the fusion feature is obtained by multiplying the first aerial view feature and the second aerial view feature by the corresponding weights, respectively, and then adding the first aerial view feature and the second aerial view feature.

In some embodiments, the first attention weight or the second attention weight respectively includes a weight corresponding to each of the plurality of first locations, the weight indicating a degree of importance of a feature corresponding to the first location.

The method comprises the steps of obtaining weight values which respectively correspond to each of a plurality of first positions and indicate importance degrees of the first positions, fusing the first aerial view features and the second aerial view features based on the weight values corresponding to each of the plurality of first positions, realizing feature fusion of the first aerial view features and the second aerial view features based on a point-by-point attention mechanism, further improving accuracy of the obtained fusion features, and improving accuracy of target detection results.

In some embodiments, the target detection results are obtained by fusing features input into the MLP and the detection head, which may indicate, for example, a detection box, a center point, a category, or an orientation angle, etc., for each of a plurality of instances in the target space, without limitation.

Referring to fig. 9, a schematic diagram illustrating a process of a target detection method according to an embodiment of the present disclosure is shown, in which a plurality of images 910 corresponding to a plurality of viewing angles are input to a backbone network 901, image features 920 corresponding to the plurality of images are obtained, two-dimensional to three-dimensional conversion is performed on the image features 920 corresponding to the plurality of images, a first aerial view feature 930A and a second aerial view feature 930B are obtained, respectively, the first aerial view feature 930A and the second aerial view feature 930B are fused 902 based on an attention mechanism to obtain a fused feature 940, and the fused feature is input to a prediction head 903, so as to obtain a target detection result 950.

According to another aspect of the present disclosure, there is also provided an object detection apparatus, as shown in fig. 10, an apparatus 1000 including: an image acquisition unit 1010 configured to obtain a plurality of images taken for a target space in which a target object is located from a position in which the target object is located, the plurality of images corresponding to a plurality of perspectives, the target space including a plurality of positions; a first spatial feature acquisition unit 1020 configured to obtain a first spatial feature of each of the plurality of positions by mapping an image feature of each of the plurality of images to a corresponding position of the plurality of positions; a second spatial feature acquisition unit 1030 configured to acquire a second spatial feature of each of the plurality of positions by projecting the position onto a corresponding one of the plurality of images and extracting a corresponding feature in the corresponding image; and a detection result acquisition unit 1040 configured to obtain a target detection result of the target space, which indicates a target instance located in the target space, based on the first spatial feature and the second spatial feature of each of the plurality of positions.

In some embodiments, the first spatial feature acquisition unit 1020 includes: a first spatial sub-feature obtaining unit configured to obtain, for each of the plurality of images, an image feature of the image, a distribution probability of each of a plurality of depths of each pixel on the image along a viewing angle direction to which the image corresponds, and obtain, based on the distribution probability of each of the plurality of depths of each pixel on the image along the viewing angle direction to which the image corresponds and the image feature of the image, a first spatial sub-feature of each of the plurality of positions corresponding to the image; and a first spatial feature acquisition subunit configured to, for each of the plurality of positions, obtain a first spatial feature of the position based on a plurality of first spatial features of the position corresponding to the plurality of images.

In some embodiments, the second spatial feature acquisition unit 1030 includes: a camera pose acquisition unit configured to acquire a camera pose of each of the plurality of images; a second spatial sub-feature acquisition unit configured to project, for each of the plurality of positions, the position based on a camera pose of each of the plurality of images, determine that the image is a corresponding image of the position in response to determining that the image has a projection position corresponding to the position, and extract, as a second spatial sub-feature of the position corresponding to the image, a feature of the corresponding image corresponding to the projection position; and a second spatial feature acquisition subunit configured to, for each of the plurality of locations, obtain, in response to obtaining one or more second spatial features of the location corresponding to one or more respective ones of the plurality of images, a second spatial feature of the location based on the one or more spatial features.

In some embodiments, the detection result obtaining unit 1040 includes: a first bird's-eye view feature obtaining unit configured to obtain a first bird's-eye view feature based on a first spatial feature of each of the plurality of positions, the first bird's-eye view feature including a first feature of each of a plurality of first positions of the plurality of positions, the plurality of first positions being located on a horizontal plane in which the target object is located, the first feature being obtained based on a plurality of first spatial features of a plurality of second positions of the plurality of positions, the plurality of second positions being located in a vertical direction in which the first position is located; a second bird's-eye view feature acquisition unit configured to obtain a second bird's-eye view feature based on a second spatial feature of each of the plurality of positions, the second bird's-eye view feature including a second feature of each of the plurality of first positions, the second feature being obtained based on a plurality of second spatial features of the plurality of second positions; and a detection result acquisition subunit configured to obtain the target detection result based on the first bird's-eye view feature and the second bird's-eye view feature.

In some embodiments, the first feature of each of the plurality of first locations comprises a mean of a plurality of first spatial features of the plurality of second locations; or the second feature of each of the plurality of first locations comprises a mean of the plurality of second spatial features of the plurality of second locations.

In some embodiments, the detection result acquisition subunit includes: an attention weight acquisition unit configured to acquire a first attention weight corresponding to the first bird's-eye view feature and a second attention weight corresponding to the second bird's-eye view feature, the first attention weight indicating a degree of importance of the first bird's-eye view feature, the second attention indicating a degree of importance of the second bird's-eye view feature; a feature fusion unit configured to perform feature fusion on the first aerial view feature and the second aerial view feature based on the first attention weight and the second attention weight, so as to obtain a fusion feature; and a first acquisition subunit configured to obtain the target detection result based on the fusion feature.

In some embodiments, the target object comprises a vehicle and the plurality of images comprises a plurality of images acquired by a plurality of cameras on the vehicle, respectively.

In some embodiments, the target object comprises an imaging device, and the plurality of images comprises a plurality of images acquired by the imaging device in a plurality of directions, respectively.

According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.

Referring to fig. 11, a block diagram of an electronic device 1100 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 11, the electronic device 1100 includes a computing unit 1101 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data required for the operation of the electronic device 1100 can also be stored. The computing unit 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

A number of components in the electronic device 1100 are connected to the I/O interface 1105, including: an input unit 1106, an output unit 1107, a storage unit 1108, and a communication unit 1109. The input unit 1106 may be any type of device capable of inputting information to the electronic device 1100, the input unit 1106 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 1107 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 1108 may include, but is not limited to, magnetic disks, optical disks. The communication unit 1109 allows the electronic device 1100 to exchange information/data with other devices through computer networks such as the internet and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 1101 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1101 performs the various methods and processes described above, such as method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1108. In some embodiments, some or all of the computer programs may be loaded and/or installed onto electronic device 1100 via ROM 1102 and/or communication unit 1109. One or more of the steps of the method 200 described above may be performed when a computer program is loaded into the RAM 1103 and executed by the computing unit 1101. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform the method 200 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A target detection method comprising:

obtaining a plurality of images shot from a position of a target object for a target space of the target object, wherein the plurality of images correspond to a plurality of view angles, and the target space comprises a plurality of positions;

Mapping image features of each of the plurality of images to corresponding ones of the plurality of locations by forward mapping to obtain first spatial features of each of the plurality of locations;

projecting each of the plurality of locations onto a respective one of the plurality of images by back projection and extracting a respective feature in the respective image to obtain a second spatial feature of the location; and

obtaining a target detection result of the target space based on the first spatial feature and the second spatial feature of each of the plurality of locations, the target detection result indicating a target instance located in the target space;

wherein said mapping the image feature of each of the plurality of images to a corresponding one of the plurality of locations by forward mapping to obtain a first spatial feature of each of the plurality of locations comprises:

for each of the plurality of images,

the image characteristics of the image are obtained,

obtaining a probability of distribution of each pixel on the image along each depth of a plurality of depths in a viewing angle direction corresponding to the image, and

Obtaining a first spatial sub-feature of the image corresponding to each of the plurality of locations based on a probability of distribution of each pixel on the image along each of the plurality of depths and an image feature of the image; and

for each of the plurality of locations, a first spatial feature of the location is obtained based on a plurality of first spatial sub-features of the location corresponding to the plurality of images.

2. The method of claim 1, wherein the projecting each of the plurality of locations onto a respective one of the plurality of images by backprojection and extracting a respective feature in the respective image to obtain a second spatial feature for the location comprises:

obtaining a camera pose for each of the plurality of images;

for each of the plurality of locations,

projecting the position based on the camera pose of each of the plurality of images,

responsive to determining that the image has a projected position corresponding to the position, determining that the image is a corresponding image of the position, and

extracting features corresponding to the projection positions in the corresponding images to serve as second space sub-features of the positions corresponding to the images; and

For each of the plurality of locations, in response to obtaining one or more second spatial sub-features of the location corresponding to one or more respective images of the plurality of images, obtaining a second spatial feature of the location based on the one or more spatial sub-features.

3. The method of claim 2, wherein, for each of the plurality of locations, the second spatial feature of that location is a mean of one or more second spatial sub-features corresponding to that location.

4. The method of claim 1, wherein the obtaining the target detection result for the target space based on the first spatial feature and the second spatial feature for each of the plurality of locations comprises:

obtaining a first aerial view feature based on a first spatial feature of each of the plurality of locations, the first aerial view feature comprising a first feature of each of a plurality of first locations of the plurality of locations, the plurality of first locations being located on a horizontal plane in which the target object is located, the first feature being obtained based on a plurality of first spatial features of a plurality of second locations of the plurality of locations, the plurality of second locations being located in a vertical direction in which the first locations are located;

Obtaining a second aerial view feature based on a second spatial feature of each of the plurality of locations, the second aerial view feature comprising a second feature of each of the plurality of first locations, the second feature being obtained based on a plurality of second spatial features of the plurality of second locations; and

and obtaining the target detection result based on the first aerial view characteristic and the second aerial view characteristic.

5. The method of claim 4, wherein the first feature of each of the plurality of first locations comprises a mean of a plurality of first spatial features of the plurality of second locations; or alternatively

6. The method of claim 4, wherein the obtaining the target detection result based on the first and second bird's-eye view features comprises:

obtaining a first attention weight corresponding to the first aerial view feature and a second attention weight corresponding to the second aerial view feature, wherein the first attention weight indicates the importance degree of the first aerial view feature, and the second attention indicates the importance degree of the second aerial view feature;

Based on the first attention weight and the second attention weight, performing feature fusion on the first aerial view feature and the second aerial view feature to obtain fusion features; and

and obtaining the target detection result based on the fusion characteristic.

7. The method of claim 6, wherein the first or second attention weight respectively comprises a weight corresponding to each of the plurality of first locations, the weight indicating a degree of importance of a feature to which the first location corresponds.

8. The method of claim 1, wherein the target object comprises a vehicle and the plurality of images comprises a plurality of images acquired by a plurality of cameras on the vehicle, respectively.

9. The method of claim 1, wherein the target object comprises an imaging device, the plurality of images comprising a plurality of images acquired by the imaging device in a plurality of directions, respectively.

10. An object detection apparatus comprising:

an image acquisition unit configured to acquire a plurality of images taken for a target space in which a target object is located from a position in which the target object is located, the plurality of images corresponding to a plurality of angles of view, the target space including a plurality of positions;

A first spatial feature acquisition unit configured to map an image feature of each of the plurality of images to a corresponding one of the plurality of positions by forward mapping, to obtain a first spatial feature of each of the plurality of positions;

a second spatial feature acquisition unit configured to project each of the plurality of positions onto a corresponding one of the plurality of images by back projection and extract a corresponding feature in the corresponding image, to obtain a second spatial feature of the position; and

a detection result acquisition unit configured to obtain a target detection result of the target space based on the first spatial feature and the second spatial feature of each of the plurality of positions, the target detection result indicating a target instance located in the target space;

wherein the first spatial feature acquisition unit includes:

a first spatial sub-feature obtaining unit configured to obtain, for each of the plurality of images, an image feature of the image, a distribution probability of each of a plurality of depths of each pixel on the image along a viewing angle direction to which the image corresponds, and obtain, based on the distribution probability of each of the plurality of depths of each pixel on the image and the image feature of the image, a first spatial sub-feature of each of the plurality of positions corresponding to the image; and

A first spatial feature acquisition subunit configured to, for each of the plurality of locations, obtain a first spatial feature of the location based on a plurality of first spatial features of the location corresponding to the plurality of images.

11. The apparatus of claim 10, wherein the second spatial feature acquisition unit comprises:

a camera pose acquisition unit configured to acquire a camera pose of each of the plurality of images;

a second spatial sub-feature acquisition unit configured to project, for each of the plurality of positions, the position based on a camera pose of each of the plurality of images, determine that the image is a corresponding image of the position in response to determining that the image has a projection position corresponding to the position, and extract, as a second spatial sub-feature of the position corresponding to the image, a feature of the corresponding image corresponding to the projection position; and

a second spatial feature acquisition subunit configured to, for each of the plurality of locations, obtain, in response to obtaining one or more second spatial features of the location corresponding to one or more respective ones of the plurality of images, a second spatial feature of the location based on the one or more spatial features.

12. The apparatus of claim 11, wherein, for each of the plurality of locations, the second spatial feature of that location is a mean of one or more second spatial sub-features corresponding to that location.

13. The apparatus of claim 10, wherein the detection result acquisition unit comprises:

a first bird's-eye view feature obtaining unit configured to obtain a first bird's-eye view feature based on a first spatial feature of each of the plurality of positions, the first bird's-eye view feature including a first feature of each of a plurality of first positions of the plurality of positions, the plurality of first positions being located on a horizontal plane in which the target object is located, the first feature being obtained based on a plurality of first spatial features of a plurality of second positions of the plurality of positions, the plurality of second positions being located in a vertical direction in which the first position is located;

a second bird's-eye view feature acquisition unit configured to obtain a second bird's-eye view feature based on a second spatial feature of each of the plurality of positions, the second bird's-eye view feature including a second feature of each of the plurality of first positions, the second feature being obtained based on a plurality of second spatial features of the plurality of second positions; and

And a detection result acquisition subunit configured to obtain the target detection result based on the first bird's-eye view feature and the second bird's-eye view feature.

14. The apparatus of claim 13, wherein the first feature of each of the plurality of first locations comprises a mean of a plurality of first spatial features of the plurality of second locations; or alternatively

15. The apparatus of claim 13, wherein the detection result acquisition subunit comprises:

an attention weight acquisition unit configured to acquire a first attention weight corresponding to the first bird's-eye view feature and a second attention weight corresponding to the second bird's-eye view feature, the first attention weight indicating a degree of importance of the first bird's-eye view feature, the second attention indicating a degree of importance of the second bird's-eye view feature;

a feature fusion unit configured to perform feature fusion on the first aerial view feature and the second aerial view feature based on the first attention weight and the second attention weight, so as to obtain a fusion feature; and

And the first acquisition subunit is configured to acquire the target detection result based on the fusion characteristic.

16. The apparatus of claim 15, wherein the first or second attention weight comprises a weight corresponding to each of the plurality of first locations, respectively, the weight indicating a degree of importance of a feature to which the first location corresponds.

17. The apparatus of claim 10, wherein the target object comprises a vehicle and the plurality of images comprises a plurality of images acquired by a plurality of camera devices on the vehicle, respectively.

18. The apparatus of claim 10, wherein the target object comprises an imaging device, the plurality of images comprising a plurality of images acquired by the imaging device in a plurality of directions, respectively.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.