CN117541816A

CN117541816A - Target detection method and device and electronic equipment

Info

Publication number: CN117541816A
Application number: CN202311436010.5A
Authority: CN
Inventors: 谭资昌; 毛永强; 杜金浩; 叶晓青; 王井东
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2024-02-09

Abstract

The disclosure provides a target detection method, a target detection device and electronic equipment, relates to artificial intelligence technology, and particularly relates to the technical fields of computer vision, deep learning and the like. The specific implementation scheme is as follows: acquiring a multi-scale feature map of a multi-view image; determining three-dimensional coordinate information of each scale feature map in space, wherein the three-dimensional coordinate information of each scale feature map comprises three-dimensional coordinates of a plurality of points corresponding to each pixel point in the scale feature map in space; coding each scale feature map and three-dimensional coordinate information of the scale feature map, and determining three-dimensional position sensing features of the multi-scale feature map according to coding results; decoding based on the three-dimensional position sensing features of the multi-scale feature map to obtain object category and position information of objects in the multi-view image. The accuracy of 3D target detection is improved.

Description

Target detection method and device and electronic equipment

Technical Field

The disclosure relates to the technical fields of computer vision, deep learning and the like in the field of artificial intelligence, and can be applied to scenes such as automatic driving, virtual reality, industrial automation, security protection and the like, in particular to a target detection method, a target detection device and electronic equipment.

Background

In many scenarios, three-dimensional (3D) object detection is required. For example, in an automatic driving scenario, it is necessary to accurately identify and locate objects such as pedestrians, vehicles, obstacles, etc. on a road surface by a 3D object detection assisting vehicle, and then provide the detection result to an automatic driving system for decision making and control. Also for example, in industrial automation scenarios, industrial robots need to be precisely positioned and controlled by 3D object detection; products on the production line need to be quality inspected and classified by 3D inspection. In the security monitoring field, information such as people flow, traffic flow and the like placed in public places, airports, railway stations and the like needs to be monitored through 3D target detection so as to provide real-time security alarm and early warning. In a virtual reality scene, it is necessary to identify and detect the position of a virtual object in the virtual reality scene by 3D object detection.

The object in the environment can be detected by using the multi-view two-dimensional image, and when the 3D object is detected by using the multi-view two-dimensional image, the accuracy and efficiency of the 3D object detection still need to be improved.

Disclosure of Invention

The disclosure provides a target detection method, a target detection device and electronic equipment.

According to a first aspect of the present disclosure, there is provided a target detection method comprising: acquiring a multi-scale feature map of a multi-view image; determining three-dimensional coordinate information of each scale feature map in space, wherein the three-dimensional coordinate information of each scale feature map comprises three-dimensional coordinates of a plurality of points corresponding to each pixel point in the scale feature map in space; coding each scale feature map and three-dimensional coordinate information of the scale feature map, and determining three-dimensional position sensing features of the multi-scale feature map according to coding results; decoding based on the three-dimensional position sensing features of the multi-scale feature map to obtain object category and position information of objects in the multi-view image.

According to a second aspect of the present disclosure, there is provided an object detection apparatus comprising: the acquisition unit acquires a multi-scale feature map of the multi-view image; the determining unit is used for determining three-dimensional coordinate information of each scale feature map in space, wherein the three-dimensional coordinate information of each scale feature map comprises three-dimensional coordinates of a plurality of points corresponding to each pixel point in the scale feature map in space; the coding unit is used for coding each scale feature map and the three-dimensional coordinate information of the scale feature map, and determining the three-dimensional position sensing features of the multi-scale feature map according to the coding result; and the decoding unit is used for decoding based on the three-dimensional position sensing characteristics of the multi-scale characteristic map to obtain object category and position information of the objects in the multi-view image.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method provided in the first aspect.

According to a fourth aspect of the present disclosure there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method provided by the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising: a computer program stored in a readable storage medium, from which it can be read by at least one processor of an electronic device, the at least one processor executing the computer program causing the electronic device to perform the method of the first aspect.

According to the scheme of the disclosure, a multi-scale feature map of a multi-view image is acquired; determining three-dimensional coordinate information of each scale feature map in space, wherein the three-dimensional coordinate information of each scale feature map comprises three-dimensional coordinates of a plurality of points corresponding to each pixel point in the scale feature map in space; coding each scale feature map and three-dimensional coordinate information of the scale feature map, and determining three-dimensional position sensing features of the multi-scale feature map according to coding results; decoding based on the three-dimensional position sensing features of the multi-scale feature map to obtain object category and position information of objects in the multi-view image. By constructing 3D position coordinate information respectively corresponding to different scale features, 3D position coding is expanded to multi-scale characterization so as to realize implicit 2D-to-3D position relation learning, more image features can be provided for object query, and the accuracy and efficiency of 3D target detection can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a scene diagram in which embodiments of the present disclosure may be implemented;

FIG. 2 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a network architecture for implementing the target detection method;

FIG. 6 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 7 is a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Referring to fig. 1, a schematic diagram of a network structure for implementing 3D object detection is shown. As shown in fig. 1, multiple images of the environment looking around 360 ° can be acquired, for example 6 images 11, 12, 13, 14, 15, 16 looking around, and then these 6 images are input into a backbone network (backbone) to extract a multi-view two-dimensional feature map. These feature maps are then input to a Feature Pyramid Network (FPN) for enhancement, outputting enhanced feature maps. Each pixel in the feature map represents a feature.

Inputting the multi-view two-dimensional feature map into a 3D coordinate generator, and dispersing a camera view cone space shared by all views into a 3D grid in the 3D coordinate generator to obtain 3D coordinates of a plurality of points on rays starting from each pixel point in the two-dimensional feature map. And further obtaining a plurality of 3D coordinates corresponding to each pixel in the two-dimensional feature map.

And inputting the features in the feature map and a plurality of 3D coordinates corresponding to each pixel of the feature map to an encoder to generate a 3D position sensing feature.

The 3D location-aware feature is input to a decoder, and the decoder decodes the 3D location-aware feature and an object query (object query) vector to obtain an object category and an object location of the object in the multi-view. Each object query vector may be a representation of one target in the detection task.

The object query vectors may include query location (query_pos) vectors corresponding to the three-dimensional spaces, respectively. In the decoder, self-attention mechanism processing is performed among all the object query vectors. After self-attention mechanism processing, cross-attention mechanism processing is performed with the 3D position sensing characteristics. When the cross-attention mechanism is used for processing, the object query vector of the 3D reference point and the 3D position sensing feature of the corresponding point in the 2D image are required to be interacted so as to update the object query vector corresponding to the 3D reference point, so that the 3D position sensing feature of the two-dimensional image is integrated into the object query vector. The object query vector processed by the cross-attention mechanism is input to a feed-forward network (FFN) and a prediction network, and the object class and the position of the object are output by the prediction network.

In the implementation shown in fig. 1, implicit 2D to 3D position embedding is achieved by fusion encoding the 3D coordinates of each pixel of the feature map with the image features of that pixel into 3D position-aware features. When the target detection is carried out on the environment image according to the scheme, the method has less calculation amount and better target detection result.

However, in the scheme shown in fig. 1, since objects with different sizes may exist in a scene, the objects with different sizes cannot be accurately detected using the scheme shown in fig. 1. In addition, in performing the cross-attention mechanism processing, although the 2D to 3D implicit embedding scheme is adopted, the image features provided to the object query vector by the above scheme are insufficient, and thus the accuracy and precision of the target detection result obtained by the above scheme have yet to be improved.

The disclosure provides a target detection method, a target detection device and electronic equipment, which are applied to the technical fields of computer vision, deep learning and the like in the artificial intelligence field so as to achieve the purpose of improving the accuracy and precision of a target detection result.

Referring to fig. 2, fig. 2 is a schematic diagram of a first embodiment of the disclosure, and as shown in fig. 2, the target detection method provided in the present embodiment includes the following steps:

s201, acquiring a multi-scale feature map of the multi-view image.

The execution body of the embodiment may be a terminal device or a server. In this embodiment, the execution body is introduced as a terminal device.

The multi-view image includes a plurality of images photographed at different photographing angles. Each image may be a two-dimensional (2D) image.

The multi-view image may be a plurality of images acquired by the same image acquisition device at different viewing angles, or may be a plurality of images acquired by different image acquisition devices at different viewing angles at the same time.

In one example, the acquiring the multi-scale feature map of the multi-view image may include acquiring the multi-scale feature map of each image in the multi-view image, and then stitching each scale feature map of each view image to obtain the multi-scale feature map.

In one example, the multi-view images may be integrated into one scene image, and then the image data of the scene image may be subjected to multi-scale feature extraction. The feature map of an image is the result of a convolution operation performed on the image data of the image input by a series of convolution checks, each pixel point in the feature map representing a number of specific features.

The multi-scale feature map may include feature maps of different scales obtained by downsampling the multi-view image data using different sampling multiples. The downsampling factors may include, for example, 4, 8, 16, 32, etc. For an image of the original size w×h, the image is downsampled by the above-described sampling multiple to obtain an image of the size (W/4) × (H/4), (W/8) × (H/8), (W/16) × (H/16), or (W/32) × (H/32). Each feature map obtained from each sampling multiple in the multi-scale feature map may correspond to a level of a 2D feature.

In one example, the image data of the above described w×h image may be input into a plurality of convolution kernels, each of which may downsample the image data input thereto. The image data input thereto can be changed into one pixel by downsampling the image data input thereto using, for example, an sxs convolution check. Wherein S is an integer greater than 1.

As the number of convolution kernel layers increases, the multiple of downsampling increases. Depending on which feature map of the downsampling multiple is required, it may be determined which layer of convolution kernel outputs the feature map of the downsampling multiple.

Illustratively, the multi-scale feature map includes 1/4, 1/8, 1/16, etc. scale feature maps of the original image size.

Objects of different sizes may be included in the scene to be subject to target detection. As the depth of the neural network increases, the downsampling multiple increases, the receptive field increases, and semantic information increases, so that the neural network is suitable for detecting large-size objects, but small-size objects may be submerged, so that when the image features obtained by using the higher sampling multiple are used for target detection, the problem that the small-size objects cannot be accurately identified may occur. Therefore, it is necessary to construct image features of objects of different sizes using a multi-scale feature map to achieve accurate detection results when object detection is performed in a scene including objects of different sizes.

S202, determining three-dimensional coordinate information of each scale feature map in space, wherein the three-dimensional coordinate information of each scale feature map comprises three-dimensional coordinates of a plurality of points corresponding to each pixel point in the scale feature map in space.

Coordinates of the pixel points in the feature map in three-dimensional space may be determined in various ways.

In this embodiment, the image data of the multi-view image may include camera parameters (including internal parameters and external parameters) of a camera capturing the image. Coordinates of the pixels in the feature map in the three-dimensional space can be calculated according to the pixels in the feature map and a camera parameter matrix of the camera.

As an implementation manner, the step S202 may include: and determining three-dimensional coordinate information of each scale feature map in the view cone space corresponding to the multiple views.

In order to establish the connection between the 2D image and the 3D space, a multi-view shared camera view cone space is constructed, and the view cone space is scattered into a 3D grid. In the above process, for any one pixel point in each feature map, the ray from the pixel point may include a plurality of three-dimensional space coordinates. Thus, each pixel point may correspond to a plurality of three-dimensional coordinates in space, for example, D three-dimensional coordinates, where D is an integer greater than or equal to 2.

S203, coding each scale feature map and three-dimensional coordinate information of the scale feature map, and determining three-dimensional position sensing features of the multi-scale feature map according to coding results.

Assume that a multi-scale feature map (2D image features) of a multi-view image is characterized by the following formula:

where l is the level of index 2 image features (e.g., the level formed by 1/4, 1/8, 1/16, etc. scale feature maps of the original size, illustratively 1/4 scale feature map is 1 level, 1/8 scale feature map is 2 level, 1/16 scale feature map is 3 level), N represents the number of views in the multi-view, W _l And H _l The width and height dimensions of the first level features, respectively. L is the largest number of levels, L is an integer greater than 1.

3D location mapping of multi-scale feature mapsCan be characterized by the following formula:

wherein the method comprises the steps of Is a point of the image feature space in the 3D grid, where D represents the number of three-dimensional coordinates in the view cone space corresponding to each pixel point. /> Is the 3D coordinates after the inverse 3D projection. K denotes the camera internal and external parameters. PE () is the coding algorithm.

By constructing 3D position codes respectively corresponding to different scale feature graphs, the 3D position codes are expanded to multi-scale characterization to realize implicit 2D-to-3D position relation learning, more image features can be provided for the object query vector during decoding, and the accuracy of 3D target detection of the object can be improved.

S204, decoding is carried out based on the three-dimensional position sensing characteristics of the multi-scale characteristic map, and object category and position information of objects in the multi-view image are obtained.

The step S204 includes: decoding is performed based on three-dimensional position sensing features of the multi-scale feature map and a preset plurality of object query vectors, wherein the object query vectors comprise coding vectors of positions in a three-dimensional space.

Because the three-dimensional position sensing features comprise three-dimensional position information and image feature information, the three-dimensional position sensing features respectively corresponding to the multi-scale feature images are decoded, and therefore accurate target detection results can be obtained.

The object query (object query) vector may include a query location (query_pos) vector and a type query (query) vector, which correspond to the plurality of three-dimensional spaces, respectively. For each object query vector, a query position corresponding to the object query vector may be mapped to a Bird's Eye View (BEV) space to obtain a three-dimensional reference point of the BEV space corresponding to the object query vector. The three-dimensional reference point to which the object query vector corresponds is a learnable reference point.

The three-dimensional position sensing features respectively corresponding to the multi-scale feature images and the plurality of object query vectors can be interacted, and the object query vectors are updated according to interaction results. The object type and the object position are predicted from the updated object query vector.

In the embodiment, a multi-scale feature map of a multi-view image is acquired; determining three-dimensional coordinate information of each scale feature map in a view cone space, wherein the three-dimensional coordinate information of each feature map comprises three-dimensional coordinates of a plurality of points of each pixel point in the feature map in the view cone space; coding each scale feature map and three-dimensional coordinate information of the scale feature map to obtain three-dimensional position sensing features respectively corresponding to the multi-scale feature maps; decoding is carried out based on three-dimensional position sensing features respectively corresponding to the multi-scale feature images to obtain object types and position information of objects in the multi-view images, 3D position codes are expanded to multi-scale characterization by constructing 3D position coordinate information respectively corresponding to different scale image features so as to realize implicit 2D-3D position relation learning, more image features can be provided for object query vectors, and the accuracy of 3D target detection of the objects is improved.

Referring to fig. 3, fig. 3 is a schematic diagram of a second embodiment of the disclosure, and as shown in fig. 3, the target detection method provided in the present embodiment includes the following steps:

s301, acquiring a multi-scale feature map of the multi-view image.

S302, determining three-dimensional coordinate information of each scale feature map in space, wherein the three-dimensional coordinate information of each scale feature map comprises three-dimensional coordinates of a plurality of points corresponding to each pixel point in the scale feature map in space.

S303, coding each scale feature map and three-dimensional coordinate information of the scale feature map, and determining three-dimensional position sensing features of the multi-scale feature map according to coding results.

The specific implementation of steps S301 to S303 may refer to the relevant parts of the embodiment shown in fig. 2, which is not repeated here.

In this embodiment, the execution body of the target detection method may be a terminal device, or may be a server that provides services for the terminal device.

S304, for each object query vector, performing bidirectional offset according to a reference point of the object query vector in a three-dimensional space to obtain a plurality of two-dimensional offset points of the reference point in an image space; the two-way offset comprises the steps of performing first offset on the reference point in a three-dimensional space, and performing second offset on the three-dimensional offset point obtained by the first offset and the projection point of the reference point in the two-dimensional space respectively.

In this embodiment, in order to further solve the problem of insufficient image features provided for the object query vectors, bidirectional shifting is performed so as to determine a plurality of two-dimensional offset points in the image for the reference point of each object query vector, and then the image features related to the two-dimensional offset points are embedded into the object query vector through a cross-attention mechanism.

The bidirectional offset may include performing a first offset on a reference point corresponding to the object query vector in a three-dimensional space, and performing a second offset on a three-dimensional offset point obtained by the first offset and a projection point of the reference point in the two-dimensional space. A greater number of two-dimensional offset points associated with the reference point of the object query vector may be obtained by the bi-directional offset described above. By fusing the image features corresponding to the two-dimensional offset points into the object query vector, the object query vector can have more image information.

Specifically, the step S304 may include the following steps:

first, a reference point of the object query vector in a three-dimensional space is subjected to first offset in the three-dimensional space, and at least one three-dimensional offset point is determined.

For each object query vector, determining at least one three-dimensional offset point from a reference point of the object query vector in three-dimensional space.

At least one three-dimensional offset point may be determined in a linear manner based on coordinates of a reference point corresponding to the object query vector.

Secondly, the projection of each of the at least one three-dimensional offset point and the reference point in the multi-view image space is determined, and a second offset is carried out on each projection in the multi-view image space, so that a plurality of two-dimensional offset points are obtained.

The three-dimensional offset points and the reference points of the object query vector can be projected into the multi-view image space by using camera parameters to obtain projection points in the multi-view image. For each projection point, performing second offset to obtain a plurality of two-dimensional offset points. For each two-dimensional offset point, the pixels associated with the two-dimensional offset point are found in the feature maps. And fusing the features corresponding to the pixels related to the two-dimensional offset points in each feature map.

Since the object query vector contains rich 3D information (e.g., location, size, and direction), a 3D offset can be directly generated by applying a linear method to the object query vector itself Wherein N is _q Is the number of object query vectors, N _3d The number of offset points that are 3D-learnable can be characterized by the following formula:

Δ _3d ＝Linear(Q) (3)；

wherein,is a query vector (object query vector); linear () is a Linear function.

Because of the lack of two-dimensional information in three-dimensional space, it is difficult to generate reliable two-dimensional offsets based solely on object query vector features. To solve this problem, two-way offset is used, first, the reference point of the object query vector is offset in three-dimensional space to obtain at least one three-dimensional offset point, and the at least one three-dimensional offset point and the reference point are projected into multi-view image space (two-dimensional space) to obtain a plurality of two-dimensional projection points in the multi-view image space. For each projection point, shifting is performed in the multi-view image space, resulting in a plurality of two-dimensional shift points. And then fusing the image features corresponding to the projection of the reference points in the multi-view image space with the object query vector, and fusing the image features of the two-dimensional offset points with the object query vector.

Given a reference point of an object query vectorFirstly, the reference point is projected into an image space according to camera parameters to obtain a two-dimensional projection point +.>The features of the corresponding locations are then sampled and fused together in all feature layers. For simplicity, this process is expressed as the following formula:

where sampling is a grid sampling operation,representing multi-scale two-dimensional image features. In obtaining the sampling feature->After that, two-dimensional offset +.>Can be generated by a linear method as shown in the following formula:

Δ _2d ＝Linear(Q+X _img ) (5)；

based on the bi-directional offset (including 3D offset and 2D offset), the final sampling point can be expressed as follows:

P _2d ＝K(P _3d +Δ _3d )+Δ _2d (6)；

wherein the method comprises the steps ofK denotes a 3D to 2D projection matrix according to the internal and external parameters of the camera.

S305, determining the associated image characteristics of each two-dimensional offset point.

Specifically, the step S305 includes the steps of:

first, for each two-dimensional offset point, feature information corresponding to the two-dimensional offset point is determined in the multi-scale feature map.

And secondly, fusing the characteristic information corresponding to each multi-scale characteristic map to obtain the associated image characteristics of the two-dimensional offset point.

That is, for each sampling point (which can be regarded as a two-dimensional offset point), the corresponding features in each scale feature map are fused to obtain the associated image feature of the sampling point. Such that the associated image features may include fusion features of the respective 3D location-aware features in the multi-scale feature map. The image characteristics of the object query vector can be increased through the above-mentioned associated image characteristics, and the object query vector can also be made to learn position information. And the multi-scale characteristics are fused, so that the calculated amount can be reduced.

S306, interacting with the object query vector by using the associated image features of the two-dimensional offset points and the projected associated image features of the reference points, and updating the object query vector.

The associated image features of the plurality of two-dimensional offset points and the projected associated image features of the reference point may be interacted with the object query vector using a cross-attention algorithm to update the object query vector.

S307, predicting the object category and the position information according to the updated object query vector.

The above object position information includes position information indicating a position in a three-dimensional space, including three-dimensional coordinates, orientations, and the like.

After updating the object query vector, object category and location information predictions may be made from the updated object query vector.

Compared with the embodiment shown in fig. 2, the embodiment describes that by using bidirectional offset in decoding, the offset of the reference points can be simultaneously learned in the 2D and 3D spaces through the interaction attention learned by the bidirectional offset, thereby reducing the negative influence caused by the inconsistency of the occupied area of the object between the three-dimensional space and the two-dimensional image space; the accurate and effective interaction between the object query vector and the image features is promoted; local attention aggregation is achieved through learning of the offset of the 3D reference points in the 2D image, and feature changes of objects of different distances and different sizes are accommodated by learning the offset of the reference points in the 2D and 3D spaces simultaneously. The method and the device are beneficial to improving the precision and accuracy of the target detection result determined according to the object query vector.

Referring to fig. 4, fig. 4 is a schematic diagram of a third embodiment of the disclosure, and as shown in fig. 4, the target detection method provided in the present embodiment includes the following steps:

s401: the multi-view image is input into a backbone network and a feature pyramid network, and a multi-scale feature map is output by the feature pyramid network.

The backbone network may be various neural networks for extracting image features, such as convolutional neural networks, etc. The feature pyramid network may enhance the feature map output from the backbone network.

S402: and determining three-dimensional coordinate information of each scale feature map in space, wherein the three-dimensional coordinate information of each scale feature map comprises three-dimensional coordinates of a plurality of points corresponding to each pixel point in the feature map in space.

The implementation of step S402 may refer to step S202 in the embodiment shown in fig. 2.

As an implementation manner, the multi-scale feature map and the camera parameters may be input to a 3D position generating network, and the 3D position generating network generates three-dimensional coordinates of a plurality of points corresponding to each pixel in each scale feature map.

S403: and inputting each scale feature map and the three-dimensional coordinate information corresponding to the scale feature map to an encoder, encoding the three-dimensional coordinate information corresponding to each scale feature map and the scale feature map by the encoder, and determining the three-dimensional position sensing features of the multi-scale feature map according to the encoding result.

And for each scale feature map, using an encoder to encode the feature of each pixel point in the feature map and the three-dimensional coordinate information of the pixel point to obtain the three-dimensional position sensing feature of the pixel.

As one implementation, for each scale feature map, an encoder is used to encode the scale feature map and three-dimensional coordinate information of the scale feature map, so as to obtain three-dimensional position sensing features corresponding to the scale feature map.

As an implementation manner, the position sensing features corresponding to the scale feature images can be fused to obtain the three-dimensional position sensing features of the fused multi-scale feature images.

S404: inputting three-dimensional position-aware features of the multi-scale feature map and object query vectors to a decoder, determining associated image features of a plurality of two-dimensional offset points for each object query vector based on bi-directional offset by a deformable cross-attention layer in the decoder; and performing cross attention calculation on the associated image features of the two-dimensional offset points and the projected associated image features of the reference points and the object query vector by the deformable cross attention layer, and updating the object query vector according to a cross attention calculation result.

Specifically, the process of determining a plurality of two-dimensional offset points of the query object based on the bidirectional offset in the deformable cross-attention layer, and determining the associated image features of the two-dimensional offset points may refer to the relevant portions of the embodiment shown in fig. 3, which is not described herein.

The plurality of object query vectors may include a query location vector that encodes a plurality of locations in space, respectively. The Q matrix derived from the object query vector may be considered to pose such a problem "what is there in a certain location? ".

The 3D position-aware features output by the encoder are used to generate key (key) vectors and value (value) vectors. And a plurality of key vectors and value vectors respectively constitute a K matrix and a V matrix. The K matrix provides some key information for solving the problem in the Q matrix, and the K matrix can be considered to contain some key features of the object at a certain position.

Referring to fig. 5, fig. 5 is a schematic diagram of a network structure for implementing the object detection of the present disclosure. As shown in fig. 5, the network structure includes a backbone network, a feature pyramid network, a 3D generation network, an encoder, and a decoder. The 3D generation network may generate a 3D location map corresponding to the multi-scale feature map. And the encoder encodes the multi-scale feature and the 3D position mapping corresponding to the multi-scale feature map to obtain the multi-scale 3D position sensing feature. The decoder may include a deformable self-attention layer, a deformable cross-attention layer, a feed forward network, and a hybrid query combining network. The plurality of object query vectors are input to the decoder and then fed to the deformable self-attention layer, which performs self-attention mechanism processing on the plurality of object query vectors. A global context may be established between the object query vectors through the self-attention mechanism processing of the object query vectors described above. In the self-attention mechanism process, each object query vector interacts with all other object query vectors to obtain a more accurate object query vector representation globally. The 3D position-aware features input into the decoder generate corresponding key vectors and value vectors at the deformable cross-attention. The key vector and the value vector are processed by a cross-attention mechanism with the object query vector.

And acquiring a plurality of related 3D position sensing features through a bidirectional offset process at the deformable cross attention layer, generating respective corresponding key vectors and value vectors, and performing cross attention mechanism processing on the key vectors and the value vectors and the object query object so as to update the query object.

The most relevant points of the object query vector in the image can be captured by a bi-directional shifting process, and the image features corresponding to the most relevant points will be sampled to update the query features through the attention mechanism. Let Q index have mixed query features Q _q ∈R ^1×C And a reference point in image spaceThe deformable interaction focus employed may be represented by the following formula:

wherein m indexes the attention header; k index sampling points; k is an integer greater than or equal to 2; l index feature level. A is that _mlqk Representing the attention weight of the kth sample point and the mth attention head of the ith feature layer. Attention weight A _mlqk Is of (2)Enclose as [0,1]。W _m And W is _m ^′ Is a learnable parameter. By implicit 2D-to-3D relational modeling,representing multi-scale two-dimensional image features.

The updated query object vector can be fused with 3D location-aware features of the projection points in the image space corresponding to the reference points and related image features of the related two-dimensional offset points, so that the updated query object includes rich image information.

Image data for multiple views (see 360 ° around image 11, image 12, image 13, image 14, image 15, and image 16 as shown in fig. 1) is input to the backbone network and feature pyramid network of fig. 5. A multi-scale feature map is output from the feature pyramid network. And then 3D mapping is carried out on the multi-scale features by a 3D generating network to obtain 3D position information corresponding to the multi-scale feature images respectively, the multi-scale feature images and the 3D position information corresponding to the multi-scale feature images are encoded by an encoder to obtain 3D position sensing features, the 3D position sensing features are input into a decoder, a deformable cross attention layer in the decoder generates key vectors and value vectors according to the 3D position features, the key vectors and the value vectors are used for carrying out cross attention mechanism processing on the object query vectors, the object query vectors are updated, and the updated object query vectors have rich image information, so that the accuracy and the precision of object detection results in the multi-view are high.

The network provided in fig. 5 contains two important modules, namely a multi-scale 3D location mapping module and a deformable interactive attention layer module, which can implement two-way offset learning of the reference points of the object query vector, which are intended to implement both implicit 2D to 3D and explicit 3D to 2D location relationship learning. The network is embedded with 2D-3D implicit relation and 3D-2D explicit relation mining, and complementary modeling of two modes is achieved. Different from the traditional hidden 2D image-to-3D space position relation modeling, the multi-scale 3D position mapping can perform view cone space modeling on multi-scale features extracted by a backbone network and an FPN, generate a 3D position relation, and further accurately encode multi-scale value vectors and key vectors during interaction attention. In the related art, explicit 3D space-to-2D image position relation modeling focuses on reference point offset learning in a single space, and the interactive attention based on bidirectional offset learning in the present disclosure can learn offset of reference points in 2D and 3D spaces at the same time, so that negative effects caused by inconsistent object occupation areas between three-dimensional space and two-dimensional image space are greatly reduced. Therefore, the method has higher detection efficiency in target detection.

S405: in the decoder, object class and position information of objects in the multi-view image are determined from the updated object query.

The updated object query vector is input into a feedforward network and a hybrid query merging network, and the object and position information corresponding to the object query vector are judged through the network.

The above-mentioned position information includes 3D position information, direction information, and the like.

In this embodiment, by providing the backbone network and the feature pyramid network to calculate the multi-scale feature map of the multi-view, using the encoder to encode the 3D position information corresponding to the scale feature map and the multi-sub, decoding the 3D position sensing feature and the object query vector obtained by the encoding by the decoder including the deformable cross attention layer, and outputting the object information of the object corresponding to the multi-view image and the 3D position information of the object, the 3D target detection result with higher accuracy can be provided with higher efficiency.

Fig. 6 is a schematic diagram of a fourth embodiment of the present disclosure, and as shown in fig. 6, an object detection apparatus 600 provided in the present embodiment includes:

an acquisition unit 601 that acquires a multi-scale feature map of a multi-view image;

a determining unit 602, configured to determine three-dimensional coordinate information of each scale feature map in space, where the three-dimensional coordinate information of each scale feature map includes three-dimensional coordinates of a plurality of points corresponding to each pixel point in the scale feature map in space;

The encoding unit 603 is configured to encode each scale feature map and three-dimensional coordinate information of the scale feature map, and determine three-dimensional position sensing features of the multi-scale feature map according to an encoding result;

the decoding unit 604 is configured to decode based on the three-dimensional position sensing feature of the multi-scale feature map, so as to obtain object category and position information of the object in the multi-view image.

In some embodiments, the decoding unit includes a first decoding module 6041, the first decoding module 6041 for:

decoding is performed based on three-dimensional position sensing features of the multi-scale feature map and a preset plurality of object query vectors, wherein the object query vectors comprise coding vectors of positions in a three-dimensional space.

In some embodiments, the first decoding module 6041 includes an offset submodule 6042, a determination submodule 6043, an interaction submodule 6044, and a prediction submodule 6045, wherein,

an offset submodule 6042, configured to, for each object query vector, perform bidirectional offset according to a reference point of the object query vector in a three-dimensional space, to obtain a plurality of two-dimensional offset points of the reference point in an image space; the two-way offset comprises the steps of performing first offset on the reference point in a three-dimensional space, and performing second offset on the three-dimensional offset point obtained by the first offset and the projection point of the reference point in the two-dimensional space respectively;

A determination submodule 6043 for determining associated image features for each two-dimensional offset point;

an interaction sub-module 6044 for interacting with the object query vector using the associated image features of the plurality of two-dimensional offset points and the projected associated image features of the reference points, to update the object query vector;

and a prediction submodule 6045, configured to predict the object category and the position information according to the updated object query vector.

In some embodiments, the offset submodule 6042 is further configured to:

performing first offset on a reference point of the object query vector in a three-dimensional space in the three-dimensional space, and determining at least one three-dimensional offset point;

determining projections of the at least one three-dimensional offset point and the reference point in the multi-view image space respectively, and performing second offset on each projection in the multi-view image space to obtain a plurality of two-dimensional offset points; the method comprises the steps of carrying out a first treatment on the surface of the The determination submodule 6043 is further configured to:

for each two-dimensional offset point, determining feature information corresponding to the two-dimensional offset point in the multi-scale feature map;

and fusing the feature information corresponding to each multi-scale feature map to obtain the associated image features of the two-dimensional offset points.

In some embodiments, the decoding unit 604 includes a second decoding module 6046, the second decoding module 6046 for:

Inputting three-dimensional position-aware features of the multi-scale feature map and object query vectors to a decoder, determining associated image features of a plurality of two-dimensional offset points for each object query vector based on bi-directional offset by a deformable cross-attention layer in the decoder; performing cross attention calculation on the associated image features of a plurality of two-dimensional offset points and the projected associated image features of the reference points of the object query vector and the object query vector by using a deformable cross attention layer, and updating the object query vector according to a cross attention calculation result;

in the decoder, object class and position information of objects in the multi-view image are determined from the updated object query vector.

In some embodiments, the acquisition unit 601 includes a first acquisition module 6011, the first acquisition module 6011 configured to:

the multi-view image is input into a backbone network and a feature pyramid network, and a multi-scale feature map is output by the feature pyramid network.

In some embodiments, the determining unit 602 includes a first determining module 6021, the first determining module 6021 for:

and determining three-dimensional coordinate information of each scale feature map in the view cone space corresponding to the multiple views.

In some embodiments, the encoding unit 603 includes a first encoding module 6031, the first encoding module 6031 for:

And inputting each scale feature map and the three-dimensional coordinate information corresponding to the scale feature map to an encoder, encoding the three-dimensional coordinate information corresponding to each scale feature map and the scale feature map by the encoder, and determining the three-dimensional position sensing features of the multi-scale feature map according to the encoding result.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the solution provided by any one of the above embodiments.

According to an embodiment of the present disclosure, the present disclosure also provides a computer program product comprising: a computer program stored in a readable storage medium, from which at least one processor of an electronic device can read, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any one of the embodiments described above.

FIG. 7 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure. The electronic device 700 may be a terminal device or a server. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, for example, the generation operation process of the electronic case report. For example, in some embodiments, the operational process of generating the electronic case report may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the above-described electronic case report generation operation processing may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the electronic case report generation operation processing by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A target detection method comprising:

acquiring a multi-scale feature map of a multi-view image;

determining three-dimensional coordinate information of each scale feature map in space, wherein the three-dimensional coordinate information of each scale feature map comprises three-dimensional coordinates of a plurality of points corresponding to each pixel point in the scale feature map in space;

coding each scale feature map and three-dimensional coordinate information of the scale feature map, and determining three-dimensional position sensing features of the multi-scale feature map according to coding results;

Decoding based on the three-dimensional position sensing features of the multi-scale feature map to obtain object category and position information of objects in the multi-view image.

2. The method of claim 1, wherein the decoding based on the three-dimensional location-aware features of the multi-scale feature map comprises:

decoding is performed based on the three-dimensional position sensing features of the multi-scale feature map and a preset plurality of object query vectors, wherein the object query vectors comprise encoding vectors of positions in a three-dimensional space.

3. The method of claim 2, wherein the decoding based on the three-dimensional location-aware features of the multi-scale feature map comprises:

for each object query vector, performing bidirectional offset according to a reference point of the object query vector in a three-dimensional space to obtain a plurality of two-dimensional offset points of the reference point in an image space; the two-way offset comprises the steps of performing first offset on the reference point in a three-dimensional space, and performing second offset on a three-dimensional offset point obtained by the first offset and a projection point of the reference point in a two-dimensional space respectively;

determining associated image features of each two-dimensional offset point;

Interacting the object query vector with the associated image features of the plurality of two-dimensional offset points and the projected associated image features of the reference points to update the object query vector;

and predicting the object category and the position information according to the updated object query vector.

4. A method according to claim 3, wherein for each object query vector, performing bidirectional offset according to a reference point of the object query vector in the three-dimensional space to obtain a plurality of two-dimensional offset points of the reference point in the image space, comprises:

determining projections of the at least one three-dimensional offset point and the reference point in a multi-view image space respectively, and performing second offset on each projection in the multi-view image space to obtain a plurality of two-dimensional offset points; and

the determining the associated image features for each two-dimensional offset point comprises:

and fusing the characteristic information corresponding to each multi-scale characteristic image to obtain the associated image characteristic of the two-dimensional offset point.

5. The method of claim 1, wherein the decoding based on the three-dimensional location-aware features of the multi-scale feature map comprises:

inputting the three-dimensional position sensing features of the multi-scale feature map and the object query vectors to a decoder, and determining associated image features of a plurality of two-dimensional offset points for each object query vector based on bidirectional offset by a deformable cross-attention layer in the decoder; performing cross attention calculation on the associated image features of the two-dimensional offset points and the projected associated image features of the reference points of the object query vector and the object query vector by the deformable cross attention layer, and updating the object query vector according to a cross attention calculation result;

6. The method of claim 1, wherein the acquiring a multi-scale feature map of a multi-view image comprises:

inputting the multi-view image into a backbone network and a feature pyramid network, and outputting a multi-scale feature map by the feature pyramid network.

7. The method of claim 1, wherein determining three-dimensional coordinate information of each scale feature map in space comprises:

8. The method according to claim 1, wherein the encoding each scale feature map and the three-dimensional coordinate information of the scale feature map, and determining the three-dimensional position-aware feature of the multi-scale feature map according to the encoding result, comprises:

9. An object detection apparatus comprising:

the acquisition unit acquires a multi-scale feature map of the multi-view image;

the determining unit is used for determining three-dimensional coordinate information of each scale feature map in space, wherein the three-dimensional coordinate information of each scale feature map comprises three-dimensional coordinates of a plurality of points corresponding to each pixel point in the scale feature map in space;

the coding unit is used for coding each scale feature map and the three-dimensional coordinate information of the scale feature map, and determining the three-dimensional position sensing features of the multi-scale feature map according to the coding result;

And the decoding unit is used for decoding based on the three-dimensional position sensing characteristics of the multi-scale characteristic map to obtain object category and position information of the objects in the multi-view image.

10. The apparatus of claim 9, wherein the decoding unit comprises a first decoding module configured to:

11. The apparatus of claim 10, wherein the first decoding module comprises an offset sub-module, a determination sub-module, an interaction sub-module, and a prediction sub-module, wherein,

the offset submodule is used for carrying out bidirectional offset on each object query vector according to a reference point of the object query vector in a three-dimensional space to obtain a plurality of two-dimensional offset points of the reference point in an image space; the two-way offset comprises the steps of performing first offset on the reference point in a three-dimensional space, and performing second offset on a three-dimensional offset point obtained by the first offset and a projection point of the reference point in a two-dimensional space respectively;

The determining submodule is used for determining the associated image characteristics of each two-dimensional offset point;

the interaction sub-module is used for interacting with the object query vector by utilizing the associated image features of the two-dimensional offset points and the projected associated image features of the reference points, and updating the object query vector;

and the prediction sub-module is used for predicting the object category and the position information according to the updated object query vector.

12. The apparatus of claim 11, wherein the offset sub-module is further to:

the determination submodule is further configured to:

13. The apparatus of claim 9, wherein the decoding unit comprises a second decoding module configured to:

14. The apparatus of claim 9, wherein the acquisition unit comprises a first acquisition module to:

15. The apparatus of claim 9, wherein the determining unit comprises a first determining module configured to:

16. The apparatus of claim 9, wherein the encoding unit comprises a first encoding module to:

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of claims 1-8.