CN115909253A

CN115909253A - Target detection and model training method, device, equipment and storage medium

Info

Publication number: CN115909253A
Application number: CN202211675436.1A
Authority: CN
Inventors: 崇智禹; 董嘉蓉
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-04-04

Abstract

The disclosure provides a target detection method, a target detection device, a model training device and a storage medium, and relates to the field of artificial intelligence, in particular to the field of automatic driving. The specific implementation scheme is as follows: acquiring images acquired by image acquisition equipment of different acquisition visual angles deployed in a target vehicle; extracting two-dimensional image features of the obtained image; performing visual angle transformation on the extracted two-dimensional image characteristics to obtain scene BEV characteristics corresponding to the aerial view BEV scene of the target vehicle; predicting a thermodynamic diagram corresponding to the BEV scene based on the obtained scene BEV characteristics, wherein the response value of each position in the thermodynamic diagram represents the probability of the existence of the target at each position in the BEV scene; determining an initial detection frame for target detection in a BEV scene according to the thermodynamic diagram; and performing target detection based on the determined initial detection frame and the obtained image. By applying the scheme provided by the embodiment of the disclosure, the efficiency of target detection can be improved.

Description

Target detection and model training method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to the field of automated driving technology.

Background

Nowadays, the target detection technology is mature and widely applied in various industries. For example, in the field of automatic driving and the like, a target such as a vehicle, an obstacle, a pedestrian, and the like may be detected based on an image captured by an onboard camera, so that downstream tasks such as a subsequent vehicle lane change, a vehicle obstacle avoidance, and the like are performed based on the detected target.

In the existing target detection scheme, an initialization detection frame for detecting a target in an image is generally selected randomly, then the initialization detection frame is adjusted and optimized continuously until a preset standard is reached, image features are extracted based on the optimized initialization detection frame, and target detection is performed based on the extracted features.

Disclosure of Invention

The disclosure provides a target detection method, a model training method, a device, equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a target detection method including:

acquiring images acquired by image acquisition equipment with different acquisition visual angles deployed in a target vehicle;

extracting two-dimensional image features of the obtained image;

performing visual angle transformation on the extracted two-dimensional image characteristics to obtain scene BEV characteristics corresponding to the bird's-eye view angle BEV scene of the target vehicle;

predicting a thermodynamic diagram corresponding to the BEV scene based on the obtained scene BEV characteristics, wherein the response value of each position in the thermodynamic diagram represents the probability of the existence of the target at each position in the BEV scene;

determining an initial detection frame for target detection in the BEV scene according to the thermodynamic diagram;

and performing target detection based on the determined initial detection frame and the obtained image.

According to another aspect of the present disclosure, there is provided a model training method, including:

acquiring sample images acquired by image acquisition equipment at different acquisition visual angles deployed in a vehicle and an annotation frame of a sample target in the acquired sample images;

obtaining sample BEV characteristics corresponding to the BEV scene where the vehicle is located of the collected sample images;

inputting the sample BEV characteristics into a preset neural network model for target detection to obtain a target detection result output by the neural network model, wherein the neural network model comprises: a thermodynamic diagram generation network layer for generating a thermodynamic diagram, wherein a response value of each position in the thermodynamic diagram represents a probability that a sample target exists at each position in the BEV scene, the target detection result is obtained according to each sample initial detection frame, and each sample initial detection frame is determined based on the thermodynamic diagram;

obtaining a first difference between each sample initial detection frame and an annotation frame in the obtained sample image and a second difference between the thermodynamic diagram and a sample thermodynamic diagram generated based on the annotation frame in the obtained sample image;

and adjusting network parameters of the neural network model based on the first difference and the second difference to obtain a target detection model.

According to another aspect of the present disclosure, there is provided an object detecting apparatus including:

the image acquisition module is used for acquiring images acquired by image acquisition equipment with different acquisition visual angles deployed in the target vehicle;

the two-dimensional image feature extraction module is used for extracting two-dimensional image features of the obtained image;

the scene BEV characteristic obtaining module is used for carrying out visual angle transformation on the extracted two-dimensional image characteristics to obtain scene BEV characteristics corresponding to the BEV scene of the target vehicle of the extracted two-dimensional image characteristics;

a thermodynamic diagram prediction module, configured to predict a thermodynamic diagram corresponding to the BEV scene based on the obtained scene BEV features, where a response value of each location in the thermodynamic diagram characterizes a probability that an object exists at each location in the BEV scene;

the initial detection frame determining module is used for determining an initial detection frame for target detection in the BEV scene according to the thermodynamic diagram;

and the target detection module is used for carrying out target detection based on the determined initial detection frame and the obtained image.

According to another aspect of the present disclosure, there is provided a model training apparatus including:

the system comprises a sample image obtaining module, a target labeling module and a target labeling module, wherein the sample image obtaining module is used for obtaining sample images which are arranged in a vehicle and are collected by image collecting equipment with different collecting visual angles and obtaining a sample target labeling frame in the sample images;

the sample BEV characteristic obtaining module is used for obtaining sample BEV characteristics corresponding to the BEV scene where the collected sample images are located in the vehicle;

a sample input module, configured to input the sample BEV characteristics into a preset neural network model for target detection, so as to obtain a target detection result output by the neural network model, where the neural network model includes: a network layer for generating a thermodynamic diagram, wherein a response value of each position in the thermodynamic diagram represents a probability that a sample target exists at each position in the BEV scene, the target detection result is obtained according to each sample initial detection frame, and each sample initial detection frame is determined based on the thermodynamic diagram;

the difference obtaining module is used for obtaining a first difference between each sample initial detection frame and an annotation frame in the obtained sample image and a second difference between the thermodynamic diagram and a sample thermodynamic diagram generated based on the annotation frame in the obtained sample image;

and the parameter adjusting module is used for adjusting network parameters of the neural network model based on the first difference and the second difference to obtain a target detection model.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described object detection method or model training method.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the above-described object detection method or model training method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above object detection method or model training method.

As can be seen from the above, when the scheme provided by the embodiment of the present disclosure is applied to target detection, two-dimensional image features of images acquired by image acquisition devices deployed in a target vehicle and having different acquisition view angles may be extracted, and view angle transformation may be performed on the extracted two-dimensional image features to obtain scene BEV features corresponding to a BEV scene where the extracted two-dimensional image features are located in the target vehicle, so that thermodynamic diagrams corresponding to the BEV scene may be predicted based on the obtained scene BEV features, an initial detection frame may be determined according to the thermodynamic diagrams, and then target detection may be successfully performed based on the determined initial detection frame and the obtained images.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of a first target retrieval method according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a second target retrieval method according to an embodiment of the disclosure;

fig. 3 is a schematic flowchart of a third target retrieval method provided in the embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a model training method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a first model training process provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a second model training process provided by the embodiments of the present disclosure;

fig. 7 is a schematic structural diagram of a target search apparatus according to an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 9 is a block diagram of an electronic device for implementing a target retrieval method or a model training method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The implementation subject of the scheme provided by the embodiment of the present disclosure is first explained.

The implementation subject of the scheme provided by the embodiment of the disclosure is as follows: any electronic device has functions of data processing, data storage and the like.

The following describes in detail the target detection scheme provided by the embodiments of the present disclosure.

Referring to fig. 1, a schematic flowchart of a first target detection method provided in an embodiment of the present disclosure is shown, where the method includes the following steps S101 to S106.

Step S101: images acquired by image acquisition devices of different acquisition perspectives deployed in a target vehicle are obtained.

The target vehicle may be a vehicle in an autonomous driving state.

The target vehicle can be pre-deployed with image acquisition devices with different acquisition visual angles.

For example, the image pickup devices may be disposed on the front and rear sides and the left and right sides of the vehicle body, respectively.

Therefore, the image acquisition equipment can acquire images in the respective view field ranges when the vehicle is in an automatic driving state.

Step S102: two-dimensional image features of the obtained image are extracted.

The embodiment of the present disclosure does not limit a specific manner of extracting the two-dimensional image features of the obtained image, and for example, the two-dimensional image features may be extracted based on a two-dimensional image feature extraction algorithm such as an edge extraction operator, a texture feature extraction algorithm, a convolutional neural network algorithm, and the like.

In one embodiment, two-dimensional image features of the obtained image may be extracted based on a pre-trained neural network model.

The embodiments of the present disclosure do not limit the specific training mode of the model and the specific architecture of the model. For example, the neural network model may be a Resnet50 network model, a Resnet101 network model, or the like.

Step S103: and carrying out visual angle transformation on the extracted two-dimensional image characteristics to obtain scene BEV characteristics corresponding to the extracted two-dimensional image characteristics in the aerial view BEV scene of the target vehicle.

Here, the BEV (Bird's Eye View) may be referred to as a top View.

According to the steps, the acquisition visual angles of the image acquisition equipment are different, the images acquired by the image acquisition equipment are images with different acquisition visual angles, and then the two-dimensional image features extracted aiming at the acquired images are image features corresponding to the different acquisition visual angles.

In this step, the image features corresponding to different collection view angles are subjected to view angle transformation to obtain scene BEV features corresponding to BEV scenes.

For example, the image capturing apparatuses include apparatuses disposed on both the front and rear sides and the left and right sides of the vehicle body, in which case, the capturing view angle of each image capturing apparatus may be referred to as an annular view angle, and the two-dimensional image feature of the image capturing apparatus may be referred to as an image feature corresponding to the annular view angle. In this step, the image features corresponding to the annular viewing angle need to be converted into scene BEV features corresponding to the bird's-eye view angle BEV scene.

Specifically, the above-described scene BEV feature may be obtained in the following manner.

In one embodiment, three-dimensional spatial features may be generated based on two-dimensional image features, and then BEV features may be generated from the resulting three-dimensional spatial features. For example, setting the height dimension of the three-dimensional spatial feature to 0, results in BEV features, and so on.

The embodiment of the present disclosure does not limit the specific manner of generating the three-dimensional spatial feature based on the two-dimensional image feature, and for example, an LSS (Lift, splat, shoot) algorithm, a Spconv + algorithm, or other algorithms may be used to generate the three-dimensional spatial feature based on the two-dimensional image feature.

In another embodiment, the scene BEV features may be obtained based on a spatial detection frame preset in the BEV scene.

For example, image areas corresponding to the spatial detection frames preset in the BEV scene in the obtained images are determined, then a first BEV feature corresponding to the spatial detection frame is obtained based on the two-dimensional image features of the obtained images, and further the scene BEV feature is obtained based on the first BEV feature. The detailed description will be given in the following step S203-step S206 in the embodiment shown in fig. 2, and will not be described in detail here.

In an embodiment of the present disclosure, the step S103 may be implemented by a step a and a step B, which are described in detail in the following embodiments.

Step S104: and predicting a thermodynamic diagram corresponding to the BEV scene based on the obtained scene BEV characteristics.

The response value of each position in the thermodynamic diagram represents the probability of the target existing at each position in the BEV scene, the position with higher response value in the thermodynamic diagram represents the higher probability of the target existing at the position in the BEV scene, and conversely, the position with lower response value in the thermodynamic diagram represents the lower probability of the target existing at the position in the BEV scene.

The targets may be common targets in any type of autonomous driving scenario, such as pedestrians, vehicles, obstacles, etc.

Specifically, the following manner may be adopted based on the thermodynamic diagram corresponding to the predicted BEV scene.

In one embodiment, the scene BEV features may be divided into sub-BEV features each corresponding to each sub-region of the BEV scene, and then a thermodynamic diagram corresponding to the BEV scene may be predicted based on a similarity between the sub-BEV features and a preset feature. For example, the response value corresponding to the position of the sub-region in the thermodynamic diagram may be set as the similarity. The preset feature may be preset by a worker, for example, may be a feature of the common target in a BEV scene.

In another embodiment, the scene BEV features may be input into a pre-trained thermodynamic diagram generation model, and a thermodynamic diagram corresponding to the BEV scene output by the thermodynamic diagram generation model is obtained.

Wherein, the thermodynamic diagram generation model is as follows: and training a preset neural network model by taking the sample BEV characteristics as input information and a marking frame of a detection target in the sample image as a training mark, wherein the model is obtained and used for generating the thermodynamic diagram, and the sample BEV characteristics are the BEV characteristics of the sample image corresponding to the BEV scene. Therefore, thermodynamic diagrams corresponding to the BEV scene can be obtained quickly through the model.

Step S105: and determining an initial detection frame for target detection in the BEV scene according to the thermodynamic diagram.

Specifically, the initial detection frame for target detection in the BEV scene may be determined in the following manner.

In one embodiment, a target position in the thermodynamic diagram where the response value is greater than a preset threshold value can be determined, and an initial detection frame for target detection in the BEV scene is determined based on the determined target position.

In another embodiment, a preset number of target positions with the highest response values in the thermodynamic diagram may be determined, and an initial detection frame for target detection in the BEV scene may be determined based on the determined target positions.

Therefore, the initial detection frame for target detection in the BEV scene is determined based on the preset number of target positions with the highest response value, that is, the initial detection frame can be determined according to the region with the highest probability of the target existing at each position in the BEV scene, which is beneficial to improving the accuracy of determining the initial detection frame.

The preset threshold value and the preset number can be set by workers according to experience.

The manner in which the initial detection frame is determined based on the determined target position described above is described below.

In particular, a region in the BEV scene corresponding to the determined location may be determined as the initial detection box.

Step S106: and performing target detection based on the determined initial detection frame and the obtained image.

Specifically, target detection may be performed based on the determined initial detection frame and the obtained image in the following manner.

In one embodiment, the spatial position of the initial detection frame corresponding to the BEV scene may be obtained directly based on the device parameters of each image capturing device, and then the second BEV feature corresponding to the obtained spatial position may be determined based on the BEV features of the scene, so as to detect the detection target in the captured image based on the second BEV feature and the obtained image.

After the second BEV features corresponding to the initial detection frames are obtained, the image area corresponding to the spatial position may be determined, and depth dimension information is added to the second BEV features according to the two-dimensional image features of the determined image area, so that the spatial position of the target is determined based on the second BEV features to which the depth dimensions are added, that is, the target in the obtained image is detected.

In an embodiment of the present disclosure, on the basis of the foregoing embodiment, after the spatial position of the detection target is determined, the category of the detection target may also be determined based on the second BEV feature.

The category of the detection target may be determined by calculating similarity between the second BEV feature and each preset category feature, and the like, which is not described in detail herein.

In another embodiment, after obtaining the second BEV feature, the detection target in the acquired image may be detected in different manners according to whether the second BEV feature is a BEV feature characterizing the target. The detailed implementation manner is shown in the following steps S306-S310 in the embodiment shown in fig. 3, and will not be described in detail here.

In addition, the solution provided by the embodiment of the present disclosure is determined based on a thermodynamic diagram when determining the initialization detection frame, wherein the response value of each position in the thermodynamic diagram characterizes the probability that the target exists at each position in the BEV scene. Therefore, the determined initial detection frame is related to the probability of the target existing at each position in the BEV scene, and compared with the randomly selected initial detection frame, the initial detection frame is determined by adopting the potential probability of the target existing at each position in the BEV scene, so that the defects caused by completely randomly determining the initial detection frame are avoided, the blindness in selection is reduced, the probability that the determined initial detection frame covers the target is improved, the target detection is rapidly and accurately performed based on the initial detection frame, and the efficiency and the accuracy of the target detection are improved.

In addition, the scheme provided by the embodiment of the disclosure is beneficial to improving the probability that the initial detection frame covers the target, so that when the initial detection frame is subsequently optimized, compared with randomly selecting the initial detection frame, the optimization amplitude and the optimization difficulty required by the initial detection frame are beneficial to reducing the calculated amount required by the target detection, and the efficiency of the target detection is further improved.

Another embodiment of the aforementioned step S103 is explained below by the following steps a and B.

Step A: and carrying out scale transformation on the two-dimensional image characteristics of the image aiming at the image corresponding to each acquisition visual angle to obtain multi-scale two-dimensional characteristics.

Specifically, for an image corresponding to each collection view, a Feature Pyramid Network (FPN) may be used to perform scale transformation on the two-dimensional image features of the image, so as to obtain multi-scale two-dimensional features.

And B: and carrying out visual angle transformation on the obtained multi-scale two-dimensional image characteristics to obtain scene BEV characteristics corresponding to the BEV scene of the target vehicle of the extracted multi-scale image characteristics.

For example, if the multi-scale two-dimensional image is characterized by F _img ∈R ^N*C*h*w That is, the feature of the BEV of the scene obtained based on the multi-scale two-dimensional image feature can be represented by F _bev ∈R ^C*X*Y And (4) showing.

Wherein C represents the number of channels adopted when the image features are extracted, h and w represent the length and the width of the image respectively, and X and Y represent horizontal and vertical coordinates under a real three-dimensional coordinate system established by taking the vehicle as the center respectively.

This step can be obtained on the basis of the foregoing step S103, and the difference is only that the image feature types are different, which is not described herein again.

Because the multi-scale dimension image features can represent more comprehensive information, the scene BEV features which are more accurate can be obtained based on the multi-scale dimension image features.

On the basis of the embodiment shown in fig. 1, when determining the scene BEV feature corresponding to the BEV scene based on the two-dimensional image feature, an image area corresponding to a spatial detection frame preset in the BEV scene in each obtained image may be determined, then a first BEV feature corresponding to the spatial detection frame is obtained based on the two-dimensional image feature of the obtained image, and the scene BEV feature is further obtained based on the first BEV feature. In view of the above, the embodiments of the present disclosure provide a second target detection method.

Referring to fig. 2, a flowchart of a second target detection method provided in the embodiment of the present disclosure includes the following steps S201 to S209.

Step S201: images acquired by image acquisition devices of different acquisition perspectives deployed in a target vehicle are obtained.

Step S202: two-dimensional image features of the obtained image are extracted.

Step S201 and step S202 are the same as step S101 and step S102 in the embodiment shown in fig. 1, and are not repeated here.

Step S203: and determining the image area corresponding to the preset space detection frame in the BEV scene where the target vehicle is located in each obtained image based on the equipment internal parameters of each image acquisition equipment.

The preset spatial detection frame may be understood as each sub-region obtained by dividing a spatial region, where the spatial region may be a spatial region corresponding to a BEV scene, specifically, may be a spatial region in a real three-dimensional spatial coordinate system established with a target vehicle as a center, and for example, the spatial region may be a rectangular parallelepiped region with a length, a width, and a height of 60 meters.

Specifically, the image area corresponding to the spatial detection frame may be determined in the following manner.

In an embodiment, a preset spatial detection frame in a BEV scene may be mapped to an image area of an image based on device internal parameters of each image acquisition device, where the image area is an image area corresponding to the spatial detection frame in each obtained image.

In another embodiment, for each obtained image, a first image area corresponding to a spatial detection frame preset in the BEV scene in the image and a second image area within a first preset range of the first image area may be determined according to device parameters of an image capturing device capturing the image.

The first preset range may be set by a worker according to experience. Moreover, the second image area only needs to be within a first preset range of the first image area, and the number of the second image areas is not limited in the embodiment of the disclosure.

Step S204: and obtaining two-dimensional area characteristics corresponding to the determined image area based on the two-dimensional image characteristics of the obtained image.

The foregoing steps have obtained two-dimensional image features of the image, and since the two-dimensional image features are associated with image regions in the obtained image, two-dimensional region features corresponding to the determined image regions can be obtained.

Specifically, according to the difference of the determined image area, the two-dimensional area feature corresponding to the determined image area may be obtained in the following manner.

In one embodiment, if the determined image area is an entire area, the two-dimensional area feature corresponding to the determined image area can be directly obtained.

In another embodiment, if the determined image area includes the first image area and the second image area, for each obtained image, according to the two-dimensional image feature of the image, the first area sub-feature of the first image area and the second area sub-feature of the second image area corresponding to the image may be determined, and the determined first area sub-feature and second area sub-feature may be weighted and fused to obtain the two-dimensional area feature.

The weights corresponding to the first region sub-feature and the second region sub-feature may be preset by a worker.

Therefore, after determining the first image area and the second image area corresponding to the spatial detection frame in the image for each obtained image, the first area sub-features corresponding to the first image area and the second area sub-features corresponding to the second image area can be weighted and fused to obtain the two-dimensional area features, so that the two-dimensional area features can comprehensively represent the area features of the plurality of image areas, and the comprehensiveness and accuracy of the obtained two-dimensional area features are improved.

Step S205: and obtaining a first BEV characteristic corresponding to the space detection frame according to the obtained two-dimensional area characteristic.

Specifically, the obtained two-dimensional region feature may be used as a first BEV feature corresponding to the spatial detection frame.

Step S206: and obtaining scene BEV characteristics corresponding to the two-dimensional image characteristics in the BEV scene based on the first BEV characteristics.

The first BEV feature is a BEV feature corresponding to one spatial detection frame, and then all the first BEV features obtained by each detection frame are obtained, that is, the BEV features corresponding to the spatial region formed by the spatial detection frames are obtained, that is, the scene BEV features corresponding to the BEV scene are obtained.

Step S207: and predicting a thermodynamic diagram corresponding to the BEV scene based on the obtained scene BEV characteristics.

Step S208: and determining an initial detection frame for target detection in the BEV scene according to the thermodynamic diagram.

Step S209: and performing target detection based on the determined initial detection frame and the obtained image.

Step S207 and step S209 are the same as step S104 and step S106 in the embodiment shown in fig. 1, and are not described again here.

As can be seen from the above, when the scheme provided by the embodiment of the present disclosure is applied to target detection, image areas corresponding to the space detection frames preset in the BEV scenes in the obtained images may be determined based on the device internal parameters of each image acquisition device, so that the first BEV features corresponding to the space detection frames are obtained based on the two-dimensional area features corresponding to the determined image areas, and then the scene BEV features corresponding to the two-dimensional image features in the BEV scenes are obtained based on the first BEV features. Therefore, the first BEV characteristics corresponding to the space detection frame can be accurately determined according to the corresponding relation between the space detection frame and the image areas in the obtained images, and accordingly scene BEV characteristics can be accurately and conveniently obtained based on the first BEV characteristics corresponding to the space detection frames.

On the basis of the embodiment shown in fig. 1, when target detection is performed based on the determined initial detection frame and the obtained image, for each determined initial detection frame, a spatial position corresponding to the initial detection frame in a BEV scene may be obtained based on device internal parameters of each image acquisition device, then a second BEV feature corresponding to the obtained spatial position may be determined based on a scene BEV feature, and further, a detection target in the acquired image may be detected in different manners according to whether the second BEV feature is a BEV feature characterizing the target. In view of the above, the embodiments of the present disclosure provide a third target detection method.

Referring to fig. 3, a flowchart of a third target detection method provided in the embodiment of the present disclosure is shown, where the method includes the following steps S301 to S310.

Step S301: images acquired by image acquisition devices of different acquisition perspectives deployed in a target vehicle are obtained.

Step S302: two-dimensional image features of the obtained image are extracted.

Step S303: and carrying out visual angle transformation on the extracted two-dimensional image characteristics to obtain scene BEV characteristics corresponding to the BEV scene of the target vehicle of the extracted two-dimensional image characteristics.

Step S304: and predicting a thermodynamic diagram corresponding to the BEV scene based on the obtained scene BEV characteristics.

Step S305: and determining an initial detection frame for target detection in the BEV scene according to the thermodynamic diagram.

The steps S301 to S305 are the same as the steps S101 to S105 in the embodiment shown in fig. 1, and are not described again here.

Step S306: and obtaining the corresponding spatial position of the initial detection frame in the BEV scene based on the device internal parameters of each image acquisition device.

Specifically, the corresponding spatial position of the initial detection frame in the BEV scene may be obtained in the following manner.

In one embodiment, the initial detection frame may be mapped into the BEV scene directly based on the device internal parameters of each image acquisition device, so as to obtain the corresponding spatial position of the initial detection frame in the BEV scene.

In another embodiment, a corresponding first spatial position of the initial detection frame in the BEV scene and a second spatial position within a second preset range of the first spatial position may be obtained based on device parameters of each image acquisition device.

Wherein the second preset range can be set by a worker according to experience. Moreover, the second spatial position only needs to be within a second preset range of the first spatial position, and the number of the second spatial position is not limited in the embodiment of the disclosure.

Step S307: and determining a second BEV characteristic corresponding to the obtained spatial position based on the scene BEV characteristic.

The foregoing steps have already obtained the scene BEV feature, and since the scene BEV feature and the spatial position have a corresponding relationship, a second BEV feature corresponding to the obtained spatial position may be determined.

Specifically, according to the determined spatial position, the second BEV feature corresponding to the obtained spatial position may be determined in the following manner.

In one embodiment, if the number of the determined spatial positions is 1, the second BEV feature corresponding to the spatial position may be directly obtained.

In another embodiment, if the determined spatial location includes the first spatial location and the second spatial location, a first sub-BEV feature corresponding to the first spatial location and a second sub-BEV feature corresponding to the second spatial location may be determined based on the scene BEV features, and the determined first sub-BEV feature and the second sub-BEV feature may be subjected to weighted fusion to obtain the second BEV feature.

The weight values corresponding to the first sub-BEV feature and the second sub-BEV feature may be preset by a worker.

Therefore, after the first spatial position and the second spatial position corresponding to the spatial detection frame in the BEV scene are obtained for each spatial detection frame, the first sub-BEV feature corresponding to the first spatial position and the second sub-BEV feature corresponding to the second spatial position may be subjected to weighted fusion to obtain the second BEV feature, so that the second BEV feature may comprehensively represent BEV features of a plurality of spatial positions, and the comprehensiveness and accuracy of the obtained second BEV feature are improved.

Step S308: and judging whether the second BEV characteristic is the BEV characteristic of the characterization target, if so, executing the step S310, and if not, executing the step S309.

In this step, the BEV characteristics characterizing the target may be set empirically by the operator.

Step S309: based on the second BEV feature, the initial detection box is updated, and the process returns to step S306.

Since the second BEV feature is the second BEV feature corresponding to the initial detection frame, if the second BEV feature is not the BEV feature characterizing the target, it indicates that the initial detection frame does not cover the target, and therefore, the initial detection frame needs to be updated based on the second BEV feature, and step S306 is returned to obtain the corresponding spatial position of the updated initial detection frame in the BEV scene.

Step S310: detecting a detection target in the captured image based on the second BEV feature and the obtained image.

Since the second BEV feature is a second BEV feature corresponding to the initial detection frame, if the second BEV feature is a BEV feature characterizing the target, it indicates that the initial detection frame has been covered on the target, and therefore, the detection target in the captured image can be detected directly based on the second BEV feature and the obtained image.

The specific detection method can be obtained on the basis of step S106 in the foregoing embodiment of fig. 1, and the difference is only that the feature names are different, which is not described herein again.

As can be seen from the above, when the scheme provided by the embodiment of the present disclosure is applied to target detection, different methods may be used to detect the detection target in the acquired image according to whether the second BEV feature is a BEV feature that characterizes the target. Specifically, when the second BEV feature is not the BEV feature of the characterization target, the initial detection frame may be updated until the second BEV feature obtained based on the initial detection frame is the BEV feature of the characterization target, so that the accuracy in target detection based on the second BEV feature corresponding to the initial detection frame may be further improved by continuously updating and tuning the initial detection frame.

Corresponding to the target detection method, the embodiment of the disclosure further provides a model training method.

Referring to fig. 4, a schematic flowchart of a model training method provided in an embodiment of the present disclosure is shown, where the method includes the following steps S401 to S405.

Step S401: sample images acquired by image acquisition equipment with different acquisition visual angles deployed in a vehicle and an annotation frame of a sample target in the obtained sample images are obtained.

Step S402: and obtaining sample BEV characteristics corresponding to the BEV scene of the vehicle in which the collected sample images are positioned.

The meaning and the obtaining mode of the sample BEV feature can be obtained on the basis of the embodiment of the target detection method, and the difference is only that the feature names are different, and the description is omitted here.

Step S403: and inputting the BEV characteristics of the sample into a preset neural network model for target detection to obtain a target detection result output by the neural network model.

Wherein, the neural network model includes: and generating a thermodynamic diagram of the network layer, wherein the response value of each position in the thermodynamic diagram represents the probability of the target existing at each position in the BEV scene, the target detection result is obtained according to each sample initial detection frame, and each sample initial detection frame is determined based on the thermodynamic diagram.

The method for determining the initial detection frame of each sample based on the thermodynamic diagram can be obtained on the basis of the foregoing embodiment of the target detection method, and the difference is only that the detection frame names are different, and is not described here again.

In an embodiment of the present disclosure, the labeling frame may be subjected to a certain degree of fuzzy processing, and then the sample initial detection frame is determined based on the labeling frame after the fuzzy processing. Thus, the effect of stable training can be achieved to a certain extent.

Step S404: a first difference between each sample initialization detection frame and an annotation frame in the obtained sample image and a second difference between the thermodynamic diagram and a sample thermodynamic diagram generated based on the annotation frame in the obtained sample image are obtained.

Because the labeling box represents the real existing target in the sample image, a sample thermodynamic diagram representing the probability of the target existing at each position in the BEV scene can be generated based on the labeling box in the sample image.

The embodiment of the present disclosure does not limit the manner of obtaining the first difference, for example, the target labeling box corresponding to each sample initial detection box may be determined based on the hungarian matching algorithm, and then the first difference between each sample initial detection box and the target labeling box corresponding to each sample initial detection box is obtained.

In one embodiment, the first difference may be obtained in the following manner.

And determining each target marking frame closest to the distance between each sample initial detection frame and each corresponding target marking frame to obtain a first difference between each sample initial detection frame and the corresponding target marking frame.

Because the determined sample initialization detection frame in the embodiment of the disclosure is determined based on the thermodynamic diagram, and the probability of the sample initialization detection frame covering the sample target is higher, compared with the method based on the Hungarian matching algorithm, the method of marking the frame based on each target with the closest distance between the initial detection frames of each sample is beneficial to reducing noise and reducing the offset generated during the detection of the model target.

Step S405: and adjusting network parameters of the neural network model based on the first difference and the second difference to obtain a target detection model.

Specifically, a loss value generated when the neural network model performs target detection may be calculated according to the first difference and the second difference, the network parameter of the neural network model is adjusted by using the loss value, iterative training is continued based on the adjusted parameter, and after a preset training end condition is met, training is completed to obtain a trained scaling coefficient prediction model. The preset training end condition may be that the loss value is smaller than a preset value, a preset number of times of training is reached, or the like.

The specific way of calculating the loss value may be by using a loss function, for example, the loss function may be a mean square error loss function, a cross entropy loss function, and the like, and the specific way of calculating the loss value is not limited in the embodiments of the present disclosure.

As can be seen from the above, when the scheme provided by the embodiment of the present disclosure is applied to model training, a sample image and a labeling frame of a sample target in the obtained sample image are obtained, and a sample BEV feature corresponding to a BEV scene of an acquired sample image at a vehicle is obtained, so that the sample BEV feature can be input into a preset neural network model to perform target detection, and a target detection result output by the neural network model is obtained, and then, network parameter adjustment can be performed on the neural network model according to a first difference between each sample initial detection frame and the labeling frame in the obtained sample image and a second difference between a thermodynamic diagram and a sample thermodynamic diagram generated based on the labeling frame in the obtained sample image, so as to obtain a target detection model. Therefore, the labeling frame of the sample target is used as a training label to train the neural network model, so that the model can learn the relation between the BEV characteristics and the sample target, and the model capable of performing target detection is obtained through training.

In addition, during the training process, the sample initial detection box is determined according to a thermodynamic diagram, wherein the response value of each position in the thermodynamic diagram characterizes the probability that the sample target exists at each position in the BEV scene. Therefore, the determined initial detection frame is related to the probability of the sample target at each position in the BEV scene, so that the defects caused by completely randomly determining the initial detection frame are avoided, the probability that the determined sample initial detection frame covers the sample target is favorably improved, the target detection is favorably and quickly and accurately performed based on the initial detection frame, and the convergence speed of the model and the training efficiency are favorably improved when the network parameter of the model is subsequently adjusted based on the detection result.

A model training procedure provided in the embodiments of the present disclosure is generally described below with reference to fig. 5.

Referring to fig. 5, a schematic diagram of a model training process according to an embodiment of the present disclosure is provided.

The model training flow shown in fig. 5 is explained below through steps 1 to 6.

Step 1: and (4) extracting features.

Feature extraction is performed on the sample image, thereby obtaining imgfeats (image features).

Wherein, the labeling frame of the target is labeled in advance in the sample image.

The imgfeats are the two-dimensional image features, but may be multi-scale two-dimensional image features obtained by multi-scale transformation of two-dimensional image features.

Step 2: and (5) converting the characteristic visual angle.

As can be seen from the figure, based on the preset spatial detection frame, the feature view angle conversion can be completed, and the sample BEV feature in the BEV scene is obtained.

And step 3: a thermodynamic diagram is generated.

Based on the sample BEV features, a thermodynamic diagram may be generated.

And 4, step 4: a second difference is obtained.

I.e., the second difference between the sample thermodynamic diagrams generated by the labeling boxes in the previous sample images.

And 5: and determining an initial detection frame and carrying out target detection.

Step 6: a first difference is obtained.

I.e. the first difference between each sample detection frame and the labeling frame in the obtained sample image.

The embodiments of the above steps can be obtained on the basis of the embodiments of the target detection method, and therefore, only a brief description is given here.

As can be seen, the first difference and the second difference can be obtained through the above steps, so that network parameter adjustment can be performed on the neural network model based on the first difference and the second difference, and the target detection model is obtained through training, wherein information output by the target detection model may include a 3D box and class (target class) representing a spatial position of the target.

Next, referring to fig. 6, a second model training process will be described focusing on a manner of determining an initial detection frame based on a thermodynamic diagram during the model training process.

Referring to fig. 6, a schematic diagram of a second model training process according to an embodiment of the present disclosure is provided.

The model training process is described in 3 stages below.

Stage 1: and generating a thermodynamic diagram based on the sample BEV characteristics, and determining Top K initial detection frames according to the marking frame (gt box).

The specific value of Top K may be set by a worker.

And (2) stage: and the Linear Layer (Linear Layer) determines Q, K and V based on the initial detection frame and imgffeats and a transform manner, wherein Q (query) represents the initial detection frame, K (key) represents the annotation frame, and V (value) represents the image feature corresponding to the annotation frame.

And (3) stage: optimizing an initial detection box based on a multi-head self-attention (multi-head self-attention) mechanism and a multi-head cross attention (multi-head cross attention) mechanism, and detecting a target position (3D box) and a target class (class) based on FFN (Feed forward neural networks).

The multi-head cross attribute is used for extracting BEV characteristics corresponding to each initial detection frame, and the multi-head self attribute is used for adjusting the influence weight of the BEV characteristics corresponding to each initial detection frame on the detection result.

As can be seen from the foregoing embodiments, the application of the scheme provided by the embodiments of the present disclosure is beneficial to improve the probability that the determined sample initial detection frame covers the sample target, so that when the initialization detection frame is optimized, the number of layers of the decoder can be reduced, thereby reducing the amount of calculation, and reducing the difficulty of regression.

Corresponding to the target detection method, the embodiment of the disclosure also provides a target detection device.

Referring to fig. 7, a schematic structural diagram of an object detection apparatus provided in the embodiment of the present disclosure is further provided, where the apparatus includes the following modules 701 to 706:

an image obtaining module 701, configured to obtain images acquired by image acquisition devices deployed in a target vehicle and having different acquisition perspectives;

a two-dimensional image feature extraction module 702, configured to extract two-dimensional image features of the obtained image;

a scene BEV feature obtaining module 703, configured to perform view angle transformation on the extracted two-dimensional image features, and obtain scene BEV features corresponding to the BEV scene where the target vehicle is located of the extracted two-dimensional image features;

a thermodynamic diagram prediction module 704, configured to predict a thermodynamic diagram corresponding to the BEV scene based on the obtained scene BEV features, where a response value of each location in the thermodynamic diagram characterizes a probability that an object exists at each location in the BEV scene;

an initial detection frame determining module 705, configured to determine an initial detection frame for performing target detection in the BEV scene according to the thermodynamic diagram;

and an object detection module 706, configured to perform object detection based on the determined initial detection frame and the obtained image.

In addition, the solution provided by the embodiments of the present disclosure is determined based on a thermodynamic diagram when determining the initialization detection frame, wherein the response value of each position in the thermodynamic diagram characterizes a probability that a target exists at each position in the BEV scene. Therefore, compared with the randomly selected initial detection frame, the probability that the target exists at each position in the BEV scene is used for determining the initial detection frame, so that the defects caused by completely randomly determining the initial detection frame are avoided, the blindness in the selection process is reduced, the probability that the target is covered by the determined initial detection frame is improved, the target detection is rapidly and accurately performed based on the initial detection frame, and the efficiency and the accuracy of the target detection are improved.

In an embodiment of the present disclosure, the initial detection frame determining module 705 is specifically configured to determine a preset number of target positions with the highest response values in the thermodynamic diagram; based on the determined target location, an initial detection box for target detection in the BEV scene is determined.

In an embodiment of the present disclosure, the scene BEV feature obtaining module 703 includes:

the image area determining submodule is used for determining image areas corresponding to the preset space detection frames in the obtained images in the BEV scene where the target vehicle is located based on the device internal parameters of the image acquisition devices;

the two-dimensional area characteristic obtaining submodule is used for obtaining two-dimensional area characteristics corresponding to the determined image area based on the two-dimensional image characteristics of the obtained image;

the first BEV characteristic obtaining submodule is used for obtaining a first BEV characteristic corresponding to the space detection frame according to the obtained two-dimensional area characteristic;

and the scene BEV characteristic obtaining submodule is used for obtaining a scene BEV characteristic corresponding to the two-dimensional image characteristic in the BEV scene based on the first BEV characteristic.

In an embodiment of the present disclosure, the image area determining sub-module is specifically configured to determine, for each obtained image, according to an apparatus internal reference of an image acquiring apparatus that acquires the image, a first image area in the image, which corresponds to a spatial detection frame preset in a BEV scene where the target vehicle is located, and a second image area in a first preset range of the first image area;

the two-dimensional region feature obtaining sub-module is specifically configured to determine, for each obtained image, according to a two-dimensional image feature of the image, a first region sub-feature of a first image region and a second region sub-feature of a second image region corresponding to the image, and perform weighted fusion on the determined first region sub-feature and second region sub-feature to obtain a two-dimensional region feature.

Therefore, after determining a first image area and a second image area corresponding to the space detection frame in the image for each obtained image, the first area sub-features corresponding to the first image area and the second area sub-features corresponding to the second image area can be weighted and fused to obtain two-dimensional area features, so that the two-dimensional area features can comprehensively represent the area features of a plurality of image areas, and the comprehensiveness and accuracy of the obtained two-dimensional area features are improved.

In an embodiment of the present disclosure, the target detection module 706 includes:

and aiming at each determined initial detection frame, carrying out target detection according to the following sub-modules:

the spatial position obtaining submodule is used for obtaining the corresponding spatial position of the initial detection frame in the BEV scene based on the equipment internal parameters of each image acquisition equipment;

a second BEV feature determination sub-module configured to determine, based on the scene BEV feature, a second BEV feature corresponding to the obtained spatial position;

an initial detection frame updating sub-module, configured to update the initial detection frame based on the second BEV feature and trigger the second BEV feature determining sub-module if the second BEV feature is not the BEV feature of the characterization target;

and the target detection sub-module is used for detecting a detection target in the acquired image based on the second BEV characteristic and the obtained image if the second BEV characteristic is a BEV characteristic representing a target.

In an embodiment of the present disclosure, the spatial position obtaining sub-module is specifically configured to obtain, based on an apparatus internal parameter of each image acquisition apparatus, a first spatial position corresponding to the initial detection frame in the BEV scene and a second spatial position within a second preset range of the first spatial position; the second BEV feature determination sub-module is specifically configured to determine, based on the scene BEV feature, a first sub-BEV feature corresponding to the first spatial position and a second sub-BEV feature corresponding to the second spatial position, and perform weighted fusion on the determined first sub-BEV feature and the second sub-BEV feature to obtain a second BEV feature.

Corresponding to the model training method, the embodiment of the disclosure also provides a model training device.

Referring to fig. 8, a schematic structural diagram of model training provided in the embodiment of the present disclosure is shown, where the apparatus includes the following modules 801 to 805:

a sample image obtaining module 801, configured to obtain sample images collected by image collecting devices deployed in a vehicle and having different collecting view angles, and an annotation frame of a sample target in the obtained sample image;

a sample BEV feature obtaining module 802, configured to obtain a sample BEV feature corresponding to a BEV scene where the vehicle is located in the acquired sample image;

a sample input module 803, configured to input the sample BEV characteristics into a preset neural network model for target detection, so as to obtain a target detection result output by the neural network model, where the neural network model includes: a thermodynamic diagram generation network layer for generating a thermodynamic diagram, wherein a response value of each position in the thermodynamic diagram represents a probability that a sample target exists at each position in the BEV scene, the target detection result is obtained according to each sample initial detection frame, and each sample initial detection frame is determined based on the thermodynamic diagram;

a difference obtaining module 804, configured to obtain a first difference between each sample initial detection frame and an annotation frame in the obtained sample image, and a second difference between the thermodynamic diagram and a sample thermodynamic diagram generated based on the annotation frame in the obtained sample image;

a parameter adjusting module 805, configured to perform network parameter adjustment on the neural network model based on the first difference and the second difference, so as to obtain a target detection model.

In addition, during the training process, the sample initial detection box is determined according to a thermodynamic diagram, wherein the response value of each position in the thermodynamic diagram characterizes the probability that a sample object exists at each position in the BEV scene. Therefore, the determined initial detection frame is related to the probability of the sample target at each position in the BEV scene, so that the defects caused by completely randomly determining the initial detection frame are avoided, the probability that the determined sample initial detection frame covers the sample target is favorably improved, the target detection is favorably and quickly and accurately performed based on the initial detection frame, and the convergence speed of the model and the training efficiency are favorably improved when the network parameter of the model is subsequently adjusted based on the detection result.

In an embodiment of the present disclosure, the difference obtaining module 804 is specifically configured to determine each target labeling box closest to a distance between each initial detection box and each target labeling box; and obtaining a first difference between each initial detection frame and the corresponding target labeling frame and a second difference between the thermodynamic diagram and a sample thermodynamic diagram generated based on the labeling frame in the obtained sample image.

Because the determined sample initialization detection boxes in the embodiment of the disclosure are determined based on the thermodynamic diagram, and the probability of the sample initialization detection boxes covering the sample targets is higher, compared with the method based on the Hungarian matching algorithm, the method of determining the first difference based on the target labeling boxes with the shortest distance to the sample initialization detection boxes is beneficial to reducing noise and reducing the offset generated in the model target detection.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the above-described object detection method or model training method.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 901 performs the various methods and processes described above, such as method XXX. For example, in some embodiments, method XXX may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 900 via ROM 902 and/or communications unit 909. When loaded into RAM903 and executed by computing unit 901, may perform one or more of the steps of method XXX described above. Alternatively, in other embodiments, computing unit 901 may be configured to perform method XXX by any other suitable means (e.g., by way of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of target detection, comprising:

acquiring images acquired by image acquisition equipment of different acquisition visual angles deployed in a target vehicle;

extracting two-dimensional image features of the obtained image;

performing visual angle transformation on the extracted two-dimensional image features to obtain scene BEV features corresponding to the extracted two-dimensional image features in the aerial view BEV scene where the target vehicle is located;

2. The method of claim 1, wherein the determining an initial detection box for target detection in the BEV scene from the thermodynamic diagram comprises:

determining a preset number of target positions with highest response values in the thermodynamic diagram;

based on the determined target location, an initial detection box for target detection in the BEV scene is determined.

3. The method according to claim 1, wherein the performing perspective transformation on the extracted two-dimensional image features to obtain scene BEV features corresponding to the two-dimensional image features at a bird's eye view BEV scene where the target vehicle is located comprises:

determining image areas corresponding to the preset space detection frames in the BEV scene where the target vehicle is located in the obtained images based on the equipment internal parameters of the image acquisition equipment;

obtaining two-dimensional area characteristics corresponding to the determined image area based on the two-dimensional image characteristics of the obtained image;

according to the obtained two-dimensional region characteristics, first BEV characteristics corresponding to the space detection frame are obtained;

and obtaining scene BEV characteristics corresponding to the two-dimensional image characteristics in the BEV scene based on the first BEV characteristics.

4. The method of claim 3, wherein,

the determining, based on the device internal reference of each image acquisition device, an image area corresponding to a preset spatial detection frame in the BEV scene where the target vehicle is located in each obtained image includes:

for each obtained image, according to the device internal parameters of the image acquisition device acquiring the image, determining a first image area corresponding to a spatial detection frame preset in a BEV scene where the target vehicle is located in the image and a second image area within a first preset range of the first image area;

the obtaining of the two-dimensional region feature corresponding to the determined image region based on the two-dimensional image feature of the obtained image includes:

and for each obtained image, determining a first area sub-feature of a first image area and a second area sub-feature of a second image area corresponding to the image according to the two-dimensional image feature of the image, and performing weighted fusion on the determined first area sub-feature and second area sub-feature to obtain the two-dimensional area feature.

5. The method of any of claims 1-4, wherein the performing target detection based on the determined initial detection frame and the obtained image comprises:

and aiming at each determined initial detection frame, carrying out target detection in the following way:

acquiring a corresponding spatial position of the initial detection frame in the BEV scene based on equipment internal parameters of each image acquisition equipment;

determining a second BEV feature corresponding to the obtained spatial position based on the scene BEV feature;

if the second BEV feature is not the BEV feature of the characterization target, updating the initial detection frame based on the second BEV feature, and returning to the step of obtaining the corresponding spatial position of the initial detection frame in the BEV scene;

detecting a detection target in the captured image based on the second BEV feature and the obtained image if the second BEV feature is a BEV feature characterizing the target.

6. The method of claim 5, wherein,

the obtaining of the corresponding spatial position of the initial detection frame in the BEV scene based on the device internal parameters of each image acquisition device includes:

obtaining a first spatial position corresponding to the initial detection frame in the BEV scene and a second spatial position within a second preset range of the first spatial position based on equipment internal parameters of each image acquisition equipment;

the determining, based on the scene BEV features, second BEV features corresponding to the obtained spatial locations includes:

and determining a first sub BEV feature corresponding to the first space position and a second sub BEV feature corresponding to the second space position based on the scene BEV features, and performing weighted fusion on the determined first sub BEV feature and the second sub BEV feature to obtain a second BEV feature.

7. A model training method, comprising:

inputting the sample BEV characteristics into a preset neural network model for target detection to obtain a target detection result output by the neural network model, wherein the neural network model comprises: a network layer for generating a thermodynamic diagram, wherein a response value of each position in the thermodynamic diagram represents a probability that a sample target exists at each position in the BEV scene, the target detection result is obtained according to each sample initial detection frame, and each sample initial detection frame is determined based on the thermodynamic diagram;

obtaining a first difference between each sample initialization detection frame and an annotation frame in the obtained sample image and a second difference between the thermodynamic diagram and a sample thermodynamic diagram generated based on the annotation frame in the obtained sample image;

8. The method of claim 7, wherein obtaining a first difference between each sample initialization detection box and an annotation box in the obtained sample image comprises:

determining each target marking frame closest to each initial detection frame;

and obtaining a first difference between each initial detection frame and the corresponding target marking frame.

9. An object detection device comprising:

the thermodynamic diagram prediction module is used for predicting a thermodynamic diagram corresponding to the BEV scene based on the obtained BEV characteristics, wherein the response value of each position in the thermodynamic diagram represents the probability of the existence of the target at each position in the BEV scene;

10. The apparatus of claim 9, wherein,

the initial detection frame determining module is specifically configured to determine a preset number of target positions with the highest response values in the thermodynamic diagram; based on the determined target location, an initial detection box for target detection in the BEV scene is determined.

11. The apparatus of claim 9, wherein the scene BEV feature obtaining module comprises:

the two-dimensional region characteristic obtaining submodule is used for obtaining two-dimensional region characteristics corresponding to the determined image region based on the two-dimensional image characteristics of the obtained image;

the first BEV feature obtaining submodule is used for obtaining a first BEV feature corresponding to the space detection frame according to the obtained two-dimensional area feature;

12. The apparatus of claim 11, wherein,

the image area determining submodule is specifically configured to determine, for each obtained image, a first image area corresponding to a spatial detection frame preset in a BEV scene where the target vehicle is located in the image and a second image area within a first preset range of the first image area according to an equipment internal parameter of an image acquisition device acquiring the image;

the two-dimensional region feature obtaining sub-module is specifically configured to determine, for each obtained image, a first region sub-feature of a first image region and a second region sub-feature of a second image region corresponding to the image according to the two-dimensional image feature of the image, and perform weighted fusion on the determined first region sub-feature and second region sub-feature to obtain a two-dimensional region feature.

13. The apparatus of any of claims 9-12, wherein the object detection module comprises:

the spatial position obtaining sub-module is used for obtaining the corresponding spatial position of the initial detection frame in the BEV scene based on the equipment internal parameters of each image acquisition equipment;

a second BEV feature determination submodule, configured to determine, based on the scene BEV feature, a second BEV feature corresponding to the obtained spatial position;

a target detection sub-module to detect a detection target in the captured image based on the second BEV feature and the obtained image if the second BEV feature is a BEV feature characterizing the target.

14. The apparatus of claim 13, wherein,

the spatial position obtaining submodule is specifically configured to obtain, based on device internal parameters of each image acquisition device, a first spatial position corresponding to the initial detection frame in the BEV scene and a second spatial position within a second preset range of the first spatial position;

the second BEV feature determination sub-module is specifically configured to determine, based on the scene BEV feature, a first sub-BEV feature corresponding to the first spatial position and a second sub-BEV feature corresponding to the second spatial position, and perform weighted fusion on the determined first sub-BEV feature and the second sub-BEV feature to obtain a second BEV feature.

15. A model training apparatus comprising:

a sample input module, configured to input the sample BEV characteristics into a preset neural network model for target detection, so as to obtain a target detection result output by the neural network model, where the neural network model includes: a thermodynamic diagram generation network layer for generating a thermodynamic diagram, wherein a response value of each position in the thermodynamic diagram represents a probability that a sample target exists at each position in the BEV scene, the target detection result is obtained according to each sample initial detection frame, and each sample initial detection frame is determined based on the thermodynamic diagram;

the difference obtaining module is used for obtaining a first difference between each sample initialization detection frame and an annotation frame in the obtained sample image and a second difference between the thermodynamic diagram and a sample thermodynamic diagram generated based on the annotation frame in the obtained sample image;

16. The apparatus of claim 15, wherein,

the difference obtaining module is specifically used for determining each target marking frame closest to the distance between each initial detection frame and each target marking frame; and obtaining a first difference between each initial detection frame and the corresponding target labeling frame and a second difference between the thermodynamic diagram and a sample thermodynamic diagram generated based on the labeling frames in the obtained sample image.

17. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6 or 7-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-6 or 7-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-6 or 7-8.