CN113205515A

CN113205515A - Target detection method, device and computer storage medium

Info

Publication number: CN113205515A
Application number: CN202110585764.1A
Authority: CN
Inventors: 张泽瀚; 罗兵华
Original assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Current assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-08-03
Anticipated expiration: 2041-05-27
Also published as: CN113205515B

Abstract

The embodiment of the application discloses a target detection method, a target detection device and a computer storage medium, and belongs to the technical field of artificial intelligence. In the embodiment of the application, a point cloud semantic thermodynamic diagram capable of indicating the approximate distribution condition of the target to be detected is predicted according to point cloud data acquired aiming at a detection area. And then carrying out feature fusion based on the point cloud semantic thermodynamic diagram and the point cloud features to obtain self-adaptive fusion features, and further detecting the target according to the self-adaptive fusion features. That is, in the embodiment of the present application, a fuzzy prediction is performed on a target to be detected according to point cloud data to obtain prior knowledge of the target to be detected, and then the target is further detected based on the predicted prior knowledge and the accurate point cloud characteristics. Therefore, the fuzzy detection of the target can be firstly carried out in the target detection process, and then the accurate detection is carried out, so that the accuracy of the detected target is improved.

Description

Target detection method, device and computer storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a target detection method, a target detection device and a computer storage medium.

Background

At present, the conventional 2D (2-dimensional) target detection such as image recognition can only detect the specific category of the target and the position of the target in the current visual field area, which results in that in the field of artificial intelligence such as intelligent transportation, the 2D target detection cannot provide all information required for sensing the surrounding environment, and further the 3D target detection is more and more focused by people. The 3D target detection specifically refers to: not only the type of the target and the position of the target in the current visual field region but also information such as the length, width, and rotation angle of the target in the three-dimensional space can be detected. At present, the difficulty of the 3D object detection technology lies in how to improve the accuracy of the detected object, that is, how to make the relevant information of the detected object and the information of the object in the actual environment as consistent as possible.

Disclosure of Invention

The embodiment of the application provides a target detection method, a target detection device and a computer storage medium, which can improve the accuracy of a target detected in 3D target detection and further provide powerful data support for technologies such as intelligent transportation. The technical scheme is as follows:

in one aspect, a target detection method is provided, and the method includes:

determining features in a bird's-eye view of a detection area based on point cloud data acquired aiming at the detection area to obtain point cloud features, wherein the bird's-eye view indicates an image obtained by projecting a three-dimensional environment indicated by the point cloud data to a two-dimensional space, and the point cloud data comprises three-dimensional position information of each position point in the detection area, projected by a laser beam emitted by a laser radar;

predicting a point cloud semantic thermodynamic diagram based on the point cloud features, the point cloud semantic thermodynamic diagram indicating distribution of suspected objects in a three-dimensional environment indicated by the point cloud data;

determining an adaptive fusion feature based on the point cloud feature and the point cloud semantic thermodynamic diagram;

and carrying out target detection based on the self-adaptive fusion characteristics to acquire the information of the target in the detection area.

In one possible implementation, the method further includes:

determining features in the camera image based on the camera image acquired for the detection area, resulting in image features;

predicting an image semantic thermodynamic diagram based on the image features, the image semantic thermodynamic diagram indicating a distribution of suspected objects in the camera image;

determining an adaptive fusion feature based on the point cloud semantic thermodynamic diagram and one or both of the point cloud features and one or both of the image semantic thermodynamic diagram and the image features;

In one possible implementation, the determining an adaptive fusion feature based on the point cloud semantic thermodynamic diagram and the point cloud feature, and the image semantic thermodynamic diagram and the image feature, comprises:

respectively taking the respective features in the point cloud semantic thermodynamic diagram and the image semantic thermodynamic diagram and the point cloud features as a channel feature to obtain three channel features, and cascading the three channel features to obtain an initial feature;

obtaining a global context feature based on the initial feature, wherein the global context feature indicates the relevance between different channel features in the three channel features;

and superposing the global context feature and the initial feature to obtain the self-adaptive fusion feature.

In a possible implementation manner, the obtaining a global context feature based on the initial feature includes:

acquiring attention weight of each channel feature in the initial features, wherein the attention weight indicates the importance degree of each channel feature in the process of detecting the target;

and multiplying the attention weight of each channel feature in the initial features by the corresponding channel feature to obtain the global context feature.

In one possible implementation, before the overlaying the global context feature and the initial feature, the method further includes:

performing feature conversion on the global context features to extract depth features in the global context features to obtain converted global context features;

the obtaining the adaptive fusion feature by superposing the global context feature and the initial feature includes:

and superposing the converted global context characteristic and the initial characteristic to obtain the self-adaptive fusion characteristic.

In one possible implementation, the predicting a point cloud semantic thermodynamic diagram based on the point cloud features includes:

determining the point cloud semantic thermodynamic diagram through a first thermodynamic diagram prediction model based on the point cloud features.

In one possible implementation, the method further includes:

obtaining a plurality of sample aerial views and tag information for each of the plurality of sample aerial views, the tag information for each sample aerial view indicating location information for an object in the respective sample aerial view;

obtaining features in each of the plurality of sample aerial views;

training a first initialization model based on the features in each sample aerial view in the plurality of sample aerial views and the label information of each sample aerial view to obtain the first thermodynamic diagram prediction model.

In one possible implementation, the determining an image semantic thermodynamic diagram based on the image features includes:

determining the image semantic thermodynamic diagram through a second thermodynamic prediction model based on the image features.

In one possible implementation, the method further includes:

obtaining a plurality of sample camera images and marker information for each of the plurality of sample camera images, the marker information for each sample camera image indicating location information of a target in the respective sample camera image;

obtaining features in each of the plurality of sample camera images;

training a second initialization model based on features in each sample camera image in the plurality of sample camera images and the label information of each sample camera image to obtain the second thermodynamic diagram prediction model.

In another aspect, an object detecting apparatus is provided, the apparatus including:

the device comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining a feature in a bird's eye view of a detection area based on point cloud data collected aiming at the detection area to obtain the point cloud feature, the bird's eye view indicates an image obtained by projecting a three-dimensional environment indicated by the point cloud data to a two-dimensional space, and the point cloud data comprises three-dimensional position information of each position point in the detection area, to which a laser beam emitted by a laser radar is projected;

a first prediction module, configured to predict a point cloud semantic thermodynamic diagram based on the point cloud features, where the point cloud semantic thermodynamic diagram indicates a distribution of suspected objects in a three-dimensional environment indicated by the point cloud data;

a second determination module for determining an adaptive fusion feature based on the point cloud feature and the point cloud semantic thermodynamic diagram;

and the detection module is used for carrying out target detection based on the self-adaptive fusion characteristics so as to acquire the information of the target in the detection area.

In one possible implementation, the apparatus further includes:

a third determining module, configured to determine, based on a camera image acquired for the detection area, a feature in the camera image, so as to obtain an image feature;

a second prediction module, configured to predict an image semantic thermodynamic diagram based on the image features, where the image semantic thermodynamic diagram indicates a distribution of suspected objects in the camera image;

a fourth determination module to determine an adaptive fusion feature based on the point cloud semantic thermodynamic diagram and one or both of the point cloud features and one or both of the image semantic thermodynamic diagram and the image features;

the detection module is further configured to perform target detection based on the adaptive fusion features to obtain information of a target in the detection area.

In one possible implementation manner, the fourth determining module is configured to:

In one possible implementation, the first prediction module is configured to:

In one possible implementation, the apparatus further includes:

a first acquisition module for acquiring a plurality of sample aerial views and tag information for each of the plurality of sample aerial views, the tag information for each sample aerial view indicating location information of an object in the respective sample aerial view;

the first acquisition module is further used for acquiring features in each sample aerial view in the plurality of sample aerial views;

and the first training module is used for training the first initialization model based on the features in each sample aerial view in the plurality of sample aerial views and the label information of each sample aerial view to obtain the first thermodynamic prediction model.

In one possible implementation, the second prediction module is configured to:

In one possible implementation, the apparatus further includes:

a second acquisition module to acquire a plurality of sample camera images and marker information for each of the plurality of sample camera images, the marker information for each sample camera image indicating position information of a target in the respective sample camera image;

the second acquisition module is further configured to acquire features in each of the plurality of sample camera images;

and the second training module is used for training a second initialization model based on the features in each sample camera image in the plurality of sample camera images and the label information of each sample camera image to obtain the second thermodynamic diagram prediction model.

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform any of the above-described object detection methods.

In another aspect, a computer-readable storage medium is provided, having instructions stored thereon, which when executed by a processor, implement any of the above-mentioned object detection methods.

In another aspect, a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the steps of the object detection method described above is provided.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

in the embodiment of the application, a point cloud semantic thermodynamic diagram capable of indicating the approximate distribution condition of the target to be detected is predicted according to point cloud data acquired aiming at a detection area. And then carrying out feature fusion based on the point cloud semantic thermodynamic diagram and the point cloud features to obtain self-adaptive fusion features, and further detecting the target according to the self-adaptive fusion features. That is, in the embodiment of the present application, a fuzzy prediction is performed on a target to be detected according to point cloud data to obtain prior knowledge of the target to be detected, and then the target is further detected based on the predicted prior knowledge and the accurate point cloud characteristics. Therefore, the fuzzy detection of the target can be firstly carried out in the target detection process, and then the accurate detection is carried out, so that the accuracy of the detected target is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of an object detection system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an architecture of another object detection system provided in an embodiment of the present application;

fig. 3 is a flowchart of a target detection method provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a method for obtaining point cloud features according to an embodiment of the present disclosure;

fig. 5 is a flowchart of a target detection method provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a second convolutional network provided in an embodiment of the present application;

fig. 7 is a schematic flow chart of a fused feature for example four provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.

It should be understood that reference herein to "a plurality" means two or more. In the description of the present application, "/" indicates an OR meaning, for example, A/B may indicate A or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

Before explaining the embodiments of the present application in detail, an application scenario related to the embodiments of the present application will be explained.

At present, in the field of artificial intelligence technologies such as intelligent transportation, in order to accurately plan a driving route of an unmanned vehicle, the environment around the unmanned vehicle needs to be sensed in advance. Wherein, perception of the environment around the unmanned vehicle specifically means: the method comprises the steps of determining specific three-dimensional space information of each obstacle in the environment around the unmanned vehicle, such as specific three-dimensional space information of obstacles around the unmanned vehicle, such as buildings, green belts and the like, so as to simulate a virtual space identical to the actual environment based on the specific three-dimensional space information of each obstacle, and then planning the driving route of the unmanned vehicle in the virtual space.

Obviously, in the field of intelligent transportation, in order to improve the driving safety of the unmanned vehicle, it is necessary to ensure that the simulated virtual space is as identical as possible to the actual environment. In order to achieve that the simulated virtual space is as identical as possible to the actual environment, accurate 3D detection of the target of the environment around the unmanned vehicle is required. The target related to the embodiment of the present application refers to an object that needs to be focused when performing 3D target detection, where the object that needs to be focused changes with business requirements, and the embodiment of the present application does not limit what the target specifically refers to. For example, in the field of intelligent transportation, the targets may include pedestrians, vehicles, and surrounding buildings on the road, and so on.

It should be noted that the foregoing intelligent transportation field is only an application scenario of an example of the target detection method provided in the embodiment of the present application, and the embodiment of the present application does not limit a specific application scenario of the target detection method. For example, the present invention can be applied to virtual character detection in a virtual scene of a game, and the like.

In order to implement the target detection method provided by the embodiment of the present application, the embodiment of the present application further provides a target detection system, and for convenience of description in the following, the target detection system is explained first. Fig. 1 is a schematic structural diagram of an object detection system according to an embodiment of the present disclosure. As shown in fig. 1, the object detection system 100 includes: the system comprises a point cloud feature encoder 101, an image feature encoder 102, a point cloud semantic thermodynamic prediction module 103, an image semantic thermodynamic prediction module 104, an adaptive feature fusion module 105 and a 3D prediction module 106.

The point cloud feature encoder 101 is configured to obtain features in the point cloud data based on the point cloud data, that is, obtain point cloud features. The point cloud semantic thermodynamic diagram predicting module 103 is used for obtaining a point cloud semantic thermodynamic diagram based on point cloud feature prediction. The image feature encoder 102 is used to acquire features in the camera image based on the camera image, i.e., camera image features. The image semantic thermodynamic diagram prediction module 104 is used for obtaining an image semantic thermodynamic diagram based on the camera image feature prediction. The adaptive feature fusion module 105 is used to determine the adaptive fusion features. The 3D prediction module 106 is configured to predict the target based on the adaptive fusion features.

For the sake of convenience in the following description, the semantic thermodynamic diagram is explained first. Semantic thermodynamic diagrams refer to the labeling of objects of interest in an image with a particular color. For example, in current map-like applications, the displayed map may have a region of a population that is circled by a particular color. For example, darker colors for a region in a map indicate a more concentrated population of the corresponding region, and lighter colors indicate a less populated corresponding region. The flow of people in the area can be visually seen through the semantic thermodynamic diagram in the form, so that a user can conveniently plan own journey. The semantic thermodynamic diagram applied to the embodiment of the present application may specifically refer to a distribution condition that a suspected target is marked in an image by using a special color.

In addition, the point cloud data refers to point cloud data collected for a detection area, and the point cloud data can be acquired by a laser radar. The camera image refers to a camera image captured for the detection area, which can be acquired by a video camera.

It should be noted that the target detection system 100 shown in fig. 1 is a software system, and the modules included in the system are also software modules. Only the functions of each software module are explained here, and detailed implementation of each software module will be explained in the following method embodiments, and will not be explained here first.

In addition, the object detection system shown in fig. 1 may be centrally deployed on a hardware device, such as a terminal or a server. Optionally, each software module in the target detection system shown in fig. 1 may also be distributed and deployed on different hardware devices, which is not limited in this embodiment of the present application.

In order to realize the functions of the target detection system shown in fig. 1, another target detection system is provided in the embodiment of the present application. Fig. 2 is a schematic structural diagram of another object detection system according to an embodiment of the present application. As shown in fig. 2, the object detection system 200 includes a laser radar 201, a camera 202, and an object detection device 203.

Wherein, the laser radar 201 and the camera 202 are respectively connected with the target detection device 203 for communication. The laser radar 201 is used for collecting point cloud data, the video camera 202 is used for collecting camera images, and the target detection device 203 is used for realizing the target detection method provided by the embodiment of the application based on the point cloud data and the camera images.

It should be noted that the target detection device 203 may be a terminal, a server, or a server group integrated with multiple servers, which is not limited in this embodiment of the present application.

The following describes in detail an object detection method according to an embodiment of the present application based on the object detection system shown in fig. 1 and 2. The method may be applied in particular to the object detection device shown in fig. 2. Fig. 3 is a flowchart of a target detection method according to an embodiment of the present application. As shown in fig. 3, the method includes the following steps.

Step 301: based on the point cloud data acquired aiming at the detection area, the target detection equipment determines the characteristics in the aerial view of the detection area to obtain point cloud characteristics. The aerial view indicates an image obtained by projecting a three-dimensional environment indicated by point cloud data into a two-dimensional space, wherein the point cloud data includes three-dimensional position information of each position point in a detection area onto which a laser beam emitted by a laser radar is projected.

In particular, step 301 may be implemented by a software module point cloud feature encoder in fig. 1.

In a possible implementation manner, in order to enable the obtained point cloud features to indicate more features in the point cloud data, the point cloud feature encoder may determine a bird's-eye view of the detection area based on the point cloud data, and then extract features in the bird's-eye view based on the first convolution network, where the obtained features are the point cloud features.

The determination of the bird's-eye view of the detection area based on the point cloud data can be realized by a technique related to bird's-eye views, which is not limited in the embodiments of the present application. For example, for the acquired point cloud data, a full-connection network may be used to learn high-dimensional point features in the point cloud data, where the high-dimensional point features include features in all point cloud data of each location point in the detection area in the vertical ground direction. And then, grouping high-dimensional point features on each position point along x and y axes parallel to the ground direction according to the voxel sizes vx and vy, namely meshing the point detection area, wherein the size of each grid is vx multiplied by vy. For any grid, based on the high-dimensional point features of each position point distributed in the grid, the maximum pooling operation is adopted to combine the high-dimensional point features of each position point in the grid into the high-dimensional point feature of one position point, the combined high-dimensional point feature of one position point can be used as the high-dimensional point feature of the detection area at the grid, and the high-dimensional point feature at the grid can also be referred to as the strut feature at the grid. And then, scattering the encoded pillar features back to the original pillar positions (the original pillar positions are also the positions of the corresponding grids), so as to obtain a pseudo image, wherein the pseudo image is the aerial view of the detection area.

Fig. 4 is a schematic diagram of acquiring a point cloud feature according to an embodiment of the present application. As shown in fig. 4, after the point cloud data is processed as described above, the bird's-eye view shown in fig. 4 can be obtained, where the bird's-eye view includes a plurality of grids, and each grid corresponds to a pillar feature.

In addition, after the target detection device acquires the point cloud data, it is not usually determined that the point cloud data is the point cloud data acquired for the detection area, so the target detection device may also clip the acquired point cloud data to obtain the point cloud data acquired for the detection area. Specifically, the point cloud data located in the three-dimensional position range is acquired from the acquired point cloud data according to the three-dimensional position range L, W, H corresponding to the detection area (L is the length of the detection area, W is the width of the detection area, and H is the height of the detection area), so that the point cloud data can be clipped. And then determining the aerial view by using the clipped point cloud data.

The three-dimensional position range L, W, H corresponding to the detection area may be specified by a user based on a service requirement, which is not limited in this embodiment of the present application.

In addition, the first convolution network may be a deep 2D convolution network. In this scenario, the extracting of the feature in the bird's-eye view based on the first convolution network may be implemented in the following manner: features in the bird's eye view are further extracted using a depth 2D convolutional network. The structure of the deep 2D convolutional network is shown in fig. 4. Sequentially carrying out 3 convolutions on the input aerial view, wherein the convolution step length is 2, obtaining three characteristics with different scales, carrying out deconvolution on the three characteristics with different scales, restoring the three characteristics to the same scale, and then cascading the three characteristic graphs to obtain cascading characteristics, wherein the cascading characteristics are required point cloud characteristics.

It should be noted that the depth 2D convolutional network in fig. 4 is only an example of the first convolutional network provided in the embodiment of the present application, and the embodiment of the present application does not limit the specific structure of the first convolutional network, and only the depth feature in the bird's eye view image can be extracted through the first convolutional network.

In addition, the convolution network adopts the original image as input, so that corresponding features can be effectively learned from a large number of samples, and a complex feature extraction process is avoided. Because the convolution network can directly process the two-dimensional image, more abstract characteristics are extracted from the original image through a simple nonlinear model, and only a small amount of manual work is needed in the whole process. Therefore, in the embodiment of the present application, the first convolution network is used to extract the features in the bird's eye view. Optionally, other feature extraction methods may also be adopted in the embodiments of the present application to extract features in the bird's-eye view, which are not illustrated herein.

Step 302: based on the point cloud features, the target detection device predicts a point cloud semantic thermodynamic diagram indicating the distribution of suspected targets in the three-dimensional environment indicated by the point cloud data.

Specifically, step 302 may be implemented by a software module point cloud semantic thermodynamic prediction module in fig. 1.

In a possible implementation manner, in order to accurately predict the distribution of suspected targets in the three-dimensional environment indicated by the point cloud data, a first thermodynamic diagram prediction model may be trained in advance, and the first thermodynamic diagram prediction model is used for identifying the approximate distribution of the targets to be detected based on the point cloud data. In this scenario, the implementation process of step 302 specifically includes: and determining a point cloud semantic thermodynamic diagram through the first thermodynamic diagram prediction model based on the point cloud characteristics obtained in the step 301. That is, the point cloud features obtained in step 301 are input into the first thermodynamic diagram prediction model, and after the point cloud features are subjected to a series of processing by the first thermodynamic diagram prediction model, a point cloud semantic thermodynamic diagram can be obtained.

The first thermodynamic diagram prediction model is obtained by pre-training. In one possible implementation, the process of training the first thermodynamic diagram prediction model may be: a plurality of sample aerial views and tag information for each of the plurality of sample aerial views are obtained. Wherein the tag information of each sample aerial view indicates position information of the target in the respective sample aerial view. And obtaining features in each sample aerial view in the plurality of sample aerial views, and training the first initialization model based on the features in each sample aerial view in the plurality of sample aerial views and the mark information of each sample aerial view to obtain a first thermodynamic prediction model.

The features of the sample bird's-eye views and the sample bird's-eye views can be determined by the implementation manner of the point cloud features in step 301, and are not described herein again. Further, the marking information of each sample bird's eye view is manually marked in advance by the user, and the main functions of the marking information are: and continuously adjusting parameters in the first initialization model in the training process of the first initialization model so that the trained first thermodynamic diagram prediction model predicts the features in the sample aerial view to obtain the predicted target distribution situation which is consistent with the position information of the target indicated by the mark information corresponding to the sample aerial view as much as possible.

It should be noted that the above process of training the first thermodynamic diagram prediction model is only an exemplary training process, and the embodiment of the present application is not limited to how to train the process of obtaining the first thermodynamic diagram prediction model.

In addition, the point cloud features and the point cloud semantic thermodynamic diagram to be predicted may have a scale mismatch. In such a scenario, after the point cloud features are obtained, the point cloud features can be mapped to the same scale as the point cloud semantic thermodynamic diagram to be predicted by using a deconvolution network. And after point cloud features with the same scale are obtained, predicting the point cloud semantic thermodynamic diagram by using the first thermodynamic diagram prediction model.

Step 303: based on the point cloud feature and the point cloud semantic thermodynamic diagram, the target detection device determines an adaptive fusion feature.

Specifically, the step 303 can be implemented by the adaptive feature fusion module in fig. 1.

In a possible implementation manner, in order to implement effective fusion of the point cloud feature and the point cloud semantic thermodynamic diagram, rather than simply directly superimposing the two types of features to obtain a fusion feature, the implementation process of step 303 may be: and respectively taking the point cloud characteristics and the characteristics in the point cloud semantic thermodynamic diagram as one channel characteristic to obtain two channel characteristics, and cascading the two channel characteristics to obtain the initial characteristic. Based on the initial feature, a global context feature is obtained that indicates an association between different ones of the two channel features. And superposing the global context feature and the initial feature to obtain the self-adaptive fusion feature.

That is, the global context feature and the initial feature, which take the relevance among the channels into consideration, are superimposed to serve as the adaptive fusion feature, so that the adaptive fusion feature can represent more features of the detection region, and the efficiency of performing the 3D target detection based on the adaptive fusion feature is improved.

Based on the initial feature, the implementation process of obtaining the global context feature may be: acquiring attention weight of each channel feature in the initial features, wherein the attention weight indicates the importance degree of each channel feature in the process of detecting the target; and multiplying the attention weight of each channel feature in the initial features by the corresponding channel feature to obtain the global context feature.

Wherein the attention weight of each channel feature in the initial feature can be determined by a 1 × 1 convolution network and a Softmax function (normalization function). Optionally, the attention weight of each channel feature in the initial feature may also be obtained in other manners, which is not limited in this embodiment of the application.

Attention weights are explained herein. Attention weight may be simply understood as a type of duty value. For example, three people complete a task, with a first person contributing 40%, a second person contributing 30%, and a third person contributing 30%. The percentage contribution in this example can be understood as the attention weight. Accordingly, in the embodiment of the present application, for example, the existing point cloud features, the image semantic thermodynamic diagram, and the point cloud semantic thermodynamic diagram, the contribution of the three features to the target detection is different, so that the contribution degree of the three features can be measured by using the attention weight. It should be noted that the attention weight in the embodiment of the present application is not manually given, but the device itself is autonomously learned according to actual data (for example, determined by a 1 × 1 convolutional network and a Softmax function as described above), so that the embodiment of the present application refers to this method as adaptive feature fusion.

In addition, in order to enable the fused adaptive fusion features to represent more information, before the global context features and the initial features are superposed, feature conversion can be performed on the global context features to extract depth features in the global context features, so that the converted global context features are obtained; and then, overlapping the converted global context features and the initial features to obtain the self-adaptive fusion features.

For example, the global context feature may be subjected to feature conversion by 1x1 convolution, layer normalization, ReLU (linear rectification unit) nonlinear function, and 1x1 convolution, so as to extract the depth feature in the global context feature. The first 1x1 convolution described above is used to reduce the number of eigen-channels and thus speed up the model inference time. Layer normalization is used to prevent overfitting. ReLU is a non-linear function used to learn a better fit function. The second 1x1 convolution is used to convert the reduced number of feature channels to the original scale for easy combination with the original scale features.

In addition, the processes of the 1 × 1 convolution, layer normalization, and ReLU nonlinear function can refer to corresponding technologies, which are not described in detail in this embodiment of the present application.

Step 304: the target detection device performs target detection based on the adaptive fusion feature to acquire information of a target in the detection area.

After the adaptive fusion feature is obtained in step 303, target detection may be performed based on the adaptive fusion feature to obtain information of a target in the detection area, thereby completing 3D target detection.

The information of the target in the detection area comprises the category to which the target belongs, an area frame where the target is located, and the center point coordinates of the area frame and the three-dimensional position information of the length, the width, the height and the like of the area frame can be marked aiming at the area frame.

For example, in the field of intelligent transportation, the class to which the target belongs may be an automobile, a non-automobile, a pedestrian, and the like.

Further, in step 304, object detection may be performed by 1 × 1 convolution. That is, the adaptive fusion features are input to a 1 × 1 convolutional network through which target detection is achieved.

In the embodiment shown in fig. 3, the point cloud feature and the point cloud semantic thermodynamic diagram are fused to obtain an adaptive fusion feature. Optionally, in the embodiment of the present application, the adaptive fusion feature may also be determined in combination with other data sources besides the point cloud data, such as a camera image, so that the adaptive fusion feature can characterize more information of the detection area, thereby improving the accuracy of the detected target. Based on the above, the embodiment of the present application further provides another target detection method, in which the point cloud data and the camera image are referred to simultaneously to determine the adaptive fusion feature. The method is explained in detail below.

Fig. 5 is a flowchart of a target detection method according to an embodiment of the present application. As shown in fig. 5, the method includes the following steps.

Step 501: based on point cloud data acquired aiming at the detection area, the target detection equipment determines characteristics in a bird's-eye view of the detection area to obtain point cloud characteristics, the bird's-eye view indicates an image obtained by projecting a three-dimensional environment indicated by the point cloud data to a two-dimensional space, and the point cloud data comprises three-dimensional position information of each position point in the detection area projected by a laser beam emitted by a laser radar.

Step 502: based on the point cloud features, the target detection device predicts a point cloud semantic thermodynamic diagram indicating the distribution of suspected targets in the three-dimensional environment indicated by the point cloud data.

Step 501 and step 502 have already been described in detail in the embodiment shown in fig. 3, and are not described herein again.

Step 503: based on the camera image acquired for the detection area, the target detection device determines features in the camera image, resulting in image features.

In particular, step 503 may be implemented by the software module image feature encoder in fig. 1.

In one possible implementation, to obtain depth features in the camera image, features in the camera image may be extracted based on a second convolution network, resulting in image features. That is, the camera image is input to the second convolution network, and the output of the second convolution network is the image characteristic of the camera image.

For example, the second convolutional network may be an image network HRNet (High-resolution network) shown in fig. 6. The HRNet can keep high-resolution representation in the whole image feature extraction process, so that the image features are not lost, and the prediction of the image semantic thermodynamic diagram in the actual space is facilitated. As shown in fig. 6, HRNet starts with a high resolution branch in the first phase. At each stage that follows, a new branch will be added in parallel to the current branch, the resolution of the new branch being 1/2 of the lowest resolution in the current branch. As the network has more stages, it will have more parallel branches, with different resolutions, and the resolution of the previous stage is preserved at a later stage. And finally, restoring the features of the plurality of branches to the same scale, and cascading to obtain the final image feature.

It should be noted that fig. 6 is only an exemplary network structure of the second convolutional network, and the embodiment of the present application does not limit the specific structure of the second convolutional network.

Based on the discussion in step 301, it can be known that the convolution network uses the original image as input, and can effectively learn corresponding features from a large number of samples, thereby avoiding a complex feature extraction process. Because the convolution network can directly process the two-dimensional image, more abstract characteristics are extracted from the original image through a simple nonlinear model, and only a small amount of manual work is needed in the whole process. Therefore, in the embodiment of the present application, the second convolution network is used to extract the features in the bird's eye view. It should be noted that the second convolutional network and the first convolutional network in step 301 have no special meaning, and are only used for distinguishing two different convolutional networks.

Optionally, in this embodiment of the present application, the image feature of the camera image may be obtained by another feature extraction method, which is not limited to the implementation based on the second convolution network, and this is also not limited in this embodiment of the present application.

Step 504: based on the image features, the object detection device determines an image semantic thermodynamic diagram indicating a distribution of suspected objects in the camera image.

In particular, step 504 may be implemented by the software module image semantic thermodynamic diagram prediction module in fig. 1.

In a possible implementation manner, in order to accurately predict the distribution of suspected targets in the camera image, a second thermodynamic diagram prediction model may be trained in advance, and the second thermodynamic diagram prediction model is used for identifying the approximate distribution of the targets to be detected based on the camera image. In this scenario, the implementation process of step 504 specifically includes: and determining an image semantic thermodynamic diagram through a second thermodynamic prediction model based on the image features obtained in the step 503. That is, the image features obtained in step 503 are input to the second thermodynamic prediction model, and the second thermodynamic prediction model performs a series of processing on the image features to obtain an image semantic thermodynamic diagram.

The second thermodynamic diagram prediction model is obtained by pre-training. In one possible implementation, the process of training the second thermodynamic diagram prediction model may be: obtaining a plurality of sample camera images and marker information for each of the plurality of sample camera images, the marker information for each sample camera image indicating position information of a target in the respective sample camera image; obtaining features in each of a plurality of sample camera images; and training the second initialization model based on the features in each sample camera image in the plurality of sample camera images and the label information of each sample camera image to obtain a second thermodynamic diagram prediction model.

The features in the sample camera image can be determined by the implementation manner related to the image features in step 503, and are not described herein again. Further, the marking information of each sample camera image is manually marked in advance by the user, and the main functions of the marking information are: and continuously adjusting parameters in the second initialization model in the process of training the second initialization model so as to enable the trained second thermodynamic diagram prediction model to predict the features in the sample camera image to obtain the predicted target distribution situation which is consistent with the position information of the target indicated by the mark information corresponding to the sample camera image as much as possible.

It should be noted that the above process of training the second thermodynamic diagram prediction model is only an exemplary training process, and the embodiment of the present application is not limited to how to train the second thermodynamic diagram prediction model.

In addition, there may be a scale mismatch between the image features and the image semantic thermodynamic diagram to be predicted. In this scenario, after the image features are obtained, a deconvolution network may also be used to map the image features to the same scale as the semantic thermodynamic diagram of the image to be predicted. And after the image features with the same scale are obtained, predicting the image semantic thermodynamic diagram by using the second thermodynamic diagram prediction model.

In addition, because the point cloud features generally include richer information than the image features, the distribution of the targets in the point cloud semantic thermodynamic diagram predicted based on the point cloud features is more accurate than the distribution of the targets in the image semantic thermodynamic diagram predicted based on the image features. That is, only the rough distribution of the predicted target can be blurred through the image semantic thermodynamic diagram. Therefore, in order to speed up the training process of the second thermodynamic prediction model, the conditions that need to be satisfied by the training may be set less strictly when the second thermodynamic prediction model is trained. For example, when training the first thermodynamic diagram prediction model, it is required that the error between the predicted value and the true value is within the first error, and the training is determined to be completed. However, when the second thermodynamic diagram prediction model is trained, the training can be determined to be completed only if the error between the predicted value and the true value is within a second error larger than the first error. That is, in training the second thermodynamic diagram prediction model, gaussian blur is applied to the true values of the prediction.

Step 505: based on one or both of the point cloud semantic thermodynamic diagram and the point cloud features, and one or both of the image semantic thermodynamic diagram and the image features, the target detection device determines adaptive fusion features.

Specifically, step 505 may include the following several examples.

Example one: and based on the point cloud semantic thermodynamic diagram and the image semantic thermodynamic diagram, the target detection equipment determines the self-adaptive fusion characteristics.

In example one, a point cloud semantic thermodynamic diagram capable of indicating an approximate distribution of objects to be detected is predicted from point cloud data acquired for a detection area. Meanwhile, an image semantic thermodynamic diagram is predicted according to a camera image acquired aiming at the detection area, and the image semantic thermodynamic diagram can also indicate the approximate distribution condition of the target to be detected. And then, performing feature fusion based on the two semantic thermodynamic diagrams to obtain a self-adaptive fusion feature, and further detecting the target according to the self-adaptive fusion feature. That is, in the first example, firstly, a fuzzy prediction is performed on the target to be detected according to different data sources, so as to obtain the prior knowledge of the target to be detected under different data sources, and then the target is further detected based on the prior knowledge under different data sources. Because the data source according to which the target is detected not only comprises point cloud data but also comprises a camera image, the information according to which the target is detected is more comprehensive, and the detected target can be more accurate. In addition, the target is further detected based on the prior knowledge of the target to be detected under different data sources, so that the target can be firstly subjected to fuzzy detection in the target detection process, accurate detection can be further performed, and the accuracy of the detected target can also be improved.

The specific implementation manner of the first example may refer to step 303 in the embodiment of fig. 3, and the difference is that the first example is to respectively use the features in the point cloud semantic thermodynamic diagram and the features in the image semantic thermodynamic diagram as one channel feature to obtain two channel features, and then fuse the two channel features to obtain the adaptive fusion feature. Accordingly, the specific implementation of example one is not described in detail herein.

Example two: and based on the point cloud semantic thermodynamic diagram and the image characteristics, the target detection equipment determines self-adaptive fusion characteristics.

In example two, a point cloud semantic thermodynamic diagram capable of indicating an approximate distribution of the target to be detected is predicted from the point cloud data acquired for the detection area. Camera features of the camera image are also determined from the camera image acquired for the detection area. And then carrying out feature fusion based on the point cloud semantic thermodynamic diagram and the image features to obtain self-adaptive fusion features, and further detecting the target according to the self-adaptive fusion features. That is, in the second example, firstly, a fuzzy prediction is performed on the target to be detected according to the point cloud data to obtain the prior knowledge of the target to be detected under the point cloud data, and then the image features of the camera are fused to further detect the target. Because the data source according to which the target is detected not only comprises point cloud data but also comprises a camera image, the information according to which the target is detected is more comprehensive, and the detected target can be more accurate. In addition, based on the prior knowledge of the target to be detected under the point cloud data, the image feature detection target is further fused, so that the fuzzy detection of the target can be firstly carried out in the target detection process, the accurate detection can be further carried out, and the accuracy of the detected target can also be improved.

The specific implementation manner of the second example can refer to step 303 in the embodiment of fig. 3, and the difference is that the second example is to obtain two channel features by using the feature and the image feature in the point cloud semantic thermodynamic diagram as one channel feature respectively, and then fuse the two channel features to obtain the adaptive fusion feature. Therefore, the specific implementation of example two will not be described in detail herein.

Example three: based on the image semantic thermodynamic diagram and the point cloud features, the target detection device determines adaptive fusion features.

In example three, from a camera image acquired for the detection area, an image semantic thermodynamic diagram is predicted, which can indicate the approximate distribution of the objects to be detected. And meanwhile, determining point cloud characteristics according to point cloud data acquired aiming at the detection area. And then carrying out feature fusion based on the image semantic thermodynamic diagram and the point cloud features to obtain self-adaptive fusion features, and further detecting the target according to the self-adaptive fusion features. That is, in the third example, firstly, a fuzzy prediction is performed on the target to be detected according to the camera image, so as to obtain the prior knowledge of the target to be detected under the camera image, and then the point cloud features are fused to further detect the target. Because the data source according to which the target is detected not only comprises point cloud data but also comprises a camera image, the information according to which the target is detected is more comprehensive, and the detected target can be more accurate. In addition, based on prior knowledge of the target to be detected under the camera image, the point cloud detection target is further fused, so that fuzzy detection can be performed on the target in the target detection process, accurate detection can be performed, and the accuracy of the detected target can be improved.

The specific implementation manner of the third example can refer to step 303 in the embodiment of fig. 3, and the difference is that the third example is to respectively use the feature in the image semantic thermodynamic diagram and the point cloud feature as one channel feature to obtain two channel features, and then fuse the two channel features to obtain the adaptive fusion feature. Accordingly, the specific implementation of example three is not described in detail herein.

Example four: and determining self-adaptive fusion characteristics by the target detection equipment based on the point cloud semantic thermodynamic diagram, the image semantic thermodynamic diagram and the point cloud characteristics.

In example four, a point cloud semantic thermodynamic diagram capable of indicating an approximate distribution of the target to be detected is predicted from the point cloud data acquired for the detection area. Meanwhile, an image semantic thermodynamic diagram is predicted according to a camera image acquired aiming at the detection area, and the image semantic thermodynamic diagram can also indicate the approximate distribution condition of the target to be detected. And then performing feature fusion on the two semantic thermodynamic diagrams and the point cloud features under the point cloud data to obtain self-adaptive fusion features, and further detecting the target according to the self-adaptive fusion features. That is, in the fourth example, firstly, a fuzzy prediction is performed on the target to be detected according to different data sources, so as to obtain the prior knowledge of the target to be detected under different data sources, and then the point cloud features are fused based on the prior knowledge under different data sources, so as to further detect the target. Because the data source according to which the target is detected not only comprises point cloud data but also comprises a camera image, the information according to which the target is detected is more comprehensive, and the detected target can be more accurate. In addition, fine point cloud data are fused according to the prior knowledge of the target to be detected based on different data sources, the target is further detected, fuzzy detection can be performed on the target in the target detection process, accurate detection can be performed based on the fused point cloud characteristics, and the accuracy of the detected target can be improved.

The specific implementation manner of the example four may refer to step 303 in the embodiment of fig. 3, and the difference is that the example four is that the features in the point cloud semantic thermodynamic diagram, the features in the image semantic thermodynamic diagram, and the point cloud features are respectively used as one channel feature to obtain three channel features, and then the three channel features are fused to obtain the adaptive fusion feature.

Example five: based on the point cloud semantic thermodynamic diagram, the image semantic thermodynamic diagram and the image features, the target detection device determines self-adaptive fusion features.

In example five, from the point cloud data collected for the detection area, a point cloud semantic thermodynamic diagram is predicted, which can indicate an approximate distribution of the target to be detected. Meanwhile, an image semantic thermodynamic diagram is predicted according to a camera image acquired aiming at the detection area, and the image semantic thermodynamic diagram can also indicate the approximate distribution condition of the target to be detected. And then performing feature fusion based on the two semantic thermodynamic diagrams and the image features under the camera image to obtain a self-adaptive fusion feature, and further detecting the target according to the self-adaptive fusion feature. That is, in the fourth example, firstly, a fuzzy prediction is performed on the target to be detected according to different data sources, so as to obtain the prior knowledge of the target to be detected under different data sources, and then the image features are fused based on the prior knowledge under different data sources, so as to further detect the target. Because the data source according to which the target is detected not only comprises point cloud data but also comprises a camera image, the information according to which the target is detected is more comprehensive, and the detected target can be more accurate. In addition, the prior knowledge of the target to be detected is fused with the image characteristics in the camera image based on different data sources, so that the target can be further detected, fuzzy detection of the target can be firstly carried out in the target detection process, accurate detection can be further carried out based on the fused image characteristics, and the accuracy of the detected target can also be improved.

The specific implementation manner of the fifth example can refer to step 303 in the embodiment of fig. 3, and the difference is that the fifth example is to respectively use the features in the point cloud semantic thermodynamic diagram, the features in the image semantic thermodynamic diagram, and the image features as one channel feature to obtain three channel features, and then fuse the three channel features to obtain the adaptive fusion feature.

Example six: based on the point cloud semantic thermodynamic diagram, the image features and the point cloud features, the target detection device determines self-adaptive fusion features.

In example six, from the point cloud data collected for the detection area, a point cloud semantic thermodynamic diagram is predicted, which can indicate an approximate distribution of the target to be detected. And then carrying out feature fusion on the basis of the point cloud semantic thermodynamic diagram, the image features under the camera image and the point cloud features to obtain self-adaptive fusion features, and further detecting the target according to the self-adaptive fusion features. That is, in the sixth example, firstly, a fuzzy prediction is performed on the target to be detected according to the point cloud data, and then the target is further detected based on the image features and the point cloud features under different data sources by fusing the fuzzy prediction. Because the data source according to which the target is detected not only comprises point cloud data but also comprises a camera image, the information according to which the target is detected is more comprehensive, and the detected target can be more accurate. In addition, the prior knowledge aiming at the target to be detected is fused with the image characteristics and the point cloud characteristics in the camera image based on the point cloud data, so that the target can be further detected, fuzzy detection can be firstly carried out on the target in the process of detecting the target, then accurate detection is carried out based on the fused image characteristics and the point cloud characteristics, and the accuracy of the detected target can also be improved.

The specific implementation manner of the sixth example can refer to step 303 in the embodiment of fig. 3, and the difference is that the sixth example is to respectively use the features, the image features, and the point cloud features in the point cloud semantic thermodynamic diagram as one channel feature to obtain three channel features, and then fuse the three channel features to obtain the adaptive fusion feature.

Example seven: based on the image semantic thermodynamic diagram, the image features and the point cloud features, the target detection device determines adaptive fusion features.

In example seven, from the camera image acquired for the detection area, an image semantic thermodynamic diagram is predicted, which can indicate the approximate distribution of the objects to be detected. And then carrying out feature fusion on the basis of the image semantic thermodynamic diagram, the image features under the camera image and the point cloud features to obtain self-adaptive fusion features, and further detecting the target according to the self-adaptive fusion features. That is, in the seventh example, firstly, a fuzzy prediction is performed on the target to be detected according to the camera image, and then the target is further detected by fusing the fuzzy predictions based on the image features and the point cloud features of different data sources. Because the data source according to which the target is detected not only comprises point cloud data but also comprises a camera image, the information according to which the target is detected is more comprehensive, and the detected target can be more accurate. In addition, the prior knowledge aiming at the target to be detected under the camera image is fused with the image characteristics and the point cloud characteristics in the camera image to further detect the target, so that the target can be firstly subjected to fuzzy detection in the target detection process, then the target can be accurately detected based on the fused image characteristics and the point cloud characteristics, and the accuracy of the detected target can also be improved.

The specific implementation manner of the seventh example can refer to step 303 in the embodiment of fig. 3, and the difference is that the seventh example is to respectively use the features, the image features, and the point cloud features in the image semantic thermodynamic diagram as one channel feature to obtain three channel features, and then fuse the three channel features to obtain the adaptive fusion feature.

Example eight: and based on the point cloud semantic thermodynamic diagram, the image characteristics and the point cloud characteristics, the target detection equipment determines the self-adaptive fusion characteristics.

In example eight, from the point cloud data collected for the detection area, a point cloud semantic thermodynamic diagram is predicted, which can indicate an approximate distribution of the target to be detected. Meanwhile, an image semantic thermodynamic diagram is predicted according to a camera image acquired aiming at the detection area, and the image semantic thermodynamic diagram can also indicate the approximate distribution condition of the target to be detected. And then performing feature fusion on the two semantic thermodynamic diagrams and the point cloud features under the point cloud data and the image features of the camera image to obtain self-adaptive fusion features, and further detecting the target according to the self-adaptive fusion features. That is, in the eighth example, firstly, a fuzzy prediction is performed on the target to be detected according to different data sources, so as to obtain the prior knowledge of the target to be detected under different data sources, and then the point cloud feature and the image feature are fused based on the prior knowledge under different data sources, so as to further detect the target. Because the data source according to which the target is detected not only comprises point cloud data but also comprises a camera image, the information according to which the target is detected is more comprehensive, and the detected target can be more accurate. In addition, fine point cloud data and common image features are fused based on prior knowledge of the target to be detected under different data sources, the target is further detected, fuzzy detection on the target in the target detection process can be achieved, accurate detection is further conducted based on the fused point cloud features and the image features, and accuracy of the detected target can be improved.

The specific implementation manner of the example eight may refer to step 303 in the embodiment of fig. 3, and the difference is that the example eight is to respectively use the features in the point cloud semantic thermodynamic diagram, the features in the image semantic thermodynamic diagram, the point cloud features, and the image features as one channel feature to obtain four channel features, and then fuse the four channel features to obtain the adaptive fusion feature.

It should be noted that, in the above step 505, the image feature and the point cloud feature may be only fused, except for the above eight examples, based on the fusion of the point cloud semantic thermodynamic diagram and the point cloud feature, and the fusion of the image semantic thermodynamic diagram and the image feature. However, in this scenario, the fuzzy prediction is not performed first, and then the fine prediction is performed, so that the subsequent detection target is not very accurate.

It should be noted that, in the above several examples, in the fourth example, not only fuzzy prediction under different data sources is considered at the same time, but also the fused fine point cloud features are further accurately predicted, so that the target of the subsequent prediction in the fourth example is more accurate than the target of the subsequent prediction in the embodiment shown in fig. 3 and three-channel fusion in the foregoing examples from the first to the third and the other examples. In addition, the efficiency of the three-channel fusion process in the example four is higher than that of the fusion of the four-channel features, so that when the embodiment of the application is applied, the fusion feature for determining the adaptive application in the example four can be considered.

Fig. 7 is a schematic flow chart of a fused feature for example four provided in an embodiment of the present application. As shown in fig. 7, after obtaining 3 features (point cloud semantic thermodynamic diagram, image semantic thermodynamic diagram, and point cloud feature), the 3 features need to be effectively fused to make the performance of the algorithm better. In this case, the function of the adaptive feature fusion module in fig. 1 is to achieve the purpose of effectively fusing the features in fig. 3. The self-adaptive feature fusion module regards the 3 features as different channels, and different weights are given to the different channels, so that the purpose of effective fusion is achieved.

The method comprises the following specific steps:

a) firstly, cascading 3 characteristics, namely, taking different characteristics as different channels to form initial characteristics;

b) global attention pool: taking the initial features as input, obtaining the attention weight of each channel by adopting convolution of 1x1 and a Softmax function, and multiplying the attention weight of each channel by each channel feature in the initial features to obtain global context features;

c) performing feature conversion on the global context features through 1x1 convolution, layer normalization, a ReLU nonlinear function and 1x1 convolution to obtain converted global context features;

d) and adding the converted global context feature and the initial feature element by element to obtain the self-adaptive fusion feature. Note that the element-by-element addition here means that the elements of the channels corresponding to the two features are added one by one.

The above-mentioned connecting step can be specifically realized by the following formula:

wherein Zi represents an adaptive fusion feature, Xi represents an initial feature,

a global context feature is represented that represents a global context feature,

representing the transformed global context characteristics.

Step 506: the target detection device performs target detection based on the adaptive fusion features to acquire information of a target in the detection area.

The implementation of step 506 may refer to step 304 in the embodiment of fig. 3, and is not described herein again.

In the embodiment of the present application, on the basis of the embodiment shown in fig. 3, other data source camera images are also considered. Because the data source according to which the target is detected not only comprises point cloud data but also comprises a camera image, the information according to which the target is detected is more comprehensive, and the detected target can be more accurate. In addition, other characteristics are fused according to the prior knowledge of the target to be detected under at least one data source in different data sources, the target is further detected, so that fuzzy detection of the target can be firstly carried out in the target detection process, accurate detection is further carried out based on fusion of other characteristics, and the accuracy of the detected target can be improved.

All the above optional technical solutions can be combined arbitrarily to form an optional embodiment of the present application, and the present application embodiment is not described in detail again.

In summary, based on the embodiment shown in fig. 3, the embodiment of the present application implements a point cloud semantic thermodynamic diagram and point cloud feature adaptive fusion 3D detection method. Based on the embodiment shown in fig. 5, the embodiment of the application realizes accurate prediction of the point cloud semantic graph and fuzzy prediction of the image semantic graph, and utilizes global context information to perform adaptive fusion on features under different data sources, and finally performs accurate 3D target prediction.

That is, the embodiment of the application designs a self-adaptive feature fusion module, so that the 3D detection algorithm can learn the importance of various different source features by itself, and provides a favorable technical support for the cooperative sensing of multiple sensors. In addition, a point cloud semantic thermodynamic diagram and point cloud feature fusion technology is designed, so that the 3D detection can effectively obtain spatial semantic information. In addition, a point cloud semantic thermodynamic diagram prediction module and an image semantic thermodynamic diagram prediction module are designed, and semantic features are obtained from data of different sources, so that sufficient semantic information is effectively obtained.

Fig. 8 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application. As shown in fig. 8, the apparatus 800 includes several modules as follows.

A first determining module 801, configured to determine, based on point cloud data acquired for a detection area, a feature in a bird's eye view of the detection area to obtain a point cloud feature, where the bird's eye view indicates an image obtained by projecting a three-dimensional environment indicated by the point cloud data to a two-dimensional space, and the point cloud data includes three-dimensional position information of each position point in the detection area, to which a laser beam emitted by a laser radar is projected;

the first prediction module 802 is configured to predict a point cloud semantic thermodynamic diagram based on point cloud features, where the point cloud semantic thermodynamic diagram indicates a distribution situation of a suspected target in a three-dimensional environment indicated by point cloud data;

a second determining module 803, configured to determine an adaptive fusion feature based on the point cloud feature and the point cloud semantic thermodynamic diagram;

a detection module 804, configured to perform target detection based on the adaptive fusion feature to obtain information of a target in the detection area.

In one possible implementation, the apparatus further includes:

the third determining module is used for determining the characteristics in the camera image based on the camera image acquired aiming at the detection area to obtain the image characteristics;

the second prediction module is used for predicting image semantic thermodynamic diagrams based on the image features, and the image semantic thermodynamic diagrams indicate the distribution situation of suspected targets in the camera images;

a fourth determination module for determining an adaptive fusion feature based on one or both of the point cloud semantic thermodynamic diagram and the point cloud feature, and one or both of the image semantic thermodynamic diagram and the image feature;

and the detection module is also used for carrying out target detection based on the self-adaptive fusion characteristics so as to acquire the information of the target in the detection area.

respectively taking the respective features in the point cloud semantic thermodynamic diagram and the image semantic thermodynamic diagram and the point cloud features as one channel feature to obtain three channel features, and cascading the three channel features to obtain an initial feature;

and superposing the global context characteristic and the initial characteristic to obtain the self-adaptive fusion characteristic.

In one possible implementation, the first prediction module is configured to:

and determining a point cloud semantic thermodynamic diagram through the first thermodynamic diagram prediction model based on the point cloud characteristics.

In one possible implementation, the apparatus further includes:

a first acquisition module further configured to acquire a feature in each of the plurality of sample aerial views;

and the first training module is used for training the first initialization model based on the features in each sample aerial view in the plurality of sample aerial views and the mark information of each sample aerial view to obtain a first thermodynamic diagram prediction model.

In one possible implementation, the second prediction module is configured to:

and determining an image semantic thermodynamic diagram through a second thermal prediction model based on the image features.

In one possible implementation, the apparatus further includes:

a second acquisition module further configured to acquire features in each of the plurality of sample camera images;

and the second training module is used for training the second initialization model based on the features in each sample camera image in the plurality of sample camera images and the label information of each sample camera image to obtain a second thermodynamic diagram prediction model.

It should be noted that: in the target detection apparatus provided in the above embodiment, when performing target detection, only the division of the functional modules is illustrated, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the target detection apparatus and the target detection method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

Fig. 9 is a schematic structural diagram of a terminal 900 according to an embodiment of the present application. The object detection apparatus in the foregoing embodiments can be implemented by the embodiment shown in fig. 9. The terminal 900 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

In general, terminal 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement a target detection method provided by method embodiments herein.

In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, providing the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display panel 905 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The positioning component 908 is used to locate the current geographic Location of the terminal 900 for navigation or LBS (Location Based Service). The Positioning component 908 may be a Positioning component based on the GPS (Global Positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 909 is used to provide power to the various components in terminal 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When power source 909 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.

The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the display screen 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 912 may detect a body direction and a rotation angle of the terminal 900, and the gyro sensor 912 may cooperate with the acceleration sensor 911 to acquire a 3D motion of the user on the terminal 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 913 may be disposed on a side bezel of the terminal 900 and/or underneath the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the user's holding signal of the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the display screen 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the terminal 900. When a physical key or vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.

The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display screen 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is reduced. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.

Proximity sensor 916, also known as a distance sensor, is typically disposed on the front panel of terminal 900. The proximity sensor 916 is used to collect the distance between the user and the front face of the terminal 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user and the front face of the terminal 900 gradually decreases, the processor 901 controls the display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user and the front surface of the terminal 900 gradually becomes larger, the display 905 is controlled by the processor 901 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

The embodiments of the present application also provide a non-transitory computer-readable storage medium, and when instructions in the storage medium are executed by a processor of a terminal, the terminal is enabled to execute the target detection method provided in the above embodiments.

The embodiment of the present application further provides a computer program product containing instructions, which when run on a terminal, causes the terminal to execute the target detection method provided by the foregoing embodiment.

Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application. The object detection apparatus in the foregoing embodiments can be implemented by the embodiment shown in fig. 10. The server may be a server in a cluster of background servers. Specifically, the method comprises the following steps:

the server 1000 includes a Central Processing Unit (CPU)1001, a system memory 1004 including a Random Access Memory (RAM)1002 and a Read Only Memory (ROM)1003, and a system bus 1005 connecting the system memory 1004 and the central processing unit 1001. The server 1000 also includes a basic input/output system (I/O system) 1006, which facilitates the transfer of information between devices within the computer, and a mass storage device 1007, which stores an operating system 1013, application programs 1014, and other program modules 1015.

The basic input/output system 1006 includes a display 1008 for displaying information and an input device 1009, such as a mouse, keyboard, etc., for user input of information. Wherein a display 1008 and an input device 1009 are connected to the central processing unit 1001 via an input-output controller 1010 connected to the system bus 1005. The basic input/output system 1006 may also include an input/output controller 1010 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller 1010 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the server 1000. That is, the mass storage device 1007 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1004 and mass storage device 1007 described above may be collectively referred to as memory.

According to various embodiments of the present application, the server 1000 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 1000 may be connected to the network 1012 through a network interface unit 1011 connected to the system bus 1005, or the network interface unit 1011 may be used to connect to another type of network or a remote computer system (not shown).

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU. The one or more programs include instructions for performing the object detection methods provided by embodiments of the present application.

Embodiments of the present application further provide a non-transitory computer-readable storage medium, where instructions in the storage medium, when executed by a processor of a server, enable the server to perform the object detection method provided in the foregoing embodiments.

Embodiments of the present application further provide a computer program product containing instructions, which when run on a server, cause the server to execute the object detection method provided by the foregoing embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only a preferred embodiment of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of object detection, the method comprising:

2. The method of claim 1, wherein the method further comprises:

3. The method of claim 2, wherein the determining adaptive fusion features based on the point cloud semantic thermodynamic diagram and one or both of the point cloud features and the image semantic thermodynamic diagram and one or both of the image features comprises:

4. The method of claim 3, wherein said obtaining global context features based on said initial features comprises:

5. The method of claim 3, wherein prior to said overlaying said global context feature and said initial feature, said method further comprises:

6. The method of claim 1, wherein the predicting a point cloud semantic thermodynamic diagram based on the point cloud features comprises:

7. The method of claim 6, wherein the method further comprises:

obtaining features in each of the plurality of sample aerial views;

8. The method of claim 2, wherein determining an image semantic thermodynamic map based on the image features comprises:

9. The method of claim 8, wherein the method further comprises:

obtaining features in each of the plurality of sample camera images;

10. An object detection apparatus, characterized in that the apparatus comprises:

11. An object detection apparatus, characterized in that the apparatus comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the steps of the method of any of the above claims 1 to 9.

12. A computer-readable storage medium having stored thereon instructions which, when executed by a processor, carry out the steps of the method of any of claims 1 to 9.