CN117765525A

CN117765525A - Cross-modal distillation 3D target detection method and system based on monocular camera

Info

Publication number: CN117765525A
Application number: CN202311786873.5A
Authority: CN
Inventors: 杨勐; 丁瑞; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-12-23
Filing date: 2023-12-23
Publication date: 2024-03-26

Abstract

The invention discloses a method and a system for detecting a cross-modal distillation 3D target based on a monocular camera, which are characterized in that laser radar data is used for training a teacher network, camera data is used for training a student network, and the uncertainty of the depth of each target of the teacher network and the student network is calculated; combining the trained teacher network, the trained student network and the distillation module to form a distillation network, and calculating the weights of the targets by the weighted feature distillation and the weighted relation distillation; calculating a loss function of the weighted feature distillation and the weighted relation distillation based on the weights of the targets of the weighted feature distillation and the weighted relation distillation, counter-propagating the gradient of the neural network, and updating parameters of the neural network; and when the parameters of the updated neural network reach the maximum iteration times or the termination condition is met, reserving the student network for the real scene. The invention is beneficial to the application of a scene perception algorithm based on a camera in the automatic driving industry and the rapid landing and development of related industries.

Description

Cross-modal distillation 3D target detection method and system based on monocular camera

Technical Field

The invention belongs to the technical field of computer vision and safety auxiliary driving, and particularly relates to a cross-modal distillation 3D target detection method and system based on a monocular camera.

Background

Three-dimensional (3D) object detection is a key component of autonomous vehicle and robotic scene perception. Currently, the main solution for 3D object detection is usually dependent on lidar sensors. However, the high cost of lidar sensors limits their use in practical settings. In contrast, monocular 3D detection provides a more convenient and cost-effective solution, which is a research hotspot in the academia and industry. However, there is still a large performance gap between monocular 3D detectors and lidar-based 3D target detection due to the lack of accurate depth information in a single image.

The existing monocular 3D detection schemes are limited by the pathogenicity of monocular depth estimation, and the detection effect is seriously influenced by the low accuracy of the depth estimation. In view of the fact that the lidar sensor can acquire accurate depth information of the scene, the pseudo lidar method attempts to utilize lidar data to provide depth information lacking the latter for monocular 3D detection. However, these methods do not fully utilize the information of the lidar data and do not enable end-to-end training.

Cross-modal knowledge distillation provides an innovative solution to achieve more efficient 3D target detection using lidar data. In the method, the cross-modal distillation method can greatly improve the accuracy of monocular 3D detection without increasing any reasoning cost. However, since the laser radar data and the image data adopted by the cross-modal distillation have huge modal differences, the serious negative migration problem exists, and the further improvement of the performance is limited. In particular, negative migration can be divided into two problems of architectural inconsistencies and feature overfitting. The former is that the large difference in laser radar 3D detector and image 3D detection architecture results in difficult alignment of distillation features; the latter is that the inference stage image 3D detector has no depth information input, resulting in the failure of the fitted features when it is trained. Both of these can seriously affect the accuracy of the cross-modal distillation, which is the bottleneck of the current monocular 3D detection performance improvement.

Disclosure of Invention

The invention aims to solve the technical problems of overcoming the defects in the prior art, and provides a cross-modal distillation 3D target detection method and system based on a monocular camera, which are used for solving the technical problem of negative migration caused by modal differences in cross-modal distillation, reducing the performance gap between a 3D target detection method based on a camera and a laser radar sensor, facilitating the application of a scene perception algorithm based on the camera in the automatic driving industry, and facilitating the rapid landing and development of related industries.

The invention adopts the following technical scheme:

a cross-modal distillation 3D target detection method based on a monocular camera comprises the following steps:

training a teacher network by using laser radar data, training a student network by using camera data, and calculating the depth uncertainty of each target of the teacher network and the student network;

combining the trained teacher network, the trained student network and the distillation module to form a distillation network, and calculating the weights of the targets by the weighted feature distillation and the weighted relation distillation;

calculating a loss function of the weighted feature distillation and the weighted relation distillation based on the weights of the targets of the weighted feature distillation and the weighted relation distillation, counter-propagating the gradient of the neural network, and updating parameters of the neural network;

When the parameters of the updated neural network reach the maximum iteration times or the termination condition is met, the student network is reserved for a real scene, when the real scene is faced, camera data are used as input, and the position, the size and the orientation of each target are obtained by reasoning according to the network parameters trained by the student network, so that the three-dimensional positioning of the targets is completed.

Preferably, the teacher network takes as input depth maps converted from lidar data and the student network takes as input monocular images, and the teacher network's learning rate, data enhancement and optimizer settings remain consistent with the student network and parameters freeze after training is completed.

More preferably, the depth uncertainty of each target of the teacher network and the student network is calculated as follows:

wherein L is _dep Loss functions representing depth prediction, Z and Z ^* Representing the predicted depth value and the true depth value, respectively, σ represents the predicted depth uncertainty.

Preferably, the weighting of each target by weighted feature distillation and weighted relation distillation is calculated as follows:

where θ=t and θ=s represent depth uncertainty, σ, generated by the teacher network and the student network, respectively _θ,i And omega _i The depth uncertainty and the weighted weight of the i-th object are represented, respectively.

Preferably, the loss function of the weighted feature distillation is calculated as follows:

wherein T is _i ^(l) Andfeatures representing the ith object of the first layer of the teacher's network and the student's network, respectively, +.>And W is _i ^(l) Length and width of feature map respectively representing corresponding object, +.>2D detection box mask representing the corresponding object,/->The weighted weight of the corresponding target is represented, L represents the number of layers used for middle layer characteristic distillation, N represents the total number of targets, and F (-) function represents the difference function of the targets corresponding to the teacher network and the student network.

Preferably, the degree of uncertainty of the introduced depth measures the importance degree of the relation between two targets of the teacher network and the student network, and after the weighted relation between the two targets of the teacher network and the student network is obtained, the difference of the corresponding target relation between the teacher network and the student network needs to be further calculated.

More preferably, the two target relationships of the teacher network and the student network are calculated as follows:

wherein D is _T[i,j] And D _S[i,j] Representing the relationship of the i and j targets of the weighted teacher and student networks respectively,and->Depth uncertainty, T, of teacher network and student network predictions respectively representing ith target of layer I _i ^(l) And->The characteristics of the ith targets of the first layer of the teacher network and the student network are respectively represented, L represents the number of layers used for characteristic distillation of the middle layer, and the function R (-) represents a basic formula for calculating the relation between the two targets.

More preferably, the difference in the target relationship corresponding to the teacher network and the student network is calculated as follows:

wherein N represents the total number of the target numbers, and the G (-) function represents the difference of the corresponding target relationship between the teacher network and the student network.

In a second aspect, an embodiment of the present invention provides a cross-modal distillation 3D target detection system based on a monocular camera, including:

the training module is used for training a teacher network by using laser radar data, training a student network by using camera data and calculating the depth uncertainty of each target of the teacher network and the student network;

the weight module combines the trained teacher network, the trained student network and the distillation module to form a distillation network, and calculates the weight of each target by weighted feature distillation and weighted relation distillation;

the function module is used for calculating a loss function of the weighted feature distillation and the weighted relation based on the weight of each target by the weighted feature distillation and the weighted relation, counter-propagating the gradient of the neural network and updating the parameters of the neural network;

and the output module is used for reserving the student network for a real scene when the parameter of the updated neural network reaches the maximum iteration number or meets the termination condition.

Preferably, in the weight module, the degree of depth uncertainty is introduced to measure the importance degree of the two target relationships of the teacher network and the student network, and after the weighted relationship between the two targets of the teacher network and the student network is obtained, the difference of the target relationships corresponding to the teacher network and the student network needs to be further calculated;

The two target relationships of the teacher network and the student network are calculated as follows:

wherein D is _T[i,j] And D _S[i,j] Representing the relationship of the i and j targets of the weighted teacher and student networks respectively,and->Depth uncertainty, T, of teacher network and student network predictions respectively representing ith target of layer I _i ^(l) And->Respectively representing the characteristics of the ith targets of the first layer of the teacher network and the ith targets of the student network, wherein L represents the number of layers for middle layer characteristic distillation, and R (-) function represents a basic formula for calculating the relation between the two targets;

the difference of the target relationship corresponding to the teacher network and the student network is calculated as follows:

Compared with the prior art, the invention has at least the following beneficial effects:

a cross-modal distillation 3D target detection method based on a monocular camera comprises a teacher network, a student network and a distillation module in a neural network training stage, wherein the student network can learn characteristics of the teacher network through a distillation loss function in the training stage, and supplement depth information lacking in monocular images; in the test stage, only the monocular image is used as an input student network, so that the network can be ensured to realize more accurate 3D target detection without increasing any reasoning cost.

Further, the loss function of the depth estimation can ensure that the uncertainty of the prediction accurately measures the quality of the depth prediction, and when the depth estimation accuracy is higher, the uncertainty of the predicted depth is lower, and vice versa. Meanwhile, the loss function ensures that targets with higher estimation precision have higher weights, ensures the attention of the network to main targets, and avoids the excessive fitting to difficult targets.

Further, the importance degree of different targets in distillation is distinguished by the weight setting in the weighted feature distillation and the weighted relation distillation, so that the targets in positive migration are given higher weight, and the targets in negative migration are given lower weight, thereby ensuring that the targets in the distillation process are mainly in positive migration to act and avoiding the interference of the negative migration.

Further, classical characteristic distillation loss functions treat all targets as equally important, but in practice the role of different targets in distillation is different, and some targets may even lead to reduced accuracy of distillation, i.e. negative migration problems. For this purpose, the target is distinguished by the designed weighted distillation, ensuring that the distillation process is mainly positive migration.

Further, classical relational distillation considers the relationship of any two targets equally important, but it is apparent that the importance of the relationship of two positively migrating targets is much higher than that of two negatively migrating targets. To this end, further weighting the relational distillation of the different targets helps the network focus primarily on the relationships between the migrating targets.

Further, the difference between the pairs of targets of the teacher network and the student network is calculated to enable the student network to learn the relationship between the targets corresponding to the teacher network, because the importance degree of different targets is already distinguished in the foregoing, and the relationship between different pairs of targets is also distinguished here, so that the student network is helped to concentrate on the relationship between the targets being migrated.

It will be appreciated that the advantages of the second aspect may be found in the relevant description of the first aspect, and will not be described in detail herein.

In summary, the laser radar data is used for training the teacher network, the camera data is used for training the student network, the importance degree of different targets is distinguished through weighted feature distillation and weighted relation distillation loss function in the training stage, and the student network is ensured to only transfer beneficial information from the teacher network; in the reasoning stage, only a monocular image is used as an input student network for reasoning, so that more accurate 3D target detection is ensured on the premise of not increasing any reasoning cost, and meanwhile, the scheme can be applied to various basic networks.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a diagram of a neural network framework employed in the present invention;

FIG. 3 is a graph showing the comparison of the effects of the distillation module of the present invention after use in three advanced basic models;

FIG. 4 is a graph of test results according to the present invention;

FIG. 5 is a schematic diagram of a computer device according to an embodiment of the present invention;

fig. 6 is a block diagram of a chip according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it will be understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In the present invention, the character "/" generally indicates that the front and rear related objects are an or relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe the preset ranges, etc. in the embodiments of the present invention, these preset ranges should not be limited to these terms. These terms are only used to distinguish one preset range from another. For example, a first preset range may also be referred to as a second preset range, and similarly, a second preset range may also be referred to as a first preset range without departing from the scope of embodiments of the present invention.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

Various structural schematic diagrams according to the disclosed embodiments of the present invention are shown in the accompanying drawings. The figures are not drawn to scale, wherein certain details are exaggerated for clarity of presentation and may have been omitted. The shapes of the various regions, layers and their relative sizes, positional relationships shown in the drawings are merely exemplary, may in practice deviate due to manufacturing tolerances or technical limitations, and one skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions as actually required.

The invention provides a cross-modal distillation 3D target detection method based on a monocular camera, which systematically researches and relieves the problem of negative migration of cross-modal distillation in monocular 3D detection, including the problem of inconsistent architecture and the problem of excessive fitting of features; the method is used for solving the problem of negative migration caused by modal differences in cross-modal distillation, and is beneficial to further reducing the performance gap between a 3D target detection method based on a camera and a laser radar sensor.

Referring to fig. 2, the method for detecting a cross-modal distillation 3D target based on a monocular camera is divided into a teacher network using laser radar sensor data as input, an academic network using camera data as input and a distillation module between the two, wherein the student network uses a single image as input, the teacher network uses a similar architecture, but different inputs can be LiDAR or fusion of LiDAR and images, and two new distillation modules comprise a depth-related selective feature distillation module and a depth-related selective relation distillation module.

Referring to fig. 1, the method for detecting a cross-modal distillation 3D target based on a monocular camera of the present invention includes the following steps:

s1, independently training student network

Any convolution network-based method can be directly used as the student network without any modification, and the detection precision of the basic model can be greatly improved without increasing the reasoning cost. If the base model already has a pre-trained model, this step can be skipped and the cross-modal distillation can be performed directly. Otherwise, the student network can be trained independently, and the training setting is identical to the setting of the basic model.

S2, training teacher network alone

In order to reduce the problem of characteristic misalignment caused by inconsistent architecture of the teacher network and the student network as far as possible, the teacher network and the student network are completely consistent in network architecture and only are input differences. Specifically, the teacher network takes as input depth maps converted from lidar data, while the student network takes as input monocular images. The learning rate, data enhancement and optimizer settings of the teacher's network are also consistent with the student's network and the parameters freeze after training is completed.

S3, degree of depth uncertainty

The depth uncertainty represents the degree of confidence that the system is accurate for the predicted depth information. In autopilot, introducing depth uncertainty can ensure that safe and reliable decisions are made in complex and dynamic environments. In cross-module distillation, the lidar sensor is primarily migrated to the student network as depth information, however, excessive reliance on teacher network depth information can lead to serious overfitting problems. The selective learning of depth information and thus avoiding overfitting to teacher network features is the core idea of the invention, so that the invention adopts depth uncertainty as a reference index for selective learning of teacher network features.

The formula for calculating the depth uncertainty of the present invention is as follows:

wherein L is _dep Loss functions representing depth prediction, Z and Z ^* Representing the predicted depth value and the true depth value, respectively, σ represents the predicted depth uncertainty. By the formula, the target with low predicted depth precision has higher depth uncertainty, and the target with high predicted depth precision has lower depth uncertainty.

The formula for obtaining the weighted weight is as follows, according to whether the adopted depth uncertainty is generated by a teacher network or a student network:

S4, depth-dependent selective characteristic distillation module

After the teacher network and the student network are trained independently, the teacher network and the student network are combined with the distillation module to form a distillation network, wherein network parameters of the teacher network are frozen, and the network parameters are not updated in the training process. The student network is adapted to a variety of basic models and can be accessed into the framework of the invention without any modification. The core module of the present invention is furthermore two distillation modules for alleviating the problem of negative migration.

In traditional feature knowledge distillation, the student network is forced to directly mimic the feature map from the teacher network, but the effects of distillation may be severely limited in cross-modal distillation. Therefore, the invention improves the traditional characteristic distillation module by introducing the depth uncertainty, and relieves the problem of excessive fitting of the characteristics in the cross-modal distillation. The invention calculates the loss function L of characteristic distillation _wfd The calculation formula of (2) is as follows:

wherein T is _i ^(l) Andfeatures representing the ith object of the first layer of the teacher's network and the student's network, respectively, +.>And W is _i ^(l) Length and width of feature map respectively representing corresponding object, +.>2D detection box mask representing the corresponding object,/->Representing the weighting of the corresponding target, L represents the weighting for the middle layer feature steamingThe number of distilled layers, N represents the total number of the target numbers, and F (-) function represents the difference function for calculating the corresponding targets of the teacher network and the student network, and the L2 function is adopted in the invention.

Compared with the traditional characteristic distillation module, the invention has two main improvements. In one aspect, the invention uses a 2D detection frame to distinguish the foreground from the background, and can effectively filter noise in the background, which is consistent with the main goal of 3D detection being to detect foreground objects. On the other hand, an effective measurement standard is used for distinguishing the importance of the object, so that the method is allowed to selectively learn the effective characteristics of the object from a teacher network, interference of other characteristics is avoided, and the problem of characteristic overfitting in cross-modal distillation is effectively relieved.

S5, depth-related selective relation distillation module

Relational distillation conveys structural knowledge by capturing relationships between objects. The correlation between the structural knowledge and the input modes of the network is low, and the structural knowledge is not easily affected by mode differences. Relational distillation is therefore suitable for cross-modal distillation 3D target detection. Conventional relational distillation methods generally treat each foreground object equally. However, in cross-modal distillation, the importance of the different targets is clearly different. For example, those objects that can be accurately predicted may generally communicate more efficient information than other objects, and the relationships between these objects are clearly more important than the relationships between other objects. Therefore, structural knowledge of the representation of relationships between the selectively migrated objects should be focused on the relationships between these targets that bring forward migration effects.

Based on the observation, the invention designs a depth-related selective relational distillation module, and the importance degree of two target relations is measured by introducing depth uncertainty, so that selective relational distillation is realized. The calculation formula for calculating the two target relations is as follows:

wherein D is _T[i,j] And D _S[i,j] Representing the relationship of the i and j targets of the weighted teacher and student networks respectively, And->Depth uncertainty, T, of teacher network and student network predictions respectively representing ith target of layer I _i ^(l) And->The method respectively represents the characteristics of the ith targets of the first layer of the teacher network and the ith targets of the student network, L represents the number of layers for characteristic distillation of the middle layer, and R (-) function represents a basic formula for calculating the relation between the two targets, and the L1 function is adopted in the method.

After obtaining the weighted relation between the teacher network and the student network, the difference of the target relation corresponding to the teacher network and the student network needs to be further calculated, and the calculation formula is as follows:

wherein N represents the total number of target numbers, and G (-) function represents the difference of corresponding target relations between a teacher network and a student network by calculation, and the invention adopts an L1 function.

S6, applying to the real scene

The training stage adopts a teacher network for laser radar data input and a student network for image data input, and the reasoning stage only uses the image data as input and only retains the student network. The method ensures that the precision of the basic model is greatly improved on the premise of not increasing any reasoning cost in the reasoning stage.

In still another embodiment of the present invention, a system for detecting a cross-modal distillation 3D target based on a monocular camera is provided, where the system can be used to implement the above method for detecting a cross-modal distillation 3D target based on a monocular camera, and specifically, the system for detecting a cross-modal distillation 3D target based on a monocular camera includes a module, and a module.

The training module trains a teacher network by using laser radar data, trains a student network by using camera data, and calculates the depth uncertainty of each target of the teacher network and the student network;

The function module is used for calculating a loss function of the weighted feature distillation and the weighted relation distillation based on the weight of each target of the weighted feature distillation and the weighted relation distillation, counter-propagating the gradient of the neural network and updating the parameters of the neural network;

and the output module is used for reserving the student network for a real scene when the parameters of the updated neural network reach the maximum iteration times or meet the termination condition, using camera data as input when the student network faces the real scene, and reasoning according to the network parameters trained by the student network to obtain the position, the size and the orientation of each target so as to finish the three-dimensional positioning of the targets.

In yet another embodiment of the present invention, a terminal device is provided, the terminal device including a processor and a memory, the memory for storing a computer program, the computer program including program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular to load and execute one or more instructions to implement the corresponding method flow or corresponding functions; the processor according to the embodiment of the invention can be used for the operation of a cross-modal distillation 3D target detection method based on a monocular camera, and comprises the following steps:

Training a teacher network by using laser radar data, training a student network by using camera data, and calculating the depth uncertainty of each target of the teacher network and the student network; combining the trained teacher network, the trained student network and the distillation module to form a distillation network, and calculating the weights of the targets by the weighted feature distillation and the weighted relation distillation; calculating a loss function of the weighted feature distillation and the weighted relation distillation based on the weights of the targets of the weighted feature distillation and the weighted relation distillation, counter-propagating the gradient of the neural network, and updating parameters of the neural network; when the parameters of the updated neural network reach the maximum iteration times or the termination condition is met, the student network is reserved for a real scene, when the real scene is faced, camera data are used as input, and the position, the size and the orientation of each target are obtained by reasoning according to the network parameters trained by the student network, so that the three-dimensional positioning of the targets is completed.

Referring to fig. 5, the terminal device is a computer device, and the computer device 60 of this embodiment includes: a processor 61, a memory 62, and a computer program 63 stored in the memory 62 and executable on the processor 61, the computer program 63 when executed by the processor 61 implements the reservoir inversion wellbore fluid composition calculation method of the embodiment, and is not described in detail herein to avoid repetition. Alternatively, the computer program 63 when executed by the processor 61 implements the functionality of the models/units of the embodiment of the monocular camera-based cross-modal distillation 3D object detection system, and is not described in detail herein to avoid repetition.

The computer device 60 may be a desktop computer, a notebook computer, a palm top computer, a cloud server, or the like. Computer device 60 may include, but is not limited to, a processor 61, a memory 62. It will be appreciated by those skilled in the art that fig. 5 is merely an example of a computer device 60 and is not intended to limit the computer device 60, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., a computer device may also include an input-output device, a network access device, a bus, etc.

The processor 61 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 62 may be an internal storage unit of the computer device 60, such as a hard disk or memory of the computer device 60. The memory 62 may also be an external storage device of the computer device 60, such as a plug-in hard disk provided on the computer device 60, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like.

Further, the memory 62 may also include both internal storage units and external storage devices of the computer device 60. The memory 62 is used to store computer programs and other programs and data required by the computer device. The memory 62 may also be used to temporarily store data that has been output or is to be output.

Referring to fig. 6, the terminal device is a chip, and the chip 600 of this embodiment includes a processor 622, which may be one or more in number, and a memory 632 for storing a computer program executable by the processor 622. The computer program stored in memory 632 may include one or more modules each corresponding to a set of instructions. Further, the processor 622 may be configured to execute the computer program to perform the above-described method of cross-modality distillation 3D target detection based on a monocular camera.

In addition, chip 600 may further include a power supply component 626 and a communication component 650, where power supply component 626 may be configured to perform power management of chip 600, and communication component 650 may be configured to enable communication of chip 600, e.g., wired or wireless communication. In addition, the chip 600 may also include an input/output (I/O) interface 658. Chip 600 may operate based on an operating system stored in memory 632.

In a further embodiment of the present invention, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a terminal device, for storing programs and data. It will be appreciated that the computer readable storage medium herein may include both a built-in storage medium in the terminal device and an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium may be a high-speed RAM Memory or a Non-Volatile Memory (Non-Volatile Memory), such as at least one magnetic disk Memory.

One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the respective steps of the above-described embodiments with respect to a method for cross-modal distillation 3D target detection based on a monocular camera; one or more instructions in a computer-readable storage medium are loaded by a processor and perform the steps of:

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

To verify the effectiveness of the present invention, experiments were performed on the KITTI test set.

The KITTI data set is jointly sponsored by Karl-Lu-Er institute of technology and Toyota American society of technology, and is a computer vision algorithm evaluation data set in an automatic driving scene widely adopted internationally at present. The KITTI comprises real image data acquired by a plurality of scenes such as urban areas, villages, highways and the like, and covers the conditions of shielding and cutting off to various degrees. Each image contains 15 vehicles and 30 pedestrians at maximum. The dataset contains 7481 images for training and 7518 images for testing.

The present invention divides the training data into a training set containing 3712 images and a validation set containing 3769 images according to a previously well-recognized division method.

The data set is mainly used for detecting automobiles, pedestrians and cyclists, and is classified into three grades of simplicity, medium and difficulty according to difficulty in detection. The standard evaluation method is to compare the average accuracy of the methods under two different views, 3D view and BEV.

Referring to fig. 3, the method of the present invention can be applied to a plurality of basic models by plug-in, and three representative high-precision monocular 3D detection models are selected as the basic models for verifying the effectiveness of the method. Fig. 3 shows that the method designed by the invention can significantly improve the precision of the basic model, and is not limited by the effect of the basic model.

Further, the present embodiment compares the test set with the published method with the best current effect, and the comparison result is shown in table 1. The results show that the method of the invention achieves the best effect in two different views, namely 3D and BEV, and effectively proves the effectiveness of the module of the invention.

TABLE 1

Referring to fig. 4, in order to more intuitively show the effect of the detection of the present invention, a detected 3D frame is projected onto a monocular image, and a frame line after the projection is drawn. The method can accurately frame the corresponding target, and has better detection effect on shielding and remote objects.

In summary, the method and the system for detecting the cross-modal distillation 3D target based on the monocular camera can be applied to various basic models without modification, can greatly improve the detection precision of the basic models without increasing the reasoning cost, and have better universality. According to the invention, the negative migration phenomenon in the cross-modal distillation 3D target detection is analyzed and relieved for the first time, the problems of inconsistent architecture and characteristic overfitting are solved, and the method has strong generalization.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal and method may be implemented in other manners. For example, the apparatus/terminal embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a usb disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random-Access Memory (RAM), an electrical carrier wave signal, a telecommunications signal, a software distribution medium, etc., it should be noted that the content of the computer readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in jurisdictions, such as in some jurisdictions, according to the legislation and patent practice, the computer readable medium does not include electrical carrier wave signals and telecommunications signals.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The method for detecting the cross-modal distillation 3D target based on the monocular camera is characterized by comprising the following steps of:

2. The method for detecting a 3D target by cross-modal distillation based on a monocular camera according to claim 1, wherein the teacher network takes as input a depth map converted from lidar data and the student network takes as input monocular images, and learning rate, data enhancement and optimizer settings of the teacher network are consistent with those of the student network and parameters are frozen after training is completed.

3. The method for detecting the target of the cross-modal distillation 3D based on the monocular camera according to claim 2, wherein the depth uncertainty of each target of a teacher network and a student network is calculated as follows:

4. The method for detecting a target in a cross-modal distillation 3D based on a monocular camera according to claim 1, wherein the weight of each target by weighted feature distillation and weighted relation distillation is calculated as follows:

5. The method for detecting a 3D target by cross-modal distillation based on a monocular camera according to claim 1, wherein the loss function of the weighted feature distillation is calculated as follows:

6. The method for detecting the target by the cross-modal distillation 3D based on the monocular camera according to claim 1, wherein the depth uncertainty is introduced to measure the importance degree of the relation between two targets of a teacher network and a student network, and after the weighted relation between the two targets of the teacher network and the student network is obtained, the difference of the corresponding target relation between the teacher network and the student network needs to be further calculated.

7. The method for detecting a target in cross-modal distillation 3D based on a monocular camera according to claim 6, wherein two target relationships of a teacher network and a student network are calculated as follows:

wherein D is _T[i,j] And D _S[i,j] Representing the relationship of the i and j targets of the weighted teacher and student networks respectively,andrespectively represent the ith mesh of the first layerDepth uncertainty, T, of target teacher network and student network predictions _i ^(l) And->The characteristics of the ith targets of the first layer of the teacher network and the student network are respectively represented, L represents the number of layers used for characteristic distillation of the middle layer, and the function R (-) represents a basic formula for calculating the relation between the two targets.

8. The method for detecting a target in cross-modal distillation 3D based on a monocular camera according to claim 6, wherein the difference of the target relationships corresponding to the teacher network and the student network is calculated as follows:

9. A unicamera-based cross-modality distillation 3D target detection system, comprising:

10. The system for detecting the target of the cross-modal distillation 3D based on the monocular camera according to claim 9, wherein the degree of importance of the relation between two targets of the teacher network and the student network is measured by introducing depth uncertainty into a weight module, and after the weighted relation between the targets of the teacher network and the student network is obtained, the difference of the target relation corresponding to the teacher network and the student network is required to be further calculated;

wherein D is _T[i,j] And D _S[i,j] Representing the relationship of the i and j targets of the weighted teacher and student networks respectively,anddepth uncertainty, T, of teacher network and student network predictions respectively representing ith target of layer I _i ^(l) And->Respectively representing the characteristics of the ith targets of the first layer of the teacher network and the ith targets of the student network, wherein L represents the number of layers for middle layer characteristic distillation, and R (-) function represents a basic formula for calculating the relation between the two targets;