CN116524329B

CN116524329B - Network model construction method, device, equipment and medium for low-computational-power platform

Info

Publication number: CN116524329B
Application number: CN202310808005.6A
Authority: CN
Inventors: 朱菲婷
Original assignee: Jika Intelligent Robot Co ltd
Current assignee: Jika Intelligent Robot Co ltd
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-08-29
Anticipated expiration: 2043-07-04
Also published as: CN116524329A

Abstract

The present disclosure relates to network model building methods, apparatus, devices, and media for low-computational-power platforms. The method comprises the following steps: constructing a backbone network module, and carrying out two-dimensional feature extraction and downsampling on a received image so as to output two-dimensional feature information; constructing a neck extraction module, and carrying out light weight treatment to extract and receive two-dimensional characteristic information on a single-scale characteristic layer; constructing an encoding module to obtain three-dimensional characteristic information; constructing a cross-modal feature distillation module, generating point cloud features for at least one image based on the point cloud information corresponding to the at least one image, and obtaining corrected three-dimensional feature information based on the point cloud features; and constructing an anchor-free detection head module, decoupling and obtaining target characteristic information for at least one image based on the corrected three-dimensional characteristic information. In this way, the method can improve the target detection precision, quicken the model reasoning speed and facilitate the deployment and landing of the model on a low-calculation-force platform.

Description

Network model construction method, device, equipment and medium for low-computational-power platform

Technical Field

The present disclosure relates generally to the field of computer and image processing technology, and in particular, to a network model building method, apparatus, electronic device, and computer readable storage medium for a low-power platform.

Background

The structural design of the multi-image BEV (bird's eye view) algorithm is currently widely used for image detection for automatic driving. The structure is basically divided into three modules, the first module is to extract 2D feature information from an image by using a 2D convolutional neural network, the second module is to convert the 2D feature into a 3D space by a depth transform-based or transform-based coding method to form BEV features, and the third module is to perform feature extraction on the BEV features to predict 3D information of an object.

The cross-modal knowledge distillation is mainly used for extracting more abundant semantic information from LiDAR point clouds through a distillation mode so as to assist in learning image features. At present, the application of cross-modal knowledge distillation in multi-image BEVs significantly improves the detection accuracy, such as Tig-BEVs, the scheme uses a depth transformation method to generate image feature BEVs, uses a LiDAR detector to extract front depth information to guide image depth prediction, and uses BEV features extracted by the LiDAR detector to guide image BEV feature learning. But this approach is based on depth transforms, often requiring acceleration of unfriendly voxel pool operations, difficult to deploy to the floor on resource-constrained low-computational-power platforms.

The Fast-BEV algorithm proposes a high-efficiency BEV encoder to accelerate vehicle-mounted reasoning and deployment, but the method lacks of learning of depth information, and uses an anchor-base mode to carry out regression, the anchor (anchor point) size setting of the method can influence the effect of network learning, manual adjustment is needed, and the bbox head of the network adopts a shared convolution layer and is excessively coupled, so that the difficulty of network learning is increased.

Disclosure of Invention

According to an example embodiment of the present disclosure, a network model building scheme for a low-power platform is provided to at least partially solve the problems existing in the prior art.

In a first aspect of the present disclosure, a network model building method for a low-computational-power platform is provided. The method comprises the following steps: constructing a backbone network module configured to perform two-dimensional feature extraction and downsampling on at least one received image to output two-dimensional feature information for the at least one image; constructing a neck extraction module that is lightweight to perform two-dimensional feature extraction at a single scale feature layer and configured to receive and enhance the two-dimensional feature information; constructing an encoding module configured to perform three-dimensional feature encoding on the enhanced two-dimensional feature information to obtain three-dimensional feature information for the at least one image; constructing a cross-modal feature distillation module configured to generate a point cloud feature for the at least one image based on the point cloud information corresponding to the at least one image and configured to perform feature distillation on the three-dimensional feature information based on the point cloud feature to obtain corrected three-dimensional feature information; and constructing an anchor-free detection head module, the anchor-free detection head module being decoupled and configured to derive target feature information for the at least one image based on the modified three-dimensional feature information. The target characteristic information may be predicted three-dimensional information of the object in the image.

In some embodiments, the neck extraction module is further configured to: performing 1*1 convolution and 3*3 convolution on the two-dimensional characteristic information; inputting the convolved two-dimensional characteristic information into the following branches: global adaptive pooling branches; 1*1 convolution, 3*3 convolution with a hole rate of 2, and 1*1 convolution series branches; short cut quick branches; and adding the global adaptive pooling branch, the series branch and the shortcut branch to obtain the enhanced two-dimensional feature information.

In some embodiments, the encoding module is further configured to: the enhanced two-dimensional feature information is projected directly and fused into a three-dimensional space based on the camera intrinsic and camera extrinsic capturing the at least one image to obtain three-dimensional feature information for the at least one image.

In some embodiments, the cross-modal feature distillation module is further configured to: extracting point cloud information corresponding to the at least one image by utilizing a laser radar BEV feature extractor to obtain the point cloud features for the at least one image; and guiding the three-dimensional characteristic information by utilizing the point cloud characteristics to obtain the corrected three-dimensional characteristic information.

In some embodiments, the anchor-free detection head module includes a center detection head centrrhead module including 3*3 shared convolutional layers and a set of decoupling heads and configured to: classifying by using thermodynamic diagrams; predicting a center point offset using 2 1*1 convolutions and adding the center point offset to the most responsive point coordinates in the thermodynamic diagram to obtain a center coordinate of the target in the at least one image; predicting the height of the target in the at least one image using 2 1*1 convolutions; predicting the actual length, width and height of a target by using 2 1*1 convolutions, and combining the heights to obtain a target distance; predicting a target rotation angle using 2 1*1 convolutions; and predicting the velocity of the target using 2 1*1 convolutions.

In some embodiments, the backbone network module is further configured to perform two-dimensional feature extraction on the received at least one image using a Resnet network and the downsampling is by a factor of 32.

In some embodiments, based on the in-camera and out-camera parameters capturing the at least one image, projecting and fusing the enhanced two-dimensional feature information directly into a three-dimensional space to obtain three-dimensional feature information for the at least one image comprises: defining a preset size grid in a three-dimensional space by using a BEV encoder, wherein the preset size grid comprises characteristic points; converting the feature points from a lidar coordinate system to a camera coordinate system and to an image coordinate system based on the camera intrinsic and the camera extrinsic; determining whether the converted feature points are inside the at least one image; and in response to judging that the feature points are inside the at least one image, filling the image features corresponding to the feature points into the positions of the feature points in the grid with the preset size.

In a second aspect of the present disclosure, a network model apparatus for a low-power platform is provided. The device comprises: a backbone network module configured to perform two-dimensional feature extraction and downsampling on at least one received image to output two-dimensional feature information for the at least one image; a neck extraction module, lightweight processed to perform two-dimensional feature extraction at a single scale feature layer and configured to receive and enhance the two-dimensional feature information; an encoding module configured to perform three-dimensional feature encoding on the enhanced two-dimensional feature information to obtain three-dimensional feature information for the at least one image; a cross-modal feature distillation module configured to generate a point cloud feature for the at least one image based on the point cloud information corresponding to the at least one image, and configured to perform feature distillation on the three-dimensional feature information based on the point cloud feature to obtain corrected three-dimensional feature information; and an anchor-free detection head module that is decoupled and configured to derive target feature information for the at least one image based on the modified three-dimensional feature information.

In a third aspect of the present disclosure, an electronic device is provided. The apparatus includes: one or more processors; and storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a method according to the first aspect of the present disclosure.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The medium has stored thereon a computer program which, when executed by a processor, implements a method according to the first aspect of the present disclosure.

In a fifth aspect of the present disclosure, a computer program product is provided. The article of manufacture comprises a computer program/instruction which, when executed by a processor, implements a method according to the first aspect of the disclosure.

It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements. The accompanying drawings are included to provide a better understanding of the present disclosure, and are not to be construed as limiting the disclosure, wherein:

FIG. 1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure may be implemented;

FIG. 2 illustrates a schematic flow diagram of a network model building method for a low-computational-power platform, according to some embodiments of the present disclosure;

fig. 3 illustrates a schematic view of a neck extraction module structure according to some embodiments of the present disclosure;

FIG. 4 illustrates a BEV encoder schematic diagram in accordance with some embodiments of the present disclosure;

FIG. 5 illustrates a center head architecture diagram according to some embodiments of the present disclosure;

FIG. 6 illustrates a schematic block diagram of a network model apparatus for a low-power platform, according to some embodiments of the present disclosure; and

FIG. 7 illustrates a block diagram of a computing device capable of implementing various embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As mentioned previously, current image detection algorithms typically require acceleration of unfriendly voxel pool operations, making it difficult to deploy landings on resource-constrained low-computational-power platforms; the BEV encoder adopts a shared convolution layer to be excessively coupled, so that the network learning difficulty is high; the overall network structure cannot fully consider the overall relation among all modules, and further causes the problems of poor target detection precision, low reasoning speed and the like.

Aiming at the problems, the embodiments of the disclosure extract 2D image features from the pictures acquired by the multiple cameras through a 2D convolutional neural network backbone module, enlarge the network receptive field by adopting a neck extraction module structure which is subjected to light weight treatment, and realize image target price detection by adopting a mode of integrally combining a high-efficiency coding module, a cross-modal feature distillation method and a decoupled anchor-free detection head module. In this way, the lightweight neck extraction module fully forms an integral chain for feature extraction with the detection head module combined with the coding module and the decoupling module, and integrally combines cross-modal distillation, fully considers the integral matching relation of each module structure, remarkably improves the learning capacity of a network structure, improves the detection precision, accelerates the model reasoning speed, fully converges the network, and effectively solves the problem of difficult deployment of a vehicle-end hardware platform with low calculation power.

Exemplary embodiments of the present disclosure will be described below in conjunction with fig. 1-7.

FIG. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure may be implemented.

As shown in fig. 1, the environment 100 may include a vehicle 101, a computing device 103, and a network model structure 105, which network model structure 105 may be suitable for use with low-computing-power platforms.

In the example of fig. 1, vehicle 101 may be any type of vehicle that may carry a person and/or object and that is moved by a power system such as an engine, including, but not limited to, a car, truck, bus, electric car, motorcycle, caravan, train, and the like. In some embodiments, the vehicle 101 in the environment 100 may be a vehicle having some autonomous capability, such a vehicle also being referred to as an unmanned vehicle or an autonomous vehicle. In some embodiments, vehicle 101 may also be a vehicle with semi-autonomous driving capabilities.

In some embodiments, the vehicle end hardware platform of the vehicle 101 has low computational power, so that it needs an image target detection model structure suitable for the low computational power hardware platform to ensure that the accuracy and the speed can meet the use requirements.

As shown in fig. 1, the computing device 103 may be communicatively coupled to the vehicle 101. Although shown as a separate entity, the computing device 103 may be embedded in the vehicle 101. The computing device 103 may also be an entity external to the vehicle 101 and may communicate with the vehicle 101 via a wireless network. Computing device 103 may be any device having computing capabilities.

As non-limiting examples, computing device 103 may be any type of fixed, mobile, or portable computing device including, but not limited to, a desktop computer, a laptop computer, a notebook computer, a netbook computer, a tablet computer, a multimedia computer, a mobile phone, and the like; all or a portion of the components of computing device 103 may be distributed across the cloud. The computing device 103 includes at least a processor, memory, and other components typically found in general purpose computers to perform computing, storage, communication, control, etc. functions.

With continued reference to fig. 1, the network model structure 105 may be at least partially deployed in the computing device 103. Specific modules of the network model structure 105 will be described in detail below in conjunction with fig. 2-5.

Fig. 2 illustrates a schematic flow diagram of a network model building method 200 for a low-computing-power platform, the method 200 may be implemented, for example, by the computing device 103 shown in fig. 1, and may generate, for example, the network model structure 105 shown in fig. 1, according to some embodiments of the present disclosure.

At block 201, a backbone network module is constructed, which is configured to perform two-dimensional feature extraction and downsampling of the received at least one image to output two-dimensional feature information for the at least one image.

In one embodiment, referring to fig. 1, a constructed backbone network module may receive images of a plurality of cameras as input. For example, it is possible to use Resnet as a two-dimensional feature extractor while downsampling (e.g., 32 times) the image, and then output two-dimensional feature information C5. It should be appreciated that the Resnet described above is merely exemplary, and that other suitable two-dimensional feature extractors may be employed for extraction, and that any other suitable multiple of downsampling of the image may be performed, as this disclosure is not limited in this regard.

At block 203, a neck extraction module is constructed that is lightweight to perform two-dimensional feature extraction at a single scale feature layer and is configured to receive and enhance two-dimensional feature information.

In one embodiment, referring to FIG. 1, the neck extraction module is lightweight to expand the receptive field of the network and extract higher-dimensional two-dimensional image features, which may be, for example, a dimaled-neg. The structure is characterized in that a multi-level detection structure is not adopted, and cavity convolution is adopted at a single-scale feature layer to perform feature extraction so as to obtain high-dimensional 2D image features. In the embodiment shown in FIG. 1, the neck extraction module receives the two-dimensional feature information C5, expands the network receptive field and outputs F after strengthening the two-dimensional feature information ^2D For subsequent encoding.

Fig. 3 illustrates a schematic view of a neck extraction module structure according to some embodiments of the present disclosure. As shown in FIG. 3, the C5 feature is first convolved by 1*1 convolution and 3*3 convolution, and then fed into a multi-branch module, one branch is subjected to global adaptive pooling, the other branch is 3*3 convolution with a void ratio of 2 and 1*1 convolution series branches, and the last branch is subjected to shortcut quick branch connection addition to output the feature F ^2D . The module can achieve the precision of multi-stage prediction, and simultaneously greatly reduces model parameters.

At block 205, an encoding module is constructed, the encoding module configured to three-dimensionally feature encode the enhanced two-dimensional feature information to obtain three-dimensional feature information for at least one image.

In one embodiment, referring to FIG. 1, the encoding module may be a high efficiency BEV encoder module. The module directly projects and fuses the two-dimensional characteristic information into a three-dimensional space through calculation of internal and external parameters of a camera to obtain BEV characteristic F of the three-dimensional space ^BEV . In one embodiment, the features F output by the neck extraction module may be ^2D Encoding to obtain feature F ^BEV 。

FIG. 4 illustrates a BEV encoder schematic diagram in accordance with some embodiments of the present disclosure. Referring to FIG. 4, I is a point in BEV space defined under world coordinate system, to find the corresponding feature of the point, the I point is first transformed by projective transformation from LiDAR coordinate system to camera coordinate system and then from camera coordinate system to image coordinate system, P _i ^cam2img Is the conversion matrix from the camera coordinate system to the image coordinate system, called camera reference, P _i ^lidar2cam The conversion matrix from the LiDAR coordinate system to the camera coordinate system is called as camera external parameters. The transformed I-points are discarded without being inside the image by determining whether they are inside the image to acquire features of the image to fill the features into the BEV space, and if a point appears in multiple images, the features overlap, the features in the initial space BEV being preset to 0.

In one example, the BEV encoder may define a 200X 4 grid in three-dimensional space, the grid being predefined as the unit size of the BEV space, the unit sizes of the grid being defined as (0.5 m,1.5 m), respectively, to represent a BEV grid range of (100 m, 100m, 6 m) size in real space, by the coordinates (X, Y, Z) of each point in the grid, using the projection parameters P of LiDAR to the ith camera _i ^lidar2cam And ith camera intrinsic P _i ^cam2img And obtaining coordinate points (u, v, d) of each grid point corresponding to the ith camera image, judging whether the coordinates of the points are in the image, and reserving the points in the image range, namely satisfying the conditions that u is larger than 0 and smaller than the width of the image, v is larger than 0 and smaller than the height of the image and d (depth) is larger than 0. By at high dimension feature F ^2D The feature of the corresponding point coordinates is extracted as a feature of the corresponding point in the BEV spatial grid, thereby forming a complete 2D to 3D converted BEV feature.

At block 207, a cross-modal feature distillation module is constructed, configured to generate point cloud features for the at least one image based on the point cloud information corresponding to the at least one image, and configured to perform feature distillation on the three-dimensional feature information based on the point cloud features to obtain modified three-dimensional feature information.

In one embodiment, referring to fig. 1, point cloud information corresponding to the same-time image may also enter a LiDAR BEV feature extractor for feature extraction, where the feature extractor may be a laser point cloud encoding method (e.g. pointpilar), and the point cloud BEV features are obtained after feature extractionThe feature contains more abundant depth and positioning information and can be used for guiding F ^BEV Make up for the effect of lack of depth information caused by direct use of projection.

At block 209, an anchor-free detection head module is constructed, the anchor-free detection head module being decoupled and configured to derive target feature information for at least one image based on the modified three-dimensional feature information.

In one embodiment, referring to FIG. 1, the guided or modified F ^BEV Classification and regression of the targets may continue through the anchor-free detection head module. In particular, the head module may be a centrhead module. The module can be an anchor-free target detection module based on a heat map, the module predicts the category of the target by using the heat map, and predicts the offset and the height of the center point of the target in the image, the actual length, width and height of the target, the rotation angle of the target and the speed of the target respectively by using a decoupled convolution layer.

Fig. 5 illustrates a center head architecture diagram according to some embodiments of the present disclosure. As shown in fig. 5, in one embodiment, the internal structure of the centrhead consists of 3*3 shared convolutional layers and 6 decoupled heads, without the need to set an anchor, in a manner that considers each target to be represented by its center point. In a specific target three-dimensional characteristic information prediction process, firstly classifying by using a hetmap thermodynamic diagram, calculating probability values belonging to N categories at each point in the thermodynamic diagram, and taking the category with the highest probability value as the category of the point; predicting the central point offset (fatly x, fatly y) by using two 1*1 convolutions, and combining the coordinates of the point with the strongest response in the hetmap thermodynamic diagram, and adding the two to obtain the central coordinates (x, y) of the target in the image; convolving the predicted target height (h) in the image using two 1*1; using two 1*1 convolutions to predict the actual length-width-height of the target (L, W, H), in combination with the height of the target in the image, the distance Z of the target can be calculated according to the formula z= (f x H/H), where f is the camera focal length; predicting direction of a target using two 1*1 convolutionsThe method comprises the steps of carrying out a first treatment on the surface of the Two 1*1 convolutions are used to predict the velocity (Vx, vy) of the object. In this way, the use of decoupled convolutional layers may allow features to be better distinguished and learned, facilitating network convergence.

Fig. 6 illustrates a schematic block diagram of a network model apparatus 600 for a low-power platform, according to some embodiments of the present disclosure. The apparatus 600 may be deployed, for example, in the computing device 103 as shown in fig. 1 or implemented as the computing device 103.

As shown in fig. 6, the apparatus 600 includes a backbone network module 601, a neck extraction module 603, an encoding module 605, a cross-modal feature distillation module 607, and an anchor-free detection head module 609. Wherein the backbone network module 601 is configured to perform two-dimensional feature extraction and downsampling on at least one received image to output two-dimensional feature information for the at least one image. The neck extraction module 603 is lightweight to perform two-dimensional feature extraction at a single scale feature layer and is configured to receive and enhance two-dimensional feature information. The encoding module 605 is configured to three-dimensionally feature encode the enhanced two-dimensional feature information to obtain three-dimensional feature information for at least one image. The cross-modal feature distillation module 607 is configured to generate point cloud features for the at least one image based on the point cloud information corresponding to the at least one image and to perform feature distillation on the three-dimensional feature information based on the point cloud features to obtain modified three-dimensional feature information. The anchor-free detection head module 609 is decoupled and configured to derive target feature information for at least one image based on the modified three-dimensional feature information.

In some embodiments, the neck extraction module 603 is further configured to convolve the two-dimensional feature information with 1*1 and 3*3 convolves; the convolved two-dimensional feature information is input into the following branches: global adaptive pooling branches; 1*1 convolution, 3*3 convolution with a hole rate of 2, and 1*1 convolution series branches; and shortcut branches, and adding the global adaptive pooling branches, the series branches, and the shortcut branches to obtain enhanced two-dimensional feature information.

In some embodiments, the encoding module 605 is further configured to directly project and fuse the enhanced two-dimensional feature information into a three-dimensional space based on the camera intrinsic and the camera extrinsic capturing the at least one image to obtain three-dimensional feature information for the at least one image.

In some embodiments, the cross-modal feature distillation module 607 is further configured to feature extract point cloud information corresponding to the at least one image using the lidar BEV feature extractor to obtain point cloud features for the at least one image, and to direct three-dimensional feature information using the point cloud features to obtain modified three-dimensional feature information.

In some embodiments, the anchor-free detection head module 609 includes a center detection head centrrhead module that includes 3*3 shared convolutional layers and a set of decoupling heads and is configured to classify using thermodynamic diagrams; predicting a center point offset using 2 1*1 convolutions and adding the center point offset to the most responsive point coordinates in the thermodynamic diagram to obtain a center coordinate of the target in the at least one image; predicting the height of the target in at least one image using 2 1*1 convolutions; predicting the actual length, width and height of the target by using 2 1*1 convolutions, and combining the heights to obtain the target distance; predicting a target rotation angle using 2 1*1 convolutions; and predicting the velocity of the target using 2 1*1 convolutions.

In some embodiments, the backbone network module 601 is further configured to perform two-dimensional feature extraction and downsampling by a factor of 32 on the at least one image received using a Resnet network.

In some embodiments, the encoding module 605 is further configured to define a pre-sized grid in three-dimensional space with the BEV encoder, the pre-sized grid including feature points; converting the feature points from the lidar coordinate system to the camera coordinate system and to the image coordinate system based on the camera intrinsic and extrinsic parameters; judging whether the converted feature points are inside at least one image; and in response to judging that the feature points are in at least one image, filling the image features corresponding to the feature points into the positions of the feature points in the grid with the preset size.

It should be understood that each of the modules recited in apparatus 600 corresponds to each of the steps in method 200 described with reference to fig. 1, respectively. Accordingly, the operations and features described above in connection with fig. 1 are equally applicable to the apparatus 600 and the modules included therein, and have the same effects, and specific details are not repeated here.

Fig. 7 illustrates a block diagram of a computing device 700 capable of implementing various embodiments of the present disclosure. Device 700 may be used, for example, to implement computing device 103 of fig. 1.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various suitable actions and processes according to computer program instructions stored in a Read Only Memory (ROM) 702 or loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. One or more of the steps of the method 200 described above may be performed when a computer program is loaded into RAM703 and executed by the computing unit 701. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method 200 by any other suitable means (e.g., by means of firmware).

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), etc.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method of network model construction for a low-computational-power platform, the method comprising:

constructing a backbone network module configured to perform two-dimensional feature extraction and downsampling on at least one received image to output two-dimensional feature information for the at least one image;

constructing a neck extraction module that is lightweight to perform two-dimensional feature extraction at a single scale feature layer and configured to receive and enhance the two-dimensional feature information;

constructing an encoding module configured to perform three-dimensional feature encoding on the enhanced two-dimensional feature information to obtain three-dimensional feature information for the at least one image;

constructing a cross-modal feature distillation module configured to generate a point cloud feature for the at least one image based on the point cloud information corresponding to the at least one image and configured to perform feature distillation on the three-dimensional feature information based on the point cloud feature to obtain corrected three-dimensional feature information; and

an anchor-free detection head module is constructed, which is decoupled and configured to derive target feature information for the at least one image based on the modified three-dimensional feature information.

2. The method of claim 1, wherein the neck extraction module is further configured to:

performing 1*1 convolution and 3*3 convolution on the two-dimensional characteristic information;

inputting the convolved two-dimensional characteristic information into the following branches: global adaptive pooling branches; 1*1 convolution, 3*3 convolution with a hole rate of 2, and 1*1 convolution series branches; short cut quick branches; and

adding the global adaptive pooling branch, the series branch and the shortcut branch to obtain the enhanced two-dimensional feature information.

3. The method of claim 1, wherein the encoding module is further configured to:

the enhanced two-dimensional feature information is projected directly and fused into a three-dimensional space based on the camera intrinsic and camera extrinsic capturing the at least one image to obtain three-dimensional feature information for the at least one image.

4. The method of claim 1, wherein the cross-modal signature distillation module is further configured to:

extracting point cloud information corresponding to the at least one image by utilizing a laser radar BEV feature extractor to obtain the point cloud features for the at least one image; and

and guiding the three-dimensional characteristic information by utilizing the point cloud characteristics to obtain the corrected three-dimensional characteristic information.

5. The method of claim 1, wherein the anchor-free detection head module comprises a center detection head centrrhead module comprising 3*3 shared convolutional layers and a set of decoupling heads and configured to:

classifying by using thermodynamic diagrams;

predicting a center point offset using 2 1*1 convolutions and adding the center point offset to the most responsive point coordinates in the thermodynamic diagram to obtain a center coordinate of the target in the at least one image;

predicting the height of the target in the at least one image using 2 1*1 convolutions;

predicting the actual length, width and height of a target by using 2 1*1 convolutions, and combining the heights to obtain a target distance;

predicting a target rotation angle using 2 1*1 convolutions; and

the speed of the target was predicted using 2 1*1 convolutions.

6. The method of claim 1, wherein the backbone network module is further configured to perform two-dimensional feature extraction on the received at least one image using a Resnet network and the downsampling is by a factor of 32.

7. The method of claim 3, wherein directly projecting and fusing the enhanced two-dimensional feature information into a three-dimensional space based on camera intrinsic and camera extrinsic capturing the at least one image to obtain three-dimensional feature information for the at least one image comprises:

defining a preset size grid in a three-dimensional space by using a BEV encoder, wherein the preset size grid comprises characteristic points;

converting the feature points from a lidar coordinate system to a camera coordinate system and to an image coordinate system based on the camera intrinsic and the camera extrinsic;

determining whether the converted feature points are inside the at least one image; and

and in response to judging that the characteristic points are inside the at least one image, filling the image characteristics corresponding to the characteristic points into the positions of the characteristic points in the grid with the preset size.

8. A network model apparatus for a low-power platform, comprising:

a backbone network module configured to perform two-dimensional feature extraction and downsampling on at least one received image to output two-dimensional feature information for the at least one image;

a neck extraction module, lightweight processed to perform two-dimensional feature extraction at a single scale feature layer and configured to receive and enhance the two-dimensional feature information;

an encoding module configured to perform three-dimensional feature encoding on the enhanced two-dimensional feature information to obtain three-dimensional feature information for the at least one image;

a cross-modal feature distillation module configured to generate a point cloud feature for the at least one image based on the point cloud information corresponding to the at least one image, and configured to perform feature distillation on the three-dimensional feature information based on the point cloud feature to obtain corrected three-dimensional feature information; and

an anchor-free detection head module is decoupled and configured to derive target feature information for the at least one image based on the modified three-dimensional feature information.

9. An electronic device, the device comprising:

one or more processors; and

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method according to any of claims 1 to 7.