CN114782786A

CN114782786A - Feature fusion processing method and device for point cloud and image data

Info

Publication number: CN114782786A
Application number: CN202210536129.9A
Authority: CN
Inventors: 张雨
Original assignee: Suzhou Qingyu Technology Co Ltd
Current assignee: Suzhou Qingyu Technology Co Ltd
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-07-22

Abstract

The embodiment of the invention relates to a method and a device for fusing and processing characteristics of point cloud and image data, wherein the method comprises the following steps: acquiring first point cloud data and first image data; performing aerial view feature extraction on the first point cloud data to generate a first feature map tensor; performing aerial view feature extraction on the first image data to generate a second feature map tensor; cascading the first and second eigen map tensors to generate a third eigen map tensor; calculating a first position coding tensor corresponding to the third eigen map tensor according to a position coding rule of the Transformer model; inputting the third eigen map tensor and the first position coding tensor into a Transformer model for self-attention operation; and the output result of the model operation is used as the corresponding fusion characteristic tensor. The invention can achieve the purpose of multi-sensor aerial view characteristic fusion, and also can achieve the purpose of reducing development and maintenance cost by additionally maintaining a fusion model.

Description

Feature fusion processing method and device for point cloud and image data

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a device for feature fusion processing of point cloud and image data.

Background

The perception module of the unmanned system performs multi-target tracking by taking Bird's Eye View (BEV) image characteristics as reference, and can further improve the tracking efficiency. In a conventional situation, the sensing module acquires the aerial view characteristics through image data shot by a camera or acquires the aerial view characteristics through point cloud data scanned by a laser radar, and the aerial view characteristics can be rarely acquired on the premise of fusing the commonalities of the image data and the point cloud data, because the respective calculated amount and maintenance amount of an image-based or point cloud-based aerial view characteristic extraction model are large, and if the calculated amount and the maintenance amount are additionally established, the two are fused together, larger resource loss is inevitably caused.

Disclosure of Invention

The invention aims to provide a feature fusion processing method, a feature fusion processing device, an electronic device and a computer-readable storage medium for point cloud and image data, which aim to overcome the defects of the prior art. By the bird's-eye view feature fusion processing mechanism provided by the invention, the purpose of multi-sensor bird's-eye view feature fusion can be achieved, and the purpose of reducing development and maintenance cost can be achieved without additionally maintaining a fusion model.

In order to achieve the above object, a first aspect of the embodiments of the present invention provides a method for feature fusion processing of point cloud and image data, where the method includes:

acquiring first point cloud data and first image data;

performing aerial view feature extraction processing on the first point cloud data to generate a corresponding first feature map tensor;

performing bird's-eye view feature extraction processing on the first image data to generate a corresponding second feature map tensor;

performing cascade processing on the first and second eigen map tensors to generate a corresponding third eigenmap tensor;

calculating a position coding tensor corresponding to the third feature map tensor according to a position coding rule of a Transformer model to obtain a corresponding first position coding tensor; inputting the third feature map tensor and the corresponding first position coding tensor into a Transformer model for self-attention operation; and the output result of the model operation is used as the corresponding fusion characteristic tensor.

Preferably, the generating a corresponding first feature map tensor by performing the bird's-eye view feature extraction processing on the first point cloud data specifically includes:

and carrying out aerial view plane pseudo-image conversion processing on the first point cloud data based on a PointPillars model, and carrying out two-dimensional image feature extraction processing on the aerial view plane pseudo-image obtained by conversion to generate the first feature image tensor.

Preferably, the generating a corresponding second feature map tensor by performing the bird's-eye view feature extraction processing on the first image data specifically includes:

and inputting the first image data into a BevFormer model to perform two-dimensional image aerial-view feature extraction to generate a second feature map tensor.

Preferably, the first eigenmap tensor has a shape H₁*W₁*C₁，H₁Is the height, W, of the feature map₁Is the width, C, of the feature map₁The total number of data channels;

the second eigen map tensor has a shape H₂*W₂*C₂，H₂Is a feature height and H₂＝H₁、W₂Is a feature width of W₂＝W₁、C₂The total number of data channels;

the third eigen map tensor has a shape H₃*W₃*C₃，H₃Is feature height and H₃＝H₂＝H₁、W₃Is a feature width of W₃＝W₂＝W₁、C₃Is the total number of data channels and C₃＝(C₁+C₂)；

The fused feature tensor has a shape of H₄*W₄*C₄，H₄Is feature height and H₄＝H₃＝H₂＝H₁、W₄Is a feature width of W₄＝W₃＝W₂＝W₁、C₄Is the total number of data channels and C₄＝C₃＝(C₁+C₂)。

A second aspect of the embodiments of the present invention provides an apparatus for implementing the method for feature fusion processing of point cloud and image data according to the first aspect, where the apparatus includes: the system comprises an acquisition module, a point cloud aerial view feature processing module, an image aerial view feature processing module and a feature fusion processing module;

the acquisition module is used for acquiring first point cloud data and first image data;

the point cloud aerial view feature processing module is used for extracting aerial view features of the first point cloud data to generate a corresponding first feature map tensor;

the image aerial view feature processing module is used for performing aerial view feature extraction processing on the first image data to generate a corresponding second feature map tensor;

the feature fusion processing module is used for performing cascade processing on the first and second eigen map tensors to generate a corresponding third eigen map tensor; calculating a position coding tensor corresponding to the third eigen map tensor according to a position coding rule of a Transformer model to obtain a corresponding first position coding tensor; inputting the third eigen map tensor and the corresponding first position coding tensor into a Transformer model for self-attention operation; and the output result of the model operation is used as the corresponding fusion characteristic tensor.

A third aspect of an embodiment of the present invention provides an electronic device, including: a memory, a processor, and a transceiver;

the processor is configured to be coupled to the memory, read and execute instructions in the memory, so as to implement the method steps of the first aspect;

the transceiver is coupled to the processor, and the processor controls the transceiver to transmit and receive messages.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, which stores computer instructions that, when executed by a computer, cause the computer to execute the instructions of the method according to the first aspect.

The embodiment of the invention provides a feature fusion processing method and device for point cloud and image data, electronic equipment and a computer readable storage medium. By the bird's-eye view feature fusion processing mechanism provided by the invention, the bird's-eye view feature fusion of multiple sensors is realized, no additional fusion model is added, and the development and maintenance cost is reduced.

Drawings

Fig. 1 is a schematic diagram of a feature fusion processing method for point cloud and image data according to an embodiment of the present invention;

fig. 2 is a block diagram of a feature fusion processing apparatus for point cloud and image data according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a method for feature fusion processing of point cloud and image data, as shown in fig. 1, which is a schematic diagram of a method for feature fusion processing of point cloud and image data provided in an embodiment of the present invention, the method mainly includes the following steps:

step 1, first point cloud data and first image data are obtained.

The first point cloud data are point cloud data generated by the vehicle-mounted laser radar, and a sensing module of the vehicle unmanned system obtains the first point cloud data from the vehicle-mounted laser radar; the sensing module acquires the first image data from the vehicle-mounted camera; in the embodiment of the invention, the generation time of the first point cloud data and the generation time of the first image data are matched with each other by default, and the corresponding spatial ranges are also matched with each other.

Step 2, performing aerial view feature extraction processing on the first point cloud data to generate a corresponding first feature map tensor;

the method specifically comprises the following steps: carrying out aerial view plane pseudo-image conversion processing on the first point cloud data based on a PointPillars model, and carrying out two-dimensional image feature extraction processing on the aerial view plane pseudo-image obtained through conversion to generate a first feature image tensor;

wherein the shape of the first eigen map tensor is H₁*W₁*C₁，H₁Is the height, W, of the feature map₁Is the width, C, of the feature map₁Is the total number of data channels.

Here, the embodiment of the present invention may extract the bird's-eye view features of the first point cloud data based on a plurality of mature models capable of identifying bird's-eye view features of the point cloud data, so as to obtain a corresponding bird's-eye view feature tensor, that is, the first feature map tensor, and the pointpilars model is used as a default. For the model implementation of PointPillars model, refer to the paper "PointPillars: Fast Encoders for Object Detection from Point cloud", which is not further described herein; it is known from the thesis that the pointclouds model is composed of a point cloud Pillar Feature extraction network (Pillar Feature Net), a two-dimensional Feature extraction Backbone network (back bone (2D CNN)) and a target Detection Head (SSD)), wherein the point cloud Pillar Feature extraction network is used for performing point cloud Pillar (Pillar) clustering on input point clouds and performing aerial view plane projection on the point cloud pillars and outputting a final projection result as an aerial view plane Pseudo-Image (Pseudo Image), and the two-dimensional Feature extraction Backbone network is based on the point cloud Pillar Feature extraction networkTwo-dimensional image feature extraction is carried out on the aerial view plane pseudo-image in a traditional multi-level down-sampling convolution network, the target detection head classifies the extracted aerial view features and restores classification results to original point cloud data to increase semantic features for each point in the original point cloud data. When the aerial view plane pseudo-image conversion processing is carried out on the first point cloud data based on the PointPillars model, the point cloud column feature extraction network of the PointPillars model is used for carrying out aerial view plane pseudo-image conversion processing on the first point cloud data to obtain a corresponding aerial view plane pseudo-image tensor, and the two-dimensional feature extraction backbone network of the PointPillars model is used for carrying out two-dimensional image feature extraction processing on the aerial view plane pseudo-image tensor to generate a corresponding first feature image tensor. The output tensor structure of the two-dimensional feature extraction backbone network can know that the first feature map tensor is a three-dimensional map tensor and the shape of the first feature map tensor is H₁*W₁*C₁Wherein H is₁Is the feature height, W₁Is the feature width, C₁The first eigen map tensor is understood as a two-dimensional image, the image being H, for the total number of data channels₁*W₁Each pixel point is composed of C₁And (4) characteristic data.

Step 3, performing aerial view feature extraction processing on the first image data to generate a corresponding second feature map tensor;

the method specifically comprises the following steps: inputting the first image data into a BevFormer model to perform two-dimensional image aerial view feature extraction to generate a second feature map tensor;

wherein the second eigen map tensor has a shape of H₂*W₂*C₂，H₂Is feature height and H₂＝H₁、W₂Is the feature width and W₂＝W₁、C₂Is the total number of data channels.

Here, in the embodiment of the present invention, bird's-eye view features of first image data may be extracted based on a plurality of mature models capable of identifying bird's-eye view features of image data, so as to obtain a corresponding bird's-eye view feature tensor, that is, a second feature map tensor, and a BevFormer model is used as a default. The model realization of the BevFormer model can refer to the paper BEVFormLer, Learning Bird's-Eye-View retrieval from Multi-Camera Images via spatial mapping transformations, which will not be further described herein. According to the embodiment of the invention, after the first image data is input into the BevFormer model for two-dimensional image aerial-view feature extraction, the BevFormer model obtains the historical aerial-view time features of the first image data through query, obtains the real-time image features of the first image data through the feature extraction network, and then performs space-time feature aggregation on the obtained historical aerial-view time features and the real-time image features to obtain the corresponding second feature map tensor. Here, the second eigen map tensor is also actually a three-dimensional map tensor having the shape H₂*W₂*C₂，H₂Is the height, W, of the feature map₂Is the width, C, of the feature map₂The total number of data channels, i.e. the second feature tensor can be understood as a two-dimensional image, which is represented by H₂*W₂Each pixel point is composed of C₂And (4) characteristic data. In order to facilitate subsequent feature cascade processing, in the embodiment of the present invention, the sizes of the feature tensors output by the pointpilars model and the BevFormer model are specifically set to be the same, that is, H is ensured by setting the model parameters₂＝H₁、W₂＝W₁。

Step 4, cascade processing is carried out on the first characteristic image tensor and the second characteristic image tensor to generate a corresponding third characteristic image tensor;

wherein the third eigen map tensor has the shape H₃*W₃*C₃，H₃Is a feature height and H₃＝H₂＝H₁、W₃Is the feature width and W₃＝W₂＝W₁、C₃Is the total number of data channels and C₃＝(C₁+C₂)。

Step 5, calculating a position coding tensor corresponding to the third feature map tensor according to a position coding rule of the Transformer model to obtain a corresponding first position coding tensor; inputting the third eigen map tensor and the corresponding first position coding tensor into a Transformer model for self-attention operation; and using the output result of the model operation as the corresponding fusion characteristic tensor;

wherein the shape of the fused feature tensor is H₄*W₄*C₄，H₄Is a feature height and H₄＝H₃＝H₂＝H₁、W₄Is the feature width and W₄＝W₃＝W₂＝W₁、C₄Is the total number of data channels and C₄＝C₃＝(C₁+C₂)。

Here, the model implementation of the transform model can refer to the article "Attention Is All You Need", which Is not further described herein. As can be seen from the paper, the input of the Transformer model includes two parts: the calculation mode of the position coding tensor is determined by a position coding rule of a Transformer model, the position coding rule of the Transformer model comprises a sine coding rule and a cosine coding rule, and the sine coding rule is adopted by default in the embodiment of the invention. The thesis shows that the Transformer model comprises an encoder and a decoder, the third feature map tensor and the corresponding first position encoding tensor are input into the encoder to be encoded step by step, and the decoder is used for decoding step by step to obtain a final model operation output result, namely the fusion feature tensor. As can be seen from the input/output structure of the Transformer model, the shape of the output fusion feature tensor should match the shape of the input third feature map tensor, so the shape H of the fusion feature tensor is₄*W₄*C₄Is actually H₃*W₃*C₃。

The fusion feature tensor obtained through the steps 1-5 not only contains the aerial view feature of the point cloud, but also contains the aerial view feature of the image, and the perception module can perform multi-target tracking subsequently based on the fusion feature tensor as a reference.

Fig. 2 is a block diagram of a feature fusion processing apparatus for point cloud and image data according to a second embodiment of the present invention, where the apparatus is a terminal device or a server for implementing the foregoing method embodiment, and may also be an apparatus capable of enabling the foregoing terminal device or server to implement the foregoing method embodiment, for example, the apparatus may be an apparatus or a chip system of the foregoing terminal device or server. As shown in fig. 2, the apparatus includes: the system comprises an acquisition module 201, a point cloud aerial view feature processing module 202, an image aerial view feature processing module 203 and a feature fusion processing module 204.

The obtaining module 201 is configured to obtain first point cloud data and first image data.

The point cloud aerial view feature processing module 202 is configured to perform aerial view feature extraction processing on the first point cloud data to generate a corresponding first feature map tensor.

The image bird's-eye-view feature processing module 203 is configured to perform bird's-eye-view feature extraction processing on the first image data to generate a corresponding second feature map tensor.

The feature fusion processing module 204 is configured to cascade the first and second eigen map tensors to generate a corresponding third eigenmap tensor; calculating a position coding tensor corresponding to the third feature map tensor according to a position coding rule of the Transformer model to obtain a corresponding first position coding tensor; inputting the third eigen map tensor and the corresponding first position coding tensor into a Transformer model for self-attention operation; and the output result of the model operation is used as the corresponding fusion characteristic tensor.

The feature fusion processing device for point cloud and image data provided by the embodiment of the invention can execute the method steps in the method embodiment, and the implementation principle and the technical effect are similar, and are not repeated herein.

It should be noted that the division of each module of the above apparatus is only a logical division, and all or part of the actual implementation may be integrated into one physical entity or may be physically separated. And these modules can all be implemented in the form of software invoked by a processing element; or can be implemented in the form of hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the obtaining module may be a processing element separately set up, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a processing element of the apparatus calls and executes the functions of the determining module. The other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more Digital Signal Processors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when some of the above modules are implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor that can call the program code. As another example, these modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the foregoing method embodiments may be generated, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, bluetooth, microwave, etc.) means, the computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more of the available media, which may be magnetic media, (e.g., floppy disks, hard disks, magnetic tape), optical media (e.g., DVD), or semiconductor media (e.g., Solid State Disk (SSD)), etc.

Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. The electronic device may be the terminal device or the server, or may be a terminal device or a server connected to the terminal device or the server and implementing the method according to the embodiment of the present invention. As shown in fig. 3, the electronic device may include: a processor 301 (e.g., CPU), memory 302, transceiver 303; the transceiver 303 is coupled to the processor 301, and the processor 301 controls the transceiving operation of the transceiver 303. Various instructions may be stored in memory 302 for performing various processing functions and implementing the processing steps described in the foregoing method embodiments. Preferably, the electronic device according to an embodiment of the present invention further includes: a power supply 304, a system bus 305, and a communication port 306. The system bus 305 is used to implement communication connections between the elements. The communication port 306 is used for connection communication between the electronic device and other peripheral devices.

The system bus 305 mentioned in fig. 3 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 3, but that does not indicate only one bus or one type of bus. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM), and may further include a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a central Processing Unit CPU, a Network Processor (NP), a Graphics Processing Unit (GPU), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

It should be noted that the embodiment of the present invention also provides a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to execute the method and the processing procedure provided in the above-mentioned embodiment.

The embodiment of the present invention further provides a chip for executing the instruction, where the chip is configured to execute the processing steps described in the foregoing method embodiment.

The embodiment of the invention provides a feature fusion processing method and device for point cloud and image data, an electronic device and a computer readable storage medium. The bird's-eye view feature fusion processing mechanism provided by the invention not only realizes the bird's-eye view feature fusion of multiple sensors, but also reduces the development and maintenance cost without additionally adding a fusion model.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A feature fusion processing method of point cloud and image data is characterized by comprising the following steps:

acquiring first point cloud data and first image data;

performing aerial view feature extraction processing on the first image data to generate a corresponding second feature map tensor;

calculating a position coding tensor corresponding to the third feature map tensor according to a position coding rule of a Transformer model to obtain a corresponding first position coding tensor; inputting the third eigen map tensor and the corresponding first position coding tensor into a Transformer model for self-attention operation; and the output result of the model operation is used as the corresponding fusion characteristic tensor.

2. The feature fusion processing method of point cloud and image data according to claim 1, wherein the performing the bird's eye feature extraction processing on the first point cloud data to generate a corresponding first feature map tensor specifically comprises:

3. The feature fusion processing method of point cloud and image data according to claim 1, wherein the performing the bird's eye view feature extraction processing on the first image data to generate a corresponding second feature map tensor specifically comprises:

4. The feature fusion processing method of point cloud and image data according to claim 1,

the first eigen map tensor has a shape H₁*W₁*C₁，H₁Is the feature height, W₁Is the feature width, C₁Is the total number of data channels;

the second eigen map tensor has a shape H₂*W₂*C₂，H₂Is feature height and H₂＝H₁、W₂Is a feature width of W₂＝W₁、C₂The total number of data channels;

The fused feature tensor has a shape of H₄*W₄*C₄，H₄Is a feature height and H₄＝H₃＝H₂＝H₁、W₄Is the feature width and W₄＝W₃＝W₂＝W₁、C₄Is the total number of data channels and C₄＝C₃＝(C₁+C₂)。

5. An apparatus for implementing the feature fusion processing method of point cloud and image data according to any one of claims 1 to 4, the apparatus comprising: the system comprises an acquisition module, a point cloud aerial view feature processing module, an image aerial view feature processing module and a feature fusion processing module;

the feature fusion processing module is used for performing cascade processing on the first and second eigen map tensors to generate a corresponding third eigen map tensor; calculating a position coding tensor corresponding to the third eigen map tensor according to a position coding rule of a Transformer model to obtain a corresponding first position coding tensor; inputting the third feature map tensor and the corresponding first position coding tensor into a Transformer model for self-attention operation; and the output result of the model operation is used as the corresponding fusion characteristic tensor.

6. An electronic device, comprising: a memory, a processor, and a transceiver;

the processor is used for being coupled with the memory, reading and executing the instructions in the memory to realize the method steps of any one of claims 1-4;

7. A computer-readable storage medium having computer instructions stored thereon which, when executed by a computer, cause the computer to perform the method of any of claims 1-4.