CN117830776A

CN117830776A - Feature fusion method and system for vehicle-mounted sensor data and toilet sheet image data

Info

Publication number: CN117830776A
Application number: CN202311765226.6A
Authority: CN
Inventors: 尹玉成; 蔡晨; 石涤文; 王一鹏; 张志军
Original assignee: Heading Data Intelligence Co Ltd
Current assignee: Heading Data Intelligence Co Ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-04-05

Abstract

The invention provides a method and a system for fusing characteristics of vehicle-mounted sensor data and toilet image data, wherein the method comprises the following steps: acquiring a toilet sheet image of an acquisition area of a vehicle-mounted sensor; extracting image features of the vehicle-mounted sensor through a multi-layer perceptron and inverse perspective transformation to obtain first features, and extracting image features of the toilet sheets through a UNet network and an FPN network to obtain second features; fusion weights of different positions in the image are adjusted based on position embedding, and the Q value, the K value and the V value of the attention network are obtained through linear projection learning; the segmentation mask and the distance mask in the second feature are used as attention masks to filter interference information in the toilet image data, and the first feature is extracted based on a masked cross attention mechanism to obtain a third feature; and aligning the second feature with the third feature, and fusing the aligned second feature and third feature. According to the scheme, feature fusion of the vehicle-mounted sensor image and the toilet film image can be achieved, and high-precision map manufacturing efficiency and quality are improved.

Description

Feature fusion method and system for vehicle-mounted sensor data and toilet sheet image data

Technical Field

The invention belongs to the field of high-precision maps, and particularly relates to a method and a system for fusing vehicle-mounted sensor data and satellite image data characteristics.

Background

The vehicle-mounted sensor data and the toilet-film image data play an important role in the process of manufacturing the crowdsourcing high-precision map, and can provide a lot of key data for manufacturing the high-precision map. At present, the high-precision map is mostly manufactured by singly using one data or separately using two data, for example, the map is constructed based on an on-vehicle sensor, and then the map is adjusted through a toilet image. However, due to certain defects of both data, the vehicle-mounted sensor is difficult to process complex intersections and scenes blocked by front obstacles, and the satellite images are blocked by high vegetation or building shadows, so that the actual high-precision map manufacturing efficiency is low and errors are easy to occur.

Disclosure of Invention

In view of the above, the embodiment of the invention provides a method and a system for fusing the characteristics of vehicle-mounted sensor data and toilet-film image data, which are used for solving the problems that the existing high-precision map is low in manufacturing efficiency and has errors.

In a first aspect of the embodiment of the present invention, a method for feature fusion between vehicle-mounted sensor data and toilet-film image data is provided, including:

acquiring the image data of the toilet sheets in the acquisition area of the vehicle-mounted sensor;

extracting image features of the vehicle-mounted sensor through a multi-layer perceptron and inverse perspective transformation to obtain first features, and extracting image data features of the toilet sheets through a UNet network and an FPN network to obtain second features;

based on the fusion weights of different positions in the position embedded and adjusted image, learning a first characteristic through linear projection to obtain an attention network Q value, and learning a second characteristic to obtain an attention network K value and a V value;

taking the segmentation mask and the distance mask in the second feature as attention masks, filtering interference information in the toilet sheet image data based on the attention masks, and extracting the first feature based on a cross attention mechanism with masks to obtain a third feature;

and aligning the second feature with the third feature, and fusing the aligned second feature and third feature.

In a second aspect of the embodiment of the present invention, there is provided a system for feature fusion of vehicle-mounted sensor data and film image data, including:

the data acquisition module is used for acquiring the toilet sheet image data of the acquisition area of the vehicle-mounted sensor;

the feature extraction module is used for extracting image features of the vehicle-mounted sensor through the multi-layer perceptron and inverse perspective transformation to obtain first features, and extracting image data features of the toilet sheets through the UNet network and the FPN network to obtain second features;

the feature learning module is used for adjusting the fusion weights of different positions in the image based on position embedding, learning a first feature through linear projection to obtain an attention network Q value, and learning a second feature to obtain an attention network K value and a V value;

the filtering and extracting module is used for taking the segmentation mask and the distance mask in the second feature as attention masks, filtering interference information in the toilet piece image data based on the attention masks, and extracting the first feature based on a cross attention mechanism with masks to obtain a third feature;

and the alignment fusion module is used for aligning the second feature with the third feature position and fusing the aligned second feature with the third feature.

In a third aspect of the embodiments of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect of the embodiments of the present invention when the computer program is executed by the processor.

In a fourth aspect of the embodiments of the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method provided by the first aspect of the embodiments of the present invention.

In the embodiment of the invention, the fusion weights of different positions in the image are adjusted based on position embedding by respectively extracting the characteristics of the vehicle-mounted sensor image and the toilet image, and the Q value, the K value and the V value corresponding to the attention network are obtained through linear projection learning characteristics; the method comprises the steps of filtering interference information in the guard sheet image data based on the attention mask, extracting sensor image features based on a masked cross attention mechanism, aligning the sensor image features with the guard sheet image features, and fusing the aligned features, so that BEV (back-to-back) level fusion of the vehicle-mounted sensor image and the guard sheet image is realized, high-precision map manufacturing efficiency is improved, and accuracy and reliability of high-precision map manufacturing can be guaranteed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings described below are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

Fig. 1 is a flow chart of a method for feature fusion of vehicle-mounted sensor data and toilet-film image data according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a feature fusion system for vehicle-mounted sensor data and film image data according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more comprehensible, the technical solutions in the embodiments of the present invention are described in detail below with reference to the accompanying drawings, and it is apparent that the embodiments described below are only some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the term "comprising" and other similar meaning in the description of the invention or the claims and the above-mentioned figures is intended to cover a non-exclusive inclusion, such as a process, method or system, apparatus comprising a series of steps or elements, without limitation to the listed steps or elements. Furthermore, "first" and "second" are used to distinguish between different objects and are not used to describe a particular order.

Referring to fig. 1, a flow chart of a method for feature fusion of vehicle-mounted sensor data and toilet-film image data according to an embodiment of the present invention includes:

s101, acquiring the image data of a toilet sheet in an acquisition area of a vehicle-mounted sensor;

the vehicle-mounted sensor can comprise a vehicle-mounted camera and a laser radar, and a defensive film image corresponding to a data acquisition area of the vehicle-mounted sensor needs to be acquired. The satellite image data is a satellite photo image, namely a satellite image, and is an image for scanning ground features.

The conversion relation between the vehicle-mounted sensor data and the toilet images needs to be determined so as to convert ground object targets acquired by the vehicle-mounted sensor.

Optionally, constructing a transformation matrix between the satellite image coordinate system and the global coordinate system of the vehicle-mounted sensor by a key point alignment method; and carrying out coordinate transformation on the vehicle-mounted sensor data and the vehicle position based on the transformation matrix.

The key points are generally positions with obvious features on the map, such as intersections, traffic signs, landmarks and the like, the conversion relationship between the vehicle-mounted sensor and the toilet images can be established based on the key point positions, the sensor acquisition positions, the target marks and the like, and the ground object targets acquired by the vehicle-mounted sensor can be converted according to the conversion relationship (namely a change matrix).

And acquiring a corresponding satellite map area according to the sample position and the reverse information acquired by the vehicle-mounted sensor. The reverse information comprises a transformation matrix of a toilet sheet image coordinate system and a vehicle-mounted sensor coordinate system.

S102, extracting image features of a vehicle-mounted sensor through a multi-layer perceptron and inverse perspective transformation to obtain first features, and extracting image data features of a toilet sheet through a UNet network and an FPN network to obtain second features;

a multi-layer perceptron (MLP) is a feed-forward neural network that can map multiple data sets of an input onto a single data set of an output. After multi-layer sensing is carried out on the vehicle-mounted sensor image, the characteristics in the vehicle-mounted sensor image, namely the first characteristics, are obtained through inverse perspective transformation.

The UNet network is a pixel point classification network, and the structure of the UNet network comprises two parts, namely an encoding part and a decoding part, wherein the first half part is used for feature extraction, the second half part is used for up-sampling, and the UNet network can be used for distinguishing whether a pixel point belongs to the foreground or the background. The FPN (Feature Pyramid Networks, i.e., feature pyramid) network is an object detection network that is capable of predicting objects of different scales. In the embodiment, the Unet and the FPN structure are combined to identify and extract the characteristics in the toilet sheet image, so that the accuracy of detection and identification of the multi-scale targets can be effectively improved.

S103, adjusting fusion weights of different positions in an image based on position embedding, learning a first characteristic through linear projection to obtain an attention network Q value, and learning a second characteristic to obtain an attention network K value and a V value;

position embedding is a technique of encoding information of each position in a sequence into a fixed length vector, which can provide information about the position order for a model, which can automatically adjust fusion weights during learning.

And linearly projecting the input feature sequence, and multiplying the input features by three trainable parameter matrixes respectively to obtain corresponding (Q, K and V), namely Query, key and Value in the attention network. The linear projection can be a linear layer without bias, and can realize the visual classification of multidimensional data features.

S104, taking the segmentation mask and the distance mask in the second feature as attention masks, filtering interference information in the toilet piece image data based on the attention masks, and extracting the first feature based on a cross attention mechanism with masks to obtain a third feature;

the segmentation mask is derived from a segmentation model, and the distance mask may be determined according to the following equation:

that is, when the distance is greater than or equal to the fixed value D, the target is filtered, and if the target sequence Eul (x, y) is smaller than the fixed value D, the corresponding target value-i nf is reserved.

The attention mask is a combination of a segmentation mask AND a distance mask, which may be obtained by logically ANding the segmentation mask AND the distance mask.

The masked cross attention mechanism is to adjust the attention of the output of the encoder through the masked cross attention layer, thereby obtaining the encoder information of the decoding position. The first feature may be further refined by a masked cross-attention mechanism resulting in a third feature.

Specifically, a segmentation mask and a distance mask of the second feature are created; acquiring a first characteristic query vector and a key and value vector of a second characteristic obtained by learning after linear projection; performing dot product operation on the query vector of the first feature and the key vector of the second feature, dividing the dot product operation by the square root of the dimension of the query vector, and calculating to obtain the attention fraction; based on the segmentation mask and the distance mask, multiplying the attention score by the segmentation mask of the second feature element by element, masking filling positions in the first feature, multiplying the attention score by the distance mask of the second feature element by element, masking positions in the second feature that exceed a predetermined distance; carrying out softmax normalization on the attention score after the masking to obtain attention weight; and carrying out weighted summation on the attention weight and the value vector of the second feature to obtain a third feature of the cross attention output.

Creating a segmentation mask and a distance mask of the source sequence and the target sequence, wherein the segmentation mask is related to filling positions and is used for shielding filling positions in the source sequence and the target sequence, and the distance mask is used for shielding remote positions in the target sequence; respectively carrying out linear projection on the source sequence and the target sequence to obtain a query (Q) vector of the source sequence and a key (K) and value (V) vector of the target sequence; calculating an attention score, performing dot product operation on the query vector of the target sequence and the key vector of the source sequence, and dividing the dot product operation by the square root of the dimension of the query vector to obtain the attention score; the segmentation mask and the distance mask are applied to multiply the attention score element by element with the segmentation mask of the source sequence to mask the padding positions in the source sequence. Multiplying the attention score by the distance mask of the target sequence element by element to mask remote locations in the target sequence; performing softmax normalization on the attention score after application masking to obtain attention weight; and carrying out weighted summation on the attention weight and the value vector of the source sequence to obtain the output of the cross attention.

And S105, aligning the second feature with the third feature, and fusing the aligned second feature and third feature.

And predicting the coordinate offset of each position of the second feature through the neural network convolution layer, and adjusting the position of the second feature based on the coordinate offset. And predicting the coordinate offset of each position by constructing a plurality of convolution layers, adjusting the position of the second feature based on the coordinate offset, and performing splicing fusion of the second feature and the third feature after feature alignment.

In the embodiment, image features of the vehicle-mounted sensor are extracted through a multi-layer perceptron and inverse perspective transformation respectively, and image data features of the toilet sheets are extracted through a UNet network and an FPN network; based on fusion weights of different positions in the position embedded and adjusted image, obtaining the Q value, the K value and the V value of the attention network through linear projection learning characteristics; filtering interference information in the toilet patch image data based on the attention mask, and extracting image features of the vehicle-mounted sensor based on a masked cross attention mechanism; and carrying out feature fusion after feature alignment. Therefore, the fusion of the vehicle-mounted sensor image and the guard image feature BEV level is realized, the high-precision map manufacturing efficiency can be improved, and the accuracy and the reliability of feature extraction and fusion can be ensured.

It should be understood that the sequence number of each step in the above embodiment does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not be construed as limiting the implementation process of the embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a feature fusion system for vehicle-mounted sensor data and toilet-film image data, which is provided in an embodiment of the present invention, and includes:

the data acquisition module 210 is configured to acquire the photo image data of the acquisition area of the vehicle-mounted sensor;

wherein, the data acquisition module 210 further comprises:

the coordinate conversion module is used for constructing a transformation matrix between the satellite image coordinate system and the global coordinate system of the vehicle-mounted sensor through a key point alignment method; and carrying out coordinate transformation on the vehicle-mounted sensor data based on the transformation matrix.

The feature extraction module 220 is configured to extract image features of the vehicle-mounted sensor through the multi-layer perceptron and inverse perspective transformation to obtain a first feature, and extract image data features of the toilet sheet through the UNet network and the FPN network to obtain a second feature;

the feature learning module 230 is configured to adjust fusion weights of different positions in the image based on position embedding, learn a first feature through linear projection to obtain an attention network Q value, and learn a second feature to obtain an attention network K value and a V value;

the filtering and extracting module 240 is configured to take the segmentation mask and the distance mask in the second feature as attention masks, filter interference information in the toilet sheet image data based on the attention masks, and extract the first feature based on a cross attention mechanism with masks to obtain a third feature;

specifically, a segmentation mask of the first feature and a distance mask of the second feature are created;

creating a segmentation mask and a distance mask for the second feature;

acquiring a first characteristic query vector and a key and value vector of a second characteristic obtained by learning after linear projection;

performing dot product operation on the query vector of the first feature and the key vector of the second feature, dividing the dot product operation by the square root of the dimension of the query vector, and calculating to obtain the attention fraction;

based on the segmentation mask and the distance mask, multiplying the attention score by the segmentation mask of the second feature element by element, masking filling positions in the first feature, multiplying the attention score by the distance mask of the second feature element by element, masking positions in the second feature that exceed a predetermined distance;

carrying out softmax normalization on the attention score after the masking to obtain attention weight;

and carrying out weighted summation on the attention weight and the value vector of the second feature to obtain a third feature of the cross attention output.

And an alignment fusion module 250, configured to align the second feature with the third feature, and fuse the aligned second feature and third feature.

Wherein aligning the second feature with the third feature location comprises:

and predicting the coordinate offset of each position of the second feature through the neural network convolution layer, and carrying out position adjustment on the second feature based on the coordinate offset.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the above-described system and module may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic equipment is used for carrying out feature fusion on the vehicle-mounted sensor data and the toilet image data. As shown in fig. 3, the electronic apparatus 3 of this embodiment includes: memory 310, processor 320, and system bus 330, the memory 310 including an executable program 3101 stored thereon, it will be understood by those skilled in the art that the electronic device structure shown in fig. 3 is not limiting of the electronic device and may include more or fewer components than illustrated, or may combine certain components, or a different arrangement of components.

The following describes the respective constituent elements of the electronic device in detail with reference to fig. 3:

the memory 310 may be used to store software programs and modules, and the processor 320 may execute various functional applications and data processing of the electronic device by executing the software programs and modules stored in the memory 310. The memory 310 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device (such as cache data), and the like. In addition, memory 310 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

An executable program 3101 of the network request method is contained on the memory 310, the executable program 3101 may be divided into one or more modules/units, which are stored in the memory 310 and executed by the processor 320 to perform feature fusion of the in-vehicle sensor data and the slice image data, etc., and the one or more modules/units may be a series of computer program instruction segments capable of performing a specific function, which are used to describe the execution process of the executable program 3101 in the electronic device 3. For example, the executable program 3101 may be divided into functional modules such as a data acquisition module, a feature extraction module, a feature learning module, a filter extraction module, and an alignment fusion module.

Processor 320 is a control center of the electronic device that utilizes various interfaces and lines to connect various portions of the overall electronic device, perform various functions of the electronic device and process data by running or executing software programs and/or modules stored in memory 310, and invoking data stored in memory 310, thereby performing overall condition monitoring of the electronic device. Optionally, processor 320 may include one or more processing units; preferably, the processor 320 may integrate an application processor that primarily handles operating systems, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 320.

The system bus 330 is used to connect functional components in the computer, and CAN transmit data information, address information, and control information, and the types of the system bus may be, for example, a PCI bus, an isa bus, and a CAN bus. Instructions from the processor 320 are transferred to the memory 310 through the bus, the memory 310 feeds back data to the processor 320, and the system bus 330 is responsible for data and instruction interaction between the processor 320 and the memory 310. Of course, the system bus 330 may also access other devices, such as a network interface, a display device, etc.

In an embodiment of the present invention, the executable program executed by the processor 320 included in the electronic device includes:

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system, apparatus and module may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The characteristic fusion method of the vehicle-mounted sensor data and the toilet sheet image data is characterized by comprising the following steps of:

2. The method of claim 1, wherein the acquiring the toilet image data of the in-vehicle sensor acquisition area further comprises:

constructing a transformation matrix between a satellite image coordinate system and a global coordinate system of the vehicle-mounted sensor by a key point alignment method;

and carrying out coordinate transformation on the vehicle-mounted sensor data based on the transformation matrix.

3. The method of claim 1, wherein the filtering the disturbance information in the slice image data based on the attention mask using the segmentation mask and the distance mask in the second feature as the attention mask and extracting the first feature based on the masked cross-attention mechanism to obtain the third feature comprises:

creating a segmentation mask and a distance mask for the second feature;

4. The method of claim 1, wherein aligning the second feature with the third feature location comprises:

5. The utility model provides a vehicle-mounted sensor data and guard film image data characteristic fusion system which characterized in that includes:

6. The system of claim 5, wherein the data acquisition module further comprises:

7. The system of claim 5, wherein the filtering the disturbance information in the slice image data based on the attention mask using the segmentation mask and the distance mask in the second feature as the attention mask and extracting the first feature based on the masked cross-attention mechanism to obtain the third feature comprises:

creating a segmentation mask and a distance mask for the second feature;

8. The system of claim 5, wherein the aligning the second feature with the third feature location comprises:

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, performs the steps of a method for feature fusion of on-board sensor data and toilet-film image data as claimed in any one of claims 1 to 4.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed implements the steps of a method for feature fusion of vehicle-mounted sensor data and toilet-film image data according to any one of claims 1 to 4.