CN109118539B

CN109118539B - Method, device and equipment for fusing point cloud and picture based on multi-scale features

Info

Publication number: CN109118539B
Application number: CN201810779366.1A
Authority: CN
Inventors: 徐楷; 冯良炳; 姚杰; 严亮
Original assignee: Shenzhen Cosmosvision Intelligent Technology Co ltd
Current assignee: Shenzhen Cosmosvision Intelligent Technology Co ltd
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2020-10-09
Anticipated expiration: 2038-07-16
Also published as: CN109118539A

Abstract

The embodiment of the invention provides a method, a device and equipment for fusing point cloud and pictures based on multi-scale features, wherein the method comprises the following steps: obtaining at least two groups of point cloud characteristics and picture characteristics by extracting a characteristic network, and performing first convolution operation on the characteristics; grouping according to the abstraction degree and respectively carrying out element-by-element averaging and fusion on each group; performing one-time jump connection on the output result after the first convolution operation and the feature map obtained after grouping and element-by-element averaging fusion operation, and performing linear fusion operation; performing a second convolution operation on the feature diagram obtained after jump connection and linear fusion; performing element-by-element averaging fusion on the four types of characteristics obtained by the second convolution operation; and carrying out the third convolution operation on the new fusion characteristics obtained after the average fusion is carried out, and taking the third convolution operation as the final output characteristics. The method can accurately position and predict the direction of the target object so as to improve the accuracy of positioning and predicting the direction of the target object.

Description

Method, device and equipment for fusing point cloud and picture based on multi-scale features

Technical Field

The invention relates to the field of computer vision, in particular to a method, a device and equipment for fusing point cloud and pictures based on multi-scale features.

Background

At present, people pay attention to the problem of automatic driving safety, so that 3D target detection research in the field of automatic driving becomes a hotspot. With respect to 2D target detection, 3D target detection needs to detect depth information that is not required by 2D target detection, and therefore point cloud data including depth information obtained by a radar sensor becomes one of data sources for 3D target detection. However, since the point cloud data is often sparse and cannot convey rich texture information, the detection algorithm does not achieve the expected effect well. Compared with point cloud data, image data can not represent depth information but represent rich texture information, and in such a case, designing an algorithm which can achieve good effects and simultaneously perform 3D target detection by using the point cloud data and the image data becomes a problem to be solved urgently.

However, the existing point cloud data and image fusion method usually adopts a method such as linear addition or averaging to process, the processing method is too simple, and there is no interaction between data, so that the existing method has the problems of poor positioning effect, low prediction accuracy and the like in the aspects of 3D target positioning and direction prediction.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for fusing a point cloud and an image based on multi-scale features, which can accurately locate and predict a target object, so as to improve the accuracy of locating and predicting the direction of the target object.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention provides a point cloud and picture fusion method based on multi-scale features, which comprises the following steps:

obtaining at least two groups of point cloud characteristics and picture characteristics by extracting a characteristic network, and performing first convolution operation on the obtained point cloud characteristics and the obtained picture characteristics through a convolution layer respectively;

grouping result features output after the point cloud features and the picture features are subjected to the first convolution operation according to the abstraction degree, and then respectively carrying out element-by-element averaging fusion on each group of two types of features to obtain two types of fused features;

performing one-time jump connection on the output results of the point cloud characteristics and the image characteristics after the first convolution operation and the characteristic graphs obtained after grouping and element-by-element averaging fusion operation, and performing linear fusion operation;

respectively carrying out a second convolution operation on the feature graphs obtained after the jump connection and the linear fusion through a convolution layer;

performing element-by-element averaging fusion on the features obtained by the second convolution operation to obtain new fusion features;

and carrying out element-by-element averaging fusion on the features obtained by the second convolution operation to obtain new fusion features, carrying out third convolution operation on the new fusion features, and using the new fusion features as final output features of data fusion.

In some embodiments, the step of obtaining at least two groups of point cloud features and picture features by extracting a feature network, and performing a first convolution operation on the obtained point cloud features and picture features respectively through a convolution layer further includes:

and simultaneously controlling the number of the characteristic graphs output by the convolution layer, wherein the corresponding mathematical formula is as follows:

wherein "im 1', pc1', im2 'and pc2' represent output results of different convolutional layers;

weight parameters representing different convolutional layers; the weight parameter is automatically obtained through network learning;

"im 1, pc1, im2, pc 2" represent the input features of the different convolutional layers;

“b1_im1、b1_pc1、b1_im2、b1_pc2"represents the bias of the different convolutional layers; the bias parameters are automatically obtained through network learning;

"σ" represents the activation function, corresponding to function max {0, x }.

In some embodiments, the grouping of the result features output after the point cloud features and the image features are subjected to the first convolution operation according to the abstraction degree, and then performing element-by-element averaging and fusion on each group of two types of features to obtain two types of fused features includes:

dividing the characteristic abstraction degrees in the groups into a group with the same abstraction degree, and dividing the characteristic abstraction degrees between the groups into a group with different abstraction degrees;

the corresponding mathematical formula is:

wherein "b, h, w, i" is a nonnegative integer representing tensor subscript ordinal number;

"im 1', pc1', im2 'and pc 2'" represent the output results of the different convolutional layers.

In some embodiments, the performing a jump connection on the output result of the first convolution operation on the point cloud feature and the image feature and the feature map obtained after grouping and element-by-element averaging fusion operations, and performing a linear fusion operation includes a corresponding mathematical formula:

wherein "b, h, w, i" is a non-negative integer representing the tensor subscript ordinal number,

"m, n, i" is a positive integer,

the variation ranges of the ' b ', the h and the w ' in different formulas are the same, and the variation ranges of the ' i ' are different;

input features representing different convolutional layers;

"im 1', pc1', im2 'and pc2' represent the output results of different convolutional layers;

"impc 1 and impc 2" mean that two types of characteristics of each group are subjected to element-by-element averaging fusion to obtain fused characteristics.

In some embodiments, the performing a second convolution operation on the feature maps obtained by performing jump connection and linear fusion respectively through a convolution layer further includes:

wherein "y 1, y2, y3 and y 4" represent output results of different convolutional layers;

“w1^T、w2^T、w3^T、w4^T"represents the weight parameters of different convolutional layers; the weight parameter is automatically obtained through network learning;

input features representing different convolutional layers;

"b 1, b2, b3, b 4" represent the bias of the different convolutional layers; the bias parameters are automatically obtained through network learning;

"σ" represents the activation function, corresponding to function max {0, x }.

In some embodiments, the feature obtained by the second convolution operation is subjected to element-by-element averaging fusion to obtain a new fused feature formula:

"y 1, y2, y3, y 4" represent the output results of different convolutional layers;

"y 5" represents the result of element-by-element averaging and fusing the features obtained by the second convolution operation, i.e., the input features of the convolutional layer.

In some embodiments, the convolutional layer convolution kernel size is 1 × 1, the step size is 1, and the number of feature maps controlling the convolutional layer output is 8.

In some embodiments, the method may further comprise: and carrying out element-by-element averaging fusion on the features obtained by the second convolution operation to obtain new fusion features, carrying out the third convolution operation on the new fusion features, and using the new fusion features as final output features of the data fusion part, wherein a corresponding formula is as follows:

y6＝σ(w6^Ty5+b6) 20；

wherein "y 6" represents the output of the convolutional layer;

“w6^T"represents the weight parameter of the convolutional layer; the weight parameter is automatically obtained through network learning;

"y 5" represents the input characteristics of the convolutional layer;

"b 6" represents the bias of the convolutional layer; the bias parameters are automatically obtained through network learning;

"σ" represents the activation function, corresponding to function max {0, x }.

The second aspect of the present invention further provides a device for fusing a point cloud and an image based on multi-scale features, which is applied to any one of the above methods for fusing a point cloud and an image based on multi-scale features, and the device includes:

the characteristic extraction module is used for obtaining point cloud characteristics and picture characteristics by extracting a characteristic network;

the first convolution module is used for performing first convolution operation on the point cloud characteristics and the image characteristics through a convolution layer respectively;

the grouping fusion module is used for grouping result characteristics output after the point cloud characteristics and the picture characteristics are subjected to the first convolution operation according to the abstraction degree, and then performing element-by-element averaging fusion on each group of two types of characteristics to obtain two types of fused characteristics;

the jump fusion module is used for performing jump connection on the point cloud characteristic and the image characteristic after the first convolution operation and the characteristic graph obtained after grouping and element-by-element averaging fusion operation, and performing linear fusion;

the linear fusion module is used for carrying out linear fusion operation on the feature graph after the jump connection;

the second convolution module is used for performing second convolution operation on the feature graphs obtained after jump connection and linear fusion respectively through convolution layers;

the average fusion module is used for carrying out element-by-element average fusion on the features obtained by the second convolution operation to obtain new fusion features;

and the third convolution module is used for performing element-by-element averaging fusion on the features obtained by the second convolution operation to obtain new fusion features, performing the third convolution operation on the new fusion features, and using the new fusion features as final output features of data fusion.

The third aspect of the present invention also provides a point cloud and picture fusion device based on multi-scale features, which includes a processor, a computer-readable storage medium, and a computer program stored on the computer-readable storage medium, wherein when the computer program is executed by the processor, the computer program implements the steps in the method according to any one of the above.

The method, the device and the equipment for fusing the point cloud and the picture based on the multi-scale features can enhance the interaction between the point cloud features and the picture features, and can keep the independence of a single sensor for acquiring network features while the features are interacted; the method of the embodiment of the invention adopts a nonlinear fusion method to enhance the expressive force of the characteristics; a flexible linear fusion mode is added under the framework of a nonlinear fusion method, and the utilization rate of the characteristics is improved by utilizing quick connection, so that the target object can be accurately positioned and the direction of the target object can be accurately predicted, and the positioning accuracy and the direction prediction accuracy of the target object are improved.

Drawings

FIG. 1 is a visualization model diagram of a point cloud and picture fusion algorithm based on multi-scale features according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for fusing a point cloud and an image based on multi-scale features according to an embodiment of the present invention;

fig. 3 is a block diagram of a point cloud and picture fusion apparatus based on multi-scale features according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems that in the prior art, a point cloud and picture fusion method is often processed by adopting a linear addition or averaging method, the processing method is too simple, no interaction exists among data, the positioning effect is poor in the aspects of 3D target positioning and direction prediction, and the prediction accuracy is low, the invention provides a point cloud and picture fusion method, a device and equipment based on multi-scale features, which can accurately position and direction prediction a target object so as to improve the accuracy of positioning and direction prediction of the target object, and further improve the safety performance of applying the technology to related fields.

Nominal definition and interpretation:

the convolutional layer mentioned in the embodiment of the present invention is 2D convolutional and has encapsulated the 2D convolutional layer and the active layer Relu in tensorial flow.

The initial parameter of the convolution layer adopts the initialization mode of 0 mean value and 1 variance Gaussian distribution.

The output feature map number of the convolutional layer comprehensively considers the ratio of the point cloud feature map number and the picture feature map number in the fusion data; and the feature map number ratio of the fusion data and the network feature data acquired by the single sensor for re-fusion can be effectively controlled. In this embodiment, the single sensor is used to capture an image to extract network characteristic data on the image.

It should be noted that: the processing methods of the point cloud characteristics and the picture characteristics are the same, abstract processing is required, pc is used for representing the point cloud characteristics, Im is used for representing the picture characteristics, different numbers are used for representing different abstract processing degrees, and the same numbers indicate the same abstract processing degrees. The abstraction degree mainly refers to the number of layers of passed convolutions, and the different abstraction degrees refer to the different number of layers of passed convolutions; the same abstraction level means that the number of layers of convolutions passed through is the same. For example, Im1 represents the same degree of abstraction as pc 1; im2 represents the same degree of abstraction as pc 2; im1 and im2 represent different levels of abstraction; pc1 and pc2 show different levels of abstraction.

The first embodiment is as follows:

the invention provides a method for fusing a point cloud and a picture based on multi-scale features, and please refer to fig. 1, which is a visual model diagram of a multi-scale feature-based point cloud and picture fusion algorithm provided by an embodiment of the invention, and please refer to fig. 2, wherein the method specifically comprises the following steps:

s1 at least two groups of point cloud features and picture features are obtained by extracting the feature network, the obtained point cloud features and picture features are respectively subjected to first convolution operation through a convolution layer, and meanwhile, the number of feature graphs output by the convolution layer is controlled.

Specifically, point cloud features and picture features with different abstraction degrees are obtained through a feature extraction network, in the embodiment, the point cloud features and the picture features are im1, im2, pc1 and pc2 respectively (pc represents the point cloud features; im represents the picture features; the same number represents the same abstraction degree of two types of features, and different numbers represent different abstraction degrees of the two types of features); carrying out convolution operation on the two groups of point cloud characteristics and the image characteristics through a convolution layer respectively to obtain four types of new characteristics im1', im2', pc1 'and pc2', and simultaneously controlling the number of characteristic graphs output by the convolution layer, wherein the corresponding mathematical formula is as follows:

"σ" represents the activation function, corresponding to function max {0, x }.

The convolution layer convolution kernels are all 1 x 1 in size, the step lengths are all 1, and the number of characteristic graphs output by the convolution layer is 8.

S2, grouping result features output after the point cloud features and the picture features are subjected to the first convolution operation according to the abstraction degree, and then performing element-by-element averaging and fusion on each group of two types of features to obtain two types of fused features.

Specifically, the point cloud features and the picture features in S1 are grouped according to the abstraction degree after the first convolution operation, that is, the feature abstraction degrees in the group are the same and are grouped into one group, and the feature abstraction degrees between the groups are different and are grouped into one group; then, element-by-element averaging and fusing are carried out on each group of the two types of characteristics to obtain two types of fused characteristics, namely, impc1 and impc2, which correspond to the two left characteristics in the figure 1

The corresponding mathematical formula is:

where "b, h, w, i" is a non-negative integer representing the tensor subscript ordinal number.

S3, carrying out one-time jump connection on the output result of the point cloud characteristic and the picture characteristic after the first convolution operation and the characteristic graph obtained after grouping and element-by-element averaging fusion operation, and carrying out linear fusion operation.

Specifically, im1' was combined with impc 2; im2' and impc 1; pc1' and impc 2; pc2' and impc1 are connected by jump and are subjected to linear fusion to perform containment operation, which corresponds to the operation in FIG. 1

The mathematical formula is:

b in the 'b, h and w' corresponds to the size of the super parameter value during network training (an integer value needs to be set according to the actual condition);

h and w respectively correspond to the length and the width of the characteristic diagram, and can be set to a certain integer value according to the actual situation;

i corresponds to the number of the characteristic graphs and can be set to a certain integer value according to the actual situation;

b, h, w, i have no explicit range limits, and once the design network structure is determined, its value can be determined.

The'm, n, i' are positive integers, the variation ranges of 'b, h and w' in different formulas are the same, and the variation ranges of 'i' are different.

S4, the feature maps obtained by jump connection and linear fusion are respectively passed through a convolution layer to make second convolution operation.

Specifically, the feature maps obtained in S3 are respectively subjected to a second convolution operation by a convolution layer to obtain four new types of features, and the number of feature maps output by the convolution layer is controlled, where the corresponding mathematical formula is:

input features representing different convolutional layers;

"σ" represents the activation function, corresponding to function max {0, x }.

In this embodiment, the convolutional layer convolution kernels are all 1 × 1 in size, the step sizes are all 1, and the number of feature maps controlling the convolutional layer output is all 8.

S5, carrying out element-by-element averaging fusion on the four types of features obtained by the second convolution operation to obtain new fusion features.

Specifically, the four types of features obtained in S4 are subjected to element-by-element averaging fusion to obtain a new fusion feature, which corresponds to the rightmost feature in fig. 1

The corresponding mathematical formula is:

“y₅"indicates the result obtained by performing element-by-element average fusion on the features obtained by the second convolution operation.

And S6, carrying out element-by-element averaging fusion on the four types of features obtained by the second convolution operation to obtain new fusion features, carrying out the third convolution operation on the new fusion features, and using the new fusion features as final output features of data fusion.

Specifically, the new fusion feature obtained in S5 is subjected to a convolution layer to perform a third convolution operation, and is used as a final output feature of the data fusion part, where the corresponding formula is:

y6＝σ(w6^Ty5+b6) 20；

wherein "y 6" represents the output of the convolutional layer;

"y 5" represents the input characteristics of the convolutional layer;

"σ" represents the activation function, corresponding to function max {0, x }.

The mathematical formula of the model of the above steps S1-S6 is expressed as:

note: f [ L +1] represents the characteristics of the L +1 layer network, with different subscripts representing different input feature sources (input feature sources refer to the content within the "()" immediately following f);

f [ L ] im2 represents the L-th network characteristics of the picture, and the input source is im 2;

fL pc2 represents the L-level network characteristics of the point cloud, and the input source is pc 2;

the operator is C: catenate (linear fusion operation) or M: element-wise mean;

l represents the number of layers of convolution;

k is represented as a positive integer less than L.

The method provided by the embodiment of the invention is improved aiming at the problems that the existing fusion algorithm is simpler and has no interaction between data, and comprises the steps of strengthening the interaction between point cloud and picture data, emphasizing the independence of the data, adopting a nonlinear fusion mode with stronger expressive force, simultaneously utilizing a linear fusion mode to carry out flexible feature splicing processing, and carrying out feature integration processing through a small-scale convolution kernel. The point cloud and picture fusion method based on the multi-scale features, provided by the invention, is proved to have better effects than the existing fusion method in the aspects of 3D target object positioning accuracy and direction prediction accuracy through tests.

The point cloud and picture fusion method based on the multi-scale features can enhance the interaction of the point cloud features and the picture features, and can keep the independence of a single sensor for acquiring network features while the features are interacted; the method of the embodiment of the invention adopts a nonlinear fusion method to enhance the expressive force of the characteristics; a flexible linear fusion mode is added under the framework of a nonlinear fusion method, and the utilization rate of the characteristics is improved by utilizing quick jump connection, so that the target object can be accurately positioned and the direction of the target object can be accurately predicted, and the positioning accuracy and the direction prediction accuracy of the target object are improved.

Example two

The embodiment of the present invention further provides a device for fusing a point cloud and an image based on multi-scale features, please refer to fig. 3, wherein the device includes the following modules:

the system comprises a feature extraction module 10, a first convolution module 20, a grouping fusion module 30, a jump connection module 40, a linear fusion module 50, a second convolution module 60, an average fusion module 70 and a third convolution module 80.

The feature extraction module 10 is configured to obtain point cloud features and image features by extracting a feature network.

The first convolution module 20 is configured to perform a first convolution operation on the obtained point cloud features and the obtained image features through a convolution layer respectively, and control the number of feature maps output by the convolution layer.

Specifically, at least two groups of point cloud features and picture features are obtained through the feature extraction module 10; and then, the obtained point cloud features and the image features are respectively convolved by the first convolution module 20 (with different abstraction degrees of the two types of features), and the number of feature maps output by the convolution layer is controlled. The point cloud characteristics and the picture characteristics are im1, im2, pc1 and pc2 respectively (pc represents the point cloud characteristics; im represents the picture characteristics; the same number represents the same abstraction degree of two types of characteristics, and different numbers represent the different abstraction degrees of the two types of characteristics); carrying out convolution operation on the two groups of point cloud characteristics and the image characteristics through a convolution layer respectively to obtain four types of new characteristics im1', im2', pc1 'and pc2', and simultaneously controlling the number of characteristic graphs output by the convolution layer, wherein the corresponding mathematical formula is as follows:

"σ" represents the activation function, corresponding to function max {0, x }.

The grouping and fusing module 30 is configured to group result features output after the point cloud features and the picture features are subjected to the first convolution operation according to the abstraction degree, and perform element-by-element averaging and fusing on each group of two types of features respectively to obtain two types of fused features.

Specifically, the result features output after the point cloud features and the picture features are subjected to the first convolution operation are grouped according to the abstraction degree, namely the feature abstraction degrees in the groups are the same and are grouped into one group, and the feature abstraction degrees between the groups are different and are grouped into one group; then, element-by-element averaging and fusing are carried out on each group of the two types of characteristics to obtain two types of fused characteristics, namely, impc1 and impc2, which correspond to the two left characteristics in the figure 1

The corresponding mathematical formula is:

And the jump fusion module 40 is used for performing jump connection on the output result of the point cloud characteristic and the image characteristic after the first convolution operation and the characteristic graph obtained after grouping and element-by-element averaging fusion operation, and performing linear fusion.

The formula is:

b, h, w, i have no explicit ranges and once the design network structure is determined, its values can only be determined.

And the linear fusion module 50 is configured to perform a linear fusion operation on the feature map after the jump connection.

And the second convolution module 60 is configured to perform a second convolution operation on the feature maps obtained by performing jump connection and linear fusion respectively through convolution layers.

input features representing different convolutional layers;

"σ" represents the activation function, corresponding to function max {0, x }.

The average fusion module 70 is configured to perform element-by-element average fusion on the four types of features obtained through the second convolution operation to obtain new fusion features.

Specifically, will pass through the secondThe four types of features obtained by the sub-convolution operation are subjected to element-by-element averaging fusion to obtain new fusion features, which correspond to the rightmost features in the graph 1

The corresponding mathematical formula is:

The third convolution module 80 performs element-by-element averaging fusion on the four types of features obtained by the second convolution operation to obtain new fusion features, and performs the third convolution operation on the new fusion features, and the new fusion features are used as final output features of data fusion.

Specifically, the feature obtained in S5 is subjected to a convolution layer for the third convolution operation, and is used as the final output feature of the data fusion part, and the corresponding mathematical formula is:

y6＝σ(w6^Ty5+b6) 20；

wherein "y 6" represents the output of the convolutional layer;

"y 5" represents the input characteristics of the convolutional layer;

"σ" represents the activation function, corresponding to function max {0, x }.

The point cloud and picture fusion device based on the multi-scale features can enhance the interaction of the point cloud features and the picture features, and can keep the independence of a single sensor for acquiring network features while the features are interacted; according to the embodiment of the invention, the first convolution module 20, the second convolution module 60, the third convolution module 80 and the jump connection module 40 can enhance the expressive force of features; by adding a flexible linear fusion mode to the linear fusion module 50 under the framework of a nonlinear fusion method, the utilization rate of the features can be effectively improved by using the rapid jump connection module 40, so that the target object can be accurately positioned and the direction can be predicted, and the positioning accuracy and the direction prediction accuracy of the target object can be improved.

Example three:

according to an embodiment of the present invention, the device includes a processor, a computer-readable storage medium, and a computer program stored on the computer-readable storage medium, where the computer program, when executed by the processor, implements the steps in the above method for fusing a point cloud and an image based on multi-scale features, and the specific steps are as described in the first embodiment, and are not described herein again.

The memory in the present embodiment may be used to store software programs as well as various data. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the mobile phone, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

According to an example of this embodiment, all or part of the processes in the methods of the embodiments described above may be implemented by a computer program to instruct related hardware, where the program may be stored in a computer-readable storage medium, and in this embodiment of the present invention, the program may be stored in the storage medium of a computer system and executed by at least one processor in the computer system, so as to implement the processes including the embodiments of the methods described above. The storage medium includes, but is not limited to, a magnetic disk, a flash disk, an optical disk, a Read-Only Memory (ROM), and the like.

The preferred embodiments of the present invention have been described above with reference to the accompanying drawings, and are not to be construed as limiting the scope of the invention. Those skilled in the art can implement the invention in various modifications, such as features from one embodiment can be used in another embodiment to yield yet a further embodiment, without departing from the scope and spirit of the invention. Any modification, equivalent replacement and improvement made within the technical idea of using the present invention should be within the scope of the right of the present invention.

Claims

1. A point cloud and picture fusion method based on multi-scale features is characterized by comprising the following steps:

performing element-by-element averaging fusion on the features obtained by the second convolution operation to obtain new fusion features, performing third convolution operation on the new fusion features, and using the new fusion features as final output features of data fusion;

the method comprises the following steps of performing convolution operation on point cloud features and image features, grouping result features output after the point cloud features and the image features are subjected to the first convolution operation according to the abstraction degree, and performing element-by-element averaging and average fusion on each group of two types of features respectively to obtain two types of fused features, wherein the two types of fused features comprise:

the corresponding mathematical formula is:

2. The method of claim 1, wherein the step of obtaining at least two groups of point cloud features and image features by extracting a feature network and performing a first convolution operation on the obtained point cloud features and image features respectively by a convolution layer further comprises:

"σ" represents the activation function, corresponding to function max {0, x }.

3. The method for fusing point cloud and picture based on multi-scale features of claim 1, wherein the step of performing one-hop connection on the output result of the first convolution operation on the point cloud features and the picture features and the feature graph obtained after the grouping and element-by-element averaging fusion operation, and performing linear fusion operation comprises a corresponding mathematical formula:

"m, n, i" is a positive integer,

input features representing different convolutional layers;

4. The method for fusing point cloud and picture based on multi-scale features of claim 1, wherein the second convolution operation of the feature map obtained by jump connection and linear fusion through a convolution layer respectively further comprises the following steps:

input features representing different convolutional layers;

"σ" represents the activation function, corresponding to function max {0, x }.

5. The method for fusing point cloud and picture based on multi-scale features of claim 1, wherein the feature obtained by the second convolution operation is subjected to element-by-element averaging fusion to obtain a new fused feature formula:

"y 5" represents the result of element-by-element average fusion of the features obtained by the second convolution operation, i.e. the input features of the convolutional layer.

6. The method of fusing point cloud and picture based on multi-scale features of claim 4 or 5, wherein the convolution kernel size is 1 x 1, the step size is 1, and the number of feature maps output by the convolution layer is controlled to be 8.

7. The method for fusing point cloud and picture based on multi-scale features according to claim 1, wherein the features obtained by the second convolution operation are subjected to element-by-element averaging fusion to obtain new fusion features, the new fusion features are subjected to a third convolution operation and are used as final output features of a data fusion part, and a corresponding formula is as follows:

y6＝σ(w6^Ty5+b6) 20；

wherein "y 6" represents the output of the convolutional layer;

"y 5" represents the input characteristics of the convolutional layer;

"σ" represents the activation function, corresponding to function max {0, x }.

8. A multi-scale feature-based point cloud and picture fusion device applied to the multi-scale feature-based point cloud and picture fusion method of any one of claims 1 to 7, the device comprising:

9. A multi-scale feature based point cloud and picture fusion device, comprising a processor, a computer readable storage medium, and a computer program stored on the computer readable storage medium, wherein the computer program, when executed by the processor, implements the steps of the method according to any one of claims 1 to 7.