CN111652261A

CN111652261A - Multi-modal perception fusion system

Info

Publication number: CN111652261A
Application number: CN202010120330.XA
Authority: CN
Inventors: 王鸿鹏; 韩霄; 邵岩
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2020-09-11

Abstract

The invention provides a multi-modal perception fusion system for a full scene, which comprises an upper computer, a laser radar, a multi-view camera, an IMU, an infrared depth camera and a power supply, wherein the multi-view camera comprises two FLIR industrial network port cameras and two USB3.0 cameras, and the multi-modal perception fusion system comprises the following steps: installing each hardware and software, acquiring data and constructing a model. The multi-mode perception fusion system is small and exquisite, has light weight, can be used for modeling of unmanned vehicle-mounted and unmanned vehicle-mounted environments, medical industries and military unmanned environments, can also be used for various complex environments such as indoor and outdoor environments and the like, and lays a foundation for planning navigation.

Description

Multi-modal perception fusion system

Technical Field

The invention belongs to the field of multi-modal perception fusion systems, and particularly relates to a multi-modal perception fusion system for a whole scene.

Background

With the rapid development of sensor technology and the internet, large data of various different modalities is emerging rapidly at an unprecedented rate. For an object to be described (object, scene, etc.), the coupled data samples collected by the different methods or perspectives are multi-modal data, and each method or perspective in which such data is collected is generally referred to as a modality.

Narrow multi-modal information generally focuses on modalities with different perception characteristics, while broad multi-modal fusion generally also includes multi-feature fusion in the same modality information, data fusion of a plurality of sensors of the same type, and the like, so that the problem of multi-modal perception and learning is closely related to multi-source fusion and multi-sensor fusion in the field of signal processing, multi-view learning or multi-view fusion in the field of machine learning, and the like; the multi-modal data can obtain more comprehensive and accurate information, and the reliability and fault tolerance of the system are enhanced.

In the multi-modal perception and learning problem, since different modalities have completely different description forms and complex coupling correspondence, the problem of perception representation and cognitive fusion related to multi-modalities needs to be uniformly solved. The multi-modal perception and fusion is to enable two completely unrelated data samples with different formats to be relatively fused with each other through proper transformation or projection, and the fusion of the heterogeneous data can often achieve unexpected effects.

At present, multi-modal data plays a great role in the fields of internet information search, human-computer interaction, industrial environment fault diagnosis, robots and the like, multi-modal learning between vision and language is a field with more concentrated research results in the aspect of multi-modal fusion at present, and the robot field still faces a lot of challenging problems needing further exploration at present; therefore, a multi-mode sensing system is developed, and multi-vision, laser, binocular infrared, depth, IMU and other multi-modes are installed according to different directions. The system can realize the full-scene perception by realizing the automatic perception, scanning and modeling of large scenes and small workpieces, is suitable for indoor and outdoor, and gives depth information and distance information to RGB image information of the environment, but the most important difficulty is that: the heterogeneous multi-source sensors, the extraction of the features and the solution of the correlation between the features enable the fusion to be more accurate, and the real-time perception environment can be achieved.

Disclosure of Invention

In order to solve the technical problems, the invention provides a multi-modal perception fusion system for a full scene, so as to realize automatic perception, scanning and modeling of a large scene and a small workpiece. The multi-modal perception fusion system comprises an upper computer, a laser radar, a multi-view camera, an IMU, an infrared depth camera and a power supply, wherein the multi-view camera comprises two FLIR industrial network port cameras and two USB3.0 cameras, and the multi-modal perception fusion system comprises the following steps:

s1: installing hardware: connecting a laser radar to an upper computer in an Ethernet interface connection mode, connecting two FLIR industrial network port cameras to the upper computer in the Ethernet interface mode, respectively connecting two USB3.0 cameras, an IMU (inertial measurement unit) and an infrared depth camera to a USB3.0 interface of the upper computer, and connecting all the parts after being connected with each other through a data line and a power supply;

s2: installation of software and acquisition of data: opening a Linux Ubuntu System, installing and configuring a driver and software of each module, starting nodes of each mode by using a Robot Operating System, and displaying the acquired data of the point cloud of the laser radar, the RGB image of the multi-view camera, the information of an accelerometer and a gyroscope of the IMU and the depth-of-field image of the infrared depth camera by using RVIZ;

s3: constructing a model: and then, processing the acquired data by using an SLAM theoretical system, wherein the processing flow is divided into two steps, namely a front end and a rear end, the front end is responsible for feature extraction of each module and representation of correlation among features, the rear end is responsible for parameter optimization and three-dimensional reconstruction and positioning, and the rear end is responsible for iterative optimization of external parameters by using a Marquardt algorithm in modeling identification to obtain optimal estimation, so that a fused final model and effect graph are obtained.

Preferably, the Operating System adopted by the multi-modal perception fusion System is a Linux Ubuntu System, the middleware adopted is a Robot Operating System, and the used programming languages are c + + and python.

Preferably, the laser radar adopts radium intelligence c 16-151B.

Preferably, the number of the infrared depth cameras is two, and the infrared depth cameras and the IMU employ intel real Sense D435 i.

Preferably, the distance from the laser radar to the ground is 10m, a conical blind area is arranged below the laser radar after projection, the working distance of the infrared depth camera is 0.2-10m, and the blind area which cannot be projected by the laser radar can be made up.

Preferably, the laser radar, the multi-view camera, the IMU and the infrared depth camera are respectively provided with an independent sensor.

Compared with the prior art, the invention has the beneficial effects that: the system has the advantages that heterogeneous sensors are independently combined, the calibration can be performed quickly, the matching of collected information and the fusion under a three-dimensional space are performed, a patch model is generated by point cloud, the iterative optimization is performed again, a three-dimensional reconstruction model capable of achieving the precision is finally obtained, the most accurate model and effect diagram are obtained, the fusion is more accurate, the real-time sensing environment can be achieved, accurate technical data are provided for the later identification and detection technology, the multi-mode sensing fusion system is small and exquisite, the weight is light, the system can be used for modeling of unmanned vehicles, medical industries and military unmanned environments, and can also be used for various complex environments such as indoor and outdoor environments, and the basis is laid for planning navigation.

Drawings

Fig. 1 is an appearance diagram of a multi-modal perceptual fusion system for a full scene.

FIG. 2 is a multi-modal system architecture diagram for a full scene;

fig. 3 is a diagram of the installation steps of the multi-modal perceptual fusion system for a full scene.

In the figure: 1-laser radar; 2-a first FLIR industrial portal camera; 3-a first USB3.0 camera; 4-a first infrared depth camera; 5-a second infrared depth camera; 6-a second FLIR industrial portal camera; 7-a second USB3.0 camera; 8-multi-view camera; 9-IMU.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any inventive step, are within the scope of the present invention.

The invention is further described below:

example (b):

as shown in fig. 1, an Operating System adopted by the multi-modal perception fusion System is a Linux Ubuntu System, a middleware adopted by the multi-modal perception fusion System is a Robot Operating System, and programming languages used by the multi-modal perception fusion System are c + + and python; the multi-mode perception fusion system comprises an upper computer, a laser radar 1, a first FLIR industrial network port camera 2, a second FLIR industrial network port camera 6, a first USB3.0 camera 3, a second USB3.0 camera 7, a multi-view camera 8, an IMU 9, a first infrared depth camera 4, a second infrared depth camera 5 and a power supply;

specifically, as shown in fig. 3, the steps of forming the multi-modal perceptual fusion system are as follows:

s1: installing hardware: connecting a laser radar to an upper computer in an Ethernet interface connection mode, connecting a first FLIR industrial network port camera 2 and a second FLIR industrial network port camera 3 to the upper computer in the Ethernet interface mode, respectively connecting a first USB3.0 camera 3, a second USB3.0 camera 7, an IMU 9, a first infrared depth camera 4 and a second infrared depth camera 5 to a USB3.0 interface of the upper computer, and connecting all the parts after connection with a power supply through data lines;

s2: installation of software and acquisition of data: the Linux Ubuntu System is opened, drivers and software of all modules are installed and configured, a Robot Operating System is used for starting nodes of all modes, the acquired point cloud of the laser radar 1, RGB images of the multi-view camera 8, the first FLIR industrial portal camera 2, the second FLIR industrial portal camera 6, the first USB3.0 camera 3 and the second USB3.0 camera 7, accelerometer and gyroscope information of the IMU 9 and depth maps of the first infrared depth camera 4 and the second infrared depth camera 5 are displayed by using the RVIZ;

s3: constructing a model: and then, processing the acquired data by using a slam theoretical system, wherein the processing flow is divided into two steps, namely a front end and a rear end, the front end is responsible for feature extraction of each module and representation of correlation among features, the rear end is responsible for parameter optimization and three-dimensional reconstruction and positioning, the external parameters are iteratively optimized by using a Marquardt algorithm in modeling identification to obtain optimal estimation, and finally, a final model and an effect graph which are accurately fused are obtained.

Specifically, the laser radar 1 adopts radium intelligence c 16-151B.

Specifically, the first infrared depth camera 4, the second infrared depth camera 5 and the IMU 9 all employ intel real Sense D435 i.

Specifically, the distance from the laser radar 1 to the ground is 10m, a conical blind area is arranged below the laser radar 1 after projection, the working distance of the first infrared depth camera 4 and the second infrared depth camera 5 is 0.2-10m, and the blind area which cannot be projected by the laser radar 1 can be made up.

Specifically, the laser radar 1, the first FLIR industrial portal camera 2, the second FLIR industrial portal camera 6, the first USB3.0 camera 3, the second USB3.0 camera 7, the multi-view camera 8, the IMU 9, the first infrared depth camera 4, and the second infrared depth camera 5 are respectively provided with an independent sensor.

Referring to fig. 2, a graphical structural representation of a multimodal system is shown. Wherein, the vertex represents the sensors such as laser radar, camera, IMU and the like. The edges represent the relative pose transformation inferences between the sensors.

The workflow diagram is shown in fig. 3.

It should be noted that, in this document, moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The multi-modal perception fusion system for the full scene is characterized by comprising an upper computer, a laser radar, a multi-view camera, an IMU (inertial measurement Unit), an infrared depth camera and a power supply, wherein the multi-view camera comprises two FLIR (flash infrared) industrial network port cameras and two USB (universal serial bus) 3.0 cameras, and the multi-modal perception fusion system comprises the following steps:

s2: installation of software and acquisition of data: opening a Linux Ubuntu System, installing and configuring a driver and software of each module, starting nodes of each mode by using a Robot Operating System, and displaying the acquired data of the point cloud of the laser radar, the RGB image of the multi-view camera, the information of an accelerometer and a gyroscope of the IMU and the depth-of-field map of the infrared depth camera by using RVIZ;

s3: constructing a model: and then, processing the acquired data by using a slam theoretical system, wherein the processing flow is divided into two steps, namely a front end and a rear end, the front end is responsible for feature extraction of each module and representation of correlation among features, and the rear end is responsible for optimization, three-dimensional reconstruction and positioning of parameters, so that a fused final model and an effect graph are obtained finally.

2. The System according to claim 1, wherein the operating System adopted by the multimodal perceptual fusion System is Linux Ubuntu System, the middleware adopted by the multimodal perceptual fusion System is RobotOperating System, and the programming languages used by the multimodal perceptual fusion System are c + + and python.

3. The multi-modal perceptual fusion system for a whole scene of claim 1 wherein the lidar employs radium intelligence c 16-151B.

4. The multi-modal perceptual fusion system for a full scene of claim 1 wherein the number of infrared depth cameras is two and the infrared depth cameras and IMU employ Intel Real sensor D435 i.

5. The multi-modal perceptual fusion system for a full scene as claimed in claim 1 wherein the final model and effect map is obtained by using a Marquardt algorithm in modeling recognition to iteratively optimize the external parameters to obtain an optimal estimation, thereby obtaining the most accurate model and effect map.

6. The system of claim 1, wherein the distance from the laser radar projected to the ground is 10m, a conical blind area is formed below the projected laser radar, and the working distance of the infrared depth camera is 0.2-10m, so that the blind area which cannot be projected by the laser radar can be compensated.

7. The multi-modal perceptual fusion system for a whole scene of claim 1 wherein the lidar, the multi-view camera, the IMU, and the infrared depth camera each have independent sensors.