WO2022231523A1 - Multi-camera system - Google Patents

Multi-camera system Download PDF

Info

Publication number
WO2022231523A1
WO2022231523A1 PCT/SG2022/050257 SG2022050257W WO2022231523A1 WO 2022231523 A1 WO2022231523 A1 WO 2022231523A1 SG 2022050257 W SG2022050257 W SG 2022050257W WO 2022231523 A1 WO2022231523 A1 WO 2022231523A1
Authority
WO
WIPO (PCT)
Prior art keywords
monocular camera
sensor
feature
monocular
sensors
Prior art date
Application number
PCT/SG2022/050257
Other languages
French (fr)
Inventor
Huimin CHENG
Original Assignee
National University Of Singapore
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University Of Singapore filed Critical National University Of Singapore
Publication of WO2022231523A1 publication Critical patent/WO2022231523A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/817Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level by voting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/90Arrangement of cameras or camera modules, e.g. multiple cameras in TV studios or sports stadiums

Abstract

A multi-camera system comprising multiple monocular camera sensors wherein each monocular camera sensor comprises: a camera; and an IMU; wherein data captured by the camera and the IMU are used by the monocular camera sensor to: estimate a pose and velocity of the monocular camera sensor; and perform feature detection for features in an environment surrounding the system; wherein features detected by each monocular camera sensor are shared with other ones of the monocular camera sensors, and wherein each monocular cam-era sensor is configured, for each feature it detects, to: determine if there is a match with the features detected by the other ones of the monocular camera sensors; and if there is a match, store the feature in memory.

Description

Multi-camera System
Technical Field
The present invention relates, in general terms, to a multi-camera system, more particularly relates to a multi-camera system comprising monocular cameras.
Background
The current state-of-the-art techniques recover camera poses and 3D landmark locations based on a sequence of synchronized camera images and can run in real-time on commodity hardware. The technique can be extended and applied to installation of static cameras for tracking objects and volumes in 3D. This vision-based technique is potentially much more powerful and versatile than the current 2D Lidar-based technique, which is common in the industry.
Traditional stereo or multi-camera visual simultaneous location and mapping (SLAM) normally require all sensor data to be gathered to a central computing unit. This is due to the way the algorithm is designed - the matching and optimization assumes availability of all sensor data at the time of computation. The key drawbacks are that the central compute has to have high bandwidth I/O ports to take in precisely synchronized sensor data, high computational resources to compute and make sense of the data received in real time, and is a single point of failure for the system. This often gives rise to difficulties in integrating such sensing solutions. In-capable embedded computers also introduce significant latency and synchronization issues, rendering the whole system unstable or even unusable. In addition, there is a lack of design for hardware and software redundancy in implementation, which are critical for commercial applications.
It would be desirable to overcome all or at least one of the above-described problems. Summary
Disclosed herein is a multi-camera system comprising multiple monocular cam era sensors wherein each monocular camera sensor comprises: a camera; and an IMU; wherein data captured by the camera and the IMU are used by the monocular camera sensor to: estimate a pose and velocity of the monocular camera sensor; and perform feature detection for features in an environment surrounding the system; wherein features detected by each monocular camera sensor are shared with other ones of the monocular camera sensors, and wherein each monocular camera sensor is configured, for each feature it detects, to: determine if there is a match with the features detected by the other ones of the monocular camera sensors; and if there is a match, store the feature in memory.
In some embodiments, each monocular camera sensor is further configured to send to the other ones of the monocular camera sensors, an indication of the features detected by the other ones of the monocular camera sensors that has matched with features detected by it.
In some embodiments, for each monocular camera, for each feature it has de tected, storing the feature in memory comprises storing the feature in multi view feature memory if the respective monocular camera sensor receives an indication from the other ones of the monocular camera sensors.
In some embodiments, for each monocular camera, for each feature, storing the feature in memory comprises storing the feature in external feature memory if the respective monocular camera sensor receives a matching feature from the other ones of the monocular camera sensors.
In some embodiments, the estimated pose and velocity for all monocular cam era sensors are synchronized using network-based time synchronization. In some embodiments, in use, all the monocular camera sensors are attached to a common rigid body.
In some embodiments, for synchronization, an offset is calculated based on his tories of the estimated position of the monocular camera sensors across the rigid body.
In some embodiments, for synchronization, an offset is calculated based on his tories of estimated rotations of the monocular camera sensors across the rigid body.
In some embodiments, visual information and structured data are shared be tween the monocular camera sensors to provide perception of depth.
Brief description of the drawings
Embodiments of the present invention will now be described, by way of non limiting example, with reference to the drawings in which:
Figure 1 illustrates a hardware architecture for a monocular camera sensor;
Figure 2 is a flow diagram of a fully distributed VIO algorithm;
Figure 3 is a flow diagram of a communication and synchronization algorithm;
Figure 4 is a flow diagram of a network level sensor fusion algorithm; and
Figure 5 is a schematic diagram showing components of an exemplary computer system for performing the methods described herein.
Detailed description
The present invention relates to a multi-camera system comprising multiple proprietary vision sensor modules with edge computing processors with deployment scalability in mind. Existing technologies such as stereo or multi camera visual SLAM require all sensor data to be gathered to a central computing unit. This is due to the way the algorithm is designed: that is, the matching and optimization assumes availability of all sensor data at the time of computation. Contrary to the existing SLAM-based technologies, the present disclosure proposes a fully modular and distributed hardware design, with edge computing processors(s). In particular, each camera sensor is equipped with a high frequency inertial measurement unit (IMU) sensor within the same module, so as to form a minimal monocular visual inertial navigation unit.
The proposed distributed visual SLAM over edge computing camera has the following advantages. First, the proposed sensor placement scheme avoids the needs for high precision IMU-camera synchronization between different sensor modules. The proposed design enables that high-bandwidth data is consumed within the same module, in particular processed by the edge computing processor in the module. The high-bandwidth data does not need to be transmitted between different modules and to the central computing unit. It will be appreciated that in the present disclosure, only compressed and/or structured data may need to be transmitted between modules. As a result, the present disclosure does not need a central computing unit having high- bandwidth I/O ports to take in precisely synchronized sensor data. Therefore, the complexity of system hardware design can be reduced.
Second, the distributed visual SLAM over edge computing camera prevents high requirement for computation. Because the high-bandwidth data only needs to be consumed with each module, the proposed sensor placement scheme avoids the need for the central computing unit to have high computation resources to compute and make sense of the data received in real time.
Last but not least, the distributed visual SLAM can avoid the single point of failure (SPOF) problem. A SPOF is a part of a system that, if it fails, will stop the entire system from working. Existing SLAM technologies need the central computing unit to process gathered sensor data. Therefore, once the central computing unit or the communication between the central computing unit and the sensors is down, the whole system will not work. On the contrary, in the present disclosure, each vision sensor module is able to operate independently at a reduced accuracy and work collaboratively with other sensor modules when communication between the vision sensor modules are available and validated. Such a design is robust against hardware failures, communication failures or algorithm and data glitches. The failure of any particular module will simply result in omission of data available from that module, rather than catastrophic loss of system functionality.
The present disclosure proposes a distributed stereoscopic localization method. It will be appreciated that this method is a non-trivial redesign of existing visual- SLAM pipelines. The proposed distributed stereoscopic localization method is able to work over a network of monocular cameras with individual edge computing processor, and to allow visual inertial odometry (VIO) running on the distributed sensor modules. Visual odometry is the process of determining the position and orientation of a robot by analyzing the associated camera images. VIO in the present disclosure refers to the IMU used within a visual odometry system, and it is a technique to recover camera poses and 3D landmarks using a sequence of images and IMU readings. In the proposed distributed stereoscopic localization method, a feature extraction process as well as a front- end are running independently on each sensor module, and a back end of each sensor module is configured to handle more than one other sensor module based on the data transmitted over the network. Stereoscopic feature matching is done in a fully distributed manner over network communication. Such design enables that visual information and structured data can be shared between the cameras in a well-defined and consistent way to achieve stereoscopic algorithm, without a need of the central computing unit.
The present invention also relates to a relaxed sensor synchronization method to perform multi-camera visual SLAM. The aim of this method is to make the proposed distributed visual SLAM over edge computing camera require less dedicated hardware and communication channel (e.g. electrical pulse) between different camera modules than the existing SLAM-based technologies. The relaxed sensor synchronization method eliminates the need to run physical synchronization signal across different edge cameras, which are synchronized approximately by round-trip estimation over network with sub-5ms accuracy. The round-trip estimation could be further refined by commonly observed visual motion if needed. Also, the relaxed sensor synchronization method ensures that no dedicated hardware fabric is needed to decode the physical synchronization. Such design is both a simplification in construction and overall system complexity. It will also be appreciated that due to the relaxed sensor synchronization method, the proposed distributed visual SLAM scheme can be easily extended to situations where wireless communication is required, such as when there is no specific hardware for generating pulse.
The present invention also proposes a new architecture of design of sensor fusion to use the network sensor modules with full redundancy. Sensor fusion here refers to the process of combining of sensor data or data derived from disparate sources such that the resulting information has less uncertainty than would be possible when these sources were used individually. In particular, the proposed multi-camera system is able to operate in monocular mode if only one sensor is present, and can act as multi-camera SLAM when more sensors are present. Sensor data are fused in a distributed manner when more than one sensors are present, and each sensor is configured to compute its own sensor fusion data. In the present disclosure, consensus of fused state can be communicated over a network, and a mechanism such as voting is designed to reach an agreement in the final state. The proposed new architecture is also able to diagnose sensor inconsistency and failure, as well as VIO failure. This new structure can enable the multi-camera system to recover from partial failures from part of the sensors within the network, including bad initialization, occlusion, data corruption and delays.
As mentioned before, the present invention relates to a multi-camera system comprising multiple monocular camera sensors. Figure 1 illustrates the hard ware architecture for an example monocular camera sensor 100. The example monocular camera sensor 100 comprises a camera 102 and an IMU sensor 104. Other sensors 106 such as height gauge and speedometer may also be installed in the monocular camera sensor 100. The other sensors and their functions may depend on the particular application to which the system is being applied. The camera 102 in the present disclosure is configured to obtain high-bandwidth raw image data. As shown in Figure 1, the camera 102 transmits the obtained high-bandwidth raw image data 120 to the edge processor 108. In some embodiments, the camera 102 is equipped with a mobile industry processor interface (MlPI)-interfaced image sensor, with hardware trigger I/O. It will be appreciated that the edge processor 108 is also one part of the monocular camera sensor 100. The edge processor 108 may be ARM-based main CPU and GPU processor, although other processors may be suitable for particular applications. Such design enables the high-bandwidth data to be consumed within the same monocular camera sensor 100 (i.e. edge computer). The IMU sensor 104 is configured to obtain high frequency inertial data. As shown in Figure 1, the high frequency inertial data 122 is then transmitted from the IMU sensor 104 to the real-time processor 110, which is also one part of the monocular camera sensor 100. In some embodiments, the real-time processor 110 is an STM32 processor, running hardware triggering and sensor fusion algorithm. Such design enables the frequency inertial data to also be consumed within the same monocular camera sensor 100. In the present disclosure, data captured by the camera 102 and the IMU 104 are used by the monocular camera sensor to estimate a pose and velocity of the monocular camera sensor 100, and perform feature detection for features in an environment surrounding the multi-camera system. In view of the present teachings, the skilled person will appreciate that the components forming part of the monocular camera sensor may be a single device, or may be multiple devices in sufficiently close communication that transmission over neighbouring networks is substantially avoided for many data processing tasks - e.g. tasks undertaken by the real time processor 110 and other components mentioned above.
The real-time processor 110 enables the monocular camera sensor 100 to have built-in sensor fusion processing capability. As shown in Figure 1, the real-time processor 110 performs sensor fusion from other sensors 106 (for example GPS, barometer, height gauge and speedometer etc.), and sends the fused result 126 to the edge processor 110. In some embodiments, the real-time processor 110 is connected to the ARM-based edge processor 108 by a high-speed onboard communication interface (e.g. serial peripheral interface (SPI)). The edge processor 110 performs visual odometry. Visual odometry is the process of determining at least one, and preferably both, of the position and orientation of a robot (that comprises the monocular camera sensor 100) by analyzing the associated high-bandwidth raw image data 120. The edge processor 110 also performs VIO, to recover camera poses and 3D landmarks using a sequence of images 120 and IMU readings (i.e., the high frequency inertial data 122). The edge processor 110 then sends the VIO result 128 to the real-time processor. It will also be appreciated that the edge processor 108 and the real-time processor 110 may be general purpose, which means that adding on sensor capabilities (for example adding more sensors 106) is possible.
In the existing technologies, multi-camera setup normally requires electrical synchronization signals. The synchronization signals need to exposed outside of each unit. For example, the cameras in Intel RealSense are not designed to communicate with each other, especially with regard to visual and spatial information. Such a system would require an additional center node for coordination. In such cases, synchronization signals from different cameras need to be sent out to the center node for processing. The cameras in Stereolabs ZED 2 do not have sizeable compute resources onboard. They are designed to just stream high quality raw and synchronized data. The features for VIO/SLAM are achieved through a SDK, running on the host machine. This is an even less scalable solution as center computing units are often loaded with many other tasks.
In contrast, the presently disclosed systems can avoid electrical synchronization signals that are transmitted between the central computing unit and monocular camera sensors, as well as those between different monocular camera sensors, and replace them with a standard network layer, where 'soft' synchronization can be performed. In particular, the synchronization signal is not exposed outside, which is different from the prior art. As shown in Figure 1, the processor 110 sends synchronization signals 124 back to the camera 102 and IMU sensor 104. The monocular camera sensor 100 thus does not need high-bandwidth I/O ports to transmit precisely synchronized sensor data to outside. Such design enables that the complexity of system hardware design can be reduced. On the one hand, the high-bandwidth image data 120 as well as high frequency inertial data 122 do not need to be transmitted between different monocular camera sensors. We would like to emphasize that in the present disclosure, only compressed and/or structured data needs to be transmitted between different monocular camera sensors. On the other hand, since the present invention does not need a central computing unit to gather and process data from different monocular camera sensors, the high-bandwidth data 120 as well as high frequency inertial data 122 also do not need to be transmitted to the central computing unit for further processing.
Another novelty in the hardware design of the monocular camera sensor 100 is a dedicated camera-based spatial hardware that could communicate with other camera modules by design. As shown in Figure 1, the edge processor 108 is able to communicate with other monocular camera sensors through the network hardware 112, thus a central computing unit for coordination is not needed. In some embodiments, the monocular camera sensor 100 is able to communicate with other monocular camera sensors through Ethernet, with power over Ethernet (PoE) configuration. The distributed visual SLAM over edge computing camera prevents high communications resource requirements. Because the high-bandwidth data only needs to be consumed within each module, the proposed sensor placement scheme avoids the need for the central computing unit to have high computation resources to compute and make sense of the data received in real time. There is no single compute unit responsible for analyzing all data. Instead, analysis is distributed across edge devices.
As mentioned, embodiments of the present invention involve the design of the algorithms to allow VIO to run in a fully distributed manner. Figure 2 shows an example structure 200 of a fully distributed VIO algorithm. The left half of Figure 2 illustrates the implementation of a monocular VIO, and includes monocular pre-processing 202 and implementing of a monocular minimal system 204. The implementation of the monocular VIO follows the typical front-end and back end architecture design. In the present disclosure, a feature extraction and front-end are running independently on each monocular camera sensor, and a back-end handles more than one other monocular camera sensors. In particular, the monocular pre-processing step (step 202) achieves buffering of data streams and extracting synchronized data. It achieves pre-processing of 2D image features for example by using features from accelerated segment test (FAST), which is a corner detection method that could be used to extract feature points and later used to track and map objects in the computer vision tasks. Step 202 could also track the 2D image features for example by using the Lucas- Kanade method, which is a widely used differential method for optical flow estimation. Step 202 also conducts pre-processing of IMU raw data for example through IMU pre-integration calculation. It will be appreciated that step 202 provides step 204 the minimal system tracked features in 2D image domain, as well as IMU motion constraint. The monocular minimal system 204 achieves joint optimization or filtering from the tracked features and IMU motion constraints received from step 202, to obtain estimated inertial odometry. It will be appreciated that step 204 itself only uses data from one monocular camera and one IMU.
It will be appreciated that the proposed multi-camera system is able to operate in monocular mode if only one monocular camera sensor is present, and can act as multi-camera SLAM when more monocular camera sensors are present. One novelty of the present disclosure is that the structure 200 enables implementation of a monocular algorithm 201 (including monocular pre processing 202 and implementing of a monocular minimal system 204 shown in Figure 2) even though the VIO systems are equipped with multiple sets of monocular camera sensors.
At the same time, the monocular algorithm 201 has the capability to expand to multi-camera VIO algorithm (i.e., 200) without the need to change the monocular algorithm itself. The expansion is done by allowing external software modules (i.e., other monocular camera sensors) to inject 'in-view' information and 'out-of-view' information. The in-view information is those can be viewed by the current monocular camera sensor. The in-view information includes three main components. The first component consists of 2D observations that correspond to the currently observed features (in an environment surrounding the multi-camera system) by the current monocular camera sensor. The second component consists of those 2D observations that are observed by other monocular camera sensors but are also happened to be in-view to the current monocular camera sensor. In particular, the second component includes those 2D observations that the current monocular camera sensor can view but are not yet registered as features in the current monocular camera sensor's processing. The first and second components are illustrated by 206 shown in Figure 2. The third component consists of those 3D landmarks that are not registered by the current monocular camera sensor, but are observed in other monocular camera sensors, and is then re-identified in the current monocular camera sensor. The third component is illustrated by 208 shown in Figure 2.
The out-of-view information refers to those information that cannot be viewed by the current monocular camera sensor. It will be appreciated that the injection of the out-of-view information 210 is as important as the injection of the in-view information 206/208, to achieve a true omnidirectional VIO function. Injection of the out-of-view information 210 done by explicitly allowing eternal factors 212. In the present disclosure, the formulation of the eternal factors 212 is very general. One implementation of the external factors 212 could be a selected set of landmarks observed by other monocular camera sensors, but are not observable by the current monocular camera sensor, or where only part of the set is observable by the current monocular camera sensor. An alternative formulation could be to receive motion estimation result (pose & velocity) from all other monocular camera sensors. In such case, we assume that all monocular camera sensors are fixed mounted relative to each other, and use said motion estimation result as a factor to constraint the further optimization.
In the present disclosure, features detected by each monocular camera sensor are shared with other ones of the monocular camera sensors. Each monocular camera sensor is configured, for each feature it detects, to: determine if there is a match with the features detected by the other ones of the monocular camera sensors; and if there is a match, store the feature in memory.
In the present disclosure, the features that each monocular camera sensor is configured to detect include 2D features. Figure 2 illustrates an example distrib uted 2D feature matching algorithm 214. The inputs of algorithm 214 are the features registered by the current monocular camera sensor, as well as features from other monocular camera sensors transmitted by the network layer. On the one hand, the internally registered features OF1 (i.e., the features registered by the current monocular camera sensor) are sent to the network layer for matching (see step 216). The internally registered features OF1 are broadcasted to the network layer 218. Once the network layer 218 receives broadcast re garding OF1 (see 220), 2D feature matching (such as stereoscopic feature matching) will be conducted (see 222). The stereoscopic feature matching is the process of finding the pixels in the multi-scope view that correspond to the same 2D/3D point in the scene. The camera of the current monocular camera sensor is a stereoscopic camera. It will be appreciated that the 'stereoscopic camera' refers to the camera that can perform stereoscopic feature matching. It is dif ferent with a stereo camera (a type of camera with two or more lenses with a separate image sensor or film frame for each lens).
The current monocular camera sensor will determine, for the registered features OF1, if there is a match with the features detected by the other ones of the monocular camera sensors. If the 2D feature matching is successful, a corre sponding message will be sent back from the network layer to the current mo nocular camera sensor (see 224). In one example, if the 2D feature matching is successful, the other monocular camera sensors will reply to the broadcast regarding OF1 sent by the current monocular camera sensor through the net work layer. In some embodiments, each monocular camera sensor is configured to send to the other ones of the monocular camera sensors, an indication of the features detected by the other ones of the monocular camera sensors that have matched with features detected by it. The current monocular camera sensor will then receive the successful matching information (see 227), and the successful matching will form the multi-view observations for the same set of OF1 features (see 228). In some embodiments, the proposed multi-camera system, for each monocular camera, for each feature it has detected, stores the feature in memory by storing the feature in multi-view feature memory if the respective monocular camera sensor receives an indication from the other ones of the mo nocular camera sensors. On the other hand, there may also be numerous request from other monocular camera sensors as well for request of matching. In Figure 2, the internally reg istered features OFx (i.e., the features registered by each of the other monoc ular camera sensors) are also sent to the network layer for matching. In some embodiments, the internally registered features OFx are broadcasted to the net work layer 218. Once the network layer 218 receives broadcast regarding OFx (see 220), 2D feature matching (such as stereoscopic feature matching) will be conducted (see 222). The camera of each of the other monocular camera sen sors may comprise a stereoscopic camera.
Each of the other monocular camera sensors will determine, for the registered features OFx, if there is a match with the features detected by any different monocular camera sensor. If the 2D feature matching is successful, a corre sponding message will be sent back from the network layer to each of the other monocular camera sensors (see 224), and the successful matching will be rec orded as external OF features 226. In one example, if the 2D feature matching is successful, the reply to the broadcast regarding OFx sent by each of the other monocular camera sensors will be sent through the network layer. In some em bodiments, each monocular camera sensor is configured to send to the other ones of the monocular camera sensors, an indication of the features detected by the other ones of the monocular camera sensors that have matched with features detected by it. Each of the other current monocular camera sensor will then receive the successful matching information (see 227), and the successful matching will form the multi-view observations for the same set of OFx features (see 228). The multi-view measurement of internal features 228 and new ex ternal features 226 together will help to augment the minimal monocular pre processing 202 (see step 206). In some embodiments, the proposed multi-cam era system, for each monocular camera, for each feature it has detected, stores the feature in memory by storing the feature in multi-view feature memory if the respective monocular camera sensor receives an indication from the other ones of the monocular camera sensors.
As the result, the matching process 214 is done in a distributed fashion. The data transmitted over the network are requested feature patches to be matched, and their 2D location. There is no need for any single computing unit to perform the optical-flow based matching for all co-visible monocular camera sensors' image, which is typical for conventional algorithm. The distributed visual SLAM also prevents high requirement for computation.
In the present disclosure, the features that each monocular camera sensor is configured to detect also include 3D landmarks. Figure 2 illustrates an example distributed 3D landmark re-identification algorithm 230. The inputs of algorithm 230 are the 3D landmarks registered by the current monocular camera sensor, as well as 3D landmarks from other monocular camera sensors transmitted by the network layer. In the present disclosure, for 3D landmark, there are both 2D and 3D information. At the first step, the 3D trajectory should be sent over for matching. At the second step, 2D image patch around the landmark should be sent over, based on some heuristics. It will be appreciated that not all 2D image patch are needed over time. For the current monocular camera sensor, if the reply from other current monocular camera sensor shows a match, two landmarks registered in two different monocular camera sensors should be as sociated as one. All the pre-processing 2D feature tracks could be combined to give a more localized observation.
On the one hand, the internally registered landmarks LM1 (i.e., the features registered by the current monocular camera sensor) are selected by the current monocular camera sensor (see step 232). The 3D trajectory and observations will then be sent to the network layer 218 for matching (see step 234). The internally registered landmarks LM1 are broadcasted to the network layer 218. Once the network layer 218 receives broadcast regarding LM1 (see 236), it will check if LM1 falls in-view (see 238). Whenever the landmark matching request is received, there is a need to detect if the land-mark location is in or out of view. The in-view information is those can be viewed by the current monocular camera sensor. The out-of-view information refers to those information that cannot be viewed by the current monocular camera sensor.
If LM1 falls in-view, re-identification of 2D feature of the LM1 will be conducted (see 240). Other techniques that are more advanced than said 2D based re identification could also be performed. The current monocular camera sensor will determine, for the registered landmarks LM1, if there is a match with the landmarks detected by the other ones of the monocular camera sensors. If the 2D based re-identification is successful, a corresponding message will be sent back from the network layer to the current monocular camera sensor (see 224). If the 2D based re-identification is successful, the successful ones will result in added LM in the current monocular camera sensor (see 242). The added in-view LM will then be sent back to the current monocular camera sensor (see 243).
If LM1 falls out-of-view, then the landmark tracks could be stored in an external LM database 244 and then used as external factors 212, which are particularly useful if extrinsic is estimated online. In some embodiments, the proposed multi-camera system, for each monocular camera, for each feature, stores the feature in memory by storing the feature in external feature memory if the re spective monocular camera sensor receives a matching feature from the other ones of the monocular camera sensors.
On the other hand, there may also be numerous request from other monocular camera sensors as well for request of 3D landmark re-identification. The inter nally registered landmarks LMx (i.e., the features registered by each of the other monocular camera sensors) are selected. The 3D trajectory and observations will then be sent to the network layer 218 for matching (see step 234). The internally registered landmarks LMx are broadcasted to the network layer 218. Once the network layer 218 receives broadcast regarding LMx (see 236), it will check if LMx falls in-view (see 238).
If LMx falls in-view, re-identification of 2D feature of the LMx will be conducted (see 240). Other techniques that are more advanced than said 2D based re identification could also be performed. Each of the other monocular camera sen sors will determine, for the registered landmarks LMx, if there is a match with the landmarks detected by the other ones of the monocular camera sensors. If the 2D based re-identification is successful, a corresponding message will be sent back from the network layer to each of the monocular camera sensors (see 224). If the 2D based re-identification is successful, the successful ones will result in added LM in the current monocular camera sensor (see 242). The added in-view LM will then be sent back to the current monocular camera sensor (see 243). If LMx falls out-of-view, then the landmark tracks could be stored in an external LM database 244 and then used as external factors 212.
The distributed visual SLAM (base on the distributed 2D feature matching algorithm 214 and distributed 3D landmark re-identification algorithm 230) shown in Figure 2 can avoid the SPOF problem. Existing SLAM technologies such as Intel RealSense need the central computing unit to process gathered sensor data. As mentioned, the central computing unit are often loaded with many other tasks. A computing unit operating at high load is at risk of collapse. Once the central computing unit or the communication between the central computing unit and the monocular camera sensor is down, the whole multi-camera system will not work. In this invention, each monocular camera sensor 100 is able to operate independently at a reduced accuracy and work collaboratively with other monocular camera sensors when communication between the monocular camera sensors are available and validated. Such design is robust against hardware failures, communication failures or algorithm and data glitches.
The present invention now discusses real-time communication and synchronization protocol for visual spatial information exchange. The proposed real-time communication and synchronization protocol is a dedicated network protocol that is designed to exchange both observable visual features and landmarks, which incorporate time offset considerations. Such protocol is able to minimize bandwidth required compared to the traditional map exchange mechanism. It also uses a relaxed sensor synchronization method to perform multi-camera visual SLAM. The aim of this method is to make the proposed distributed visual SLAM over edge computing camera require less dedicated hardware and communication channel (e.g. electrical pulse) between different monocular camera sensors than the existing SLAM-based technologies.
Figure 3 shows an example communication and synchronization flow 300. The synchronization takes two steps, which in turn signals the mode of communica tion between the monocular camera sensor's algorithm as well. The first step 302 is a general network-based time synchronization. In particular, the esti mated pose and velocity for all monocular camera sensors are synchronized using the network-based time synchronization (see 304). In some examples, network time protocol (NTP), precision time protocol (PTP), and Micro Air Vehicle Link (MAVLink) can be used for the network-based time synchronization. In par ticular, NTP is a network protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks. PTP is a protocol used to synchronize clocks throughout a computer network. MAVLink is a pro tocol for communicating with different monocular camera sensor. The network- based time synchronization used in the present disclosure is a time offset esti mation, as time jump in a real-time system is not considered. Once the network- based time synchronization 302 is done (see 306), a typical l-5ms offset accu racy can be achieved, and the monocular camera sensor can then enter from the unstable synchronization state (see 305) to the approximated synchroniza tion state (see 306). The VIO data transmission can be started (see 308).
In the approximated synchronization state 306, each monocular camera sensor is still operating in the monocular mode (see 310), as the network-based time synchronization 302 is assumed to be inaccurate. The second step 312 of the communication and synchronization flow 300 is precision time synchronization. In particular, the information about each monocular camera sensor's estimated motion is shared within the network by transmitting the VIO data (see 308). As shown in step 314, distinguishable motions across all VIO units need to be de tected based on current pose and velocity 316. And then a time offset between multiple motion tracks can be estimated (see 318). The precise time offset could be obtained by solve a minimization problem 320, parametrized by the offset value obtained at step 318. Once the minimization yields high confidence (see 322), we could declare the synchronization to be precise (see 324). It will be appreciated that the synchronization quality should be better than 2ms, as IMU sampling rate could be very high to detect sudden rotation or motion.
The present disclosure considers only the case where all monocular camera sen sors are fixed to the same rigid body when they are in use. In such case, differ ent monocular camera sensors should share the same profile, except to the degree of difference of a fixed rotation and displacement. In some embodi ments, for synchronization, an offset is calculated based on histories of the es timated position of the monocular camera sensors across the rigid body, and offset is calculated based on histories of estimated rotations of the monocular camera sensors across the rigid body. In some examples, visual information and structured data are shared between the monocular camera sensors to provide perception of depth. In the present disclosure, the data that is used for precision time synchronization 304 is the histories of estimated egomotion data (such as position, orientation) of respective monocular camera sensors. The reason why motion data could be used as synchronization is that, when there is a motion change (for example acceleration or rotational motion of the system), the rigid body enforces that all camera sensors are observing the same motion pro file/pattern in their history. This history could be used to matched in a sliding window manner, to obtain the precise data time offset.
The relaxed sensor synchronization method eliminates the need to run physical synchronization signal across different edge cameras, which are synchronized approximately by round-trip estimation over network with sub-5ms accuracy. The round-trip estimation could be further refined by commonly observed visual motion if needed. Also, the relaxed sensor synchronization method ensures that no dedicated hardware fabric is needed to decode the physical synchronization. Such design is both a simplification in construction and overall system complexity. It will also be appreciated that due to the relaxed sensor synchronization method, the proposed distributed visual SLAM scheme can be easily extended to situations where wireless communication is required, such as when there is no specific hardware for generating pulse.
The present invention now discusses a new architecture of sensor fusion to use the network sensor modules with full redundancy. Sensor fusion here refers to the process of combining of sensor data or data derived from disparate sources such that the resulting information has less uncertainty than would be possible when these sources were used individually. The usefulness of this part of invention is that the proposed multi-camera system is fully redundant. To be more specific, the multi-camera system is able to operate in monocular mode if only one monocular camera sensor is present, and can act as multi-camera SLAM when more sensors are available.
In the present disclosure, each VIO unit (i.e., each monocular camera sensor) in the multi-camera system has its own sensor fusion, in terms of fusing in between visual and inertial data. Other than that, the present disclosure focuses on a system level sensor fusion.
Figure 4 shows an example system level sensor fusion architecture 400. The fusion logics basically ensure consensus among all the distributed sensing unit (i.e., VIO Unit 402, 404 and 406). Sensor data are fused in a distributed manner when more than one sensors are present, and each VIO unit is configured to compute its own current positions and velocity. For example, VIO unit 402 is configured to compute its own current positions and velocity 408. VIO unit 404 is configured to compute its own current positions and velocity 410. VIO unit 406 is configured to compute its own current positions and velocity 412. In the present disclosure, consensus of fused state can be communicated over a network, and a mechanism such as voting is designed to reach an agreement in the final state (see 414). In some embodiments, the voting mechanism may be consensus based, such as majority voting, to reject outliers in the odometry estimation. The outlier monocular system could then be reset. For minority that fail to agree or crashed, a fault detection and re-initialization logic such as 416 can be implemented. This new structure can enable the multi-camera system to recover from partial failures from part of the sensors within the network, including bad initialization, occlusion, data corruption and delays.
The proposed sensor fusion mechanism 400 is done in a conceptually simple way, compared to existing mathematical-driven filter based approaches. The proposed new architecture is able to diagnose sensor inconsistency and failure, as well as VIO failure. This approach is made possible as a direct consequence of forcing each VIO unit to run in essentially a monocular mode, thus producing their own independent pose and velocity estimation. This makes voting and consensus effective, as each VIO unit has distinct sensor input, and slightly different received shared input from other VIO units.
Figure 5 is a block diagram showing an exemplary computer device 300, in which embodiments of the invention may be practiced. The computer device 500 may be a mobile computer device such as a smart phone, a wearable device, a palm-top computer, and multimedia Internet enabled cellular telephones, an on-board computing system or any other computing system, a mobile device such as an iPhone TM manufactured by AppleTM, Inc or one manufactured by LGTM, HTCTM and SamsungTM, for example, or other device.
As shown, the mobile computer device 500 includes the following components in electronic communication via a bus 506:
(a) a display 502;
(b) non-volatile (non-transitory) memory 504;
(c) random access memory ("RAM") 508;
(d) N processing components 510;
(e) a transceiver component 512 that includes N transceivers; and
(f) user controls 514.
Although the components depicted in Figure 5 represent physical components, Figure 2 is not intended to be a hardware diagram. Thus, many of the components depicted in Figure 5 may be realized by common constructs or distributed among additional physical components. Moreover, it is certainly contemplated that other existing and yet-to-be developed physical components and architectures may be utilized to implement the functional components described with reference to Figure 5.
The display 502 generally operates to provide a presentation of content to a user, and may be realized by any of a variety of displays (e.g., CRT, LCD, HDMI, micro-projector and OLED displays).
In general, the non-volatile data storage 504 (also referred to as non-volatile memory) functions to store (e.g., persistently store) data and executable code. The system architecture may be implemented in memory 504, or by instructions stored in memory 504.
In some embodiments for example, the non-volatile memory 504 includes bootloader code, modem software, operating system code, file system code, and code to facilitate the implementation components, well known to those of ordinary skill in the art, which are not depicted nor described for simplicity.
In many implementations, the non-volatile memory 504 is realized by flash memory (e.g., NAND or ONENAND memory), but it is certainly contemplated that other memory types may be utilized as well. Although it may be possible to execute the code from the non-volatile memory 504, the executable code in the non-volatile memory 504 is typically loaded into RAM 508 and executed by one or more of the N processing components 510.
The N processing components 510 in connection with RAM 508 generally operate to execute the instructions stored in non-volatile memory 504. As one of ordinarily skill in the art will appreciate, the N processing components 510 may include a video processor, modem processor, DSP, graphics processing unit (GPU), and other processing components.
The transceiver component 512 includes N transceiver chains, which may be used for communicating with external devices via wireless networks. Each of the N transceiver chains may represent a transceiver associated with a particular communication scheme. For example, each transceiver may correspond to protocols that are specific to local area networks, cellular networks (e.g., a CDMA network, a GPRS network, a UMTS networks), and other types of communication networks.
The system 500 of Figure 5 may be connected to any appliance, such as one or more cameras mounted to the vehicle, a speedometer, a weather service for updating local context, or an external database from which context can be acquired.
It should be recognized that Figure 5 is merely exemplary and in one or more exemplary embodiments, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code encoded on a non-transitory computer-readable medium 504. Non-transitory computer-readable medium 504 includes both computer storage medium and communication medium including any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available medium that can be accessed by a computer. It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.
Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavor to which this specification relates.

Claims

Claims:
1. A multi-camera system comprising multiple monocular camera sensors wherein each monocular camera sensor comprises: a camera; and an inertial measurement unit (IMU); wherein data captured by the camera and the IMU are used by the monocular camera sensor to: estimate a pose and velocity of the monocular camera sensor; and perform feature detection for features in an environment surrounding the system; wherein features detected by each monocular camera sensor are shared with other ones of the monocular camera sensors, and wherein each monocular camera sensor is configured, for each feature it detects, to: determine if there is a match with the features detected by the other ones of the monocular camera sensors; and if there is a match, store the feature in memory.
2. The system of claim 1, wherein each monocular camera sensor is further configured to send to the other ones of the monocular camera sensors, an indication of the features detected by the other ones of the monocular cam era sensors that has matched with features detected by it.
3. The system of claim 2, wherein, for each monocular camera, for each feature it has detected, storing the feature in memory comprises storing the feature in multi-view feature memory if the respective monocular camera sensor receives an indication from the other ones of the monocular camera sensors.
4. The system of any one of claim 2 or 3, wherein, for each monocular camera, for each feature, storing the feature in memory comprises storing the feature in external feature memory if the respective monocular camera sensor re ceives a matching feature from the other ones of the monocular camera sensors.
5. The system of any one of claims 1 to 4, wherein the estimated pose and velocity for all monocular camera sensors are synchronized using network- based time synchronization.
6. The system of any one of claims 1 to 5, wherein, in use, all the monocular camera sensors are attached to a common rigid body.
7. The system of claim 6 when dependent on claim 5, wherein, for synchroni zation, an offset is calculated based on histories of the estimated position of the monocular camera sensors across the rigid body.
8. The system of claim 6 or 7, wherein, for synchronization, an offset is calcu lated based on histories of estimated rotations of the monocular camera sen sors across the rigid body.
9. The system of any one of claims 1 to 8, wherein visual information and struc tured data are shared between the monocular camera sensors to provide perception of depth.
PCT/SG2022/050257 2021-04-29 2022-04-28 Multi-camera system WO2022231523A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10202104457T 2021-04-29
SG10202104457T 2021-04-29

Publications (1)

Publication Number Publication Date
WO2022231523A1 true WO2022231523A1 (en) 2022-11-03

Family

ID=83848885

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2022/050257 WO2022231523A1 (en) 2021-04-29 2022-04-28 Multi-camera system

Country Status (1)

Country Link
WO (1) WO2022231523A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866953A (en) * 2019-10-31 2020-03-06 Oppo广东移动通信有限公司 Map construction method and device, and positioning method and device
US20200109954A1 (en) * 2017-06-30 2020-04-09 SZ DJI Technology Co., Ltd. Map generation systems and methods
US20200234459A1 (en) * 2019-01-22 2020-07-23 Mapper.AI Generation of structured map data from vehicle sensors and camera arrays
US20200300637A1 (en) * 2016-03-28 2020-09-24 Sri International Collaborative navigation and mapping

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200300637A1 (en) * 2016-03-28 2020-09-24 Sri International Collaborative navigation and mapping
US20200109954A1 (en) * 2017-06-30 2020-04-09 SZ DJI Technology Co., Ltd. Map generation systems and methods
US20200234459A1 (en) * 2019-01-22 2020-07-23 Mapper.AI Generation of structured map data from vehicle sensors and camera arrays
CN110866953A (en) * 2019-10-31 2020-03-06 Oppo广东移动通信有限公司 Map construction method and device, and positioning method and device

Similar Documents

Publication Publication Date Title
US11145083B2 (en) Image-based localization
CN109993113B (en) Pose estimation method based on RGB-D and IMU information fusion
Heng et al. Self-calibration and visual slam with a multi-camera system on a micro aerial vehicle
CN110490900B (en) Binocular vision positioning method and system under dynamic environment
US10341633B2 (en) Systems and methods for correcting erroneous depth information
US9443350B2 (en) Real-time 3D reconstruction with power efficient depth sensor usage
US9406171B2 (en) Distributed aperture visual inertia navigation
US20230019960A1 (en) System and method for spatially mapping smart objects within augmented reality scenes
US20210183100A1 (en) Data processing method and apparatus
US20180276863A1 (en) System and method for merging maps
EP3400507A1 (en) System and method for fault detection and recovery for concurrent odometry and mapping
US11776151B2 (en) Method for displaying virtual object and electronic device
US10600206B2 (en) Tracking system and method thereof
WO2019104571A1 (en) Image processing method and device
CN113888639B (en) Visual odometer positioning method and system based on event camera and depth camera
CN109767470B (en) Tracking system initialization method and terminal equipment
WO2020110359A1 (en) System and method for estimating pose of robot, robot, and storage medium
WO2022231523A1 (en) Multi-camera system
WO2023093515A1 (en) Positioning system and positioning method based on sector depth camera
Calloway et al. Three tiered visual-inertial tracking and mapping for augmented reality in urban settings
TW202247099A (en) Method for performing visual inertial odometry and user equipment
CN116567537A (en) Terminal cloud co-location method, terminal equipment and edge node
EP3396594B1 (en) Tracking system and method thereof
CN116576866B (en) Navigation method and device
CN112348865B (en) Loop detection method and device, computer readable storage medium and robot

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22796285

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE