CN118038143A - Intelligent blind assisting method and system based on multi-sensor fusion - Google Patents

Intelligent blind assisting method and system based on multi-sensor fusion Download PDF

Info

Publication number
CN118038143A
CN118038143A CN202410159223.6A CN202410159223A CN118038143A CN 118038143 A CN118038143 A CN 118038143A CN 202410159223 A CN202410159223 A CN 202410159223A CN 118038143 A CN118038143 A CN 118038143A
Authority
CN
China
Prior art keywords
user
semantic
information
occupancy
environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410159223.6A
Other languages
Chinese (zh)
Inventor
支瑞聪
郑鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN202410159223.6A priority Critical patent/CN118038143A/en
Publication of CN118038143A publication Critical patent/CN118038143A/en
Pending legal-status Critical Current

Links

Landscapes

  • Navigation (AREA)

Abstract

The invention discloses an intelligent blind assisting method and system based on multi-sensor fusion, belonging to the technical field of blind assisting systems, wherein the method comprises the following steps: acquiring point cloud data of an environment where a user is located, user position data and a pose of the user in the environment; based on the point cloud data, three-dimensional modeling is carried out on the environment where the user is located by using the voxel as a scene representation mode, and the attribute of the voxel is described by using semantic occupation information and an occupation flow; the semantic occupation information reflects semantic tags of the voxel space; the occupied flow reflects the change trend of the environment where the user is located; and the global navigation information and the local environment information are fused to carry out track planning, and the users are guided in the form of track points, so that the users can realize dynamic obstacle avoidance and safely reach the destination under the condition of not deviating from the navigation route. The invention provides a perception framework with track guidance, which can not only perceive the 3D environment facing the user, but also plan the fine-grained local path for the user, and has wide application prospect.

Description

Intelligent blind assisting method and system based on multi-sensor fusion
Technical Field
The invention relates to the technical field of auxiliary systems for blind persons, in particular to an intelligent blind assisting method and system based on multi-sensor fusion.
Background
Over 2.5 hundred million people worldwide suffer from vision impairment, and visually impaired people often face various challenges in traveling, including the presence of various obstacles, unfriendly traffic environments, and unhealthy traffic facilities, which limit their independent travel. To address these challenges, intelligent auxiliary devices, and in particular intelligent cane systems, have become key contributors to the realization of safer, more autonomous, and more convenient travel patterns for visually impaired individuals. The development of the automatic driving technology provides a new thought for the design of a blind crutch system, and effectively improves the life quality and autonomy of users. In addition, research on the blind stick system promotes interdisciplinary cooperation in the fields of robots, computer vision, man-machine interaction and the like, and promotes development and application of the fields.
The blind stick is the most common blind assisting tool and can help visually impaired people to detect obstacles. The existing blind sticks mainly comprise three kinds of common blind sticks, electronic blind sticks with sensors and intelligent blind sticks with sensing algorithms. Traditional blind sticks mainly rely on active perception of a user, and electronic blind sticks use ultrasonic sensors to detect obstacles in front of the user, but have limited perception distances, and generally only detect obstacles at fixed distances in front of the user. Smart sticks are equipped with a variety of sensors, such as ultrasound, infrared and cameras, and incorporate artificial intelligence technology to sense the surrounding environment and detect obstructions on the blind pedestrian path. Through computer vision and machine learning techniques, the intelligent cane can identify obstacles and traffic signs in the picture and provide feedback. Furthermore, blind road segmentation and GNSS navigation may be used to guide the user forward, but the guidance information provided by these methods is often rough, requiring the user to react by feel, rather than providing active guidance.
Conventional blindsticks are used to explore paths and obstacles, but they are limited by the length of the stick and do not have navigation functions. The electronic cane is equipped with various sensors and controls that provide more comprehensive information, including distance, direction, and environmental awareness. These blindsticks use distance measurement, inertial measurement, position tracking, and image processing techniques to sense the surrounding environment and communicate information to the user through audio or tactile feedback. Some electronic sticks also rely on ultrasonic or LiDAR sensors to detect close range obstructions, obstructions of the head height, and terrain height variations to prevent collisions. Fig. 1 shows the workflow of an electronic cane. An ultrasonic sensor, a monocular camera and chipkit max development boards are arranged in the electronic blind stick. The ultrasonic sensor and the monocular camera collect obstacle signals and send the signals to the microcontroller. The processing of the analog signal is done in a microcontroller. Data from the ultrasonic sensor is processed to provide depth information of the scanned scene according to a given direction. The generated signal is then segmented into a plurality of portions according to the number of objects in the scene. For each part, not only distance tags but also other specific tags are associated, which tell us about some of the properties of the obstacle, such as shape, position, etc., extracted from the ultrasound signal. On the other hand, the image captured by the monocular camera is also divided, and each region thereof is marked.
Some of the tags acquired from both sensors are collected and by such design the system uses the data from the camera and ultrasonic sensor to determine the distance of the obstacle and some other characteristic thereof. All information collected about the scene is then analyzed to make decisions and returned to the user by voice messages, feeding back the nature and distance of the obstacle. The message is transmitted from the SD card to the headset via the bluetooth module.
The electronic cane also uses GNSS location information for outdoor navigation or indoor path finding using pre-placed location sensors and an environmental map. Meanwhile, synchronous localization and mapping (SLAM) techniques utilize lidar or camera sensors to estimate user location and create an unknown environment map. For example, smart sticks at the university of Stanford combine SLAM and visual servoing techniques to guide the user through a navigation route. These innovative technologies are expected to provide more navigation and environmental awareness support for visually impaired people, enhancing their independence and safety.
However, the above existing cane systems suffer from two drawbacks:
1. The traditional electronic cane senses environmental information by means of sensor signals and cannot completely express the environment. Although the object detection technology or the segmentation technology can identify the instance, the object detection technology or the segmentation technology is mainly focused on 2D data, and cannot fully express the 3D environment where the user is located.
2. In the existing navigation mode, a navigation route obtained through a GNSS sensor and navigation software can only provide global planning, and fine-granularity guidance can not be provided for a user.
Disclosure of Invention
The invention provides an intelligent blind assisting method and system based on multi-sensor fusion, which are used for solving the technical problems existing in the prior art at least to a certain extent.
In order to solve the technical problems, the invention provides the following technical scheme:
in one aspect, the invention provides an intelligent blind assisting method based on multi-sensor fusion, which comprises the following steps:
acquiring point cloud data of an environment where a user is located, user position data and a pose of the user in the environment;
Based on the point cloud data, using voxels as a scene representation mode, carrying out three-dimensional modeling on the environment where the user is located, and describing the attribute of the voxels by using semantic occupation information and occupation flow; the semantic occupation information reflects semantic tags of the voxel space; the occupied flow reflects the change trend of the environment where the user is located;
Based on the semantic occupancy information and occupancy stream, user position data and the pose of the user in the environment, the global navigation information and the local environment information are fused to conduct track planning, and the user is guided in a track point mode, so that the user is ensured to realize dynamic obstacle avoidance and safely reach a destination under the condition of not deviating from a navigation route.
Further, the acquiring the point cloud data of the environment where the user is located, the user position data and the pose of the user in the environment includes:
The method comprises the steps of scanning an environment where a user is located by using a Light Detection (Light Detection AND RANGING, laser radar) to obtain point cloud data of the environment where the user is located; acquiring the pose of a user in the environment by adopting an IMU (Inertial Measurement Unit, an inertial measurement unit); user location data is acquired using a GNSS (Global Navigation SATELLITE SYSTEM, global satellite navigation system).
Further, based on the point cloud data, using the voxel as a scene representation mode, performing stereo modeling on the environment where the user is located, and describing the attribute of the voxel by using semantic occupancy information and occupancy streams, including:
for a given frame of point clouds P epsilon R n×ε, wherein n represents the number of the point clouds and epsilon represents the dimension of the point clouds; firstly, preprocessing point cloud, and limiting the point cloud to be within a range of 12.8 meters in front, back, left and right and 2.4 meters in up and down respectively; dividing the 3D space into voxels of size 0.1m x 0.1 m; assigning a random point p (x r,yr,zr) to a voxel indexed by v (h, w, d), wherein x r,yr,zr represents the coordinates of the random point, h, w, d represents the location index of the voxel; following the VoxelNet work, extracting the point cloud features within each voxel as voxel features F v∈RC×H×W×D using a stacked voxel feature encoding module, where H x W x D represents the spatial resolution of the point cloud and C represents the voxel feature dimensions; then, extracting multi-scale geometric and semantic features of voxel features by using 3D Unet with a 3D semantic segmentation head, mapping sparse voxels to dense voxels by combining the 3D semantic segmentation head and predicting the categories of the dense voxels to obtain semantic occupation information; the 3D Unet is composed of an encoder and a decoder, the feature map input 3D Unet is subjected to four downsampling and four upsampling, and the feature maps of different levels in the encoder are connected to the corresponding feature maps in the decoder through jump; the formula is as follows:
Focc=fs(fu(Fv))
wherein f u and f s represent 3D Unet and 3D semantic segmentation heads, respectively; f occ∈RN×H×W×D represents semantic occupation information, and N represents the number of semantic segmentation categories;
After obtaining the semantic occupancy information, storing multi-frame historical semantic occupancy information at fixed time intervals of 0.5 seconds, and for the given semantic occupancy information at t-1 time Pose transformation matrix using t-1 to t frames will/>Transformation to a New feature map/>To eliminate the effect of self-movement; feature map/>And semantic occupancy information/>, at the current time tConnected in the channel dimension and then input to the timing encoder, the formula is as follows:
Wherein F temp represents a timing characteristic of the timing encoder output; [., ] represents a stitching operation; f temp denotes a time-series encoder composed of a plurality of 3D convolution blocks and 2D convolution blocks;
After the timing feature F temp is obtained, it is materialized by processing the timing feature F temp with two 2D convolution heads to obtain a motion score and a motion vector; wherein the motion score represents the probability of motion of the voxel, and the motion vector represents the motion direction of the voxel in x and y, and the formula is as follows:
Fflow=Ms(Ftemp)×Mv(Ftemp)
Where M s represents a 2D convolution head that outputs a motion score, and M v represents a 2D convolution head that outputs a motion vector; multiplying the motion score and the motion vector to obtain an occupied stream F flow; is used during training And adding the initial semantic occupancy information with F flow to obtain a prediction result, and taking the future semantic occupancy rate as supervision.
Further, based on the semantic occupancy information and occupancy stream, and user position data and the pose of the user in the environment, fusing global navigation information and local environment information to perform track planning, guiding the user in the form of track points, ensuring that the user realizes dynamic obstacle avoidance and safely arrives at a destination without deviating from a navigation route, and comprising the following steps:
designing an occupied encoder and a navigation encoder, and respectively encoding a scene and navigation information; wherein,
The occupancy encoder is used for aggregating semantic occupancy information and occupancy streams; considering trace regression in the x and y directions, the occupancy encoder uses 3D convolution to reduce the channel dimension, preserving information in the x and y directions, while using the z dimension as the channel dimension; in the occupancy encoder, semantic occupancy information and occupancy streams are spliced along the channel dimension, and then depth fusion is carried out by using ResNet to obtain fusion characteristicsIn addition, the occupancy encoder includes a coordinate convolution layer, thereby enabling/>Is sensitive to position;
In the process of obtaining And then fusing global navigation information and local environment information, wherein the formula is as follows:
Wherein, F g represents the navigation feature obtained by encoding the navigation information G by 1X 1 convolution; f fused represents the coding features of the aggregated environmental information and navigation information; globalAvgPool (-) represents global average pooling; MLP () represents a multi-layer perceptron;
After F fused is acquired, the GRU module is utilized to receive the coding feature F fused and the position data of the user, and a relative position track is output; wherein the GRU module consists of a plurality of GRU units; the hidden state h is transferred between the GRU units; each GRU unit outputs the offset of the next track point and the hidden state of the next GRU unit, and the formula is as follows:
(bi+1,hi+1)=GRU([wi,gt],hi)
wi+1=bi+1+wi
Wherein g t represents the current position coordinates of the user; h i denotes the hidden state of the ith GRU unit, initialized by F fused; h i+1 represents the hidden state of the i+1th GRU unit; b i denotes the offset of the trace point output by the ith GRU unit; b i+1 represents the offset of the trace point output by the i+1th GRU unit; adding the current track point w i to the offset b i+1 to obtain a next track point w i+1; in this way, a plurality of track point sequences with a time interval of 0.5 seconds are obtained, by means of which the user is guided to advance.
On the other hand, the invention also provides an intelligent blind assisting system based on multi-sensor fusion, which comprises the following steps:
the multi-sensor input module is used for acquiring point cloud data of an environment where a user is located, user position data and the pose of the user in the environment;
the 3D scene perception module is used for carrying out three-dimensional modeling on the environment where the user is located by using the voxels as a scene representation mode based on the point cloud data output by the multi-sensor input module and describing the attribute of the voxels by using semantic occupation information and an occupation flow; the semantic occupation information reflects semantic tags of the voxel space; the occupied flow reflects the change trend of the environment where the user is located;
The track prediction module is used for carrying out track planning by fusing global navigation information and local environment information based on semantic occupation information and occupation flow output by the 3D scene perception module, user position data output by the multi-sensor input module and the pose of a user in the environment, guiding the user in a track point mode, and ensuring that the user realizes dynamic obstacle avoidance and safely arrives at a destination under the condition of not deviating from a navigation route.
Further, the multi-sensor input module includes: light Detection AND RANGING, liDAR), IMU Inertial Measurement Unit, inertial measurement unit, and GNSS Global Navigation SATELLITE SYSTEM, global satellite navigation system; wherein,
The LiDAR is used for scanning the environment where the user is located and acquiring point cloud data of the environment where the user is located;
the IMU is used for acquiring the pose of the user in the environment;
GNSS is used to acquire user location data.
Further, the 3D scene perception module is specifically configured to:
for a given frame of point clouds P epsilon R n×ε, wherein n represents the number of the point clouds and epsilon represents the dimension of the point clouds; firstly, preprocessing point cloud, and limiting the point cloud to be within a range of 12.8 meters in front, back, left and right and 2.4 meters in up and down respectively; dividing the 3D space into voxels of size 0.1m x 0.1 m; assigning a random point p (x r,yr,zr) to a voxel indexed by v (h, w, d), wherein x r,yr,zr represents the coordinates of the random point, h, w, d represents the location index of the voxel; following the VoxelNet work, extracting the point cloud features within each voxel as voxel features F v∈RC×H×W×D using a stacked voxel feature encoding module, where H x W x D represents the spatial resolution of the point cloud and C represents the voxel feature dimensions; then, extracting multi-scale geometric and semantic features of voxel features by using 3D Unet with a 3D semantic segmentation head, mapping sparse voxels to dense voxels by combining the 3D semantic segmentation head and predicting the categories of the dense voxels to obtain semantic occupation information; the 3D Unet is composed of an encoder and a decoder, the feature map input 3D Unet is subjected to four downsampling and four upsampling, and the feature maps of different levels in the encoder are connected to the corresponding feature maps in the decoder through jump; the formula is as follows:
Focc=fs(fu(Fv))
wherein f u and f s represent 3D Unet and 3D semantic segmentation heads, respectively; f occ∈RN×H×W×D represents semantic occupation information, and N represents the number of semantic segmentation categories;
After obtaining the semantic occupancy information, storing multi-frame historical semantic occupancy information at fixed time intervals of 0.5 seconds, and for the given semantic occupancy information at t-1 time Pose transformation matrix using t-1 to t frames will/>Transformation to a New feature map/>To eliminate the effect of self-movement; feature map/>And semantic occupancy information/>, at the current time tConnected in the channel dimension and then input to the timing encoder, the formula is as follows:
Wherein F temp represents a timing characteristic of the timing encoder output; [., ] represents a stitching operation; f temp denotes a time-series encoder composed of a plurality of 3D convolution blocks and 2D convolution blocks;
After the timing feature F temp is obtained, it is materialized by processing the timing feature F temp with two 2D convolution heads to obtain a motion score and a motion vector; wherein the motion score represents the probability of motion of the voxel, and the motion vector represents the motion direction of the voxel in x and y, and the formula is as follows:
Fflow=Ms(Ftemp)×Mv(Ftemp)
Where M s represents a 2D convolution head that outputs a motion score, and M v represents a 2D convolution head that outputs a motion vector; multiplying the motion score and the motion vector to obtain an occupied stream F flow; is used during training And adding the initial semantic occupancy information with F flow to obtain a prediction result, and taking the future semantic occupancy rate as supervision.
Further, the track prediction module is specifically configured to:
designing an occupied encoder and a navigation encoder, and respectively encoding a scene and navigation information; wherein,
The occupancy encoder is used for aggregating semantic occupancy information and occupancy streams; considering trace regression in the x and y directions, the occupancy encoder uses 3D convolution to reduce the channel dimension, preserving information in the x and y directions, while using the z dimension as the channel dimension; in the occupancy encoder, semantic occupancy information and occupancy streams are spliced along the channel dimension, and then depth fusion is carried out by using ResNet to obtain fusion characteristicsIn addition, the occupancy encoder includes a coordinate convolution layer, thereby enabling/>Is sensitive to position;
In the process of obtaining And then fusing global navigation information and local environment information, wherein the formula is as follows:
Wherein, F g represents the navigation feature obtained by encoding the navigation information G by 1X 1 convolution; f fused represents the coding features of the aggregated environmental information and navigation information; globalAvgPool (-) represents global average pooling; MLP () represents a multi-layer perceptron;
After F fused is acquired, the GRU module is utilized to receive the coding feature F fused and the position data of the user, and a relative position track is output; wherein the GRU module consists of a plurality of GRU units; the hidden state h is transferred between the GRU units; each GRU unit outputs the offset of the next track point and the hidden state of the next GRU unit, and the formula is as follows:
(bi+1,hi+1)=GRU([wi,gt],hi)
wi+1=bi+1+wi
Wherein g t represents the current position coordinates of the user; h i denotes the hidden state of the ith GRU unit, initialized by F fused; h i+1 represents the hidden state of the i+1th GRU unit; b i denotes the offset of the trace point output by the ith GRU unit; b i+1 represents the offset of the trace point output by the i+1th GRU unit; adding the current track point w i to the offset b i+1 to obtain a next track point w i+1; in this way, a plurality of track point sequences with a time interval of 0.5 seconds are obtained, by means of which the user is guided to advance.
In yet another aspect, the present invention also provides an electronic device including a processor and a memory; at least one instruction is stored in the memory, and the instruction is loaded and executed by the processor to implement the method.
In yet another aspect, the present invention further provides a computer readable storage medium having at least one instruction stored therein, the instruction being loaded and executed by a processor to implement the above method.
The technical scheme provided by the invention has the beneficial effects that at least:
The invention provides a new active navigation scheme for assisting visually impaired people in traveling, which is different from the previous method, and the scheme uses 3D voxels as a characterization unit, compared with a 2D detection and segmentation method, the environment can be more fully expressed, in addition, dynamic obstacles and static obstacles in the scene are predicted, and the user is helped to perceive the change trend of the obstacles; during navigation, the scheme fuses the global planning result and the perception result, performs fine-granularity local planning, and can travel according to a navigation route and guide a user to avoid obstacles.
The scheme of the invention accords with the development direction of the current mainstream perception. By processing LiDAR point cloud data, our model aggregates spatial features and time cues, obtaining rich perception results of the user environment. In addition, the rough navigation information is combined with the occupation result of the fine granularity, so that guidance is provided for point-to-point navigation of the user. This represents a new attempt to apply artificial intelligence awareness schemes to navigation of visually impaired people. Experiments show that the average displacement error and the end point displacement error of the track points predicted by the scheme and the real pedestrian track points are respectively 0.045m and 0.092m, and the inference is carried out with the time consumption of 34 ms. The experimental result verifies the feasibility of our method, can finish the task well and carry out the deduction in real time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the operation of a prior art electronic cane;
FIG. 2 is a schematic diagram of an intelligent blind assisting method based on multi-sensor fusion provided by an embodiment of the invention;
FIG. 3 is a schematic diagram of a scene perception and trajectory prediction flow provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of an occupancy flow prediction network provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of a trajectory prediction network provided by an embodiment of the present invention;
FIG. 6 is a schematic diagram of a semantic occupation prediction result visualization result provided by an embodiment of the present invention;
FIG. 7 is a schematic diagram of a visual result of a track prediction result according to an embodiment of the present invention;
fig. 8 is a system block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.
First, it should be noted that, in the embodiments of the present invention, words such as "exemplary," "for example," and the like are used to indicate an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of the term "exemplary" is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.
Furthermore, in the embodiments of the present invention, "image", "picture" may be sometimes used in combination, and it should be noted that the meaning to be expressed is consistent when the distinction is not emphasized. "of", "corresponding (corresponding, relevant)" and "corresponding (corresponding)" are sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized.
Furthermore, in embodiments of the present invention, sometimes a subscript (e.g., W 1) may be wrongly written in a non-subscript form (e.g., W1), and the meaning of the expression is consistent when de-emphasizing the distinction.
First embodiment
The embodiment provides an intelligent blind assisting method based on multi-sensor fusion, and provides a perception framework with track guidance. The method not only can sense the 3D environment facing the user, but also can carry out fine-granularity local path planning for the user, and the innovation points of the proposal provided by the embodiment can be summarized as follows:
1. Unlike existing cane systems, we propose an autopilot-like approach to solve the problem of assisting travel for visually impaired people. First, we do fine-grained scene understanding of the surrounding environment in which the viewer is located and dynamically monitor changes in the surrounding environment. Secondly, people with vision impairment can be provided with clear point-level path navigation and active obstacle avoidance, and the people are guided to complete the route.
2. In terms of ambient environment perception, we model the ambient environment during user travel using scene 3D voxels based on laser point cloud data. Key elements such as pedestrians, vehicles, obstacles, etc. are perceived by predicting semantic labels for each voxel in 3D space and using occupancy streams. Compared with the direct processing of point clouds, voxels are more spatially dense and computationally efficient.
3. In the navigation planning of the travelling path, the navigation information is encoded and fused with the environmental perception results of the multi-sensor signals, and the monitoring signals provided by the pedestrian travelling data are used for simulating, learning and predicting future track points on the travelling path of the user, so that obstacle avoidance is dynamically implemented on the obstacle and the like. With this approach we consider both long-range navigation information and real-time local path planning based on environmental changes.
The method may be implemented by an electronic device, which may be a terminal or a server. The execution flow of the method is shown in fig. 2, and mainly comprises three parts, namely: multisensor input, 3D scene perception, and trajectory prediction. In the multi-sensor input part, the embodiment integrates data from LiDAR, IMU, GNSS and other sensors, and provides a data base for subsequent modules. The lidar scans the environment surrounding the user, generating a point cloud to perceive environmental elements of the user's travel process. GNSS is used to locate the position of a user in a navigation route. The IMU sensor is used to obtain the pose of the user in the environment. In the 3D scene perception part, the voxels are used for representing the surrounding environment of the user, and the point cloud data is fed into an occupied network and an occupied stream network to respectively predict semantic labels and motion speeds of the voxels of the surrounding environment of the user. The final trajectory guidance encodes the scene understanding result and the navigation path through the trajectory prediction network, and generates an output in the form of trajectory points.
The technical scheme of each part will be described in detail below.
1. Multiple sensor inputs.
LiDAR, GNSS and IMU are three sensors that are critical for outdoor navigation. The lidar sensor emits a laser beam and utilizes the reflection of the beam in the environment to acquire point cloud data of the surrounding environment. These data provide three-dimensional information of the surrounding environment, including the position, shape, and distance of objects, for modeling and understanding the scene. In an intelligent blind-assistant system, a laser radar sensor scans the surrounding environment of a user and returns point cloud data, which is a key data source for knowing the surrounding environment of the user and implementing navigation, and helps visually impaired users to safely navigate and avoid collision with obstacles.
GNSS and IMU play an important role for outdoor navigation for visually impaired people. GNSS uses satellite positioning to determine the position of a receiver so that the navigation system can accurately know the position of the user on the earth's surface. The IMU sensor measures linear acceleration, which helps determine the user's speed and acceleration, providing motion information for navigation. In addition, the IMU also measures the angle, rotation, and tilt of the device, providing direction and attitude information for navigation. Both sensors may enable global route planning and navigation through integration with map software.
2. 3D scene perception.
The effective 3D scene representation is the basis of 3D scene perception, in a 3D scene perception module, the situation that the environment is modeled by directly using point clouds is avoided, voxels are selected as representation units, and compared with the situation that the point clouds are directly processed, the 3D scene representation method has higher calculation efficiency; in addition, in consideration of the complexity of the environment where the visually impaired patient is located, voxels are used as a scene representation mode, the environment where the user is located is subjected to stereoscopic modeling, the modeling describes the environment in a dense form, and any obstacle can be detected in the form of voxels. A voxel is a cube in 3D space, each voxel representing a volume region in 3D space and typically contains data about attributes or information within that region. Voxel characterization has been widely mentioned in recent years, which can naturally represent the spatial distribution of objects compared to BEV characterization. Voxel characterization is easier to optimize and train and has better interpretability than implicit field representation. And voxel characterization is more suitable for object segmentation and detection tasks than implicit field representations, which can only represent the surface of an object, because they can describe object boundaries and internal structures more accurately. We use voxel representation for scene modeling and design two downstream prediction tasks: 3D semantic occupancy prediction and voxel velocity prediction. The 3D semantic occupancy prediction reflects space occupancy information as well as semantic information, which we call semantic occupancy. The voxel velocity prediction results reflect the voxel motion, which we call the occupancy stream. Fig. 3 shows how we accomplish two scene understanding tasks and use them for trajectory planning.
The following sections provide detailed description.
1) Semantic occupancy network
Given a frame of point clouds P epsilon R n×ε, n table type point clouds number epsilon represents the dimension of the point clouds, preprocessing the point clouds, and limiting the point clouds to be respectively 12.8 meters in front, back, left and right and 2.4 meters in up and down. The 3D space is divided into voxels of size 0.1m x 0.1 m. Random points p (x, y, z) are assigned to voxels indexed by v (h, w, d), x, y, z representing the coordinates of the random points, h, w, d representing the location index of the voxels. Following the VoxelNet work, a stacked voxel feature encoding module is used to extract the point cloud features within each voxel as voxel features (F v∈RC×H×W×D), where hxw x D represents the spatial resolution of the point cloud. C represents a feature dimension. Subsequently, we extract multi-scale geometric and semantic features using 3D Unet, map sparse voxels to dense voxels in combination with a 3D segmentation head and predict their class. The exemplary 3D Unet is made up of two parts, an encoder and a decoder, wherein the feature map is downsampled four times and upsampled four times, and the feature maps of different levels in the encoder are connected to the corresponding feature map in the decoder by jumps. The formula is as follows:
Focc=fs(fu(Fv)) (1)
where F u and F s represent 3D Unet and 3D semantic segmentation heads, respectively, F occ∈RN×H×W×D represents the occupancy feature and N represents the number of semantic segmentation categories.
2) Occupied flow network
Semantic occupancy information is not sufficient to provide complete context-aware guidance, as the user is in motion and the surrounding scene is also dynamic. Therefore, we use voxel occupancy motion information to help the model handle dynamic obstacles. We need to infer voxel motion from changes in voxel state, which requires utilization of historical semantic occupancy. One common approach is to concatenate the historical features and input them into an encoder to extract the temporal information, thereby producing abstract temporal features. In a blind-assistant system we propose to represent voxel motion with occupancy flow as a motion feature, as shown in fig. 4. The occupied flow has a definite physical meaning, so that the whole perception module can be interpreted. Specifically, we save the T-frame history semantic occupancy results at fixed time intervals of 0.5 seconds, given the semantic occupancy at time T-1We will/>, using the pose transformation matrix R of the t-1 to t frameConversion to a new feature mapTo eliminate the effects of self-movement. Aligned feature map/>And current features/>Connected in the channel dimension and then input to the timing encoder. The formula is as follows:
Wherein F temp represents a timing feature, And/>Respectively representing the aligned historical semantic occupation result and the aligned current semantic occupation result. [. ] represents a stitching operation. f temp denotes a time-series encoder, consisting of 3D convolution. We then materialize it by two 2D convolution head processes F temp, obtaining a motion score (representing the probability of voxel motion) and a motion vector (representing the direction of motion of the voxel in x and y). The formula is as follows:
Fflow=Ms(Ftemp)×Mv(Ftemp) (3)
Where M s represents the convolution head that outputs the motion score and M v represents the convolution head that outputs the motion vector. The motion score and the motion vector are multiplied to obtain the occupancy stream F flow. Is used during training As initial occupancy, the result is added with F flow to obtain a prediction result, and future semantic occupancy is taken as supervision.
Based on the above, the present embodiment describes the attributes of the voxels using semantic occupancy and occupancy streams on a voxel basis. The semantic occupation is output by the semantic occupation network as a space encoder, reflecting semantic tags of the voxel space. The occupied flow is output by an occupied flow network, and the change trend of the surrounding environment of the visually impaired patient is reflected.
3. And (5) track prediction.
The track prediction module is used for predicting local track points of the next few seconds in the navigation route of the user in the travelling process, and guiding the user to dynamically avoid the obstacle according to the track points. And combining global navigation planning and local track prediction to ensure that a user can safely reach a destination without deviating from a navigation route.
In the last module we obtain semantic occupancy features and occupancy streams to represent the scene around the user. In the trajectory prediction module, we encode scene features and predict the trajectory of the user in combination with navigation information, and fig. 5 shows the detailed process of the trajectory prediction network. The specific method is to design an occupied encoder and a navigation encoder to encode the scene and the navigation information respectively. The occupancy encoder is used to aggregate spatial features F occ and temporal features F flow. Considering locus regression in the x and y directions, we use 3D convolution to reduce the channel dimension, preserving information in the x and y directions, while using the z dimension as the channel dimension to ensure low computational complexity. F occ and F flow are spliced along the channel dimension, and then are subjected to deep fusion by using Resnet to obtainIn addition, a coordinate convolution layer is added, thereby making the feature position sensitive. The navigation information G ε R k×3 is k GNSS points sampled from the departure point to the destination point. In the navigation encoder, G is encoded using a1×1 convolution as follows:
Wherein, F g represents the navigation characteristics of G codes, F fused aggregates the environment information and the navigation information, and provides rich priori knowledge for track prediction; globalAvgPool (-) represents global average pooling; MLP () represents a multi-layer perceptron. The subsequent GRU module accepts the code feature F fused and the user's GNSS coordinates g and outputs a relative position trajectory Traj εR n×2, n representing the number of trajectory points. As shown in fig. 5, the GRU module is composed of n GRU units. The hidden state h is passed between the GRU units. Each GRU unit outputs the offset of the next track point and the hidden state of the next GRU unit. The formula is as follows:
(bi+1,hi+1)=GRU([wi,gt],hi) (5)
wi+1=bi+1+wi (6)
Where g t denotes the current GNSS coordinates. h i denotes the hidden state of the ith GRU unit, initialized by F fused. b i denotes the offset of the trace point output by the ith GRU unit, and initializes w 0 = (0, 0). The current trajectory point w i is added to the next offset b i+1 to obtain the next trajectory point w i+1. Thus we obtain n sequences of track points with a time interval of 0.5 seconds, by which the user is guided forward.
Based on the above, in outdoor navigation, we consider not only the planned route of navigation but also the environment where the user is located, make real-time local path planning, and guide the user in the form of track points. The navigation information and the environment information are described through the characteristics and are fused, so that dynamic obstacle avoidance and active planning guidance are realized.
The method comprises the steps of constructing travel scenes of visually impaired people in CARLA simulation environments, collecting data of various scenes, and recording semantic LiDAR point cloud data, GNSS coordinates, inertial sensor data, front view image data, user world coordinates and pose transformation matrixes in the motion process of a user. We used the above data for network training with the test equipment RTX 3090Ti.
Fig. 6 provides a visual result of 3D semantic occupancy prediction. The visualization result proves the effectiveness of occupying the network, and the 3D space where the visually impaired patient is can be fully simulated. Although we only use the LiDAR point cloud as the raw perception data, we still obtain good 3D semantic occupancy results. The road type is very sensitive to the travel of the visually impaired, and our model can accurately trace the road, ground, sidewalk, or even crosswalk around the user. Our model shows accurate segmentation results even if crosswalks are on roads and features are not obvious.
Fig. 7 illustrates trajectory prediction visualization results that accurately depict future travel routes and effectively avoid pedestrians and static obstacles around visually impaired patients. Even vehicles at intersections can be avoided by our predicted trajectories due to occupancy. The visual result shows that the simulated learned prediction track can guide the user to avoid the obstacle and make straight-going, left-turning and right-turning according to the navigation route.
In summary, the present embodiment proposes a new active navigation scheme for assisting visually impaired people in traveling, unlike the previous method, in which, unlike the previous method, the scheme uses 3D voxels as a characterization unit, compared with the 2D detection and segmentation method, the method can more fully express the environment, and in addition, we predict dynamic obstacles and static obstacles in the scene, so as to help the user to perceive the change trend of the obstacles; during navigation, the scheme fuses the global planning result and the perception result, performs fine-granularity local planning, and can travel according to a navigation route and guide a user to avoid obstacles.
The scheme of the invention accords with the development direction of the current mainstream perception. By processing LiDAR point cloud data, our model aggregates spatial features and time cues, obtaining rich perception results of the user environment. In addition, the rough navigation information is combined with the occupation result of the fine granularity, so that guidance is provided for point-to-point navigation of the user. This represents a new attempt to apply artificial intelligence awareness schemes to navigation of visually impaired people. Experiments show that the average displacement error and the end point displacement error of the track points predicted by the scheme and the real pedestrian track points are respectively 0.045m and 0.092m, and the inference is carried out with the time consumption of 34 ms. The experimental result verifies the feasibility of our method, can finish the task well and carry out the deduction in real time.
Second embodiment
The embodiment provides an intelligent blind assisting system based on multi-sensor fusion, which comprises the following modules:
the multi-sensor input module is used for acquiring point cloud data of an environment where a user is located, user position data and the pose of the user in the environment;
the 3D scene perception module is used for carrying out three-dimensional modeling on the environment where the user is located by using the voxels as a scene representation mode based on the point cloud data output by the multi-sensor input module and describing the attribute of the voxels by using semantic occupation information and an occupation flow; the semantic occupation information reflects semantic tags of the voxel space; the occupied flow reflects the change trend of the environment where the user is located;
The track prediction module is used for carrying out track planning by fusing global navigation information and local environment information based on semantic occupation information and occupation flow output by the 3D scene perception module, user position data output by the multi-sensor input module and the pose of a user in the environment, guiding the user in a track point mode, and ensuring that the user realizes dynamic obstacle avoidance and safely arrives at a destination under the condition of not deviating from a navigation route.
It should be noted that, the intelligent blind assisting system based on multi-sensor fusion in this embodiment corresponds to the intelligent blind assisting method based on multi-sensor fusion in the first embodiment described above; the functions realized by the functional modules in the intelligent blind assisting system based on the multi-sensor fusion correspond to the flow steps in the intelligent blind assisting method based on the multi-sensor fusion one by one; therefore, the description is omitted here.
Third embodiment
The present embodiment provides an electronic device, as shown in fig. 8, including: a processor and a memory; wherein the processor and the memory may be connected by a communication bus; the memory stores at least one instruction that is loaded and executed by the processor to implement the method of the first embodiment. The electronic device may further comprise a transceiver, the processor and the transceiver being connectable through a communication bus, the transceiver being adapted to communicate with other devices.
The following describes the respective constituent elements of the electronic device in detail with reference to fig. 8:
The processor is a control center of an electronic device, and the electronic device may include a plurality of processors, and each of the processors may be a single-core processor (single-CPU) or a multi-core processor (multi-CPU). The processor may be a single processor or may be a combination of processing elements. For example, a processor is one or more central processing units (central processing unit, CPU), but may be other general purpose processors, application SPECIFIC INTEGRATED Circuits (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention, such as: one or more microprocessors (DIGITAL SIGNAL processors, DSPs), or one or more field programmable gate arrays (field programmable GATE ARRAY, FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or may be any conventional processor or the like. The processor may perform various functions of the electronic device by running or executing software programs stored in memory and invoking data stored in memory.
In a specific implementation, the processor may include one or more CPUs, such as CPU0 and CPU1 shown in fig. 8, as an example, although this is merely an illustration.
The memory is configured to store a software program for executing the solution of the present invention, and the processor is used to control the execution of the program, and the specific implementation manner may refer to the above method embodiment, which is not described herein again.
Alternatively, the memory may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memory, EEPROM), a compact disc read-only memory (compact disc read-only memory, CD-ROM) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory may be integral to the processor or may exist separately and be coupled to the processor through interface circuitry (not shown in fig. 8) of the electronic device, as embodiments of the invention are not limited in detail.
The transceiver may include a receiver and a transmitter (not separately shown in fig. 8). The receiver is used for realizing a receiving function, and the transmitter is used for realizing a transmitting function. The transceiver may be integrated with the processor or may exist separately and be coupled to the processor through an interface circuit (not shown in fig. 8) of the electronic device, as embodiments of the invention are not specifically limited in this regard.
Furthermore, it should be noted that the structure of the electronic device shown in fig. 8 does not constitute a limitation of the device, and an actual device may include more or less components than those shown, or may combine some components, or may have a different arrangement of components. In addition, the technical effects achieved by the electronic device when executing the method of the first embodiment may refer to the technical effects described in the first embodiment, so that the description is omitted herein.
Fourth embodiment
The present embodiment provides a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the method of the first embodiment described above. The computer readable storage medium may be, among other things, ROM, random access memory, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. The instructions stored therein may be loaded by a processor in the terminal and perform the methods described above.
Furthermore, it should be noted that the present invention can be provided as a method, an apparatus, or a computer program product. Accordingly, embodiments of the invention may take the form of an entirely or partially hardware embodiment, an entirely or partially software embodiment or an embodiment combining software and hardware aspects. Furthermore, when implemented in software, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media having computer-usable program code embodied therein. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element. Furthermore, the term "and/or" is merely an association relation describing the association object, and means that three kinds of relations may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context. "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
Furthermore, it should be understood that, in various embodiments of the present invention, the sequence number of each process described above does not mean that the execution sequence is determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiments of the present invention.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, e.g., the division of functional blocks/units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form. The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The method may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a usb disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, etc., which can store program codes.
It is finally pointed out that the above description of the preferred embodiments of the invention, it being understood that although preferred embodiments of the invention have been described, it will be obvious to those skilled in the art that, once the basic inventive concepts of the invention are known, several modifications and adaptations can be made without departing from the principles of the invention, and these modifications and adaptations are intended to be within the scope of the invention. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims (8)

1. An intelligent blind assisting method based on multi-sensor fusion is characterized by comprising the following steps:
acquiring point cloud data of an environment where a user is located, user position data and a pose of the user in the environment;
Based on the point cloud data, using voxels as a scene representation mode, carrying out three-dimensional modeling on the environment where the user is located, and describing the attribute of the voxels by using semantic occupation information and occupation flow; the semantic occupation information reflects semantic tags of the voxel space; the occupied flow reflects the change trend of the environment where the user is located;
Based on the semantic occupancy information and occupancy stream, user position data and the pose of the user in the environment, the global navigation information and the local environment information are fused to conduct track planning, and the user is guided in a track point mode, so that the user is ensured to realize dynamic obstacle avoidance and safely reach a destination under the condition of not deviating from a navigation route.
2. The intelligent blind assisting method based on multi-sensor fusion according to claim 1, wherein the acquiring the point cloud data of the environment in which the user is located, the user position data and the pose of the user in the environment comprises:
The method comprises the steps of scanning an environment where a user is located by using a Light Detection (Light Detection AND RANGING, laser radar) to obtain point cloud data of the environment where the user is located; acquiring the pose of a user in the environment by adopting an IMU (Inertial Measurement Unit, an inertial measurement unit); user location data is acquired using a GNSS (Global Navigation SATELLITE SYSTEM, global satellite navigation system).
3. The intelligent blind-assisting method based on multi-sensor fusion according to claim 1, wherein based on the point cloud data, using voxels as a scene representation mode, performing stereoscopic modeling on an environment where a user is located, and describing attributes of the voxels by using semantic occupancy information and occupancy streams, comprising:
for a given frame of point clouds P epsilon R n×ε, wherein n represents the number of the point clouds and epsilon represents the dimension of the point clouds; firstly, preprocessing point cloud, and limiting the point cloud to be within a range of 12.8 meters in front, back, left and right and 2.4 meters in up and down respectively; dividing the 3D space into voxels of size 0.1m x 0.1 m; assigning a random point p (x r,yr,zr) to a voxel indexed by v (h, w, d), wherein x r,yr,zr represents the coordinates of the random point, h, w, d represents the location index of the voxel; following the VoxelNet work, extracting the point cloud features within each voxel as voxel features F v∈RC×H×W×D using a stacked voxel feature encoding module, where H x W x D represents the spatial resolution of the point cloud and C represents the voxel feature dimensions; then, extracting multi-scale geometric and semantic features of voxel features by using 3D Unet with a 3D semantic segmentation head, mapping sparse voxels to dense voxels by combining the 3D semantic segmentation head and predicting the categories of the dense voxels to obtain semantic occupation information; the 3D Unet is composed of an encoder and a decoder, the feature map input 3D Unet is subjected to four downsampling and four upsampling, and the feature maps of different levels in the encoder are connected to the corresponding feature maps in the decoder through jump; the formula is as follows:
Focc=fs(fu(Fv))
wherein f u and f s represent 3D Unet and 3D semantic segmentation heads, respectively; f occ∈RN×H×W×D represents semantic occupation information, and N represents the number of semantic segmentation categories;
After obtaining the semantic occupancy information, storing multi-frame historical semantic occupancy information at fixed time intervals of 0.5 seconds, and for the given semantic occupancy information at t-1 time Pose transformation matrix using t-1 to t frames will/>Conversion to a new feature mapTo eliminate the effect of self-movement; feature map/>And semantic occupancy information/>, at the current time tConnected in the channel dimension and then input to the timing encoder, the formula is as follows:
Wherein F temp represents a timing characteristic of the timing encoder output; [., ] represents a stitching operation; f temp denotes a time-series encoder composed of a plurality of 3D convolution blocks and 2D convolution blocks;
After the timing feature F temp is obtained, it is materialized by processing the timing feature F temp with two 2D convolution heads to obtain a motion score and a motion vector; wherein the motion score represents the probability of motion of the voxel, and the motion vector represents the motion direction of the voxel in x and y, and the formula is as follows:
Fflow=Ms(Ftemp)×Mv(Ftemp)
Where M s represents a 2D convolution head that outputs a motion score, and M v represents a 2D convolution head that outputs a motion vector; multiplying the motion score and the motion vector to obtain an occupied stream F flow; is used during training And adding the initial semantic occupancy information with F flow to obtain a prediction result, and taking the future semantic occupancy rate as supervision.
4. The intelligent blind-assisting method based on multi-sensor fusion according to claim 1, wherein based on the semantic occupancy information and occupancy stream, and user position data and user pose in the environment, global navigation information and local environment information are fused to perform track planning, and the users are guided in the form of track points, so that the users can realize dynamic obstacle avoidance and safely reach a destination without deviating from a navigation route, and the method comprises the following steps:
designing an occupied encoder and a navigation encoder, and respectively encoding a scene and navigation information; wherein,
The occupancy encoder is used for aggregating semantic occupancy information and occupancy streams; considering trace regression in the x and y directions, the occupancy encoder uses 3D convolution to reduce the channel dimension, preserving information in the x and y directions, while using the z dimension as the channel dimension; in the occupancy encoder, semantic occupancy information and occupancy streams are spliced along the channel dimension, and then depth fusion is carried out by using ResNet to obtain fusion characteristicsIn addition, the occupancy encoder includes a coordinate convolution layer, thereby enabling/>Is sensitive to position;
In the process of obtaining And then fusing global navigation information and local environment information, wherein the formula is as follows:
Wherein, F g represents the navigation feature obtained by encoding the navigation information G by 1X 1 convolution; f fused represents the coding features of the aggregated environmental information and navigation information; globalAvgPool (-) represents global average pooling; MLP () represents a multi-layer perceptron;
After F fused is acquired, the GRU module is utilized to receive the coding feature F fused and the position data of the user, and a relative position track is output; wherein the GRU module consists of a plurality of GRU units; the hidden state is transferred between the GRU units; each GRU unit outputs the offset of the next track point and the hidden state of the next GRU unit, and the formula is as follows:
(bi+1,hi+1)=GRU([wi,gt],hi)
wi+1=bi+1+wi
Wherein g t represents user current location data; h i denotes the hidden state of the ith GRU unit, initialized by F fused; h i+1 represents the hidden state of the i+1th GRU unit; b i denotes the offset of the trace point output by the ith GRU unit; b i+1 represents the offset of the trace point output by the i+1th GRU unit; adding the current track point w i to the offset b i+1 to obtain a next track point w i+1; in this way, a plurality of track point sequences with a time interval of 0.5 seconds are obtained, by means of which the user is guided to advance.
5. An intelligent blind assisting system based on multi-sensor fusion, which is characterized by comprising:
the multi-sensor input module is used for acquiring point cloud data of an environment where a user is located, user position data and the pose of the user in the environment;
the 3D scene perception module is used for carrying out three-dimensional modeling on the environment where the user is located by using the voxels as a scene representation mode based on the point cloud data output by the multi-sensor input module and describing the attribute of the voxels by using semantic occupation information and an occupation flow; the semantic occupation information reflects semantic tags of the voxel space; the occupied flow reflects the change trend of the environment where the user is located;
The track prediction module is used for carrying out track planning by fusing global navigation information and local environment information based on semantic occupation information and occupation flow output by the 3D scene perception module, user position data output by the multi-sensor input module and the pose of a user in the environment, guiding the user in a track point mode, and ensuring that the user realizes dynamic obstacle avoidance and safely arrives at a destination under the condition of not deviating from a navigation route.
6. The intelligent blind-assistant system based on multi-sensor fusion according to claim 5, wherein the multi-sensor input module comprises: light Detection AND RANGING, liDAR), IMU Inertial Measurement Unit, inertial measurement unit, and GNSS Global Navigation SATELLITE SYSTEM, global satellite navigation system; wherein,
The LiDAR is used for scanning the environment where the user is located and acquiring point cloud data of the environment where the user is located;
the IMU is used for acquiring the pose of the user in the environment;
GNSS is used to acquire user location data.
7. The intelligent blind-assistant system based on multi-sensor fusion according to claim 5, wherein the 3D scene perception module is specifically configured to:
for a given frame of point clouds P epsilon R n×ε, wherein n represents the number of the point clouds and epsilon represents the dimension of the point clouds; firstly, preprocessing point cloud, and limiting the point cloud to be within a range of 12.8 meters in front, back, left and right and 2.4 meters in up and down respectively; dividing the 3D space into voxels of size 0.1m x 0.1 m; assigning a random point p (x r,yr,zr) to a voxel indexed by v (h, w, d), wherein x r,yr,zr represents the coordinates of the random point, h, w, d represents the location index of the voxel; following the VoxelNet work, extracting the point cloud features within each voxel as voxel features F v∈RC×H×W×D using a stacked voxel feature encoding module, where H x W x D represents the spatial resolution of the point cloud and C represents the voxel feature dimensions; then, extracting multi-scale geometric and semantic features of voxel features by using 3D Unet with a 3D semantic segmentation head, mapping sparse voxels to dense voxels by combining the 3D semantic segmentation head and predicting the categories of the dense voxels to obtain semantic occupation information; the 3D Unet is composed of an encoder and a decoder, the feature map input 3D Unet is subjected to four downsampling and four upsampling, and the feature maps of different levels in the encoder are connected to the corresponding feature maps in the decoder through jump; the formula is as follows:
Focc=fs(fu(Fv))
wherein f u and f s represent 3D Unet and 3D semantic segmentation heads, respectively; f occ∈RN×H×W×D represents semantic occupation information, and N represents the number of semantic segmentation categories;
After obtaining the semantic occupancy information, storing multi-frame historical semantic occupancy information at fixed time intervals of 0.5 seconds, and for the given semantic occupancy information at t-1 time Pose transformation matrix using t-1 to t frames will/>Conversion to a new feature mapTo eliminate the effect of self-movement; feature map/>And semantic occupancy information/>, at the current time tConnected in the channel dimension and then input to the timing encoder, the formula is as follows:
Wherein F temp represents a timing characteristic of the timing encoder output; [., ] represents a stitching operation; f temp denotes a time-series encoder composed of a plurality of 3D convolution blocks and 2D convolution blocks;
After the timing feature F temp is obtained, it is materialized by processing the timing feature F temp with two 2D convolution heads to obtain a motion score and a motion vector; wherein the motion score represents the probability of motion of the voxel, and the motion vector represents the motion direction of the voxel in x and y, and the formula is as follows:
Fflow=Ms(Ftemp)×Mv(Ftemp)
Where M s represents a 2D convolution head that outputs a motion score, and M v represents a 2D convolution head that outputs a motion vector; multiplying the motion score and the motion vector to obtain an occupied stream F flow; is used during training And adding the initial semantic occupancy information with F flow to obtain a prediction result, and taking the future semantic occupancy rate as supervision.
8. The intelligent blind-assistant system based on multi-sensor fusion according to claim 5, wherein the trajectory prediction module is specifically configured to:
designing an occupied encoder and a navigation encoder, and respectively encoding a scene and navigation information; wherein,
The occupancy encoder is used for aggregating semantic occupancy information and occupancy streams; considering trace regression in the x and y directions, the occupancy encoder uses 3D convolution to reduce the channel dimension, preserving information in the x and y directions, while using the z dimension as the channel dimension; in the occupancy encoder, semantic occupancy information and occupancy streams are spliced along the channel dimension, and then depth fusion is carried out by using ResNet to obtain fusion characteristicsIn addition, the occupancy encoder includes a coordinate convolution layer, thereby enabling/>Is sensitive to position;
In the process of obtaining And then fusing global navigation information and local environment information, wherein the formula is as follows:
Wherein, F g represents the navigation feature obtained by encoding the navigation information G by 1X 1 convolution; f fused represents the coding features of the aggregated environmental information and navigation information; globalAvgPool (-) represents global average pooling; MLP () represents a multi-layer perceptron;
After F fused is acquired, the GRU module is utilized to receive the coding feature F fused and the position data of the user, and a relative position track is output; wherein the GRU module consists of a plurality of GRU units; the hidden state is transferred between the GRU units; each GRU unit outputs the offset of the next track point and the hidden state of the next GRU unit, and the formula is as follows:
(bi+1,hi+1)=GRU([wi,gt],hi)
wi+1=bi+1+wi
Wherein g t represents user current location data; h i denotes the hidden state of the ith GRU unit, initialized by F fused; h i+1 represents the hidden state of the i+1th GRU unit; b i denotes the offset of the trace point output by the ith GRU unit; b i+1 represents the offset of the trace point output by the i+1th GRU unit; adding the current track point w i to the offset b i+1 to obtain a next track point w i+1; in this way, a plurality of track point sequences with a time interval of 0.5 seconds are obtained, by means of which the user is guided to advance.
CN202410159223.6A 2024-02-04 2024-02-04 Intelligent blind assisting method and system based on multi-sensor fusion Pending CN118038143A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410159223.6A CN118038143A (en) 2024-02-04 2024-02-04 Intelligent blind assisting method and system based on multi-sensor fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410159223.6A CN118038143A (en) 2024-02-04 2024-02-04 Intelligent blind assisting method and system based on multi-sensor fusion

Publications (1)

Publication Number Publication Date
CN118038143A true CN118038143A (en) 2024-05-14

Family

ID=91001505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410159223.6A Pending CN118038143A (en) 2024-02-04 2024-02-04 Intelligent blind assisting method and system based on multi-sensor fusion

Country Status (1)

Country Link
CN (1) CN118038143A (en)

Similar Documents

Publication Publication Date Title
US11900536B2 (en) Visual-inertial positional awareness for autonomous and non-autonomous tracking
Martin-Martin et al. Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments
JP7433227B2 (en) sensor data segmentation
US10366508B1 (en) Visual-inertial positional awareness for autonomous and non-autonomous device
US10410328B1 (en) Visual-inertial positional awareness for autonomous and non-autonomous device
US10991156B2 (en) Multi-modal data fusion for enhanced 3D perception for platforms
US10192113B1 (en) Quadocular sensor design in autonomous platforms
US10437252B1 (en) High-precision multi-layer visual and semantic map for autonomous driving
US10794710B1 (en) High-precision multi-layer visual and semantic map by autonomous units
US10496104B1 (en) Positional awareness with quadocular sensor in autonomous platforms
US10390003B1 (en) Visual-inertial positional awareness for autonomous and non-autonomous device
CN110796692A (en) End-to-end depth generation model for simultaneous localization and mapping
US20190322275A1 (en) Vehicle tracking
CN108959321A (en) Parking lot map constructing method, system, mobile terminal and storage medium
JP2009175932A (en) Traveling area detection device and method for mobile robot
JP2010282615A (en) Object motion detection system based on combining 3d warping technique and proper object motion (pom) detection
Schwarze et al. A camera-based mobility aid for visually impaired people
CN109271857A (en) A kind of puppet lane line elimination method and device
US20220012503A1 (en) Systems and methods for deriving an agent trajectory based on multiple image sources
US20220012899A1 (en) Systems and methods for deriving an agent trajectory based on tracking points within images
CN116051779A (en) 3D surface reconstruction using point cloud densification for autonomous systems and applications using deep neural networks
Wang et al. An environmental perception and navigational assistance system for visually impaired persons based on semantic stixels and sound interaction
CN116051780A (en) 3D surface reconstruction using artificial intelligence with point cloud densification for autonomous systems and applications
CN116048060A (en) 3D surface structure estimation based on real world data using neural networks for autonomous systems and applications
Huang et al. Multi-modal policy fusion for end-to-end autonomous driving

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination