LU505631B1

LU505631B1 - Apparatus and method capable of detecting stairs

Info

Publication number: LU505631B1
Application number: LU505631A
Authority: LU
Inventors: Rui Gao
Original assignee: Univ Yanan
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-05-28

Abstract

The present invention relates to technical field of computer vision, and disclosed are an apparatus and method capable of detecting stairs. The apparatus includes a blind cane body, and an ultrasonic sensor, an RGB-D camera, a processor and a vibration inductor that are connected to the blind cane body. The present invention provides a blind cane for detecting stairs by using a hybrid method based on a deep learning network and the ultrasonic sensor. The cane combines an image recognition function of the deep neural network and a recognition function of the ultrasonic sensor to provide effective assistance for visually impaired people.

Description

BL-5793

APPARATUS AND METHOD CAPABLE OF DETECTING STAIRS 0905681

TECHNICAL FIELD

[01] The present invention relates to the technical field of computer vision, and in particular to a method based on machine vision for detecting stairs.

BACKGROUND ART

[02] Over the past decade, much work has been done in this field to help visually impaired people in different directions. To solve the mobility problem of visually impaired people, a stair detection framework employing a hybrid approach 1s developed, in which stair detection is done by a pre-trained model and an ultrasonic sensors.

[03] An existing method for detecting stairs includes three-dimensional reconstruction based on binocular vision. Although the three-dimensional reconstruction method has high detection accuracy, the method has high requirements on devices and has a relatively high cost. A bionic eye can help the visually impaired people to recover part of their vision, but the bionic eye can only help the visually impaired to see low- resolution gray images, and it is difficult to distinguish stairs from other scenes. The bionic eye is only suitable for blind people caused by retinitis pigmentosa. Traditional interactive methods for the visually impaired people mainly include voice prompts and tactile vibrations. The voice prompts usually broadcast short messages, which takes time to play, resulting in delays and accident risks, and less information can be delivered. The tactile vibrations take a vibration belt or vibration vest as hardware, and give prompts about some orientation information by means of vibrations, which can solve the problem of delays, but bring a burden to the visually impaired people, and different people have different wearing feelings.

SUMMARY

[04] Aiming at the above defects existing in the prior art, the present invention provides an apparatus and method for detecting stairs, which mainly perform detection recognition by means of combination between image detection and an ultrasonic sensor. 1

BL-5793

[05] The above existing technical problems are to be solved. The problems are as HUS05631 follows: the three-dimensional reconstruction has high requirements on devices and has a relatively high cost; the bionic eye only can only help the visually impaired people to see low-resolution gray images, and it is difficult to distinguish stairs from other scenes; the voice prompts usually broadcast short messages, which takes time to play, resulting in delays and accident risks, and less information can be delivered; and the tactile vibrations bring a burden to the visually impaired people, and different people have different wearing feelings.

[06] The present invention is implemented as follows: an apparatus capable of detecting stairs specifically includes a blind cane body, and an ultrasonic sensor, an RGB-

D camera, a processor, a vibration inductor and a buzzer that are connected to the blind cane body.

[07] The RGB-D camera is used for acquiring three-dimensional environment images, and the ultrasonic sensor is used for measuring a distance from the ground, and setting a threshold so as to determine whether to go upstairs or downstairs by means of comparing the measured distance with the threshold.

[08] An HC-SR04 module is selected as the ultrasonic sensor, and is arranged on the cane body and arranged at a fixed position of the cane.

[09] The camera is arranged at a top of the cane so as to conveniently acquire scene information.

[10] The processor employs a raspberry PI mainboard.

[11] The vibration inductor is arranged in a handle of the cane.

[12] Furthermore, the apparatus further includes a blind cane body and an R-GBD camera, a processor, a vibration inductor and a buzzer that are connected to the blind cane body.

[13] The camera is arranged at a top of the cane so as to conveniently acquire scene information.

[14] The processor employs a raspberry PI mainboard.

[15] The vibration inductor is arranged in a handle of the cane.

[16] Furthermore, the cane combines an image recognition function of a deep neural 2

BL-5793 network and a recognition function of the ultrasonic sensor, so as to provide effective HUS05631 auxiliary information for the visually impaired people and reduce missed detection. A border regression method is employed to perform boundary labeling and mark confidence of recognition.

[17] Furthermore, the cane combines the image recognition function of the deep neural network and the recognition function of the ultrasonic sensor to provide effective assistance for the visually impaired people.

[18] Furthermore, a deep convolutional neural network servers as the deep neural network, the neural network is trained by using a labeled dataset, and a VGG-16 network used for extracting image features in a faster region-based convolutional neural network (Faster RCNN) is replaced with an inception network with deeper layers and stronger expression capability to obtain a convolutional feature map. The convolutional feature map is input into a regional proposal network (RPN). Moreover, a soft non-maximum suppression (soft-nms) algorithm is used for optimizing a redundant anchor box in a region-of-interest layer, and position information of an object to be detected is adjusted to reduce missed detection. Then, boundary labeling is performed by using the border regression method to mark the confidence of recognition.

[19] Another objective of the present invention is to provide a method capable of detecting stairs, to which the apparatus capable of detecting stairs is applied. The method capable of detecting stairs includes the following steps:

[20] (1) collecting images of stairs (including the upstairs and the downstairs) of different sizes and categories to form a dataset;

[21] (2) labeling the dataset by using a labeling tool, labeling the images showing going upstairs as upstair, and labeling the images showing going downstairs as downstair;

[22] (3) dividing the labeled dataset in step 2 into a training set and a test set;

[23] (4) normalizing the dataset;

[24] (5) training the training set by using the faster-rcnn-inception-v2-coco model to obtain a recognition model, where in a training process, an initial learning rate is set to 0.0002, and the maximum iteration number is set to 200000;

[25] (6) inputting the test set into the recognition model, and outputting a category 3

BL-5793 of the stairs in a test image, and accuracy of a prediction result; 17995681

[26] (7) positioning recognized stairs, and then performing boundary labeling by using a border regression method to mark confidence of recognition so as to achieve the accuracy of the prediction result; and

[27] (8) the ultrasonic sensors being used for providing assistance in detection recognition, and measuring a distance from the ground, where this fixed measured value is set as a threshold, for a data measurement rule, 1000 pieces of data is set to be taken per second to solve an average value, then the average value is compared with the threshold, if the measured average value is lower than the threshold, it is determined as going upstairs, and if the measured average value is higher than the threshold, it is determined as going downstairs.

[28] Another objective of the present invention is to provide a storage medium. The storage medium is used for storing a program for implementing the above method.

[29] Another objective of the present invention is to provide a data processing terminal. The data processing terminal operates the above method for detection.

[30] The present invention has the following advantages and positive effects:

[31] (1) The combination between the machine vision technology and the ultrasonic sensor is employed to achieve stair positioning, such that detection is more accurate.

[32] (2) Detection recognition of going upstairs or going downstairs is achieved.

[33] (3) The vibration inductor is arranged at the handle portion of the cane, and the cane body vibrates to remind a user so as to omit an auxiliary hardware device such as an existing vibration belt or vibration vest.

[34] (4) The advantages of reliability, strong timeliness, low cost and high accuracy are achieved.

BRIEF DESCRIPTION OF THE DRAWINGS

[35] FIG. 1 is a schematic structural diagram of the present invention;

[36] FIG. 2 is a connection block diagram of hardware modules of the present invention;

[37] FIG. 3 is a flow chart for training of a Faster-RCNN-Inception-V2-COCO 4

BL-5793 model of the present invention; HUS05631

[38] FIG. 4 is a flow chart of a faster RCNN in an example of the present invention;

[39] FIG. 5 is a loss function curve of a model training stage of an example of the present invention; and

[40] FIG. 6 is a diagram for test results of an example of the present invention.

[41] In FIG. 1: 1, cane body; 2, handle; 3, RGB-D camera; 4, vibration inductor; 5, wiring duct 1; 6, ultrasonic sensor; 7, raspberry PI processor; 8, wiring duct 2; 9, switch; and 10, reminder module.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[42] The application principle of the present invention will be further described below in conjunction with the drawings and particular examples.

[43] As shown in FIG. 1, the apparatus is composed of a cane body and a handle, where an RGB-D camera is arranged in the handle, a vibration sensor is connected in the handle, and the cane body is provided with transparent wiring ducts, an ultrasonic sensor and a raspberry PI processor. The ultrasonic sensor, the RGB-D camera and the vibration inductor are connected to the raspberry PI processor by means of the wiring ducts.

Detection results are transmitted to the vibration sensor by means of the raspberry PI processor, so as to be transmitted to a user.

[44] As shown in FIG. 2, in an actual environment, images around the user are acquired in real time by the RGB-D camera, and a trained model is used for detection to recognize stairs. Then, data fusion is performed by means of the raspberry PI processor in combination with measured values of the ultrasonic sensor, and a fused processing result is sent to the user by means of a reminder module.

[45] As shown in FIG. 3, a dataset processing process is as follows:

[46] Dataset

[47] 510 images of stairs of different sizes and categories are acquired, where 300 images are from stairs of different buildings in a real environment, 210 images are from websites, and division of a training set and a test set is performed according to a ratio of 73.

BL-5793

[48] Data is normalized, and resolution is uniformly adjusted to 720x960 pixels. HUS05631

[49] As shown in FIG. 4, a faster region-based convolutional neural network (faster- rcnn) is employed for operation, and specific steps are as follows:

[50] (1) Images are input into the convolutional neural network for feature extraction to obtain a feature map, where the convolutional neural network employs an Inception

V2 network architecture,

[51] InceptionV2 is of a series of improvements made by Google team on the basis of InceptionV1 in February 2015. The network consists of five convolutional layers with 3*3 convolutional kernels and two pooling layers with 3*3 and 8*8 convolutional kernels respectively. Finally, a linear layer and a classification layer are added to further process features. InceptionV2 introduces Batch-Normalization to replace an original 5*5 convolutional layer with two layers of 3*3 convolutions, which has a better feature extraction function and fewer parameters under the same perception field of view. In a training stage, deep convolutional neural network inception is used for processing the part, serving as the training set, of the dataset, and converting the images into feature vectors with fixed lengths. A convolution calculation process is as follows:

[52] For the input stair image data Iw.) of the size of wh the convolutional kernel is Km, n) , a convolutional result is Slow.) , and the corresponding convolutional formula is

Sens =U K d= 3 Lo Kin

[53] ms

[54] where w represents an image length, h represents an image width, m represents a convolutional kernel length, and n represents a convolutional kernel width.

[55] The convolutional feature map is input into a region proposal network (RPN), and candidate boxes are screened by using a soft non-maximum suppression algorithm and a sampling technology to select feature information anchor of candidate boxes with high confidence. Moreover, the feature map is transferred to a region-of-interest (ROI) pooling layer for pooling operation, and a candidate region feature map with a fixed size is generated. 6

BL-5793

[56] By means of convolutional feature extraction, 9 anchor boxes are generated for 10505681 each pixel on the feature map of 45*60, and the generated anchor boxes are filtered and labeled. The filtering and labeling rules are as follows:

[57] For feature information, the anchors are classified and labeled pixel by pixel to remove anchor boxes that exceed the original boundary of 720*960.

[58] If an intersection over union (IoU) value of the anchor box and ground truth is the largest, the boxes are labeled as positive samples, where label=1.

[59] Ifthe IoU value of the anchor box and ground truth is >0.7, the boxes are labeled as positive samples, where label=1.

[60] If the IoU value of the anchor box and ground truth is <0.3, the boxes are labeled as negative samples, where label=0.

[61] The rest are neither negative samples nor positive samples, and are not used for final training, where label=-1.

[62] The IoU refers to an overlap degree for defining two bounding boxes. The IoU represents a ratio of the overlapping area of rectangular boxes A and B to the area of the union of rectangular boxes A and B.

[63] Pixel-by-pixel regression correction

[64] In addition to labeling, the offset between the anchor box and ground truth is calculated.

[65] Position coordinates of a central point of the ground truth calibration box are set as (x ) and a width and a height are set as Wh

[66] The coordinates of the central point of the anchor box are Yar Va , and a width and a height are Woo ly

[67] Offset calculation

Ax = (x Tex, y Ww, a= ev i,

Aw = log’ fw, )

[68] Ar = log / i)

[69] Learning is performed by means of differences between the anchor box and 7

BL-5793 ground truth so as to further perform out-of-bounds culling and suppression by using the HUS05631 soft non-maximum suppression algorithm, thereby removing the overlapping boxes.

[70] Combined training of classification probability and border regression is performed on the feature map generating a candidate region by means of a full connection layer and a linear regression layer to achieve classification recognition of going upstairs and going downstairs and a stair target detection box.

[71] Border regression method

[72] By means of translation and scaling, an original input window P is mapped to obtain a regression window U closer to a real window G.

[73] A kind of mapping f is searched for (2 BB ) such that ( Pos Pros Pas p,}= a. ; G,. G… , = (G, GG, G,)

[74] ;

[75] Soft non-maximum suppression algorithm (soft-NMS)

[76] In target detection, a classifier calculates a class score for each bounding box (bbox), which is the probability that this bbox belongs to each class. NMS is performed on the basis of these values. The main process is as follows:

[77] For each class, firstly, a decay function is set for each of the scores of all bboxes with the score<threshold, and a low score is given.

[78] All the bboxes are sorted according to the scores, the highest score and the corresponding bbox are selected, and the rest bboxes are traversed. If the IoU of the bbox with the highest score is greater than a certain threshold, the bbox is deleted. The bbox with the highest score is continued to be selected from the bboxes without processing, and the above process is repeated until all reserved bboxes are found.

[79] A final prediction result is drawn on the basis of all the class scores and class color of all the reserved bboxes.

[80] As shown in FIG. 5, the loss function used in training is divided into two parts, one is class loss (clsloss) and the other is regression loss (bboxregreesionloss). During model training, losses occur at each step of training, which usually starts from high and decreases as the training progresses. It is designed that decrease of the loss to below 0.05 8

BL-5793 takes about 600 steps, and the training stops. 17995681

[81] The loss function equation is as follows:

Hpbi) == 31, pop ja HS pi fe)

[82] fdr CZ NT Ad Ces OI AT free VASE,

[83] where Leis represents the classification loss, Lreg represents the regression loss, i represents an index of the selected anchor, and p represents the prediction probability of i

[84] Leis is the logarithmic loss for calculation of each anchor, and Ng is the total number of anchors, which is selected as 256 here. pi is the target probability for anchor prediction, Llp % } are the logarithmic loss of the two classes (stair and non-stair), and ti is a vector, which is the predicted offset for the RPN training stage. is a vector of the same dimension to 4 representing the actual offset of the anchor in the RPN training stage, and A=1 so as to make the weights for classification and regression equal.

[85] Both loss terms are normalized by means of Ncls and Nreg, and Ncls and Nreg are the number of samples for classification and regression respectively.

[86] The working principle of the present invention is as follows: the switch is turned on, the camera is started, and images are acquired in real time. The system firstly detects whether stairs exist and whether to go upstairs or downstairs by means of the recognition model. Then, the ultrasonic sensor is started for secondary determination by means of a distance difference. Accurate detection is achieved by comparing the twice results, twice measured results are fused and then transmitted to the raspberry PI processor, and the processor sends a vibration signal to a user. If going upstairs is determined, the vibration inductor of the cane vibrates at intervals. If going downstairs is determined, the vibration inductor of the cane vibrates in an accelerated and continuous manner, and if no stairs are detected, the vibration inductor does not respond. 9

Claims

BL-5793 WHAT IS CLAIMED IS: HUS05631

1. An apparatus capable of detecting stairs, comprising: an RGB-D camera used for acquiring three-dimensional environment images, inputting real-time images into a trained deep learning model of a faster-renn-inception- v2-coco model, and performing real-time detection to determine whether to go upstairs or downstairs; an ultrasonic sensor used for measuring a distance from the ground, wherein a threshold is preset, whether to go upstairs or downstairs is determined by comparing a measured height with the threshold, and the ultrasonic sensor is arranged on a cane body and is arranged at a fixed position of a cane; and a vibration inductor used for recognizing a stair category and giving a warning to a user.

2. A method for detecting stairs by using the method for detecting stairs according to claim 1, comprising the following steps: (1) collecting images of stairs of different sizes and categories, including the upstairs and the downstairs to form a dataset; (2) labeling the dataset by using a labeling tool, labeling the images showing going upstairs as upstair, and labeling the images showing going downstairs as downstair; (3) dividing the labeled dataset in step (2) into a training set and a test set; (4) normalizing the dataset; (5) training the training set by using the faster-renn-inception-v2-coco model to obtain a recognition model, wherein in a training process, an initial learning rate is set to

0.0002, and the maximum iteration number is set to 200000; (6) inputting the test set into the recognition model, and outputting a category of the stairs in a test image, and accuracy of a prediction result; (7) positioning recognized stairs, and then performing boundary labeling by using a border regression method to mark confidence of recognition so as to achieve the accuracy of the prediction result; and (8) the ultrasonic sensors being used for providing assistance in detection recognition, and measuring a distance from the ground, wherein this fixed measured value 1

BL-5793 is set as a threshold, for a data measurement rule, 1000 pieces of data is set to be taken 7905681 per second to solve an average value, then the average value is compared with the threshold, if the measured average value is lower than the threshold, it is determined as going upstairs, and if the measured average value is higher than the threshold, it is determined as going downstairs. 2