CN112906485B

CN112906485B - Visual impairment person auxiliary obstacle perception method based on improved YOLO model

Info

Publication number: CN112906485B
Application number: CN202110098983.7A
Authority: CN
Inventors: 刘宇红; 李伟斌; 付建伟; 张荣芬; 胡国军
Original assignee: Hangzhou Yixiangyou Intelligent Technology Co ltd
Current assignee: Hangzhou Yixiangyou Intelligent Technology Co ltd
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2023-01-31
Anticipated expiration: 2041-01-25
Also published as: CN112906485A

Abstract

The invention discloses a visual impairment person auxiliary obstacle perception method based on an improved YOLO model, which adopts Darknet-YOLOv3 as a frame and Darknet-53 as a feature extraction trunk network; the YOLOV3 algorithm performs feature fusion by using a feature map upsampling idea in a feature pyramid network FPN, so that the accuracy of small target detection is improved, and various common obstacles on a sidewalk can be detected and identified, and the method comprises the following steps: road cone, stone ball, insulated column, banned horizontal pole, railing, fire hydrant, plant, people, pit, puddle etc. can discern various signs and the target at traffic crossing, include: zebra crossing, signal lights, bicycles, motorcycles, vehicles, people, etc., and can also judge going upstairs, downstairs, various steps, and some other obstacle objects of unknown class.

Description

Visual impairment person auxiliary obstacle perception method based on improved YOLO model

Technical Field

The invention relates to the field of blind guiding, in particular to a visual impairment person auxiliary obstacle perception method based on an improved YOLO model.

Background

China is the most blind countries in the world, about 1200 million blind people account for 18% of blind people in the world, and as special people in social groups, the blind people live in endless darkness for a lifetime, so various problems can be encountered frequently, most blind guiding products in the current market are simple in structure and single in function (only can simply prompt that a barrier exists in the front), although some products are convenient to use, the auxiliary effect is not obvious, and blind people can encounter various problems in use, such as bad road conditions, uneven pits, suspended barriers in the front and the like, so that common blind guiding products cannot be accurately detected. The obstacle detection function of the existing blind guiding product is only limited to detecting the distance of an obstacle, cannot accurately position the position of the obstacle, and can only detect a single obstacle, for example, in the detection of multiple moving obstacles, the obstacle closest to a user can only be detected, so that the practicability of the blind guiding function of the product is greatly reduced.

The intelligent blind guiding equipment such as blind guiding glasses is researched by teams and companies internationally and domestically, but is always in the stages of performance detection and small-batch trial production due to the reasons of unsatisfactory performance and use experience and the like, and a large-scale market is not formed at present. Especially in China, research and development of blind auxiliary guide equipment for blind persons are in a starting stage and have a long distance from large-scale productization and commercialization, and at present, the market space of China for the guide equipment still does not reach a development and popularization stage, so that the problem of solving the blind auxiliary guide equipment is very important.

Disclosure of Invention

In order to solve the above problems, the present invention provides an auxiliary obstacle sensing method for visually impaired people based on an improved YOLO model, which can detect and identify various common obstacles on a sidewalk, and includes: road cone, stone ball, insulated column, banned horizontal pole, railing, fire hydrant, plant, people, pit, puddle etc. can discern various signs and the target at traffic crossing, include: zebra crossing, signal lights, bicycles, motorcycles, vehicles, people, etc., and can also judge going upstairs, going downstairs, various steps, and some other obstacle targets of unknown classes.

In order to realize the technical scheme, the invention provides an auxiliary obstacle perception method for visually impaired people based on an improved YOLO model, which comprises the following steps:

the method comprises the following steps: establishing a YOLOV3 algorithm framework

Adopting Darknet-YOLOv3 as a framework, adopting a GoogleNet-based convolutional neural network of a YOLOV3 algorithm, and adopting Darknet-53 as a feature extraction backbone network; the YOLOV3 algorithm is a full convolution network, a layer jump residual module is adopted in a Darknet-53 structure for multiple times, the convolution step length movement is utilized to realize down-sampling operation, the phenomenon of gradient explosion caused by direct use of pooling operation is avoided, and the YOLOV3 algorithm performs feature fusion by using the idea of sampling on a feature map in a feature pyramid network FPN, so that the precision of small target detection is improved;

when the Yolov3 algorithm is used for target detection, firstly, feature extraction is carried out on an input image through a feature extraction network Darknet-53 to obtain 3 feature layers with different scales, each cell in each feature layer corresponds to a small square in an original image, and the square is used for predicting an object if the central coordinate of the detected object (Ground route) is located in which small square;

step two: data augmentation processing

A data augmentation layer is added between the data reading layer and the feature extraction layer, the data augmentation layer not only augments the data in a geometric transformation mode by rotating and stretching the data, but also performs data enhancement by fusing an MSRCR algorithm, so that the system adapts to a detection task with poor illumination conditions, and the generalization of a model frame is improved;

step three: pre-training

Retraining the classifier by adopting a mode of pre-training and then fine-tuning to enable the network to adapt to detection tasks in different illumination environments, pre-training on a data set mixed with VOC2007 and VOC2012, fusing a self-made data set, and fine-tuning the model on barrier data in different illumination environments;

step four: multi-scale training

The method comprises the steps of randomly adjusting the size of input data in a multi-scale training mode, enhancing the robustness of a model, inputting training data into a network, performing image preprocessing, filtering by using 32 and 64 convolution kernels with the size of 3 × 3 respectively, performing down-sampling processing to obtain a feature map with the size of 240 × 240, then alternately inserting residual blocks consisting of convolution kernels with the size of 1 × 1 and 3 × 3 into a convolution unit, and respectively calculating feature maps with the resolutions of 240 × 240, 120, 60 × 60, 30 × 30 and 15 by using 5 groups of residual blocks; all convolution units consist of convolution layers, BN layers and pooling layers so as to accelerate model convergence and reduce model parameters;

step five: network structure of improved YOLOv3

A convolution layer is added in the backbone network, so that the efficiency is ensured, and the precision is improved, thereby increasing the practicability and the accuracy of a use scene;

step six: reasoning acceleration based on TensorRT

Performing related calculation and accelerated model reasoning by adopting a low-precision parameter mode, and reducing the reasoning duration of a detection model by adopting TensorRT;

step seven: attention adding mechanism module

An attention mechanism module is added to the output part with the scale of 26 by 26, the information is refined, so that the learned content is optimized, the detection capability of a small target is enhanced, and a network formed by adding 4 layers of convolution layers qie and integrating the 4 layers of convolution layers qie into the attention mechanism module is called SE-YOLOv3;

step eight: the loss metric, GIOULoss, which is a distance metric using GIOULoss as a regression of the coordinates of the target frames, is calculated as follows, where Ac is the minimum enclosed region area of the two target frames and U is the intersection area of the two target frames

The calculation of giouloloss is as follows:

L _GIOU ＝l-GfOU

the Soft-NMS obtains the IOU in a weight form, multiplies the IOU by the original score after taking the Gaussian index, then reorders the results and continues to circulate; in Darknet-Yolov3, the backbone network is shared31 convolution layers, the network structure comprises 6 groups of networks of 1 x, 2 x residual blocks, and compared with 5 groups of networks of 1 x, 2 x, 8 x, 4 x residual blocks in the original YOLOv3, the number of parameters is reduced by 60%, the operation complexity is reduced, and the detection speed is improved; the feature interaction layer is divided into four scales, feature interaction is realized in each scale in an up-sampling mode, and the size of the four scales is y ₁ ：(13×13)，y ₂ ：(26×26)，y ₃ ：(52×52),y ₄ ：(104×104)；

Step nine: model training

Firstly, labeling an image according to a required category, presetting all labeling categories in data/predefined _ categories.txt, adjusting a labeling frame to a target edge, and storing xml in data/indications after the labeling is finished, wherein each xml corresponds to the image one by one, and the xml comprises a picture name, a path where the picture is located, a pixel position of the labeling frame and the labeling category;

then training strategy and parameter configuration, pre-training a Darknet frame which is recompiled by fusing MSRCR algorithm to obtain a barrier weights file, storing the weight of the whole convolutional neural network in a sequence mode, converting the file into a pre-training file barrier.conv.74 only containing the weight of the convolutional layer by using a-/Darknet partial command, fixing 53 convolutional layers of the network, carrying out fine tuning on the last classification layer, observing the parameter change of the LOG, training until the loss of the model is not converged any more, and configuring the fine tuning trained hyper-parameter in the cfg file of the Darknet.

In a further improvement, in the step one, each square block corresponds to 9 prediction boxes, and only the bounding box with the largest IOU of the detected object in the prediction boxes is used for predicting the object.

In the second step, the data enhancement adopts an MSRCR algorithm to enhance and repair the noise image, specifically, the method includes analyzing and eliminating a background light source signal in the image, and enhances the image by removing illumination information in the image, so that colors are more practical, and thus effective information can be extracted and analyzed subsequently, and the MSRCR algorithm formula is:

wherein I _i (x, y) represents the image information of the i-th spectral band in space (x, y), "' represents the convolution operation, F _n (x, y) is a surround function implemented as a gaussian function, G and b are the final gain and offset, respectively, which are empirical parameters; c _i (x, y) is the Color Recovery Function (CRF) for the ith channel in chroma space, formulated as:

where β is the gain for controlling color restoration, α is the nonlinear gain for controlling color restoration, and S represents the number of channels of a picture.

The further improvement is that the MSRCR algorithm comprises the following specific steps:

the method comprises the following steps: the Darknet source code is combed, and the loading and processing flow of the data in the framework is familiar;

step two: modifying on the basis of darknet/src/image _ opencv.cpp, writing an MSRCR algorithm program by using OpenCV, and calling the written msrcr.MultiScaleRetinex CR function in a load _ image _ cv function of a source code to perform image enhancement processing;

step three: converting the processed image from the mat format to an image structure by using a mat _ to _ image;

step four: and adding the written msrcr.h and image _ opencv.cpp into the darknet/src/compiling the source code.

The further improvement is that in the third step, the pre-training comprises the following specific steps:

the method comprises the following steps: firstly, pre-training an improved barrier recognition network on a data set mixed with VOC2007 and VOC2012, wherein the initial learning rate is 0.01, and iterating for 16 ten thousand times to obtain a barrier.conv.74 pre-training network weight;

step two: and setting the number of filters of the last convolutional layer as 84, setting the number of categories in the three yolo layers as 23, fixing weight parameters of the convolutional layers in the pre-training model, then carrying out fine adjustment on a self-made labeled obstacle data set, updating the weight, and retraining the detection model suitable for obstacle identification.

The invention has the beneficial effects that: the invention can detect and identify various common obstacles on the sidewalk, and comprises the following steps: road cone, stone ball, insulated column, banned horizontal pole, railing, fire hydrant, plant, people, pit, puddle etc. can discern various signs and the target at traffic crossing, include: zebra crossing, signal lights, bicycles, motorcycles, vehicles, people, etc., and can also judge going upstairs, downstairs, various steps, and some other obstacle objects of unknown class.

Drawings

Fig. 1 is a network structure diagram of YOLOV3 algorithm of the present invention.

Fig. 2 is a flow chart of the pre-processing algorithm with image enhancement of the present invention.

Fig. 3 shows detailed network parameters according to the present invention.

Fig. 4 is a diagram of an improved yollov 3 network architecture of the present invention.

FIG. 5 is a SE-YOLOv3 network structure of the present invention.

Fig. 6 shows the parameter configuration in the training phase of the present invention.

Fig. 7 shows the configuration of parameters in the network according to the present invention.

FIG. 8 is a trimmed training log of the present invention.

FIG. 9 is a diagram of a network model of the improved Darknet-Yolov3 of the present invention.

Detailed Description

In order to further understand the present invention, the following detailed description will be made with reference to the following examples, which are only used for explaining the present invention and are not to be construed as limiting the scope of the present invention.

As shown in fig. 1 to fig. 8, the present embodiment provides a method for assisting the blind person in sensing obstacles based on an improved YOLO model, which includes the following steps:

the method comprises the following steps: building YOLOV3 algorithm framework

The Darknet-Yolov3 is used as a framework, the Yolov3 algorithm is based on a GoogleNet convolutional neural network, and the Darknet-53 is used as a feature extraction backbone network, so that the computational complexity is reduced, the reasoning speed is improved, and the method can be deployed to an edge computing system; the YOLOV3 algorithm is a full convolution network, a layer jump residual module is adopted for multiple times in a Darknet-53 structure, the step movement of convolution is utilized to realize down-sampling operation, the phenomenon of gradient explosion caused by direct use of pooling operation is avoided, and the YOLOV3 algorithm utilizes the idea of sampling on a feature map in a feature pyramid network FPN to perform feature fusion, so that the precision of small target detection is improved, as shown in FIG. 1;

after the features are fused, feature layers of 3 scales are finally output, and the following table shows that:

comparative table of 3 size characteristic graphs of YOLOV3 algorithm

From the above table analysis, it can be seen that: the size of the output feature map of the feature map layer 1 is 13 × 13, and each pixel (cell) corresponds to 3 bounding boxes (bounding boxes). In the network structure of the YOLOV3 algorithm, since the feature map layer 1 belongs to a high-level feature, has the largest receptive field and is suitable for detecting a large object, the sizes of the 3 bounding boxes corresponding to each cell are respectively 116 × 90, 156 × 198 and 376 × 326, and the number of predicted bounding boxes is 13 × 13 × 3=507. The output feature map of the feature map layer 2 has a size of 26 × 26, a moderate receptive field, and is suitable for detecting objects of a general size, the predicted frame sizes are respectively 30 × 61, 62 × 45, and 59 × 119, and the number of predicted frames is 26 × 26 × 3=2028. The size of the feature map output by the feature map layer 3 is 52 × 52, the receptive field is minimum, the feature map is suitable for detecting small objects, the sizes of the predicted frames are respectively 10 × 13, 16 × 30 and 32 × 23, and the number of the predicted frames is 52 × 52 × 3=8112. Therefore, assuming that the input image size is 416 × 416 and the yolov3 algorithm generates a total number of predicted bounding boxes (13 × 13+26 × 26+ 52) × 3=10647, each bounding box predicts the output object class probability score and the bounding box position coordinates.

When the algorithm is used for target detection, firstly, feature extraction is carried out on an input image through a feature extraction network Darknet-53 to obtain 3 feature layers with different scales, each cell in each feature layer corresponds to a small square in an original image, and the square is used for predicting an object if the central coordinate of the detected object (Grountruth) is located in which small square. Each square corresponds to 9 prediction boxes, and only the bounding box with the largest IOU of the detected object in the prediction boxes is used for predicting the object. The transformation process from the preset bounding box to the final predicted bounding box is shown in the following formula:

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

where σ (x) is the sigmoid function, c _x And c _y Is to predict the center coordinate of the bounding box on the feature map, p _w And p _h Is the preset side length of the boundary frame, and the finally obtained coordinate value of the boundary frame is b _x,y,w,h And the net learning objective is t _x,y.w,h 。

Step two: data augmentation processing

The YOLOv3 algorithm is better in performance under good illumination conditions, missing detection or false detection is easily caused under the condition of insufficient illumination, the practical application environment of the system is complex, the system is easily influenced by various factors, and in order to enable the YOLOv3 algorithm to be more suitable for an obstacle detection and recognition system, data augmentation processing is carried out on the YOLOv3 algorithm. As the number of the data samples acquired automatically is small, and the diversity is lacked, the phenomenon of overfitting of the model is easily caused. In addition, the detection precision is also influenced by weather factors encountered by the blind during traveling, so that data amplification is adopted for processing before feature extraction, and the data diversity is improved;

a Data Augmentation layer (Data Augmentation) is added between the Data reading layer and the feature extraction layer, the Data Augmentation layer not only augments Data in a geometric transformation mode by rotating and stretching the Data, but also performs Data enhancement by fusing an MSRCR (multi-scale retina enhancement with color recovery) algorithm, so that the system adapts to a detection task with poor illumination conditions, and the generalization of a model frame is improved;

the main contents of the image enhancement theory Retinex are as follows: the color of the object depends on the reflection capability of long, medium and short wave light, is not influenced by illumination nonuniformity, and has consistency. The color perceived by human eyes is substantially the effect of interaction of light and an object, the Retinex theory analyzes and eliminates a background light source signal in an image by simulating a human visual system, and enhances the image by removing illumination information in the image, so that the color is more practical, and effective information can be conveniently extracted and analyzed subsequently.

The common Retinex algorithm at present is as follows: a single-scale Retinex algorithm (SSR), a multi-scale Retinex algorithm (MSR) and a multi-scale Retinex algorithm with color recovery (MSRCR). The classical algorithms have advantages and disadvantages, and the SSR algorithm cannot give consideration to the image fidelity performance and the dynamic compression.

In this embodiment, the data enhancement uses an MSRCR algorithm to enhance and repair the noise image, specifically, the method includes analyzing and eliminating a background light source signal in the image, and the image is enhanced by removing the illumination information in the image, so that the color is more practical, and the subsequent extraction and analysis of effective information are facilitated, where the MSRCR algorithm formula is:

in which I _i (x, y) represents the image information of the i-th spectral band in space (x, y), "' represents the convolution operation, F _n (x, y) is a surround function implemented with a gaussian function, G and b are the final gain and offset, respectively, which are empirical parameters; c _i (x, y) is the Color Recovery Function (CRF) for the ith channel in chrominance space, formulated as:

Experiments prove that the setting of the MSRCR parameters plays a key role. The larger the gain G and the offset b are, the stronger the blurring effect is, the more serious the color cast of the processed image is, otherwise, the less obvious the enhancement effect is. The larger alpha and beta are, the higher the image contrast and brightness are, the more discontinuous the image pixels are and the poorer the definition is, and otherwise, the lower the image saturation, contrast and brightness are. Through repeated experiments, the setting conditions of the algorithm experience parameters are as follows: s is set to be 3, G and b of the image with weak light are set to be 4 and 50 respectively, alpha and beta are set to be 2 and 50 respectively, in order to take account of the effect of compressing the dynamic range of the image and the color fidelity, gaussian blurring is carried out from three scales of 30, 150 and 300, and then the processing results are fused according to the weight proportion of 1/3. The gain, the offset and the image restoration weight of the pixel change range are set according to different illumination environments, so that the image can be effectively restored, and the image information is restored to the maximum extent.

An image enhancement preprocessing algorithm is added between the data layer and the feature extraction network, the parameter setting is as described above, and the layer has no influence on the original data channel and dimension, as shown in fig. 2.

The MSRCR algorithm comprises the following specific steps:

the method comprises the following steps: combing Darknet source codes, and familiarizing with the loading and processing flow of data in the framework;

The recompiled Darknet framework can directly execute the MSRCR algorithm to amplify data during training and can call the algorithm to preprocess images during recognition.

Step three: pre-training

When training the network, the selection of the parameter adjusting strategy may also have important influence on the model performance and the training duration. The number of layers of the barrier identification network in the embodiment is deep, and gradient dispersion is easily caused by supervised learning on a small-scale data sample, so that the classifier is retrained by adopting a pre-training and fine-tuning mode, and the network can adapt to detection tasks in different illumination environments.

In this embodiment, the classifier is retrained in a pre-training and then fine-tuning manner, so that the network adapts to detection tasks in different lighting environments, the pre-training is performed on a data set mixed by VOC2007 and VOC2012, a self-made data set is fused, and a model is fine-tuned on obstacle data in different lighting environments, and the specific operation steps are as follows:

s1, firstly, pre-training an improved barrier recognition network on a data set mixed with VOC2007 and VOC2012, wherein the initial learning rate is 0.01, and iterating for 16 ten thousand times to obtain a barrier.

S2, setting the number of filters of the last convolutional layer to be 84, setting the number of categories in the three yolo layers to be 23, fixing weight parameters of the convolutional layers in the pre-training model, then carrying out fine-tuning (fine-tuning) on a self-made labeled obstacle data set, updating the weight, and retraining the detection model suitable for obstacle recognition.

Step four: multi-scale training

The size of input data is randomly adjusted by adopting a multi-scale training mode, so that the robustness of the model is enhanced; the YOLOv3 network comprises 3 yolo layers, the size of each layer is different, and feature maps with the resolutions of 13 × 13, 26 × 26 and 52 × 52 are respectively output. In order to adapt to input images with different sizes, a multi-scale training mode is adopted, random parameters are set to be 1, and images with different sizes are randomly input to extract features. In order to obtain the receptive field suitable for the obstacle data, the input image resolution is set to 480 × 480, the detailed network parameters are as shown in fig. 4, the training data is input into the network, after image preprocessing, 32 and 64 convolution kernels with the size of 3 × 3 are respectively used for filtering, down-sampling processing is carried out, a feature map with the size of 240 × 240 is obtained, then residual blocks consisting of convolution kernels with the size of 1 × 1 and 3 × 3 are alternately inserted into a convolution unit, and feature maps with the resolutions of 240 × 240, 120 × 120, 60 × 60, 30 × 30 and 15 × 15 are respectively calculated by 5 groups of residual blocks. All convolution units consist of convolution layer, BN layer, and pooling layer to speed up model convergence and reduce model parameters, as shown in FIG. 3.

Step five: network structure of improved YOLOv3

In order to lighten the network, reduce the operation complexity, improve the detection speed and improve the real-time performance of the detection and identification of the obstacles, the design introduces the design idea of YOLOv 3-tiny. The network layer of YOLOv3-tiny has 7 convolution layers, which are greatly reduced compared with 75 layers of YOLOv3, so that the detection precision of YOLOv3-tiny is also reduced, and the feature extraction of certain target objects is limited. With the deepening of the layer number, the network structure has a better effect on feature extraction, and from the perspective of a deep network, the learning speeds of different layers are greatly different, which shows that the learning condition close to the output in the network is good, the learning close to the input layer is slow, and even the weight values of several layers before long training are not as much as the values of random initialization, which can cause the problems of gradient explosion or gradient disappearance and the like, so that the network layer number cannot be added at once to increase the feature extraction capability. Aiming at the problems of few YOLOv3-tiny convolution layers and low detection precision, the improved method for adding the convolution layers in the trunk network is provided, the efficiency is ensured, and the precision is improved, so that the practicability and the accuracy of a use scene are improved. The detection precision is gradually increased along with the increase of the number of added network layers, meanwhile, due to the fact that the number of convolution layers is increased, the calculated amount is increased, the increase of the number of layers and the improvement of the precision are not in a linear relation, when the number of added layers is larger than 4, the inference speed of the model is reduced more, but the improvement precision is not obvious, therefore, on the premise of comprehensive precision and efficiency of the text, the four 3 x 3 convolution layers are added to achieve the effect of deepening the number of network layers, meanwhile, in order to improve the learning capability of the model and reduce the parameters of the model, the corresponding 1 x 1 convolution layers are added to the four 3 x 3 added convolution layers, and the precision and the speed are well balanced. The structure of the improved YOLOv3-tiny network is shown in fig. 4.

Step six: reasoning acceleration based on TensorRT

Generally, in order to guarantee the accuracy of the calculation, the deep network weights of most algorithm frameworks use single-precision floating point data, which is common in computers, but the space occupied by the deep network weights in storage is twice as much as that of a half-precision floating point, and the calculation complexity is high. A Half-precision floating point type is defined in the IEEE754 standard, called a Half type in CUDA (compute unified device architecture), two Half-precision floating point type operations are completed in the same time period, and compared with a single-precision data type, a large amount of calculation speed is increased.

The deep learning model training stage comprises two stages of forward reasoning calculation loss function, backward gradient propagation and parameter updating, after model training is completed, only a forward reasoning process is needed when prediction is carried out again, so that the requirement on calculation precision is reduced, so that related calculation and accelerated model reasoning can be carried out by adopting a low-precision parameter mode, the model reasoning accuracy rate does not have too large change, the Jetsonnano platform supports half-precision floating point operation, and the theoretical half-precision floating point operation speed is twice as fast as a single-precision floating point operation speed, so that the TensorRT is adopted in the embodiment to reduce the reasoning time length of the detection model;

step seven: attention adding mechanism module

The attention mechanism can help the model to obtain semantic information with stronger expression capability, the semantic information can capture the areas which have the most significant contribution to a specific task in the image, and the areas (noise elements and the like) which bring negative effects are omitted, so that the fitting capability of the whole model is improved

In order to enhance the capability of detecting information of small targets, in the embodiment, an attention mechanism module is added to the output part with the dimension of 26 × 26 to refine information, so as to optimize the learned content, and enhance the detection capability of small targets, a network formed by adding 4 convolutional layers and integrating the attention mechanism module is referred to as SE-YOLOv3, and the network structure is shown in fig. 5;

step eight: the loss metric, GIOULoss, which is a distance metric using GIOULoss as a regression of the coordinates of the target frame, is calculated as follows, where A ^c Is the minimum closed area of the two target frames, and U is the intersection area of the two target frames

The calculation of GIOULoss is as follows:

L _GIOU ＝1-GIOU

the Soft-NMS obtains the IOU in a weight form, multiplies the IOU by the original score after taking the Gaussian index, then reorders the results and continues to circulate; in Darknet-Yolov3, the backbone network has 31 convolutional layers in total, the network structure comprises 6 groups of networks of 1 x, 2 x residual blocks, and compared with 5 groups of networks of 1 x, 2 x, 8 x, 4 x residual blocks in the original Yolov3, the number of parameters is reduced by 60%, the operation complexity is reduced, and the detection speed is improved; the feature interaction layer is divided into four scales, feature interaction is realized in each scale in an up-sampling mode, and the size of the four scales is y ₁ ：(13×13)，y ₂ ：(26×26)，y ₃ ：(52×52),y ₄ : (104X 104), FIG. 9 is a network model of Darknet-Yolov3 modified from the present design;

step nine: model training

The concept of the barrier is more general and the related types are wide, so the embodiment establishes the barrier database aiming at the object types which can obstruct the blind from going out and threaten the safety of the blind in the common part of life. Database images are mainly derived from: the self-made barrier data set acquired by actual road acquisition and network crawling and the barrier data sets built in the embodiment of the VOC2007 and VOC2012 data sets comprise 28800 pieces of data including daytime data and nighttime data. The resolution is 640 × 480, which contains 23 categories, 17280 images are randomly marked out according to the proportion of 6.

Due to the lack of partial required categories in open source datasets, such as: zebra stripes, red lights, green lights, etc., and therefore require labeling of the homemade barrier data set. The manual labeling is carried out by utilizing a visual labeling tool LabelImg, the tool has versions of Windows and Linux platforms, and the required environment is configured on the Linux platform: python, lxml library. Labeling the image according to the required category, which comprises the following specific steps: and presetting all labeling types in the data/predefined _ classes.txt, adjusting the marking frame to the edge of the attaching target, and after marking is finished, storing xml in the data/indications, wherein each xml corresponds to the image one by one, and the xml comprises the picture name, the path where the picture is located, the pixel position of the labeling frame and the labeling type.

Then training strategy and parameter configuration, pre-training a Darknet frame which is recompiled by a MSRCR algorithm to obtain a barrier weights file, storing the weight of the whole convolutional neural network in a sequence mode, converting the file into a pre-training file barrier.conv.74 only containing the weight of the convolutional layer by using a/Darknet partial command, fixing 53 convolutional layers of the network, carrying out fine tuning on the last classification layer, observing the parameter change of the LOG, training until the loss of the model is not converged any more, and taking about 7 hours and 30 minutes for training. The hyper-parameters of the fine tuning training are configured in the cfg file of Darknet, and the specific content is shown in FIG. 6.

When a model is trained, the optimization of the hyper-parameters determines the generalization ability of the model, and the setting of the hyper-parameters is determined by adopting a trial and error method in the embodiment. In the above fig. 6, batch represents the number of samples traversed in each iteration of the batch gradient descent algorithm, and the smaller batch is helpful to avoid the network falling into local optimum. The subdivisions is divided into the number of batchs, and can be divided into a plurality of sub-batchs input networks under the condition of limited video memory. The size of the input image is 480 × 480, the number of channels is 3, and the impulse momentum and the weight attenuation decade are set to 0.9 and 0.0005, respectively, which can increase the network convergence rate and suppress overfitting. angle, saturation, exposure, hue change parameters are associated with angle, saturation, exposure, hue change, respectively. Due to the fact that the data size of the text is large, the initial learning rate is set to be 0.001, the iterations are carried out for 80000 times, the learning strategy of steps is adopted, and the learning rate is changed when the iterations are carried out for 30000 times and 70000 times.

The obstacle data set of this embodiment contains 23 classes in total, so the last convolutional layer filter number for feature extraction is set to 3 x (1 +4+ 23) =84, and classes in three yolo layers are modified to 23, as shown in fig. 7.

FIG. 8 is a log output during Darknet fine-tuning training, where regions 82, 94, 106 are the prediction on three different scales, respectively, avgIOU represents the average intersection ratio of the prediction box and the manual annotation box in a batch, the larger the value is, the better the model training (the range is not more than 1), class is the classification confidence of the object, 5R and 75R are the recall ratios under the thresholds IOU =0.5 and 0.75, respectively, the closer the value is to 1, the better the detection is. At the training cutoff, the learning rate had dropped to 0.0001, total loss was 0.390281, and average loss was 0.589084. Since the training set's batch =64, 8 ten thousand iterations were performed using two GPUs, the total number of pictures trained was 80000 × 2 × 64= 10240000.

After the blind person wears or wears the wearable terminal integrated with the intelligent sensing system, the blind person can automatically sense the barrier in a certain distance in front through the system in the traveling process, detect and identify the distance and the direction of the barrier, plan an effective walking route and remind the blind person of the next walking direction through voice. Assuming that the walking distance of each step of the blind is within 0.5m, the barrier broadcasted each time is a target which is closest to the blind within the effective area range of about 1.5m x 1.5m by taking the blind as a reference point, and the broadcasting walking direction has 5 possible routes: and how to walk depends on the result of path planning, and the path planning is determined according to the distribution of obstacles in the current front effective area. If the blind person encounters the obstacles on the left, the middle and the right and cannot pass through, the system can stop the blind person from walking, go back one step, give an alarm for prompt, and then plan the path again.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A visual impairment person auxiliary obstacle perception method based on an improved YOLO model is characterized by comprising the following steps:

Adopting Darknet-YOLOv3 as a framework, adopting a GoogleNet-based convolutional neural network of a YOLOV3 algorithm, and adopting Darknet-53 as a feature extraction backbone network; the YOLOV3 algorithm is a full convolution network, a layer jump residual module is adopted for multiple times, downsampling operation is realized by using convolution step length movement, and feature fusion is performed by using the idea of upsampling of a feature map in a feature pyramid network FPN;

when the Yolov3 algorithm is used for target detection, firstly, feature extraction is carried out on an input image through a feature extraction network Darknet-53 to obtain 3 feature layers with different scales, each class in each feature layer corresponds to a square in an original image, and the square is used for predicting an object if the central coordinate of the detected object is located in which square;

step two: data augmentation processing

A data augmentation layer is added between the data reading layer and the feature extraction layer, the data is augmented in a geometric transformation mode by rotating and stretching the data, and the MSRCR algorithm is fused for data augmentation so as to adapt to poor illumination conditions;

the data enhancement adopts an MSRCR algorithm to enhance and repair a noise image, specifically comprises the steps of analyzing and eliminating a background light source signal in the image, and enhances the image by removing illumination information in the image, wherein the MSRCR algorithm formula is as follows:

wherein I _i (x, y) represents the image information of the i-th spectral band in space (x, y), "' represents the convolution operation, F _n (x, y) is a surround function implemented as a gaussian function, G and b are the final gain and offset, respectively, which are empirical parameters; c _i (x, y) is the Color Recovery Function (CRF) for the ith channel in chrominance space, formulated as:

the method comprises the following steps of A, performing Gaussian blur on images with weak light, wherein beta is gain for controlling color restoration, alpha is nonlinear gain for controlling color restoration, S represents the number of channels of the images, S is set to be 3, G and b of the images with weak light are set to be 4 and 50 respectively, alpha and beta are set to be 2 and 50 respectively, gaussian blur is performed from three scales of 30, 150 and 300, and then processing results are fused according to the weight proportion of 1/3;

step three: pre-training

step four: multi-scale training

Randomly adjusting the size of input data by adopting a multi-scale training mode, inputting the training data into a network, performing image preprocessing, filtering by respectively using 32 and 64 convolution kernels with the size of 3 × 3, performing down-sampling processing to obtain a feature map with the size of 240 × 240, then alternately inserting residual blocks consisting of convolution kernels with the size of 1 × 1 and 3 × 3 into a convolution unit, and respectively calculating feature maps with the resolutions of 240, 120 × 120, 60, 30 × 30 and 15 × 15 by 5 groups of residual blocks; all convolution units consist of convolution layers, BN layers and pooling layers;

step five: network structure of improved YOLOv3

A convolution layer is added in a backbone network, so that the efficiency is ensured, and the precision is improved at the same time, so that the practicability and the accuracy of a use scene are improved;

step six: reasoning acceleration based on TensorRT

step seven: attention adding mechanism module

Adding an attention mechanism module at the output part with the scale of 26 x 26, refining and purifying the information, wherein a network after 4 layers of convolution layers are added and fused into the attention mechanism module is SE-YOLOv3;

step eight: the Loss metric, GIOU Loss, which is a distance metric using GIOU Loss as a regression of the coordinates of the target frame, is calculated as follows, where A ^c Is the minimum closed area of the two target frames, and U is the intersection area of the two target frames

Loss function L of GIOU _GIOU The calculation of (d) is as follows:

L _GIOU ＝1-GIOU

the Soft-NMS obtains the IOU in a weight form, multiplies the IOU by the original score after taking the Gaussian index, then reorders the results and continues to circulate; in Darknet-Yolov3, there are 31 convolutional layers in total for the backbone network, and the network structure contains 6 groups of networks of 1 x, 2 x residual blocks; the feature interaction layer is divided into four scales, feature interaction is realized in each scale in an up-sampling mode, and the four scales are y ₁ ：(13×13)，y ₂ ：(26×26)，y ₃ ：(52×52),y ₄ ：(104×104)；

Step nine: model training

Firstly, labeling an image according to a required category, presetting all labeling categories in data/predefined _ categories.txt, adjusting a labeling frame to a fit target edge, and after labeling is finished, storing xml in data/Annotations, wherein each xml corresponds to the image one by one and comprises a picture name, a path where the xml is located, a pixel position of the labeling frame and the labeling category;

then training strategy and parameter configuration, pre-training a Darknet frame which is recompiled by fusing MSRCR algorithm to obtain a barrier weights file, storing the weight of the whole convolutional neural network in a sequence mode, converting the file into a pre-training file barrier.conv.74 only containing the weight of the convolutional layer by using a-/Darknet partial command, fixing 53 convolutional layers of the network, carrying out fine tuning on the last classification layer, observing the parameter change of the LOG, training until the loss of the model is not converged any more, and configuring the hyper-parameters of the fine tuning training in the cfg file of the Darknet.

2. The improved YOLO model-based aided obstacle sensing method of claim 1, wherein in step one, each square block corresponds to 9 prediction boxes, and only the bounding box with the largest IOU of the detected object among the prediction boxes is used to predict the object.

3. The improved YOLO model-based auxiliary obstacle perception method for visually impaired people according to claim 1, wherein in step three, the pre-training comprises the following specific steps:

step two: and setting the number of filters of the last convolution layer as 84, setting the number of categories in the three YOLO layers as 23, fixing the weight parameters of the convolution layers in the pre-training model, then carrying out fine tuning on a self-made labeled obstacle data set, updating the weight, and retraining the detection model suitable for obstacle identification.