CN111084711B

CN111084711B - Terrain detection method of blind guiding stick based on active visual guidance

Info

Publication number: CN111084711B
Application number: CN201911355769.4A
Authority: CN
Inventors: 刘华平; 李尧尧; 赵怀林
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-12-11
Anticipated expiration: 2039-12-25
Also published as: CN111084711A

Abstract

The invention relates to a terrain detection method of a blind guiding stick based on active visual guidance, and belongs to the technical field of active guidance and deep learning. The method of the invention is in the field of disabled person service equipment, utilizes the existing GAN to generate the tactile signal, is used in blind person auxiliary equipment, namely a blind guiding stick, introduces active visual guidance, and can help the blind person to better sense ground information through vibration tactile sensation. Under the condition that the blind people can sense the ground information, the blind people can be prevented from detecting the ground without purpose, but the ground is detected based on active visual guidance, so that the visual handicapped people can go out conveniently. Compared with the traditional blind guiding device, the blind guiding device does not need to purposefully detect through the blind guiding stick, but carries out ground abnormity detection firstly, achieves targeted detection of ground areas, and is more beneficial to the safety, reliability and convenience of the visually impaired people in going out.

Description

Terrain detection method of blind guiding stick based on active visual guidance

Technical Field

The invention relates to a terrain detection method of a blind guiding stick based on active visual guidance, and belongs to the technical field of active guidance and deep learning.

Background

In recent years, with the popularization of electronic products and the change of life styles of people, the number of people with impaired vision or blindness is increasing. According to 2019, the world health organization reports that at least 22 hundred million people worldwide have impaired vision or blindness. Even more alarming is the increasing number of blind people worldwide every year, and the blind people become a group which is not neglected globally.

The touch sense is one of 5 kinds of sense channels of human beings, is a basic channel for human beings to interact information with the outside world, and important information such as hardness, temperature, shape, surface texture and the like from the outside can be sensed through the touch sense channel, and particularly for blind people, the touch sense channel is one of main ways for the blind people to sense external things and compensate visual defects. Therefore, the technology of replacing visual sense with tactile sense has been a research focus, and technologists use the technology in blind-person auxiliary equipment to develop a series of auxiliary equipment for replacing visual sense with tactile sense, wherein the first effective tool for replacing visual sense with tactile sense is a walking stick, and the blind person can obtain surrounding ground information such as material, height and slope by touching the ground through the walking stick. And then the walking stick is provided with a laser, sonar and other obstacle detectors, so that the blind can obtain the direction and distance of the obstacle. Walking sticks are widely used due to the simplicity of operation, but research on converting ground information into a tactile sensation which is an intuitive feeling for the blind is rare, and especially, the research on guiding the blind based on active visual guidance is not seen at present.

The deep learning model has a multi-level structure, and can automatically extract characteristic information from a bottom layer to a high layer in an image. In the process of learning data, the model automatically generates the feature expression of the image without considering the problem of artificially constructing features, so that deep learning is widely applied, and a countermeasure network (GANs) is generated as a generating network model in the deep learning, and is widely applied to the field of computer vision, such as image synthesis, text-to-image synthesis, style migration, image super-resolution, image domain conversion, image restoration and the like. Despite their great success in computer vision, there has been little progress in audio modeling using GAN and little research into generating haptic signals using GAN.

Disclosure of Invention

The invention aims to provide a terrain detection method of a blind guiding stick based on active visual guidance, which overcomes the defects of the prior art, detects ground abnormity based on extracted image characteristics, generates tactile vibration based on a generated confrontation network, converts visual information into tactile information and provides reliable information for a blind person to detect the ground.

The invention provides a terrain detection method of a blind guiding stick based on active visual guidance, which comprises the following steps:

(1) acquiring a ground image by using a camera on the glasses for the blind;

(2) dividing the ground image in the step (1) by adopting a uniform partitioning method to obtain a plurality of image blocks;

(3) and (3) performing feature extraction on the plurality of image blocks in the step (2), wherein the feature extraction method comprises the following steps:

(3-1) color histogram h of each image block_ij,cComprises the following steps:

wherein M, N respectively represent the length and width of the image block, f_mnRepresenting color values at pixel points (m, n), C representing one color in the image blocks, wherein the color set contained in each image block is C, representing an activation function, and taking a color histogram as each image blockThe image characteristics of (1);

(3-2) calculating color histograms h of all image blocks_ij,cAverage value of (d):

wherein, I and J respectively represent the positions of the image blocks after the ground image is divided, and the color histogram of the image block with the position of (I, J) is h_ij,c；

(3-3) setting a confidence threshold value sigma according to the color histogram h of each image block_ij,cAnd a color histogram h_ij,cMean value of

The state of the ground is judged, if so

Judging that the ground state is not abnormal, and the blind person normally walks if the ground state is abnormal

Judging that the ground state is abnormal, and performing the step (4);

(4) the method for the tactile representation of the blind guide stick comprises the following steps:

(4-1) acquiring an acceleration signal of the abnormal ground in the step (3-3) through an acceleration sensor of the blind guiding stick, and performing short-time Fourier transform on the acceleration signal to obtain a spectrogram of the ground image corresponding to the acceleration signal;

(4-2) training a generative warfare network (MelGAN) with an LJ Speech dataset, the generative warfare network consisting of a generator and an arbiter, wherein an objective function of the generator is:

the objective function of the discriminator is:

wherein x represents a real audio and is acquired from the LJ Speech data set, and s represents an input spectrogram of the generator and is acquired from the LJ Speech data set; z denotes the Gaussian noise vector, k denotes the kth discriminator in the generative countermeasure network, λ is the weight of the feature matching penalty, T denotes the number of layers of the discrimination network, N_iThe number of unit neurons of the i-th discrimination layer is represented, and the parameters are set according to the training precision;

a feature map representing the output of the ith discrimination layer of the kth discriminator, G (s, z) representing the audio generated by the generator,

the mathematical expectation of the input spectrogram representing the generator and the gaussian noise vector,

mathematical expectations of input spectrograms representing real audio and generators, D_k(x) Representing the probability that the kth discriminator discriminates the sample as true audio,

a mathematical expectation representing real audio;

the training process comprises the following steps:

(4-2-1) sampling audio in the LJ Speech data set to obtain a sampling signal, and performing short-time Fourier transform on the sampling signal to obtain a corresponding spectrogram;

(4-2-2) inputting the spectrogram of the step (4-2-1) to a generator for generating a countermeasure network, and outputting audio;

(4-2-3) taking the audio output by the generator and the original audio in the LJ Speech data set as the input of a discriminator for generating the countermeasure network, and outputting a discrimination result by the discriminator to obtain the probability that the input audio of the discriminator is the real audio;

(4-2-4) training a generation countermeasure network consisting of (4-2-2) and (4-2-3) according to the objective functions of the generator and the discriminator to obtain the weight of the generation countermeasure network;

(4-3) inputting the spectrogram of the ground image in the step (4-1) into the trained generation countermeasure network in the step (4-2), and outputting an audio corresponding to the acceleration signal in the step (4-1);

and (4-4) outputting the audio frequency obtained in the step (4-3) through a power amplifier to generate touch vibration, and prompting the ground information of the blind according to different ground touch vibration to realize the terrain detection of the blind guiding stick.

Compared with the prior art, the terrain detection method of the blind guiding stick based on the active visual guidance has the advantages that:

the invention discloses a terrain detection method of a blind guiding stick based on active visual guidance, which utilizes the existing GAN to generate a tactile signal in the field of service equipment for disabled people, but in the prior art, GAN is utilized to simulate time sequence data distribution, a vibration tactile signal is converted into an image, and finally the vibration tactile signal is generated according to a texture image or texture characteristics. The method is indirect, end-to-end processing cannot be achieved, and vibration information is inevitably lost in the middle. The method of the invention uses MelGAN in vibrotactile signal generation, realizes the direct conversion from image to vibration, and is an end-to-end process. The method is used for blind auxiliary equipment, namely a blind guiding stick, active visual guidance is introduced, and the blind can be helped to better sense ground information through vibration and touch. Under the condition that the blind people can sense the ground information, the blind people can be prevented from detecting the ground without purpose, but the ground is detected based on active visual guidance, so that the visual handicapped people can go out conveniently. Compared with the traditional blind guiding device, the blind guiding device does not need to purposefully detect through the blind guiding stick, but carries out ground abnormity detection firstly, achieves targeted detection of ground areas, and is more beneficial to the safety, reliability and convenience of the visually impaired people in going out.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention.

FIG. 2 is a terrain detection device of an active visual guide blind guiding stick according to the present invention

In fig. 1, 1 is a camera on the glasses for the blind, 2 is an integrated chip, 3 is an earphone, 4 is a vibrating mass, 5 is a power amplifier, and 6 is an acceleration sensor.

Detailed Description

The invention provides a terrain detection method of a blind guiding stick based on active visual guidance, which has a flow chart shown in figure 1 and comprises the following steps:

(1) a camera 1 on the glasses for the blind is used for acquiring a ground image, as shown in figure 2;

(2) processing the ground image by an integrated chip 2 on the blind glasses, and dividing the ground image in the step (1) by adopting a uniform partitioning method to obtain a plurality of image blocks;

in the step, the image is segmented by using a block method, and common block methods include a uniform block method, a super-pixel segmentation method and the like. The local blocks segmented by the superpixel method are different in size, and the proportion of each feature point in the image information amount cannot be guaranteed to be consistent. In order to avoid introducing more interference parameters, a uniform blocking method is adopted to segment the image in the step.

the image information of the area where the image block is located is represented by extracting the features of the image block, and meanwhile, noise interference caused by few feature points on the image is reduced. There are many ways to describe the characteristics of objects, and color features are most widely used in image retrieval. The main reason is that the color tends to be quite correlated with the objects or scenes contained in the image. In addition, compared with other visual features, the color features have smaller dependence on the size, direction and visual angle of the image, so that the robustness is higher. The color feature is an intuitive feature based on the pixel points, and comprises a color histogram, a color set, a color cluster, a color correlation diagram and the like. The most common color feature expression method is a color histogram method, which has the advantages that the normalized color feature expression method is not influenced by image rotation, translation and scale change, and the common color histogram feature matching method comprises a distance method, a histogram cumulative method and the like.

(3-1) color histogram h of each image block_ij,cComprises the following steps:

wherein M, N respectively represent the length and width of the image block, f_mnRepresenting color values at pixel points (m, n), C represents one color in the image blocks, the color set contained in each image block is C, representing an activation function, and taking a color histogram as the image feature of each image block;

In one embodiment of the present invention, I ═ J ═ 20, the average color histogram of the image block

The state of the ground is judged, if so

It is determined that there is no abnormality in the ground state,if the blind is walking normally

Judging that the ground state is abnormal, and performing the step (4); the blind person can walk normally under the condition that the ground is not abnormal. If the abnormal position is detected, stopping detection and detecting the abnormal position.

(4) The blind guiding stick is used for tactile representation, the structure of the blind guiding stick is shown in figure 2, 1 is a camera on blind glasses, 2 is an integrated chip, 3 is an earphone, 4 is a vibrating block, 5 is a power amplifier, and 6 is an acceleration sensor. The haptic rendering comprises the steps of:

(4-1) acquiring the acceleration signal of the abnormal ground in the step (3-3) through an acceleration sensor of the blind guiding stick, and performing short-time Fourier transform (STFT) on the acceleration signal, wherein the STFT is a mathematical transform related to the Fourier transform and determines a local area sine wave of the frequency and phase time-varying signal. The STFT processed signal has locality in the time and frequency domains. The acceleration signal is used as a representation of vibrotactile stimulation, the acceleration signal can be obtained through an acceleration sensor arranged on the blind guiding stick, and a spectrogram of the ground image corresponding to the acceleration signal is obtained through STFT;

the objective function of the discriminator is:

wherein x represents a real audio and is acquired from the LJ Speech data set, and s represents an input spectrogram of the generator and is acquired from the LJ Speech data set; z represents a Gaussian noise vector, k represents the kth arbiter in the generative countermeasure networkλ is the weight of the feature matching loss, T represents the number of layers of the discrimination network, N_iThe number of unit neurons of the i-th discrimination layer is represented, and the parameters are set according to the training precision;

a mathematical expectation representing real audio;

the generation of the countermeasure network (MelGAN) for training is an autoregressive forward convolution structure. MelGAN enables the generation of audio waveforms in GAN. This is the first to successfully train GANs to generate raw audio without the need for additional perceptual loss functions, while still producing high quality audio generation models. The training process comprises the following steps:

(4-2-1) the LJ Speech dataset used for training is a public domain Speech dataset consisting of 13100 short audio clips from a speaker reading paragraphs in 7 non-novel books. There is one transcription per fragment. The length of the clips varies from 1 second to 10 seconds, and the total length is about 24 hours. Is a common audio data set for training models. Sampling audio in the LJ Speech data set to obtain a sampling signal, and performing short-time Fourier transform on the sampling signal to obtain a corresponding spectrogram;

inputting a spectrogram, passing through a convolutional layer, then sending the spectrogram to an upsampling stage, sequentially passing through 8 times of upsampling and 2 times of upsampling twice, sending the spectrogram to a residual error module with cavity convolution after each upsampling, and finally obtaining audio output through a convolutional layer. The residual block is mainly composed of 3 cavity convolution blocks, and each cavity convolution block is composed of two layers of convolution layers with different expansion rates and an activation function. The hole convolution is chosen to enhance the remote correlation between time steps in the audio generation process. The sensing field of the cavity convolution layer increases exponentially with the increase of the number of layers, and the sensing field of each output time step can be effectively increased. There is a large overlap in the sensing fields at longer time steps, resulting in better remote correlation.

the discriminator adopts a multi-scale architecture, namely, the original audio is discriminated, the original audio is subjected to frequency reduction processing and then fed into the next discriminator for discrimination, the frequency reduction mode adopts an average pooling method, 2 times of frequency reduction processing are carried out totally, and the discriminator corresponds to 3 scales. The inner module design of the discriminator mainly comprises a convolution layer and a down-sampling layer.