CN114120960A

CN114120960A - Auxiliary space perception system and method based on hearing

Info

Publication number: CN114120960A
Application number: CN202111373446.5A
Authority: CN
Inventors: 费腾; 李阳春
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-03-01
Anticipated expiration: 2041-11-19
Also published as: CN114120960B

Abstract

The invention relates to an auditory sense-based auxiliary space sensing system which comprises a data acquisition module, a man-machine interaction module, a control module, a walking mode module and a staring mode module, wherein two working modes of the walking mode and the staring mode are provided, and the walking mode and the staring mode correspond to two auditory sense-based auxiliary space sensing methods. The walking mode assisted space perception method provides larger space information amount, and effectively solves the problem that the existing sense organ substitution device cannot meet the requirements of the visually impaired due to insufficient space information; the gazing mode auxiliary space perception method provides more refined and concentrated space descriptive information, and effectively solves the problem that the practicability is not strong because the existing sense organ substitution device provides redundant information; the two methods are combined to meet different use requirements of the visually impaired users in different scenes.

Description

Auxiliary space perception system and method based on hearing

Technical Field

The invention belongs to the technical field of sense organ replacement, and particularly relates to an auxiliary space perception system and method based on hearing.

Background

Data published by the world health organization show that about 2.53 million people in the world suffer from vision disorders. The visually impaired people face a lot of difficulties in going out in daily life due to the lack of vision. With the development of society, the quality of life and the level of travel of visually impaired people are receiving more and more attention. The visual sense space is assisted to visually impaired people, the walking ability of the group is further improved, the life of the group is facilitated, and the problem to be solved urgently is solved.

At present, the traditional ways for assisting the blind to walk are mainly a blind stick and a blind guiding dog. However, they all have some disadvantages, such as limited detection range of the blind stick, high cultivation cost of the guide dog and limited application. Meanwhile, the traditional blind guiding methods only can help the visually impaired people to avoid obstacles on the circuit, but cannot enable the visually impaired people to know the spatial structure and scene information of the surrounding environment.

With the development of computer science and sensor technology, sensory substitution devices are being studied for spatial perception assistance in visually impaired people. Most sensory substitution devices attempt to use auditory signals instead of visual signals to provide scene information to users due to the characteristics of strong intuition and multiple available parameters of auditory sense. However, these studies tend to have two problems: the provision of insufficient spatial information fails to satisfy the needs of visually impaired persons, or the provision of excessively redundant information makes the utility poor.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an auxiliary space perception system and method based on auditory sense, which utilize auditory sense to replace visual sense and convert space scene information into a non-voice space audio signal or a voice description signal, thereby not only assisting the visually impaired to effectively avoid obstacles in the walking process and improving the space cognition sense of a user, but also effectively transmitting environmental information and improving the scene understanding ability of the user.

An auxiliary space perception system based on the sense of hearing comprises a data acquisition module, a human-computer interaction module, a control module, a walking mode module and a staring mode module, wherein the human-computer interaction module is connected with the control module, and the control module and the data acquisition module are connected with the walking mode module and the staring mode module.

The data acquisition module is used for acquiring spatial scene information, including a depth image data stream and an RGB image data stream.

The human-computer interaction module is used for transmitting a user instruction and sending a walking mode instruction and a staring mode instruction to the control module.

The control module processes the depth image data acquired by the data acquisition module according to a walking mode instruction of the man-machine interaction module, maps the depth image information to a spatial audio frequency, and outputs the spatial audio frequency through an earphone; and the control module constructs semantic information from the RGB image data acquired by the data acquisition module according to the gaze mode instruction of the human-computer interaction module, synthesizes and outputs corresponding voice through an earphone.

The walking mode module is used for detecting and outputting object azimuth and distance information in a space scene, and comprises: the depth image preprocessing submodule is used for processing a depth image data stream; the spatial audio generation submodule is used for mapping the depth image information to audio parameters and carrying out spatialization processing on the audio; and the spatial audio output sub-module is used for outputting spatial audio.

The gaze pattern module is used for identifying and outputting the attribute and state information of the object in the space scene, and comprises: the RGB image semantic construction submodule is used for converting the RGB image data stream into sentences; a speech synthesis submodule for converting the sentence into speech; and the voice output submodule is used for outputting the voice signal.

A walking mode auxiliary space perception method realized by the hearing-based auxiliary space perception system comprises the following steps:

step 101, receiving a depth image data stream transmitted by a data acquisition module in real time, filling null values in each frame of depth image, namely, assigning an average value of eight neighborhood non-null pixel values of the null pixel to the null pixel, and traversing the image to execute the operation until no null value exists in the image;

102, performing Gaussian low-pass filtering processing on the depth image processed in the step 101 to filter noise and fuzzy details in the image;

a step 103 of down-sampling the depth image obtained in the step 102 to 5 × 7 an image having an original size of 225 × 315;

104, mapping pixel values of the depth image subjected to downsampling processing in the step 103 to audio parameters;

step 105, using head-related function technique to perform spatialization processing on the audio information generated in step 104, i.e. mapping the coordinates (x, y) of the image element in the image to the sound source position

The above step (1);

and step 106, outputting the spatial audio to an earphone worn by the user in real time, wherein the audio transmits the non-voice sound of the spatial structure information.

In step 103, the down-sampling rule is: and assigning the minimum pixel value in the eight neighborhoods of the pixel to be solved to the pixel to be solved.

Moreover, the rule for converting the image information into the audio in step 104 is as follows: extracting the minimum pixel of each row in the image, setting a threshold value D, and comparing the minimum pixel with the threshold value D; when the minimum pixel value is smaller than or equal to D, the object at the position is close to the user, the minimum pixel value is mapped to the loudness and pitch of the beep, the smaller the pixel value is, the closer the object is, the louder the object is, the higher the loudness and the higher the pitch of the beep are, and otherwise, the smaller the loudness and the lower the pitch are, so that the user is prompted to avoid obstacles; when the minimum pel value is greater than D, the object representing the location is further away from the user and does not temporarily pose a collision threat, and the pel information is represented using a fixed loudness and pitch water droplet sound, which may be considered a "safety sound" that suggests that the object is further away.

Furthermore, after the process of the step 105, the beep sound or the water drop sound is spatially sensed, and the user will feel the direction from them, and the mapping rule is as follows:

θ＝-120°+30°×y (1)

wherein x is the line number of the pixel in the image, y is the column number of the pixel in the image, a three-dimensional coordinate system is constructed by taking the center of the skull as an origin O, the x axis passes through the ear of a person, the y axis passes through the nose, the z axis is vertical to the xOy plane, theta is a horizontal angle formed by the connecting line of the sound source position and the origin O and the yOz plane,

is the height angle formed by the connecting line of the sound source position and the origin O and the xOy plane.

Moreover, in the step 106, for each frame of depth image, 7 audio segments with different horizontal angles are generated, and the audio segments are played sequentially from left to right; when the audio frequency is the sound of water drops, the object representing the position is far away from the user; when the audio is beep, the object indicating the position is less than D meters away from the user, and the higher the tone and the loudness of the beep are, the closer the object is; the user can judge the distance and the direction of the barrier according to the tone color, the tone, the loudness and the position information of the sound source of the audio, so that the barrier is avoided in the walking process.

A gaze mode auxiliary space perception method realized by the hearing-based auxiliary space perception system comprises the following steps:

step 201, receiving an RGB image transmitted by a data acquisition module in real time, and generating an English descriptive text of the image by calling a Computer Vision API service provided by Microsoft;

step 202, converting the English text generated in the step 1 into a Chinese text through Baidu translation API service;

step 203, based on a pyttsx module in python software, converting the Chinese text generated in the step 2 into voice;

and step 204, outputting voice to an earphone worn by the user in real time, wherein the voice is a descriptive sentence of the current view scene information.

Compared with the prior art, the invention has the following advantages:

1) the system provided by the invention comprises two working modes, namely a walking mode and a staring mode, and corresponds to two hearing-based auxiliary space perception methods, and the two methods can meet different use requirements of a vision-impaired user in different scenes by combining.

2) The walking mode assisted space perception method uses an audibility technology to convert space scene information into a space audio signal in real time, so that the visually impaired can quickly acquire space structure information, effectively help the visually impaired to avoid obstacles in the walking process, and improve the space cognition. The walking mode assisted space perception method provides larger space information amount, and effectively solves the problem that the existing sense organ replacement device cannot meet the requirements of the visually impaired due to insufficient space information.

3) The gazing mode assisted space perception method converts space scene information into voice to read in real time, so that the visually impaired can quickly acquire descriptive information of the space scene, the visually impaired can be effectively helped to acquire environmental information, and scene understanding ability is improved. The gaze mode auxiliary space perception method provides more refined and concentrated space descriptive information, and effectively solves the problem that the existing sense organ substitution device provides redundant information and therefore is not high in practicability.

Drawings

Fig. 1 is a schematic structural diagram of an auxiliary space perception system based on hearing according to an embodiment of the invention.

FIG. 2 is a flowchart illustrating a walking mode assisted spatial awareness method according to the present invention.

FIG. 3 shows the horizontal angle θ and the elevation angle of the sound source position used in the walking mode assisted spatial perception method of the present invention

Schematic representation of (a).

FIG. 4 is a flow chart of a gaze-mode assisted spatial awareness method of the present invention.

Detailed Description

The invention provides an auxiliary space perception system and method based on auditory sense, which utilize auditory sense to replace visual sense to convert space scene information into a non-voice space audio signal or a voice description signal so as to assist a user in perceiving space and understanding a scene.

The technical solution of the present invention is further explained with reference to the drawings and the embodiments.

As shown in fig. 1, the present invention provides an auditory sense-based auxiliary spatial perception system, comprising: the system comprises a data acquisition module, a human-computer interaction module, a control module, a walking mode module and a staring mode module, wherein the human-computer interaction module is connected with the control module, and the control module and the data acquisition module are connected with the walking mode module and the staring mode module.

The control module processes the depth image data acquired by the data acquisition module according to a walking mode instruction of the man-machine interaction module, maps the depth image information to an audio parameter and outputs a spatial audio through an earphone; and the control module constructs semantic information from the RGB image data acquired by the data acquisition module according to the gaze mode instruction of the human-computer interaction module, synthesizes and outputs corresponding voice through an earphone.

By arranging the human-computer interaction module, a user can freely select a mode, and different types of information of surrounding space scenes can be acquired through different types of sound output; by arranging the walking mode module, a user can quickly sense the spatial structure of the surrounding environment and help the user to master the direction and distance information of the barrier, so that the barrier is effectively avoided and the safety of the user in the walking process is ensured; by setting the gaze pattern module, the user can quickly acquire descriptive information of the surrounding environment, which is helpful for the user to understand the spatial scene.

The system comprises two working modes, namely a walking mode and a staring mode, and corresponds to two hearing-based auxiliary space perception methods: a walking mode aided space perception method and a gaze mode aided space perception method. The walking mode assisted space perception method uses an audibility technology to convert space scene information into a space audio signal in real time, so that the visually impaired can quickly acquire space structure information, effectively help the visually impaired to avoid obstacles in the walking process, and improve the space cognition. The gazing mode assisted space perception method converts space scene information into voice to read in real time, so that the visually impaired can quickly acquire descriptive information of the space scene, the visually impaired can be effectively helped to acquire environmental information, and scene understanding ability is improved.

As shown in fig. 2, the walking mode assisted spatial perception method includes the following steps:

step 101, receiving a depth image data stream transmitted by a data acquisition module in real time, filling null values in each frame of depth image, namely, assigning an average value of eight neighborhood non-null pixel values of the null pixel to the null pixel, and traversing the image to execute the operation until no null value exists in the image.

And 102, performing Gaussian low-pass filtering processing on the depth image processed in the step 101 to filter noise and fuzzy details in the image.

Step 103, down-sampling the depth image obtained in step 102 to 5 × 7 of an image with an original size of 225 × 315, wherein the down-sampling rule is as follows: and assigning the minimum pixel value in the eight neighborhoods of the pixel to be solved to the pixel to be solved.

And step 104, mapping the pixel values of the depth image subjected to the downsampling processing in the step 103 to the audio parameters.

The rule for converting image information into audio is as follows: the minimum pixel of each column in the image is extracted, a threshold value D (in the embodiment, D is 3 meters) is set, and the minimum pixel is compared with the threshold value D. And when the minimum pixel value is less than or equal to D, the object representing the position is closer to the user, the minimum pixel value is mapped to the loudness and pitch of the beep, the smaller the pixel value is, the closer the object is represented, the louder the loudness of the mapped beep is higher, the higher the pitch is, and otherwise, the louder the loudness is and the lower the pitch is, so that the user is prompted to avoid obstacles. When the minimum pel value is greater than D, the object representing the location is further away from the user and does not temporarily pose a collision threat, and the pel information is represented using a fixed loudness and pitch water droplet sound, which may be considered a "safety sound" that suggests that the object is further away.

The above. After the spatialization treatment, the beep sound or the water drop sound is spacious, and the user feels the direction from which they come.

The specific mapping rule is shown as the following formula:

θ＝-120°+30°×y (1)

wherein x is the row number of the pixel in the image, y is the column number of the pixel in the image, as shown in figure 3, a three-dimensional coordinate system is constructed by taking the center of the skull as an origin O, the x axis passes through the ear of a person, the y axis passes through the nose,the z-axis is perpendicular to the xOy plane. Theta is a horizontal angle formed by a line connecting the sound source position with the origin O and the yOz plane,

For each frame depth image, 7 different horizontal angle audio clips will be generated, which are played sequentially in left to right order. When the audio frequency is the sound of water drops, the object representing the position is far away from the user; when the audio is beeping, the object representing the location is less than D meters away from the user, and the beeping sound is closer as the tone is higher and the loudness is greater. The user can judge the distance and the direction of the barrier according to the tone color, the tone, the loudness and the position information of the sound source of the audio, so that the barrier is avoided in the walking process.

As shown in fig. 3, the gaze-mode assisted spatial perception method comprises the steps of:

step 201, receiving the RGB image transmitted by the data acquisition module in real time, and generating an english descriptive text of the image by calling a Computer Vision API service provided by microsoft.

Step 202, the english text generated in step 201 is converted into a chinese text through a Baidu translation API service.

Step 203, based on the pyttsx module in the python software, the chinese text generated in step 202 is converted into speech.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. An auxiliary space perception system based on hearing is characterized by comprising a data acquisition module, a human-computer interaction module, a control module, a walking mode module and a staring mode module, wherein the human-computer interaction module is connected with the control module;

the data acquisition module is used for acquiring spatial scene information, including a depth image data stream and an RGB image data stream;

the human-computer interaction module is used for transmitting a user instruction and sending a walking mode instruction and a staring mode instruction to the control module;

the control module processes the depth image data acquired by the data acquisition module according to a walking mode instruction of the man-machine interaction module, maps the depth image information to an audio parameter and outputs a spatial audio through an earphone; the control module constructs semantic information from the RGB image data acquired by the data acquisition module according to the staring mode instruction of the human-computer interaction module, synthesizes and outputs corresponding voice through an earphone;

the walking mode module is used for detecting and outputting object azimuth and distance information in a space scene, and comprises: the depth image preprocessing submodule is used for processing a depth image data stream; the spatial audio generation submodule is used for mapping the depth image information to spatial audio and carrying out spatialization processing on the audio; a spatial audio output sub-module for outputting spatial audio;

the gaze pattern module is used for identifying and outputting the attribute and state information of the object in the space scene, and comprises: the RGB image semantic construction submodule is used for converting RGB image data into sentences; a speech synthesis submodule for converting the sentence into speech; and the voice output submodule is used for outputting the voice signal.

2. A walking mode aided spatial perception method implemented using the hearing based aided spatial perception system of claim 1, comprising the steps of:

The above step (1);

3. A walking mode aided spatial perception method as claimed in claim 2, characterized by: the down-sampling rule in step 103 is: and assigning the minimum pixel value in the eight neighborhoods of the pixel to be solved to the pixel to be solved.

4. A walking mode aided spatial perception method as claimed in claim 2, characterized by: the rule for converting the image information into the audio in step 104 is as follows: extracting the minimum pixel of each row in the image, setting a threshold value D, and comparing the minimum pixel with the threshold value D; when the minimum pixel value is smaller than or equal to D, the object at the position is close to the user, the minimum pixel value is mapped to the loudness and pitch of the beep, the smaller the pixel value is, the closer the object is, the louder the object is, the higher the loudness and the higher the pitch of the beep are, and otherwise, the smaller the loudness and the lower the pitch are, so that the user is prompted to avoid obstacles; when the minimum pel value is greater than D, the object representing the location is further away from the user and does not temporarily pose a collision threat, and the pel information is represented using a fixed loudness and pitch water droplet sound, which may be considered a "safety sound" that suggests that the object is further away.

5. A walking mode aided spatial perception method as claimed in claim 2, characterized by: the spatialized beeps or drips in step 105 are spatial, and the user will feel the directions they are coming from, the mapping rule is as follows:

θ＝-120°+30°×y (1)

wherein x is the line number of the pixel in the image, y is the column number of the pixel in the image, a three-dimensional coordinate system is constructed by taking the center of the skull as an origin O, the x axis passes through the ear, the y axis passes through the nose, the z axis is vertical to the xOy plane, theta is a horizontal angle formed by the connecting line of the sound source position and the origin and the yOz plane,

the height angle formed by the connecting line of the sound source position and the origin and the xOy plane.

6. A walking mode aided spatial perception method as claimed in claim 2, characterized by: in step 106, for each frame depth image, 7 audio segments with different horizontal angles are generated, and the audio segments are played sequentially from left to right; when the audio frequency is the sound of water drops, the object representing the position is far away from the user; when the audio is beep, the object indicating the position is less than D meters away from the user, and the higher the tone and the loudness of the beep are, the closer the object is; the user can judge the distance and the direction of the barrier according to the tone color, the tone, the loudness and the position information of the sound source of the audio, so that the barrier is avoided in the walking process.

7. A gaze-mode aided spatial perception method implemented using the hearing-based aided spatial perception system of claim 1, comprising the steps of:

step 202, converting the English text generated in the step 201 into a Chinese text through Baidu translation API service;

step 203, based on a pyttsx module in python software, converting the Chinese text generated in the step 202 into voice;