CN114245280A

CN114245280A - Scene self-adaptive hearing aid audio enhancement system based on neural network

Info

Publication number: CN114245280A
Application number: CN202111565538.3A
Authority: CN
Inventors: 吴志勇; 杨玉杰; 蔡新宇; 陈玉鹏
Original assignee: Shenzhen International Graduate School of Tsinghua University; Peng Cheng Laboratory
Current assignee: Shenzhen International Graduate School of Tsinghua University; Peng Cheng Laboratory
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-03-25
Anticipated expiration: 2041-12-20
Also published as: CN114245280B

Abstract

A scene self-adaptive hearing aid audio enhancement system and method based on a neural network comprise a multi-modal scene feature extraction module and an audio enhancement module based on the neural network, wherein the multi-modal scene feature extraction module performs multi-modal scene feature extraction on a scene, and the multi-modal scene features comprise audio and image features of the scene; the audio enhancement module uses the multi-modal scene feature coding to perform audio enhancement on the original audio. The invention fuses the scene information into the audio enhancement system, thereby realizing enhancement or inhibition of different audios in different scenes and improving the use experience of users. The method can be used for helping the hearing aid realize targeted audio enhancement in different scenes, can reduce the requirement on model storage, and improves the reasoning speed and audio enhancement performance.

Description

Scene self-adaptive hearing aid audio enhancement system based on neural network

Technical Field

The invention relates to an audio enhancement technology, in particular to a scene self-adaptive hearing aid audio enhancement system based on a neural network.

Background

About 3000 tens of thousands of hearing disabilities patients exist in China. By wearing the hearing aid, the communication capacity of the hearing-impaired patient can be remarkably improved, and the life quality is improved. The conventional hearing aid only mechanically amplifies the volume of audio and noise at the same time, and the use experience is poor. And the intelligent digital hearing aid applying the artificial intelligence technology can pertinently enhance certain sounds under different conditions, thereby further improving the use experience of the hearing aid. In real life, it is desirable that hearing aids selectively enhance or suppress certain sounds depending on the scene. For example, a user may wish the hearing aid to enhance his voice while participating in a conference. In the home life, the hearing aid can suppress some external noise. When the user walks on the street, the hearing aid cannot suppress sounds such as automobile whistling, otherwise potential safety hazards can be caused.

There is currently no complete set of solutions to this problem. To achieve the above effects, a plurality of separate methods need to be combined, and specifically, a single scene classification module is mainly introduced to judge the current scene, and then different audio enhancement methods are selected according to the scene. This solution mainly suffers from the following two problems: (1) different audio enhancement methods are used for different scene loads, which causes waste of storage space and reduction of real-time performance. (2) The scene classification module and the audio enhancement module are executed in series, and the audio enhancement can be performed specifically only after the scene classification result is obtained, so that the real-time performance is further reduced.

With the development of artificial intelligence technology, neural network technology is beginning to be used in the hearing assistance field. Paper [1] studies speech enhancement techniques based on convolutional neural networks. Paper [2] proposes a recurrent neural network-based speech enhancement technique. Paper [3] uses the confrontation generation network structure, uses the generator to perform speech enhancement, and discriminators distinguish between pure speech and enhanced speech. These deep learning based methods achieve superior performance far beyond traditional methods. Regarding the use of speech enhancement technology in the hearing assistance field, patent [4] proposes a method for hearing aid speech enhancement using cloud computing, and paper [5] enables LSTM-based speech enhancement models to run on the hearing aid local chip and achieve acceptable computational delay by using multiple model compression methods.

These papers and patents are mainly focused on speech enhancement and its application in the field of hearing aids, and no research has been conducted on the application of audio enhancement in the field of hearing aids. Paper [6] proposes a framework for training an audio separation model on weakly supervised data, providing a basis for the application of the audio separation model.

Some solutions also consider the scene noise problem in speech enhancement, for example, paper [10] proposes to use noise classification to assist mono speech enhancement, embed classification network into speech enhancement system, and guide the network to extract network features better.

The prior scheme mainly has the following defects:

1) the existing scheme mainly aims at enhancing voice, cannot pertinently enhance other audios such as music and knock, and is not wide enough in applicable scenes.

2) In the existing scheme, when the audio is enhanced by using the scene information, the scene classification module and the voice enhancement modules of different scenes are executed in series. The scene categories are first obtained from the scene classification, and then the speech enhancement models of the corresponding categories are selected. This concatenation results in a further reduction in real-time and more space is occupied by multiple speech enhancement models.

3) The scene classification module of the existing scheme only considers the audio information and has certain information overlapping with the audio enhancement module.

Reference documents:

[1]Park S R,Lee J.A fully convolutional neural network for speech enhancement[J].arXiv preprint arXiv:1609.07132,2016.

[2]Sun L,Du J,Dai L R,et al.Multiple-target deep learning for LSTM-RNN based speech enhancement[C]//2017Hands-free Speech Communications and Microphone Arrays(HSCMA).IEEE,2017:136-140.

[3]Pascual S,Bonafonte A,Serra J.SEGAN:Speech enhancement generative adversarial network[J].arXiv preprint arXiv:1703.09452,2017.

[4] chenfei, lang gao.a speech enhancement method for hearing aid combining edge computing with cloud computing [ P ]. tianjin city: CN112908353A,2021-06-04.

[5]Fedorov I,Stamenovic M,Jensen C,et al.TinyLSTMs:Efficient neural speech enhancement for hearing aids[J].arXiv preprint arXiv:2005.11138,2020.

[6]Kong Q,Wang Y,Song X,et al.Source separation with weakly labelled data:An approach to computational auditory scene analysis[C]//ICASSP 2020-2020IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:101-105.

[7]Kim J,El-Khamy M,Lee J.T-GSA:Transformer with Gaussian-weighted self-attention for speech enhancement[J].2019.

[8]Ronneberger O,Fischer P,Brox T.U-net:Convolutional networks for biomedical image segmentation[C]//International Conference on Medical image computing and computer-assisted intervention.Springer,Cham,2015:234-241.

[9]Tan K,Wang D L.A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement[C]//Interspeech.2018:3229-3233.

[10]Noise Classification Aided Attention-Based Neural Network for Monaural Speech Enhancement.

It is to be noted that the information disclosed in the above background section is only for understanding the background of the present application and thus may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

The main objective of the present invention is to overcome the above-mentioned drawbacks of the background art, and to provide a neural network-based scene adaptive hearing aid audio enhancement system.

In order to achieve the purpose, the invention adopts the following technical scheme:

a scene self-adaptive hearing aid audio enhancement system based on a neural network comprises a multi-modal scene feature extraction module and an audio enhancement module based on the neural network, wherein the multi-modal scene feature extraction module performs multi-modal scene feature extraction on a scene, and the multi-modal scene features comprise audio and image features of the scene; the audio enhancement module uses the multi-modal scene feature coding to perform audio enhancement, so that corresponding enhanced audio is generated according to the information of the scene, and the perception capability of the scene is improved by jointly using audio and image features.

Further:

the multi-modal scene feature extraction module comprises an image feature extractor based on a convolution vision Transformer, an audio feature extractor based on a convolution neural network, a feature fusion network based on a multilayer perceptron and a scene classifier, wherein the image feature extractor is used for extracting image features of a scene, the audio feature extractor is used for extracting audio features of the scene, the obtained image features and the obtained audio features are subjected to feature fusion through the feature fusion network, and the scene classifier is used for predicting the category of the scene.

The convolution vision Transformer model of the image feature extractor is divided into three stages, local features are fused and the length of an image sequence is reduced by performing convolution operation on a feature map, and finally a classification token output by the model is used as an image feature.

The audio characteristic extractor adopts a convolutional neural network CNN14 to extract audio characteristics, and realizes attention to different scales of frequency spectrums through stacking of 3 multiplied by 3 convolutional kernels so as to classify audio scenes; preferably, the output of the penultimate fully-connected layer is extracted as an audio feature.

The feature fusion network and the scene classifier are constructed by using a multilayer perceptron model, the feature fusion network and the scene classifier obtain the prediction of the scene category according to the extracted image features and audio features, and the output of the feature fusion network is used as the audio-visual features of the scene and is provided for the audio enhancement module.

The audio enhancement module converts input audio into a spectrogram through short-time Fourier transform, then decomposes the spectrogram according to amplitude and phase, wherein the amplitude spectrum and scene features obtained from the multi-modal scene feature extraction module are input into a sound separation network of the audio enhancement module together, the sound separation network generates a separated amplitude spectrum, and therefore the enhancement and suppression degrees of different parts of the original amplitude spectrum are determined, and then the audio enhancement module combines phase information to recover a one-dimensional audio signal, so that audio enhancement is completed.

The audio enhancement module adopts a sound separation network based on CNN and LSTM.

The voice separation network comprises an encoder, a decoder and a bottleneck layer, wherein the convolution operation in the encoder continuously reduces the size of an input spectrogram, the deconvolution operation in the decoder restores the spectrogram to the original size, and data direct flow exists between corresponding layers in the encoder and the decoder, so that the finally restored spectrogram integrates a plurality of characteristics with different sizes, and the LSTM of the bottleneck layer enables the network to carry out voice separation by utilizing preset information before a long time.

The audio enhancement module converts the obtained scene features into corresponding dimensionalities through linear transformation, then adds the dimensionalities with the output of each layer in the encoder and the decoder, and enables the scene features to control the output of the sound separation network through training so as to output different separation results for different scenes.

A scene self-adaptive hearing aid audio enhancement method based on a neural network is used for enhancing scene audio by using the scene self-adaptive hearing aid audio enhancement system based on the neural network.

The invention has the following beneficial effects:

the invention provides a scene self-adaptive hearing aid audio enhancement system based on a neural network, which combines a scene classification system based on an audio-image multi-mode with an audio enhancement method based on the neural network, particularly CNN and LSTM, and fuses scene information into the audio enhancement system, thereby realizing enhancement or inhibition of different audios under different scenes and improving the use experience of a user. The invention can be used for helping the hearing aid realize targeted audio enhancement under different scenes. Compared with the prior art, the method and the device can reduce the model storage requirement and improve the reasoning speed and the audio enhancement performance.

Compared with the traditional scheme, the invention has the following advantages:

1) the invention expands the voice enhancement method into the audio enhancement method, and greatly expands the practicability of the hearing aid.

2) The invention introduces visual information, and can better extract scene characteristics by utilizing images and audio.

3) The audio enhancement module can adaptively change the enhancement degree of different sounds according to the output of the scene feature extraction module, thereby saving the storage space and accelerating the audio enhancement speed.

4) The scene feature extraction module and the audio enhancement module can work in parallel, so that the running speed of the whole system is further increased, and the system time delay is reduced.

The invention can be used in an intelligent hearing aid, realizes self-adaptive audio enhancement through automatically sensing scenes, and improves the use experience of users.

Drawings

Fig. 1 is a schematic diagram of a scene adaptive audio enhancement system according to an embodiment of the present invention.

Fig. 2 is a scene feature classification system based on multi-modal feature fusion and a workflow diagram thereof according to an embodiment of the invention.

FIG. 3 is a diagram of a model architecture of a convolutional visual Transformer according to an embodiment of the present invention.

Fig. 4 is a CNN14 network structure according to an embodiment of the present invention.

Fig. 5 is a block diagram of an audio enhancement module according to an embodiment of the invention.

Fig. 6 is a diagram of a voice separation network according to an embodiment of the present invention.

FIG. 7 is a schematic diagram of a feature fusion method according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary in nature and is not intended to limit the scope of the invention or its application.

Abbreviations and key term definitions:

and SE: sound Enhancement refers to selectively enhancing certain sounds, highlighting key Sound information, and suppressing interference of irrelevant noise

ASC: audio Scene Classification, and judging the Scene of recording sound according to the sound fragment

CNN: convolutional Neural Network, which generally refers to a multi-layer Neural Network composed of Convolutional layers and fully-connected layers, is a common architecture in deep learning

LSTM: long Short-Term Memory Network (RNN), which is an implementation of Recurrent Neural Network (RNN)

Transformer: generally referred to as Self-Attention (Self-Attention) structure based multi-layer neural network, is a common architecture in deep learning.

Referring to fig. 1 to 7, in a preferred embodiment, an embodiment of the present invention provides a scene adaptive audio enhancement system based on a neural network, which includes a multi-modal scene feature extraction module (an audio-image multi-modal based scene classification module) and a CNN and LSTM based audio enhancement module. The multi-modal scene features are encoded into an audio enhancement module to generate corresponding enhanced audio from different scene information. As shown in fig. 1, as a scene adaptive hearing aid audio enhancement scheme, scene features extracted by an audio-image multi-modal based scene classification module are fused into an audio enhancement module, so as to generate corresponding enhanced audio according to different scene information.

Multi-modal scene feature extraction module

Most of the existing scene analysis methods rely on single-modality information, such as image scene classification and acoustic scene classification to extract scene features. Although a single-modality scene classification system can distinguish scenes individually through images or sounds, the addition of multi-modality information can enhance the judgment of scene information. We propose a scene feature extraction scheme based on multi-modal feature fusion, which jointly uses audio and image features to improve the perception capability of the scene.

Fig. 2 shows a scene classification system based on multi-modal feature fusion and a workflow thereof. (a) An image scene feature extractor: a convolution vision Transformer model is used to extract visual features; (b) audio feature extraction model: a convolutional neural network is used to extract audio features; (c) feature fusion classification network: a multi-tier perceptron is used to perform feature fusion classification.

The multi-mode-based scene feature extraction module mainly comprises three parts, namely an image feature extractor based on a convolution vision Transformer, an audio feature extractor based on a convolution neural network, a feature fusion network based on a multilayer perceptron and a scene classifier. The captured scene image and audio are input to an image and audio scene feature extractor, respectively. And predicting the scene category by the feature fusion network and the scene classifier according to the obtained image feature and audio feature. We extract scene audiovisual features that fuse image and audio information as scene auxiliary information for the audio enhancement system. The details of each module are described in turn below.

Image feature extraction model

Due to the computational power and battery size limitations of hearing aids, we process image information using a smartphone connected to the hearing aid, and only image features are sent to the hearing aid side to assist in sound enhancement. Extracting image scene features may use a convolutional visual Transformer (CvT) model. The model combines the advantages of a convolutional neural network and a visual Transformer, has the capability of extracting local features, and retains the attention to global information. The image scene feature extraction model is pre-trained on a large scene classification data set Place365, and has the capability of identifying complex scenes. We extract the classification token of the model as a feature of the image scene.

FIG. 3 is a diagram of a model architecture for a convolutional visual Transformer. (a) The model is divided into three stages, and each stage of the model fuses local features and simultaneously reduces the length of an image sequence by performing convolution operation on the feature map. And inserting global information of the classified token fused picture into the sequence starting position at the beginning of the third stage, and finally using the token as the picture feature.

Audio feature extraction model

As shown in fig. 4, a convolutional neural network is used to extract audio features. The CNN14 is an audio classification model, and realizes attention to different scales of frequency spectrums and accurate classification of audio through stacking of 3 x 3 convolution kernels. Fig. 4 is a convolutional neural network CNN14 that we use to extract audio features. The audio scene feature extraction model is firstly trained on a large data set Audioset to learn the characterization of audio. We extract the output of the penultimate fully-connected layer as an audio feature.

Feature fusion network and scene classifier

We use a multi-layered perceptron model to build a feature fusion network and a scene classifier. The image features and audio features extracted by the pre-training model are predicted by the feature fusion network and the scene classifier to obtain the scene category. After the network training is finished, the output of the feature fusion network is extracted to be used as the audio-visual feature of the scene, and assistance is provided for an audio enhancement system.

Audio enhancement module

The core function of the audio enhancement module is to separate different kinds of sounds in the input audio and perform targeted enhancement on certain sound types on the basis of the sound. In the present invention, the input audio is first converted to a spectrogram by a short-time Fourier transform, and the spectrogram is then decomposed in amplitude and phase. The audio enhancement module needs to determine the type of the sound to be separated according to the input scene characteristics, the amplitude spectrum and the scene characteristics obtained from the scene classification module are input into the sound separation network together, the sound separation network generates the separated amplitude spectrum, the enhancement and inhibition degrees of different parts of the original amplitude spectrum are determined, and finally the audio enhancement module is combined with the phase information and restored into a one-dimensional audio signal, so that the audio enhancement is completed. The overall structure of the audio enhancement module is shown in fig. 5.

Sound separation network

Because the invention is applied to the hearing aid, the real-time requirement of the sound separation method is high, a sound separation network [9] based on CNN and LSTM and similar to U-Net [8] structure is selected, and the structure is shown in figure 6. The voice separation network of the present invention is mainly composed of three parts of an encoder, a decoder and a bottleneck layer. The convolution operation in the encoder reduces the input spectrogram size continuously, the deconvolution operation in the decoder restores the spectrogram back to the original size, and there is also a direct flow of data between the corresponding layers in the encoder and decoder. The above process enables the spectrogram obtained by final recovery to be fused with a plurality of features with different sizes, so that a better recovery result can be obtained. The bottleneck layer LSTM can help the network to use long-term past information for voice separation.

Fusing scene features

In order to fuse scene features into a sound separation network, as in the feature fusion method shown in fig. 7, the obtained scene features are first converted into corresponding dimensions by linear transformation and then added to the output of each layer in the encoder and decoder. By training in this way, the scene characteristics can control the output of the sound separation network, so that the network can output different separation results for different scenes.

Hardware environment

The hardware environment required by the technical scheme can be divided into three parts, namely audio acquisition, image acquisition and signal processing. Wherein audio acquisition can rely on the microphone realization of audiphone, and image acquisition needs the audiphone to be equipped with the module of making a video recording. The signal processing part has two realization modes: (1) and processing by using a computing chip carried by the hearing aid, and directly playing after the processing is finished. This approach is simple to implement, but due to the high requirements of real-time neural network computation on hardware resources, attention needs to be paid to balancing performance, cost and power consumption. (2) The hearing aid and the smart phone are wirelessly connected through technologies such as Bluetooth and WiFi, the hearing aid end transmits collected audio and image signals to the mobile phone end, the strong computing performance of the mobile phone is utilized for processing, and then the enhanced audio is transmitted back to the hearing aid end to be played. The method can reduce the cost of the hearing aid chip, improve the endurance, and simultaneously can expand more functions by developing corresponding mobile phone APP, such as manually switching an enhancement mode, adjusting the loudness compensation size of the hearing aid, and the like.

The invention has at least the following advantages:

1. the audio enhancement module can adaptively change the enhancement degree of different sounds according to the output of the scene feature extraction module, thereby saving the storage space and accelerating the audio enhancement speed.

2. The voice enhancement method is expanded to be an audio enhancement method, and the practicability of the hearing aid is greatly expanded.

3. The scene classification module and the audio enhancement module can work simultaneously, and the running speed of the whole system is further increased.

The scene feature extraction module of the embodiment of the invention can only receive the input of single modal data and independently use audio or pictures to carry out scene classification tasks.

The audio enhancement module of embodiments of the present invention may be replaced with a variety of different configurations.

The invention can adopt a plurality of different scene characteristic fusion modes.

The background of the present invention may contain background information related to the problem or environment of the present invention and does not necessarily describe the prior art. Accordingly, the inclusion in the background section is not an admission of prior art by the applicant.

The foregoing is a more detailed description of the invention in connection with specific/preferred embodiments and is not intended to limit the practice of the invention to those descriptions. It will be apparent to those skilled in the art that various substitutions and modifications can be made to the described embodiments without departing from the spirit of the invention, and these substitutions and modifications should be considered to fall within the scope of the invention. In the description herein, references to the description of the term "one embodiment," "some embodiments," "preferred embodiments," "an example," "a specific example," or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction. Although embodiments of the present invention and their advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the claims.

Claims

1. A scene self-adaptive hearing aid audio enhancement system based on a neural network is characterized by comprising a multi-modal scene feature extraction module and an audio enhancement module based on the neural network, wherein the multi-modal scene feature extraction module performs multi-modal scene feature extraction on a scene, and the multi-modal scene features comprise audio and image features of the scene; the audio enhancement module uses the multi-modal scene feature coding to perform audio enhancement, so that corresponding enhanced audio is generated according to the information of the scene, and the perception capability of the scene is improved by jointly using audio and image features.

2. The scene adaptive hearing aid audio enhancement system of claim 1, wherein the multi-modal scene feature extraction module comprises a convolutional visual Transformer-based image feature extractor for extracting image features of a scene, a convolutional neural network-based audio feature extractor for extracting audio features of the scene, and a multi-layer perceptron-based feature fusion network and a scene classifier for performing feature fusion on the obtained image features and audio features, and the scene classifier performs prediction on scene categories.

3. The scene adaptive hearing aid audio enhancement system of claim 2, wherein the convolution vision Transformer model of the image feature extractor is divided into three stages, local features are fused and the length of the image sequence is reduced by performing convolution operation on a feature map, and finally the classification token output by the model is used as the image feature.

4. The scene adaptive hearing aid audio enhancement system according to claim 2 or 3, wherein the audio feature extractor adopts a convolutional neural network CNN14 to extract audio features, and the audio scene is classified by focusing on different scales of frequency spectrum through stacking of 3 x 3 convolutional kernels; preferably, the output of the penultimate fully-connected layer is extracted as an audio feature.

5. The scene adaptive hearing aid audio enhancement system according to any one of claims 2 to 4, wherein the feature fusion network and scene classifier are constructed using a multi-layered perceptron model, the feature fusion network and scene classifier deriving predictions of scene classes from extracted image features and audio features, the output of the feature fusion network being provided to the audio enhancement module as audiovisual features of a scene.

6. The scene adaptive hearing aid audio enhancement system according to any one of claims 1 to 5, wherein the audio enhancement module performs short-time Fourier transform on the input audio to obtain a spectrogram, and then decomposes the spectrogram in terms of amplitude and phase, wherein the amplitude spectrum is input to the sound separation network of the audio enhancement module together with the scene features obtained from the multi-modal scene feature extraction module, and the separated amplitude spectrum is generated by the sound separation network, so as to determine the enhancement and suppression degree for different parts of the original amplitude spectrum, and then the audio enhancement is performed by combining the phase information and restoring the audio signal to a one-dimensional audio signal.

7. The context adaptive hearing aid audio enhancement system of any one of claims 1 to 5, wherein the audio enhancement module employs a CNN and LSTM based sound separation network.

8. The scene adaptive hearing aid audio enhancement system of claim 7, wherein the sound separation network comprises an encoder, a decoder, and a bottleneck layer, wherein a convolution operation in the encoder reduces the input spectrogram size continuously, wherein a deconvolution operation in the decoder restores the spectrogram back to its original size, and wherein a direct flow of data exists between corresponding layers in the encoder and the decoder, whereby the resulting restored spectrogram incorporates a plurality of different sized features, and wherein the LSTM of the bottleneck layer causes the network to perform sound separation using a predetermined long-term prior information.

9. The scene adaptive hearing aid audio enhancement system of claim 8, wherein the audio enhancement module first transforms the obtained scene features into corresponding dimensions through a linear transformation, then adds them to the output of each layer in the encoder and the decoder, trained such that the scene features can control the output of the sound separation network to output different separation results for different scenes.

10. A neural network based scene adaptive hearing aid audio enhancement method, characterized in that the scene audio is enhanced by using the neural network based scene adaptive hearing aid audio enhancement system according to any one of claims 1 to 9.