CN114245280A - Scene self-adaptive hearing aid audio enhancement system based on neural network - Google Patents

Scene self-adaptive hearing aid audio enhancement system based on neural network Download PDF

Info

Publication number
CN114245280A
CN114245280A CN202111565538.3A CN202111565538A CN114245280A CN 114245280 A CN114245280 A CN 114245280A CN 202111565538 A CN202111565538 A CN 202111565538A CN 114245280 A CN114245280 A CN 114245280A
Authority
CN
China
Prior art keywords
scene
audio
audio enhancement
features
hearing aid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111565538.3A
Other languages
Chinese (zh)
Other versions
CN114245280B (en
Inventor
吴志勇
杨玉杰
蔡新宇
陈玉鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen International Graduate School of Tsinghua University
Peng Cheng Laboratory
Original Assignee
Shenzhen International Graduate School of Tsinghua University
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen International Graduate School of Tsinghua University, Peng Cheng Laboratory filed Critical Shenzhen International Graduate School of Tsinghua University
Priority to CN202111565538.3A priority Critical patent/CN114245280B/en
Publication of CN114245280A publication Critical patent/CN114245280A/en
Application granted granted Critical
Publication of CN114245280B publication Critical patent/CN114245280B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
    • H04R25/50Customised settings for obtaining desired overall acoustical characteristics
    • H04R25/505Customised settings for obtaining desired overall acoustical characteristics using digital signal processing
    • H04R25/507Customised settings for obtaining desired overall acoustical characteristics using digital signal processing implemented by neural network or fuzzy logic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Abstract

A scene self-adaptive hearing aid audio enhancement system and method based on a neural network comprise a multi-modal scene feature extraction module and an audio enhancement module based on the neural network, wherein the multi-modal scene feature extraction module performs multi-modal scene feature extraction on a scene, and the multi-modal scene features comprise audio and image features of the scene; the audio enhancement module uses the multi-modal scene feature coding to perform audio enhancement on the original audio. The invention fuses the scene information into the audio enhancement system, thereby realizing enhancement or inhibition of different audios in different scenes and improving the use experience of users. The method can be used for helping the hearing aid realize targeted audio enhancement in different scenes, can reduce the requirement on model storage, and improves the reasoning speed and audio enhancement performance.

Description

Scene self-adaptive hearing aid audio enhancement system based on neural network
Technical Field
The invention relates to an audio enhancement technology, in particular to a scene self-adaptive hearing aid audio enhancement system based on a neural network.
Background
About 3000 tens of thousands of hearing disabilities patients exist in China. By wearing the hearing aid, the communication capacity of the hearing-impaired patient can be remarkably improved, and the life quality is improved. The conventional hearing aid only mechanically amplifies the volume of audio and noise at the same time, and the use experience is poor. And the intelligent digital hearing aid applying the artificial intelligence technology can pertinently enhance certain sounds under different conditions, thereby further improving the use experience of the hearing aid. In real life, it is desirable that hearing aids selectively enhance or suppress certain sounds depending on the scene. For example, a user may wish the hearing aid to enhance his voice while participating in a conference. In the home life, the hearing aid can suppress some external noise. When the user walks on the street, the hearing aid cannot suppress sounds such as automobile whistling, otherwise potential safety hazards can be caused.
There is currently no complete set of solutions to this problem. To achieve the above effects, a plurality of separate methods need to be combined, and specifically, a single scene classification module is mainly introduced to judge the current scene, and then different audio enhancement methods are selected according to the scene. This solution mainly suffers from the following two problems: (1) different audio enhancement methods are used for different scene loads, which causes waste of storage space and reduction of real-time performance. (2) The scene classification module and the audio enhancement module are executed in series, and the audio enhancement can be performed specifically only after the scene classification result is obtained, so that the real-time performance is further reduced.
With the development of artificial intelligence technology, neural network technology is beginning to be used in the hearing assistance field. Paper [1] studies speech enhancement techniques based on convolutional neural networks. Paper [2] proposes a recurrent neural network-based speech enhancement technique. Paper [3] uses the confrontation generation network structure, uses the generator to perform speech enhancement, and discriminators distinguish between pure speech and enhanced speech. These deep learning based methods achieve superior performance far beyond traditional methods. Regarding the use of speech enhancement technology in the hearing assistance field, patent [4] proposes a method for hearing aid speech enhancement using cloud computing, and paper [5] enables LSTM-based speech enhancement models to run on the hearing aid local chip and achieve acceptable computational delay by using multiple model compression methods.
These papers and patents are mainly focused on speech enhancement and its application in the field of hearing aids, and no research has been conducted on the application of audio enhancement in the field of hearing aids. Paper [6] proposes a framework for training an audio separation model on weakly supervised data, providing a basis for the application of the audio separation model.
Some solutions also consider the scene noise problem in speech enhancement, for example, paper [10] proposes to use noise classification to assist mono speech enhancement, embed classification network into speech enhancement system, and guide the network to extract network features better.
The prior scheme mainly has the following defects:
1) the existing scheme mainly aims at enhancing voice, cannot pertinently enhance other audios such as music and knock, and is not wide enough in applicable scenes.
2) In the existing scheme, when the audio is enhanced by using the scene information, the scene classification module and the voice enhancement modules of different scenes are executed in series. The scene categories are first obtained from the scene classification, and then the speech enhancement models of the corresponding categories are selected. This concatenation results in a further reduction in real-time and more space is occupied by multiple speech enhancement models.
3) The scene classification module of the existing scheme only considers the audio information and has certain information overlapping with the audio enhancement module.
Reference documents:
[1]Park S R,Lee J.A fully convolutional neural network for speech enhancement[J].arXiv preprint arXiv:1609.07132,2016.
[2]Sun L,Du J,Dai L R,et al.Multiple-target deep learning for LSTM-RNN based speech enhancement[C]//2017Hands-free Speech Communications and Microphone Arrays(HSCMA).IEEE,2017:136-140.
[3]Pascual S,Bonafonte A,Serra J.SEGAN:Speech enhancement generative adversarial network[J].arXiv preprint arXiv:1703.09452,2017.
[4] chenfei, lang gao.a speech enhancement method for hearing aid combining edge computing with cloud computing [ P ]. tianjin city: CN112908353A,2021-06-04.
[5]Fedorov I,Stamenovic M,Jensen C,et al.TinyLSTMs:Efficient neural speech enhancement for hearing aids[J].arXiv preprint arXiv:2005.11138,2020.
[6]Kong Q,Wang Y,Song X,et al.Source separation with weakly labelled data:An approach to computational auditory scene analysis[C]//ICASSP 2020-2020IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:101-105.
[7]Kim J,El-Khamy M,Lee J.T-GSA:Transformer with Gaussian-weighted self-attention for speech enhancement[J].2019.
[8]Ronneberger O,Fischer P,Brox T.U-net:Convolutional networks for biomedical image segmentation[C]//International Conference on Medical image computing and computer-assisted intervention.Springer,Cham,2015:234-241.
[9]Tan K,Wang D L.A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement[C]//Interspeech.2018:3229-3233.
[10]Noise Classification Aided Attention-Based Neural Network for Monaural Speech Enhancement.
It is to be noted that the information disclosed in the above background section is only for understanding the background of the present application and thus may include information that does not constitute prior art known to a person of ordinary skill in the art.
Disclosure of Invention
The main objective of the present invention is to overcome the above-mentioned drawbacks of the background art, and to provide a neural network-based scene adaptive hearing aid audio enhancement system.
In order to achieve the purpose, the invention adopts the following technical scheme:
a scene self-adaptive hearing aid audio enhancement system based on a neural network comprises a multi-modal scene feature extraction module and an audio enhancement module based on the neural network, wherein the multi-modal scene feature extraction module performs multi-modal scene feature extraction on a scene, and the multi-modal scene features comprise audio and image features of the scene; the audio enhancement module uses the multi-modal scene feature coding to perform audio enhancement, so that corresponding enhanced audio is generated according to the information of the scene, and the perception capability of the scene is improved by jointly using audio and image features.
Further:
the multi-modal scene feature extraction module comprises an image feature extractor based on a convolution vision Transformer, an audio feature extractor based on a convolution neural network, a feature fusion network based on a multilayer perceptron and a scene classifier, wherein the image feature extractor is used for extracting image features of a scene, the audio feature extractor is used for extracting audio features of the scene, the obtained image features and the obtained audio features are subjected to feature fusion through the feature fusion network, and the scene classifier is used for predicting the category of the scene.
The convolution vision Transformer model of the image feature extractor is divided into three stages, local features are fused and the length of an image sequence is reduced by performing convolution operation on a feature map, and finally a classification token output by the model is used as an image feature.
The audio characteristic extractor adopts a convolutional neural network CNN14 to extract audio characteristics, and realizes attention to different scales of frequency spectrums through stacking of 3 multiplied by 3 convolutional kernels so as to classify audio scenes; preferably, the output of the penultimate fully-connected layer is extracted as an audio feature.
The feature fusion network and the scene classifier are constructed by using a multilayer perceptron model, the feature fusion network and the scene classifier obtain the prediction of the scene category according to the extracted image features and audio features, and the output of the feature fusion network is used as the audio-visual features of the scene and is provided for the audio enhancement module.
The audio enhancement module converts input audio into a spectrogram through short-time Fourier transform, then decomposes the spectrogram according to amplitude and phase, wherein the amplitude spectrum and scene features obtained from the multi-modal scene feature extraction module are input into a sound separation network of the audio enhancement module together, the sound separation network generates a separated amplitude spectrum, and therefore the enhancement and suppression degrees of different parts of the original amplitude spectrum are determined, and then the audio enhancement module combines phase information to recover a one-dimensional audio signal, so that audio enhancement is completed.
The audio enhancement module adopts a sound separation network based on CNN and LSTM.
The voice separation network comprises an encoder, a decoder and a bottleneck layer, wherein the convolution operation in the encoder continuously reduces the size of an input spectrogram, the deconvolution operation in the decoder restores the spectrogram to the original size, and data direct flow exists between corresponding layers in the encoder and the decoder, so that the finally restored spectrogram integrates a plurality of characteristics with different sizes, and the LSTM of the bottleneck layer enables the network to carry out voice separation by utilizing preset information before a long time.
The audio enhancement module converts the obtained scene features into corresponding dimensionalities through linear transformation, then adds the dimensionalities with the output of each layer in the encoder and the decoder, and enables the scene features to control the output of the sound separation network through training so as to output different separation results for different scenes.
A scene self-adaptive hearing aid audio enhancement method based on a neural network is used for enhancing scene audio by using the scene self-adaptive hearing aid audio enhancement system based on the neural network.
The invention has the following beneficial effects:
the invention provides a scene self-adaptive hearing aid audio enhancement system based on a neural network, which combines a scene classification system based on an audio-image multi-mode with an audio enhancement method based on the neural network, particularly CNN and LSTM, and fuses scene information into the audio enhancement system, thereby realizing enhancement or inhibition of different audios under different scenes and improving the use experience of a user. The invention can be used for helping the hearing aid realize targeted audio enhancement under different scenes. Compared with the prior art, the method and the device can reduce the model storage requirement and improve the reasoning speed and the audio enhancement performance.
Compared with the traditional scheme, the invention has the following advantages:
1) the invention expands the voice enhancement method into the audio enhancement method, and greatly expands the practicability of the hearing aid.
2) The invention introduces visual information, and can better extract scene characteristics by utilizing images and audio.
3) The audio enhancement module can adaptively change the enhancement degree of different sounds according to the output of the scene feature extraction module, thereby saving the storage space and accelerating the audio enhancement speed.
4) The scene feature extraction module and the audio enhancement module can work in parallel, so that the running speed of the whole system is further increased, and the system time delay is reduced.
The invention can be used in an intelligent hearing aid, realizes self-adaptive audio enhancement through automatically sensing scenes, and improves the use experience of users.
Drawings
Fig. 1 is a schematic diagram of a scene adaptive audio enhancement system according to an embodiment of the present invention.
Fig. 2 is a scene feature classification system based on multi-modal feature fusion and a workflow diagram thereof according to an embodiment of the invention.
FIG. 3 is a diagram of a model architecture of a convolutional visual Transformer according to an embodiment of the present invention.
Fig. 4 is a CNN14 network structure according to an embodiment of the present invention.
Fig. 5 is a block diagram of an audio enhancement module according to an embodiment of the invention.
Fig. 6 is a diagram of a voice separation network according to an embodiment of the present invention.
FIG. 7 is a schematic diagram of a feature fusion method according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary in nature and is not intended to limit the scope of the invention or its application.
Abbreviations and key term definitions:
and SE: sound Enhancement refers to selectively enhancing certain sounds, highlighting key Sound information, and suppressing interference of irrelevant noise
ASC: audio Scene Classification, and judging the Scene of recording sound according to the sound fragment
CNN: convolutional Neural Network, which generally refers to a multi-layer Neural Network composed of Convolutional layers and fully-connected layers, is a common architecture in deep learning
LSTM: long Short-Term Memory Network (RNN), which is an implementation of Recurrent Neural Network (RNN)
Transformer: generally referred to as Self-Attention (Self-Attention) structure based multi-layer neural network, is a common architecture in deep learning.
Referring to fig. 1 to 7, in a preferred embodiment, an embodiment of the present invention provides a scene adaptive audio enhancement system based on a neural network, which includes a multi-modal scene feature extraction module (an audio-image multi-modal based scene classification module) and a CNN and LSTM based audio enhancement module. The multi-modal scene features are encoded into an audio enhancement module to generate corresponding enhanced audio from different scene information. As shown in fig. 1, as a scene adaptive hearing aid audio enhancement scheme, scene features extracted by an audio-image multi-modal based scene classification module are fused into an audio enhancement module, so as to generate corresponding enhanced audio according to different scene information.
Multi-modal scene feature extraction module
Most of the existing scene analysis methods rely on single-modality information, such as image scene classification and acoustic scene classification to extract scene features. Although a single-modality scene classification system can distinguish scenes individually through images or sounds, the addition of multi-modality information can enhance the judgment of scene information. We propose a scene feature extraction scheme based on multi-modal feature fusion, which jointly uses audio and image features to improve the perception capability of the scene.
Fig. 2 shows a scene classification system based on multi-modal feature fusion and a workflow thereof. (a) An image scene feature extractor: a convolution vision Transformer model is used to extract visual features; (b) audio feature extraction model: a convolutional neural network is used to extract audio features; (c) feature fusion classification network: a multi-tier perceptron is used to perform feature fusion classification.
The multi-mode-based scene feature extraction module mainly comprises three parts, namely an image feature extractor based on a convolution vision Transformer, an audio feature extractor based on a convolution neural network, a feature fusion network based on a multilayer perceptron and a scene classifier. The captured scene image and audio are input to an image and audio scene feature extractor, respectively. And predicting the scene category by the feature fusion network and the scene classifier according to the obtained image feature and audio feature. We extract scene audiovisual features that fuse image and audio information as scene auxiliary information for the audio enhancement system. The details of each module are described in turn below.
Image feature extraction model
Due to the computational power and battery size limitations of hearing aids, we process image information using a smartphone connected to the hearing aid, and only image features are sent to the hearing aid side to assist in sound enhancement. Extracting image scene features may use a convolutional visual Transformer (CvT) model. The model combines the advantages of a convolutional neural network and a visual Transformer, has the capability of extracting local features, and retains the attention to global information. The image scene feature extraction model is pre-trained on a large scene classification data set Place365, and has the capability of identifying complex scenes. We extract the classification token of the model as a feature of the image scene.
FIG. 3 is a diagram of a model architecture for a convolutional visual Transformer. (a) The model is divided into three stages, and each stage of the model fuses local features and simultaneously reduces the length of an image sequence by performing convolution operation on the feature map. And inserting global information of the classified token fused picture into the sequence starting position at the beginning of the third stage, and finally using the token as the picture feature.
Audio feature extraction model
As shown in fig. 4, a convolutional neural network is used to extract audio features. The CNN14 is an audio classification model, and realizes attention to different scales of frequency spectrums and accurate classification of audio through stacking of 3 x 3 convolution kernels. Fig. 4 is a convolutional neural network CNN14 that we use to extract audio features. The audio scene feature extraction model is firstly trained on a large data set Audioset to learn the characterization of audio. We extract the output of the penultimate fully-connected layer as an audio feature.
Feature fusion network and scene classifier
We use a multi-layered perceptron model to build a feature fusion network and a scene classifier. The image features and audio features extracted by the pre-training model are predicted by the feature fusion network and the scene classifier to obtain the scene category. After the network training is finished, the output of the feature fusion network is extracted to be used as the audio-visual feature of the scene, and assistance is provided for an audio enhancement system.
Audio enhancement module
The core function of the audio enhancement module is to separate different kinds of sounds in the input audio and perform targeted enhancement on certain sound types on the basis of the sound. In the present invention, the input audio is first converted to a spectrogram by a short-time Fourier transform, and the spectrogram is then decomposed in amplitude and phase. The audio enhancement module needs to determine the type of the sound to be separated according to the input scene characteristics, the amplitude spectrum and the scene characteristics obtained from the scene classification module are input into the sound separation network together, the sound separation network generates the separated amplitude spectrum, the enhancement and inhibition degrees of different parts of the original amplitude spectrum are determined, and finally the audio enhancement module is combined with the phase information and restored into a one-dimensional audio signal, so that the audio enhancement is completed. The overall structure of the audio enhancement module is shown in fig. 5.
Sound separation network
Because the invention is applied to the hearing aid, the real-time requirement of the sound separation method is high, a sound separation network [9] based on CNN and LSTM and similar to U-Net [8] structure is selected, and the structure is shown in figure 6. The voice separation network of the present invention is mainly composed of three parts of an encoder, a decoder and a bottleneck layer. The convolution operation in the encoder reduces the input spectrogram size continuously, the deconvolution operation in the decoder restores the spectrogram back to the original size, and there is also a direct flow of data between the corresponding layers in the encoder and decoder. The above process enables the spectrogram obtained by final recovery to be fused with a plurality of features with different sizes, so that a better recovery result can be obtained. The bottleneck layer LSTM can help the network to use long-term past information for voice separation.
Fusing scene features
In order to fuse scene features into a sound separation network, as in the feature fusion method shown in fig. 7, the obtained scene features are first converted into corresponding dimensions by linear transformation and then added to the output of each layer in the encoder and decoder. By training in this way, the scene characteristics can control the output of the sound separation network, so that the network can output different separation results for different scenes.
Hardware environment
The hardware environment required by the technical scheme can be divided into three parts, namely audio acquisition, image acquisition and signal processing. Wherein audio acquisition can rely on the microphone realization of audiphone, and image acquisition needs the audiphone to be equipped with the module of making a video recording. The signal processing part has two realization modes: (1) and processing by using a computing chip carried by the hearing aid, and directly playing after the processing is finished. This approach is simple to implement, but due to the high requirements of real-time neural network computation on hardware resources, attention needs to be paid to balancing performance, cost and power consumption. (2) The hearing aid and the smart phone are wirelessly connected through technologies such as Bluetooth and WiFi, the hearing aid end transmits collected audio and image signals to the mobile phone end, the strong computing performance of the mobile phone is utilized for processing, and then the enhanced audio is transmitted back to the hearing aid end to be played. The method can reduce the cost of the hearing aid chip, improve the endurance, and simultaneously can expand more functions by developing corresponding mobile phone APP, such as manually switching an enhancement mode, adjusting the loudness compensation size of the hearing aid, and the like.
The invention has at least the following advantages:
1. the audio enhancement module can adaptively change the enhancement degree of different sounds according to the output of the scene feature extraction module, thereby saving the storage space and accelerating the audio enhancement speed.
2. The voice enhancement method is expanded to be an audio enhancement method, and the practicability of the hearing aid is greatly expanded.
3. The scene classification module and the audio enhancement module can work simultaneously, and the running speed of the whole system is further increased.
The scene feature extraction module of the embodiment of the invention can only receive the input of single modal data and independently use audio or pictures to carry out scene classification tasks.
The audio enhancement module of embodiments of the present invention may be replaced with a variety of different configurations.
The invention can adopt a plurality of different scene characteristic fusion modes.
The background of the present invention may contain background information related to the problem or environment of the present invention and does not necessarily describe the prior art. Accordingly, the inclusion in the background section is not an admission of prior art by the applicant.
The foregoing is a more detailed description of the invention in connection with specific/preferred embodiments and is not intended to limit the practice of the invention to those descriptions. It will be apparent to those skilled in the art that various substitutions and modifications can be made to the described embodiments without departing from the spirit of the invention, and these substitutions and modifications should be considered to fall within the scope of the invention. In the description herein, references to the description of the term "one embodiment," "some embodiments," "preferred embodiments," "an example," "a specific example," or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction. Although embodiments of the present invention and their advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the claims.

Claims (10)

1. A scene self-adaptive hearing aid audio enhancement system based on a neural network is characterized by comprising a multi-modal scene feature extraction module and an audio enhancement module based on the neural network, wherein the multi-modal scene feature extraction module performs multi-modal scene feature extraction on a scene, and the multi-modal scene features comprise audio and image features of the scene; the audio enhancement module uses the multi-modal scene feature coding to perform audio enhancement, so that corresponding enhanced audio is generated according to the information of the scene, and the perception capability of the scene is improved by jointly using audio and image features.
2. The scene adaptive hearing aid audio enhancement system of claim 1, wherein the multi-modal scene feature extraction module comprises a convolutional visual Transformer-based image feature extractor for extracting image features of a scene, a convolutional neural network-based audio feature extractor for extracting audio features of the scene, and a multi-layer perceptron-based feature fusion network and a scene classifier for performing feature fusion on the obtained image features and audio features, and the scene classifier performs prediction on scene categories.
3. The scene adaptive hearing aid audio enhancement system of claim 2, wherein the convolution vision Transformer model of the image feature extractor is divided into three stages, local features are fused and the length of the image sequence is reduced by performing convolution operation on a feature map, and finally the classification token output by the model is used as the image feature.
4. The scene adaptive hearing aid audio enhancement system according to claim 2 or 3, wherein the audio feature extractor adopts a convolutional neural network CNN14 to extract audio features, and the audio scene is classified by focusing on different scales of frequency spectrum through stacking of 3 x 3 convolutional kernels; preferably, the output of the penultimate fully-connected layer is extracted as an audio feature.
5. The scene adaptive hearing aid audio enhancement system according to any one of claims 2 to 4, wherein the feature fusion network and scene classifier are constructed using a multi-layered perceptron model, the feature fusion network and scene classifier deriving predictions of scene classes from extracted image features and audio features, the output of the feature fusion network being provided to the audio enhancement module as audiovisual features of a scene.
6. The scene adaptive hearing aid audio enhancement system according to any one of claims 1 to 5, wherein the audio enhancement module performs short-time Fourier transform on the input audio to obtain a spectrogram, and then decomposes the spectrogram in terms of amplitude and phase, wherein the amplitude spectrum is input to the sound separation network of the audio enhancement module together with the scene features obtained from the multi-modal scene feature extraction module, and the separated amplitude spectrum is generated by the sound separation network, so as to determine the enhancement and suppression degree for different parts of the original amplitude spectrum, and then the audio enhancement is performed by combining the phase information and restoring the audio signal to a one-dimensional audio signal.
7. The context adaptive hearing aid audio enhancement system of any one of claims 1 to 5, wherein the audio enhancement module employs a CNN and LSTM based sound separation network.
8. The scene adaptive hearing aid audio enhancement system of claim 7, wherein the sound separation network comprises an encoder, a decoder, and a bottleneck layer, wherein a convolution operation in the encoder reduces the input spectrogram size continuously, wherein a deconvolution operation in the decoder restores the spectrogram back to its original size, and wherein a direct flow of data exists between corresponding layers in the encoder and the decoder, whereby the resulting restored spectrogram incorporates a plurality of different sized features, and wherein the LSTM of the bottleneck layer causes the network to perform sound separation using a predetermined long-term prior information.
9. The scene adaptive hearing aid audio enhancement system of claim 8, wherein the audio enhancement module first transforms the obtained scene features into corresponding dimensions through a linear transformation, then adds them to the output of each layer in the encoder and the decoder, trained such that the scene features can control the output of the sound separation network to output different separation results for different scenes.
10. A neural network based scene adaptive hearing aid audio enhancement method, characterized in that the scene audio is enhanced by using the neural network based scene adaptive hearing aid audio enhancement system according to any one of claims 1 to 9.
CN202111565538.3A 2021-12-20 2021-12-20 Scene self-adaptive hearing aid audio enhancement system based on neural network Active CN114245280B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111565538.3A CN114245280B (en) 2021-12-20 2021-12-20 Scene self-adaptive hearing aid audio enhancement system based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111565538.3A CN114245280B (en) 2021-12-20 2021-12-20 Scene self-adaptive hearing aid audio enhancement system based on neural network

Publications (2)

Publication Number Publication Date
CN114245280A true CN114245280A (en) 2022-03-25
CN114245280B CN114245280B (en) 2023-06-23

Family

ID=80759625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111565538.3A Active CN114245280B (en) 2021-12-20 2021-12-20 Scene self-adaptive hearing aid audio enhancement system based on neural network

Country Status (1)

Country Link
CN (1) CN114245280B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116367063A (en) * 2023-04-23 2023-06-30 郑州大学 Bone conduction hearing aid equipment and system based on embedded
CN116432703A (en) * 2023-06-12 2023-07-14 成都大学 Pulse height estimation method, system and terminal based on composite neural network model

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN109859767A (en) * 2019-03-06 2019-06-07 哈尔滨工业大学(深圳) A kind of environment self-adaption neural network noise-reduction method, system and storage medium for digital deaf-aid
CN110322900A (en) * 2019-06-25 2019-10-11 深圳市壹鸽科技有限公司 A kind of method of phonic signal character fusion
CN110782878A (en) * 2019-10-10 2020-02-11 天津大学 Attention mechanism-based multi-scale audio scene recognition method
CN110827813A (en) * 2019-10-18 2020-02-21 清华大学深圳国际研究生院 Stress detection method and system based on multi-modal characteristics
CN111539449A (en) * 2020-03-23 2020-08-14 广东省智能制造研究所 Sound source separation and positioning method based on second-order fusion attention network model
CN111539445A (en) * 2020-02-26 2020-08-14 江苏警官学院 Object classification method and system based on semi-supervised feature fusion
CN112951258A (en) * 2021-04-23 2021-06-11 中国科学技术大学 Audio and video voice enhancement processing method and model
CN112967713A (en) * 2021-01-23 2021-06-15 西安交通大学 Audio-visual voice recognition method, device, equipment and storage medium based on multi-modal fusion
CN113035227A (en) * 2021-03-12 2021-06-25 山东大学 Multi-modal voice separation method and system
CN113128527A (en) * 2021-06-21 2021-07-16 中国人民解放军国防科技大学 Image scene classification method based on converter model and convolutional neural network
CN113470671A (en) * 2021-06-28 2021-10-01 安徽大学 Audio-visual voice enhancement method and system by fully utilizing visual and voice connection
CN113593601A (en) * 2021-07-27 2021-11-02 哈尔滨理工大学 Audio-visual multi-modal voice separation method based on deep learning
CN113611318A (en) * 2021-06-29 2021-11-05 华为技术有限公司 Audio data enhancement method and related equipment
CN113782048A (en) * 2021-09-24 2021-12-10 科大讯飞股份有限公司 Multi-modal voice separation method, training method and related device

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN109859767A (en) * 2019-03-06 2019-06-07 哈尔滨工业大学(深圳) A kind of environment self-adaption neural network noise-reduction method, system and storage medium for digital deaf-aid
CN110322900A (en) * 2019-06-25 2019-10-11 深圳市壹鸽科技有限公司 A kind of method of phonic signal character fusion
CN110782878A (en) * 2019-10-10 2020-02-11 天津大学 Attention mechanism-based multi-scale audio scene recognition method
CN110827813A (en) * 2019-10-18 2020-02-21 清华大学深圳国际研究生院 Stress detection method and system based on multi-modal characteristics
CN111539445A (en) * 2020-02-26 2020-08-14 江苏警官学院 Object classification method and system based on semi-supervised feature fusion
CN111539449A (en) * 2020-03-23 2020-08-14 广东省智能制造研究所 Sound source separation and positioning method based on second-order fusion attention network model
CN112967713A (en) * 2021-01-23 2021-06-15 西安交通大学 Audio-visual voice recognition method, device, equipment and storage medium based on multi-modal fusion
CN113035227A (en) * 2021-03-12 2021-06-25 山东大学 Multi-modal voice separation method and system
CN112951258A (en) * 2021-04-23 2021-06-11 中国科学技术大学 Audio and video voice enhancement processing method and model
CN113128527A (en) * 2021-06-21 2021-07-16 中国人民解放军国防科技大学 Image scene classification method based on converter model and convolutional neural network
CN113470671A (en) * 2021-06-28 2021-10-01 安徽大学 Audio-visual voice enhancement method and system by fully utilizing visual and voice connection
CN113611318A (en) * 2021-06-29 2021-11-05 华为技术有限公司 Audio data enhancement method and related equipment
CN113593601A (en) * 2021-07-27 2021-11-02 哈尔滨理工大学 Audio-visual multi-modal voice separation method based on deep learning
CN113782048A (en) * 2021-09-24 2021-12-10 科大讯飞股份有限公司 Multi-modal voice separation method, training method and related device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KE TAN, DELIANG WANG: "A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement", pages 3229 - 3233 *
朱宸都: "基于ICRNN-GRU异常音频事件检测及增强算法研究", pages 136 - 290 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116367063A (en) * 2023-04-23 2023-06-30 郑州大学 Bone conduction hearing aid equipment and system based on embedded
CN116367063B (en) * 2023-04-23 2023-11-14 郑州大学 Bone conduction hearing aid equipment and system based on embedded
CN116432703A (en) * 2023-06-12 2023-07-14 成都大学 Pulse height estimation method, system and terminal based on composite neural network model
CN116432703B (en) * 2023-06-12 2023-08-29 成都大学 Pulse height estimation method, system and terminal based on composite neural network model

Also Published As

Publication number Publication date
CN114245280B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN110459237B (en) Voice separation method, voice recognition method and related equipment
CN111243620B (en) Voice separation model training method and device, storage medium and computer equipment
Li et al. CBLDNN-based speaker-independent speech separation via generative adversarial training
CN111128197B (en) Multi-speaker voice separation method based on voiceprint features and generation confrontation learning
CN114245280B (en) Scene self-adaptive hearing aid audio enhancement system based on neural network
WO2019222759A1 (en) Recurrent multimodal attention system based on expert gated networks
CN113158727A (en) Bimodal fusion emotion recognition method based on video and voice information
WO2022048239A1 (en) Audio processing method and device
Yu et al. A two-stage complex network using cycle-consistent generative adversarial networks for speech enhancement
CN113593601A (en) Audio-visual multi-modal voice separation method based on deep learning
CN113516990A (en) Voice enhancement method, method for training neural network and related equipment
US20230335148A1 (en) Speech Separation Method, Electronic Device, Chip, and Computer-Readable Storage Medium
KR102174189B1 (en) Acoustic information recognition method and system using semi-supervised learning based on variational auto encoder model
CN112562698B (en) Power equipment defect diagnosis method based on fusion of sound source information and thermal imaging characteristics
JP2020134657A (en) Signal processing device, learning device, signal processing method, learning method and program
CN111883105B (en) Training method and system for context information prediction model of video scene
CN113611318A (en) Audio data enhancement method and related equipment
CN110020596B (en) Video content positioning method based on feature fusion and cascade learning
CN113643688B (en) Mongolian voice feature fusion method and device
CN113327631B (en) Emotion recognition model training method, emotion recognition method and emotion recognition device
Ngo et al. Sound context classification based on joint learning model and multi-spectrogram features
Jiang et al. An integrated convolutional neural network with a fusion attention mechanism for acoustic scene classification
Shen Application of transfer learning algorithm and real time speech detection in music education platform
WO2020068401A1 (en) Audio watermark encoding/decoding
CN116705013B (en) Voice wake-up word detection method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant