WO2021077247A1 - 一种人工耳蜗信号处理方法、装置及计算机可读存储介质 - Google Patents

一种人工耳蜗信号处理方法、装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2021077247A1
WO2021077247A1 PCT/CN2019/112174 CN2019112174W WO2021077247A1 WO 2021077247 A1 WO2021077247 A1 WO 2021077247A1 CN 2019112174 W CN2019112174 W CN 2019112174W WO 2021077247 A1 WO2021077247 A1 WO 2021077247A1
Authority
WO
WIPO (PCT)
Prior art keywords
deep neural
neural network
training
envelope
network
Prior art date
Application number
PCT/CN2019/112174
Other languages
English (en)
French (fr)
Inventor
郑能恒
史裕鹏
康迂勇
张伟
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Priority to PCT/CN2019/112174 priority Critical patent/WO2021077247A1/zh
Publication of WO2021077247A1 publication Critical patent/WO2021077247A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception

Definitions

  • the present invention relates to the technical field of signal processing, in particular to a cochlear implant signal processing method, device and computer readable storage medium.
  • Cochlear Implant is an auditory bionic device, mainly used to provide speech perception for deaf patients with severe peripheral hearing damage (such as inner ear hair cell necrosis).
  • CI Cochlear Implant
  • the most advanced CI devices can enable CI implanters to achieve speech perception effects equivalent to normal people in a quiet acoustic environment.
  • background noise in real life such as environmental noise or multi-person conversation
  • the main purpose of the embodiments of the present invention is to provide a cochlear implant signal processing method, device, and computer-readable storage medium, which can at least solve the low processing efficiency, high power consumption, and noise reduction processing effect of the noise reduction algorithm used in related technologies It is more limited and cannot be well adapted to the problem of CI processing strategy.
  • the first aspect of the embodiments of the present invention provides a cochlear implant signal processing method based on deep learning, which is applied to a cochlear implant device, and the method includes:
  • the envelope extraction network includes first deep nerves connected in sequence Network, a second deep neural network, and a third deep neural network, the first deep neural network is used to extract high-dimensional features from input features, and the second deep neural network is used to estimate the enhanced training speech Signal characteristics, the third deep neural network is used to extract a number of channel envelopes corresponding to the number of electrodes implanted in the body from the characteristics estimated by the second deep neural network;
  • the channel envelope extracted from the real-time voice signal is sequentially subjected to nonlinear compression, channel selection, electrode mapping, and pulse modulation, and then a target number of electrode stimulation signals are output to a corresponding number of implanted electrodes in the body.
  • the second aspect of the embodiments of the present invention provides a deep learning-based cochlear implant signal processing device, which is applied to a cochlear implant device, and the device includes:
  • the training module is used to obtain the training speech signal, and input the training speech signal to the envelope extraction network after preprocessing, and train the envelope extraction network; wherein, the envelope extraction network includes sequential connection The first deep neural network, the second deep neural network, and the third deep neural network, the first deep neural network is used to extract high-dimensional features from the input features, and the second deep neural network is used to estimate the enhanced The characteristics of the training speech signal, the third deep neural network is used to extract a number of channel envelopes corresponding to the number of electrodes implanted in the body from the features estimated by the second deep neural network;
  • the extraction module is used to input the collected real-time voice signals into the trained envelope extraction network after preprocessing, and extract the channel envelopes whose number corresponds to the number of electrodes implanted in the body;
  • the processing module is used to sequentially perform nonlinear compression, channel selection, electrode mapping, and pulse modulation on the channel envelope extracted from the real-time voice signal, and then output a target number of electrode stimulation signals to a corresponding number of implanted electrodes in the body .
  • a third aspect of the embodiments of the present invention provides a cochlear implant device, the cochlear implant device including: a processor, a memory, and a communication bus;
  • the communication bus is used to implement connection and communication between the processor and the memory
  • the processor is configured to execute one or more programs stored in the memory to implement the steps of any one of the aforementioned cochlear implant signal processing methods.
  • a fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores one or more programs, and the one or more programs can be processed by one or more The device executes to implement the steps of any of the above-mentioned cochlear implant signal processing methods.
  • the training speech signal is obtained, and the training speech signal is preprocessed and input to the envelope extraction network to train the envelope extraction network
  • the envelope extraction network includes a first deep neural network, a second deep neural network, and a third deep neural network that are sequentially connected; the collected real-time voice signal is preprocessed and then input to the trained envelope extraction network, Extract the channel envelope corresponding to the number of implanted electrodes in the body; perform nonlinear compression, channel selection, electrode mapping, and pulse modulation on the channel envelope extracted from the real-time voice signal, and then output the target number of electrode stimulation Signal to the corresponding number of electrodes implanted in the body.
  • the lightweight envelope extraction network with low computational complexity provided by the present invention effectively reduces power consumption, improves processing efficiency and noise reduction processing effect, and ensures the seamless integration of CI signal processing and noise reduction.
  • Fig. 1 is a schematic diagram of the basic flow of a cochlear implant signal processing method provided by the first embodiment of the present invention
  • FIG. 2 is a schematic flowchart of the network training method provided by the first embodiment of the present invention.
  • Fig. 3 is a schematic diagram of the training of the envelope extraction network provided by the first embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a cochlear implant signal processing device provided by a second embodiment of the present invention.
  • Fig. 5 is a schematic structural diagram of a cochlear implant device provided by a third embodiment of the present invention.
  • this embodiment proposes a cochlear implant
  • the signal processing method is applied to a cochlear implant device.
  • FIG. 1 is a basic flow diagram of the cochlear implant signal processing method provided in this embodiment.
  • the cochlear implant signal processing method proposed in this embodiment includes the following steps:
  • Step 101 Obtain a training voice signal, and input the training voice signal to an envelope extraction network after preprocessing, and train the envelope extraction network.
  • the envelope extraction network in this embodiment includes a first deep neural network (DNN1), a second deep neural network (DNN2), and a third deep neural network (DNN3) that are sequentially connected in sequence.
  • the first deep neural network is preferably Can be Long Short-Term Memory (LSTM, Long Short-Term Memory), which is used to extract high-dimensional features from the input features
  • the second deep neural network is used to estimate the features of the enhanced training speech signal
  • the third deep neural network The network is used to extract the channel envelope whose number corresponds to the number of electrodes implanted in the body from the features estimated by the second deep neural network.
  • the features in this embodiment may be frequency domain features (such as logarithmic amplitude spectrum, amplitude spectrum, etc.) or time domain features in practical applications.
  • the mainstream CI device on the market includes two parts: an internal implant and an extracorporeal machine.
  • the CI signal processing system of this embodiment is preferably set in an extracorporeal machine, and the number of implanted electrodes of the CI product of this embodiment can preferably be 22.
  • the envelope extraction network is used to extract the subband signal envelopes of the channels with the same number as the actual CI product implanted electrodes, so that the envelope contains richer original sound detail information.
  • the training voice signal may be a ready-made training voice sample, for example, directly obtained from a preset sample database, or it may be recorded by oneself, which is not uniquely limited in this embodiment.
  • obtaining the training voice signal includes: randomly selecting a target number of clean voice samples from a preset voice database, and selecting a preset type of noise sample from a preset noise set; Based on clean speech samples and noise samples, a training speech signal is generated in combination with a preset signal-to-noise ratio.
  • the training voice samples are constructed by selecting a suitable voice database and noise database.
  • 2500 sentences can be randomly selected from the training set of the Tsinghua Chinese voice database to form the clean voice sample set of the envelope extraction network of this embodiment, and two types of noises of whitenoise and babble can be selected from the noiseX-92 noise set.
  • 2500 sentences of speech and noise are randomly combined with the signal-to-noise ratio of -5dB, 0dB, 5dB and noise-free respectively to generate a noisy training speech signal for training the envelope extraction network.
  • Figure 2 is a schematic flow chart of a network training method provided by this embodiment.
  • the training speech signal is preprocessed and then input to the envelope extraction network.
  • Network extraction network training specifically includes the following steps:
  • Step 1011 Perform preprocessing on the training voice signal to obtain the feature of the continuous preset number of frames
  • Step 1012 Input the features of the continuous preset number of frames to the first deep neural network containing 128 neurons to perform high-dimensional feature extraction;
  • Step 1013 Pass the output of the first deep neural network through a second deep neural network composed of two fully connected layers each containing 512 neurons and a linear layer containing 65 neurons, to estimate the enhanced training speech Signal characteristics;
  • Step 1014 Pass the output of the second deep neural network through a third deep neural network composed of a fully connected layer containing 256 neurons and a linear layer containing 22 neurons, and the number of extractions corresponds to the implantation in the body Channel envelope of the number of electrodes;
  • Step 1015 Use a backpropagation algorithm to optimize the parameters of the envelope extraction network, and iteratively train until the envelope extraction network converges, to obtain a completed envelope extraction network.
  • the noisy speech and its corresponding clean speech samples can be preprocessed respectively to obtain the short-time Fourier transform logarithmic energy spectrum characteristics (Log-powermagnitudes-LPS, 8ms/frame, frame shift is 1ms). ).
  • the input of the envelope extraction network of this embodiment may be a continuous feature block of 25 frames. Input the LPS features of noisy speech (with dimensions of 25 ⁇ 65) of 25 consecutive frames into a layer of one-way DNN1 (such as LSTM).
  • DNN1 The output of DNN1 is first passed through DNN2 to output the estimated LPS (with dimensions of 25 ⁇ 65), and DNN2 The output of is continued to be input to DNN3, and the estimated 22 channel envelopes (dimension 25 ⁇ 22) are output, and then the backpropagation algorithm is used to optimize the parameters of the network to obtain the final network model.
  • the value of various parameters of the network is adjusted through the loss function.
  • the loss function is used to estimate the degree of approximation between the predicted value of the trained network model and the true value, which is a convex optimization process. , Among them, the smaller the loss function, the stronger the model's envelope extraction and processing capabilities.
  • the iterative network training process is continued until the network converges, that is, the function value of the loss function basically stops decreasing, that is, the network model of the envelope extraction network of this embodiment is trained.
  • the loss function of the envelope extraction network is expressed as:
  • loss stft is the error between the feature output by the second deep neural network and the feature of the clean voice sample corresponding to the training voice signal
  • loss env is the channel envelope feature extracted by the third deep neural network
  • the clean voice sample The error of the channel envelope feature extracted by the traditional CI processing strategy.
  • Loss waveform is the error of the simulated speech signal obtained after electrode mapping and other processing based on the channel envelope extracted by the third deep neural network
  • the error of the clean speech sample, w stft , w env and w waveform are the weighting factors corresponding to each error. It should be understood that the above-mentioned error in this embodiment may preferably be an L1 paradigm error.
  • Figure 3 shows a schematic diagram of the training of an envelope extraction network provided by this embodiment, where A represents the input training voice signal, B represents the clean voice sample used to generate the training voice signal together with noise, and C represents the package.
  • Network extraction network Continuing the scale of the aforementioned preferred envelope extraction network, in this embodiment, the features of noisy speech (with a dimension of 25 ⁇ 65) are input to DNN1 for high-dimensional feature extraction, and the output of DNN1 passes through DNN2 and outputs the estimated LPS feature (The dimension is 25 ⁇ 65).
  • the loss stft can be calculated based on the 65-dimensional LPS feature output by DNN2 and the 65-dimensional LPS feature of the corresponding clean speech sample.
  • the calculation of loss stft can learn from the weighted perception method commonly used in audio coding, and the purpose is
  • the guided model is less sensitive to the noise near the formant, and is more sensitive to the noise near the valley bottom of the non-formant;
  • the 22-dimensional channel envelope based on the output of DNN3 and the corresponding clean speech samples are processed by the existing traditional CI processing strategies such as ACE (advanced combination encoders)
  • the 22-dimensional channel envelope extracted by the strategy calculates the loss env ; in addition, the 22-dimensional channel envelope output by DNN3 is used to construct a simulated voice signal, and the simulated voice signal and the clean voice waveform error loss waveform are calculated, thereby forcing the original
  • the envelope extraction network of the embodiment learns the detailed information of the clean speech, which effectively overcomes the shortcoming that the traditional CI strategy cannot effectively extract the time domain detailed information.
  • the three kinds of errors are weighted by three adjustable weighting factors and added together as the objective function for the optimization learning of the envelope extraction network.
  • the entire envelope extraction network is trained for 60 epochs using the Adam gradient optimizer, and the model with the smallest verification loss is saved as the finally completed envelope extraction network model.
  • the loss function provided by this embodiment can not only guide the model to a certain extent in learning how to extract the envelope energy from the Fourier energy spectrum, but also in the frequency domain and time domain.
  • the domain forces the network to learn how to approximate the data distribution of clean speech, which indirectly makes the envelope signal output by the envelope extraction network have more detailed information, which largely overcomes the inability of traditional CI processing strategies to extract the details of the speech signal time series. Defects of information.
  • Step 102 After preprocessing, the collected real-time voice signal is input to the trained envelope extraction network, and the number of channel envelopes corresponding to the number of electrodes implanted in the body is extracted.
  • the voice collection unit of the CI device receives an external voice signal, it preprocesses the signal and outputs it to the trained envelope extraction network.
  • the preferred number of implanted electrodes in this embodiment may be 22.
  • the network outputs 22 channel envelope signals.
  • the envelope extraction network model in this embodiment is more streamlined than the existing algorithm model. The size is only about 1.9MB, the number of network parameters is 0.46M, and the system complexity is significantly reduced. (8ms)
  • the decoding process takes about 0.1 to 0.2ms. Since the total number of parameters and calculation complexity of the network model are greatly reduced, the power consumption is also reduced accordingly (the ratio of memory and CPU is very small), which ensures the feasibility of applying the envelope extraction network model of this embodiment to actual CI products.
  • Step 103 Perform nonlinear compression, channel selection, electrode mapping, and pulse modulation on the channel envelope extracted from the real-time voice signal in sequence, and then output the target number of electrode stimulation signals to the corresponding number of implanted electrodes in the body.
  • this embodiment can select N envelope signals with the largest energy and/or the highest signal-to-noise ratio for electrode mapping. For example, when the total number of implanted electrodes in the body is 22, the selected There can be 8 channels, and the pulse-modulated electrical stimulation signal is output to the corresponding 8 implanted electrodes.
  • the noise reduction effect can be achieved very well. It has the traditional noise reduction module without adding other front-end noise reduction modules. Anti-noise performance that the CI processing strategy cannot have.
  • the envelope extraction network of this embodiment can learn through the second deep neural network to obtain an adjustable parameter similar to the parameters of the triangular filter bank in the existing CI processing strategy, and is based on simulated speech and real speech. Error back propagation optimizes the envelope extraction network, so that the extracted envelope has more detailed information.
  • the speech processing effect achieved in a quiet environment is better than traditional CI processing strategies, and the noise reduction performance in a noisy environment is obvious It is better than traditional CI processing strategies that use Wiener filtering or some lightweight DNNs as front-end noise reduction modules.
  • a training speech signal is obtained, and the training speech signal is preprocessed and input to the envelope extraction network, and the envelope extraction network is trained, where the envelope extraction network includes successively The first deep neural network, the second deep neural network, and the third deep neural network are connected in sequence; the collected real-time voice signals are preprocessed and then input to the trained envelope extraction network, and the number of extractions corresponds to the implantation in the body Channel envelope of the number of electrodes; non-linear compression, channel selection, electrode mapping, and pulse modulation are sequentially performed on the channel envelope extracted from the real-time voice signal, and then output the target number of electrode stimulation signals to the corresponding number of implantation in the body electrode.
  • the lightweight envelope extraction network with low computational complexity provided by the present invention effectively reduces power consumption, improves processing efficiency and noise reduction processing effect, and ensures the seamless integration of CI signal processing and noise reduction.
  • this embodiment shows a cochlear implant
  • the signal processing device is applied to a cochlear implant device.
  • the cochlear implant signal processing device in this embodiment includes:
  • the training module 401 is used to obtain the training speech signal, and input the training speech signal to the envelope extraction network after preprocessing, and train the envelope extraction network; wherein, the envelope extraction network includes the first deep nerves connected in sequence Network, the second deep neural network and the third deep neural network, the first deep neural network is used to extract high-dimensional features from the input features, the second deep neural network is used to estimate the features of the enhanced training speech signal, and the third The deep neural network is used to extract the channel envelope whose number corresponds to the number of implanted electrodes in the body from the features estimated by the second deep neural network;
  • the extraction module 402 is configured to input the collected real-time voice signals into the trained envelope extraction network after preprocessing, and extract the channel envelopes whose number corresponds to the number of electrodes implanted in the body;
  • the processing module 403 is configured to sequentially perform nonlinear compression, channel selection, electrode mapping, and pulse modulation on the channel envelope extracted from the real-time voice signal, and then output a target number of electrode stimulation signals to a corresponding number of implanted electrodes in the body.
  • the training module 401 inputs the training speech signal to the envelope extraction network after preprocessing, and when training the envelope extraction network, it is specifically used to: Preprocess to obtain the features of the continuous preset number of frames; input the features of the continuous preset number of frames to the first deep neural network containing 128 neurons for high-dimensional feature extraction; take the output of the first deep neural network through two A second deep neural network composed of a fully connected layer each containing 512 neurons and a linear layer containing 65 neurons, to estimate the characteristics of the enhanced training speech signal; the output of the second deep neural network is processed by A fully connected layer containing 256 neurons and a third deep neural network consisting of a linear layer containing 22 neurons, extract the channel envelope corresponding to the number of implanted electrodes in the body; use the backpropagation algorithm to The envelope extraction network performs parameter optimization, and iteratively trains until the envelope extraction network converges, and the trained envelope extraction network is obtained.
  • the training module 401 when it obtains the training speech signal, it is specifically configured to: randomly select a target number of clean speech samples from a preset speech database, and select a target number of clean speech samples from a preset noise collection. Select a preset type of noise sample; based on the clean voice sample and the noise sample, combine to generate a training voice signal under a preset signal-to-noise ratio.
  • the loss function of the envelope extraction network is expressed as:
  • loss stft is the error between the feature output by the second deep neural network and the feature of the clean voice sample corresponding to the training voice signal
  • loss env is the channel envelope feature extracted by the third deep neural network
  • the error of the channel envelope feature extracted by the traditional CI processing strategy, loss waveform is the simulated voice signal obtained based on the channel envelope extracted by the third deep neural network
  • the error of the clean voice sample, w stft , w env , w waveform Respectively are the weighting factors corresponding to each error.
  • cochlear implant signal processing methods in the foregoing embodiments can all be implemented based on the cochlear implant signal processing device provided in this embodiment, and those of ordinary skill in the art can clearly understand that for the convenience and conciseness of the description, this For the specific working process of the cochlear implant signal processing device described in the embodiment, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.
  • the training speech signal is obtained, and the training speech signal is preprocessed and input to the envelope extraction network to train the envelope extraction network, where the envelope extraction network includes sequential order The first deep neural network, the second deep neural network, and the third deep neural network are connected; the collected real-time voice signals are preprocessed and input into the trained envelope extraction network, and the number of extractions corresponds to the implanted electrodes in the body Number of channel envelopes; sequentially perform nonlinear compression, channel selection, electrode mapping, and pulse modulation on the channel envelope extracted from the real-time voice signal, and then output the target number of electrode stimulation signals to the corresponding number of implanted electrodes in the body .
  • the lightweight envelope extraction network with low computational complexity provided by the present invention effectively reduces power consumption, improves processing efficiency and noise reduction processing effect, and ensures the seamless integration of CI signal processing and noise reduction.
  • This embodiment provides a cochlear implant device. As shown in FIG. 5, it includes a processor 501, a memory 502, and a communication bus 503.
  • the communication bus 503 is used to implement connection and communication between the processor 501 and the memory 502;
  • the processor 501 is configured to execute one or more computer programs stored in the memory 502 to implement at least one step in the cochlear implant signal processing method in the first embodiment.
  • This embodiment also provides a computer-readable storage medium, which is included in any method or technology for storing information (such as computer-readable instructions, data structures, computer program modules, or other data). Volatile or non-volatile, removable or non-removable media.
  • Computer-readable storage media include but are not limited to RAM (Random Access Memory), ROM (Read-Only Memory, read-only memory), EEPROM (Electrically Erasable Programmable read only memory, charged Erasable Programmable Read-Only Memory) ), flash memory or other memory technology, CD-ROM (Compact Disc Read-Only Memory), digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, Or any other medium that can be used to store desired information and that can be accessed by a computer.
  • the computer-readable storage medium in this embodiment may be used to store one or more computer programs, and the stored one or more computer programs may be executed by a processor to implement at least one step of the method in the first embodiment.
  • This embodiment also provides a computer program, which can be distributed on a computer-readable medium and executed by a computable device to implement at least one step of the method in the first embodiment; and in some cases At least one of the steps shown or described can be performed in a different order from that described in the foregoing embodiment.
  • This embodiment also provides a computer program product, including a computer readable device, and the computer readable device stores the computer program as shown above.
  • the computer-readable device in this embodiment may include the computer-readable storage medium as shown above.
  • communication media usually contain computer-readable instructions, data structures, computer program modules, or other data in a modulated data signal such as carrier waves or other transmission mechanisms, and may include any information delivery medium. Therefore, the present invention is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Neurosurgery (AREA)
  • Otolaryngology (AREA)
  • Signal Processing (AREA)
  • Prostheses (AREA)

Abstract

根据本发明实施例公开的人工耳蜗信号处理方法、装置及计算机可读存储介质,首先获取训练语音信号,并将训练语音信号经过预处理后输入至包络提取网络进行网络训练,其中,包络提取网络包括依次顺序连接的三个深度神经网络;然后将采集的实时语音信号经预处理后输入至训练完成的包络提取网络,提取个数对应于体内植入电极个数的通道包络;最后对所提取的通道包络依次进行非线性压缩、通道选择、电极映射以及脉冲调制,输出目标数量的电极刺激信号至对应数量的体内植入电极。通过本发明所提供的计算复杂度较低的轻量级包络提取网络,有效降低了功率消耗,提升了处理效率以及降噪处理效果,并保证了CI信号处理与降噪的无缝融合。

Description

一种人工耳蜗信号处理方法、装置及计算机可读存储介质 技术领域
本发明涉及信号处理技术领域,尤其涉及一种人工耳蜗信号处理方法、装置及计算机可读存储介质。
背景技术
人工耳蜗(CI,CochlearImplant)是一种听觉仿生装置,主要用于为重度听觉外周损伤(如内耳毛细胞坏死)的耳聋患者提供言语感知。当前,最先进的CI装置在安静声学环境下能够使CI植入者达到与正常人相当的言语感知效果。但是,现实生活中的背景噪声(如环境噪声或者多人交谈的情况)会严重影响CI植入者的言语感知体验。
近年来,学术界和工业界提出了许多将降噪算法与传统CI信号处理策略相结合的、用于改进CI言语感知效果的信号处理系统。但是,一方面,目前的降噪算法的模型参数庞大、计算复杂度较高,导致实际应用中的信号处理效率低、消耗功率高;另一方面,目前的降噪算法并不能可靠地提取出声音中的时域精细结构,降噪处理效果较为局限;另外,经过目前的降噪算法处理后的语音信号输入至CI信号处理单元进行处理时,无法保证最终的输出能够达到最佳言语感知效果,从而降噪算法与CI处理策略之间的适配性较差。
发明内容
本发明实施例的主要目的在于提供一种人工耳蜗信号处理方法、装置及计算机可读存储介质,至少能够解决相关技术中所采用的降噪算法的处理效率低、消耗功率高、降噪处理效果较为局限,以及无法很好适配于CI处理策略的问题。
为实现上述目的,本发明实施例第一方面提供了一种基于深度学习的人工耳蜗信号处理方法,应用于人工耳蜗装置,该方法包括:
获取训练语音信号,并将所述训练语音信号经过预处理后输入至包络提取网络,对所述包络提取网络进行训练;其中,所述包络提取网络包括依次顺序连接的第一深度神经网络、第二深度神经网络以及第三深度神经网络,所述第 一深度神经网络用于从输入的特征中提取高维特征,所述第二深度神经网络用于估计增强后的所述训练语音信号的特征,所述第三深度神经网络用于从所述第二深度神经网络所估计的特征中,提取个数对应于体内植入电极个数的通道包络;
将采集到的实时语音信号经过预处理后输入至训练完成的包络提取网络,提取个数对应于体内植入电极个数的通道包络;
对从所述实时语音信号中所提取的通道包络依次进行非线性压缩、通道选择、电极映射以及脉冲调制,然后输出目标数量的电极刺激信号至对应数量的体内植入电极。
为实现上述目的,本发明实施例第二方面提供了一种基于深度学习的人工耳蜗信号处理装置,应用于人工耳蜗装置,该装置包括:
训练模块,用于获取训练语音信号,并将所述训练语音信号经过预处理后输入至包络提取网络,对所述包络提取网络进行训练;其中,所述包络提取网络包括依次顺序连接的第一深度神经网络、第二深度神经网络以及第三深度神经网络,所述第一深度神经网络用于从输入的特征中提取高维特征,所述第二深度神经网络用于估计增强后的所述训练语音信号的特征,所述第三深度神经网络用于从所述第二深度神经网络所估计的特征中,提取个数对应于体内植入电极个数的通道包络;
提取模块,用于将采集到的实时语音信号经过预处理后输入至训练完成的包络提取网络,提取个数对应于体内植入电极个数的通道包络;
处理模块,用于对从所述实时语音信号中所提取的通道包络依次进行非线性压缩、通道选择、电极映射以及脉冲调制,然后输出目标数量的电极刺激信号至对应数量的体内植入电极。
为实现上述目的,本发明实施例第三方面提供了一种人工耳蜗装置,该人工耳蜗装置包括:处理器、存储器和通信总线;
所述通信总线用于实现所述处理器和存储器之间的连接通信;
所述处理器用于执行所述存储器中存储的一个或者多个程序,以实现上述任意一种人工耳蜗信号处理方法的步骤。
为实现上述目的,本发明实施例第四方面提供了一种计算机可读存储介质,该计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现上述任意一种人工耳蜗信号处理方法的步骤。
根据本发明实施例提供的人工耳蜗信号处理方法、装置及计算机可读存储介质,获取训练语音信号,并将训练语音信号经过预处理后输入至包络提取网络,对包络提取网络进行训练,其中,包络提取网络包括依次顺序连接的第一深度神经网络、第二深度神经网络以及第三深度神经网络;将采集到的实时语音信号经过预处理后输入至训练完成的包络提取网络,提取个数对应于体内植入电极个数的通道包络;对从实时语音信号中所提取的通道包络依次进行非线性压缩、通道选择、电极映射以及脉冲调制,然后输出目标数量的电极刺激信号至对应数量的体内植入电极。通过本发明所提供的计算复杂度较低的轻量级包络提取网络,有效降低了功率消耗,提升了处理效率以及降噪处理效果,并保证了CI信号处理与降噪的无缝融合。
本发明其他特征和相应的效果在说明书的后面部分进行阐述说明,且应当理解,至少部分效果从本发明说明书中的记载变的显而易见。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明第一实施例提供的人工耳蜗信号处理方法的基本流程示意图;
图2为本发明第一实施例提供的网络训练方法的流程示意图;
图3为本发明第一实施例提供的包络提取网络的训练示意图;
图4为本发明第二实施例提供的人工耳蜗信号处理装置的结构示意图;
图5为本发明第三实施例提供的人工耳蜗装置的结构示意图。
具体实施方式
为使得本发明的发明目的、特征、优点能够更加的明显和易懂,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而非全部实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
第一实施例:
为了解决相关技术中所采用的降噪算法的处理效率低、消耗功率高、降噪处理效果较为局限,以及无法很好适配于CI处理策略的技术问题,本实施例提出了一种人工耳蜗信号处理方法,应用于人工耳蜗装置,如图1所示为本实施例提供的人工耳蜗信号处理方法的基本流程示意图,本实施例提出的人工耳蜗信号处理方法包括以下的步骤:
步骤101、获取训练语音信号,并将训练语音信号经过预处理后输入至包络提取网络,对包络提取网络进行训练。
具体的,本实施例中的包络提取网络包括依次顺序连接的第一深度神经网络(DNN1)、第二深度神经网络(DNN2)以及第三深度神经网络(DNN3),第一深度神经网络优选的可以为长短时记忆网络(LSTM,Long Short-Term Memory),用于从输入的特征中提取高维特征,第二深度神经网络用于估计增强后的训练语音信号的特征,第三深度神经网络用于从第二深度神经网络所估计的特征中,提取个数对应于体内植入电极个数的通道包络。应当理解的是,本实施例中的特征在实际应用中可以为频域特征(如对数幅度谱、幅度谱等)或时域特征。
目前市场上主流的CI装置包含体内植入体和体外机两部分,本实施例的CI的信号处理系统优选的设置在体外机中,并且本实施例的CI产品的植入电极数目优选可以为22个。本实施例通过包络提取网络来提取数量与实际CI产品植入电极数量相同的通道的子带信号包络,使得包络中包含更丰富的原始声音细节信息。
还应当说明的是,在实际应用中,训练语音信号可以是现成的训练语音样本,例如从预设的样本数据库中直接获取,还可以是自行录制得到,本实施例在此不作唯一限定。
在本实施例一种可选的实施方式中,获取训练语音信号包括:从预设的语音数据库中随机挑选目标数量的干净语音样本,以及从预设的噪声集中挑选预设类型的噪声样本;基于干净语音样本以及噪声样本,在预设的信噪比下结合生成训练语音信号。
具体的,本实施例通过选取合适的语音数据库和噪声数据库来自行构造训练语音样本。其中,可以从清华中文语音数据库的训练集中随机挑选2500句语音组成本实施例的包络提取网络的干净语音样本集,以及可以从noiseX-92噪声集中挑选类型分别为whitenoise和babble的两种噪声作为噪声样本集。然后 将2500句语音与噪声在信噪比分别为-5dB、0dB、5dB及无噪声四种情况下随机结合,即可生成用于训练包络提取网络的带噪的训练语音信号。
如图2所示为本实施例提供的一种网络训练方法的流程示意图,在本实施例一种可选的实施方式中,将训练语音信号经过预处理后输入至包络提取网络,对包络提取网络进行训练具体包括以下步骤:
步骤1011、对训练语音信号进行预处理得到连续预设帧数的特征;
步骤1012、将连续预设帧数的特征输入至包含128个神经元的第一深度神经网络进行高维特征提取;
步骤1013、将第一深度神经网络的输出,经过由两个均包含512个神经元的全连接层以及一个包含65个神经元的线性层组成的第二深度神经网络,估计增强后的训练语音信号的特征;
步骤1014、将第二深度神经网络的输出,经过由一个包含256个神经元的全连接层以及一个包含22个神经元的线性层组成的第三深度神经网络,提取个数对应于体内植入电极个数的通道包络;
步骤1015、采用反向传播算法对包络提取网络进行参数优化,并迭代训练至包络提取网络收敛,得到训练完成的包络提取网络。
具体的,本实施例首先可以对带噪语音及其对应的干净语音样本分别进行预处理得到短时傅里叶变换对数能量谱特征(Log-powermagnitudes-LPS,8ms/帧,帧移为1ms)。考虑语音相邻帧间的相关性,本实施例的包络提取网络的输入可以为连续25帧特征作为一个连续特征块。将连续25帧的带噪语音的LPS特征(维度为25×65)输入一层单向DNN1(例如LSTM),DNN1的输出先经过DNN2输出估计的LPS(维度为25×65),并将DNN2的输出继续输入至DNN3,输出估计的22个通道包络(维度为25×22),然后再采用反向传播算法对网络进行参数优化来得到最终的网络模型。
应当说明的是,在反向传播过程中,通过损失函数来调整网络的各种参数的值,损失函数用来估计所训练的网络模型的预测值与真实值的逼近程度,是一个凸优化过程,其中,损失函数越小,模型的包络提取和处理能力越强。本实施例根据损失函数更新网络参数后继续迭代网络训练过程,直至网络收敛,也即损失函数的函数值基本停止降低,即训练完成本实施例的包络提取网络的网络模型。
进一步地,在本实施例一种可选的实施方式中,包络提取网络的损失函数 表示为:
loss=w stft*loss stft+w env*loss env+w waveform*loss waveform
其中,loss stft为第二深度神经网络所输出的特征,与对应于训练语音信号的干净语音样本的特征的误差,loss env为第三深度神经网络所提取的通道包络特征,与干净语音样本经过传统CI处理策略提取的通道包络特征的误差,loss waveform为基于第三深度神经网络所提取的通道包络经过电极映射等处理后所得到的仿真语音信号,与干净语音样本的误差,w stft、w env、w waveform分别为各误差所对应的加权因子。应当理解的是,本实施例的上述误差可以优选为L1范式误差。
如图3所示为本实施例提供的一种包络提取网络的训练示意图,其中,A表示输入的训练语音信号,B表示用于与噪声一起生成训练语音信号的干净语音样本,C表示包络提取网络。承接前述所优选的包络提取网络的尺度,在本实施例中,将带噪语音的特征(维度为25×65)输入DNN1进行高维特征提取,DNN1的输出经过DNN2后输出估计的LPS特征(维度为25×65),可以基于DNN2输出的65维LPS特征与对应的干净语音样本的65维LPS特征计算loss stft,这里loss stft的计算可以借鉴音频编码普遍使用的加权感知方法,目的在于引导模型对共振峰附近的噪声不那么敏感,对非共振峰的谷底附近的噪声较为敏感;另外,基于DNN3输出的22维通道包络与对应干净语音样本经现有的传统CI处理策略例如ACE(advanced combination encoders)策略提取的22维通道包络计算loss env;此外,还由DNN3输出的22维通道包络构造仿真语音信号,计算该仿真语音信号与干净语音波形误差loss waveform,从而迫使本实施例的包络提取网络学习干净语音的细节信息,有效克服了传统CI策略无法有效提取时域细节信息的缺点。
最后,将三种误差通过三个可调整的加权因子加权后相加,作为对包络提取网络进行优化学习的目标函数。在优选的实施方式中,整个包络提取网络使用Adam梯度优化器训练60个epoch,保存其中验证loss最小的模型作为最终训练完成的包络提取网络模型。
应当说明的是,通过本实施例提供的损失函数,既能在一定程度上引导模型学习传统CI处理策略如何从傅里叶能量谱提取得到包络能量,又能够在频域和时域两个域迫使网络学习如何逼近干净语音的数据分布,间接使得包络提取网络输出的包络信号具备更多细节信息,在很大程度上克服了传统CI处理策略 本身无法提取到语音信号时间序列上细节信息的缺陷。
步骤102、将采集到的实时语音信号经过预处理后输入至训练完成的包络提取网络,提取个数对应于体内植入电极个数的通道包络。
具体的,CI装置的语音采集单元例如麦克风在接收到外界的语音信号时,对信号进行预处理后输出至训练好的包络提取网络,本实施例优选的植入电极数量可以为22个,那么网络输出22个通道包络信号。应当说明的是,本实施例中的包络提取网络模型相对于现有的算法模型更为精简,大小仅约为1.9MB,网络的参数数量为0.46M,系统复杂度显著降低,平均每帧(8ms)解码处理用时约为0.1~0.2ms。由于网络模型的参数总量和计算复杂度大大降低,功耗也相应降低(内存和CPU占比都很小),确保了本实施例的包络提取网络模型应用于实际CI产品的可行性。
步骤103、对从实时语音信号中所提取的通道包络依次进行非线性压缩、通道选择、电极映射以及脉冲调制,然后输出目标数量的电极刺激信号至对应数量的体内植入电极。
具体的,本实施例中的预处理、非线性压缩处理、通道选择处理、电极映射处理脉冲调制处理以及仿真语音信号的生成均采用与传统CI处理策略相同的方式,在此不再赘述。应当说明的是,在进行通道选择时,本实施例可以选择能量最大和/或信噪比最高的N个包络信号进行电极映射,例如体内植入电极的总数量为22个时,所选择的通道可以为8个,经过脉冲调制后的电刺激信号则输出至相应的8个植入电极。
利用本实施例的包络提取网络强大的学习能力,通过构造合适的带噪语音数据进行训练,可以很好的达到降噪效果,在不需另外增加其他前端降噪模块的情况下已具备传统CI处理策略所无法具备的抗噪性能。
此外,本实施例的包络提取网络能够通过第二深度神经网络来学习得到一个类似于现有CI处理策略中的三角滤波器组的参数的可调参数,且基于仿真语音与真实语音所得的误差反向传播优化包络提取网络,使得所提取的包络具备更多的细节信息,在安静环境下所实现的语音处理效果优于传统CI处理策略,且在噪声环境下的降噪性能明显优于将维纳滤波或一些轻量级DNN作为前端降噪模块的传统CI处理策略。
根据本发明实施例提供的人工耳蜗信号处理方法,获取训练语音信号,并将训练语音信号经过预处理后输入至包络提取网络,对包络提取网络进行训练, 其中,包络提取网络包括依次顺序连接的第一深度神经网络、第二深度神经网络以及第三深度神经网络;将采集到的实时语音信号经过预处理后输入至训练完成的包络提取网络,提取个数对应于体内植入电极个数的通道包络;对从实时语音信号中所提取的通道包络依次进行非线性压缩、通道选择、电极映射以及脉冲调制,然后输出目标数量的电极刺激信号至对应数量的体内植入电极。通过本发明所提供的计算复杂度较低的轻量级包络提取网络,有效降低了功率消耗,提升了处理效率以及降噪处理效果,并保证了CI信号处理与降噪的无缝融合。
第二实施例:
为了解决相关技术中所采用的降噪算法的处理效率低、消耗功率高、降噪处理效果较为局限,以及无法很好适配于CI处理策略的技术问题,本实施例示出了一种人工耳蜗信号处理装置,应用于人工耳蜗装置,具体请参见图4,本实施例的人工耳蜗信号处理装置包括:
训练模块401,用于获取训练语音信号,并将训练语音信号经过预处理后输入至包络提取网络,对包络提取网络进行训练;其中,包络提取网络包括依次顺序连接的第一深度神经网络、第二深度神经网络以及第三深度神经网络,第一深度神经网络用于从输入的特征中提取高维特征,第二深度神经网络用于估计增强后的训练语音信号的特征,第三深度神经网络用于从第二深度神经网络所估计的特征中,提取个数对应于体内植入电极个数的通道包络;
提取模块402,用于将采集到的实时语音信号经过预处理后输入至训练完成的包络提取网络,提取个数对应于体内植入电极个数的通道包络;
处理模块403,用于对从实时语音信号中所提取的通道包络依次进行非线性压缩、通道选择、电极映射以及脉冲调制,然后输出目标数量的电极刺激信号至对应数量的体内植入电极。
在本实施例一种可选的实施方式中,训练模块401在将训练语音信号经过预处理后输入至包络提取网络,对包络提取网络进行训练时,具体用于:对训练语音信号进行预处理得到连续预设帧数的特征;将连续预设帧数的特征输入至包含128个神经元的第一深度神经网络进行高维特征提取;将第一深度神经网络的输出,经过由两个均包含512个神经元的全连接层以及一个包含65个神经元的线性层组成的第二深度神经网络,估计增强后的训练语音信号的特征; 将第二深度神经网络的输出,经过由一个包含256个神经元的全连接层以及一个包含22个神经元的线性层组成的第三深度神经网络,提取个数对应于体内植入电极个数的通道包络;采用反向传播算法对包络提取网络进行参数优化,并迭代训练至包络提取网络收敛,得到训练完成的包络提取网络。
在本实施例一种可选的实施方式中,训练模块401在获取训练语音信号时,具体用于:从预设的语音数据库中随机挑选目标数量的干净语音样本,以及从预设的噪声集中挑选预设类型的噪声样本;基于干净语音样本以及噪声样本,在预设的信噪比下结合生成训练语音信号。
进一步地,在本实施例一种可选的实施方式中,包络提取网络的损失函数表示为:
loss=w stft*loss stft+w env*loss env+w waveform*loss waveform
其中,loss stft为第二深度神经网络所输出的特征,与对应于训练语音信号的干净语音样本的特征的误差,loss env为第三深度神经网络所提取的通道包络特征,与干净语音样本经过传统CI处理策略提取的通道包络特征的误差,loss waveform为基于第三深度神经网络所提取的通道包络得到的仿真语音信号,与干净语音样本的误差,w stft、w env、w waveform分别为各误差所对应的加权因子。
应当说明的是,前述实施例中的人工耳蜗信号处理方法均可基于本实施例提供的人工耳蜗信号处理装置实现,所属领域的普通技术人员可以清楚的了解到,为描述的方便和简洁,本实施例中所描述的人工耳蜗信号处理装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
采用本实施例提供的人工耳蜗信号处理装置,获取训练语音信号,并将训练语音信号经过预处理后输入至包络提取网络,对包络提取网络进行训练,其中,包络提取网络包括依次顺序连接的第一深度神经网络、第二深度神经网络以及第三深度神经网络;将采集到的实时语音信号经过预处理后输入至训练完成的包络提取网络,提取个数对应于体内植入电极个数的通道包络;对从实时语音信号中所提取的通道包络依次进行非线性压缩、通道选择、电极映射以及脉冲调制,然后输出目标数量的电极刺激信号至对应数量的体内植入电极。通过本发明所提供的计算复杂度较低的轻量级包络提取网络,有效降低了功率消耗,提升了处理效率以及降噪处理效果,并保证了CI信号处理与降噪的无缝融合。
第三实施例:
本实施例提供了一种人工耳蜗装置,参见图5所示,其包括处理器501、存储器502及通信总线503,其中:通信总线503用于实现处理器501和存储器502之间的连接通信;处理器501用于执行存储器502中存储的一个或者多个计算机程序,以实现上述实施例一中的人工耳蜗信号处理方法中的至少一个步骤。
本实施例还提供了一种计算机可读存储介质,该计算机可读存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、计算机程序模块或其他数据)的任何方法或技术中实施的易失性或非易失性、可移除或不可移除的介质。计算机可读存储介质包括但不限于RAM(Random Access Memory,随机存取存储器),ROM(Read-Only Memory,只读存储器),EEPROM(Electrically Erasable Programmable read only memory,带电可擦可编程只读存储器)、闪存或其他存储器技术、CD-ROM(Compact Disc Read-Only Memory,光盘只读存储器),数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。
本实施例中的计算机可读存储介质可用于存储一个或者多个计算机程序,其存储的一个或者多个计算机程序可被处理器执行,以实现上述实施例一中的方法的至少一个步骤。
本实施例还提供了一种计算机程序,该计算机程序可以分布在计算机可读介质上,由可计算装置来执行,以实现上述实施例一中的方法的至少一个步骤;并且在某些情况下,可以采用不同于上述实施例所描述的顺序执行所示出或描述的至少一个步骤。
本实施例还提供了一种计算机程序产品,包括计算机可读装置,该计算机可读装置上存储有如上所示的计算机程序。本实施例中该计算机可读装置可包括如上所示的计算机可读存储介质。
可见,本领域的技术人员应该明白,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件(可以用计算装置可执行 的计算机程序代码来实现)、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。
此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、计算机程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。所以,本发明不限制于任何特定的硬件和软件结合。
以上内容是结合具体的实施方式对本发明实施例所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。

Claims (10)

  1. 一种人工耳蜗信号处理方法,应用于人工耳蜗装置,其特征在于,包括:
    获取训练语音信号,并将所述训练语音信号经过预处理后输入至包络提取网络,对所述包络提取网络进行训练;其中,所述包络提取网络包括依次顺序连接的第一深度神经网络、第二深度神经网络以及第三深度神经网络,所述第一深度神经网络用于从输入的特征中提取高维特征,所述第二深度神经网络用于估计增强后的所述训练语音信号的特征,所述第三深度神经网络用于从所述第二深度神经网络所估计的特征中,提取个数对应于体内植入电极个数的通道包络;
    将采集到的实时语音信号经过预处理后输入至训练完成的包络提取网络,提取个数对应于体内植入电极个数的通道包络;
    对从所述实时语音信号中所提取的通道包络依次进行非线性压缩、通道选择、电极映射以及脉冲调制,然后输出目标数量的电极刺激信号至对应数量的体内植入电极。
  2. 如权利要求1所述的人工耳蜗信号处理方法,其特征在于,所述将所述训练语音信号经过预处理后输入至包络提取网络,对所述包络提取网络进行训练包括:
    对所述训练语音信号进行预处理得到连续预设帧数的特征;
    将所述连续预设帧数的特征输入至包含128个神经元的第一深度神经网络进行高维特征提取;
    将所述第一深度神经网络的输出,经过由两个均包含512个神经元的全连接层以及一个包含65个神经元的线性层组成的第二深度神经网络,估计增强后的所述训练语音信号的特征;
    将所述第二深度神经网络的输出,经过由一个包含256个神经元的全连接层以及一个包含22个神经元的线性层组成的第三深度神经网络,提取个数对应于体内植入电极个数的通道包络;
    采用反向传播算法对所述包络提取网络进行参数优化,并迭代训练至所述包络提取网络收敛,得到训练完成的包络提取网络。
  3. 如权利要求1所述的人工耳蜗信号处理方法,其特征在于,所述获取训练语音信号包括:
    从预设的语音数据库中随机挑选目标数量的干净语音样本,以及从预设的 噪声集中挑选预设类型的噪声样本;
    基于所述干净语音样本以及所述噪声样本,在预设的信噪比下生成训练语音信号。
  4. 如权利要求1至3中任意一项所述的人工耳蜗信号处理方法,其特征在于,所述包络提取网络的损失函数表示为:
    loss=w stft*loss stft+w env*loss env+w waveform*loss waveform
    其中,loss stft为所述第二深度神经网络所输出的特征,与对应于所述训练语音信号的干净语音样本的特征的误差,loss env为所述第三深度神经网络所提取的通道包络特征,与所述干净语音样本经过传统CI处理策略提取的通道包络特征的误差,loss waveform为基于所述第三深度神经网络所提取的通道包络得到的仿真语音信号,与所述干净语音样本的误差,w stft、w env、w waveform分别为各误差所对应的加权因子。
  5. 一种人工耳蜗信号处理装置,应用于人工耳蜗装置,其特征在于,包括:
    训练模块,用于获取训练语音信号,并将所述训练语音信号经过预处理后输入至包络提取网络,对所述包络提取网络进行训练;其中,所述包络提取网络包括依次顺序连接的第一深度神经网络、第二深度神经网络以及第三深度神经网络,所述第一深度神经网络用于从输入的特征中提取高维特征,所述第二深度神经网络用于估计增强后的所述训练语音信号的特征,所述第三深度神经网络用于从所述第二深度神经网络所估计的特征中,提取个数对应于体内植入电极个数的通道包络;
    提取模块,用于将采集到的实时语音信号经过预处理后输入至训练完成的包络提取网络,提取个数对应于体内植入电极个数的通道包络;
    处理模块,用于对从所述实时语音信号中所提取的通道包络依次进行非线性压缩、通道选择、电极映射以及脉冲调制,然后输出目标数量的电极刺激信号至对应数量的体内植入电极。
  6. 如权利要求5所述的人工耳蜗信号处理装置,其特征在于,所述训练模块在将所述训练语音信号经过预处理后输入至包络提取网络,对所述包络提取网络进行训练时,具体用于:
    对所述训练语音信号进行预处理得到连续预设帧数的特征;
    将所述连续预设帧数的特征输入至包含128个神经元的第一深度神经网络进行高维特征提取;
    将所述第一深度神经网络的输出,经过由两个均包含512个神经元的全连接层以及一个包含65个神经元的线性层组成的第二深度神经网络,估计增强后的所述训练语音信号的特征;
    将所述第二深度神经网络的输出,经过由一个包含256个神经元的全连接层以及一个包含22个神经元的线性层组成的第三深度神经网络,提取个数对应于体内植入电极个数的通道包络;
    采用反向传播算法对所述包络提取网络进行参数优化,并迭代训练至所述包络提取网络收敛,得到训练完成的包络提取网络。
  7. 如权利要求5所述的人工耳蜗信号处理装置,其特征在于,所述训练模块在获取训练语音信号时,具体用于:
    从预设的语音数据库中随机挑选目标数量的干净语音样本,以及从预设的噪声集中挑选预设类型的噪声样本;
    基于所述干净语音样本以及所述噪声样本,在预设的信噪比下结合生成训练语音信号。
  8. 如权利要求5至7中任意一项所述的人工耳蜗信号处理装置,其特征在于,所述包络提取网络的损失函数表示为:
    loss=w stft*loss stft+w env*loss env+w waveform*loss waveform
    其中,loss stft为所述第二深度神经网络所输出的特征,与对应于所述训练语音信号的干净语音样本特征的误差,loss env为所述第三深度神经网络所提取的通道包络特征,与所述干净语音样本经过传统CI处理策略提取的通道包络特征的误差,loss waveform为基于所述第三深度神经网络所提取的通道包络得到的仿真语音信号,与所述干净语音样本的误差,w stft、w env、w waveform分别为各误差所对应的加权因子。
  9. 一种人工耳蜗装置,其特征在于,包括:处理器、存储器和通信总线;
    所述通信总线用于实现所述处理器和存储器之间的连接通信;
    所述处理器用于执行所述存储器中存储的一个或者多个程序,以实现如权利要求1至4中任意一项所述的人工耳蜗信号处理方法的步骤。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现如权利要求1至4中任意一项所述的人工耳蜗信号处理方法的步骤。
PCT/CN2019/112174 2019-10-21 2019-10-21 一种人工耳蜗信号处理方法、装置及计算机可读存储介质 WO2021077247A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/112174 WO2021077247A1 (zh) 2019-10-21 2019-10-21 一种人工耳蜗信号处理方法、装置及计算机可读存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/112174 WO2021077247A1 (zh) 2019-10-21 2019-10-21 一种人工耳蜗信号处理方法、装置及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2021077247A1 true WO2021077247A1 (zh) 2021-04-29

Family

ID=75619685

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/112174 WO2021077247A1 (zh) 2019-10-21 2019-10-21 一种人工耳蜗信号处理方法、装置及计算机可读存储介质

Country Status (1)

Country Link
WO (1) WO2021077247A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090312820A1 (en) * 2008-06-02 2009-12-17 University Of Washington Enhanced signal processing for cochlear implants
CN102314880A (zh) * 2010-06-30 2012-01-11 上海视加信息科技有限公司 一种语音基元的编码与合成方法
CN107767859A (zh) * 2017-11-10 2018-03-06 吉林大学 噪声环境下人工耳蜗信号的说话人可懂性检测方法
CN109841220A (zh) * 2017-11-24 2019-06-04 深圳市腾讯计算机系统有限公司 语音信号处理模型训练方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090312820A1 (en) * 2008-06-02 2009-12-17 University Of Washington Enhanced signal processing for cochlear implants
CN102314880A (zh) * 2010-06-30 2012-01-11 上海视加信息科技有限公司 一种语音基元的编码与合成方法
CN107767859A (zh) * 2017-11-10 2018-03-06 吉林大学 噪声环境下人工耳蜗信号的说话人可懂性检测方法
CN109841220A (zh) * 2017-11-24 2019-06-04 深圳市腾讯计算机系统有限公司 语音信号处理模型训练方法、装置、电子设备及存储介质

Similar Documents

Publication Publication Date Title
US11961533B2 (en) Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
JP7258182B2 (ja) 音声処理方法、装置、電子機器及びコンピュータプログラム
Xia et al. Speech enhancement with weighted denoising auto-encoder.
US11800301B2 (en) Neural network model for cochlear mechanics and processing
CN110111769B (zh) 一种电子耳蜗控制方法、装置、可读存储介质及电子耳蜗
Nossier et al. A comparative study of time and frequency domain approaches to deep learning based speech enhancement
CN104810024A (zh) 一种双路麦克风语音降噪处理方法及系统
CN104778948B (zh) 一种基于弯折倒谱特征的抗噪语音识别方法
Koizumi et al. WaveFit: An iterative and non-autoregressive neural vocoder based on fixed-point iteration
WO2020087716A1 (zh) 人工耳蜗听觉场景识别方法
Nossier et al. Mapping and masking targets comparison using different deep learning based speech enhancement architectures
Zai et al. Reconstruction of audio waveforms from spike trains of artificial cochlea models
Kang et al. Deep learning-based speech enhancement with a loss trading off the speech distortion and the noise residue for cochlear implants
WO2021077247A1 (zh) 一种人工耳蜗信号处理方法、装置及计算机可读存储介质
Zheng et al. A noise-robust signal processing strategy for cochlear implants using neural networks
Chen et al. CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
CN110681051B (zh) 一种人工耳蜗信号处理方法、装置及计算机可读存储介质
Radha et al. Enhancing speech quality using artificial bandwidth expansion with deep shallow convolution neural network framework
CN115472168A (zh) 耦合bgcc和pwpe特征的短时语音声纹识别方法、系统及设备
Kumar A spectro-temporal framework for compensation of reverberation for speech recognition
Nareddula et al. Fusion-Net: Time-Frequency Information Fusion Y-Network for Speech Enhancement.
Liang et al. A non-invasive speech quality evaluation algorithm for hearing aids with multi-head self-attention and audiogram-based features
Yi-Ting et al. Fully convolutional network (FCN) model to extract clear speech signals on non-stationary noises of human conversations for cochlear implants
Parameswaran Objective assessment of machine learning algorithms for speech enhancement in hearing aids
Lai Intelligent background sound event detection and classification based on WOLA spectral analysis in hearing devices

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19949712

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.08.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19949712

Country of ref document: EP

Kind code of ref document: A1