WO2020124902A1 - 基于有监督学习听觉注意的语音提取方法、系统、装置 - Google Patents

基于有监督学习听觉注意的语音提取方法、系统、装置 Download PDF

Info

Publication number
WO2020124902A1
WO2020124902A1 PCT/CN2019/083352 CN2019083352W WO2020124902A1 WO 2020124902 A1 WO2020124902 A1 WO 2020124902A1 CN 2019083352 W CN2019083352 W CN 2019083352W WO 2020124902 A1 WO2020124902 A1 WO 2020124902A1
Authority
WO
WIPO (PCT)
Prior art keywords
time
pulse
speech
target
frequency
Prior art date
Application number
PCT/CN2019/083352
Other languages
English (en)
French (fr)
Inventor
许家铭
黄雅婷
徐波
Original Assignee
中国科学院自动化研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院自动化研究所 filed Critical 中国科学院自动化研究所
Priority to US16/645,447 priority Critical patent/US10923136B2/en
Publication of WO2020124902A1 publication Critical patent/WO2020124902A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the invention belongs to the technical field of speech separation, and specifically relates to a method, system and device for speech extraction based on supervised learning auditory attention.
  • Speech contains a rich space-time structure, so using a pulse neural network that considers the timing information of the pulse sequence to model the "cocktail party problem" is a new solution, but the pulse neural network uses an unsupervised learning algorithm, which can only be separated Some simple speech aliasing, such as two separate vocals /di/ and /da/, for some complex speech aliasing, the accuracy rate can not meet the requirements.
  • the pulse neural network of pulse sequence timing information models the "cocktail party problem”
  • the supervised learning algorithm is used to train the network, which is beneficial to the pulse neural network to separate complex continuous speech aliasing.
  • the proposed method for the separation of aliased speech based on supervised learning algorithms in this field although compared with the traditional artificial neural network and the pulse neural network of unsupervised learning algorithm, has a great deal of extraction and separation of aliased speech Great progress, but the convergence process is relatively slow, and the accuracy of extraction needs to be further improved.
  • the present invention provides a speech extraction method based on supervised learning auditory attention, including:
  • Step S10 the short-time Fourier transform is used to convert the original aliased speech signal into a two-dimensional time-frequency signal representation to obtain a first aliased speech signal;
  • Step S20 sparse the first aliased speech signal and map the intensity information of the time-frequency unit therein to the preset D intensity levels, and based on the intensity level information, the second sparseness is obtained to obtain the second aliased voice signal ;
  • Step S30 converting the second aliased speech signal into a pulse signal by means of time coding;
  • the time coding is time-frequency coding or time-group coding;
  • time coding to encode, retain the timing information of the speech, use a pulse neural network that is good at processing timing information to learn a mapping function from noisy features to separate targets (such as ideal masking or the amplitude spectrum of the speech of interest), greatly Improve the accuracy of voice separation.
  • Step S40 a trained target pulse extraction network is used to extract target pulses from the pulse signal; the target pulse extraction network is constructed based on a pulse neural network;
  • step S50 the target pulse is converted into a time-frequency representation of the target speech, and the target speech is obtained by inverse short-time Fourier transform conversion.
  • step S10 "using short-time Fourier transform to convert the original aliased speech signal into a two-dimensional time-frequency signal representation"
  • the steps are:
  • Step S11 resampling the original aliased speech signal to reduce the sampling rate of the original aliased speech signal
  • Step S12 encode the resampled aliased speech signal by short-time fast Fourier transform, encode the speech signal into a matrix representation with two dimensions of time and frequency, and each group of time and frequency as a time-frequency unit .
  • step S20 "sparse the first aliased speech signal and map the intensity information of the time-frequency unit therein to the preset D intensity levels, and the second sparseness based on the intensity level information
  • the steps are:
  • Step S21 based on a preset background noise threshold, selecting a time-frequency unit in the time-frequency unit of the first aliased speech signal that is greater than the background noise threshold to form a first set of time-frequency units;
  • Step S22 Perform K-means clustering on the time-frequency units of the time-frequency unit set, and map the time-frequency units of the first time-frequency unit set to D intensity levels that are preset;
  • step S23 the time-frequency unit with the lowest intensity level is set as a mute unit to obtain a second aliased voice signal.
  • the time-frequency coding is:
  • the number of pulses in the coding window and the time of firing are used to reflect the strength of the time-frequency unit; the intensity level in the sparse mapping module is D, and the time-frequency unit with the lowest intensity level is set as the mute unit; the intensity of the time-frequency unit is mapped after clustering Is intensity 0 ⁇ d ⁇ D, d is an integer, the time-frequency unit (t 0 , f 0 ) corresponds to the time window of neuron i whose starting time is t 0 , and the time interval is ⁇ t, then the coding window is marked as t 0 Within the interval of starting time A pulse is issued at each location, for a total of d pulses.
  • the time-population code is:
  • Multiple neuron groups are used to encode the intensity of the time-frequency unit, and the intensity information pulses of the time-frequency unit are distributed in the coding windows of the corresponding neurons in the multiple neuron groups;
  • the intensity level in the sparse mapping module is D, the lowest intensity
  • the level time-frequency unit is set as a silent unit, and the time-group coding uses D-1 neuron groups to encode;
  • the target pulse extraction network is a two-layer fully connected pulse neural network constructed using a random linear neuron model
  • a remote supervision method is used to train the weights of the target pulse extraction network; the weight ⁇ w ji (t) between the output layer neuron j and the input layer neuron i at time t is:
  • S i (t) represents the expected output pulse sequence, the actual output pulse sequence and the input pulse sequence
  • a represents the non-Hebbian term
  • W(s) represents the learning window
  • the weight of the target pulse extraction network is determined by ⁇ w ji Earn points in time.
  • the learning window W(s) is:
  • s is the time interval between the post-synaptic pulse firing time and the pre-synaptic pulse firing time;
  • A is the amplitude, A>0;
  • ⁇ win is the time constant of the learning window.
  • the remote monitoring method used is a remote monitoring method with an impulse added or a remote monitoring method with a Nesterov acceleration gradient;
  • the target pulse extracts the weight between neuron j in the output layer of the network and neuron i in the input layer for:
  • the target pulse extracts the weight between neuron j in the output layer of the network and neuron i in the input layer for:
  • the step S50 "converting the target pulse into a time-frequency representation of the target speech and converting the target short-term Fourier transform to obtain the target speech” includes the following steps:
  • Step S51 converting the target pulse into information masking of the corresponding target to obtain the corresponding masking value
  • Step S52 multiplying the masking value and the corresponding point of the first aliased speech signal and adding the phase information of the first aliased speech signal to obtain the time-frequency signal representation of the target speech;
  • step S53 the short-time inverse Fourier transform is used to convert the target speech time-frequency signal representation into speech information to obtain the target speech.
  • a speech extraction system based on supervised learning auditory attention which includes an acquisition module, a conversion module, a sparse mapping module, a pulse conversion module, a target pulse extraction module, a pulse recognition module, and an output module;
  • the acquisition module is configured to acquire and input the original aliased voice signal
  • the conversion module is configured to convert the original aliased speech signal into a two-dimensional time-frequency signal representation using short-time Fourier transform to obtain a first aliased speech signal;
  • the sparse mapping module is configured to sparse the first aliased speech signal and map the intensity information of the time-frequency unit therein to preset D intensity levels, and based on the intensity level information to sparse the second time to obtain the second mixed Overlapping voice signals;
  • the pulse conversion module is configured to convert the second aliased voice signal into a pulse signal by time coding
  • the target pulse extraction module is configured to use a trained target pulse extraction network to extract target pulses from the pulse signal;
  • the pulse recognition module is configured to convert the target pulse into a time-frequency representation of the target speech, and obtain the target speech by inverse short-time Fourier transform;
  • the output module is configured to output the target speech.
  • a storage device in which a plurality of programs are stored, the programs are adapted to be loaded and executed by a processor to implement the above-mentioned speech extraction method based on supervised learning auditory attention.
  • a processing device including a processor and a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing multiple programs; the program is suitable for Loaded and executed by the processor to implement the above-mentioned speech extraction method based on supervised learning auditory attention.
  • the method of the present invention aims at the characteristic that the speech signal has a rich space-time structure, designs a time coding method to encode the intensity information of the aliased speech signal, and uses a pulse neural network to learn from the input pulse sequence of the aliased speech to the target speech
  • the mapping of the output pulse sequence effectively improves the accuracy of voice separation.
  • the present invention designs and uses time coding to encode the aliased speech information, which retains the rich spatiotemporal information of the speech to a certain extent, and effectively improves the accuracy of the pulse neural network to separate the speech.
  • the present invention applies a pulse neural network that is good at processing time-series data to speech separation.
  • the network has the ability to process complex aliased speech.
  • the present invention introduces the impulse and Nesterov acceleration gradient into the remote supervision method, and uses the improved remote supervision method to train the pulse neural network, which greatly improves the convergence speed of the pulse neural network and can find a better solution.
  • FIG. 1 is a schematic flowchart of a speech extraction method based on supervised learning auditory attention of the present invention
  • FIG. 2 is a schematic diagram of a speech extraction method based on supervised learning auditory attention of the present invention
  • FIG. 3 is a schematic diagram showing the conversion of time-domain speech into time-frequency based on an embodiment of a speech extraction method based on supervised learning auditory attention;
  • FIG. 4 is a schematic diagram of a sliding time window of an embodiment of a speech extraction method based on supervised learning auditory attention of the present invention
  • FIG. 5 is a schematic diagram of time-frequency coding of an embodiment of a speech extraction method based on supervised learning auditory attention of the present invention
  • FIG. 6 is a time-group coding schematic diagram of an embodiment of a speech extraction method based on supervised learning auditory attention of the present invention
  • FIG. 7 is a schematic diagram of a pulse neural network according to an embodiment of a speech extraction method based on supervised learning auditory attention of the present invention.
  • FIG. 8 is a schematic diagram of a voice output unit of an embodiment of a voice extraction method based on supervised learning auditory attention of the present invention
  • the present invention provides a speech extraction method based on supervised learning auditory attention to perform auditory attention on aliased speech and extract target speech.
  • This method aims at the characteristic that the speech signal has a rich space-time structure. It designs a time coding method to encode the intensity information of the aliased speech signal, and uses a pulse neural network to learn the input pulse sequence from the aliased speech to the output pulse sequence of the target speech. Mapping. The weights of the impulsive neural network in this method are learned using supervised learning algorithms.
  • this method introduces the impulse and Nesterov acceleration gradient into the remote supervision method, and uses the improved remote supervision method to supervise the impulse neural network to accelerate the convergence process And further improve the accuracy of voice separation.
  • the speech extraction method based on the speech extraction method of supervised learning auditory attention includes:
  • Step S10 the short-time Fourier transform is used to convert the original aliased speech signal into a two-dimensional time-frequency signal representation to obtain a first aliased speech signal;
  • Step S20 sparse the first aliased speech signal and map the intensity information of the time-frequency unit therein to the preset D intensity levels, and based on the intensity level information, the second sparseness is obtained to obtain the second aliased voice signal ;
  • Step S30 converting the second aliased speech signal into a pulse signal by means of time coding;
  • the time coding is time-frequency coding or time-group coding;
  • Step S40 a trained target pulse extraction network is used to extract target pulses from the pulse signal; the target pulse extraction network is constructed based on a pulse neural network;
  • step S50 the target pulse is converted into a time-frequency representation of the target speech, and the target speech is obtained by inverse short-time Fourier transform conversion.
  • a speech extraction method based on supervised learning auditory attention includes steps S10-S50, and each step is described in detail as follows:
  • step S10 the short-time Fourier transform is used to convert the original aliased speech signal into a two-dimensional time-frequency signal representation to obtain a first aliased speech signal.
  • Step S11 resampling the original aliased speech signal to reduce the sampling rate of the original aliased speech signal.
  • the resampling rate adopted by the embodiment of the present invention is 8KHz.
  • Step S12 encode the resampled aliased speech signal by short-time fast Fourier transform (STFT, Short-Time Fourier Transform), and encode the speech signal into a matrix representation with two dimensions of time and frequency, each Group time and frequency as a time-frequency unit.
  • STFT short-time fast Fourier transform
  • the speech time-domain signal is represented by time amplitude and contains different speech information. It is encoded by a short-time fast Fourier transform (STFT, Short-Time Fourier Transform) and converted into a time-frequency representation.
  • STFT short-time fast Fourier transform
  • the window length of the STFT is 32 ms
  • a sine window function sine window
  • the length of the Hop size is 16 ms.
  • step S20 the first aliased speech signal is sparse and the intensity information of the time-frequency unit is mapped to preset D intensity levels, and the second sparseness is obtained based on the intensity level information to obtain the second aliased speech signal.
  • Step S21 Select a time-frequency unit greater than the background noise threshold among the time-frequency units of the first aliased speech signal based on a preset background noise threshold to form a first set of time-frequency units.
  • the background threshold is set to -40dB.
  • Step S22 Perform K-means clustering on the time-frequency units of the time-frequency unit set, and map the time-frequency units of the first time-frequency unit set to D intensity levels that are set in advance.
  • step S23 the time-frequency unit with the lowest intensity level is set as a mute unit to obtain a second aliased voice signal.
  • step S30 the second aliased speech signal is converted into a pulse signal by time coding.
  • the length of the sliding coding window is twice the length of the time interval; t 0 , t 1 , t 2 and t 3 are four uniformly distributed in the time dimension Time points, t 0 -t 2 and t 1 -t 3 are two adjacent encoding windows, and t 0 -t 1 , t 1 -t 2 and t 2 -t 3 are all time intervals (time span).
  • the time coding may use time-frequency coding or time-group coding.
  • the embodiments of the present invention show schematic diagrams of two types of time coding.
  • Time-frequency coding uses the number of pulses in the coding window and the time of firing to reflect the strength of the time-frequency unit, and converts the intensity information of the sparsely aliased speech time-frequency unit into a pulse signal that can be processed by the pulse neural network.
  • the intensity level in the sparse mapping module is D, and the time-frequency unit with the lowest intensity level is set as the mute unit; the intensity cluster of the time-frequency unit is mapped to intensity 0 ⁇ d ⁇ D, d is an integer, and the time-frequency unit (t 0 , F 0 ) corresponds to the time window of neuron i whose starting time is t 0 , and the time interval is ⁇ t, then the time interval with t 0 as the starting time in the coding window is A pulse is issued at each location, for a total of d pulses.
  • FIG. 5 it is a schematic diagram of time-frequency encoding according to an embodiment of the present invention.
  • the strength of time-frequency encoding is encoded in the time interval corresponding to the first half of the neuron encoding window.
  • the dotted line in the figure indicates the boundary of the time interval.
  • the coding window duration is 24 ms
  • the time interval duration is 12 ms
  • the frequency of the time-frequency unit whose current start time is t 0 corresponds to neuron i, and its intensity is 2, then in the time interval with t 0 as the start time, t 0 ms and (t 0 +6) ms 2 pulses are evenly distributed at the time; then the intensity of the time-frequency unit encoded by the neuron i in the subsequent encoding window with t 1 as the starting time is 3, then within the time interval with t 1 as the starting time, At t 1 ms, (t 1 +4) ms and (t 1 +8) ms, 3 pulses are evenly distributed.
  • Time-group coding uses multiple neuron groups to encode the intensity of the time-frequency unit.
  • the intensity information pulses of the time-frequency unit are distributed in the coding window of the corresponding neurons in the multiple neuron groups, which will sparsely overlap the speech time-frequency unit
  • the intensity information is converted into a pulse signal that the pulse neural network can process.
  • time-group coding is an important coding strategy found in neuroscience, mainly using multiple inaccurate neurons to encode stimuli.
  • time-group coding uses multiple neuron groups to encode the intensity of the time-frequency unit. Specifically, the pulses representing the intensity of the time-frequency unit are distributed in the coding windows of the corresponding neurons in the population of multiple neurons.
  • FIG. 6 it is a schematic diagram of time-group coding according to an embodiment of the present invention.
  • the strength of time-group coding is coded in the time interval of the first half of the corresponding neuron coding window.
  • the dotted line in the figure represents the boundary of the time interval, and P d in the figure represents the d-th neuron population.
  • the frequency of the time-frequency unit whose current start time is t 0 corresponds to neuron i in each neuron group, and its intensity is 2, and neurons i in the first two neuron groups will each emit one at t 0 Pulse, and the neuron i in the third neuron population is silent at t 0 ; the intensity of the time-frequency unit coded by this neuron i in the subsequent coding window with t 1 as the starting time is 3, then Neurons i in the three neuron populations will each emit a pulse at t 1 .
  • Step S40 Use the trained target pulse extraction network to extract the target pulse from the pulse signal.
  • the target pulse extraction network is a two-layer fully connected pulse neural network constructed using a stochastic linear neuron model (Figure 7).
  • a leaky integrated release neuron model (LIF, Leaky Integrate-and-Fire) V j (t) with a leakage current is used, and its definition is as shown in formula (1):
  • ⁇ j is the set of presynaptic neurons in neuron j
  • w ji is the weight of the synaptic connection between neuron j and neuron i
  • ⁇ (t) is the impulse response function
  • V rest is the resting potential
  • ⁇ (t) is a simple ⁇ -function, as shown in equation (2):
  • is a time constant, which means that the postsynaptic potential determines the speed of the rise and fall of the potential; H(t) is the step function (Heaviside function); ⁇ ref is the refractory period, indicating that the membrane potential accumulation reaches the threshold When the potential V thre is reached, the neuron returns to the resting potential V rest for a period of time.
  • the structure of the impulsive neural network is related to the time coding method.
  • the time coding method is time-frequency coding
  • the number of neurons in the input layer m and the number of neurons in the output layer n are both F, where F is the frequency dimension of the time-frequency representation X t,f ;
  • the time coding method is In time-population coding
  • the number of neurons in the input layer m is (D-1)F
  • the number of neurons in the output layer n is F.
  • a remote supervision method is used to train the weights of the target pulse extraction network.
  • S i (t) represents the expected output pulse sequence, the actual output pulse sequence and the input pulse sequence
  • a represents the non-Hebbian term
  • W(s) represents the learning window
  • the weight of the target pulse extraction network is determined by ⁇ w ji Earn points in time.
  • the learning window W(s) is defined as shown in equation (4):
  • s is the time interval between the post-synaptic pulse firing time and the pre-synaptic pulse firing time;
  • A is the amplitude, A>0;
  • ⁇ win is the time constant of the learning window.
  • the remote supervision method can be derived from another angle. This derivation process is similar to stochastic gradient descent.
  • the remote monitoring method adopted is the remote monitoring method with impulse added or the remote monitoring method with Nesterov acceleration gradient.
  • the target pulse extracts the weight between neuron j in the output layer of the network and neuron i in the input layer As shown in formula (5) and formula (6):
  • the target pulse extracts the weight between neuron j in the output layer of the network and neuron i in the input layer As shown in formula (7) and formula (8):
  • the initial learning rate of the pulse neural network is 0.05. If the distance between the expected output pulse sequence and the actual output pulse continuously increases in 5 epochs, the learning rate is adjusted at a rate of 0.95. Use an early stopping strategy with patience of 15 epochs.
  • SGD Stochastic Gradient Descent
  • SGDM Stochastic Gradient Descent with Momentum
  • NAG Nesterov’s Accelerated Gradient
  • step S50 the target pulse is converted into a time-frequency representation of the target speech, and the target speech is obtained by inverse short-time Fourier transform conversion.
  • step S51 the target pulse is converted into information masking of the corresponding target to obtain the corresponding masking value.
  • the output pulse sequence predicted by the pulse neural network is converted into the corresponding target information mask At , f , where the time-frequency representation dimensions of At and f and the first aliased speech are the same.
  • ideal binary masking IBM, Ideal Binary Mask
  • the corresponding information masking unit is set to 1, otherwise it is 0.
  • Step S52 multiplying the mask value and the corresponding point of the first aliased speech signal and adding the phase information of the first aliased speech signal to obtain the time-frequency signal representation of the target speech.
  • step S53 a short-time inverse Fourier transform (iSTFT, inverse Short-Time Fourier Transform) is used to convert the target voice time-frequency signal representation into voice information to obtain the target voice.
  • iSTFT inverse Short-Time Fourier Transform
  • FIG. 8 it is a schematic diagram of target speech output according to an embodiment of the present invention.
  • the aliased speech signal is converted into a time-frequency representation, and the time-frequency representation of the target speech is extracted through the learning of a pulse neural network.
  • the short-time Fourier inverse is used Transform (iSTFT, inverse Short-Time Fourier Transform) converts the time-frequency signal representation to the time amplitude representation of voice information, which is the extracted target voice.
  • iSTFT inverse Short-Time Fourier Transform
  • the present invention uses the global signal distortion improvement (GNSDR, global signal-to-distortion improvement) (GNSDR) as the indicator to measure the speech separation performance of the model .
  • GDSDR global signal distortion improvement
  • GMSDR global signal-to-distortion improvement
  • the experiment of the present invention uses the English speech data set Grid corpus. Two speakers, one male and one female, were selected from the Grid data set, and 20 voices were randomly extracted from each of them, divided into 3 parts, 10 of which were used to generate the mixed voice of the training set, 5 were used to generate the mixed voice of the verification set, and 5 Used to generate mixed speech in test set.
  • the final training set has 100 samples
  • the validation set has 25 samples
  • the test set has 25 samples. Each sample is cut into 0.5s to align.
  • Time-to-First-Spike encodes the intensity information by the morning and evening of a single pulse in the coding window. The earlier the pulse is issued, the greater the intensity.
  • the speech extraction method based on supervised learning auditory attention in the present invention
  • MLP multi-layer perceptron
  • RNN Recurrent Neural Network
  • LSTM Long-Short-Term Memory Network
  • IRM Ideal Ratio Mask
  • the traditional Time-to-First-Spike is oversimplified, and only uses a single pulse to represent the strength of the time-frequency unit, which is easily interfered by noise.
  • the time-frequency coding and time-group coding proposed by the present invention have a significant performance improvement over the traditional Time-to-First-Spike coding.
  • the method of the present invention performs better than the artificial neural network under the same network parameters in most settings, which shows the potential superiority of the impulsive neural network in processing time-series data.
  • the speech extraction system based on supervised learning auditory attention of the second embodiment of the present invention includes an acquisition module, a conversion module, a sparse mapping module, a pulse conversion module, a target pulse extraction module, a pulse recognition module, and an output module;
  • the acquisition module is configured to acquire and input the original aliased voice signal
  • the conversion module is configured to convert the original aliased speech signal into a two-dimensional time-frequency signal representation using short-time Fourier transform to obtain a first aliased speech signal;
  • the sparse mapping module is configured to sparse the first aliased speech signal and map the intensity information of the time-frequency unit therein to preset D intensity levels, and based on the intensity level information to sparse the second time to obtain the second mixed Overlapping voice signals;
  • the pulse conversion module is configured to convert the second aliased voice signal into a pulse signal by time coding
  • the target pulse extraction module is configured to use a trained target pulse extraction network to extract target pulses from the pulse signal;
  • the pulse recognition module is configured to convert the target pulse into a time-frequency representation of the target speech, and obtain the target speech by inverse short-time Fourier transform;
  • the output module is configured to output the target speech.
  • the speech extraction system based on supervised learning auditory attention provided by the above embodiments is only exemplified by the division of the above functional modules.
  • the above functions can be assigned to different functions as needed
  • the module is completed, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined.
  • the modules in the above embodiments can be combined into one module, or can be further split into multiple sub-modules to complete all or part of the above description Features.
  • the names of the modules and steps involved in the embodiments of the present invention are only for distinguishing each module or step, and are not regarded as an improper limitation of the present invention.
  • a storage device wherein a plurality of programs are stored, the programs are adapted to be loaded and executed by a processor to implement the above-mentioned speech extraction method based on supervised learning auditory attention.
  • a processing device includes a processor and a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing multiple programs; the program is suitable for being loaded and executed by the processor In order to realize the above-mentioned speech extraction method based on supervised learning auditory attention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

属于语音分离技术领域,具体涉及了一种基于有监督学习听觉注意的语音提取方法、系统、装置,旨在为了解决混叠语音提取收敛过程慢,进一步提高提取的准确性。方法包括:将原始混叠语音信号转换为二维时间-频率信号表示;稀疏化并将其中的时频单元的强度信息映射到离散强度等级,基于强度等级信息二次稀疏化;采用时间编码的方式转换为脉冲信号;采用训练好的目标脉冲提取网络提取目标脉冲;将目标脉冲转换成目标语音的时间-频率表示并转换得到目标语音。通过不同的时间编码方式将刺激转换成脉冲序列,有效提高了脉冲神经网络分离语音的准确性;通过改进的远程有监督方法对脉冲神经网络进行训练,大大提高了脉冲神经网络的收敛速度。

Description

基于有监督学习听觉注意的语音提取方法、系统、装置 技术领域
本发明属于语音分离技术领域,具体涉及了一种基于有监督学习听觉注意的语音提取方法、系统、装置。
背景技术
“鸡尾酒会问题”计算机语音识别领域中一个十分具有挑战性的问题,当前语音识别技术已经可以以较高精度识别一个人所讲的话,但是当说话的人数为两人或者多人时,语音识别正确率就会极大的降低。许多语音分离算法均致力于解决“鸡尾酒会问题”。随着深度学习在人工智能各个领域中的成功应用,许多研究者将人工神经网络应用到对“鸡尾酒会问题”的建模中。传统的人工神经网络采用频率编码对刺激进行编码,但是近年来的研究表明,忽略了时间结构的频率编码可能过于简化,语音识别正确率不高。当编码中采用时间结构编码信息时,我们称之为时间编码。语音中蕴含丰富的时空结构,因此采用考虑脉冲序列的时序信息的脉冲神经网络对“鸡尾酒会问题”进行建模是一个新的解决方案,但是脉冲神经网络采用无监督的学习算法,只能分离一些简单的语音混叠,比如两个分离的人声/di/和/da/,对一些复杂的语音混叠,正确率也不能达到要求。
通过有监督学习,可以从训练语料中学习到可区分性的模式,并且数种针对脉冲神经网络的有监督学习算法也获得了一定的成功。因此,在脉冲序列的时序信息的脉冲神经网络对“鸡尾酒会问题”进行建 模时,采用有监督学习算法对网络进行训练,有益于脉冲神经网络分离复杂的连续语音混叠。
总的来说,该领域提出的基于有监督学习算法的混叠语音分离方法,虽然较传统的人工神经网络和无监督学习算法的脉冲神经网络,在混叠语音的提取和分离上有了很大的进步,但是收敛过程比较慢,提取的准确性也有待进一步提高。
发明内容
为了解决现有技术中的上述问题,即为了提高混叠语音分离的准确性,本发明提供了一种基于有监督学习听觉注意的语音提取方法,包括:
步骤S10,利用短时傅立叶变换将原始混叠语音信号转换为二维时间-频率信号表示,得到第一混叠语音信号;
步骤S20,对所述第一混叠语音信号稀疏化并将其中的时频单元的强度信息映射到预设的D个强度等级,基于强度等级信息二次稀疏化,得到第二混叠语音信号;
步骤S30,采用时间编码的方式将所述第二混叠语音信号转换为脉冲信号;所述时间编码为时间-频率编码或时间-群体编码;
采用用时间编码方式进行编码,保留语音的时序信息,用擅于处理时序信息的脉冲神经网络学习一个从带噪特征到分离目标(例如理想掩蔽或者感兴趣语音的幅度谱)的映射函数,大大提高了语音分离的准确性。
步骤S40,采用训练好的目标脉冲提取网络从所述脉冲信号中提取目标脉冲;所述目标脉冲提取网络基于脉冲神经网络构建;
步骤S50,将所述目标脉冲转换成目标语音的时间-频率表示,通过逆短时傅立叶变换转换得到目标语音。
在一些优选的实施例中,步骤S10中“利用短时傅立叶变换将原始混叠语音信号转换为二维时间-频率信号表示”,其步骤为:
步骤S11,对原始混叠语音信号进行重采样,降低所述原始混叠语音信号的采样率;
步骤S12,将重采样后的混叠语音信号通过短时快速傅里叶变换进行编码,将语音信号编码为具有时间、频率两个维度的矩阵表示,每一组时间、频率作为一个时频单元。
在一些优选的实施例中,步骤S20中“对所述第一混叠语音信号稀疏化并将其中的时频单元的强度信息映射到预设的D个强度等级,基于强度等级信息二次稀疏化”,其步骤为:
步骤S21,基于预设的背景噪音阈值,选取所述第一混叠语音信号的时频单元中大于所述背景噪音阈值的时频单元,构成第一时频单元集;
步骤S22,对时频单元集的时频单元进行K-means聚类,并将第一时频单元集的时频单元映射到预先设定好的D个强度等级上;
步骤S23,将强度等级最低的时频单元设置为静音单元,得到第二混叠语音信号。
在一些优选的实施例中,所述时间-频率编码为:
采用编码窗口中脉冲的数量和发放时间来反映时频单元的强度;稀疏映射模块中的强度等级为D,最低强度等级的时频单元被设为静音单元;时频单元的强度聚类后映射为强度0<d<D,d为整数,时频单元(t 0,f 0)对应神经元i的起始时间为t 0的时间窗口,时间间隔为Δt,则该编码窗口中以t 0为起始时间的时间间隔内则分别在
Figure PCTCN2019083352-appb-000001
处各发放一个脉冲,共发放d个脉冲。
在一些优选的实施例中,所述时间-群体编码为:
采用多个神经元群体对时频单元的强度进行编码,时频单元的强度信息脉冲分布在多个神经元群体中相应神经元的编码窗口中;稀疏映射模块中的强度等级为D,最低强度等级的时频单元被设为静音单元,时间-群体编码采用D-1个神经元群体来进行编码;时频单元的强度聚类后映射为强度0<d<D,d为整数,时频单元(t 0,f 0)对应各个神经元群体中神经元i的起始时间为t 0的时间窗口,神经元i∈P l,l=1,2,...,d,在该时间窗口的起始时间t 0处各发放一个脉冲,总计发放d个脉冲,其中P l表示第l个神经元群体。
在一些优选的实施例中,所述目标脉冲提取网络为采用随机线性神经元模型构建的一个两层全连接脉冲神经网络;
采用远程监督方法对所述目标脉冲提取网络的权重进行训练;所述目标脉冲提取网络输出层神经元j和输入层神经元i之间在t时刻的权重Δw ji(t)为:
Figure PCTCN2019083352-appb-000002
其中,
Figure PCTCN2019083352-appb-000003
S i(t)分别表示期望的输出脉冲序列、实际的输出脉冲序列和输入脉冲序列;a表示非赫布项;W(s)表示学习窗口;所述目标脉冲提取网络的权重通过对Δw ji在时间上积分获得。
在一些优选的实施例中,所述学习窗口W(s)为:
Figure PCTCN2019083352-appb-000004
其中,s是突触后脉冲发放时间和突触前脉冲发放时间之间相差的时间间隔;A是幅值,A>0;τ win是学习窗口的时间常数。
在一些优选的实施例中,所采用的远程监督方法,为加入冲量的远程监督方法或加入Nesterov加速梯度的远程监督方法;
采用所述加入冲量的远程监督方法时,所述目标脉冲提取网络输出层神经元j和输入层神经元i之间的权重
Figure PCTCN2019083352-appb-000005
为:
Figure PCTCN2019083352-appb-000006
Figure PCTCN2019083352-appb-000007
其中,k表示迭代次数;β是冲量系数,β∈[0,1];η是学习率;
Figure PCTCN2019083352-appb-000008
是用于每次迭代更新的速度向量;
采用所述加入Nesterov加速梯度的远程监督方法时,所述目标脉冲提取网络输出层神经元j和输入层神经元i之间的权重
Figure PCTCN2019083352-appb-000009
为:
Figure PCTCN2019083352-appb-000010
Figure PCTCN2019083352-appb-000011
其中,
Figure PCTCN2019083352-appb-000012
表示在
Figure PCTCN2019083352-appb-000013
处的权重更新。
在一些优选的实施例中,步骤S50中“将所述目标脉冲转换成目标语音的时间-频率表示,通过逆短时傅立叶变换转换得到目标语音”,包括以下步骤:
步骤S51,将所述目标脉冲转换成对应目标的信息掩蔽,得到对应的掩蔽值;
步骤S52,将掩蔽值与第一混叠语音信号对应点乘并加入第一混叠语音信号的相位信息,得到目标语音的时间-频率信号表示;
步骤S53,采用短时傅立叶逆变换将目标语音时间-频率信号表示转换为语音信息,获取目标语音。
本发明的另一方面,提出了一种基于有监督学习听觉注意的语音提取系统,包括获取模块、转换模块、稀疏映射模块、脉冲转换模块、目标脉冲提取模块、脉冲识别模块、输出模块;
所述获取模块,配置为获取原始混叠语音信号并输入;
所述转换模块,配置为利用短时傅立叶变换将原始混叠语音信号转换为二维时间-频率信号表示,得到第一混叠语音信号;
所述稀疏映射模块,配置为将第一混叠语音信号稀疏化并将其中的时频单元的强度信息映射到预设的D个强度等级,基于强度等级信息二次稀疏化,得到第二混叠语音信号;
所述脉冲转换模块,配置为采用时间编码的方式将第二混叠语音信号转换为脉冲信号;
所述目标脉冲提取模块,配置为采用训练好的目标脉冲提取网络从所述脉冲信号中提取目标脉冲;
所述脉冲识别模块,配置为将目标脉冲转换成目标语音的时间-频率表示,通过逆短时傅立叶变换转换得到目标语音;
所述输出模块,配置为将目标语音输出。
本发明的第三方面,提出了一种存储装置,其中存储有多条程序,所述程序适于由处理器加载并执行以实现上述的基于有监督学习听觉注意的语音提取方法。
本发明的第四方面,提出了一种处理装置,包括处理器、存储装置;所述处理器,适于执行各条程序;所述存储装置,适于存储多条程序;所述程序适于由处理器加载并执行以实现上述的基于有监督学习听觉注意的语音提取方法。
本发明的有益效果:
(1)本发明方法针对语音信号具有丰富的时空结构的特点,设计时间编码方式对混叠语音信号的强度信息进行编码,并采用脉冲神经网络学习从混叠语音的输入脉冲序列到目标语音的输出脉冲序列的映射,有效提高了语音分离的准确性。
(2)本发明设计并使用时间编码对混叠语音信息进行编码,一定程度上保留了语音丰富的时空信息,有效提高了脉冲神经网络分离语音的正确率。
(3)本发明将擅于处理时序数据的脉冲神经网络运用到语音分离中,通过有监督学习,使得网络具有处理复杂混叠语音的能力。
(4)本发明将冲量和Nesterov加速梯度引入到远程监督方法中,采用改进的远程监督方法对脉冲神经网络进行训练,大大提高了脉冲神经网络的收敛速度,并能寻找到更优解。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1是本发明基于有监督学习听觉注意的语音提取方法的流程示意图;
图2是本发明基于有监督学习听觉注意的语音提取方法的框架示意图;
图3是本发明基于有监督学习听觉注意的语音提取方法实施例的时域语音转换成时间-频率表示示意图;
图4是本发明基于有监督学习听觉注意的语音提取方法实施例的滑动时间窗口示意图;
图5是本发明基于有监督学习听觉注意的语音提取方法实施例的时间-频率编码示意图;
图6是本发明基于有监督学习听觉注意的语音提取方法实施例的时间-群体编码示意图;
图7是本发明基于有监督学习听觉注意的语音提取方法实施例的脉冲神经网络示意图;
图8是本发明基于有监督学习听觉注意的语音提取方法实施例的语音输出单元示意图;
图9是本发明基于有监督学习听觉注意的语音提取方法实施例的在各个实验设置下的学习收敛数。
具体实施方式
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。
本发明提供了一种基于有监督学习听觉注意的语音提取方法来对混叠语音进行听觉注意,提取目标语音。本方法针对语音信号具有丰富的时空结构的特点,设计时间编码方式对混叠语音信号的强度信息进行编码,并采用脉冲神经网络学习从混叠语音的输入脉冲序列到目标语音的输出脉冲序列的映射。本方法中的脉冲神经网络的权重采用有监督学习算法进行学习。通过将脉冲神经网络的神经元模型限定为线性神经元模型,本方法将冲量和Nesterov加速梯度引入到远程监督方法中,并用改进的远程监督方法对脉冲神经网络进行有监督学习,以加速收敛过程和进一步提高语音分离的准确性。
本发明的一种基于有监督学习听觉注意的语音提取方法的语音提取方法,包括:
步骤S10,利用短时傅立叶变换将原始混叠语音信号转换为二维时间-频率信号表示,得到第一混叠语音信号;
步骤S20,对所述第一混叠语音信号稀疏化并将其中的时频单元的强度信息映射到预设的D个强度等级,基于强度等级信息二次稀疏化,得到第二混叠语音信号;
步骤S30,采用时间编码的方式将所述第二混叠语音信号转换为脉冲信号;所述时间编码为时间-频率编码或时间-群体编码;
步骤S40,采用训练好的目标脉冲提取网络从所述脉冲信号中提取目标脉冲;所述目标脉冲提取网络基于脉冲神经网络构建;
步骤S50,将所述目标脉冲转换成目标语音的时间-频率表示,通过逆短时傅立叶变换转换得到目标语音。
为了更清晰地对本发明基于有监督学习听觉注意的语音提取方法进行说明,下面结合图1对本发明方法实施例中各步骤展开详述。
本发明一种实施例的基于有监督学习听觉注意的语音提取方法,包括步骤S10-步骤S50,各步骤详细描述如下:
步骤S10,利用短时傅立叶变换将原始混叠语音信号转换为二维时间-频率信号表示,得到第一混叠语音信号。
步骤S11,对原始混叠语音信号进行重采样,降低所述原始混叠语音信号的采样率。本发明实施例采用的重采样率为8KHz。
步骤S12,将重采样后的混叠语音信号通过短时快速傅里叶变换(STFT,Short-Time Fourier Transform)进行编码,将语音信号编码为具有时间、频率两个维度的矩阵表示,每一组时间、频率作为一个时频单元。
如图3所示,语音时域信号为时间幅值表示,包含了不同的语音信息,经过短时快速傅里叶变换(STFT,Short-Time Fourier Transform)进行编码,转换为时间频率表示。本实施例中STFT的窗口长度为32ms,采用正弦窗函数(sine window),Hop Size长度为16ms。
步骤S20,对第一混叠语音信号稀疏化并将其中的时频单元的强度信息映射到预设的D个强度等级,基于强度等级信息二次稀疏化,得到第二混叠语音信号。
步骤S21,基于预设的背景噪音阈值,选取所述第一混叠语音信号的时频单元中大于所述背景噪音阈值的时频单元,构成第一时频单元集。本实施例中,背景阈值设为-40dB。
步骤S22,对时频单元集的时频单元进行K-means聚类,并将第一时频单元集的时频单元映射到预先设定好的D个强度等级上。
步骤S23,将强度等级最低的时频单元设置为静音单元,得到第二混叠语音信号。
步骤S30,采用时间编码的方式将所述第二混叠语音信号转换为脉冲信号。
如图4所示,为本发明实施例时间编码的滑动编码窗口:滑动编码窗口长度是时间间隔长度的两倍;t 0、t 1、t 2和t 3是在时间维度上均匀分布的四个时间点,t 0-t 2和t 1-t 3是两个相邻的编码窗口(encoding window),而t 0-t 1、t 1-t 2和t 2-t 3都是时间间隔(time span)。
时间编码可采用时间-频率编码或时间-群体编码,本发明实施例展示了两种时间编码的示意图。
时间-频率编码,采用编码窗口中脉冲的数量和发放时间来反映时频单元的强度,将稀疏混叠语音时频单元的强度信息转换为脉冲神经网络可以处理的脉冲信号。
稀疏映射模块中的强度等级为D,最低强度等级的时频单元被设为静音单元;时频单元的强度聚类后映射为强度0<d<D,d为整数,时频单元(t 0,f 0)对应神经元i的起始时间为t 0的时间窗口,时间间隔为Δt,则该编码窗口中以t 0为起始时间的时间间隔内则分别在
Figure PCTCN2019083352-appb-000014
处各发放一个脉冲,共发放d个脉冲。
如图5所示,为本发明实施例的时间-频率编码示意图,时间-频率编码的强度在对应神经元编码窗口前半部分的时间间隔中进行编码。图示虚线表示时间间隔的边界。假设编码窗口时长为24ms,则时间间隔时长为12ms,总的强度等级D=4。由于最低强度的时频单元被设为静音单元,所以只有1、2、3这三种强度等级。当前起始时间为t 0的时频单元的频率对应神经元i,设其强度是2,则在以t 0为起始时间的时间间隔内,t 0ms和(t 0+6)ms的时候均匀分布2个脉冲;其后以该神经元i在随后以t 1为起始时间的编码窗口编码的时频单元的强度为3,则在以t 1为起始时间的时间间隔内,t 1ms、(t 1+4)ms和(t 1+8)ms的时候均匀分布3个脉冲。
时间-群体编码采用多个神经元群体对时频单元的强度进行编码,时频单元的强度信息脉冲分布在多个神经元群体中相应神经元的编码窗口中,将稀疏混叠语音时频单元的强度信息转换为脉冲神经网络可以处理的脉冲信号。
群体编码是在神经科学中发现的一个重要的编码策略,主要是用多个不精确的神经元对刺激进行编码。受启于时间编码和群体编码,时间-群体编码采用多个神经元群体对时频单元的强度进行编码。具体来说,表示时频单元的强度的脉冲分布在多个神经元群体中的相应神经元的编码窗口中。
稀疏映射模块中的强度等级为D,最低强度等级的时频单元被设为静音单元,时间-群体编码采用D-1个神经元群体来进行编码;时频单元的强度聚类后映射为强度0<d<D,d为整数,时频单元(t 0,f 0)对应 各个神经元群体中神经元i的起始时间为t 0的时间窗口,神经元i∈P l,l=1,2,...,d,在该时间窗口的起始时间t 0处各发放一个脉冲,总计发放d个脉冲,其中P l表示第l个神经元群体。
如图6所示,为本发明实施例的时间-群体编码示意图,时间-群体编码的强度在对应的神经元编码窗口前半部分的时间间隔中进行编码。图示虚线表示时间间隔的边界,图中P d表示第d个神经元群体。时间-群体编码采用多个神经元组对刺激进行编码。假设总的强度等级D=4,由于最低强度的时频单元被设为静音单元,所以只有1、2、3这三种强度等级,故有3个神经元群体对刺激进行编码。当前起始时间为t 0的时频单元的频率对应各个神经元群组中的神经元i,设其强度是2,前两个神经元群体中的神经元i各会在t 0处发放一个脉冲,而第三个神经元群体中的神经元i在t 0处沉默;其后以该神经元i在随后以t 1为起始时间的编码窗口编码的时频单元的强度为3,则三个神经元群体中的神经元i都会在t 1处各发放一个脉冲。
步骤S40,采用训练好的目标脉冲提取网络从所述脉冲信号中提取目标脉冲。
目标脉冲提取网络为采用随机线性神经元模型(stochastic linear neuron model)构建的一个两层全连接脉冲神经网络,如图7所示。
本发明实施例中,采用了有漏电流的漏电整合发放神经元模型(LIF,Leaky Integrate-and-Fire)V j(t),其定义如式(1)所示:
Figure PCTCN2019083352-appb-000015
其中,Γ j为神经元j的突触前神经元的集合,w ji为神经元j和神经元i之间的突触连接权重,
Figure PCTCN2019083352-appb-000016
为神经元i的脉冲发放时间,ε(t)为脉冲响应函数,V rest是静息电位。
ε(t)为简单的α-函数,如式(2)所示:
Figure PCTCN2019083352-appb-000017
其中,τ为时间常数,表示突触后电位(postsynaptic potential)决定电位上升和下降的快慢;H(t)是阶跃函数(Heaviside function);τ ref为不应期,表示膜电位累积达到阈值电位V thre时,神经元恢复到静息电位V rest并维持的一段时间。
本实施例中V thre=1.0,V rest=0,时间编码为时间-频率编码时,τ=0.6,τ ref=0.8,τ win=0.8,D=8;时间编码为时间-群体编码时,τ=0.45,τ ref=0.8,τ win=0.7,D=10。
脉冲神经网络的结构和时间编码方式有关。当时间编码方式是时间-频率编码时,输入层的神经元数m和输出层的神经元数n都是F,其中F是时间-频率表示X t,f的频率维度;当时间编码方式是时间-群体编码时,输入层的神经元数m是(D-1)F,而输出层的神经元数n是F。
采用远程监督方法对目标脉冲提取网络的权重进行训练。
目标脉冲提取网络输出层神经元j和输入层神经元i之间在t时刻的权重Δw ji(t)如式(3)所示:
Figure PCTCN2019083352-appb-000018
其中,
Figure PCTCN2019083352-appb-000019
S i(t)分别表示期望的输出脉冲序列、实际的输出脉冲序列和输入脉冲序列;a表示非赫布项;W(s)表示学习窗口;所述目标脉冲提取网络的权重通过对Δw ji在时间上积分获得。
学习窗口W(s)定义如式(4)所示:
Figure PCTCN2019083352-appb-000020
其中,s是突触后脉冲发放时间和突触前脉冲发放时间之间相差的时间间隔;A是幅值,A>0;τ win是学习窗口的时间常数。
当且仅当神经元模型限制在随机线性神经元模型的时候,远程监督方法可以从另一个角度推导出来,此推导过程类似于随机梯度下降。采用的远程监督方法,为加入冲量的远程监督方法或加入Nesterov加速梯度的远程监督方法。
采用加入冲量的远程监督方法(ReSuMe-M,Remote Supervised Method with Momentum)时,目标脉冲提取网络输出层神经元j和输入层神经元i之间的权重
Figure PCTCN2019083352-appb-000021
如式(5)和式(6)所示:
Figure PCTCN2019083352-appb-000022
Figure PCTCN2019083352-appb-000023
其中,k表示迭代次数;β是冲量系数,β∈[0,1];η是学习率;
Figure PCTCN2019083352-appb-000024
是用于每次迭代更新的速度向量。本实施例中,β=0.9。
采用加入Nesterov加速梯度的远程监督方法(ReSuMe-NAG,Remote Supervised Method with Nesterov’s Accelerated Gradient)时,目标脉冲提取网络输出层神经元j和输入层神经元i之间的权重
Figure PCTCN2019083352-appb-000025
如式(7)和式(8)所示:
Figure PCTCN2019083352-appb-000026
Figure PCTCN2019083352-appb-000027
其中,
Figure PCTCN2019083352-appb-000028
表示在
Figure PCTCN2019083352-appb-000029
处的权重更新。本实施例中,β=0.9。
本实施例中,脉冲神经网络的初始学习率为0.05,如果期望的输出脉冲序列和实际的输出脉冲的距离在5个epoch中连续增长,则以0.95的倍率调整学习率。采用耐心为15个epoch(迭代次数)的早停止策略。对于人工神经网络,则分别用SGD(Stochastic Gradient Descent)、SGDM(Stochastic Gradient Descent with Momentum,SGDM)和NAG(Nesterov’s Accelerated Gradient)进行优化。
步骤S50,将所述目标脉冲转换成目标语音的时间-频率表示,通过逆短时傅立叶变换转换得到目标语音。
步骤S51,将所述目标脉冲转换成对应目标的信息掩蔽,得到对应的掩蔽值。
根据一定规则将脉冲神经网络预测的输出脉冲序列转换成对应目标的信息掩蔽A t,f,其中A t,f和第一混叠语音的时间-频率表示维度相同。本发明实施例中采用理想二值掩蔽(IBM,Ideal Binary Mask),当某个神经元的某个时间间隔中有脉冲发放,则其对应的信息掩蔽单元置1,否则为0。
步骤S52,将掩蔽值与第一混叠语音信号对应点乘并加入第一混叠语音信号的相位信息,得到目标语音的时间-频率信号表示。
步骤S53,采用短时傅立叶逆变换(iSTFT,inverse Short-Time Fourier Transform)将目标语音时间-频率信号表示转换为语音信息,获取目标语音。
如图8所示,为本发明实施例的目标语音输出示意图,混叠语音信号转换成时间频率表示,通过脉冲神经网络的学习,提取出目标语音的时间-频率表示,最后采用短时傅立叶逆变换(iSTFT,inverse Short-Time Fourier Transform)将时间-频率信号表示转换为语音信息的时间幅值表示,为提取的目标语音。
为了准确评估本发明方法的目标语音分离的性能,本发明采用语音分离中权威的BSS_EVAL工具集中的全局信号失真改善度(GNSDR,global signal-to-distortion improvement)作为指标,衡量模型的语音分离性能。
本发明的实验采用英文语音数据集Grid语料库。从Grid数据集中选取一男一女两个说话人,各随机抽取出20条语音,分为3部分,其中10条用于生成训练集混叠语音,5条用于生成验证集混合语音,5条用于生成测试集混合语音。最终训练集共有100个样本,验证集有25个样本,测试集有25个样本。每个样本都被剪辑成0.5s以对齐。
为了说明本发明所述时间-频率编码(TR)和时间-群体编码(TP)的有效性,我们在上述数据集中在相同网络结构参数设置下和传统的Time-to-First-Spike(TF)进行对比实验。Time-to-First-Spike通过编码窗口中单个脉冲发放的早晚来编码强度信息,脉冲发放得越早,强度越大。
为了说明本发明所述加入冲量的远程监督方法(ReSuMe-M)和加入Nesterov加速梯度的远程监督方法(ReSuMe-NAG)的有效性,我们在上述数据集中在多种实验设置下和朴素的远程监督方法(ReSuMe)进行对比实验。
为了说明本发明所述基于有监督学习听觉注意的语音提取方法的有效性,我们在上述数据集中在相同网络结构参数设置下和两层人工神经网络中的多层感知机(MLP,Multi-Layer Perceptron)、递归神经网络(RNN,Recurrent Neural Network)和长短时记忆网络(LSTM,Long-Short Term Memory)进行对比实验。其中人工神经网络采用步骤S10得到的时间-频率表示作为输入,步骤S51中采用理想比率掩蔽(IRM,Ideal Ratio Mask),人工神经网络使用IRM比使用IBM的效果好。
传统的Time-to-First-Spike过度简化,只使用单个脉冲表示时频单元的强度,容易受到噪音的干扰。本发明提出的时间-频率编码和时间-群体编码比传统的Time-to-First-Spike编码有明显表现提升。
对比有监督方法(ReSuMe)、加入冲量的有监督方法(ReSuMe-M)和加入Nesterov加速梯度的有监督方法(ReSuMe-NAG),可以发现将冲量和Nesterov加速梯度引入到远程监督方法中后,本发明的模型跳出局部极值,能够寻找到更优解,提升语音提取准确性。
对比脉冲神经网络和人工神经网络的表现,本发明的方法在大多数设置下表现均优于相同网络参数下的人工神经网络,这表明脉冲神经网络处理时序数据的潜在优越性。
对比结果如表1所示:
表1
方法 SNN(TF) SNN(TR) SNN(TP)
ReSuMe 1.81±0.31 3.71±0.32 4.04±0.27
ReSuMe-M 2.16±0.21 4.03±0.29 4.41±0.29
ReSuMe-NAG 2.20±0.24 4.54±0.23 4.23±0.20
方法 MLP RNN LSTM
SGD 3.70±0.07 3.56±0.06 3.80±0.03
SGDM 3.72±0.07 3.58±0.05 3.94±0.07
NAG 3.74±0.06 3.58±0.05 3.94±0.06
如图9所示,从本发明方法在各个实验设置下的学习收敛数中可以看出,远程监督方法加入冲量和Nesterov加速梯度之后,脉冲序列学习的收敛过程明显加快,表明了本发明所述有监督学习算法ReSuMe-M和ReSuMe-NAG的有效性。
本发明第二实施例的基于有监督学习听觉注意的语音提取系统,包括获取模块、转换模块、稀疏映射模块、脉冲转换模块、目标脉冲提取模块、脉冲识别模块、输出模块;
所述获取模块,配置为获取原始混叠语音信号并输入;
所述转换模块,配置为利用短时傅立叶变换将原始混叠语音信号转换为二维时间-频率信号表示,得到第一混叠语音信号;
所述稀疏映射模块,配置为将第一混叠语音信号稀疏化并将其中的时频单元的强度信息映射到预设的D个强度等级,基于强度等级信息二次稀疏化,得到第二混叠语音信号;
所述脉冲转换模块,配置为采用时间编码的方式将第二混叠语音信号转换为脉冲信号;
所述目标脉冲提取模块,配置为采用训练好的目标脉冲提取网络从所述脉冲信号中提取目标脉冲;
所述脉冲识别模块,配置为将目标脉冲转换成目标语音的时间-频率表示,通过逆短时傅立叶变换转换得到目标语音;
所述输出模块,配置为将目标语音输出。
所属技术领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统的具体工作过程及有关说明,可以参考前述方法实施例中的对应过程,在此不再赘述。
需要说明的是,上述实施例提供的基于有监督学习听觉注意的语音提取系统,仅以上述各功能模块的划分进行举例说明,在实际应用中,可以根据需要而将上述功能分配由不同的功能模块来完成,即将本发明实施例中的模块或者步骤再分解或者组合,例如,上述实施例的模块可以合并为一个模块,也可以进一步拆分成多个子模块,以完成以上描述的全部或者部分功能。对于本发明实施例中涉及的模块、步骤的名称,仅仅是为了区分各个模块或者步骤,不视为对本发明的不当限定。
本发明第三实施例的一种存储装置,其中存储有多条程序,所述程序适于由处理器加载并执行以实现上述的基于有监督学习听觉注意的语音提取方法。
本发明第四实施例的一种处理装置,包括处理器、存储装置;处理器,适于执行各条程序;存储装置,适于存储多条程序;所述程序适于由处理器加载并执行以实现上述的基于有监督学习听觉注意的语音提取方法。
所属技术领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的存储装置、处理装置的具体工作过程及有关说明,可以参考前述方法实施例中的对应过程,在此不再赘述。
本领域技术人员应该能够意识到,结合本文中所公开的实施例描述的各示例的模块、方法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,软件模块、方法步骤对应的程序可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。为了清楚地说明电子硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以电子硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
术语“上”、“下”、“前”、“后”、“左”、“右”等,仅是参考附图的方向,并非用来限制本发明的保护范围。
术语“第一”、“第二”等是用于区别类似的对象,而不是用于描述或表示特定的顺序或先后次序。
术语“包括”或者任何其它类似用语旨在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备/装置不仅包括那些要素,而且还包括没有明确列出的其它要素,或者还包括这些过程、方法、物品或者设备/装置所固有的要素。
至此,已经结合附图所示的优选实施方式描述了本发明的技术方案,但是,本领域技术人员容易理解的是,本发明的保护范围显然不局限于这些具体实施方式。在不偏离本发明的原理的前提下,本领域技术人员可以对相关技术特征作出等同的更改或替换,这些更改或替换之后的技术方案都将落入本发明的保护范围之内。

Claims (12)

  1. 一种基于有监督学习听觉注意的语音提取方法,其特征在于,包括:
    步骤S10,利用短时傅立叶变换将原始混叠语音信号转换为二维时间-频率信号表示,得到第一混叠语音信号;
    步骤S20,对所述第一混叠语音信号稀疏化并将其中的时频单元的强度信息映射到预设的D个强度等级,基于强度等级信息二次稀疏化,得到第二混叠语音信号;
    步骤S30,采用时间编码的方式将所述第二混叠语音信号转换为脉冲信号;所述时间编码为时间-频率编码或时间-群体编码;
    步骤S40,采用训练好的目标脉冲提取网络从所述脉冲信号中提取目标脉冲;所述目标脉冲提取网络基于脉冲神经网络构建;
    步骤S50,将所述目标脉冲转换成目标语音的时间-频率表示,通过逆短时傅立叶变换转换得到目标语音。
  2. 根据权利要求1所述的基于有监督学习听觉注意的语音提取方法,其特征在于,步骤S10中“利用短时傅立叶变换将原始混叠语音信号转换为二维时间-频率信号表示”,其步骤为:
    步骤S11,对原始混叠语音信号进行重采样,降低所述原始混叠语音信号的采样率;
    步骤S12,将重采样后的混叠语音信号通过短时快速傅里叶变换进行编码,将语音信号编码为具有时间、频率两个维度的矩阵表示,每一组时间、频率作为一个时频单元。
  3. 根据权利要求1所述的基于有监督学习听觉注意的语音提取方法, 其特征在于,步骤S20中“对所述第一混叠语音信号稀疏化并将其中的时频单元的强度信息映射到预设的D个强度等级,基于强度等级信息二次稀疏化”,其步骤为:
    步骤S21,基于预设的背景噪音阈值,选取所述第一混叠语音信号的时频单元中大于所述背景噪音阈值的时频单元,构成第一时频单元集;
    步骤S22,对时频单元集的时频单元进行K-means聚类,并将第一时频单元集的时频单元映射到预先设定好的D个强度等级上;
    步骤S23,将强度等级最低的时频单元设置为静音单元,得到第二混叠语音信号。
  4. 根据权利要求1所述的基于有监督学习听觉注意的语音提取方法,其特征在于,所述时间-频率编码为:
    采用编码窗口中脉冲的数量和发放时间来反映时频单元的强度;稀疏映射模块中的强度等级为D,最低强度等级的时频单元被设为静音单元;时频单元的强度聚类后映射为强度0<d<D,d为整数,时频单元(t 0,f 0)对应神经元i的起始时间为t 0的时间窗口,时间间隔为Δt,则该编码窗口中以t 0为起始时间的时间间隔内则分别会在
    Figure PCTCN2019083352-appb-100001
    l=0,1,...,d-1处各发放一个脉冲,共发放d个脉冲。
  5. 根据权利要求1所述的基于有监督学习听觉注意的语音提取方法,其特征在于,所述时间-群体编码为:
    采用多个神经元群体对时频单元的强度进行编码,时频单元的强度信息脉冲分布在多个神经元群体中相应神经元的编码窗口中;稀疏映射模块中的强度等级为D,最低强度等级的时频单元被设为静音单元,时间-群体编码采用D-1个神经元群体来进行编码;时频单元的强度聚类后映射为强度0<d<D,d为整数,时频单元(t 0,f 0)对应各个神经元群体中神经 元i的起始时间为t 0的时间窗口,神经元i∈P l,l=1,2,...,d,在该时间窗口的起始时间t 0处各发放一个脉冲,总计发放d个脉冲,其中P l表示第l个神经元群体。
  6. 根据权利要求1所述的基于有监督学习听觉注意的语音提取方法,其特征在于,所述目标脉冲提取网络为采用随机线性神经元模型构建的一个两层全连接脉冲神经网络;
    采用远程监督方法对所述目标脉冲提取网络的权重进行训练;所述目标脉冲提取网络输出层神经元j和输入层神经元i之间在t时刻的权重Δw ji(t)为:
    Figure PCTCN2019083352-appb-100002
    其中,
    Figure PCTCN2019083352-appb-100003
    S i(t)分别表示期望的输出脉冲序列、实际的输出脉冲序列和输入脉冲序列;a表示非赫布项;W(s)表示学习窗口;所述目标脉冲提取网络的权重通过对Δw ji在时间上积分获得。
  7. 根据权利要求4所述的基于有监督学习听觉注意的语音提取方法,其特征在于,所述学习窗口W(s)为:
    Figure PCTCN2019083352-appb-100004
    其中,s是突触后脉冲发放时间和突触前脉冲发放时间之间相差的时间间隔;A是幅值,A>0;τ win是学习窗口的时间常数。
  8. 根据权利要求4或5所述的基于有监督学习听觉注意的语音提取方法,其特征在于,所采用的远程监督方法,为加入冲量的远程监督方法 或加入Nesterov加速梯度的远程监督方法;
    采用所述加入冲量的远程监督方法时,所述目标脉冲提取网络输出层神经元j和输入层神经元i之间的权重
    Figure PCTCN2019083352-appb-100005
    为:
    Figure PCTCN2019083352-appb-100006
    Figure PCTCN2019083352-appb-100007
    其中,k表示迭代次数;β是冲量系数,β∈[0,1];η是学习率;
    Figure PCTCN2019083352-appb-100008
    是用于每次迭代更新的速度向量;
    采用所述加入Nesterov加速梯度的远程监督方法时,所述目标脉冲提取网络输出层神经元j和输入层神经元i之间的权重
    Figure PCTCN2019083352-appb-100009
    为:
    Figure PCTCN2019083352-appb-100010
    Figure PCTCN2019083352-appb-100011
    其中,
    Figure PCTCN2019083352-appb-100012
    表示在
    Figure PCTCN2019083352-appb-100013
    处的权重更新。
  9. 根据权利要求1所述的基于有监督学习听觉注意的语音提取方法,其特征在于,步骤S50中“将所述目标脉冲转换成目标语音的时间-频率表示,通过逆短时傅立叶变换转换得到目标语音”,包括以下步骤:
    步骤S51,将所述目标脉冲转换成对应目标的信息掩蔽,得到对应的掩蔽值;
    步骤S52,将掩蔽值与第一混叠语音信号对应点乘并加入第一混叠语音信号的相位信息,得到目标语音的时间-频率信号表示;
    步骤S53,采用短时傅立叶逆变换将目标语音时间-频率信号表示转换为语音信息,获取目标语音。
  10. 一种基于有监督学习听觉注意的语音提取系统,其特征在于,包括获取模块、转换模块、稀疏映射模块、脉冲转换模块、目标脉冲提取模块、脉冲识别模块、输出模块;
    所述获取模块,配置为获取原始混叠语音信号并输入;
    所述转换模块,配置为利用短时傅立叶变换将原始混叠语音信号转换为二维时间-频率信号表示,得到第一混叠语音信号;
    所述稀疏映射模块,配置为将第一混叠语音信号稀疏化并将其中的时频单元的强度信息映射到预设的D个强度等级,基于强度等级信息二次稀疏化,得到第二混叠语音信号;
    所述脉冲转换模块,配置为采用时间编码的方式将第二混叠语音信号转换为脉冲信号;
    所述目标脉冲提取模块,配置为采用训练好的目标脉冲提取网络从所述脉冲信号中提取目标脉冲;
    所述脉冲识别模块,配置为将目标脉冲转换成目标语音的时间-频率表示,通过逆短时傅立叶变换转换得到目标语音;
    所述输出模块,配置为将目标语音输出。
  11. 一种存储装置,其中存储有多条程序,其特征在于,所述程序适于由处理器加载并执行以实现权利要求1-9任一项所述的基于有监督学习听觉注意的语音提取方法。
  12. 一种处理装置,包括
    处理器,适于执行各条程序;以及
    存储装置,适于存储多条程序;
    其特征在于,所述程序适于由处理器加载并执行以实现:
    权利要求1-9任一项所述的基于有监督学习听觉注意的语音提取方法。
PCT/CN2019/083352 2018-12-19 2019-04-19 基于有监督学习听觉注意的语音提取方法、系统、装置 WO2020124902A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/645,447 US10923136B2 (en) 2018-12-19 2019-04-19 Speech extraction method, system, and device based on supervised learning auditory attention

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811558212.6 2018-12-19
CN201811558212.6A CN109448749B (zh) 2018-12-19 2018-12-19 基于有监督学习听觉注意的语音提取方法、系统、装置

Publications (1)

Publication Number Publication Date
WO2020124902A1 true WO2020124902A1 (zh) 2020-06-25

Family

ID=65560163

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/083352 WO2020124902A1 (zh) 2018-12-19 2019-04-19 基于有监督学习听觉注意的语音提取方法、系统、装置

Country Status (3)

Country Link
US (1) US10923136B2 (zh)
CN (1) CN109448749B (zh)
WO (1) WO2020124902A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699956A (zh) * 2021-01-08 2021-04-23 西安交通大学 一种基于改进脉冲神经网络的神经形态视觉目标分类方法
CN113037781A (zh) * 2021-04-29 2021-06-25 广东工业大学 基于rnn的语音信息加密方法及装置

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109448749B (zh) 2018-12-19 2022-02-15 中国科学院自动化研究所 基于有监督学习听觉注意的语音提取方法、系统、装置
CN111768761B (zh) * 2019-03-14 2024-03-01 京东科技控股股份有限公司 一种语音识别模型的训练方法和装置
CN110609986B (zh) * 2019-09-30 2022-04-05 哈尔滨工业大学 一种基于预训练的结构化数据生成文本的方法
CN111540367B (zh) * 2020-04-17 2023-03-31 合肥讯飞数码科技有限公司 语音特征提取方法、装置、电子设备和存储介质
CN111739555B (zh) * 2020-07-23 2020-11-24 深圳市友杰智新科技有限公司 基于端到端深度神经网络的音频信号处理方法及装置
CN113192526B (zh) * 2021-04-28 2023-10-31 北京达佳互联信息技术有限公司 音频处理方法和音频处理装置
CN113257282B (zh) * 2021-07-15 2021-10-08 成都时识科技有限公司 语音情感识别方法、装置、电子设备以及存储介质
CN115662409B (zh) * 2022-10-27 2023-05-05 亿铸科技(杭州)有限责任公司 一种语音识别方法、装置、设备及存储介质
CN115587321B (zh) * 2022-12-09 2023-03-28 中国科学院苏州生物医学工程技术研究所 一种脑电信号识别分类方法、系统及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105448302A (zh) * 2015-11-10 2016-03-30 厦门快商通信息技术有限公司 一种环境自适应的语音混响消除方法和系统
US20170061978A1 (en) * 2014-11-07 2017-03-02 Shannon Campbell Real-time method for implementing deep neural network based speech separation
CN108680245A (zh) * 2018-04-27 2018-10-19 天津大学 鲸豚类Click类叫声与传统声呐信号分类方法及装置
CN108899048A (zh) * 2018-05-10 2018-11-27 广东省智能制造研究所 一种基于信号时频分解的声音数据分类方法
CN109448749A (zh) * 2018-12-19 2019-03-08 中国科学院自动化研究所 基于有监督学习听觉注意的语音提取方法、系统、装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3840684B2 (ja) * 1996-02-01 2006-11-01 ソニー株式会社 ピッチ抽出装置及びピッチ抽出方法
JP3006677B2 (ja) * 1996-10-28 2000-02-07 日本電気株式会社 音声認識装置
KR101400535B1 (ko) * 2008-07-11 2014-05-28 프라운호퍼 게젤샤프트 쭈르 푀르데룽 데어 안겐반텐 포르슝 에. 베. 시간 워프 활성 신호의 제공 및 이를 이용한 오디오 신호의 인코딩
TR201810466T4 (tr) * 2008-08-05 2018-08-27 Fraunhofer Ges Forschung Özellik çıkarımı kullanılarak konuşmanın iyileştirilmesi için bir ses sinyalinin işlenmesine yönelik aparat ve yöntem.
JP2014219467A (ja) * 2013-05-02 2014-11-20 ソニー株式会社 音信号処理装置、および音信号処理方法、並びにプログラム
CN105118503A (zh) * 2015-07-13 2015-12-02 中山大学 一种音频翻录检测方法
CN105957537B (zh) * 2016-06-20 2019-10-08 安徽大学 一种基于l1/2稀疏约束卷积非负矩阵分解的语音去噪方法和系统
CN108109619B (zh) * 2017-11-15 2021-07-06 中国科学院自动化研究所 基于记忆和注意力模型的听觉选择方法和装置
CN107945817B (zh) * 2017-11-15 2021-10-22 广东顺德西安交通大学研究院 心肺音信号分类方法、检测方法、装置、介质和计算机设备
CN107863111A (zh) * 2017-11-17 2018-03-30 合肥工业大学 面向交互的语音语料处理方法及装置
CN109034070B (zh) * 2018-07-27 2021-09-14 河南师范大学 一种置换混叠图像盲分离方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061978A1 (en) * 2014-11-07 2017-03-02 Shannon Campbell Real-time method for implementing deep neural network based speech separation
CN105448302A (zh) * 2015-11-10 2016-03-30 厦门快商通信息技术有限公司 一种环境自适应的语音混响消除方法和系统
CN108680245A (zh) * 2018-04-27 2018-10-19 天津大学 鲸豚类Click类叫声与传统声呐信号分类方法及装置
CN108899048A (zh) * 2018-05-10 2018-11-27 广东省智能制造研究所 一种基于信号时频分解的声音数据分类方法
CN109448749A (zh) * 2018-12-19 2019-03-08 中国科学院自动化研究所 基于有监督学习听觉注意的语音提取方法、系统、装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699956A (zh) * 2021-01-08 2021-04-23 西安交通大学 一种基于改进脉冲神经网络的神经形态视觉目标分类方法
CN112699956B (zh) * 2021-01-08 2023-09-22 西安交通大学 一种基于改进脉冲神经网络的神经形态视觉目标分类方法
CN113037781A (zh) * 2021-04-29 2021-06-25 广东工业大学 基于rnn的语音信息加密方法及装置

Also Published As

Publication number Publication date
CN109448749B (zh) 2022-02-15
US10923136B2 (en) 2021-02-16
CN109448749A (zh) 2019-03-08
US20200402526A1 (en) 2020-12-24

Similar Documents

Publication Publication Date Title
WO2020124902A1 (zh) 基于有监督学习听觉注意的语音提取方法、系统、装置
Kong et al. Weakly labelled audioset tagging with attention neural networks
CN103531199B (zh) 基于快速稀疏分解和深度学习的生态声音识别方法
CN109616104B (zh) 基于关键点编码和多脉冲学习的环境声音识别方法
US11694696B2 (en) Method and apparatus for implementing speaker identification neural network
CN111429947B (zh) 一种基于多级残差卷积神经网络的语音情感识别方法
CN110610709A (zh) 基于声纹识别的身份辨别方法
Xiao et al. A spiking neural network model for sound recognition
Xiao et al. Spike-based encoding and learning of spectrum features for robust sound recognition
CN113111786B (zh) 基于小样本训练图卷积网络的水下目标识别方法
CN109522448B (zh) 一种基于crbm和snn进行鲁棒性语音性别分类的方法
Lu et al. Deep convolutional neural network with transfer learning for environmental sound classification
CN111091815A (zh) 基于膜电压驱动的聚合标签学习模型的语音识别方法
Khandelwal et al. FMSG-NTU submission for DCASE 2022 Task 4 on sound event detection in domestic environments
Pak et al. Convolutional neural network approach for aircraft noise detection
CN114428234A (zh) 基于gan和自注意力的雷达高分辨距离像降噪识别方法
CN113850438A (zh) 公共建筑能耗预测方法、系统、设备及介质
Li et al. Work mode identification of airborne phased array radar based on the combination of multi-level modeling and deep learning
CN113628615B (zh) 语音识别方法、装置、电子设备及存储介质
CN115602156A (zh) 一种基于多突触连接光脉冲神经网络的语音识别方法
CN116561664A (zh) 基于tcn网络的雷达辐射源脉间调制模式识别方法
CN113948067B (zh) 一种具有听觉高保真度特点的语音对抗样本修复方法
CN115032682A (zh) 一种基于图论的多站台地震震源参数估计方法
Zhipeng et al. Voiceprint recognition based on BP Neural Network and CNN
GB2614674A (en) Accuracy of streaming RNN transducer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19898572

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19898572

Country of ref document: EP

Kind code of ref document: A1