CN111816187A - Deep neural network-based voice feature mapping method in complex environment - Google Patents

Deep neural network-based voice feature mapping method in complex environment Download PDF

Info

Publication number
CN111816187A
CN111816187A CN202010635342.6A CN202010635342A CN111816187A CN 111816187 A CN111816187 A CN 111816187A CN 202010635342 A CN202010635342 A CN 202010635342A CN 111816187 A CN111816187 A CN 111816187A
Authority
CN
China
Prior art keywords
complex environment
voice
environment
feature mapping
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010635342.6A
Other languages
Chinese (zh)
Inventor
刘剑豪
王亨佳
胡乔林
高坡
都兴霖
杨华兵
王敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Air Force Early Warning Academy
Original Assignee
Air Force Early Warning Academy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Air Force Early Warning Academy filed Critical Air Force Early Warning Academy
Priority to CN202010635342.6A priority Critical patent/CN111816187A/en
Publication of CN111816187A publication Critical patent/CN111816187A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The invention relates to the technical field of voice signal processing, and discloses a deep neural network-based voice feature mapping method in a complex environment, which comprises the following steps: step 1: constructing a large number of voice signal data pairs under clean environment and complex environment; step 2: extracting the characteristics of the voice signal in a clean environment; and step 3: extracting the characteristics of the voice signal in the complex environment; and 4, step 4: training a DNN model; and 5: and mapping the voice features in the complex environment by using the trained DNN model. The method adopts DNN as a mapping model, and can effectively fit the nonlinear relation between the voice signal characteristic parameters in a complex environment and the voice signal characteristic parameters in a clean environment; the feature mapping method provided by the invention can map the features of the voice signal in the complex environment, and effectively improves the purity of the voice features in the complex environment; the feature mapping method provided by the method has generalization capability on most acoustic scenes.

Description

Deep neural network-based voice feature mapping method in complex environment
Technical Field
The invention relates to the technical field of voice signal processing, in particular to a deep neural network-based voice feature mapping method in a complex environment.
Background
With the continuous development of voice signal processing technology, mode recognition technology and artificial intelligence technology, speaker recognition technology begins to move from laboratories to practical application fields, and shows great application prospects in the fields of information security, financial verification, public security criminal investigation, military and national defense confidentiality and the like.
The voice is the most frequently and efficiently used communication mode for human beings, and can be one of the biological characteristics of the human beings; under a good voice environment, voice is used as a man-machine interaction mode, so that not only can the meaning expressed by people be conveyed, but also the personal identity authentication can be accurately and quickly realized by utilizing a speaker recognition technology; under the wide application demand, the development of speaker recognition technology in the voice technology is promoted, and the method becomes an important condition of a human-computer interaction revolution and has epoch-making significance.
But a great deal of uncertain factors are enriched in real life, so that the robustness of speaker recognition still faces great difficulty and challenge; the factors mainly comprise the aspects of background noise, channel difference, voice difference, short voice, time-varying voice, emotional influence and the like; especially, the channel difference and the environmental change are inevitable obstacles of the speaker recognition technology in the application scene, which causes the speaker recognition performance to be sharply reduced.
The current solutions for solving the influence of environmental noise and channel difference on speaker recognition mainly include three types: one is a feature domain, the second is a model domain, and the third is a score domain; the method has the advantages that the method has a good effect of solving the problem of channel interference from the aspect of feature space, does not depend on the model, is irrelevant to the scoring algorithm after model matching, and is relevant to the inherent attribute of the voice signal; feature mapping is one solution for a feature domain; by training a mapping model of most generalized scenes, the speech signal characteristics under a clean environment are predicted from the speech signal characteristics under a complex environment, and the channel and noise robustness of the characteristics is greatly improved.
Disclosure of Invention
In view of the above problems, the present invention is directed to provide a deep neural network-based speech feature mapping method in a complex environment, which solves the problem of speech signal spectrum distortion caused by the mismatch between the training environment and the testing environment, and also eliminates the distortion of speaker feature parameters caused by channel transmission characteristics.
The invention adopts the following technical scheme for realizing the technical purpose: the deep neural network-based voice feature mapping method under the complex environment comprises the following steps:
step 1: constructing a large number of voice signal data pairs under clean environment and complex environment;
step 2: extracting the characteristics of the voice signal in a clean environment;
and step 3: extracting the characteristics of the voice signal in the complex environment;
and 4, step 4: training a DNN model;
and 5: and mapping the voice features in the complex environment by using the trained DNN model.
Further, the step 1 is specifically realized by:
the DNN-based feature mapping method is a supervised training model, needs a large amount of parallel corpora, and also needs a large amount of data pairs of voices in a complex environment and voices in a clean environment, so that the DNN model has good generalization capability, and the main factors influencing the recognition rate of a speaker under general conditions are environmental noise and channel difference, wherein the influence of additive noise in the environmental noise on the recognition rate is the largest, so that a large amount of parallel corpora are constructed according to the complex environment model, and the complex model is as shown in the formula:
S=f(X1,X2,w)+αN
where α is an adjustment factor used to control the signal-to-noise ratio; x represents a voice signal collected in a clean environment; n represents a noise signal; w is a channel transmission parameter; the model can construct massive voice data with multiple signal-to-noise ratios, multiple noise types and complex environments transmitted by different channels.
Further, the specific implementation of step 2 includes the following steps:
step 2.1: preprocessing a voice signal in a clean environment, including sampling, quantizing, pre-emphasizing, end point detecting, framing and windowing;
step 2.2: and (3) extracting Mel cepstrum coefficients (MFCC) from the signals preprocessed in the step 2.1.
Further, the specific implementation of step 3 includes the following steps:
step 3.1: preprocessing a voice signal under a complex environment, including sampling, quantizing, pre-emphasizing, end point detecting, framing and windowing;
step 3.2: and (4) extracting MFCC characteristics of the signal preprocessed in the step 3.1.
Further, the specific implementation of the step 4 includes the following steps:
step 4.1: the method comprises the steps that an initialization model based on a limited Boltzmann machine is tried to be trained by utilizing voice characteristic parameters in a complex environment;
step 4.2: an inverse error propagation algorithm that employs a minimum mean square error criterion between the characteristics of speech in a clean environment and speech in a complex environment is used to update the parameters of the entire DNN.
Further, the step 5 comprises the following steps:
step 5.1: the feature mapping model is:
Figure BDA0002568027420000031
wherein c isiFeatures of a speech signal in a complex environment; w is a1…wjIs a DNN model parameter, f (w)1,…,wj) Is a non-linear function; mu.siIs a perturbation term;
Figure BDA0002568027420000032
the feature is obtained after the feature mapping;
step 5.2: and (4) substituting the parameters estimated in the step (4.2) into the feature mapping parameter masking matrix to obtain the voice features obtained after the voice under the complex environment is subjected to feature mapping model operation, wherein the features at the moment can be approximately regarded as the voice features under the clean environment.
The invention has the following beneficial effects:
1. the method adopts DNN as a mapping model, and can effectively fit the nonlinear relation between the voice signal characteristic parameters in the complex environment and the voice signal characteristic parameters in the clean environment.
2. The feature mapping method provided by the invention can map the features of the voice signal in the complex environment, and effectively improves the purity of the voice features in the complex environment.
3. The feature mapping method provided by the method has generalization capability on most acoustic scenes.
Drawings
FIG. 1 is a flowchart of a deep neural network-based feature mapping method according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a deep neural network-based speech feature mapping method in a complex environment includes the following steps:
step 1: constructing a voice signal under a clean environment and a voice signal data pair under a complex environment;
the concrete implementation is as follows:
the DNN-based feature mapping method is a supervised training model, needs a large amount of parallel linguistic data, and also needs a large amount of data pairs of voice under a complex environment and voice under a clean environment, so that the DNN model has better generalization capability; in general, the main factors influencing the recognition rate of a speaker are environmental noise and channel difference, and additive noise in the environmental noise has the largest influence on the recognition rate, so that a large number of parallel corpora are constructed according to a complex environment model, and the complex model is shown as the formula:
S=f(X,w)+αN
where α is an adjustment factor used to control the signal-to-noise ratio; x represents a voice signal collected in a clean environment; t represents a channel transmission matrix for controlling transmission characteristics of different channels; n represents a noise signal; w is a channel parameter; the model can construct massive voice data with multiple signal-to-noise ratios, multiple noise types and complex environments transmitted by different channels.
Step 2: extracting characteristic parameters of the voice in a clean environment;
the specific implementation comprises the following substeps:
step 2.1: preprocessing a voice signal obtained in a clean environment, including sampling, quantizing, pre-emphasizing, framing and windowing;
step 2.2: the preprocessed signal in step 2.1 is extracted as MFCC.
And step 3: extracting characteristic parameters of the voice in the complex environment;
step 3.1: preprocessing a voice signal under a complex environment, including sampling, quantizing, pre-emphasizing, framing and windowing;
step 3.2: the preprocessed signal in step 3.1 is extracted as MFCC.
And 4, step 4: carrying out big data training on the DNN mapping model;
step 4.1: the method comprises the steps that an initialization model based on a limited Boltzmann machine is tried to be trained by utilizing voice characteristic parameters in a complex environment;
step 4.2: updating parameters of the whole DNN by adopting a reverse error propagation algorithm of a minimum mean square error criterion between the characteristics of the speech in the clean environment and the speech in the complex environment; the objective function of the network training is as follows:
Figure BDA0002568027420000051
wherein: n is the number of all samples; the characteristics of the clean speech are represented by s (n); short wave speech characteristic
Figure BDA0002568027420000052
Represents; through calculation, the network weight can be corrected; the calculation formula used in the correction is as follows:
Figure BDA0002568027420000053
Figure BDA0002568027420000054
wherein: is the learning rate; wlAnd blRefer to the parameter of the l-th layer; l is the total hidden layer; l +1 refers to the output layer of the network.
And 5: and performing feature mapping on the feature parameters of the voice under the complex environment through a trained DNN mapping model.
Step 5.1: the feature mapping model is:
Figure BDA0002568027420000055
wherein c isiFeatures of a speech signal in a complex environment; w is a1…wjIs a DNN model parameter, f (w)1,…,wj) Is a non-linear function; mu.siIs a perturbation term;
Figure BDA0002568027420000056
the features obtained after the feature mapping.
Step 5.2: and (4) substituting the parameters estimated in the step (4.2) into the feature mapping parameter masking matrix to obtain the voice features obtained after the voice under the complex environment is subjected to feature mapping model operation, wherein the features at the moment can be approximately regarded as the voice features under the clean environment.
In conclusion, the DNN is adopted as the mapping model, so that the nonlinear relation between the speech signal characteristic parameters in the complex environment and the speech signal characteristic parameters in the clean environment can be effectively fitted; the feature mapping method provided by the invention can map the features of the voice signal in the complex environment, and effectively improves the purity of the voice features in the complex environment; the feature mapping method provided by the method has generalization capability on most acoustic scenes.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. The method for mapping the speech features based on the deep neural network in the complex environment is characterized by comprising the following steps of:
step 1: constructing a large number of voice signal data pairs under clean environment and complex environment;
step 2: extracting the characteristics of the voice signal in a clean environment;
and step 3: extracting the characteristics of the voice signal in the complex environment;
and 4, step 4: training a DNN model;
and 5: and mapping the voice features in the complex environment by using the trained DNN model.
2. The deep neural network-based speech feature mapping method under the complex environment according to claim 1, wherein the step 1 is implemented in a specific way:
the DNN-based feature mapping method is a supervised training model, needs a large amount of parallel corpora, and also needs a large amount of data pairs of voices in a complex environment and voices in a clean environment, so that the DNN model has good generalization capability, and the main factors influencing the recognition rate of a speaker under general conditions are environmental noise and channel difference, wherein the influence of additive noise in the environmental noise on the recognition rate is the largest, so that a large amount of parallel corpora are constructed according to the complex environment model, and the complex model is as shown in the formula:
S=f(X1,X2,w)+αN
where α is an adjustment factor used to control the signal-to-noise ratio; x represents a voice signal collected in a clean environment; n represents a noise signal; w is a channel transmission parameter; the model can construct massive voice data with multiple signal-to-noise ratios, multiple noise types and complex environments transmitted by different channels.
3. The deep neural network-based speech feature mapping method under the complex environment according to claim 1, wherein the step 2 is implemented by the following steps:
step 2.1: preprocessing a voice signal in a clean environment, including sampling, quantizing, pre-emphasizing, end point detecting, framing and windowing;
step 2.2: and (3) extracting Mel cepstrum coefficients (MFCC) from the signals preprocessed in the step 2.1.
4. The deep neural network-based speech feature mapping method under the complex environment according to claim 1, wherein the specific implementation of the step 3 comprises the following steps:
step 3.1: preprocessing a voice signal under a complex environment, including sampling, quantizing, pre-emphasizing, end point detecting, framing and windowing;
step 3.2: and (4) extracting MFCC characteristics of the signal preprocessed in the step 3.1.
5. The deep neural network-based speech feature mapping method under the complex environment according to claim 1, wherein the specific implementation of the step 4 comprises the following steps:
step 4.1: the method comprises the steps that an initialization model based on a limited Boltzmann machine is tried to be trained by utilizing voice characteristic parameters in a complex environment;
step 4.2: an inverse error propagation algorithm that employs a minimum mean square error criterion between the characteristics of speech in a clean environment and speech in a complex environment is used to update the parameters of the entire DNN.
6. The deep neural network-based speech feature mapping method in the complex environment according to claims 1-5, wherein the step 5 comprises the following steps:
step 5.1: the feature mapping model is:
Figure FDA0002568027410000021
wherein c isiFeatures of a speech signal in a complex environment; w is a1...wjIs a DNN model parameter, f (w)1,…,wj) Is a non-linear function; mu.siIs a perturbation term;
Figure FDA0002568027410000022
the feature is obtained after the feature mapping;
step 5.2: and (4) substituting the parameters estimated in the step (4.2) into the feature mapping parameter masking matrix to obtain the voice features obtained after the voice under the complex environment is subjected to feature mapping model operation, wherein the features at the moment can be approximately regarded as the voice features under the clean environment.
CN202010635342.6A 2020-07-03 2020-07-03 Deep neural network-based voice feature mapping method in complex environment Pending CN111816187A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010635342.6A CN111816187A (en) 2020-07-03 2020-07-03 Deep neural network-based voice feature mapping method in complex environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010635342.6A CN111816187A (en) 2020-07-03 2020-07-03 Deep neural network-based voice feature mapping method in complex environment

Publications (1)

Publication Number Publication Date
CN111816187A true CN111816187A (en) 2020-10-23

Family

ID=72855794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010635342.6A Pending CN111816187A (en) 2020-07-03 2020-07-03 Deep neural network-based voice feature mapping method in complex environment

Country Status (1)

Country Link
CN (1) CN111816187A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530399A (en) * 2020-11-30 2021-03-19 上海明略人工智能(集团)有限公司 Method and system for expanding voice data, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106782520A (en) * 2017-03-14 2017-05-31 华中师范大学 Phonetic feature mapping method under a kind of complex environment
CN108766430A (en) * 2018-06-06 2018-11-06 华中师范大学 A kind of phonetic feature mapping method and system based on Pasteur's distance
CN109658949A (en) * 2018-12-29 2019-04-19 重庆邮电大学 A kind of sound enhancement method based on deep neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106782520A (en) * 2017-03-14 2017-05-31 华中师范大学 Phonetic feature mapping method under a kind of complex environment
CN108766430A (en) * 2018-06-06 2018-11-06 华中师范大学 A kind of phonetic feature mapping method and system based on Pasteur's distance
CN109658949A (en) * 2018-12-29 2019-04-19 重庆邮电大学 A kind of sound enhancement method based on deep neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张洪冉: "噪声环境下说话人识别的鲁棒性研究", 中国优秀硕士学位论文全文数据库 信息科技辑, pages 136 - 357 *
王子腾 等: "面向语音识别的深度映射网络谱/特征增强方法" *
高登峰 等: "基于深度神经网络的地空通话语音增强方法", 第一届空中交通管理系统技术学术年会论文集 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530399A (en) * 2020-11-30 2021-03-19 上海明略人工智能(集团)有限公司 Method and system for expanding voice data, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109326302B (en) Voice enhancement method based on voiceprint comparison and generation of confrontation network
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
WO2021042870A1 (en) Speech processing method and apparatus, electronic device, and computer-readable storage medium
Yang et al. Characterizing speech adversarial examples using self-attention u-net enhancement
CN111261147B (en) Music embedding attack defense method for voice recognition system
CN105611477B (en) The voice enhancement algorithm that depth and range neutral net are combined in digital deaf-aid
CN110111803A (en) Based on the transfer learning sound enhancement method from attention multicore Largest Mean difference
CN108962237A (en) Mixing voice recognition methods, device and computer readable storage medium
CN110739003B (en) Voice enhancement method based on multi-head self-attention mechanism
CN109599109A (en) For the confrontation audio generation method and system of whitepack scene
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN109887496A (en) Orientation confrontation audio generation method and system under a kind of black box scene
CN103065629A (en) Speech recognition system of humanoid robot
CN107068167A (en) Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures
CN111968666A (en) Hearing aid voice enhancement method based on depth domain self-adaptive network
CN111192598A (en) Voice enhancement method for jump connection deep neural network
CN107274887A (en) Speaker's Further Feature Extraction method based on fusion feature MGFCC
CN111986679A (en) Speaker confirmation method, system and storage medium for responding to complex acoustic environment
CN115602152B (en) Voice enhancement method based on multi-stage attention network
CN112017682A (en) Single-channel voice simultaneous noise reduction and reverberation removal system
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
CN111489763B (en) GMM model-based speaker recognition self-adaption method in complex environment
Li et al. Densely connected network with time-frequency dilated convolution for speech enhancement
CN104778948A (en) Noise-resistant voice recognition method based on warped cepstrum feature
CN112183582A (en) Multi-feature fusion underwater target identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination