WO2021217750A1 - 消除语音交互中信道差异的方法及系统、电子设备及介质 - Google Patents

消除语音交互中信道差异的方法及系统、电子设备及介质 Download PDF

Info

Publication number
WO2021217750A1
WO2021217750A1 PCT/CN2020/091030 CN2020091030W WO2021217750A1 WO 2021217750 A1 WO2021217750 A1 WO 2021217750A1 CN 2020091030 W CN2020091030 W CN 2020091030W WO 2021217750 A1 WO2021217750 A1 WO 2021217750A1
Authority
WO
WIPO (PCT)
Prior art keywords
cepstrum
speech
signal
background environment
mean value
Prior art date
Application number
PCT/CN2020/091030
Other languages
English (en)
French (fr)
Inventor
陆成
叶顺舟
Original Assignee
锐迪科微电子科技(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 锐迪科微电子科技(上海)有限公司 filed Critical 锐迪科微电子科技(上海)有限公司
Publication of WO2021217750A1 publication Critical patent/WO2021217750A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present invention relates to the field of voice processing, in particular to a method and system, electronic equipment and medium for eliminating channel differences in voice interaction.
  • the recording channel environment of the training corpus of the speech model is consistent with the pickup channel environment of the collected speech during recognition, and the recognition effect is the best.
  • the channel environment is defined as a set of signal conversion sets between the speech leaving the speaker’s mouth and storing it in digital form. Referring to Fig. 1, the speech signal s(t) leaves the mouth of the speaker, obtains the digital signal x(k) after ADC analog-to-digital conversion, and then enters the recognizer for signal recognition.
  • the speech signal s(t) leaves the mouth of the speaker, obtains the digital signal x(k) after ADC analog-to-digital conversion, and then enters the recognizer for signal recognition.
  • the performance of the back-end recognition may be significantly reduced because the training corpus does not match the channel environment of the actual collected speech.
  • the technical problem to be solved by the present invention is to overcome the defect that the back-end recognition performance is degraded due to the mismatch of the channel environment of the training corpus and the actual collected speech in the prior art, and to provide a method and system for eliminating channel differences in voice interaction, and electronic equipment And medium.
  • the first aspect of the present invention provides a method for eliminating channel differences in voice interaction, including the following steps:
  • cepstrum feature of the speech signal in the training corpus uses the cepstrum feature of the speech signal in the training corpus to subtract the mean value of the cepstrum of the background environment signal to obtain a normalized cepstrum sequence, and use the cepstrum sequence to train a speech model; wherein, the speech signal Including background environmental signals;
  • cepstrum feature to subtract the mean value of the cepstrum of the background environment signal to obtain a normalized cepstrum sequence, and input the cepstrum sequence into the trained speech model.
  • calculating the mean value of the cepstrum of the background environment signal in the corresponding scene according to the cepstrum feature specifically includes:
  • the cepstrum mean value of the background environment signal is calculated according to the cepstrum feature of the background environment signal in the corresponding scene.
  • calculating the mean value of the cepstrum of the background environment signal in the corresponding scene according to the cepstrum feature specifically includes:
  • the minimum value among the average values of all cepstrums is taken as the average value of the cepstrums of the background environment signal in the corresponding scene.
  • estimating the mean value of the cepstrum of the background environment signal in the same scene as the user's speech signal according to the cepstrum feature specifically includes:
  • the first-order recursive estimator is used to calculate the mean value of the cepstrum of the background environment signal.
  • the calculation formula is as follows:
  • x(k) is the cepstrum feature of the user's speech signal at time k
  • is the recursive coefficient
  • said using a first-order recursive estimator to calculate the mean value of the cepstrum of the background environment signal includes:
  • the speech area includes an initial phase of speech and a non-initial phase of speech
  • the calculation of the mean value of the cepstrum of the background environment signal by the first-order recursive estimator further includes:
  • the second aspect of the present invention provides a system for eliminating channel differences in voice interaction, including: a first extraction module, a first calculation module, and a first normalization module used in the voice model training phase, and used for voice model use The second extraction module, the second calculation module and the second normalization module of the stage;
  • the first extraction module is used to extract cepstrum features for the training corpus in each scenario
  • the first calculation module is configured to calculate the mean value of the cepstrum of the background environment signal in the corresponding scene according to the cepstrum feature;
  • the first normalization module is configured to use the cepstrum feature of the speech signal in the training corpus to subtract the mean value of the cepstrum of the background environment signal to obtain a normalized cepstrum sequence, and use the cepstrum sequence to train speech Model; wherein, the voice signal includes a background environment signal;
  • the second extraction module is used to collect user voice signals and extract the cepstrum characteristics of the user voice signals; wherein, the user voice signals include background environment signals;
  • the second calculation module is configured to estimate, according to the cepstrum feature, the mean value of the cepstrum of the background environment signal in the same scene as the user voice signal;
  • the second normalization module is configured to use the cepstrum feature to subtract the mean value of the cepstrum of the background environment signal to obtain a normalized cepstrum sequence, and input the cepstrum sequence into the trained speech model.
  • the first calculation module is specifically configured to calculate the average value of the cepstrum of the background environment signal according to the cepstrum characteristics of the background environment signal in the corresponding scene when the training corpus includes a separate background environment signal.
  • the first calculation module is specifically configured to divide the speech signal in the training corpus into several segments on average, and calculate the average value of the cepstrum of each segment of the speech signal according to the cepstrum characteristics of the speech signal; and The minimum value of the cepstrum mean value is used as the cepstrum mean value of the background environment signal in the corresponding scene.
  • the second calculation module is specifically configured to use a first-order recursive estimator to calculate the mean value of the cepstrum of the background environment signal, and the calculation formula is as follows:
  • x(k) is the cepstrum feature of the user's speech signal at time k
  • is the recursive coefficient
  • the second calculation module is also used to detect the voice zone and the non-speech zone of the user's voice signal, and to set different recursive coefficients in the voice zone and the non-speech zone.
  • the speech area includes an initial phase of speech and a non-initial phase of speech
  • the second calculation module is further configured to set different recursive coefficients in the initial phase of speech and the non-initial phase of speech.
  • the third aspect of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program, the first aspect of the present invention is implemented The described method for eliminating channel differences in voice interaction.
  • a fourth aspect of the present invention provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method steps for eliminating channel differences in voice interaction as described in the first aspect of the present invention are implemented .
  • the positive and progressive effect of the present invention is that by calculating the cepstrum mean of the background environment signal in the training phase and the use phase of the speech model, and subtracting the cepstrum mean of the background environment signal from the cepstrum feature of the speech signal, the channel is not affected.
  • the impacted normalized cepstrum sequence matched the channel environment in the two stages, successfully eliminated the channel difference in voice interaction, and improved the accuracy of back-end recognition.
  • Fig. 1 is a schematic diagram of an acoustic transmission channel in the prior art.
  • FIG. 2 is a flowchart of a method for eliminating channel differences in voice interaction according to Embodiment 1 of the present invention.
  • Fig. 3 is a basic flow chart for extracting MFCC features provided by Embodiment 1 of the present invention.
  • FIG. 4 is a schematic structural diagram of a system for eliminating channel differences in voice interaction according to Embodiment 2 of the present invention.
  • FIG. 5 is a schematic diagram of the structure of an electronic device according to Embodiment 3 of the present invention.
  • CMN Cosmetic Mean Normalization
  • the time domain signal y'(n) obtained at this time is the cepstrum. Although it is different from the original time domain signal y(n), the convolution relationship of the time domain signal can be converted into a linear addition relationship.
  • h is the cepstrum corresponding to the channel frequency response. It is assumed that the channel is a linear time-invariant system, so h is a constant. Then, the sample mean of the new cepstrum sequence is:
  • the normalized cepstrum sequence is:
  • CMN performs sentence by sentence on the training corpus and the actual collected user voice.
  • the signal y(n) is the result of filtering the signal x(n) through a linear channel with an impulse response h(n)
  • the cepstral vector of y(n) is:
  • T is the total length of voice data
  • T 1 is the effective voice length.
  • T ⁇ the above formula shows that the proportion of speech to the entire data will be very small, approximately 0.
  • the average cepstrum vector It should be equal, and it mainly contains information about the background environment. Therefore, subtracting the mean value of the cepstrum will eliminate the change of the cepstrum caused by the environment; on the contrary, for the shorter speech, we can also know the inversion through the above formula.
  • the spectral mean will contain more effective speech information.
  • This embodiment provides a method for eliminating channel differences in voice interaction, as shown in FIG. 2, including:
  • Step S101 Extract cepstrum features for the training corpus in each scene.
  • Training corpus can be recorded in different scenarios.
  • Step S102 Calculate the mean value of the cepstrum of the background environment signal in the corresponding scene according to the cepstrum feature.
  • Step S103 Use the cepstrum feature of the speech signal in the training corpus to subtract the mean value of the cepstrum of the background environment signal to obtain a normalized cepstrum sequence, and use the cepstrum sequence to train a speech model; where The speech signal includes background environment signals.
  • the process of obtaining the normalized cepstrum sequence in step S103 is the calculation process of CMN.
  • the cepstrum feature of the speech signal is used to subtract the mean value of the cepstrum of the background environment signal, and the normalized cepstrum sequence obtained is not affected by the background environment signal. That is, the influence of the channel.
  • the training corpus includes speech signals in different scenarios and separate background environment signals in different scenarios
  • the cepstrum features extracted in step S101 include the cepstrum features of the speech signals and the background environment signals.
  • step S102 specifically includes: calculating the cepstrum mean value of the background environment signal according to the cepstrum feature of the background environment signal in the corresponding scene.
  • the cepstrum mean value y k is:
  • the normalized cepstrum sequence obtained in step S103 is
  • step S102 specifically includes:
  • the minimum value among the average values of all cepstrums is taken as the average value of the cepstrums of the background environment signal in the corresponding scene.
  • min(x k ) of all the cepstrum average values Take the minimum value min(x k ) of all the cepstrum average values as the cepstrum average value of the background environment signal in the corresponding scene.
  • the normalized cepstrum sequence obtained in step S103 is
  • the method for calculating the cepstrum mean of the background environment signal provided in this embodiment is applicable to the two situations where a separate background environment signal is included in the training corpus and that does not include a separate background environment signal, and is especially suitable for the training corpus. Does not include the case of a separate background environment signal.
  • Step S201 Collect a user's voice signal, and extract the cepstrum feature of the user's voice signal.
  • step S201 a microphone is used to collect a user's voice signal, where the collected user's voice signal includes the background environment signal in the scene where the user is located.
  • Step S202 Estimate the mean value of the cepstrum of the background environment signal in the same scene as the user voice signal according to the cepstrum feature.
  • step S202 the average value of the cepstrum of the individual background environment signals in the scene where the user is located is estimated according to the cepstrum features extracted in step S201.
  • a first-order recursive estimator is used to calculate the mean value of the cepstrum of the background environment signal, and the calculation formula is as follows:
  • x(k) is the cepstrum feature of the user's speech signal at time k
  • is the recursive coefficient
  • using the first-order recursive estimator in step S202 to calculate the mean value of the cepstrum of the background environment signal includes: detecting the voice area and the non-speech area of the user voice signal collected in step S201, and in the voice area Set a different recursion coefficient ⁇ from the non-speech area.
  • the voice area and non-voice area of the user's voice signal are detected based on VAD (Voice Activity Detection).
  • VAD Voice Activity Detection
  • the energy method or the zero-crossing rate method can be used to detect the speech area and the non-speech area.
  • the speech area includes an initial phase of speech and a non-initial phase of speech.
  • using a first-order recursive estimator to calculate the mean value of the cepstrum of the background environment signal includes: Different recursion coefficient ⁇ is set in the initial stage.
  • the period of time when the speech starts is the initial stage of the speech, for example, 0 ⁇ t1 are the initial stage of the speech.
  • the time period from the end of the initial phase of the speech to the end of the speech is the non-initial phase of the speech, for example, t1 to t2 are the non-initial phase of the speech.
  • the value of the recursion coefficient ⁇ is as shown in the following formula:
  • vad_flag is the detection flag of VAD.
  • the value of ⁇ is divided into two situations: in the initial stage of voice start ( For example, 0-100ms), ⁇ takes a larger value, and slowly updates the cepstrum mean value to reduce the influence of speech on the channel cepstrum; after 100ms, it is the non-initial stage of speech, and the value of ⁇ is 1, which completely eliminates speech For the influence of channel cepstrum, the average value of cepstrum is not updated.
  • the method of estimating the mean value of the cepstrum of the background environment signal in step S202 is not limited to the first-order recursive estimator in the foregoing embodiment, and may also be other estimators.
  • Step S203 Use the cepstrum feature to subtract the mean value of the cepstrum of the background environment signal to obtain a normalized cepstrum sequence.
  • the normalized cepstrum sequence obtained in step S203 is input into the speech model trained in step S103.
  • the process of obtaining the normalized cepstrum sequence in step S203 is the calculation process of CMN.
  • the cepstrum feature of the user's voice signal is used to subtract the mean value of the background environment signal, and the normalized cepstrum sequence obtained is not Affected by the background environment signal that is the channel.
  • cepstrum features are MFCC (Mel Frequency Cepstrum Coefficient) features, LPCC (Linear Predictive Cepstrum Coefficient, linear prediction cepstrum coefficients) features, or FBank (Filterbank, filter bank) features.
  • MFCC Mel Frequency Cepstrum Coefficient
  • LPCC Linear Predictive Cepstrum Coefficient, linear prediction cepstrum coefficients
  • FBank Feterbank, filter bank
  • cepstrum feature being the MFCC feature.
  • the basic process of extracting MFCC features is shown in Figure 3.
  • pre-emphasis is to enhance the high frequency part, flatten the frequency spectrum of the signal, and keep it in the entire frequency band from low frequency to high frequency, so that the same signal-to-noise ratio can be used to find the frequency spectrum.
  • it is also to eliminate the effects of the vocal cords and lips in the process of occurrence, to compensate for the high-frequency part of the voice signal suppressed by the pronunciation system, and to highlight the high-frequency formant.
  • Framing is to first gather N sampling points into an observation unit. Normally, the value of N is 256 or 512, and the time covered is about 20-30ms. In order to avoid too large changes between two adjacent frames, there will be an overlapping area between two adjacent frames. This overlapping area contains M sampling points, usually the value of M is about 1/2 or 1/3 of N .
  • Windowing usually multiplies each frame by the Hamming window to increase the continuity between the left and right ends of the frame. Since the transformation of the signal in the time domain is usually difficult to see the characteristics of the signal, it is usually converted to the energy distribution in the frequency domain for observation. Different energy distributions can represent the characteristics of different voices.
  • each frame After multiplying the Hamming window, each frame must go through FFT to get the energy distribution on the frequency spectrum.
  • Fast Fourier transform is performed on each frame signal after frame division and windowing to obtain the frequency spectrum of each frame, and the power spectrum of the speech signal is obtained by modulating the square of the frequency spectrum of the speech signal.
  • the energy spectrum is passed through a set of Mel-scale triangular filter banks to smooth the spectrum and eliminate the effect of harmonics, highlighting the formant of the original voice. Then perform logarithmic operation on the output value of each filter bank, and finally bring the logarithmic energy into the discrete cosine transform (DCT), and finally obtain the Mel-scale Cepstrum parameter.
  • DCT discrete cosine transform
  • the cepstrum average of the background environment signal is calculated in the training phase and the use phase of the speech model, and the cepstrum average of the background environment signal is subtracted from the cepstrum feature of the speech signal to obtain a normalization that is not affected by the channel.
  • the cepstrum sequence matches the channel environments in the two stages, successfully eliminating the channel differences in the voice interaction between the two stages, and thereby improving the accuracy of the back-end recognition.
  • This embodiment provides a system 400 for eliminating channel differences in voice interaction, as shown in FIG. 4, including: a first extraction module 411, a first calculation module 412, and a first normalization module 413 used in the voice model training phase , And a second extraction module 421, a second calculation module 422, and a second normalization module 423 used in the speech model use stage.
  • the first extraction module 411 is used for extracting cepstrum features for the training corpus in each scene.
  • the first calculation module 412 is configured to calculate the average value of the cepstrum of the background environment signal in the corresponding scene according to the cepstrum feature.
  • the first normalization module 413 is configured to use the cepstrum feature of the speech signal in the training corpus to subtract the mean value of the cepstrum of the background environment signal to obtain a normalized cepstrum sequence, and use the cepstrum sequence for training Voice model; wherein, the voice signal includes a background environment signal.
  • the second extraction module 421 is configured to collect user voice signals and extract cepstrum features of the user voice signals; wherein, the user voice signals include background environment signals.
  • the second calculation module 422 is configured to estimate, according to the cepstrum feature, the mean value of the cepstrum of the background environment signal in the same scene as the user voice signal.
  • the second normalization module 423 is configured to use the cepstrum feature to subtract the mean value of the cepstrum of the background environment signal to obtain a normalized cepstrum sequence, and input the cepstrum sequence into the trained speech model .
  • the first calculation module 412 is specifically configured to calculate the value of the background environment signal according to the cepstrum characteristics of the background environment signal in the corresponding scene when the training corpus includes a separate background environment signal. Cepstrum mean.
  • the first calculation module 412 is specifically configured to divide the speech signal in the training corpus into several segments on average, and calculate the cepstrum of each speech signal according to the cepstrum characteristics of the speech signal. Mean value; and the minimum value of all cepstrum mean values is used as the cepstrum mean value of the background environment signal in the corresponding scene.
  • the second calculation module 422 is specifically configured to use a first-order recursive estimator to calculate the average value of the cepstrum of the background environment signal, and the calculation formula is as follows:
  • x(k) is the cepstrum feature of the user's speech signal at time k
  • is the recursive coefficient
  • the second calculation module 422 is further configured to detect the voice area and the non-speech area of the user's voice signal, and set different recursion coefficients in the voice area and the non-speech area.
  • the speech area includes an initial phase of speech and a non-initial phase of speech
  • the second calculation module is further configured to set different recursive coefficients in the initial phase of speech and the non-initial phase of speech.
  • FIG. 5 is a schematic structural diagram of an electronic device provided by this embodiment.
  • the electronic device includes a memory, a processor, and a computer program that is stored on the memory and can run on the processor, and the processor implements the method for eliminating channel differences in voice interaction in Embodiment 1 when the processor executes the program.
  • the electronic device 3 shown in FIG. 5 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.
  • the electronic device 3 may be in the form of a general-purpose computing device, for example, it may be a server device.
  • the components of the electronic device 3 may include, but are not limited to: the above-mentioned at least one processor 4, the above-mentioned at least one memory 5, and a bus 6 connecting different system components (including the memory 5 and the processor 4).
  • the bus 6 includes a data bus, an address bus, and a control bus.
  • the memory 5 may include a volatile memory, such as a random access memory (RAM) 51 and/or a cache memory 52, and may further include a read-only memory (ROM) 53.
  • RAM random access memory
  • ROM read-only memory
  • the memory 5 may also include a program/utility tool 55 having a set of (at least one) program modules 54.
  • program modules 54 include but are not limited to: an operating system, one or more application programs, other program modules, and program data. Each of the examples or some combination may include the realization of a network environment.
  • the processor 4 executes various functional applications and data processing by running a computer program stored in the memory 5, such as the method for eliminating channel differences in voice interaction in Embodiment 1 of the present invention.
  • the electronic device 3 may also communicate with one or more external devices 7 (for example, keyboards, pointing devices, etc.). This communication can be performed through an input/output (I/O) interface 8.
  • the device 3 generated by the model can also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through a network adapter 9. As shown in the figure, the network adapter 9 communicates with other modules of the device 3 generated by the model via the bus 6.
  • networks for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet
  • This embodiment provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the method for eliminating channel differences in voice interaction of Embodiment 1 are implemented.
  • the readable storage medium may more specifically include but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device or any of the above The right combination.
  • the present invention can also be implemented in the form of a program product, which includes program code.
  • program product runs on a terminal device
  • the program code is used to make the terminal device execute the implementation.
  • program code used to execute the present invention can be written in any combination of one or more programming languages, and the program code can be completely executed on the user equipment, partially executed on the user equipment, as an independent
  • the software package is executed, partly on the user’s device, partly on the remote device, or entirely on the remote device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种消除语音交互中信道差异的方法及系统、电子设备及介质,该方法包括:在语音模型的训练阶段:针对每种场景下的训练语料提取倒谱特征(S101);根据倒谱特征计算相应场景下背景环境信号的倒谱均值(S102);利用语音信号的倒谱特征减去背景环境信号的倒谱均值,得到归一化的倒谱序列,并利用倒谱序列训练语音模型(S103);以及在语音模型的使用阶段:采集用户语音信号,并提取倒谱特征(S201);根据倒谱特征估计背景环境信号的倒谱均值(S202);利用倒谱特征减去背景环境信号的倒谱均值,得到归一化的倒谱序列,并输入至语音模型(S203)。该方法成功消除了语音模型训练和使用阶段语音信道的差异,提高了后端识别的准确率。

Description

消除语音交互中信道差异的方法及系统、电子设备及介质
本申请要求申请日为2020年4月30日的中国专利申请CN202010363659.9的优先权。本申请引用上述中国专利申请的全文。
技术领域
本发明涉及语音处理领域,特别涉及一种消除语音交互中信道差异的方法及系统、电子设备及介质。
背景技术
随着Amazon的Echo引爆智能音箱这个人工智能产品,各大音箱厂商和各个对人工智能领域都开始布局智能音频交互设备,Google的Google home、小米的小爱同学纷纷推出,大家的切入点都不约而同,以语音交互为载体,布局智能家居控制功能。目前的产品应用方式多种多样,有以音箱为中心,通过网络来控制家电,这种方式要求用户可以在离音箱5米甚至更远距离以内的范围进行对话,做到随时随地地语音交互。与此同时,特定产品下的语音对话,如语音交互电视,大多用遥控器上的语音键和麦克风,而现在也有在冰箱、车载上做语音交互,大多采用麦克风阵列(两颗麦克风),然后用户用唤醒词来唤醒,比如“你好小锐”,唤醒后再进行对应的指令词识别或任意词识别。
语音模型的训练语料的录制信道环境与识别时采集语音的拾取信道环境保持一致,识别效果才是最优的。信道环境定义为:语音从说话人的口腔离开直到以数字形式存储这之间的一组信号转换集合。参照图1,语音信号s(t)从说话人的口腔离开,经过ADC模数转换后得到数字信号x(k),再进入识别器进行信号识别。然而,由于成本原因和实际操作起来的困难程度,这种匹配是很难实现的。所以,当训练好的语音模型在真实条件下使用时,后端识别的性能可能会显著地下降,就是因为训练语料和实际采集语音的信道环境不匹配。
发明内容
本发明要解决的技术问题是为了克服现有技术中训练语料和实际采集语音的信道环境不匹配导致后端识别性能下降的缺陷,提供一种消除语音交互中信道差异的方法及系统、电子设备及介质。
本发明是通过下述技术方案来解决上述技术问题:
本发明的第一方面提供一种消除语音交互中信道差异的方法,包括以下步骤:
在语音模型的训练阶段:
针对每种场景下的训练语料,提取倒谱特征;
根据所述倒谱特征计算相应场景下背景环境信号的倒谱均值;
利用所述训练语料中语音信号的倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列,并利用所述倒谱序列训练语音模型;其中,所述语音信号包括背景环境信号;
在语音模型的使用阶段:
采集用户语音信号,并提取所述用户语音信号的倒谱特征;其中,所述用户语音信号包括背景环境信号;
根据所述倒谱特征估计与所述用户语音信号相同场景下背景环境信号的倒谱均值;
利用所述倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列,并将所述倒谱序列输入至训练完成的语音模型。
较佳地,在语音模型的训练阶段,根据所述倒谱特征计算相应场景下背景环境信号的倒谱均值,具体包括:
若训练语料中包括单独的背景环境信号,则根据相应场景下背景环境信号的倒谱特征计算所述背景环境信号的倒谱均值。
较佳地,在语音模型的训练阶段,根据所述倒谱特征计算相应场景下背景环境信号的倒谱均值,具体包括:
将训练语料中的语音信号平均分为若干段,并根据所述语音信号的倒谱特征分别计算每段语音信号的倒谱均值;
将所有倒谱均值中的最小值作为相应场景下背景环境信号的倒谱均值。
较佳地,在语音模型的使用阶段,根据所述倒谱特征估计与所述用户语音信号相同场景下背景环境信号的倒谱均值,具体包括:
利用一阶递归估计器计算背景环境信号的倒谱均值,计算公式如下:
Figure PCTCN2020091030-appb-000001
其中,x(k)为k时刻用户语音信号的倒谱特征,
Figure PCTCN2020091030-appb-000002
为k-1时刻背景环境信号的倒谱均值,
Figure PCTCN2020091030-appb-000003
为k时刻背景环境信号的倒谱均值,α为递归系数。
较佳地,所述利用一阶递归估计器计算背景环境信号的倒谱均值包括:
检测所述用户语音信号的语音区和非语音区;
在语音区和非语音区设置不同的递归系数。
较佳地,所述语音区包括语音初始阶段和语音非初始阶段,所述利用一阶递归估计器计算背景环境信号的倒谱均值还包括:
在语音初始阶段和语音非初始阶段设置不同的递归系数。
本发明的第二方面提供一种消除语音交互中信道差异的系统,包括:用于语音模型训练阶段的第一提取模块、第一计算模块以及第一归一化模块,以及用于语音模型使用阶段的第二提取模块、第二计算模块以及第二归一化模块;
第一提取模块用于针对每种场景下的训练语料,提取倒谱特征;
第一计算模块用于根据所述倒谱特征计算相应场景下背景环境信号的倒谱均值;
第一归一化模块用于利用所述训练语料中语音信号的倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列,并利用所述倒谱序列训练语音模型;其中,所述语音信号包括背景环境信号;
第二提取模块用于采集用户语音信号,并提取所述用户语音信号的倒谱特征;其中,所述用户语音信号包括背景环境信号;
第二计算模块用于根据所述倒谱特征估计与所述用户语音信号相同场景下背景环境信号的倒谱均值;
第二归一化模块用于利用所述倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列,并将所述倒谱序列输入至训练完成的语音模型。
较佳地,所述第一计算模块具体用于在训练语料中包括单独的背景环境信号的情况下,根据相应场景下背景环境信号的倒谱特征计算所述背景环境信号的倒谱均值。
较佳地,所述第一计算模块具体用于将训练语料中的语音信号平均分为若干段,并根据所述语音信号的倒谱特征分别计算每段语音信号的倒谱均值;以及将所有倒谱均值中的最小值作为相应场景下背景环境信号的倒谱均值。
较佳地,所述第二计算模块具体用于利用一阶递归估计器计算背景环境信号的倒谱均值,计算公式如下:
Figure PCTCN2020091030-appb-000004
其中,x(k)为k时刻用户语音信号的倒谱特征,
Figure PCTCN2020091030-appb-000005
为k-1时刻背景环境信号的倒谱均值,
Figure PCTCN2020091030-appb-000006
为k时刻背景环境信号的倒谱均值,α为递归系数。
较佳地,所述第二计算模块还用于检测所述用户语音信号的语音区和非语音区,以及在语音区和非语音区设置不同的递归系数。
较佳地,所述语音区包括语音初始阶段和语音非初始阶段,所述第二计算模块还用于在语音初始阶段和语音非初始阶段设置不同的递归系数。
本发明的第三方面提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如本发明第一方面所述的消除语音交互中信道差异的方法。
本发明的第四方面提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如本发明第一方面所述的消除语音交互中信道差异的方法步骤。
本发明的积极进步效果在于:通过分别在语音模型的训练阶段和使用阶段计算背景环境信号的倒谱均值,并利用语音信号的倒谱特征减去背景环境信号的倒谱均值,得到不受信道影响的归一化的倒谱序列,使得两个阶段中的信道环境相匹配,成功消除了语音交互中的信道差异,进而提高了后端识别的准确率。
附图说明
图1为现有技术中声学传输信道的示意图。
图2为本发明实施例1提供的一种消除语音交互中信道差异的方法流程图。
图3为本发明实施例1提供的提取MFCC特征的基本流程图。
图4为本发明实施例2提供的一种消除语音交互中信道差异的系统的结构示意图。
图5为本发明实施例3的电子设备的结构示意图。
具体实施方式
下面通过实施例的方式进一步说明本发明,但并不因此将本发明限制在所述的实施例范围之中。
CMN(Cepstrum Mean Normalization,倒谱均值归一化)是一种简单而强大的卷积失真处理技术,提高了语音识别系统对未知线性滤波信道的鲁棒性。这里先对倒谱进行简单的分析:对时域信号做傅里叶变换,然后进行对数运算,再进行逆傅里叶变换。假设时域信号为x(n),信道信息为h(n),经过信道传输之后的输出为y(n):
y(n)=x(n)*h(n)
此时很难区分开x(n)和h(n),所以先转到频域分析:
Y(k)=X(k)H(k)
对频域两边取log:
log(Y(k))=log(X(k))+log(H(k))
然后进行逆傅里叶变换:
IDFT(log(Y(k)))=IDFT(log(X(k)))+IDFT(log(H(k)))
假设此时得到的时域信号如下:
y′(n)=x′(n)+h′(n)
此时获得的时域信号y′(n)即为倒谱,虽然已经和原始的时域信号y(n)不一样,但是可以把时域信号的卷积关系转化成线性相加关系。
接下来介绍CMN的计算。假设信号x(n)的倒谱向量的时间序列为X={x 1,x 2,…,x t,…,x T},它的样本均值计算表达式为:
Figure PCTCN2020091030-appb-000007
倒谱序列的归一化是通过减去样本均值来定义:
Figure PCTCN2020091030-appb-000008
现在考虑信号y(n)是通过一个脉冲响应为h(n)的线性信道对信号x(n)进行滤波的结果,则y(n)的倒谱向量为:
y t=x t+h
其中,h为信道频率响应对应的倒谱,这里假设信道为线性时不变系统,故h为常量。那么,新倒谱序列的样本均值为:
Figure PCTCN2020091030-appb-000009
其归一化倒谱序列为:
Figure PCTCN2020091030-appb-000010
这表明CMN对线性滤波操作具有不变性。
CMN对训练语料和实际采集的用户语音都是逐句进行的。假设信号y(n)是通过一个脉冲响应为h(n)的线性信道对信号x(n)进行滤波的结果,则y(n)的倒谱向量为:
Figure PCTCN2020091030-appb-000011
其中,T为语音数据的总长度,T 1为有效语音长度。对于足够长的语音(T→∞),由以上公式可知语音对于整条数据的占比将会非常小,近似为0,对于在相同环境条件下录制的所有语音,倒谱平均向量
Figure PCTCN2020091030-appb-000012
应该是相等的,并且它主要包含关于背景环境的信息,因此,减去倒谱均值将消除由环境引起的倒谱变化;相反地,对于较短的语音,通过以上公 式我们也可以得知倒谱均值将会包含较多的有效语音信息。
实施例1
本实施例提供一种消除语音交互中信道差异的方法,如图2所示,包括:
在语音模型的训练阶段:
步骤S101、针对每种场景下的训练语料,提取倒谱特征。
其中,可以针对不同的场所设置不同的场景,例如办公室、广场、家里、地铁站等。训练语料可以通过在不同场景下进行录制。
步骤S102、根据所述倒谱特征计算相应场景下背景环境信号的倒谱均值。
步骤S103、利用所述训练语料中语音信号的倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列,并利用所述倒谱序列训练语音模型;其中,所述语音信号包括背景环境信号。
步骤S103中得到归一化的倒谱序列的过程即为CMN的计算过程,利用语音信号的倒谱特征减去背景环境信号的倒谱均值,得到的归一化倒谱序列不受背景环境信号即信道的影响。
假设有M种场景,每种场景下语音信号的倒谱特征为
Figure PCTCN2020091030-appb-000013
倒谱均值zk为:
Figure PCTCN2020091030-appb-000014
其中,n=1,…,T,T为语音信号的长度。
在可选的一种实施方式中,训练语料中包括不同场景下的语音信号以及不同场景下单独的背景环境信号,步骤S101中提取的倒谱特征包括语音信号的倒谱特征以及背景环境信号的倒谱特征。本实施例中,步骤S102中具体包括:根据相应场景下背景环境信号的倒谱特征计算所述背景环境信号的倒谱均值。
假设有M种场景,每种场景下背景环境信号的倒谱特征为
Figure PCTCN2020091030-appb-000015
倒谱均值y k为:
Figure PCTCN2020091030-appb-000016
其中,n=1,…,T,T为背景环境信号的长度。
本实施例中,步骤S103中得到的归一化的倒谱序列为
Figure PCTCN2020091030-appb-000017
在可选的另一种实施方式中,步骤S102具体包括:
将训练语料中的语音信号平均分为若干段,并根据所述语音信号的倒谱特征分别计算每段语音信号的倒谱均值;
将所有倒谱均值中的最小值作为相应场景下背景环境信号的倒谱均值。
假设对一条帧数为N的语音信号完成特征提取之后,平均分成M1段,则每段的长度为:T=N/M1;
分别对每一段语音信号计算倒谱均值:
Figure PCTCN2020091030-appb-000018
其中,n=0,…,T-1,k=1,…,M1。取所有倒谱均值中的最小值min(x k)作为相应场景下背景环境信号的倒谱均值。
本实施例中,步骤S103中得到的归一化的倒谱序列为
Figure PCTCN2020091030-appb-000019
需要说明的是,本实施例提供的计算背景环境信号的倒谱均值的方法适用于训练语料中包括单独的背景环境信号以及不包括单独的背景环境信号这两种情况,尤其适用于训练语料中不包括单独的背景环境信号的情况。
在语音模型的使用阶段:
步骤S201、采集用户语音信号,并提取所述用户语音信号的倒谱特征。
步骤S201中,使用麦克风采集用户语音信号,其中采集的用户语音信号中包括用户所处场景下的背景环境信号。
步骤S202、根据所述倒谱特征估计与所述用户语音信号相同场景下背景环境信号的倒谱均值。
步骤S202中,根据步骤S201中提取的倒谱特征估计用户所处场景下单独的背景环境信号的倒谱均值。
在可选的一种实施方式中,步骤S202中利用一阶递归估计器计算背景环境信号的倒谱均值,计算公式如下:
Figure PCTCN2020091030-appb-000020
其中,x(k)为k时刻用户语音信号的倒谱特征,
Figure PCTCN2020091030-appb-000021
为k-1时刻背景环境信号的倒谱均值,
Figure PCTCN2020091030-appb-000022
为k时刻背景环境信号的倒谱均值,α为递归系数。
在可选的一种实施方式中,步骤S202中利用一阶递归估计器计算背景环境信号的倒谱均值包括:检测步骤S201中采集的用户语音信号的语音区和非语音区,并在语音区和非语音区设置不同的递归系数α。
在可选的一种实施方式中,基于VAD(Voice Activity Detection,语音活动检测)检 测用户语音信号的语音区和非语音区。在VAD检测的具体实施中,可以使用能量法或者过零率法检测语音区和非语音区。
在可选的一种实施方式中,所述语音区包括语音初始阶段和语音非初始阶段,步骤S202中利用一阶递归估计器计算背景环境信号的倒谱均值包括:在语音初始阶段和语音非初始阶段设置不同的递归系数α。
假设0~t2为语音区,在语音开始的一段时间为语音初始阶段,例如0~t1为语音初始阶段。语音初始阶段结束直至语音结束的时间段为语音非初始阶段,例如t1~t2为语音非初始阶段。
在一个基于VAD检测的具体例子中,递归系数α的值如以下公式所示:
Figure PCTCN2020091030-appb-000023
其中,vad_flag为VAD的检测标志位,当vad_flag=1时,说明语音已经开始,此时为了避免VAD做出语音存在的错误判断,α取值分为两种情况:在语音开始的初始阶段(例如0~100ms),α取一个较大的值,对倒谱均值进行缓慢更新,减小语音对信道倒谱的影响;100ms之后即为语音非初始阶段,α取值为1,完全去除语音对信道倒谱的影响,倒谱均值不更新。当vad_flag=0时,即语音已经结束,α取值可根据经验设置为0.99,信道倒谱均值开始正常更新。
需要说明的是,步骤S202中估计背景环境信号倒谱均值的方法不限于上述实施例中的一阶递归估计器,还可以为其他估计器。
步骤S203、利用所述倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列。其中,将步骤S203中得到的归一化的倒谱序列输入至步骤S103中训练完成的语音模型。
同样地,步骤S203中得到归一化的倒谱序列的过程即为CMN的计算过程,利用用户语音信号的倒谱特征减去背景环境信号的倒谱均值,得到的归一化倒谱序列不受背景环境信号即信道的影响。
其中,上述倒谱特征为MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)特征、LPCC(Linear Predictive Cepstrum Coefficient,线性预测倒谱系数)特征或FBank(Filterbank,滤波器组)特征。
以下针对倒谱特征为MFCC特征进行举例说明。其中,提取MFCC特征的基本流程如图3所示。
预加重的目的是提升高频部分,使信号的频谱变得平坦,保持在低频到高频的整个 频带中,能用同样的信噪比求频谱。同时,也是为了消除发生过程中声带和嘴唇的效应,来补偿语音信号受到发音系统所抑制的高频部分,也为了突出高频的共振峰。
分帧是先将N个采样点集合成一个观测单位,通常情况下N的值为256或512,涵盖的时间约为20~30ms左右。为了避免相邻两帧的变化过大,因此会让两相邻帧之间有一段重叠区域,此重叠区域包含了M个取样点,通常M的值约为N的1/2或1/3。
加窗通常将每一帧乘以汉明窗,以增加帧左端和右端的连续性。由于信号在时域上的变换通常很难看出信号的特性,所以通常将它转换为频域上的能量分布来观察,不同的能量分布,就能代表不同语音的特性。
在乘上汉明窗后,每帧还必须再经过FFT以得到在频谱上的能量分布。对分帧加窗后的各帧信号进行快速傅里叶变换得到各帧的频谱,并对语音信号的频谱取模平方得到语音信号的功率谱。
将能量谱通过一组Mel尺度的三角形滤波器组,对频谱进行平滑化,并消除谐波的作用,突显原先语音的共振峰。然后将每个滤波器组输出的值进行对数运算,最终将对数能量带入离散余弦变换(DCT),最终求出的Mel-scale Cepstrum参数。
本实施例通过分别在语音模型的训练阶段和使用阶段计算背景环境信号的倒谱均值,并利用语音信号的倒谱特征减去背景环境信号的倒谱均值,得到不受信道影响的归一化的倒谱序列,使得两个阶段中的信道环境相匹配,成功消除了两个阶段语音交互中的信道差异,进而提高了后端识别的准确率。
实施例2
本实施例提供一种消除语音交互中信道差异的系统400,如图4所示,包括:用于语音模型训练阶段的第一提取模块411、第一计算模块412以及第一归一化模块413,以及用于语音模型使用阶段的第二提取模块421、第二计算模块422以及第二归一化模块423。
第一提取模块411用于针对每种场景下的训练语料,提取倒谱特征。
第一计算模块412用于根据所述倒谱特征计算相应场景下背景环境信号的倒谱均值。
第一归一化模块413用于利用所述训练语料中语音信号的倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列,并利用所述倒谱序列训练语音模型;其中,所述语音信号包括背景环境信号。
第二提取模块421用于采集用户语音信号,并提取所述用户语音信号的倒谱特征;其中,所述用户语音信号包括背景环境信号。
第二计算模块422用于根据所述倒谱特征估计与所述用户语音信号相同场景下背景环境信号的倒谱均值。
第二归一化模块423用于利用所述倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列,并将所述倒谱序列输入至训练完成的语音模型。
在可选的一种实施方式中,第一计算模块412具体用于在训练语料中包括单独的背景环境信号的情况下,根据相应场景下背景环境信号的倒谱特征计算所述背景环境信号的倒谱均值。
在可选的一种实施方式中,第一计算模块412具体用于将训练语料中的语音信号平均分为若干段,并根据所述语音信号的倒谱特征分别计算每段语音信号的倒谱均值;以及将所有倒谱均值中的最小值作为相应场景下背景环境信号的倒谱均值。
在可选的一种实施方式中,第二计算模块422具体用于利用一阶递归估计器计算背景环境信号的倒谱均值,计算公式如下:
Figure PCTCN2020091030-appb-000024
其中,x(k)为k时刻用户语音信号的倒谱特征,
Figure PCTCN2020091030-appb-000025
为k-1时刻背景环境信号的倒谱均值,
Figure PCTCN2020091030-appb-000026
为k时刻背景环境信号的倒谱均值,α为递归系数。
在可选的一种实施方式中,第二计算模块422还用于检测所述用户语音信号的语音区和非语音区,以及在语音区和非语音区设置不同的递归系数。
在可选的一种实施方式中,所述语音区包括语音初始阶段和语音非初始阶段,所述第二计算模块还用于在语音初始阶段和语音非初始阶段设置不同的递归系数。
实施例3
图5为本实施例提供的一种电子设备的结构示意图。所述电子设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现实施例1的消除语音交互中信道差异的方法。图5显示的电子设备3仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。
电子设备3可以以通用计算设备的形式表现,例如其可以为服务器设备。电子设备3的组件可以包括但不限于:上述至少一个处理器4、上述至少一个存储器5、连接不同系统组件(包括存储器5和处理器4)的总线6。
总线6包括数据总线、地址总线和控制总线。
存储器5可以包括易失性存储器,例如随机存取存储器(RAM)51和/或高速缓存存储器52,还可以进一步包括只读存储器(ROM)53。
存储器5还可以包括具有一组(至少一个)程序模块54的程序/实用工具55,这样的程序模块54包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
处理器4通过运行存储在存储器5中的计算机程序,从而执行各种功能应用以及数据处理,例如本发明实施例1的消除语音交互中信道差异的方法。
电子设备3也可以与一个或多个外部设备7(例如键盘、指向设备等)通信。这种通信可以通过输入/输出(I/O)接口8进行。并且,模型生成的设备3还可以通过网络适配器9与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器9通过总线6与模型生成的设备3的其它模块通信。应当明白,尽管图中未示出,可以结合模型生成的设备3使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理器、外部磁盘驱动阵列、RAID(磁盘阵列)系统、磁带驱动器以及数据备份存储系统等。
应当注意,尽管在上文详细描述中提及了电子设备的若干单元/模块或子单元/模块,但是这种划分仅仅是示例性的并非强制性的。实际上,根据本发明的实施方式,上文描述的两个或更多单元/模块的特征和功能可以在一个单元/模块中具体化。反之,上文描述的一个单元/模块的特征和功能可以进一步划分为由多个单元/模块来具体化。
实施例4
本实施例提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现实施例1的消除语音交互中信道差异的方法步骤。
其中,可读存储介质可以采用的更具体可以包括但不限于:便携式盘、硬盘、随机存取存储器、只读存储器、可擦拭可编程只读存储器、光存储器件、磁存储器件或上述的任意合适的组合。
在可能的实施方式中,本发明还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行实现实施例1的消除语音交互中信道差异的方法步骤。
其中,可以以一种或多种程序设计语言的任意组合来编写用于执行本发明的程序代码,所述程序代码可以完全地在用户设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户设备上部分在远程设备上执行或完全在远程设备上执行。
虽然以上描述了本发明的具体实施方式,但是本领域的技术人员应当理解,这些仅是举例说明,在不背离本发明的原理和实质的前提下,可以对这些实施方式做出多种变更或修改。因此,本发明的保护范围由所附权利要求书限定。

Claims (14)

  1. 一种消除语音交互中信道差异的方法,其特征在于,包括以下步骤:
    在语音模型的训练阶段:
    针对每种场景下的训练语料,提取倒谱特征;
    根据所述倒谱特征计算相应场景下背景环境信号的倒谱均值;
    利用所述训练语料中语音信号的倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列,并利用所述倒谱序列训练语音模型;其中,所述语音信号包括背景环境信号;
    在语音模型的使用阶段:
    采集用户语音信号,并提取所述用户语音信号的倒谱特征;其中,所述用户语音信号包括背景环境信号;
    根据所述倒谱特征估计与所述用户语音信号相同场景下背景环境信号的倒谱均值;
    利用所述倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列,并将所述倒谱序列输入至训练完成的语音模型。
  2. 如权利要求1所述的方法,其特征在于,在语音模型的训练阶段,根据所述倒谱特征计算相应场景下背景环境信号的倒谱均值,具体包括:
    若训练语料中包括单独的背景环境信号,则根据相应场景下背景环境信号的倒谱特征计算所述背景环境信号的倒谱均值。
  3. 如权利要求1-2中至少一项所述的方法,其特征在于,在语音模型的训练阶段,根据所述倒谱特征计算相应场景下背景环境信号的倒谱均值,具体包括:
    将训练语料中的语音信号平均分为若干段,并根据所述语音信号的倒谱特征分别计算每段语音信号的倒谱均值;
    将所有倒谱均值中的最小值作为相应场景下背景环境信号的倒谱均值。
  4. 如权利要求1-3中至少一项所述的方法,其特征在于,在语音模型的使用阶段,根据所述倒谱特征估计与所述用户语音信号相同场景下背景环境信号的倒谱均值,具体包括:
    利用一阶递归估计器计算背景环境信号的倒谱均值,计算公式如下:
    Figure PCTCN2020091030-appb-100001
    其中,x(k)为k时刻用户语音信号的倒谱特征,
    Figure PCTCN2020091030-appb-100002
    为k-1时刻背景环境信号的倒谱均值,
    Figure PCTCN2020091030-appb-100003
    为k时刻背景环境信号的倒谱均值,α为递归系数。
  5. 如权利要求4所述的方法,其特征在于,所述利用一阶递归估计器计算背景环境信号的倒谱均值包括:
    检测所述用户语音信号的语音区和非语音区;
    在语音区和非语音区设置不同的递归系数。
  6. 如权利要求5所述的方法,其特征在于,所述语音区包括语音初始阶段和语音非初始阶段,所述利用一阶递归估计器计算背景环境信号的倒谱均值还包括:
    在语音初始阶段和语音非初始阶段设置不同的递归系数。
  7. 一种消除语音交互中信道差异的系统,其特征在于,包括:用于语音模型训练阶段的第一提取模块、第一计算模块以及第一归一化模块,以及用于语音模型使用阶段的第二提取模块、第二计算模块以及第二归一化模块;
    第一提取模块用于针对每种场景下的训练语料,提取倒谱特征;
    第一计算模块用于根据所述倒谱特征计算相应场景下背景环境信号的倒谱均值;
    第一归一化模块用于利用所述训练语料中语音信号的倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列,并利用所述倒谱序列训练语音模型;其中,所述语音信号包括背景环境信号;
    第二提取模块用于采集用户语音信号,并提取所述用户语音信号的倒谱特征;其中,所述用户语音信号包括背景环境信号;
    第二计算模块用于根据所述倒谱特征估计与所述用户语音信号相同场景下背景环境信号的倒谱均值;
    第二归一化模块用于利用所述倒谱特征减去所述背景环境信号的倒谱均值,得到归一化的倒谱序列,并将所述倒谱序列输入至训练完成的语音模型。
  8. 如权利要求7所述的系统,其特征在于,所述第一计算模块具体用于在训练语料中包括单独的背景环境信号的情况下,根据相应场景下背景环境信号的倒谱特征计算所述背景环境信号的倒谱均值。
  9. 如权利要求7-8中至少一项所述的系统,其特征在于,所述第一计算模块具体用于将训练语料中的语音信号平均分为若干段,并根据所述语音信号的倒谱特征分别计算每段语音信号的倒谱均值;以及将所有倒谱均值中的最小值作为相应场景下背景环境信号的倒谱均值。
  10. 如权利要求7-9中至少一项所述的系统,其特征在于,所述第二计算模块具体用于利用一阶递归估计器计算背景环境信号的倒谱均值,计算公式如下:
    Figure PCTCN2020091030-appb-100004
    其中,x(k)为k时刻用户语音信号的倒谱特征,
    Figure PCTCN2020091030-appb-100005
    为k-1时刻背景环境信号的倒谱均值,
    Figure PCTCN2020091030-appb-100006
    为k时刻背景环境信号的倒谱均值,α为递归系数。
  11. 如权利要求10所述的系统,其特征在于,所述第二计算模块还用于检测所述用户语音信号的语音区和非语音区,以及在语音区和非语音区设置不同的递归系数。
  12. 如权利要求11所述的系统,其特征在于,所述语音区包括语音初始阶段和语音非初始阶段,所述第二计算模块还用于在语音初始阶段和语音非初始阶段设置不同的递归系数。
  13. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1-6中任一项所述的消除语音交互中信道差异的方法。
  14. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-6中任一项所述的消除语音交互中信道差异的方法步骤。
PCT/CN2020/091030 2020-04-30 2020-05-19 消除语音交互中信道差异的方法及系统、电子设备及介质 WO2021217750A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010363659.9 2020-04-30
CN202010363659.9A CN111627426B (zh) 2020-04-30 2020-04-30 消除语音交互中信道差异的方法及系统、电子设备及介质

Publications (1)

Publication Number Publication Date
WO2021217750A1 true WO2021217750A1 (zh) 2021-11-04

Family

ID=72273153

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/091030 WO2021217750A1 (zh) 2020-04-30 2020-05-19 消除语音交互中信道差异的方法及系统、电子设备及介质

Country Status (2)

Country Link
CN (1) CN111627426B (zh)
WO (1) WO2021217750A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077787A (zh) * 2020-12-22 2021-07-06 珠海市杰理科技股份有限公司 语音数据的识别方法、装置、芯片及可读存储介质

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038528A (en) * 1996-07-17 2000-03-14 T-Netix, Inc. Robust speech processing with affine transform replicated data
CN1490787A (zh) * 2003-09-12 2004-04-21 中国科学院声学研究所 基于语音增强的语音识别方法
CN101271686A (zh) * 2007-03-22 2008-09-24 三星电子株式会社 使用语音信号的谐波估计噪声的方法和设备
US20120041764A1 (en) * 2010-08-16 2012-02-16 Kabushiki Kaisha Toshiba Speech processing system and method
US20120130716A1 (en) * 2010-11-22 2012-05-24 Samsung Electronics Co., Ltd. Speech recognition method for robot
CN102945670A (zh) * 2012-11-26 2013-02-27 河海大学 一种用于语音识别系统的多环境特征补偿方法
CN104157294A (zh) * 2014-08-27 2014-11-19 中国农业科学院农业信息研究所 一种农产品市场要素信息采集的鲁棒性语音识别方法
CN104392718A (zh) * 2014-11-26 2015-03-04 河海大学 一种基于声学模型阵列的鲁棒语音识别方法
CN105355198A (zh) * 2015-10-20 2016-02-24 河海大学 一种基于多重自适应的模型补偿语音识别方法
CN107408394A (zh) * 2014-11-12 2017-11-28 美国思睿逻辑有限公司 确定在主信道与参考信道之间的噪声功率级差和声音功率级差

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3154487B2 (ja) * 1990-02-28 2001-04-09 エス・アール・アイ・インターナシヨナル 音声認識の際の雑音のロバストネスを改善するためにスペクトル的推定を行う方法
US9263041B2 (en) * 2012-03-28 2016-02-16 Siemens Aktiengesellschaft Channel detection in noise using single channel data
CN103730112B (zh) * 2013-12-25 2016-08-31 讯飞智元信息科技有限公司 语音多信道模拟与采集方法
CN109599118A (zh) * 2019-01-24 2019-04-09 宁波大学 一种鲁棒性的回放语音检测方法

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038528A (en) * 1996-07-17 2000-03-14 T-Netix, Inc. Robust speech processing with affine transform replicated data
CN1490787A (zh) * 2003-09-12 2004-04-21 中国科学院声学研究所 基于语音增强的语音识别方法
CN101271686A (zh) * 2007-03-22 2008-09-24 三星电子株式会社 使用语音信号的谐波估计噪声的方法和设备
US20120041764A1 (en) * 2010-08-16 2012-02-16 Kabushiki Kaisha Toshiba Speech processing system and method
US20120130716A1 (en) * 2010-11-22 2012-05-24 Samsung Electronics Co., Ltd. Speech recognition method for robot
CN102945670A (zh) * 2012-11-26 2013-02-27 河海大学 一种用于语音识别系统的多环境特征补偿方法
CN104157294A (zh) * 2014-08-27 2014-11-19 中国农业科学院农业信息研究所 一种农产品市场要素信息采集的鲁棒性语音识别方法
CN107408394A (zh) * 2014-11-12 2017-11-28 美国思睿逻辑有限公司 确定在主信道与参考信道之间的噪声功率级差和声音功率级差
CN104392718A (zh) * 2014-11-26 2015-03-04 河海大学 一种基于声学模型阵列的鲁棒语音识别方法
CN105355198A (zh) * 2015-10-20 2016-02-24 河海大学 一种基于多重自适应的模型补偿语音识别方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAO YUNPENG, WEIPING YE: "Survey of Feature Normalization Techniques for Robust Speech Recognition", JOURNAL OF CHINESE INFORMATION PROCESSING, vol. 24, no. 5, 30 September 2010 (2010-09-30), pages 106 - 116, XP055861982, ISSN: 1003-0077 *

Also Published As

Publication number Publication date
CN111627426A (zh) 2020-09-04
CN111627426B (zh) 2023-11-17

Similar Documents

Publication Publication Date Title
CN109147796B (zh) 语音识别方法、装置、计算机设备及计算机可读存储介质
CN110459241B (zh) 一种用于语音特征的提取方法和系统
US20050143997A1 (en) Method and apparatus using spectral addition for speaker recognition
Yadav et al. Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing
JP2002140089A (ja) 挿入ノイズを用いた後にノイズ低減を行うパターン認識訓練方法および装置
US9520138B2 (en) Adaptive modulation filtering for spectral feature enhancement
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN108305639B (zh) 语音情感识别方法、计算机可读存储介质、终端
Park et al. Acoustic interference cancellation for a voice-driven interface in smart TVs
CN110728991B (zh) 一种改进的录音设备识别算法
CN110970036A (zh) 声纹识别方法及装置、计算机存储介质、电子设备
CN112951259A (zh) 音频降噪方法、装置、电子设备及计算机可读存储介质
JP2018506078A (ja) 発話の復元のためのシステムおよび方法
Su et al. Perceptually-motivated environment-specific speech enhancement
Shahnawazuddin et al. Enhancing noise and pitch robustness of children's ASR
CN110268471A (zh) 具有嵌入式降噪的asr的方法和设备
Shahnawazuddin et al. Pitch-normalized acoustic features for robust children's speech recognition
Lee et al. Intra‐and Inter‐frame Features for Automatic Speech Recognition
Labied et al. An overview of automatic speech recognition preprocessing techniques
WO2021217750A1 (zh) 消除语音交互中信道差异的方法及系统、电子设备及介质
Kalamani et al. Continuous Tamil Speech Recognition technique under non stationary noisy environments
JP2019035862A (ja) 入力音マスク処理学習装置、入力データ処理関数学習装置、入力音マスク処理学習方法、入力データ処理関数学習方法、プログラム
Kaur et al. Optimizing feature extraction techniques constituting phone based modelling on connected words for Punjabi automatic speech recognition
CN113782005B (zh) 语音识别方法及装置、存储介质及电子设备
Zezario et al. Specialized speech enhancement model selection based on learned non-intrusive quality assessment metric.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933395

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20933395

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20933395

Country of ref document: EP

Kind code of ref document: A1