WO2020186742A1 - 一种应用于地空通信的话音识别方法 - Google Patents

一种应用于地空通信的话音识别方法 Download PDF

Info

Publication number
WO2020186742A1
WO2020186742A1 PCT/CN2019/111789 CN2019111789W WO2020186742A1 WO 2020186742 A1 WO2020186742 A1 WO 2020186742A1 CN 2019111789 W CN2019111789 W CN 2019111789W WO 2020186742 A1 WO2020186742 A1 WO 2020186742A1
Authority
WO
WIPO (PCT)
Prior art keywords
ground
air communication
voice
signal
acoustic model
Prior art date
Application number
PCT/CN2019/111789
Other languages
English (en)
French (fr)
Inventor
姚元飞
王群
陈洪瑀
Original Assignee
成都天奥信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 成都天奥信息科技有限公司 filed Critical 成都天奥信息科技有限公司
Publication of WO2020186742A1 publication Critical patent/WO2020186742A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • the present invention relates to the field of ground-to-air communication, in particular to a voice recognition method applied to ground-to-air communication.
  • Ground-to-air communication is mainly used for communication between controllers and pilots, and is a core part of ensuring the safety of aircraft flight. Due to the heavy work intensity of traffic control personnel, they need to be highly concentrated, and it is easy to misunderstand the voice they hear in the case of a bad call environment, which leads to the issuing of wrong traffic control commands, which greatly affects flight safety. Ground-to-air communication voice recognition technology can automatically identify the conversation between the controller and the pilot, monitor the behavior of the controller and the pilot, and warn of the danger caused by wrong instructions, which can greatly ensure the safety of flight.
  • ground-to-air communication has its special characteristics in pronunciation and grammar, it is necessary to re-establish a proprietary ground-to-air communication acoustic model based on its dialogue characteristics, pronunciation intonation, etc., so there is currently no voice recognition for ground-to-air communication systems on the market. technology.
  • Voice recognition requires that the recorded pure voice signal is trained to obtain an acoustic model, and then the signal to be recognized is subjected to the same processing and matched with the trained acoustic model to obtain the final recognition result. Because the voice signal of ground-to-air communication is always affected by the external environment The interference of noise will be mixed with a lot of noise signals. These noisy voice signals will not only cause auditory discomfort, lead to auditory fatigue and loss of attention of the controller or flight crew, but also cause the voice signal to be distorted and the characteristic parameters of the voice The change, the inability to match the acoustic model leads to the error of the final recognition result.
  • the current general solution is to cascade a speech enhancement algorithm in the recognition front end to improve speech intelligibility. The specific flow chart is shown in Figure 1.
  • HMM Hidden Markov Model
  • A is a finite set with N states
  • B is an observation sequence set
  • M is a transition state probability
  • O is an output observation probability matrix, which is an initial probability sequence
  • F is a terminal state sequence.
  • Hidden Markov-based acoustic modeling first calculates the probability of the output of the known model and the output sequence of the initial model through the forward and backward algorithm and the recursive algorithm, and then calibrate the model by using the Baum Welch algorithm and the maximum likelihood criterion , And finally use the Viterbi algorithm to decode the recognition result.
  • Hidden Markov model has a high recognition rate for voice recognition of isolated words with small vocabulary. However, to deal with continuous voice recognition of large vocabulary such as ground-to-air calls, its recognition robustness will be significantly reduced.
  • the general speech enhancement algorithms are mostly improved spectral subtraction or Wiener filters. Although their structure is simple and convenient to implement, it can improve the signal-to-noise ratio of noisy speech, but it often introduces other noises and causes speech distortion. Although this method can effectively improve the hearing comfort of the human ear, it is not suitable for voice recognition front-end.
  • the speech enhancement algorithm based on the maximum posteriori probability (MAP) can not only effectively remove background noise, but also does not introduce other noise interference.
  • MAP maximum posteriori probability
  • FFT Fourier transform
  • k is the frequency of the ⁇ th frame
  • x(n) is the pure speech signal
  • d(n) is the noise.
  • the a priori SNR of the next frame is continuously updated according to the value of the previous frame.
  • the a priori SNR of the first frame can be obtained by the following formula Calculated:
  • a is a constant, taking 0.98.
  • the MAP algorithm mainly obtains the gain function by calculating the prior signal-to-noise ratio and the posterior signal-to-noise ratio.
  • the prior signal-to-noise ratio and the posterior signal-to-noise ratio have over-estimation problems, which causes the amplitude of the enhanced voice signal to change.
  • the invention provides a voice recognition method applied to ground-air communication, which can recognize and compare voice commands between controllers and pilots, detect sensitive words and give warnings, and can improve the voice recognition rate.
  • this application provides a voice recognition method applied to ground-to-air communication, the method includes:
  • the ground-to-air communication voice signal to be recognized, processed by the improved maximum posterior probability speech enhancement algorithm, is input into the ground-to-air communication triphone GMM-HMM acoustic model for recognition, and the voice command text and keyword text of the controller and pilot are recognized , When the recognized controller and the pilot’s voice command text are inconsistent, an alert will be issued; the keyword detection model is used to detect the recognized keyword text, and an alert will be issued when the preset vocabulary is detected.
  • the establishment of the triphone GMM-HMM acoustic model for ground-air communication specifically includes:
  • the labeled audio data is trained to obtain a ground-to-air three-phone GMM-HMM acoustic model.
  • the feature extraction processing on the collected dialogue data specifically includes:
  • E(k) is the power spectrum of the voice signal
  • X is the voice signal
  • k is the k-th spectral line
  • the obtained voice power spectrum is obtained by weighted summation through the Mel filter bank:
  • S(m) is the value after weighted summation
  • L is the number of spectral lines
  • H m (k) is the band-pass filter
  • m is the m-th Mel filter
  • M is the total number of Mel filters
  • c(n) is the value after discrete cosine transform
  • n is the nth spectral line after discrete cosine transform
  • labeling the audio data after feature extraction specifically includes:
  • the ground-air communication triphone GMM-HMM acoustic model is established, which specifically includes: according to the call characteristics of ground-air communication, using different silent phoneme and non-silent phoneme HMM topologies to perform GMM parameters Random initialization; after random adjustment and integration of Gaussian parameters, iteratively obtain the three-phone GMM-HMM acoustic model of ground-air communication.
  • the three-phone GMM-HMM acoustic model for ground-to-air communication includes: continuous voice acoustic model and keyword acoustic model.
  • the voice to be recognized is processed, it is recognized by the continuous voice acoustic model and converted into text content for output. Prompt an alarm when the text commands are inconsistent; use the keyword acoustic model to detect whether the preset sensitive information vocabulary is included, and when the sensitive information vocabulary is recognized, it will be converted into text content and output and prompt the alarm.
  • an adaptive filter is added to the maximum posterior probability speech enhancement algorithm to correct the deviation of the gain function.
  • the gain function of the adaptive filter is shown in the following formula:
  • k is the frequency of the ⁇ th frame
  • x(n) is the pure speech signal
  • d(n) is the noise
  • n is a certain moment
  • the prior SNR of the first frame is calculated by the following formula:
  • a is a constant and ⁇ is the posterior signal-to-noise ratio
  • ⁇ d (k) is the noise power
  • &SNR(k, ⁇ ) is the estimated signal-to-noise ratio
  • G w (k, ⁇ ) is the adaptive filter value at the current moment
  • G w (k, ⁇ -1) is the adaptive filter value at the previous moment.
  • the invention provides a voice recognition method suitable for the ground-air communication system according to the grammatical pronunciation characteristics of the ground-air communication and the noise environment.
  • This method builds an acoustic model of ground-to-air communication terms, which can identify and compare voice commands between controllers and pilots, and can also detect sensitive words and give warnings; combined with the noise environment of ground-to-air communications, it provides an adaptive Filtered speech enhancement algorithm to improve speech recognition rate.
  • This method is mainly divided into two parts: (1) A three-phone GMM-HMM acoustic model is established according to the characteristics of ground-to-air communication, which can identify the content of the speech and detect sensitive information. (2) Combining the noise environment of ground-to-air calls, an adaptive filter is added to the MAP algorithm. Through continuous optimization of parameters, the background noise is removed while the characteristic parameters of the enhanced voice signal are not greatly changed.
  • the invention combines the voice characteristics and noise environment of ground-to-air communication to establish a ground-to-air communication acoustic model.
  • the recognition model can identify and compare the voice content of the controller and the pilot. When the command is inconsistent, it will alert and prompt; it is detected by keywords Model, when the set high-risk and sensitive words are detected, the system will also alert to ensure flight safety; adopt adaptive filtering algorithm to enhance the voice to be recognized, reduce the background noise contained in the voice to be recognized, and improve the voice to be recognized The intelligibility of the voice to be recognized has a higher recognition rate at the recognition end.
  • the present invention establishes a ground-to-air communication voice recognition model. This method can identify and compare whether the voice commands between the controller and the pilot are consistent, and can also detect pre-set sensitive words and give warnings. Improve flight safety.
  • the main functions of the adaptive filter are as follows: in the low signal-to-noise ratio range less than -15dB, the intelligibility is improved by introducing a modified gain function, and the amplitude spectrum is limited in the range greater than 10dB to reduce the amplification distortion. In this way, the recognition rate of ground-to-air calls is improved, and the voice recognition system has high robustness in harsh noise environments.
  • the present invention is mainly applied in a ground-to-air communication voice recognition system. Compared with the prior art, the present invention has a better effect on improving the voice recognition rate of ground-to-air calls and ensuring flight safety.
  • Fig. 1 is a schematic flow chart of a method for improving speech intelligibility through a speech enhancement algorithm in the prior art
  • FIG. 2 is a schematic flow diagram of the voice recognition algorithm in this application.
  • FIG. 3 is a schematic diagram of the speech enhancement algorithm flow in this application.
  • the invention is divided into two parts, namely the voice recognition end and the enhancement end.
  • FIG. 2 is a flowchart of a voice recognition algorithm in an embodiment of the present invention. The specific process is as follows:
  • the data used in the establishment of the acoustic model of the present invention are all the daily conversations of a domestic airport ground-air communication as a template, and the airport tower controller is hired to record according to the daily call rules.
  • the ratio of male to female students is 2:1, the audio sampling rate is 16KHz, the sampling accuracy is 16 bits, and the total recording audio capacity is 10G.
  • Audio data annotation In large vocabulary continuous speech recognition, there will be the same pronunciation but different meanings of the words, resulting in the current phoneme being affected by the front and back phonemes, and the characteristic parameters before and after the continuous speech cannot be calculated well. Therefore, the context-sensitive triphone model is generally used. Then use the clustering algorithm to do the context clustering process to get the cluster set of the specific state. First, the text dictionary and the audio data are forced to align, and the optimal path is obtained through the Viterbi-beam algorithm. Finally, the optimal frame-level annotation can be obtained.
  • Acoustic model 1 in Figure 2 is a continuous voice acoustic model
  • acoustic model 2 is a keyword acoustic model.
  • voice to be recognized When the voice to be recognized is processed, it can be recognized by acoustic model 1 and converted into text content for output.
  • controller and pilot text commands are inconsistent Prompt alarm; it can also detect whether it contains preset sensitive information vocabulary through the acoustic model 2.
  • the sensitive information vocabulary is recognized, it will be converted into text content and output and prompt the alarm.
  • Fig. 3 is a flowchart of a speech enhancement algorithm in an embodiment of the present invention.
  • the present invention mainly removes background noise and improves speech intelligibility by adding an adaptive filter.
  • the gain function of the adaptive filter is determined as shown in the following formula:
  • the gain function of the adaptive filter is adjusted for three different signal-to-noise ratio intervals.
  • the signal-to-noise ratio at point ⁇ of the k-th frame is calculated to be less than -15db, this can be considered
  • the frequency point is mainly a noise signal, and the noise interference is removed by introducing a correction deviation value.
  • the threshold is set to 0.8 to ensure that the signal does not introduce excessive gain compensation and the output amplitude of the signal does not change significantly. .
  • the signal-to-noise ratio is in the range of -15 to 10
  • the energy of the speech signal and the noise signal is relatively blurred.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Telephonic Communication Services (AREA)
  • Filters That Use Time-Delay Elements (AREA)

Abstract

一种应用于地空通信的话音识别方法,包括:建立地空通话三音素声学模型;通过改进的最大后验概率语音增强算法,对接收到的待识别的地空通信话音信号进行语音增强、去除背景噪声处理;将处理后的待识别地空通信话音信号,输入地空通话三音素声学模型进行识别,识别出管制员和飞行员的语音命令文本和关键词文本,当识别出的管制员与飞行员的语音命令文本不一致时进行告警提示;通过关键词检出模型对识别出的关键词文本进行检测,当检测到预设词汇时进行告警提示。该方法可识别管制人员和飞行员之间的话音命令并进行比对,检测敏感词汇并告警提示,并能够提高话音识别率。

Description

一种应用于地空通信的话音识别方法 技术领域
本发明涉及地空通信领域,具体地,涉及一种应用于地空通信的话音识别方法。
背景技术
地空通信主要应用于管制员与飞行员之间的通话,是确保飞机飞行安全的核心部分。由于交通管制人员工作强度大,注意力需高度集中,在通话环境恶劣的情况下很容易错误理解听到的话音,从而导致发出错误的交通管制命令,极大的影响飞行安全。地空通信话音识别技术可以自动识别管制员和飞行员之间的通话,监测管制员和飞行员的行为,对由错误指令造成的危险进行告警,可极大的保证飞行安全。
地空通信话音识别技术虽然是一种有效保证飞行安全的方法之一,但目前大多数地空通信系统是没有使用话音识别技术的,由于地空通信的通话方式在发音,语调等方面具有特殊性,所以无法直接使用目前通用的话音识别技术。此外,由于地空通信受周围环境影响,通话过程中会带有部分噪声干扰,导致地空对话识别难度大。
目前现有的通用话音识别技术是不适合应用到地空通信系统中的。由于地空通话在发音和语法上又具有其特殊性,需要根据其对话特点,发音语调等重新建立一个专有的地空通话声学模型,所以目前市场上并没有针对地空通信系统的话音识别技术。
话音识别是需要将录制好的纯净话音信号经过训练得到声学模型,然后再将待识别信号经过相同处理与训练好的声学模型进行匹配最终得到识别结果,由于地空通信的话音信号时刻受外界环境的干扰,会夹杂着很多噪声信号,这些带有噪声的话音信号不仅会引起听觉不适,导致管制人员或飞行人员产生听觉疲劳,注意力下降,而且还会使话音信号失真、话音的特征参数发生改变,不能与声学模型匹配导致最终的识别结果错误。目前通用的解决方案是在识别前端级联一个语音增强算法以提高话音可懂度。具体流程图如图1所示。
隐马尔科夫模型(HMM),隐马尔科夫模型被广泛的应用在语音信号处理领域。一个HMM可以通过来θ={A,B,M,O,π,F}来描述。其中A为有N个状态的有限集,B是观察序列集,M是转移状态概率,O是输出观测概率矩阵,为初始概率序列,F是终止状态序列。基于隐马尔科夫的声学建模首先是通过前后向算法和递推算法计算已知模型的输出和该初始模型的输出序列的概率,在通过利用Baum Welch算法和最大似然准则对模型进行校准,最后用维特比算法进行解码得到识别结果。隐马尔科夫模型针对小词汇量孤立词话音识别有较高的识别率,但要处理地空通话这类大词汇量连续话音识别,其识别的鲁棒性就会明显下降。
语音增强算法
传统方法:
目前在通用的语音增强算法多为改进的谱减法或维纳滤波器,虽然其结构简单方便实现,可提升带噪话音的信噪比,但却往往会引入其他的噪声,导致话音失真。虽然这种方法能有效改善人耳的听觉舒适度,但却不适用话音识别前端。
最小均方误差算法:
基于最大后验概率(Maximum a posteriori,MAP)的语音增强算法相相比谱减法和维纳滤波算法,其表现为不仅能有效去除背景噪声,而且还不会引入其他噪声干扰。假设信号为y(n)=x(n)+d(n),经过分帧加汉明窗后,求傅里叶变换(FFT)得到:
Y(k,τ)=x(k,τ)+D(k)        (1)
其中k为第τ帧的频点,x(n)为纯净语音信号,d(n)为噪声。
将无话段的信号作为噪声帧,得到噪声的功率为δ d,然后计算后验信噪比,得到:
Figure PCTCN2019111789-appb-000001
下一帧的先验信噪比根据前一帧的值不断更新得到,当在信号的第一帧时,由于没有前一帧作为参考,所以第一帧的先验信噪比可通过下式计算得到:
Figure PCTCN2019111789-appb-000002
其中a为常数,取0.98。
当信号进行到第二帧,先验信噪比的计算公式如下:
Figure PCTCN2019111789-appb-000003
通过先验信噪比和后验信噪比可求得MAP的增益函数式,最后得到增强后的话音信号:
Figure PCTCN2019111789-appb-000004
谱减法和维纳滤波虽然实现简单,但是会引入过多的“音乐噪声”,虽然信噪比会有部分提升,但实际听觉效果并不明显,当信噪比较低时,经过谱减或维纳滤波器处理后的话音信号,听觉效果反而会更差。MAP算法主要通过计算先验信噪比和后验信噪比来得到增益函数,而先验信噪比和后验信噪比又存在过估计问题,导致增强后的话音信号幅度发生改变。
发明内容
本发明提供了一种应用于地空通信的话音识别方法,可识别管制人员和飞行员之间的话音命令并进行比对,还可以检测敏感词汇并告警提示,并且能够提高话音识别率。
为实现上述发明目的,本申请提供了一种应用于地空通信的话音识别方法,所述方法包括:
建立地空通话三音素GMM-HMM声学模型;
在最大后验概率语音增强算法中加入自适应滤波器,通过改进的最大后验概率语音增强算法,对接收到的待识别的地空通信话音信号进行语音增强、去除背景噪声处理;
将通过改进的最大后验概率语音增强算法处理后的待识别地空通信话音信号,输入地空通话三音素GMM-HMM声学模型进行识别,识别出管制员和飞行员的语音命令文本和关键词文本,当识别出的管制员与飞行员的语音命令文本不一致时进行告警提示;通过关键词检出模型对识别出的关键词文本进行检测,当检测到预设词汇时进行告警提示。
进一步的,所述建立地空通话三音素GMM-HMM声学模型,具体包括:
采集机场地空通信的日常对话数据;
对采集到的对话数据进行特征提取处理,去除不需要的数据;
对特征提取后的音频数据进行标注;
将标注后的音频数据,通过训练得到地空通话三音素GMM-HMM声学模型。
进一步的,所述对采集到的对话数据进行特征提取处理,具体包括:
采用梅尔频率倒谱系数做特征提取,将对话音频信号做傅里叶变换然后计算对话音频信号功率谱得到:
E(k)=[X(k)] 2       (6)
其中,E(k)为话音信号功率谱,X为话音信号,k为第k条谱线;
将得到的话音功率谱经过Mel滤波器组通过加权求和得到:
Figure PCTCN2019111789-appb-000005
其中,S(m)为加权求和后的值,L为谱线数,H m(k)为带通滤波器,m为第m个Mel滤波器,M为Mel滤波器总个数;
取对数然后做离散余弦变换得到:
Figure PCTCN2019111789-appb-000006
其中,c(n)为离散余弦变换后的值,n为离散余弦变换后第n条谱线。
进一步的,对特征提取后的音频数据进行标注,具体包括:
选用上下文相关的地空通话三音素GMM-HMM声学模型,通过聚类算法做上下文聚类处理得到特定状态的聚类集;将文本字典与音频数据强制对齐,通过Viterbi-beam算法处理得到最优路径,得到最优的帧级别标注。
进一步的,基于标注后的音频数据,建立地空通话三音素GMM-HMM声学模型,具体包括:根据地空通信的通话特点,采用不同的静音音素和非静音音素HMM拓扑结构,对GMM的参数进行随机初始化;对高斯参数进行随机调整整合后,反复迭代获得地空通话三音素 GMM-HMM声学模型。
进一步的,地空通话三音素GMM-HMM声学模型包括:连续话音声学模型和关键词声学模型,当待识别话音经过处理后通过连续话音声学模型识别并转换成文本内容输出,当管制员和飞行员文本命令不一致时提示告警;通过关键词声学模型检测是否包含预设的敏感信息词汇,当识别到敏感信息词汇后将其转换成文本内容输出并提示告警。
进一步的,在最大后验概率语音增强算法中加入自适应滤波器,修正增益函数偏差。
进一步的,自适应滤波器的增益函数如下式所示:
假设信号为y(n)=x(n)+d(n),经过分帧加汉明窗后,求傅里叶变换(FFT)得到:
Y(k,τ)=x(k,τ)+D(k)       (1)
其中,k为第τ帧的频点,x(n)为纯净语音信号,d(n)为噪声,n为某一时刻;
将无话段的信号作为噪声帧,得到噪声的功率为δ d,然后计算后验信噪比,得到:
Figure PCTCN2019111789-appb-000007
第一帧的先验信噪比通过下式计算得到:
Figure PCTCN2019111789-appb-000008
其中,a为常数,γ为后验信噪比;
当信号进行到第二帧,先验信噪比的计算公式为:
Figure PCTCN2019111789-appb-000009
Figure PCTCN2019111789-appb-000010
其中,
Figure PCTCN2019111789-appb-000011
为估计出的纯净话音信号;δ d(k)为噪声功率;&SNR(k,τ)为预估的信噪比;
将公式(9)带入公式(3)和(4),得到改进后的先验信噪比为:
Figure PCTCN2019111789-appb-000012
Figure PCTCN2019111789-appb-000013
其中,G w(k,τ)为当前时刻自适应滤波器值;G w(k,τ-1)为前一时刻自适应滤波器值。
本发明根据地空通话的语法发音特点以及噪声环境,提供了一种适用于地空通信系统的话音识别方法。本方法搭建了地空通话用语的声学模型,可识别管制人员和飞行员之间的话音命令并进行比对,还可以检测敏感词汇并告警提示;结合地空通信的噪声环境提供了一种自适应滤波的话音增强算法以提高话音识别率。本方法主要分为两部分:(1)根据地空通话的特点建立三音素GMM-HMM声学模型,可识别比对话音内容并检测敏感信息。(2)结合地空通话的噪声环境,在MAP算法中加入自适应滤波器,通过不断优化参数,在去除背景噪声的同时也使增强后的话音信号的特征参数不发生较大的改变。
本发明结合地空通信的话音特点和噪声环境,建立地空通话声学模型,该识别模型可识别管制员和飞行员的话音内容并进行比对,当命令不一致时会告警提示;通过关键词检出模型,当检测到设定好的高危敏感词汇时系统也会告警提示,保证飞行安全;采用自适应滤波算法对待识别的话音做增强处理,减少待识别话音所含的背景噪声,提高待识别话音的可懂度,使待识别的话音在识别端有较高的识别率。
本申请提供的一个或多个技术方案,至少具有如下技术效果或优点:
本发明针对地空飞行安全,建立了地空通信话音识别模型,本方法可识别并对比管制人员和飞行员之间的话音命令是否一致,还可以检测预先设定的敏感词汇并告警提示,以此提高飞行安全。
优化现有的MAP算法,通过增加自适应滤波器来进一步提高该算法的增强效果。该自适应滤波器主要作用如下:在小于-15dB的低信噪比区间通过引入修正增益函数来提升可懂度,在大于10dB区间对幅度谱进行限制,减少放大畸变。以此提高地空通话的识别率,保证在恶劣噪声环境下话音识别系统有较高的鲁棒性。
本发明主要应用在地空通信话音识别系统中,本发明相对于现有技术对于提高地空通话的话音识别率,保证飞行安全有更好的效果。
附图说明
此处所说明的附图用来提供对本发明实施例的进一步理解,构成本申请的一部分,并不构成对本发明实施例的限定;
图1是现有技术中通过语音增强算法以提高话音可懂度方法的流程示意图;
图2是本申请中话音识别算法流程示意图;
图3是本申请中语音增强算法流程示意图。
具体实施方式
为了能够更清楚地理解本发明的上述目的、特征和优点,下面结合附图和具体实施方式对本发明进行进一步的详细描述。需要说明的是,在相互不冲突的情况下,本申请的实施例 及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是,本发明还可以采用其他不同于在此描述范围内的其他方式来实施,因此,本发明的保护范围并不受下面公开的具体实施例的限制。
本发明分为两个部分,分别为话音识别端和增强端。
1识别端
图2为本发明实施例中的话音识别算法流程图,具体流程如下:
(1)本发明声学模型建立所采用的数据均为国内某机场地空通信的日常对话为模板,并聘请机场塔台管制人员按照日常通话规则录制。其中男女生比例为2:1,音频采样率为16KHz,采样精度为16位,录制音频总容量为10G。
(2)特征提取,由于采集到的数据包含许多冗余信息,所以需要对数据中的有用信息进行特征提取以减少不必要的计算,本专利采用梅尔频率倒谱系数做特征提取。首先将信号做傅里叶变换然后计算其功率谱得到:
E(k)=[X(k)] 2      (6)
将其经过Mel滤波器组通过加权求和得到:
Figure PCTCN2019111789-appb-000014
最后取对数然后做离散余弦变换得到:
Figure PCTCN2019111789-appb-000015
(3)音频数据标注。在大词汇量连续话音识别中会存在相同读音但字意不同的情况,导致当前音素受前后音素的影响,不能很好的计算连续话音前后的特征参数,所以一般选用上下文相关的三音素模型,然后通过聚类算法做上下文聚类处理得到特定状态的聚类集。首先将文本字典与音频数据强制对齐,通过Viterbi-beam算法处理得到最优路径,最后就可以得到最优的帧级别标注。
(4)建立地空通话三音素GMM-HMM声学模型。根据地空通信的通话特点,采用不同的静音音素和非静音音素HMM拓扑结构,对GMM的参数进行随机初始化。对高斯参数进行随机调整整合后,反复迭代最终得三音素GMM-HMM声学模型。
图2中声学模型1是连续话音声学模型,声学模型2是关键词声学模型,当待识别话音经过处理后可通过声学模型1识别并转换成文本内容输出,当管制员和飞行员文本命令不一致时提示告警;也可通过声学模型2检测是否包含预设的敏感信息词汇,当识别到敏感信息词汇后将其转换成文本内容输出并提示告警。
2增强端
图3为本发明实施例中的语音增强算法流程图,本发明主要通过加入自适应滤波器来去除背景噪声并提高话音可懂度。
加入自适应滤波器,修正增益函数偏差。根据公式(4)可以看出,下一帧的先验信噪比根据前一帧来更新,由于当前计算得到的先验信噪比并不十分准确,这就导致通过当前先验信噪比计算得到的下一帧先验信噪比的估计值可能过大或过小,从而影响语音增强性能。对于此种情况,本发明在公式(3),(4)中加入一个自适应滤波器以调整不同信噪比区间先验信噪比的估计范围
经过仿真验证和工程调试,确定了自适应滤波器的增益函数如下式所示:
Figure PCTCN2019111789-appb-000016
将公式(10)带入公式(3),(4),得到改进后的先验信噪比如下所示:
Figure PCTCN2019111789-appb-000017
Figure PCTCN2019111789-appb-000018
通过公式(6)可知,自适应滤波器的增益函数针对三个不同信噪比的区间做出调整,当计算出第k帧第τ点的信噪比小于-15db的时候,可认为这一频点主要是噪声信号,通过引入修正偏差值来去除噪声干扰。当信噪比大于10时,此时信号中的语音成分远大于噪声信号,这时设定门限为0.8保证让此信号不引入过多的增益补偿,使信号的输出幅度不发生较大的改变。当信噪比在-15到10这个区间时,语音信号和噪声信号能量相对模糊,这时需要通过自适应滤波器进一步区分信号里的噪声成分,所以需要在此区间的增益函数增加一个门限,使增益函数值不能小于这个门限。通过大量实验仿真,当取0.8时效果最佳。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (8)

  1. 一种应用于地空通信的话音识别方法,其特征在于,所述方法包括:
    建立地空通话三音素GMM-HMM声学模型;
    在最大后验概率语音增强算法中加入自适应滤波器,通过改进的最大后验概率语音增强算法,对接收到的待识别的地空通信话音信号进行语音增强、去除背景噪声处理;
    将通过改进的最大后验概率语音增强算法处理后的待识别地空通信话音信号,输入地空通话三音素GMM-HMM声学模型进行识别,识别出管制员和飞行员的语音命令文本和关键词文本,当识别出的管制员与飞行员的语音命令文本不一致时进行告警提示;通过关键词检出模型对识别出的关键词文本进行检测,当检测到预设词汇时进行告警提示。
  2. 根据权利要求1所述的应用于地空通信的话音识别方法,其特征在于,所述建立地空通话三音素GMM-HMM声学模型,具体包括:
    采集机场地空通信的日常对话数据;
    对采集到的对话数据进行特征提取处理,去除不需要的数据;
    对特征提取后的音频数据进行标注;
    将标注后的音频数据,通过训练得到地空通话三音素GMM-HMM声学模型。
  3. 根据权利要求2所述的应用于地空通信的话音识别方法,其特征在于,所述对采集到的对话数据进行特征提取处理,具体包括:
    采用梅尔频率倒谱系数做特征提取,将对话音频信号做傅里叶变换然后计算对话音频信号功率谱得到:
    E(k)=[X(k)] 2  (6)其中,E(k)为话音信号功率谱,X为话音信号,k为第k条谱线;
    将得到的话音功率谱经过Mel滤波器组通过加权求和得到:
    Figure PCTCN2019111789-appb-100001
    其中,S(m)为加权求和后的值,L为谱线数,H m(k)为带通滤波器,m为第m个Mel滤波器,M为Mel滤波器总个数;
    取对数然后做离散余弦变换得到:
    Figure PCTCN2019111789-appb-100002
    其中,c(n)为离散余弦变换后的值,n为离散余弦变换后第n条谱线。
  4. 根据权利要求2所述的应用于地空通信的话音识别方法,其特征在于,对特征提取后的音频数据进行标注,具体包括:
    选用上下文相关的地空通话三音素GMM-HMM声学模型,通过聚类算法做上下文聚类 处理得到特定状态的聚类集;将文本字典与音频数据强制对齐,通过Viterbi-beam算法处理得到最优路径,得到最优的帧级别标注。
  5. 根据权利要求2所述的应用于地空通信的话音识别方法,其特征在于,基于标注后的音频数据,建立地空通话三音素GMM-HMM声学模型,具体包括:根据地空通信的通话特点,采用不同的静音音素和非静音音素HMM拓扑结构,对GMM的参数进行随机初始化;对高斯参数进行随机调整整合后,反复迭代获得地空通话三音素GMM-HMM声学模型。
  6. 根据权利要求1所述的应用于地空通信的话音识别方法,其特征在于,地空通话三音素GMM-HMM声学模型包括:连续话音声学模型和关键词声学模型,当待识别话音经过处理后通过连续话音声学模型识别并转换成文本内容输出,当管制员和飞行员文本命令不一致时提示告警;通过关键词声学模型检测是否包含预设的敏感信息词汇,当识别到敏感信息词汇后将其转换成文本内容输出并提示告警。
  7. 根据权利要求1所述的应用于地空通信的话音识别方法,其特征在于,在最大后验概率语音增强算法中加入自适应滤波器,修正增益函数偏差。
  8. 根据权利要求7所述的应用于地空通信的话音识别方法,其特征在于,自适应滤波器的增益函数如下式所示:
    假设信号为y(n)=x(n)+d(n),经过分帧加汉明窗后,求傅里叶变换(FFT)得到:
    Y(k,τ)=x(k,τ)+D(k)  (1)
    其中,k为第τ帧的频点,x(n)为纯净语音信号,d(n)为噪声,n为某一时刻;
    将无话段的信号作为噪声帧,得到噪声的功率为δ d,然后计算后验信噪比,得到:
    Figure PCTCN2019111789-appb-100003
    第一帧的先验信噪比通过下式计算得到:
    Figure PCTCN2019111789-appb-100004
    其中,a为常数,γ为后验信噪比;
    当信号进行到第二帧,先验信噪比的计算公式为:
    Figure PCTCN2019111789-appb-100005
    Figure PCTCN2019111789-appb-100006
    其中,
    Figure PCTCN2019111789-appb-100007
    为估计出的纯净话音信号;δ d(k)为噪声功率;&SNR(k,τ)为预估的信噪比;
    将公式(9)带入公式(3)和(4),得到改进后的先验信噪比为:
    Figure PCTCN2019111789-appb-100008
    Figure PCTCN2019111789-appb-100009
    其中,G w(k,τ)为当前时刻自适应滤波器值;G w(k,τ-1)为前一时刻自适应滤波器值。
PCT/CN2019/111789 2019-03-20 2019-10-18 一种应用于地空通信的话音识别方法 WO2020186742A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910213205.0 2019-03-20
CN201910213205.0A CN110189746B (zh) 2019-03-20 2019-03-20 一种应用于地空通信的话音识别方法

Publications (1)

Publication Number Publication Date
WO2020186742A1 true WO2020186742A1 (zh) 2020-09-24

Family

ID=67713727

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/111789 WO2020186742A1 (zh) 2019-03-20 2019-10-18 一种应用于地空通信的话音识别方法

Country Status (2)

Country Link
CN (1) CN110189746B (zh)
WO (1) WO2020186742A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189746B (zh) * 2019-03-20 2021-06-11 成都天奥信息科技有限公司 一种应用于地空通信的话音识别方法
CN110689906A (zh) * 2019-11-05 2020-01-14 江苏网进科技股份有限公司 一种基于语音处理技术的执法检测方法及系统
CN112309403A (zh) * 2020-03-05 2021-02-02 北京字节跳动网络技术有限公司 用于生成信息的方法和装置
CN111667830B (zh) * 2020-06-08 2022-04-29 中国民航大学 基于管制员指令语义识别的机场管制决策支持系统及方法
CN113129919A (zh) * 2021-04-17 2021-07-16 上海麦图信息科技有限公司 一种基于深度学习的空中管制语音降噪方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916565A (zh) * 2010-06-24 2010-12-15 北京华安天诚科技有限公司 空管系统中的语音识别方法及语音识别装置
CN102074246A (zh) * 2011-01-05 2011-05-25 瑞声声学科技(深圳)有限公司 基于双麦克风语音增强装置及方法
US20150081292A1 (en) * 2013-09-18 2015-03-19 Airbus Operations S.A.S. Method and device for automatically managing audio air control messages on an aircraft
CN104700661A (zh) * 2013-12-10 2015-06-10 霍尼韦尔国际公司 以文本和以图形呈现空中交通控制语音信息的系统和方法
CN106297796A (zh) * 2016-03-25 2017-01-04 李克军 一种飞行员复诵监控方法及装置
CN106875948A (zh) * 2017-02-22 2017-06-20 中国电子科技集团公司第二十八研究所 一种基于管制语音的冲突告警方法
CN110189746A (zh) * 2019-03-20 2019-08-30 成都天奥信息科技有限公司 一种应用于地空通信的话音识别方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160077523A1 (en) * 2013-07-22 2016-03-17 Sikorsky Aircraft Corporation System for controlling and communicating with aircraft
CN108986791B (zh) * 2018-08-10 2021-01-05 南京航空航天大学 针对民航陆空通话领域的中英文语种语音识别方法及系统
CN109119072A (zh) * 2018-09-28 2019-01-01 中国民航大学 基于dnn-hmm的民航陆空通话声学模型构建方法
CN109087657B (zh) * 2018-10-17 2021-09-14 成都天奥信息科技有限公司 一种应用于超短波电台的语音增强方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916565A (zh) * 2010-06-24 2010-12-15 北京华安天诚科技有限公司 空管系统中的语音识别方法及语音识别装置
CN102074246A (zh) * 2011-01-05 2011-05-25 瑞声声学科技(深圳)有限公司 基于双麦克风语音增强装置及方法
US20150081292A1 (en) * 2013-09-18 2015-03-19 Airbus Operations S.A.S. Method and device for automatically managing audio air control messages on an aircraft
CN104700661A (zh) * 2013-12-10 2015-06-10 霍尼韦尔国际公司 以文本和以图形呈现空中交通控制语音信息的系统和方法
CN106297796A (zh) * 2016-03-25 2017-01-04 李克军 一种飞行员复诵监控方法及装置
CN106875948A (zh) * 2017-02-22 2017-06-20 中国电子科技集团公司第二十八研究所 一种基于管制语音的冲突告警方法
CN110189746A (zh) * 2019-03-20 2019-08-30 成都天奥信息科技有限公司 一种应用于地空通信的话音识别方法

Also Published As

Publication number Publication date
CN110189746B (zh) 2021-06-11
CN110189746A (zh) 2019-08-30

Similar Documents

Publication Publication Date Title
WO2020186742A1 (zh) 一种应用于地空通信的话音识别方法
Mitra et al. Medium-duration modulation cepstral feature for robust speech recognition
Narayanan et al. Ideal ratio mask estimation using deep neural networks for robust speech recognition
Kingsbury et al. Robust speech recognition using the modulation spectrogram
Xiao et al. Normalization of the speech modulation spectra for robust speech recognition
US7319959B1 (en) Multi-source phoneme classification for noise-robust automatic speech recognition
US9940926B2 (en) Rapid speech recognition adaptation using acoustic input
CN104078039A (zh) 基于隐马尔科夫模型的家用服务机器人语音识别系统
Lippmann Speech perception by humans and machines
WO2014018004A1 (en) Feature normalization inputs to front end processing for automatic speech recognition
CN113192535B (zh) 一种语音关键词检索方法、系统和电子装置
CN107039035A (zh) 一种语音起始点和终止点的检测方法
CN104064196B (zh) 一种基于语音前端噪声消除的提高语音识别准确率的方法
Grozdić et al. Application of inverse filtering in enhancement of whisper recognition
CN116312561A (zh) 一种电力调度系统人员声纹识别鉴权降噪和语音增强方法、系统及装置
CN112216270B (zh) 语音音素的识别方法及系统、电子设备及存储介质
CN114664288A (zh) 一种语音识别方法、装置、设备及可存储介质
Boril et al. Front-End Compensation Methods for LVCSR Under Lombard Effect.
US8768695B2 (en) Channel normalization using recognition feedback
Thakur et al. Design of Hindi key word recognition system for home automation system using MFCC and DTW
Mehta et al. Robust front-end and back-end processing for feature extraction for Hindi speech recognition
Chandra Hindi vowel classification using QCN-PNCC features
Morales et al. Adding noise to improve noise robustness in speech recognition.
Ma et al. Context-dependent word duration modelling for robust speech recognition.
Park et al. HMM-based mask estimation for a speech recognition front-end using computational auditory scene analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19920428

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19920428

Country of ref document: EP

Kind code of ref document: A1