CN110189746A

CN110189746A - A kind of method for recognizing speech applied to earth-space communication

Info

Publication number: CN110189746A
Application number: CN201910213205.0A
Authority: CN
Inventors: 姚元飞; 王群; 陈洪瑀
Original assignee: Chengdu Spaceon Technology Co Ltd
Current assignee: Chengdu Spaceon Technology Co Ltd
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2019-08-30
Anticipated expiration: 2039-03-20
Also published as: CN110189746B; WO2020186742A1

Abstract

The invention discloses a kind of method for recognizing speech applied to earth-space communication, comprising: establishes air-ground call triphones acoustic model；By improved maximum a posteriori probability voice enhancement algorithm, speech enhan-cement, removal ambient noise processing are carried out to the earth-space communication voice signal to be identified received；It will treated earth-space communication voice signal to be identified, air-ground call triphones acoustic model is inputted to be identified, the voice command text and key words text for identifying controller and pilot carry out alarm prompt when the voice command text of the controller and pilot that identify are inconsistent；The key words text identified is detected by keyword spotting model, alarm prompt is carried out when detecting default vocabulary；This method can recognize the voice commands between controlling officer and pilot and be compared, and can also detect sensitive vocabulary and alarm prompt, and can be improved speech recognition rate.

Description

A kind of method for recognizing speech applied to earth-space communication

Technical field

The present invention relates to earth-space communication fields, and in particular, to a kind of method for recognizing speech applied to earth-space communication.

Background technique

Earth-space communication is mainly used in the call between controller and pilot, is to ensure that the core of Flight Safety Point.Since traffic control person works' intensity is big, attention needs high concentration, and mistake is easy in the case where call environment is severe Understand the speech heard, accidentally so as to cause the traffic control order for issuing mistake, strong influence flight safety.Earth-space communication words Sound identification technology can with the call between automatic identification controller and pilot, monitor controller and pilot behavior, to by Danger caused by false command is alerted, and can greatly guarantee flight safety.

Although earth-space communication voice recognition technology is a kind of one of method that flight safety is effectively ensured, but most of at present Beechnut is not using voice recognition technology, and the talking mode due to earth-space communication is pronouncing, intonation etc. With particularity, so can not be directly using voice recognition technology general at present.Further, since earth-space communication is by ambient enviroment It influences, can be interfered with partial noise in communication process, cause air-ground dialogue identification difficulty big.

Current existing general voice recognition technology is not suitable for being applied in beechnut.Due to air-ground call In pronunciation and grammatically there is its particularity again, need according to its dialogue feature, the intonation etc. that pronounces re-establish one it is proprietary Air-ground call acoustic model, so there is no the voice recognition technologies for being directed to beechnut currently on the market.

Speech recognition is to need the pure voice signal recorded obtaining acoustic model by training, then again will be wait know Level signal match with trained acoustic model finally obtaining recognition result by same treatment, if earth-space communication The sound signal moment by the interference of external environment, can be mingled with many noise signals, these voice signals with noise not only can Cause dysacusis, causes controlling officer or aircrew to generate auditory fatigue, decreased attention, but also voice signal can be made Distortion, speech characteristic parameter change, cannot match with acoustic model leads to final recognition result mistake.It is general at present Solution be identification front end cascade a voice enhancement algorithm to improve voice intellingibility.Specific flow chart such as Fig. 1 institute Show.

Hidden Markov Model (HMM), Hidden Markov Model is widely used in field of voice signal.One HMM can be described by carrying out θ={ A, B, M, O, π, F }.Wherein A is the finite aggregate for having N number of state, and B is observation sequence collection, M It is transfering state probability, it is probability sequence, F is final state sequence that O, which is output observation probability matrix,.Based on hidden Ma Er The Acoustic Modeling of section husband is output and the initial model that known models are calculated by preceding backward algorithm and recursive algorithm first The probability of output sequence finally uses dimension by calibrating using Baum Welch algorithm and maximum-likelihood criterion to model Spy is decoded to obtain recognition result than algorithm.Hidden Markov Model has higher for small vocabulary isolated word speech recognition Discrimination, but this kind of large vocabulary continuous speech identification of air-ground call is handled, the robustness of identification will be decreased obviously.

Voice enhancement algorithm

Conventional method:

It is mostly at present improved spectrum-subtraction or Wiener filter in general voice enhancement algorithm, although its structure is simply square Just realize, can elevator belt make an uproar the signal-to-noise ratio of speech, but often introduce other noises, lead to voice distortion.Although this Method can effectively improve the sense of hearing comfort level of human ear, but not applicable speech recognition front end.

Least-mean-square error algorithm:

Voice enhancement algorithm based on maximum a posteriori probability (Maximum a posteriori, MAP) compares spectrum-subtraction It with Wiener filtering algorithm, shows as that ambient noise can not only be effectively removed, but also other noise jammings will not be introduced.Assuming that Signal is that y (n)=x (n)+d (n) asks Fourier transformation (FFT) to obtain after framing plus Hamming window:

Y (k, τ)=x (k, τ)+D (k) (1)

Wherein k is the frequency point of τ frame, and x (n) is clean speech signal, and d (n) is noise.

Using the signal without words section as noise frame, the power for obtaining noise is δ_d, posteriori SNR is then calculated, is obtained:

The prior weight of next frame is constantly updated to obtain according to the value of former frame, when the first frame in signal, due to There is no former frame as reference, so the prior weight of first frame can be calculate by the following formula to obtain:

Wherein a is constant, takes 0.98.

When signal proceeds to the second frame, the calculation formula of prior weight is as follows:

The gain function formula that MAP can be acquired by prior weight and posteriori SNR, finally obtains enhanced speech Signal:

Although spectrum-subtraction and Wiener filtering are realized simply, excessive " music noise " can be introduced, although signal-to-noise ratio meeting Have part promotion, but practical auditory effect is not obvious, when noise is relatively low, subtract by spectrum or Wiener filter treated Voice signal, auditory effect instead can be worse.MAP algorithm is mainly increased by calculating prior weight and posteriori SNR Beneficial function, and there is estimation problem in prior weight and posteriori SNR, enhanced speech signal amplitude is caused to occur Change.

Summary of the invention

The present invention provides a kind of method for recognizing speech applied to earth-space communication, can recognize controlling officer and pilot it Between voice commands and be compared, sensitive vocabulary and alarm prompt can also be detected, and can be improved speech recognition rate.

For achieving the above object, described this application provides a kind of method for recognizing speech applied to earth-space communication Method includes:

Establish air-ground call triphones GMM-HMM acoustic model；

Sef-adapting filter is added in maximum a posteriori probability voice enhancement algorithm, passes through improved maximum a posteriori probability language Sound enhances algorithm, carries out speech enhan-cement, removal ambient noise processing to the earth-space communication voice signal to be identified received；

It will be defeated by improved maximum a posteriori probability voice enhancement algorithm treated earth-space communication voice signal to be identified Enter air-ground call triphones GMM-HMM acoustic model to be identified, identify controller and pilot voice command text and Key words text carries out alarm prompt when the voice command text of the controller and pilot that identify are inconsistent；Pass through pass Keyword detection model detects the key words text identified, and alarm prompt is carried out when detecting default vocabulary.

Further, described to establish air-ground call triphones GMM-HMM acoustic model, it specifically includes:

The every-day language data of harvester field earth-space communication；

Feature extraction processing is carried out to collected dialogue data, removes unwanted data；

Audio data after feature extraction is labeled；

By the audio data after mark, air-ground call triphones GMM-HMM acoustic model is obtained by training.

Further, described that feature extraction processing is carried out to collected dialogue data, it specifically includes:

Feature extraction is done using mel-frequency cepstrum coefficient, conversation audio signal is done into Fourier transformation and then calculates dialogue Audio signal power spectrum obtains:

E (k)=[X (k)]² (6)

Wherein, E (k) is voice signal power spectrum, and X is voice signal, and k is kth spectral line；

Obtained speech power spectrum is obtained by Mel filter group by weighted sum:

Wherein, S (m) is the value after weighted sum, and L is spectral line number, H_mIt (k) is bandpass filter, m is m-th of Mel filtering Device, M are Mel filter total number；

It takes logarithm then to do discrete cosine transform to obtain:

Wherein, c (n) is the value after discrete cosine transform, and n is nth spectral line after discrete cosine transform.

Further, the audio data after feature extraction is labeled, is specifically included:

Context-sensitive air-ground call triphones GMM-HMM acoustic model is selected, it is poly- to do context by clustering algorithm Class handles to obtain the cluster set of particular state；Text dictionary is aligned with audio data pressure, passes through Viterbi-beam algorithm Processing obtains optimal path, obtains optimal frame level and does not mark.

Further, based on the audio data after mark, air-ground call triphones GMM-HMM acoustic model is established, specifically It include: the call feature according to earth-space communication, using different mute phoneme and non-mute phoneme HMM topological structure, to GMM's Parameter carries out random initializtion；After carrying out random adjustment integration to Gaussian parameter, iterates and obtain air-ground call triphones GMM-HMM acoustic model.

Further, air-ground call triphones GMM-HMM acoustic model includes: continuous speech acoustic model and keyword sound Model is learned, is recognized and converted into content of text output when speech to be identified passes through continuous speech acoustic model after treatment, when Controller and prompt alarm when inconsistent pilot's Text Command；It is detected whether by keyword acoustic model comprising preset quick Feel information vocabulary, content of text output and prompt alarm are converted thereof into after recognizing sensitive information vocabulary.

Further, sef-adapting filter is added in maximum a posteriori probability voice enhancement algorithm, modified gain function is inclined Difference.

Further, the gain function of sef-adapting filter is shown below:

Assuming that signal is that y (n)=x (n)+d (n) asks Fourier transformation (FFT) to obtain after framing plus Hamming window:

Y (k, τ)=x (k, τ)+D (k) (1)

Wherein, k is the frequency point of τ frame, and x (n) is clean speech signal, and d (n) is noise, and n is a certain moment；

The prior weight of first frame is calculate by the following formula to obtain:

Wherein, a is constant, and γ is posteriori SNR；

When signal proceeds to the second frame, the calculation formula of prior weight are as follows:

Wherein,For the pure voice signal estimated；δ_dIt (k) is noise power；&SNR (k, τ) is the signal-to-noise ratio estimated；

It brings formula (9) into formula (3) and (4), obtains improved prior weight are as follows:

Wherein, G_w(k, τ) is current time sef-adapting filter value；G_w(k, τ -1) is previous moment sef-adapting filter Value.

The present invention provides a kind of suitable for earth-space communication according to the grammer pronunciation characteristic and noise circumstance of air-ground call The method for recognizing speech of system.This method has built the acoustic model of air-ground call term, can recognize controlling officer and pilot Between voice commands and be compared, sensitive vocabulary and alarm prompt can also be detected；In conjunction with the noise circumstance of earth-space communication A kind of speech enhancing algorithm of adaptive-filtering is provided to improve speech recognition rate.This method is broadly divided into two parts: (1) root Triphones GMM-HMM acoustic model is established according to the characteristics of air-ground call, can recognize than voice content and detects sensitive information. (2) noise circumstance for combining air-ground call, is added sef-adapting filter in MAP algorithm, by continuing to optimize parameter, is removing Also make the characteristic parameter of enhanced voice signal that biggish change not occur while ambient noise.

The speech feature and noise circumstance of present invention combination earth-space communication, establish air-ground call acoustic model, the identification mould The speech content of the recognizable controller of type and pilot are simultaneously compared, can alarm prompt when ordering inconsistent；Pass through key Word detects model, when detecting the high-risk sensitive vocabulary set system also can alarm prompt, guarantee flight safety；Using certainly Adaptive filtering algorithm does enhancing processing to sound if to be identified, reduces ambient noise contained by speech to be identified, improves to be identified The intelligibility of speech makes speech to be identified have higher discrimination at identification end.

One or more technical solution provided by the present application, has at least the following technical effects or advantages:

The present invention establishes earth-space communication speech recognition model, this method can recognize and compare for geocosmic flight safety Whether the voice commands between controlling officer and pilot are consistent, can also detect preset sensitive vocabulary and alarm mentions Show, so as to improve flight safety.

Optimize existing MAP algorithm, further increases the reinforcing effect of the algorithm by increasing sef-adapting filter.It should Sef-adapting filter main function is as follows: being promoted in the low signal-to-noise ratio section less than -15dB by introducing modified gain function Intelligibility limits amplitude spectrum being greater than the section 10dB, reduces amplification distortion.So as to improve the discrimination of air-ground call, Guarantee that voice recognition system has higher robustness under adverse noise environment.

Present invention is mainly applied in earth-space communication voice recognition system, the present invention is compared with the existing technology for improving ground The speech discrimination of sky call, guarantees that flight safety has better effect.

Detailed description of the invention

Attached drawing described herein is used to provide to further understand the embodiment of the present invention, constitutes one of the application Point, do not constitute the restriction to the embodiment of the present invention；

Fig. 1 is the flow diagram for improving voice intellingibility method by voice enhancement algorithm in the prior art；

Fig. 2 is speech recognition algorithm flow schematic diagram in the application；

Fig. 3 is voice enhancement algorithm flow diagram in the application.

Specific embodiment

To better understand the objects, features and advantages of the present invention, with reference to the accompanying drawing and specific real Applying mode, the present invention is further described in detail.It should be noted that in the case where not conflicting mutually, the application's Feature in embodiment and embodiment can be combined with each other.

In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, still, the present invention may be used also Implemented with being different from the other modes being described herein in range using other, therefore, protection scope of the present invention is not by under The limitation of specific embodiment disclosed in face.

The present invention is divided into two parts, respectively speech recognition end and enhancing end.

1 identification end

Fig. 2 is the speech recognizer flow chart in the embodiment of the present invention, and detailed process is as follows:

(1) data used by acoustic model of the present invention is established are that the every-day language of certain domestic airport earth-space communication is mould Plate, and approach tower controlling officer is engaged to record according to daily call rule.Wherein male and female students ratio is 2:1, audio sample rate For 16KHz, sampling precision is 16, and recording audio total capacity is 10G.

(2) feature extraction, since collected data include many redundancies, so needing to the useful letter in data Breath carries out feature extraction to reduce unnecessary calculating, and this patent does feature extraction using mel-frequency cepstrum coefficient.First will Signal does Fourier transformation and then calculates its power spectrum and obtains:

E (k)=[X (k)]² (6)

Mel filter group is passed it through to obtain by weighted sum:

It finally takes logarithm then to do discrete cosine transform to obtain:

(3) audio data marks.Can there are identical pronunciation but the different feelings of word meaning in the identification of large vocabulary continuous speech Condition causes current phoneme to be influenced by front and back phoneme, the characteristic parameter before and after continuous speech cannot be calculated well, so generally Context-sensitive phoneme model is selected, context clustering processing is then done by clustering algorithm and obtains the cluster of particular state Collection.Text dictionary is aligned with audio data pressure first, optimal path is obtained by Viterbi-beam algorithm process, finally It can be obtained by optimal frame level not mark.

(4) air-ground call triphones GMM-HMM acoustic model is established.According to the call feature of earth-space communication, using difference Mute phoneme and non-mute phoneme HMM topological structure, random initializtion is carried out to the parameter of GMM.To Gaussian parameter carry out with After machine adjustment integration, iterate final triphones GMM-HMM acoustic model.

Acoustic model 1 is continuous speech acoustic model in Fig. 2, and acoustic model 2 is keyword acoustic model, when words to be identified Sound can be recognized and converted into content of text output by acoustic model 1 after treatment, as controller and pilot's Text Command Prompt alarm when inconsistent；It can also be detected whether by acoustic model 2 comprising preset sensitive information vocabulary, when recognizing sensitivity Content of text output and prompt alarm are converted thereof into after information vocabulary.

2 enhancing ends

Fig. 3 is the voice enhancement algorithm flow chart in the embodiment of the present invention, and the present invention mainly passes through addition adaptive-filtering Device removes ambient noise and improves voice intellingibility.

Sef-adapting filter, modified gain function deviation is added.According to formula (4) as can be seen that the priori letter of next frame It makes an uproar than being updated according to former frame, since the prior weight being currently calculated is not very accurate, this is resulted in by working as The estimated value for the next frame prior weight that preceding prior weight is calculated may be excessive or too small, to influence speech enhan-cement Performance.For such situation, a sef-adapting filter is added in formula (3), in (4) to adjust different signal-to-noise ratio areas in the present invention Between prior weight estimation range

It is debugged by simulating, verifying and engineering, it is determined that the gain function of sef-adapting filter is shown below:

Bring formula (10) into formula (3), (4) it is as follows to obtain improved prior weight:

By formula (6) it is found that the gain function of sef-adapting filter makes tune for the section of three different signal-to-noise ratio It is whole, when the signal-to-noise ratio for calculating kth frame τ point is less than -15db, it is believed that this frequency point is mainly noise signal, is led to Introducing amendment deviation is crossed to remove noise jamming.When signal-to-noise ratio is greater than 10, the phonetic element in signal is much larger than and makes an uproar at this time At this moment acoustical signal sets thresholding and guarantees that this signal is allowed not introduce excessive gain compensation as 0.8, sends out the output amplitude of signal not Raw biggish change.When signal-to-noise ratio is at -15 to 10 this section, voice signal and noise signal energy Relative Fuzzy, at this moment It needs to further discriminate between the noise contribution in signal by sef-adapting filter, so needing to increase in the gain function in this section One thresholding prevents gain function value from being less than this thresholding.It is emulated by many experiments, when taking 0.8, effect is best.

Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the scope of the invention.

Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims

1. a kind of method for recognizing speech applied to earth-space communication, which is characterized in that the described method includes:

Establish air-ground call triphones GMM-HMM acoustic model；

Sef-adapting filter is added in maximum a posteriori probability voice enhancement algorithm, is increased by improved maximum a posteriori probability voice Strong algorithms carry out speech enhan-cement, removal ambient noise processing to the earth-space communication voice signal to be identified received；

Improved maximum a posteriori probability voice enhancement algorithm treated earth-space communication voice signal to be identified, input ground will be passed through Sky call triphones GMM-HMM acoustic model is identified, identifies the voice command text and key of controller and pilot Word text carries out alarm prompt when the voice command text of the controller and pilot that identify are inconsistent；Pass through keyword Detection model detects the key words text identified, and alarm prompt is carried out when detecting default vocabulary.

2. the method for recognizing speech according to claim 1 applied to earth-space communication, which is characterized in that it is described establish it is air-ground Call triphones GMM-HMM acoustic model, specifically includes:

The every-day language data of harvester field earth-space communication；

Audio data after feature extraction is labeled；

3. the method for recognizing speech according to claim 2 applied to earth-space communication, which is characterized in that described pair collects Dialogue data carry out feature extraction processing, specifically include:

Feature extraction is done using mel-frequency cepstrum coefficient, conversation audio signal is done into Fourier transformation and then calculates conversation audio Power spectrum signal obtains:

E (k)=[X (k)]² (6)

Obtained speech power spectrum is obtained by Mel filter group by weighted sum:

Wherein, S (m) is the value after weighted sum, and L is spectral line number, H_mIt (k) is bandpass filter, m is m-th of Mel filter, M For Mel filter total number；

It takes logarithm then to do discrete cosine transform to obtain:

4. the method for recognizing speech according to claim 2 applied to earth-space communication, which is characterized in that after feature extraction Audio data be labeled, specifically include:

Context-sensitive air-ground call triphones GMM-HMM acoustic model is selected, is done at context cluster by clustering algorithm Reason obtains the cluster set of particular state；Text dictionary is aligned with audio data pressure, passes through Viterbi-beam algorithm process Optimal path is obtained, optimal frame level is obtained and does not mark.

5. the method for recognizing speech according to claim 2 applied to earth-space communication, which is characterized in that after mark Audio data is established air-ground call triphones GMM-HMM acoustic model, is specifically included: according to the call feature of earth-space communication, adopting With different mute phoneme and non-mute phoneme HMM topological structure, random initializtion is carried out to the parameter of GMM；To Gaussian parameter After carrying out random adjustment integration, iterates and obtain air-ground call triphones GMM-HMM acoustic model.

6. the method for recognizing speech according to claim 1 applied to earth-space communication, which is characterized in that three sounds of air-ground call Plain GMM-HMM acoustic model includes: continuous speech acoustic model and keyword acoustic model, when speech to be identified after treatment It is recognized and converted into content of text output by continuous speech acoustic model, as controller and inconsistent pilot's Text Command Prompt alarm；It is detected whether by keyword acoustic model comprising preset sensitive information vocabulary, when recognizing sensitive information word Content of text output and prompt alarm are converted thereof into after remittance.

7. the method for recognizing speech according to claim 1 applied to earth-space communication, which is characterized in that general in maximum a posteriori Sef-adapting filter, modified gain function deviation are added in rate voice enhancement algorithm.

8. the method for recognizing speech according to claim 7 applied to earth-space communication, which is characterized in that sef-adapting filter Gain function be shown below:

Y (k, τ)=x (k, τ)+D (k) (1)

Wherein, a is constant, and γ is posteriori SNR；