CN112420018A

CN112420018A - Language identification method suitable for low signal-to-noise ratio environment

Info

Publication number: CN112420018A
Application number: CN202011154863.6A
Authority: CN
Inventors: 邵玉斌; 刘晶; 龙华; 杜庆治; 李一民; 杨贵安; 唐维康; 陈亮
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2021-02-26

Abstract

The invention discloses a language identification method suitable for a low signal-to-noise ratio environment aiming at the defect of low identification rate under the low signal-to-noise ratio, belonging to the field of voice identification. The invention mainly adopts filtering and resampling technology to process the voice with noise and then extracts the amplitude spectrum. Mainly comprises the following parts: firstly, filtering a high-frequency part by using a low-pass filter; then resampling is carried out; and then preprocessing the signal, extracting an amplitude spectrum, and resampling the amplitude spectrum again to obtain the sampled amplitude spectrum characteristic under the low signal-to-noise ratio. And training the extracted feature input model to obtain a corresponding language model, mounting the trained model on a server, collecting and inputting recognized voice to the server through a client, then extracting features and the trained language model to carry out scoring judgment, and finally outputting a recognition result to the client. Through experimental tests, the speech features extracted by the method are applied to the field of language identification, so that the overall identification accuracy is improved, and the identification speed is extremely high.

Description

Language identification method suitable for low signal-to-noise ratio environment

Technical Field

The invention relates to a language identification method under a low signal-to-noise ratio environment, belonging to the field of voice identification.

Background

In recent years, human communication awareness is getting stronger, and in cooperation of more and more countries joining the silk road, huge benefits brought by international cooperation attract more countries joining cross-country cooperation, thereby causing many problems. The most urgent problem at present is that the language is not available, which results in no better communication and cooperation. Although machine translation has achieved a good effect, the absence of language identification in the front end can cause manual switching of translated objects, so language identification is a very important research problem, and especially identification under low signal-to-noise ratio is always a hard-to-solve hotspot. With the introduction of machine learning, the technology is well developed, but the recognition under the low signal-to-noise ratio is still very limited, and the popularization to all parts of the world is still an unknown number, so the language recognition under the low signal-to-noise ratio needs to be further improved.

Disclosure of Invention

The invention aims to solve the technical problem of extracting effective characteristics in a low signal-to-noise ratio environment. The invention introduces a filter at the front end to filter high-frequency part information to obtain a low-frequency part signal, samples the low-frequency part information by adopting a time-domain interval A point, carries out FFT (fast Fourier transform), modulus taking, smoothing, logarithm taking, IFFT (inverse fast Fourier transform) conversion, construction of a sound channel impulse cepstrum sequence and a sound channel impulse response frequency spectrum on each frame by pre-emphasis, amplitude normalization and framing on the sampled signal so as to obtain a magnitude spectrum, and samples the magnitude spectrum by adopting a Nyquist sampling theorem to obtain a sampled magnitude spectrum. Finally, inputting the breadth spectrum characteristics of the languages into the training model to train the corresponding language model, and mounting the trained model to the server side. The speech to be recognized is collected by the client and input to the server, low-pass filtering processing is carried out on the speech, resampling is carried out, then the sampling amplitude spectrum characteristics are extracted and scored with the trained language model, and finally the recognition result is output and returned to the client webpage. The algorithm performs feature extraction and identification on the voice through simulation software, and achieves a good identification effect. In order to solve the technical problems, the invention adopts the following technical scheme: a language identification method under low signal-to-noise ratio environment. The method comprises the following steps:

s1, filtering

The principle is as follows:

through observation and statistics of the spectrogram, the fact that only low-frequency part energy information of the spectrogram containing the noise audio can be displayed is found, most high-frequency energy information is covered by the noise, most signal energy is concentrated on the low-frequency part, and the energy of the noise is mainly concentrated on the high-frequency part, so that the high-frequency part is filtered, the interference of partial noise can be reduced, and the signal-to-noise ratio of the original voice is improved.

S2 time domain alternate sampling

The principle is as follows:

the A-point sampling is carried out on the time domain signal by adopting a resampling technology, so that the high-frequency part of the filtered signal is superposed to the low-frequency part. Because the superposition of the speech information frequency spectrum is unequal, and the superposition of the noise information frequency spectrum is equal, the signal-to-noise ratio before sampling is improved through an average signal-to-noise ratio formula. Noisy speech is defined as x (n) ═ s (n) + w (n), and the signal-to-noise ratio is given by the following equation:

wherein the content of the first and second substances,

which is indicative of the energy of the signal,

representing the energy of white noise, s (n) is original voice, w (n) is zero-mean Gaussian white noise, and H is the total sampling point number of the whole voice.

The alternate sampling formula is defined as follows:

wherein x' (n) is a signal sampled at every other point,

the total number of sampling points of the integer part is taken.

S3, extracting a sampling amplitude spectrum

The method comprises the steps of pre-emphasis, amplitude normalization, framing, FFT (fast Fourier transform), modulus taking, smoothing, logarithm taking, IFFT (inverse fast Fourier transform) conversion, construction of a sound channel impulse cepstrum sequence, sound channel impulse response frequency spectrum, point-spaced sampling and the like.

S3.1, Pre-emphasis

In order to avoid some signal information loss caused by FFT, it is necessary to increase the height by pre-emphasis

The energy of the frequency part is convenient for transmission in a channel, and the signal obtained after pre-emphasis is x' (n).

S3.2 amplitude normalization

The amplitude normalization formula is as follows:

where z (n) is the normalized signal and max (x "(n)) is the maximum value of the pre-emphasis signal value.

S3.3, framing

The normalized signal z (n) is divided into frames, and a section of speech signal is divided into a plurality of frames and phases due to the speech signal having short-time stationarityThe overlapping part generated by two adjacent different frames aims to make the transition between the frames smooth and keep the continuity, and the signal of the ith frame after the frame division is z⁽ⁱ⁾And (n), the frame length is E, the frame shift is K, and F frames are shared.

S3.4, FFT transformation

Transforming z with fast Fourier transform⁽ⁱ⁾(n) the time domain data becomes frequency domain data, and the formula is as follows:

z⁽ⁱ⁾(k)＝FFT[z⁽ⁱ⁾(n)],1≤i≤F,1≤n≤E，1≤k≤E (4)

wherein z is⁽ⁱ⁾(k) Is the signal after the fast fourier transform.

S3.5, taking a modulus value

To z⁽ⁱ⁾(k) The signal modulo each data is | z⁽ⁱ⁾(k)|。

S3.5, smoothing

To | z⁽ⁱ⁾(k) And l, smoothing is carried out to reduce noise on the voice, the noise of the target voice is suppressed under the condition of keeping voice detail characteristics as much as possible, the size of a neighborhood is directly related to the smoothing effect, the larger the neighborhood is, the better the smoothing effect is, but the larger the neighborhood is, the larger the edge information loss is due to the fact that the neighborhood is too large, so that the output voice becomes fuzzy, and therefore the size of the neighborhood needs to be reasonably selected. The smoothed signal is y⁽ⁱ⁾(k)。

S3.6, taking logarithm

For the smoothed signal y⁽ⁱ⁾(k) Taking logarithm, the formula is as follows:

s⁽ⁱ⁾(k)＝log(y⁽ⁱ⁾(k)) (5)

wherein s is⁽ⁱ⁾(k) Is a logarithmic signal;

s3.7 IFFT transformation

The logarithmic signal is subjected to inverse Fourier transform, namely cepstrum, and after cepstrum analysis, the glottal excitation pulse and the vocal tract impulse response are conveniently separated and are positioned in different cepstrum intervals in the cepstrum, and the formula is as follows:

c⁽ⁱ⁾(n)＝FT^-1[s⁽ⁱ⁾(k)] (6)

wherein, c⁽ⁱ⁾(n) is y⁽ⁱ⁾(n) inverse Fourier transform of the log-amplitude spectrum.

S3.8, constructing the sound channel impulse cepstrum sequence

Through c⁽ⁱ⁾(n) constructing a vocal tract impulse cepstrum sequence g⁽ⁱ⁾(n)。

S3.9, amplitude spectrum

Impulse cepstrum sequence g for sound channel⁽ⁱ⁾(n) performing fast Fourier transform, and then taking a real number part to obtain an amplitude spectrum, wherein the formula is as follows:

g⁽ⁱ⁾(k)＝FFT[g⁽ⁱ⁾(n)] (7)

wherein, g⁽ⁱ⁾(k) For the Fourier transformed spectrum, g⁽ⁱ⁾(k) Getting the real number part to obtain the amplitude spectrum r⁽ⁱ⁾(k)。

S3.10, alternate sampling

Since the amplitude spectrum is bilaterally symmetrical, the first half is sampled. B point sampling is conducted on the amplitude spectrum through the Nyquist law, mainly in order to reduce data volume and simultaneously not destroy voice information, and the formula for increasing the express speed and the recognition speed is as follows:

y⁽ⁱ⁾＝[r⁽ⁱ⁾(1),r⁽ⁱ⁾(B),r⁽ⁱ⁾(2B),r⁽ⁱ⁾(3B),...,r⁽ⁱ⁾(D)]^T (8)

wherein, y⁽ⁱ⁾The amplitude spectrum after sampling for the ith frame, D is the corresponding r of the last sampling point⁽ⁱ⁾(k) Of (4).

Fusing the sampled amplitude spectrums of each frame to form a fused amplitude spectrum characteristic matrix:

Y＝[y⁽¹⁾ y⁽²⁾ y⁽³⁾ ... y⁽ⁱ⁾ ... y^(F)] (9)

wherein Y is a sampling amplitude spectrum matrix of each speech segment.

S4, generating a training model

Referring to fig. 1, the extracted sampling magnitude spectrum of each language is input to a training model for training, so as to obtain a corresponding language model.

S5, language identification

The invention provides a language identification method suitable for a low signal-to-noise ratio environment.

Drawings

FIG. 1 is a recognition chart for language theory training

FIG. 2 is a graph of local waveforms for different signal-to-noise ratios

FIG. 3 is a waveform diagram and spectrogram before and after filtering

FIG. 4 is a waveform diagram and spectrogram of a filtered signal sample

FIG. 5 is a flow chart of spectral feature extraction

FIG. 6 is a frame amplitude spectrum

FIG. 7 is a sampled amplitude spectrum

FIG. 8 is a diagram of GMM-UBM model training language model

FIG. 9 is a diagram of language identification between a server and a client

FIG. 10 is a diagram of client returned recognition results

Detailed Description

The invention will be further described by means of specific embodiments in conjunction with the accompanying drawings.

S1, test audio data acquisition:

the language material is from the international broadcasting station of China, and mainly comprises five languages of Chinese, Tibetan, Uygur language, English and Hashakesteins. The five languages adopt a single sound channel, a 8000Hz sampling rate and a recording audio file with the length of 10 seconds.

S2, noise-containing speech generation

Referring to fig. 2, each waveform of the speech in the present invention is a local speech waveform under different signal-to-noise ratios, and it can be seen from the figure that as the signal-to-noise ratio is reduced, the larger the area of the speech signal waveform submerged by white noise is, the SNR-5 dB is basically only highlighted by the local strong signal waveform, so that the difficulty of low signal-to-noise ratio identification is very large.

S3, filtering

Referring to fig. 3, before filtering and after filtering, the speech waveform and the speech spectrogram are partially filtered by using a butterworth filter, so that a 1000Hz high-frequency part is filtered, information of a low-frequency part is retained, interference of partial noise can be reduced, and the signal-to-noise ratio of the speech waveform and the speech spectrogram is improved by about 7dB compared with that of original speech.

S2 time domain alternate sampling

Referring to fig. 4, the pre-sampled and post-sampled speech waveforms and spectra are sampled at 8-point intervals by resampling the time domain signal so that the high frequency portion of the filtered signal is superimposed on the low frequency portion. Because the superposition of the voice information frequency spectrum is unequal, and the superposition of the noise information frequency spectrum is equal, the signal-to-noise ratio before sampling is improved by about 5dB according to the average signal-to-noise ratio formula.

The alternate sampling formula is defined as follows:

x′(n)＝[x(1),x(8),x(16),...,x(N)]，1≤n≤10000，N＝80000 (10)

where x' (n) is the signal after the dot-wise sampling.

S3, extracting a sampling amplitude spectrum

Referring to fig. 5, the extraction steps include pre-emphasis, amplitude normalization, framing, FFT transformation, modulus extraction, smoothing, logarithm extraction, IFFT transformation, construction of a channel impulse cepstrum sequence, magnitude spectrum, and alternate sampling.

S3.1, Pre-emphasis

S3.2 amplitude normalization

S3.3, framing

And framing the normalized signal z (n), wherein the frame length E is 256, and the frame shift K is 128.

S3.4, FFT transformation

S3.5, taking a modulus value

S3.5, smoothing

The embodiment of the present invention employs, in part, a Savitzky-Golay filter that performs smoothing based on a quadratic polynomial fit over each window.

S3.6, taking logarithm

S3.7 IFFT transformation

S3.8, constructing the sound channel impulse cepstrum sequence

S3.9, amplitude spectrum

Referring to fig. 6, a magnitude spectrum is obtained.

S3.10, alternate sampling

See fig. 7, amplitude spectrum after sampling. Since the amplitude spectrum is bilaterally symmetrical, the first half is sampled. Sampling is carried out on the amplitude spectrum at 6 points through the Nyquist law, mainly in order to reduce data volume and simultaneously not destroy voice information, and the formula of the method is as follows:

y⁽ⁱ⁾＝[r⁽ⁱ⁾(1),r⁽ⁱ⁾(6),r⁽ⁱ⁾(12),r⁽ⁱ⁾(18),...,r⁽ⁱ⁾(126)]^T,1≤i≤78 (11)

wherein, y⁽ⁱ⁾The sampled amplitude spectrum for the ith frame.

Y＝[y⁽¹⁾ y⁽²⁾ y⁽³⁾ ... y⁽ⁱ⁾ ... y⁽⁷⁸⁾] (12)

wherein Y is a sampling amplitude spectrum matrix of each speech segment.

S4, generating a training model

Referring to fig. 8, the invention trains the corresponding language model by using the GMM-UBM model language identification system, and then mounts the trained model to the server. The GMM-UBM is an improved model of the GMM, and has the advantages that model parameters are estimated through an MAP algorithm, the overfitting phenomenon is avoided, all parameters of the GMM of a target user do not need to be adjusted, the best recognition performance can be realized only by estimating mean parameters of various Gaussian components, and the defect that training data are few can be effectively overcome. 1675 pieces are used as general template training corpora in the experiment, 300 pieces are used for each language of the training sample, wherein 50 pieces are noiseless, and the rest are respectively 50 pieces with SNR (signal to noise ratio) of 25dB, 20dB, 15dB, 10dB and 5dB, so that the real noisy environment can be better simulated, and the trained model is mounted to the server side.

S5 example of application of the method of the invention to single speech

Referring to fig. 9, in the present invention, a segment of monaural 10S speech is randomly collected at a client, transmitted to a server, filtered and sampled, then amplitude spectrum features are extracted, and then a scoring decision is performed with a trained model, and then an identification result is output and returned to a client webpage, fig. 10 is a graph of a returned identification result of the client.

S6 test performance examples of a large number of voices by using the method of the invention

171 pieces of each language of a sample are tested, and then corpora with SNR (signal to noise ratio) -5dB, 0dB, 5dB, 10dB, 15dB and 20dB are added in sequence to carry out identification experiments respectively. The identification results are shown in table 1 according to the specific procedures.

The experimental performance results prove that the noise resistance of the anti-noise coating is extremely strong, and the recognition rate of 0dB can reach more than 70 percent, as shown in Table 1.

TABLE 1 identification rate of five languages with different SNR by fusing features

(Unit/%)

The above description is only a preferred embodiment of the present invention, and should not be taken as limiting the invention, and all changes, equivalents, and improvements that come within the spirit and principle of the invention are intended to be embraced therein.

Claims

1. A language identification method under the environment of low signal-to-noise ratio comprises the following steps:

s1, filtering the voice: and filtering information of the high-frequency part by using a filter to obtain a voice signal of the low-frequency part.

S2, sampling the filtering signal at intervals: and sampling the filtered signal at the A-point interval in the time domain by adopting a resampling technology.

S3, preprocessing: and preprocessing the sampled signals, including pre-emphasis, amplitude normalization and framing.

S4, extracting an amplitude spectrum: an amplitude spectrum extracted for the pre-processed signal.

S5, sampling of amplitude spectrum: and carrying out sampling treatment on the extracted amplitude spectrum at K points according to the Nyquist sampling theorem.

S6, generating a training model: and inputting the extracted magnitude spectrum of the language into a training model for training so as to obtain a corresponding language model.

S7, language identification: and mounting the extracted language model on a server, collecting voice data to be recognized at a client, inputting the voice data to the server for grading judgment recognition, and outputting a recognition result and returning the recognition result to a webpage of the client.

2. The feature extraction algorithm in low snr environment according to claim 1, wherein:

and filtering information of a high-frequency part of the voice by using a filter so as to obtain a voice signal of a low-frequency part.

3. The feature extraction algorithm in low snr environment according to claim 1, wherein: and sampling the filtered signal at the A-point interval in the time domain by adopting a resampling technology.

4. The feature extraction algorithm in low snr environment according to claim 1, wherein: and preprocessing the sampled signals, including pre-emphasis, amplitude normalization and framing.

5. The feature extraction algorithm in low snr environment according to claim 1, wherein: and performing FFT (fast Fourier transform), modulus taking, smoothing, logarithm taking, IFFT (inverse fast Fourier transform) and sound channel impulse cepstrum sequence and magnitude spectrum on the preprocessed signals to obtain the magnitude spectrum.

6. The feature extraction algorithm in low snr environment according to claim 1, wherein: and carrying out sampling treatment on the extracted amplitude spectrum at K points according to the Nyquist sampling theorem.

7. The feature extraction algorithm in low snr environment according to claim 1, wherein: and inputting the extracted sampling amplitude spectrum characteristics of the languages into a training model for training so as to obtain a corresponding language model.

8. The feature extraction algorithm in low snr environment according to claim 1, wherein: and filtering and sampling the voice, extracting the characteristics of the sampled amplitude spectrum, performing scoring judgment on the characteristics and the trained language model, and finally outputting a recognition result and returning the recognition result to the client.