CN113112990A

CN113112990A - Language identification method of variable-duration voice based on spectrum envelope diagram

Info

Publication number: CN113112990A
Application number: CN202110238968.8A
Authority: CN
Inventors: 龙华; 王瑶; 邵玉斌; 杜庆治; 王延凯; 陈亮; 唐维康
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-07-13

Abstract

The invention relates to a language identification method of variable-duration voice based on a spectrum envelope diagram, belonging to the technical field of audio signal processing. Firstly, dividing a voice signal into short-time voice with the duration of t seconds; then extracting the spectrum envelope diagrams of each short-time speech, and distributing the spectrum envelope diagrams into a training set and a test set according to the proportion of a to b; fitting the training set into a residual error network for training, and selecting a language identification model with the highest identification rate through adjusting model parameters and repeatedly testing the test set; when the duration of the voice to be detected is longer than t seconds, the voice to be detected is divided into a plurality of short-term voices with the length of t seconds, and then the recognition condition of the short voice obtained by dividing each long voice is counted to judge the language of the whole long voice. The invention can not only accelerate the speed of language identification, but also can identify the language under the condition of different durations of the voice to be detected, and can ensure higher accuracy.

Description

Language identification method of variable-duration voice based on spectrum envelope diagram

Technical Field

The invention relates to a language identification method of variable-duration voice based on a spectrum envelope diagram, belonging to the technical field of audio signal processing.

Background

According to survey, over 6900 different languages exist around the world, and it is very complicated to classify them completely by human labor. There are many language translation tools, and the front end of these translation tools refers to language identification technology. When the durations of the voices to be detected are unequal, the training set cannot contain voice signals of all durations, so the language identification effect of the voices with different durations is obviously reduced. At present, in order to solve the problem of language identification of voices with different durations, the most widely used method is to change the speed of the voices to adjust the durations, and although the frequency domain characteristics of the voices can be kept basically unchanged, the time domain characteristics are changed greatly, and the language identification effect is not ideal. There is therefore still a great deal of research space for language identification of variable duration speech.

Disclosure of Invention

The invention provides a language identification method of variable-duration speech based on a spectrum envelope diagram, which is used for solving the problem that the language identification effect is sharply reduced when the durations of the speech to be detected are unequal.

The technical scheme of the invention is as follows: a language identification method of variable-duration voice based on a spectrum envelope diagram mainly comprises two parts, wherein the first part is used for language identification of short voice (duration is 1 second), and the second part is used for regular duration of long voice to be detected, namely, the long voice is divided into a plurality of phrase voice signals with duration of 1 second.

In the language identification process of the first part of short speech, a spectrum envelope image of a speech signal is used as the characteristic input of a language identification system, the extraction process mainly comprises the steps of framing the speech signal, windowing and homomorphic processing, each frame spectrum envelope of the short speech is obtained, then the spectrum envelopes are spliced together according to lines to form a spectrum envelope image of a section of short speech (the duration is 1 second), then the spectrum envelope images are distributed into a training set and a test set according to the ratio of 4:1, the training set is used for fitting a residual error network to form a language identification model, and the test set is used for testing the generated language identification model to select the language identification model with the best identification effect.

The second part is that long speech with different time length is divided into several short speech (time length is 1 second), then these short speech are sent into language identification system to make test, and the language of long speech is discriminated by counting the identification condition of these short speech.

The method comprises the following specific steps:

step 1: and dividing long-segment voice signals of different languages into short-time voice with shorter time length, and defining the time length of the short-time voice signals as t seconds.

Step 2: and performing framing and windowing functions on the short-time voice with the time length of t seconds, and then solving the spectrum envelope of each frame of the short-time voice with the time length of t seconds through homomorphic processing.

Step 3: the spectrum envelopes of each frame signal of the same short-time speech are arranged and combined according to rows, a spectrum envelope graph corresponding to each section of speech is drawn, the horizontal axis represents a spectrum, and the vertical axis represents a time domain.

Step 4: and filtering the generated spectrum envelope diagram, removing high-frequency and low-frequency parts of the voice signal, and reserving a middle-frequency part of the voice signal to enable the frequency of the middle-frequency part to be in the range of 500HZ to 3000 HZ.

Step 5: and (3) according to the spectrum envelope diagram of each language as N: and m is distributed into a training set and a testing set, and labels of corresponding languages are marked.

Step 6: and fitting the training set to a residual error network, training by adjusting parameters to obtain different language identification models, testing the language identification models by using the test set, and selecting the language identification model with the highest language identification rate.

Step 7: when the time lengths of the voices to be detected are unequal, the voice signals are divided into a plurality of short-time voice signals, the time length is t seconds, a plurality of phrase voices obtained by dividing each long voice are fitted into a language recognition model in Step6, and the language of the long-time voice is judged by counting the recognition conditions of the short voices.

The invention has the beneficial effects that: the method can not only accelerate the speed of language identification, but also identify the languages under the condition that the time lengths of the voices to be detected are different, and can ensure higher accuracy.

Drawings

FIG. 1 is a block diagram of the overall architecture of the present invention;

FIG. 2 is a flow chart of spectral envelope extraction according to the present invention;

FIG. 3 is a graph of vocal tract impulse response (spectral envelope) and glottal excitation pulses of the present invention;

FIG. 4 is a graphical depiction of a spectral envelope of the present invention;

FIG. 5 is a graph of a filtered spectral envelope of the present invention;

FIG. 6 is a diagram illustrating long speech segmentation according to the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

As shown in fig. 1, a speech recognition method for variable duration speech based on a spectral envelope diagram mainly includes two parts, the first part is speech recognition for short speech (duration is 1 second), and the second part is a phrase tone signal with regular duration of the long speech to be measured, i.e. the phrase tone signal is divided into several phrase tone signals with duration of 1 second.

The method specifically comprises the following steps:

step 1: generating voice data;

downloading audio signals of 8 languages with equal time duration on a broadcasting station, respectively dividing the audio signals into phrase tones with time duration of t seconds through a program, transcoding a voice file into a wav file by adobe audio software, wherein the sampling rate is 16k, and the data is monaural. The 8 voices are mandarin, korean, tibetan, burmese, cambodia, laos, and vietnamese, respectively.

Step 2: extracting a spectrum envelope;

speech is a complex multi-frequency signal, the amplitudes of the signals of all frequencies are different, the signals are arranged according to the magnitude sequence, and the connecting line of the midpoints of the amplitudes is the spectrum envelope referred to in the invention. The shape of the envelope varies from utterance to utterance, and since there is a difference in the utterances of the respective languages, it is proposed herein to use the spectral envelope as a speech acoustic feature for language recognition. In the process of extracting the spectral envelope, homomorphic filtering is the most important part, and the role of the homomorphic filtering is to separate the vocal tract impulse response and the glottal excitation pulse in the speech signal, and in the invention, the obtained vocal tract impulse response is the spectral envelope required by us. The extraction process of the spectral envelope is shown in fig. 2.

Step2.1: framing;

dividing a voice signal with the duration of 1 second into short-time frame signals, and regarding each frame signal as a steady-state time-invariant signal. In the process of framing, the overlapping portion between two adjacent frames is called frame shift, and the length of the frame shift is generally one half of the frame length.

Step2.2: windowing;

adding rectangular window, Hamming window or Haining window to each frame signal, selecting different window functions will have different bandwidth and spectrum leakage, and the Hamming window is selected in the invention.

Step2.3: homomorphic processing;

after framing and windowing, each frame of speech signal can be represented as:

x(n)＝x₁(n)*x₂(n) (1)

in the formula (1), x₁(n) and x₂(n) represent the vocal tract impulse response (spectral envelope) and the glottal excitation signal, respectively.

The original convolution signal is transformed into a multiplicative signal by fast fourier transform:

DFT[x(n)]＝DFT[x₁(n)*x₂(n)]＝X₁(k)·X₂(k)＝X(k) (2)

in the formula (2), DFT [ ] is a discrete Fourier transform.

In the formula (3), x (k) is a transformed multiplicative signal.

The multiplicative signal is converted into an additive signal by taking the logarithm.

In the formula (4), the reaction mixture is,

for the logarithmic spectrum of signal x (n), the spectral envelope proposed in the present invention is referred to

The central envelope of (c).

To pair

To carry outInverse discrete Fourier transform to obtain complex cepstrum of signal x (n)

DFT^-1[]Is an inverse fast fourier transform.

Get

Real part of

In the formula (7), a

Separated by a specific spectral line m

And

then to

A discrete Fourier transform is performed, and a real part of the transform result obtains a spectrum envelope of each frame.

Y (k) is the spectral envelope of the signal. As shown in fig. 3, the vocal tract impulse response (spectral envelope) and the glottal excitation pulse of a frame of speech signal are shown.

Step 3: drawing a spectrum envelope graph;

and combining the spectral envelopes of each frame of the same short-time voice according to rows, and drawing a spectral envelope graph corresponding to each section of voice through a python program. The horizontal axis of the spectrum envelope graph is frequency characteristic, and the vertical axis is time domain characteristic, and the drawing diagram is shown in fig. 4.

Step 4: filtering the frequency spectrum;

in order to eliminate the influence on the speech recognition result due to the difference of the frequency distribution range of different languages, the high frequency part and the low frequency part of the spectrum envelope map are now removed, and only the spectrum envelopes of the speech signals 500HZ to 3000HZ are reserved. As shown in fig. 4, the picture is a filtered picture of a spectrum envelope map of a short-term speech with a duration of t seconds.

Step 5: generating a training set and a test set;

the method comprises the steps of distributing a spectrum envelope diagram of each language into a training set and a test set according to a ratio of 4:1, marking labels of the corresponding languages, then concentrating the training sets of various voices into a training set file, and concentrating the test sets of various voices into a test set file.

Step 6: generating a language identification model;

and fitting the training set to a residual error network, obtaining different language identification models by adjusting parameters, and testing the language identification models by using the test set to select the language identification model with the highest language identification rate.

Step 7: language identification of variable duration voice;

when the time lengths of the voices to be detected are unequal, the voice signals are divided into a plurality of short-time voice signals (the time length is 1 second)

A＝[a₁ a₂ a₃ ... a_N] (9)

In the formula (9), a₁,a₂,a₃.. are short-time speeches with duration of 1 secondAnd A is a voice signal with the duration of more than 1 second. When performing voice segmentation, firstly, the time length T of the voice signal is determined, and then the number N of voices with the time-sharing length of 1 second is determined by rounding the time-following length.

In the formula (10), S is the overlapping length of two adjacent short-term voices during long voice segmentation, when S is greater than 0, | S | represents the overlapping duration of two adjacent short-term voices, and when S is less than 0, | S | represents the distance between two adjacent short-term voices, which is specifically segmented as shown in fig. 5.

After a long speech with a duration T is segmented into N short phonemes, the probability statistics can be expressed by equation (11):

P＝[p₁ p₂ p₃ ... p_N] (11)

in the formula (11), p_iRepresenting the recognition probability of the ith segment of speech.

In this embodiment, eight languages are mentioned, and when a long speech is determined, the sum of probabilities of each language is obtained, and the language with the highest probability is the language of the long speech.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A language identification method of variable duration voice based on a spectrum envelope diagram is characterized in that:

step 1: dividing long-segment voice signals of different languages into short-time voice, and defining the duration of the short-time voice signals as t seconds;

step 2: performing framing and windowing functions on the short-time speech, and then solving the spectrum envelope of each frame of the short-time speech with the duration of t seconds;

step 3: combining the spectrum envelopes of each frame of the same short-time voice, and drawing a spectrum envelope graph corresponding to each section of voice;

step 4: filtering the generated spectrum envelope diagram to make the frequency of the spectrum envelope diagram within the range of 500HZ to 3000 HZ;

step 5: and (3) according to the spectrum envelope diagram of each language as N: m is distributed into a training set and a test set, and labels of corresponding languages are marked;

step 6: fitting the training set to a residual error network, obtaining different language identification models by adjusting parameters, testing the language identification models by using the test set, and selecting the language identification model with the highest language identification rate;