CN107305767B

CN107305767B - Short-time voice duration extension method applied to language identification

Info

Publication number: CN107305767B
Application number: CN201610236672.1A
Authority: CN
Inventors: 周若华; 袁庆升; 张健; 颜永红; 包秀国
Original assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Priority date: 2016-04-15
Filing date: 2016-04-15
Publication date: 2020-03-17
Anticipated expiration: 2036-04-15
Also published as: CN107305767A

Abstract

The invention provides a short-time voice duration extension method applied to language identification, which comprises the following steps: for a voice to be recognized with short duration, firstly determining the number n of generated voices with different speech speeds according to the voice duration; then, n decomposition frame shifts of the generated voice are calculated according to the synthesized frame shift value and the n speech rate change rates; and generating n voices with different speeds of speech according to the decomposed frame shift and the synthesized frame shift, and splicing the n voices with different speeds of speech and the original voice to generate a voice with longer duration. The language information of the voices with different speeds has complementarity, and the method provided by the invention can obviously improve the language identification performance of the short-time voice.

Description

Short-time voice duration extension method applied to language identification

Technical Field

The invention relates to the field of computer language identification, in particular to a short-time voice duration extension method applied to language identification.

Background

Language identification refers to a technique for automatically determining the language type to which a piece of speech belongs by a computer. This is a technology that enables large scale cross-language speech recognition applications, and can be used for spoken language translation, spoken document retrieval, and the like. Meanwhile, the method is also a research hotspot for information extraction in the fields of intelligence and safety.

The voice to be recognized is too short in time, which is a common problem in the research fields of speaker recognition, language identification and the like. In recent years, there have been some targeted studies on recognition of short-term speech. References [1] (A.K.Sarkar, D.Matrouf, P.Bousquet, and J.Bonamide.study of the effect of i-vector modeling on short and mismatch determination for Speech analysis. in INTERSPEECH 2012,13th Annual Conference of the International Speech Association, Portland, Oregon, USA, September 9-13,2012, pages 2662-.

Reference [2] (M.Wang, Y.Song, B.Jiang, L.Dai, and I.V.McLoughlin.Examplebased language recognition method for short-duration Speech segments. in IEEEInternational Conference on Acoustics, Speech and Signal Processing, ICASSP2013, Vancouver, BC, Canada, May 26-31,2013, pages 7354-7358, 2013.) proposes to first create a sample space for short-term Speech where samples are obtained by clustering vectors of different Speech lengths. In the recognition stage, the short-time speech is compared with all samples in the sample space, and the compared information, such as cosine similarity, is used as a feature to be sent to the back-end recognition.

The use of Probabilistic Linear Discriminant Analysis (PLDA) techniques commonly used in speakers to promote the use of vector in language identification is applied in reference [3] (s.cumani, o.ply, and r.f' er.explicit i-vector innovative requirements for short-duration language identification. in proceedings interlayer 2015, volume 2015, pages 1002-1006. International specific speech Association,2015 ].

Reference [4] (A.Lozano-Diez, R.Zo-Candil, J.Gonzalez-domingez, D.T.Toledano, and J.Gonz 'alez-Rodr' 1guez. An end-to-end approach to growth identification in short using a conditional Neural network. InINTERERECH SPE 2015, 16th Neural Conference of the International specificity analysis Association, Dresen, German, September 6-10, 2015, pages 403-407, 2015.) proposed the use of a Neural network (CNN) for modeling.

The existing research aiming at the recognition of short-term speech languages has two problems: (1) in order to process short-time voice, the complexity of the system is greatly improved, and the resource consumption is increased. (2) The modification of the system is in the model part, which results in that long-term speech must also be processed with the same complexity. In fact, some systems tend to process short-term speech only when the recognition performance of the long-term speech is degraded.

Disclosure of Invention

The invention aims to overcome the problem of poor language identification performance of the current short-time voice, and provides a short-time voice time length extension method applied to language identification, which directly extends the time length of the voice to be identified by utilizing a voice time domain expansion technology; for each piece of speech to be recognized, after generating multiple pieces of speech with different speech speeds, the pieces of speech are spliced with the original speech to form a longer speech.

In order to achieve the above object, the present invention provides a short-time speech duration extension method applied to language identification, wherein the method comprises:

for a voice to be recognized with short duration, firstly determining the number n of generated voices with different speech speeds according to the voice duration; then, n decomposition frame shifts of the generated voice are calculated according to the synthesized frame shift value and the n speech rate change rates; and generating n voices with different speeds of speech according to the decomposed frame shift and the synthesized frame shift, and splicing the n voices with different speeds of speech and the original voice to generate a voice with longer duration.

In the above technical solution, the method specifically includes:

step 1), for a voice x to be recognized, the time length is length (x), whether the length (x) is less than a threshold value T is judged, if the judgment result is positive, the step 2 is carried out, and otherwise, the voice is not required to be processed;

step 2), determining the number n of generated voices with different speech speeds; n is determined according to the duration of the input voice:

step 3), the composite frame shift is fixed as S_sCalculating n decomposed frame shifts S according to the rate of change of speech rate α_aThe value of (c):

step 4), the speech to be recognized is moved according to n decomposition frames

Generating n voices with different speech speeds: x is the number of₁，x₂，…，x_n；

Step 5), splicing the voice to be recognized and the generated n voices, wherein the spliced voice y is as follows:

y＝[x x₁…x_n]。

in the above technical solution, the n decomposed frame shifts S are calculated in the step 3)_aThe value of (c):

the process comprises the following steps:

the rate of change of speech α is defined as:

ith decomposition frame shift S_aIs calculated as follows:

in the above technical solution, the process of generating a speech to be recognized into a speech with a different speech speed in step 4) specifically includes:

decomposing the frame shift S by the frame length L_aDecomposing the speech to be recognized by windowing and framing; converting each frame of signal to a frequency domain by using short-time Fourier transform; then the frame length L and the composite frame shift S are used_sAnd (3) inversely transforming the time-frequency domain signals back to the time domain by a splicing and adding method to obtain a voice with different speech speeds.

The invention has the advantages that:

1. the method of the invention changes the voice into voice with different speech rates, which is different from the original voice due to the difference of the speech rates, but belongs to the same language; therefore, the language information contained in the Chinese characters has complementarity; under the condition of proper speech rate conversion, the voice still sounds natural, which means that the training set also has the voice with the same speech rate, so that the problem of mismatching of the test set and the training set can not be caused;

2. the method of the invention can reduce the influence of the speaker by splicing the voices with different speech speeds; an ideal language identification feature should be able to remove the interference of speaker information, channel related information and background noise, and only extract the difference between different languages, but these are unavoidable at present; because the speaking speeds of different people are different, the information of different people can be obtained by splicing the voices with different speaking speeds, and the interference of the speaker can be weakened to a certain degree by combining the information;

3. the method only processes the speech to be recognized and does not modify the speech in the training set, so the model does not need to be changed; moreover, the method of the present invention is applied only when the speech duration is too short, for example, less than 10 seconds, so that the system is ensured to hardly increase more burden, which is very important for a practical acoustic layer system.

Drawings

FIG. 1 is a flow chart of the method for extending the duration of short-term speech for speech recognition according to the present invention;

FIG. 2 is a diagram illustrating the generation of speech at different speech rates according to the present invention.

Detailed Description

The invention will now be further described with reference to the accompanying drawings.

As shown in fig. 1, a method for extending a duration of a short-term speech applied to language identification includes:

step 1) for a voice x to be recognized, the time length is length (x), whether the length (x) is smaller than a threshold value T is judged, if the judgment result is positive, the step 2 is carried out, and otherwise, the voice is not required to be processed;

it can be seen from the calculation formula of n that the shorter the input speech time, the more speech needs to be generated.

Step 3), the composite frame shift is fixed as S_sSelecting n decomposition frame shifts S according to the rate of change of speech rate_aThe value of (c):

the rate of change of speech α is defined as:

through experimental verification, preferably, α value range is 0.7-1.3, the ith decomposition frame is shifted by S_aIs calculated as follows:

specifically, if α is 1, the speech rate of the generated speech is the same as the original speech, and the speech does not need to be generated.

Step 4), to be identifiedThe speech is shifted by N decomposed frames S_aGenerating n voices with different speech speeds: x is the number of₁，x₂，…，x_n；

As shown in fig. 2, the process of generating a speech with different speech speeds by using the speech to be recognized specifically includes:

Wherein the frame shifts are unequal at the time of decomposition and synthesis; frame shift S during composition_sFixing; if the decomposed frame is shifted by S_aLess than composite frame shift S_sThe speech speed of the synthesized speech is slower than that of the original speech, and the speech duration is longer than that of the original speech; if the decomposed frame is shifted by S_aGreater than frame shift S at the time of composition_sThe synthesized voice has a faster speed than the original voice and a shorter duration than the original voice. Voice x after voice time domain expansion and contraction transformation_iThe time length of (1) is related to the original voice x time length

And step 5) splicing the voice to be recognized and the generated n voices, wherein the spliced voice y is as follows:

y＝[x x₁…x_n]。

and when the value range of α is 0.7-1.3, the recognition effect of the spliced voice y is optimal.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A short-time voice duration extension method applied to language identification comprises the following steps:

for a voice to be recognized with short duration, firstly determining the number n of generated voices with different speech speeds according to the voice duration; then, n decomposition frame shifts of the generated voice are calculated according to the synthesized frame shift value and the n speech rate change rates; generating n voices with different speeds of speech according to the decomposed frame shift and the synthesized frame shift, and splicing the n voices with different speeds of speech and the original voice to generate a voice with lengthened duration;

the method specifically comprises the following steps:

y＝[x x₁... x_n]

calculating n decomposition frame shifts S in the step 3)_aThe value of (c):

the process comprises the following steps:

the rate of change of speech α is defined as:

ith decomposition frame shift S_aIs calculated as follows:

the process of generating a speech to be recognized into a speech with different speech speeds in the step 4) specifically includes: