CN107305767B - Short-time voice duration extension method applied to language identification - Google Patents

Short-time voice duration extension method applied to language identification Download PDF

Info

Publication number
CN107305767B
CN107305767B CN201610236672.1A CN201610236672A CN107305767B CN 107305767 B CN107305767 B CN 107305767B CN 201610236672 A CN201610236672 A CN 201610236672A CN 107305767 B CN107305767 B CN 107305767B
Authority
CN
China
Prior art keywords
speech
voice
voices
speeds
short
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610236672.1A
Other languages
Chinese (zh)
Other versions
CN107305767A (en
Inventor
周若华
袁庆升
张健
颜永红
包秀国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, National Computer Network and Information Security Management Center filed Critical Institute of Acoustics CAS
Priority to CN201610236672.1A priority Critical patent/CN107305767B/en
Publication of CN107305767A publication Critical patent/CN107305767A/en
Application granted granted Critical
Publication of CN107305767B publication Critical patent/CN107305767B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a short-time voice duration extension method applied to language identification, which comprises the following steps: for a voice to be recognized with short duration, firstly determining the number n of generated voices with different speech speeds according to the voice duration; then, n decomposition frame shifts of the generated voice are calculated according to the synthesized frame shift value and the n speech rate change rates; and generating n voices with different speeds of speech according to the decomposed frame shift and the synthesized frame shift, and splicing the n voices with different speeds of speech and the original voice to generate a voice with longer duration. The language information of the voices with different speeds has complementarity, and the method provided by the invention can obviously improve the language identification performance of the short-time voice.

Description

Short-time voice duration extension method applied to language identification
Technical Field
The invention relates to the field of computer language identification, in particular to a short-time voice duration extension method applied to language identification.
Background
Language identification refers to a technique for automatically determining the language type to which a piece of speech belongs by a computer. This is a technology that enables large scale cross-language speech recognition applications, and can be used for spoken language translation, spoken document retrieval, and the like. Meanwhile, the method is also a research hotspot for information extraction in the fields of intelligence and safety.
The voice to be recognized is too short in time, which is a common problem in the research fields of speaker recognition, language identification and the like. In recent years, there have been some targeted studies on recognition of short-term speech. References [1] (A.K.Sarkar, D.Matrouf, P.Bousquet, and J.Bonamide.study of the effect of i-vector modeling on short and mismatch determination for Speech analysis. in INTERSPEECH 2012,13th Annual Conference of the International Speech Association, Portland, Oregon, USA, September 9-13,2012, pages 2662-.
Reference [2] (M.Wang, Y.Song, B.Jiang, L.Dai, and I.V.McLoughlin.Examplebased language recognition method for short-duration Speech segments. in IEEEInternational Conference on Acoustics, Speech and Signal Processing, ICASSP2013, Vancouver, BC, Canada, May 26-31,2013, pages 7354-7358, 2013.) proposes to first create a sample space for short-term Speech where samples are obtained by clustering vectors of different Speech lengths. In the recognition stage, the short-time speech is compared with all samples in the sample space, and the compared information, such as cosine similarity, is used as a feature to be sent to the back-end recognition.
The use of Probabilistic Linear Discriminant Analysis (PLDA) techniques commonly used in speakers to promote the use of vector in language identification is applied in reference [3] (s.cumani, o.ply, and r.f' er.explicit i-vector innovative requirements for short-duration language identification. in proceedings interlayer 2015, volume 2015, pages 1002-1006. International specific speech Association,2015 ].
Reference [4] (A.Lozano-Diez, R.Zo-Candil, J.Gonzalez-domingez, D.T.Toledano, and J.Gonz 'alez-Rodr' 1guez. An end-to-end approach to growth identification in short using a conditional Neural network. InINTERERECH SPE 2015, 16th Neural Conference of the International specificity analysis Association, Dresen, German, September 6-10, 2015, pages 403-407, 2015.) proposed the use of a Neural network (CNN) for modeling.
The existing research aiming at the recognition of short-term speech languages has two problems: (1) in order to process short-time voice, the complexity of the system is greatly improved, and the resource consumption is increased. (2) The modification of the system is in the model part, which results in that long-term speech must also be processed with the same complexity. In fact, some systems tend to process short-term speech only when the recognition performance of the long-term speech is degraded.
Disclosure of Invention
The invention aims to overcome the problem of poor language identification performance of the current short-time voice, and provides a short-time voice time length extension method applied to language identification, which directly extends the time length of the voice to be identified by utilizing a voice time domain expansion technology; for each piece of speech to be recognized, after generating multiple pieces of speech with different speech speeds, the pieces of speech are spliced with the original speech to form a longer speech.
In order to achieve the above object, the present invention provides a short-time speech duration extension method applied to language identification, wherein the method comprises:
for a voice to be recognized with short duration, firstly determining the number n of generated voices with different speech speeds according to the voice duration; then, n decomposition frame shifts of the generated voice are calculated according to the synthesized frame shift value and the n speech rate change rates; and generating n voices with different speeds of speech according to the decomposed frame shift and the synthesized frame shift, and splicing the n voices with different speeds of speech and the original voice to generate a voice with longer duration.
In the above technical solution, the method specifically includes:
step 1), for a voice x to be recognized, the time length is length (x), whether the length (x) is less than a threshold value T is judged, if the judgment result is positive, the step 2 is carried out, and otherwise, the voice is not required to be processed;
step 2), determining the number n of generated voices with different speech speeds; n is determined according to the duration of the input voice:
Figure BDA0000966408340000021
step 3), the composite frame shift is fixed as SsCalculating n decomposed frame shifts S according to the rate of change of speech rate αaThe value of (c):
Figure BDA0000966408340000022
step 4), the speech to be recognized is moved according to n decomposition frames
Figure BDA0000966408340000023
Generating n voices with different speech speeds: x is the number of1,x2,…,xn
Step 5), splicing the voice to be recognized and the generated n voices, wherein the spliced voice y is as follows:
y=[x x1…xn]。
in the above technical solution, the n decomposed frame shifts S are calculated in the step 3)aThe value of (c):
Figure BDA0000966408340000031
the process comprises the following steps:
the rate of change of speech α is defined as:
Figure BDA0000966408340000032
ith decomposition frame shift SaIs calculated as follows:
Figure BDA0000966408340000033
Figure BDA0000966408340000034
in the above technical solution, the process of generating a speech to be recognized into a speech with a different speech speed in step 4) specifically includes:
decomposing the frame shift S by the frame length LaDecomposing the speech to be recognized by windowing and framing; converting each frame of signal to a frequency domain by using short-time Fourier transform; then the frame length L and the composite frame shift S are usedsAnd (3) inversely transforming the time-frequency domain signals back to the time domain by a splicing and adding method to obtain a voice with different speech speeds.
The invention has the advantages that:
1. the method of the invention changes the voice into voice with different speech rates, which is different from the original voice due to the difference of the speech rates, but belongs to the same language; therefore, the language information contained in the Chinese characters has complementarity; under the condition of proper speech rate conversion, the voice still sounds natural, which means that the training set also has the voice with the same speech rate, so that the problem of mismatching of the test set and the training set can not be caused;
2. the method of the invention can reduce the influence of the speaker by splicing the voices with different speech speeds; an ideal language identification feature should be able to remove the interference of speaker information, channel related information and background noise, and only extract the difference between different languages, but these are unavoidable at present; because the speaking speeds of different people are different, the information of different people can be obtained by splicing the voices with different speaking speeds, and the interference of the speaker can be weakened to a certain degree by combining the information;
3. the method only processes the speech to be recognized and does not modify the speech in the training set, so the model does not need to be changed; moreover, the method of the present invention is applied only when the speech duration is too short, for example, less than 10 seconds, so that the system is ensured to hardly increase more burden, which is very important for a practical acoustic layer system.
Drawings
FIG. 1 is a flow chart of the method for extending the duration of short-term speech for speech recognition according to the present invention;
FIG. 2 is a diagram illustrating the generation of speech at different speech rates according to the present invention.
Detailed Description
The invention will now be further described with reference to the accompanying drawings.
As shown in fig. 1, a method for extending a duration of a short-term speech applied to language identification includes:
step 1) for a voice x to be recognized, the time length is length (x), whether the length (x) is smaller than a threshold value T is judged, if the judgment result is positive, the step 2 is carried out, and otherwise, the voice is not required to be processed;
step 2), determining the number n of generated voices with different speech speeds; n is determined according to the duration of the input voice:
Figure BDA0000966408340000041
it can be seen from the calculation formula of n that the shorter the input speech time, the more speech needs to be generated.
Step 3), the composite frame shift is fixed as SsSelecting n decomposition frame shifts S according to the rate of change of speech rateaThe value of (c):
Figure BDA0000966408340000042
the rate of change of speech α is defined as:
Figure BDA0000966408340000043
through experimental verification, preferably, α value range is 0.7-1.3, the ith decomposition frame is shifted by SaIs calculated as follows:
Figure BDA0000966408340000044
Figure BDA0000966408340000045
specifically, if α is 1, the speech rate of the generated speech is the same as the original speech, and the speech does not need to be generated.
Step 4), to be identifiedThe speech is shifted by N decomposed frames SaGenerating n voices with different speech speeds: x is the number of1,x2,…,xn
As shown in fig. 2, the process of generating a speech with different speech speeds by using the speech to be recognized specifically includes:
decomposing the frame shift S by the frame length LaDecomposing the speech to be recognized by windowing and framing; converting each frame of signal to a frequency domain by using short-time Fourier transform; then the frame length L and the composite frame shift S are usedsAnd (3) inversely transforming the time-frequency domain signals back to the time domain by a splicing and adding method to obtain a voice with different speech speeds.
Wherein the frame shifts are unequal at the time of decomposition and synthesis; frame shift S during compositionsFixing; if the decomposed frame is shifted by SaLess than composite frame shift SsThe speech speed of the synthesized speech is slower than that of the original speech, and the speech duration is longer than that of the original speech; if the decomposed frame is shifted by SaGreater than frame shift S at the time of compositionsThe synthesized voice has a faster speed than the original voice and a shorter duration than the original voice. Voice x after voice time domain expansion and contraction transformationiThe time length of (1) is related to the original voice x time length
Figure BDA0000966408340000046
And step 5) splicing the voice to be recognized and the generated n voices, wherein the spliced voice y is as follows:
y=[x x1…xn]。
and when the value range of α is 0.7-1.3, the recognition effect of the spliced voice y is optimal.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (1)

1. A short-time voice duration extension method applied to language identification comprises the following steps:
for a voice to be recognized with short duration, firstly determining the number n of generated voices with different speech speeds according to the voice duration; then, n decomposition frame shifts of the generated voice are calculated according to the synthesized frame shift value and the n speech rate change rates; generating n voices with different speeds of speech according to the decomposed frame shift and the synthesized frame shift, and splicing the n voices with different speeds of speech and the original voice to generate a voice with lengthened duration;
the method specifically comprises the following steps:
step 1), for a voice x to be recognized, the time length is length (x), whether the length (x) is less than a threshold value T is judged, if the judgment result is positive, the step 2 is carried out, and otherwise, the voice is not required to be processed;
step 2), determining the number n of generated voices with different speech speeds; n is determined according to the duration of the input voice:
Figure FDA0002262462890000011
step 3), the composite frame shift is fixed as SsCalculating n decomposed frame shifts S according to the rate of change of speech rate αaThe value of (c):
Figure FDA0002262462890000012
step 4), the speech to be recognized is moved according to n decomposition frames
Figure FDA0002262462890000013
Generating n voices with different speech speeds: x is the number of1,x2,…,xn
Step 5), splicing the voice to be recognized and the generated n voices, wherein the spliced voice y is as follows:
y=[x x1... xn]
calculating n decomposition frame shifts S in the step 3)aThe value of (c):
Figure FDA0002262462890000014
the process comprises the following steps:
the rate of change of speech α is defined as:
Figure FDA0002262462890000015
ith decomposition frame shift SaIs calculated as follows:
Figure FDA0002262462890000016
Figure FDA0002262462890000017
the process of generating a speech to be recognized into a speech with different speech speeds in the step 4) specifically includes:
decomposing the frame shift S by the frame length LaDecomposing the speech to be recognized by windowing and framing; converting each frame of signal to a frequency domain by using short-time Fourier transform; then the frame length L and the composite frame shift S are usedsAnd (3) inversely transforming the time-frequency domain signals back to the time domain by a splicing and adding method to obtain a voice with different speech speeds.
CN201610236672.1A 2016-04-15 2016-04-15 Short-time voice duration extension method applied to language identification Expired - Fee Related CN107305767B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610236672.1A CN107305767B (en) 2016-04-15 2016-04-15 Short-time voice duration extension method applied to language identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610236672.1A CN107305767B (en) 2016-04-15 2016-04-15 Short-time voice duration extension method applied to language identification

Publications (2)

Publication Number Publication Date
CN107305767A CN107305767A (en) 2017-10-31
CN107305767B true CN107305767B (en) 2020-03-17

Family

ID=60151327

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610236672.1A Expired - Fee Related CN107305767B (en) 2016-04-15 2016-04-15 Short-time voice duration extension method applied to language identification

Country Status (1)

Country Link
CN (1) CN107305767B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109975762B (en) * 2017-12-28 2021-05-18 中国科学院声学研究所 Underwater sound source positioning method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1512485A (en) * 2002-12-31 2004-07-14 北京天朗语音科技有限公司 Voice identification system of voice speed adaption
JP3563772B2 (en) * 1994-06-16 2004-09-08 キヤノン株式会社 Speech synthesis method and apparatus, and speech synthesis control method and apparatus
CN1750122A (en) * 2005-11-07 2006-03-22 章森 Telescopic voice compression recovery technology based on extreme point
CN101645269A (en) * 2008-12-30 2010-02-10 中国科学院声学研究所 Language recognition system and method
CN101740034A (en) * 2008-11-04 2010-06-16 刘盛举 Method for realizing sound speed-variation without tone variation and system for realizing speed variation and tone variation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3563772B2 (en) * 1994-06-16 2004-09-08 キヤノン株式会社 Speech synthesis method and apparatus, and speech synthesis control method and apparatus
CN1512485A (en) * 2002-12-31 2004-07-14 北京天朗语音科技有限公司 Voice identification system of voice speed adaption
CN1750122A (en) * 2005-11-07 2006-03-22 章森 Telescopic voice compression recovery technology based on extreme point
CN101740034A (en) * 2008-11-04 2010-06-16 刘盛举 Method for realizing sound speed-variation without tone variation and system for realizing speed variation and tone variation
CN101645269A (en) * 2008-12-30 2010-02-10 中国科学院声学研究所 Language recognition system and method

Also Published As

Publication number Publication date
CN107305767A (en) 2017-10-31

Similar Documents

Publication Publication Date Title
CN105788603B (en) A kind of audio identification methods and system based on empirical mode decomposition
CN109767756B (en) Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient
CN110797002B (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN103164403B (en) The generation method and system of video index data
Du et al. Speaker augmentation for low resource speech recognition
CN105118501A (en) Speech recognition method and system
Todkar et al. Speaker recognition techniques: A review
Delcroix et al. Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation
CN101887722A (en) Rapid voiceprint authentication method
Mun et al. The sound of my voice: Speaker representation loss for target voice separation
Dua et al. Discriminative training using heterogeneous feature vector for Hindi automatic speech recognition system
CN111968622A (en) Attention mechanism-based voice recognition method, system and device
Zheng et al. Acoustic texttiling for story segmentation of spoken documents
CN101178895A (en) Model self-adapting method based on generating parameter listen-feel error minimize
CN107305767B (en) Short-time voice duration extension method applied to language identification
CN113744715A (en) Vocoder speech synthesis method, device, computer equipment and storage medium
CN110197657A (en) A kind of dynamic speech feature extracting method based on cosine similarity
Koolagudi et al. Speaker recognition in the case of emotional environment using transformation of speech features
Zhao et al. Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding.
Tolba et al. A novel method for Arabic consonant/vowel segmentation using wavelet transform
Shahnawazuddin et al. Enhancing robustness of zero resource children's speech recognition system through bispectrum based front-end acoustic features
KR101361034B1 (en) Robust speech recognition method based on independent vector analysis using harmonic frequency dependency and system using the method
Miguel et al. Augmented state space acoustic decoding for modeling local variability in speech.
Du et al. Pan: Phoneme-aware network for monaural speech enhancement
CN112908340A (en) Global-local windowing-based sound feature rapid extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200317