CN107305767B - Short-time voice duration extension method applied to language identification - Google Patents
Short-time voice duration extension method applied to language identification Download PDFInfo
- Publication number
- CN107305767B CN107305767B CN201610236672.1A CN201610236672A CN107305767B CN 107305767 B CN107305767 B CN 107305767B CN 201610236672 A CN201610236672 A CN 201610236672A CN 107305767 B CN107305767 B CN 107305767B
- Authority
- CN
- China
- Prior art keywords
- speech
- voice
- voices
- speeds
- short
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000037433 frameshift Effects 0.000 claims abstract description 32
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 11
- 239000002131 composite material Substances 0.000 claims description 7
- 238000009432 framing Methods 0.000 claims description 3
- 239000004576 sand Substances 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 239000013598 vector Substances 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000008602 contraction Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000011229 interlayer Substances 0.000 description 1
- 239000010410 layer Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/263—Language identification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a short-time voice duration extension method applied to language identification, which comprises the following steps: for a voice to be recognized with short duration, firstly determining the number n of generated voices with different speech speeds according to the voice duration; then, n decomposition frame shifts of the generated voice are calculated according to the synthesized frame shift value and the n speech rate change rates; and generating n voices with different speeds of speech according to the decomposed frame shift and the synthesized frame shift, and splicing the n voices with different speeds of speech and the original voice to generate a voice with longer duration. The language information of the voices with different speeds has complementarity, and the method provided by the invention can obviously improve the language identification performance of the short-time voice.
Description
Technical Field
The invention relates to the field of computer language identification, in particular to a short-time voice duration extension method applied to language identification.
Background
Language identification refers to a technique for automatically determining the language type to which a piece of speech belongs by a computer. This is a technology that enables large scale cross-language speech recognition applications, and can be used for spoken language translation, spoken document retrieval, and the like. Meanwhile, the method is also a research hotspot for information extraction in the fields of intelligence and safety.
The voice to be recognized is too short in time, which is a common problem in the research fields of speaker recognition, language identification and the like. In recent years, there have been some targeted studies on recognition of short-term speech. References [1] (A.K.Sarkar, D.Matrouf, P.Bousquet, and J.Bonamide.study of the effect of i-vector modeling on short and mismatch determination for Speech analysis. in INTERSPEECH 2012,13th Annual Conference of the International Speech Association, Portland, Oregon, USA, September 9-13,2012, pages 2662-.
Reference [2] (M.Wang, Y.Song, B.Jiang, L.Dai, and I.V.McLoughlin.Examplebased language recognition method for short-duration Speech segments. in IEEEInternational Conference on Acoustics, Speech and Signal Processing, ICASSP2013, Vancouver, BC, Canada, May 26-31,2013, pages 7354-7358, 2013.) proposes to first create a sample space for short-term Speech where samples are obtained by clustering vectors of different Speech lengths. In the recognition stage, the short-time speech is compared with all samples in the sample space, and the compared information, such as cosine similarity, is used as a feature to be sent to the back-end recognition.
The use of Probabilistic Linear Discriminant Analysis (PLDA) techniques commonly used in speakers to promote the use of vector in language identification is applied in reference [3] (s.cumani, o.ply, and r.f' er.explicit i-vector innovative requirements for short-duration language identification. in proceedings interlayer 2015, volume 2015, pages 1002-1006. International specific speech Association,2015 ].
Reference [4] (A.Lozano-Diez, R.Zo-Candil, J.Gonzalez-domingez, D.T.Toledano, and J.Gonz 'alez-Rodr' 1guez. An end-to-end approach to growth identification in short using a conditional Neural network. InINTERERECH SPE 2015, 16th Neural Conference of the International specificity analysis Association, Dresen, German, September 6-10, 2015, pages 403-407, 2015.) proposed the use of a Neural network (CNN) for modeling.
The existing research aiming at the recognition of short-term speech languages has two problems: (1) in order to process short-time voice, the complexity of the system is greatly improved, and the resource consumption is increased. (2) The modification of the system is in the model part, which results in that long-term speech must also be processed with the same complexity. In fact, some systems tend to process short-term speech only when the recognition performance of the long-term speech is degraded.
Disclosure of Invention
The invention aims to overcome the problem of poor language identification performance of the current short-time voice, and provides a short-time voice time length extension method applied to language identification, which directly extends the time length of the voice to be identified by utilizing a voice time domain expansion technology; for each piece of speech to be recognized, after generating multiple pieces of speech with different speech speeds, the pieces of speech are spliced with the original speech to form a longer speech.
In order to achieve the above object, the present invention provides a short-time speech duration extension method applied to language identification, wherein the method comprises:
for a voice to be recognized with short duration, firstly determining the number n of generated voices with different speech speeds according to the voice duration; then, n decomposition frame shifts of the generated voice are calculated according to the synthesized frame shift value and the n speech rate change rates; and generating n voices with different speeds of speech according to the decomposed frame shift and the synthesized frame shift, and splicing the n voices with different speeds of speech and the original voice to generate a voice with longer duration.
In the above technical solution, the method specifically includes:
step 1), for a voice x to be recognized, the time length is length (x), whether the length (x) is less than a threshold value T is judged, if the judgment result is positive, the step 2 is carried out, and otherwise, the voice is not required to be processed;
step 2), determining the number n of generated voices with different speech speeds; n is determined according to the duration of the input voice:
step 3), the composite frame shift is fixed as SsCalculating n decomposed frame shifts S according to the rate of change of speech rate αaThe value of (c):
step 4), the speech to be recognized is moved according to n decomposition framesGenerating n voices with different speech speeds: x is the number of1,x2,…,xn;
Step 5), splicing the voice to be recognized and the generated n voices, wherein the spliced voice y is as follows:
y=[x x1…xn]。
in the above technical solution, the n decomposed frame shifts S are calculated in the step 3)aThe value of (c):the process comprises the following steps:
the rate of change of speech α is defined as:
ith decomposition frame shift SaIs calculated as follows:
in the above technical solution, the process of generating a speech to be recognized into a speech with a different speech speed in step 4) specifically includes:
decomposing the frame shift S by the frame length LaDecomposing the speech to be recognized by windowing and framing; converting each frame of signal to a frequency domain by using short-time Fourier transform; then the frame length L and the composite frame shift S are usedsAnd (3) inversely transforming the time-frequency domain signals back to the time domain by a splicing and adding method to obtain a voice with different speech speeds.
The invention has the advantages that:
1. the method of the invention changes the voice into voice with different speech rates, which is different from the original voice due to the difference of the speech rates, but belongs to the same language; therefore, the language information contained in the Chinese characters has complementarity; under the condition of proper speech rate conversion, the voice still sounds natural, which means that the training set also has the voice with the same speech rate, so that the problem of mismatching of the test set and the training set can not be caused;
2. the method of the invention can reduce the influence of the speaker by splicing the voices with different speech speeds; an ideal language identification feature should be able to remove the interference of speaker information, channel related information and background noise, and only extract the difference between different languages, but these are unavoidable at present; because the speaking speeds of different people are different, the information of different people can be obtained by splicing the voices with different speaking speeds, and the interference of the speaker can be weakened to a certain degree by combining the information;
3. the method only processes the speech to be recognized and does not modify the speech in the training set, so the model does not need to be changed; moreover, the method of the present invention is applied only when the speech duration is too short, for example, less than 10 seconds, so that the system is ensured to hardly increase more burden, which is very important for a practical acoustic layer system.
Drawings
FIG. 1 is a flow chart of the method for extending the duration of short-term speech for speech recognition according to the present invention;
FIG. 2 is a diagram illustrating the generation of speech at different speech rates according to the present invention.
Detailed Description
The invention will now be further described with reference to the accompanying drawings.
As shown in fig. 1, a method for extending a duration of a short-term speech applied to language identification includes:
step 1) for a voice x to be recognized, the time length is length (x), whether the length (x) is smaller than a threshold value T is judged, if the judgment result is positive, the step 2 is carried out, and otherwise, the voice is not required to be processed;
step 2), determining the number n of generated voices with different speech speeds; n is determined according to the duration of the input voice:
it can be seen from the calculation formula of n that the shorter the input speech time, the more speech needs to be generated.
Step 3), the composite frame shift is fixed as SsSelecting n decomposition frame shifts S according to the rate of change of speech rateaThe value of (c):
the rate of change of speech α is defined as:
through experimental verification, preferably, α value range is 0.7-1.3, the ith decomposition frame is shifted by SaIs calculated as follows:
specifically, if α is 1, the speech rate of the generated speech is the same as the original speech, and the speech does not need to be generated.
Step 4), to be identifiedThe speech is shifted by N decomposed frames SaGenerating n voices with different speech speeds: x is the number of1,x2,…,xn;
As shown in fig. 2, the process of generating a speech with different speech speeds by using the speech to be recognized specifically includes:
decomposing the frame shift S by the frame length LaDecomposing the speech to be recognized by windowing and framing; converting each frame of signal to a frequency domain by using short-time Fourier transform; then the frame length L and the composite frame shift S are usedsAnd (3) inversely transforming the time-frequency domain signals back to the time domain by a splicing and adding method to obtain a voice with different speech speeds.
Wherein the frame shifts are unequal at the time of decomposition and synthesis; frame shift S during compositionsFixing; if the decomposed frame is shifted by SaLess than composite frame shift SsThe speech speed of the synthesized speech is slower than that of the original speech, and the speech duration is longer than that of the original speech; if the decomposed frame is shifted by SaGreater than frame shift S at the time of compositionsThe synthesized voice has a faster speed than the original voice and a shorter duration than the original voice. Voice x after voice time domain expansion and contraction transformationiThe time length of (1) is related to the original voice x time length
And step 5) splicing the voice to be recognized and the generated n voices, wherein the spliced voice y is as follows:
y=[x x1…xn]。
and when the value range of α is 0.7-1.3, the recognition effect of the spliced voice y is optimal.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (1)
1. A short-time voice duration extension method applied to language identification comprises the following steps:
for a voice to be recognized with short duration, firstly determining the number n of generated voices with different speech speeds according to the voice duration; then, n decomposition frame shifts of the generated voice are calculated according to the synthesized frame shift value and the n speech rate change rates; generating n voices with different speeds of speech according to the decomposed frame shift and the synthesized frame shift, and splicing the n voices with different speeds of speech and the original voice to generate a voice with lengthened duration;
the method specifically comprises the following steps:
step 1), for a voice x to be recognized, the time length is length (x), whether the length (x) is less than a threshold value T is judged, if the judgment result is positive, the step 2 is carried out, and otherwise, the voice is not required to be processed;
step 2), determining the number n of generated voices with different speech speeds; n is determined according to the duration of the input voice:
step 3), the composite frame shift is fixed as SsCalculating n decomposed frame shifts S according to the rate of change of speech rate αaThe value of (c):
step 4), the speech to be recognized is moved according to n decomposition framesGenerating n voices with different speech speeds: x is the number of1,x2,…,xn;
Step 5), splicing the voice to be recognized and the generated n voices, wherein the spliced voice y is as follows:
y=[x x1... xn]
calculating n decomposition frame shifts S in the step 3)aThe value of (c):the process comprises the following steps:
the rate of change of speech α is defined as:
ith decomposition frame shift SaIs calculated as follows:
the process of generating a speech to be recognized into a speech with different speech speeds in the step 4) specifically includes:
decomposing the frame shift S by the frame length LaDecomposing the speech to be recognized by windowing and framing; converting each frame of signal to a frequency domain by using short-time Fourier transform; then the frame length L and the composite frame shift S are usedsAnd (3) inversely transforming the time-frequency domain signals back to the time domain by a splicing and adding method to obtain a voice with different speech speeds.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610236672.1A CN107305767B (en) | 2016-04-15 | 2016-04-15 | Short-time voice duration extension method applied to language identification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610236672.1A CN107305767B (en) | 2016-04-15 | 2016-04-15 | Short-time voice duration extension method applied to language identification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107305767A CN107305767A (en) | 2017-10-31 |
CN107305767B true CN107305767B (en) | 2020-03-17 |
Family
ID=60151327
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610236672.1A Expired - Fee Related CN107305767B (en) | 2016-04-15 | 2016-04-15 | Short-time voice duration extension method applied to language identification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107305767B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109975762B (en) * | 2017-12-28 | 2021-05-18 | 中国科学院声学研究所 | Underwater sound source positioning method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1512485A (en) * | 2002-12-31 | 2004-07-14 | 北京天朗语音科技有限公司 | Voice identification system of voice speed adaption |
JP3563772B2 (en) * | 1994-06-16 | 2004-09-08 | キヤノン株式会社 | Speech synthesis method and apparatus, and speech synthesis control method and apparatus |
CN1750122A (en) * | 2005-11-07 | 2006-03-22 | 章森 | Telescopic voice compression recovery technology based on extreme point |
CN101645269A (en) * | 2008-12-30 | 2010-02-10 | 中国科学院声学研究所 | Language recognition system and method |
CN101740034A (en) * | 2008-11-04 | 2010-06-16 | 刘盛举 | Method for realizing sound speed-variation without tone variation and system for realizing speed variation and tone variation |
-
2016
- 2016-04-15 CN CN201610236672.1A patent/CN107305767B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3563772B2 (en) * | 1994-06-16 | 2004-09-08 | キヤノン株式会社 | Speech synthesis method and apparatus, and speech synthesis control method and apparatus |
CN1512485A (en) * | 2002-12-31 | 2004-07-14 | 北京天朗语音科技有限公司 | Voice identification system of voice speed adaption |
CN1750122A (en) * | 2005-11-07 | 2006-03-22 | 章森 | Telescopic voice compression recovery technology based on extreme point |
CN101740034A (en) * | 2008-11-04 | 2010-06-16 | 刘盛举 | Method for realizing sound speed-variation without tone variation and system for realizing speed variation and tone variation |
CN101645269A (en) * | 2008-12-30 | 2010-02-10 | 中国科学院声学研究所 | Language recognition system and method |
Also Published As
Publication number | Publication date |
---|---|
CN107305767A (en) | 2017-10-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105788603B (en) | A kind of audio identification methods and system based on empirical mode decomposition | |
CN109767756B (en) | Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient | |
CN110797002B (en) | Speech synthesis method, speech synthesis device, electronic equipment and storage medium | |
CN103164403B (en) | The generation method and system of video index data | |
Du et al. | Speaker augmentation for low resource speech recognition | |
CN105118501A (en) | Speech recognition method and system | |
Todkar et al. | Speaker recognition techniques: A review | |
Delcroix et al. | Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation | |
CN101887722A (en) | Rapid voiceprint authentication method | |
Mun et al. | The sound of my voice: Speaker representation loss for target voice separation | |
Dua et al. | Discriminative training using heterogeneous feature vector for Hindi automatic speech recognition system | |
CN111968622A (en) | Attention mechanism-based voice recognition method, system and device | |
Zheng et al. | Acoustic texttiling for story segmentation of spoken documents | |
CN101178895A (en) | Model self-adapting method based on generating parameter listen-feel error minimize | |
CN107305767B (en) | Short-time voice duration extension method applied to language identification | |
CN113744715A (en) | Vocoder speech synthesis method, device, computer equipment and storage medium | |
CN110197657A (en) | A kind of dynamic speech feature extracting method based on cosine similarity | |
Koolagudi et al. | Speaker recognition in the case of emotional environment using transformation of speech features | |
Zhao et al. | Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding. | |
Tolba et al. | A novel method for Arabic consonant/vowel segmentation using wavelet transform | |
Shahnawazuddin et al. | Enhancing robustness of zero resource children's speech recognition system through bispectrum based front-end acoustic features | |
KR101361034B1 (en) | Robust speech recognition method based on independent vector analysis using harmonic frequency dependency and system using the method | |
Miguel et al. | Augmented state space acoustic decoding for modeling local variability in speech. | |
Du et al. | Pan: Phoneme-aware network for monaural speech enhancement | |
CN112908340A (en) | Global-local windowing-based sound feature rapid extraction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200317 |