CN113112990A - Language identification method of variable-duration voice based on spectrum envelope diagram - Google Patents
Language identification method of variable-duration voice based on spectrum envelope diagram Download PDFInfo
- Publication number
- CN113112990A CN113112990A CN202110238968.8A CN202110238968A CN113112990A CN 113112990 A CN113112990 A CN 113112990A CN 202110238968 A CN202110238968 A CN 202110238968A CN 113112990 A CN113112990 A CN 113112990A
- Authority
- CN
- China
- Prior art keywords
- voice
- short
- spectrum envelope
- language
- language identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001228 spectrum Methods 0.000 title claims abstract description 43
- 238000010586 diagram Methods 0.000 title claims abstract description 19
- 238000000034 method Methods 0.000 title claims abstract description 17
- 238000012360 testing method Methods 0.000 claims abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000009432 framing Methods 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 6
- 230000005236 sound signal Effects 0.000 abstract description 4
- 230000003595 spectral effect Effects 0.000 description 16
- 230000000694 effects Effects 0.000 description 5
- 230000001755 vocal effect Effects 0.000 description 5
- 230000005284 excitation Effects 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 3
- 230000037433 frameshift Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 241001672694 Citrus reticulata Species 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000011541 reaction mixture Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
Abstract
The invention relates to a language identification method of variable-duration voice based on a spectrum envelope diagram, belonging to the technical field of audio signal processing. Firstly, dividing a voice signal into short-time voice with the duration of t seconds; then extracting the spectrum envelope diagrams of each short-time speech, and distributing the spectrum envelope diagrams into a training set and a test set according to the proportion of a to b; fitting the training set into a residual error network for training, and selecting a language identification model with the highest identification rate through adjusting model parameters and repeatedly testing the test set; when the duration of the voice to be detected is longer than t seconds, the voice to be detected is divided into a plurality of short-term voices with the length of t seconds, and then the recognition condition of the short voice obtained by dividing each long voice is counted to judge the language of the whole long voice. The invention can not only accelerate the speed of language identification, but also can identify the language under the condition of different durations of the voice to be detected, and can ensure higher accuracy.
Description
Technical Field
The invention relates to a language identification method of variable-duration voice based on a spectrum envelope diagram, belonging to the technical field of audio signal processing.
Background
According to survey, over 6900 different languages exist around the world, and it is very complicated to classify them completely by human labor. There are many language translation tools, and the front end of these translation tools refers to language identification technology. When the durations of the voices to be detected are unequal, the training set cannot contain voice signals of all durations, so the language identification effect of the voices with different durations is obviously reduced. At present, in order to solve the problem of language identification of voices with different durations, the most widely used method is to change the speed of the voices to adjust the durations, and although the frequency domain characteristics of the voices can be kept basically unchanged, the time domain characteristics are changed greatly, and the language identification effect is not ideal. There is therefore still a great deal of research space for language identification of variable duration speech.
Disclosure of Invention
The invention provides a language identification method of variable-duration speech based on a spectrum envelope diagram, which is used for solving the problem that the language identification effect is sharply reduced when the durations of the speech to be detected are unequal.
The technical scheme of the invention is as follows: a language identification method of variable-duration voice based on a spectrum envelope diagram mainly comprises two parts, wherein the first part is used for language identification of short voice (duration is 1 second), and the second part is used for regular duration of long voice to be detected, namely, the long voice is divided into a plurality of phrase voice signals with duration of 1 second.
In the language identification process of the first part of short speech, a spectrum envelope image of a speech signal is used as the characteristic input of a language identification system, the extraction process mainly comprises the steps of framing the speech signal, windowing and homomorphic processing, each frame spectrum envelope of the short speech is obtained, then the spectrum envelopes are spliced together according to lines to form a spectrum envelope image of a section of short speech (the duration is 1 second), then the spectrum envelope images are distributed into a training set and a test set according to the ratio of 4:1, the training set is used for fitting a residual error network to form a language identification model, and the test set is used for testing the generated language identification model to select the language identification model with the best identification effect.
The second part is that long speech with different time length is divided into several short speech (time length is 1 second), then these short speech are sent into language identification system to make test, and the language of long speech is discriminated by counting the identification condition of these short speech.
The method comprises the following specific steps:
step 1: and dividing long-segment voice signals of different languages into short-time voice with shorter time length, and defining the time length of the short-time voice signals as t seconds.
Step 2: and performing framing and windowing functions on the short-time voice with the time length of t seconds, and then solving the spectrum envelope of each frame of the short-time voice with the time length of t seconds through homomorphic processing.
Step 3: the spectrum envelopes of each frame signal of the same short-time speech are arranged and combined according to rows, a spectrum envelope graph corresponding to each section of speech is drawn, the horizontal axis represents a spectrum, and the vertical axis represents a time domain.
Step 4: and filtering the generated spectrum envelope diagram, removing high-frequency and low-frequency parts of the voice signal, and reserving a middle-frequency part of the voice signal to enable the frequency of the middle-frequency part to be in the range of 500HZ to 3000 HZ.
Step 5: and (3) according to the spectrum envelope diagram of each language as N: and m is distributed into a training set and a testing set, and labels of corresponding languages are marked.
Step 6: and fitting the training set to a residual error network, training by adjusting parameters to obtain different language identification models, testing the language identification models by using the test set, and selecting the language identification model with the highest language identification rate.
Step 7: when the time lengths of the voices to be detected are unequal, the voice signals are divided into a plurality of short-time voice signals, the time length is t seconds, a plurality of phrase voices obtained by dividing each long voice are fitted into a language recognition model in Step6, and the language of the long-time voice is judged by counting the recognition conditions of the short voices.
The invention has the beneficial effects that: the method can not only accelerate the speed of language identification, but also identify the languages under the condition that the time lengths of the voices to be detected are different, and can ensure higher accuracy.
Drawings
FIG. 1 is a block diagram of the overall architecture of the present invention;
FIG. 2 is a flow chart of spectral envelope extraction according to the present invention;
FIG. 3 is a graph of vocal tract impulse response (spectral envelope) and glottal excitation pulses of the present invention;
FIG. 4 is a graphical depiction of a spectral envelope of the present invention;
FIG. 5 is a graph of a filtered spectral envelope of the present invention;
FIG. 6 is a diagram illustrating long speech segmentation according to the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
As shown in fig. 1, a speech recognition method for variable duration speech based on a spectral envelope diagram mainly includes two parts, the first part is speech recognition for short speech (duration is 1 second), and the second part is a phrase tone signal with regular duration of the long speech to be measured, i.e. the phrase tone signal is divided into several phrase tone signals with duration of 1 second.
In the language identification process of the first part of short speech, a spectrum envelope image of a speech signal is used as the characteristic input of a language identification system, the extraction process mainly comprises the steps of framing the speech signal, windowing and homomorphic processing, each frame spectrum envelope of the short speech is obtained, then the spectrum envelopes are spliced together according to lines to form a spectrum envelope image of a section of short speech (the duration is 1 second), then the spectrum envelope images are distributed into a training set and a test set according to the ratio of 4:1, the training set is used for fitting a residual error network to form a language identification model, and the test set is used for testing the generated language identification model to select the language identification model with the best identification effect.
The second part is that long speech with different time length is divided into several short speech (time length is 1 second), then these short speech are sent into language identification system to make test, and the language of long speech is discriminated by counting the identification condition of these short speech.
The method specifically comprises the following steps:
step 1: generating voice data;
downloading audio signals of 8 languages with equal time duration on a broadcasting station, respectively dividing the audio signals into phrase tones with time duration of t seconds through a program, transcoding a voice file into a wav file by adobe audio software, wherein the sampling rate is 16k, and the data is monaural. The 8 voices are mandarin, korean, tibetan, burmese, cambodia, laos, and vietnamese, respectively.
Step 2: extracting a spectrum envelope;
speech is a complex multi-frequency signal, the amplitudes of the signals of all frequencies are different, the signals are arranged according to the magnitude sequence, and the connecting line of the midpoints of the amplitudes is the spectrum envelope referred to in the invention. The shape of the envelope varies from utterance to utterance, and since there is a difference in the utterances of the respective languages, it is proposed herein to use the spectral envelope as a speech acoustic feature for language recognition. In the process of extracting the spectral envelope, homomorphic filtering is the most important part, and the role of the homomorphic filtering is to separate the vocal tract impulse response and the glottal excitation pulse in the speech signal, and in the invention, the obtained vocal tract impulse response is the spectral envelope required by us. The extraction process of the spectral envelope is shown in fig. 2.
Step2.1: framing;
dividing a voice signal with the duration of 1 second into short-time frame signals, and regarding each frame signal as a steady-state time-invariant signal. In the process of framing, the overlapping portion between two adjacent frames is called frame shift, and the length of the frame shift is generally one half of the frame length.
Step2.2: windowing;
adding rectangular window, Hamming window or Haining window to each frame signal, selecting different window functions will have different bandwidth and spectrum leakage, and the Hamming window is selected in the invention.
Step2.3: homomorphic processing;
after framing and windowing, each frame of speech signal can be represented as:
x(n)=x1(n)*x2(n) (1)
in the formula (1), x1(n) and x2(n) represent the vocal tract impulse response (spectral envelope) and the glottal excitation signal, respectively.
The original convolution signal is transformed into a multiplicative signal by fast fourier transform:
DFT[x(n)]=DFT[x1(n)*x2(n)]=X1(k)·X2(k)=X(k) (2)
in the formula (2), DFT [ ] is a discrete Fourier transform.
In the formula (3), x (k) is a transformed multiplicative signal.
The multiplicative signal is converted into an additive signal by taking the logarithm.
In the formula (4), the reaction mixture is,for the logarithmic spectrum of signal x (n), the spectral envelope proposed in the present invention is referred toThe central envelope of (c).
DFT-1[]Is an inverse fast fourier transform.
In the formula (7), aSeparated by a specific spectral line mAndthen toA discrete Fourier transform is performed, and a real part of the transform result obtains a spectrum envelope of each frame.
Y (k) is the spectral envelope of the signal. As shown in fig. 3, the vocal tract impulse response (spectral envelope) and the glottal excitation pulse of a frame of speech signal are shown.
Step 3: drawing a spectrum envelope graph;
and combining the spectral envelopes of each frame of the same short-time voice according to rows, and drawing a spectral envelope graph corresponding to each section of voice through a python program. The horizontal axis of the spectrum envelope graph is frequency characteristic, and the vertical axis is time domain characteristic, and the drawing diagram is shown in fig. 4.
Step 4: filtering the frequency spectrum;
in order to eliminate the influence on the speech recognition result due to the difference of the frequency distribution range of different languages, the high frequency part and the low frequency part of the spectrum envelope map are now removed, and only the spectrum envelopes of the speech signals 500HZ to 3000HZ are reserved. As shown in fig. 4, the picture is a filtered picture of a spectrum envelope map of a short-term speech with a duration of t seconds.
Step 5: generating a training set and a test set;
the method comprises the steps of distributing a spectrum envelope diagram of each language into a training set and a test set according to a ratio of 4:1, marking labels of the corresponding languages, then concentrating the training sets of various voices into a training set file, and concentrating the test sets of various voices into a test set file.
Step 6: generating a language identification model;
and fitting the training set to a residual error network, obtaining different language identification models by adjusting parameters, and testing the language identification models by using the test set to select the language identification model with the highest language identification rate.
Step 7: language identification of variable duration voice;
when the time lengths of the voices to be detected are unequal, the voice signals are divided into a plurality of short-time voice signals (the time length is 1 second)
A=[a1 a2 a3 ... aN] (9)
In the formula (9), a1,a2,a3.. are short-time speeches with duration of 1 secondAnd A is a voice signal with the duration of more than 1 second. When performing voice segmentation, firstly, the time length T of the voice signal is determined, and then the number N of voices with the time-sharing length of 1 second is determined by rounding the time-following length.
In the formula (10), S is the overlapping length of two adjacent short-term voices during long voice segmentation, when S is greater than 0, | S | represents the overlapping duration of two adjacent short-term voices, and when S is less than 0, | S | represents the distance between two adjacent short-term voices, which is specifically segmented as shown in fig. 5.
After a long speech with a duration T is segmented into N short phonemes, the probability statistics can be expressed by equation (11):
P=[p1 p2 p3 ... pN] (11)
in the formula (11), piRepresenting the recognition probability of the ith segment of speech.
In this embodiment, eight languages are mentioned, and when a long speech is determined, the sum of probabilities of each language is obtained, and the language with the highest probability is the language of the long speech.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.
Claims (1)
1. A language identification method of variable duration voice based on a spectrum envelope diagram is characterized in that:
step 1: dividing long-segment voice signals of different languages into short-time voice, and defining the duration of the short-time voice signals as t seconds;
step 2: performing framing and windowing functions on the short-time speech, and then solving the spectrum envelope of each frame of the short-time speech with the duration of t seconds;
step 3: combining the spectrum envelopes of each frame of the same short-time voice, and drawing a spectrum envelope graph corresponding to each section of voice;
step 4: filtering the generated spectrum envelope diagram to make the frequency of the spectrum envelope diagram within the range of 500HZ to 3000 HZ;
step 5: and (3) according to the spectrum envelope diagram of each language as N: m is distributed into a training set and a test set, and labels of corresponding languages are marked;
step 6: fitting the training set to a residual error network, obtaining different language identification models by adjusting parameters, testing the language identification models by using the test set, and selecting the language identification model with the highest language identification rate;
step 7: when the time lengths of the voices to be detected are unequal, the voice signals are divided into a plurality of short-time voice signals, the time length is t seconds, a plurality of phrase voices obtained by dividing each long voice are fitted into a language recognition model in Step6, and the language of the long-time voice is judged by counting the recognition conditions of the short voices.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110238968.8A CN113112990A (en) | 2021-03-04 | 2021-03-04 | Language identification method of variable-duration voice based on spectrum envelope diagram |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110238968.8A CN113112990A (en) | 2021-03-04 | 2021-03-04 | Language identification method of variable-duration voice based on spectrum envelope diagram |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113112990A true CN113112990A (en) | 2021-07-13 |
Family
ID=76710142
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110238968.8A Pending CN113112990A (en) | 2021-03-04 | 2021-03-04 | Language identification method of variable-duration voice based on spectrum envelope diagram |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113112990A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1971711A (en) * | 2005-06-28 | 2007-05-30 | 哈曼贝克自动系统-威美科公司 | System for adaptive enhancement of speech signals |
CN101494049A (en) * | 2009-03-11 | 2009-07-29 | 北京邮电大学 | Method for extracting audio characteristic parameter of audio monitoring system |
CN106653055A (en) * | 2016-10-20 | 2017-05-10 | 北京创新伙伴教育科技有限公司 | On-line oral English evaluating system |
CN108091330A (en) * | 2017-12-13 | 2018-05-29 | 北京小米移动软件有限公司 | Output sound intensity adjusting method, device, electronic equipment and storage medium |
CN109910739A (en) * | 2018-12-21 | 2019-06-21 | 嘉兴智驾科技有限公司 | A method of judging that intelligent driving is alarmed by voice recognition |
CN110827793A (en) * | 2019-10-21 | 2020-02-21 | 成都大公博创信息技术有限公司 | Language identification method |
-
2021
- 2021-03-04 CN CN202110238968.8A patent/CN113112990A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1971711A (en) * | 2005-06-28 | 2007-05-30 | 哈曼贝克自动系统-威美科公司 | System for adaptive enhancement of speech signals |
CN101494049A (en) * | 2009-03-11 | 2009-07-29 | 北京邮电大学 | Method for extracting audio characteristic parameter of audio monitoring system |
CN106653055A (en) * | 2016-10-20 | 2017-05-10 | 北京创新伙伴教育科技有限公司 | On-line oral English evaluating system |
CN108091330A (en) * | 2017-12-13 | 2018-05-29 | 北京小米移动软件有限公司 | Output sound intensity adjusting method, device, electronic equipment and storage medium |
CN109910739A (en) * | 2018-12-21 | 2019-06-21 | 嘉兴智驾科技有限公司 | A method of judging that intelligent driving is alarmed by voice recognition |
CN110827793A (en) * | 2019-10-21 | 2020-02-21 | 成都大公博创信息技术有限公司 | Language identification method |
Non-Patent Citations (5)
Title |
---|
刘芮衫: ""与文本无关的语种识别技术研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
孙乐: ""基于语种识别系统的语言距离关系研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
徐克虎: "《智能计算方法及其应用》", 30 July 2019 * |
杜鑫: ""电话语音语种识别算法研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
王成勇 等: "《多媒体技术及应用》", 30 April 2005 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kolbæk et al. | On loss functions for supervised monaural time-domain speech enhancement | |
CN109599093B (en) | Intelligent quality inspection keyword detection method, device and equipment and readable storage medium | |
JP4943335B2 (en) | Robust speech recognition system independent of speakers | |
CN110782872A (en) | Language identification method and device based on deep convolutional recurrent neural network | |
CN106611604B (en) | Automatic voice superposition detection method based on deep neural network | |
Moritz et al. | An auditory inspired amplitude modulation filter bank for robust feature extraction in automatic speech recognition | |
Venter et al. | Automatic detection of African elephant (Loxodonta africana) infrasonic vocalisations from recordings | |
Zhang et al. | Effects of telephone transmission on the performance of formant-trajectory-based forensic voice comparison–female voices | |
Kinoshita et al. | Text-informed speech enhancement with deep neural networks. | |
CN103440869A (en) | Audio-reverberation inhibiting device and inhibiting method thereof | |
CN111128211B (en) | Voice separation method and device | |
CN106157974A (en) | Text recites quality assessment device and method | |
Sandhu et al. | A comparative study of mel cepstra and EIH for phone classification under adverse conditions | |
CN109036470A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN112133289B (en) | Voiceprint identification model training method, voiceprint identification device, voiceprint identification equipment and voiceprint identification medium | |
CN112116909A (en) | Voice recognition method, device and system | |
Moritz et al. | Integration of optimized modulation filter sets into deep neural networks for automatic speech recognition | |
CN112802456A (en) | Voice evaluation scoring method and device, electronic equipment and storage medium | |
Zhang et al. | Toward universal speech enhancement for diverse input conditions | |
CN113112990A (en) | Language identification method of variable-duration voice based on spectrum envelope diagram | |
Ditter et al. | Influence of Speaker-Specific Parameters on Speech Separation Systems. | |
CN116312561A (en) | Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system | |
Vlaj et al. | Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria | |
CN115713945A (en) | Audio data processing method and prediction method | |
CN111091816B (en) | Data processing system and method based on voice evaluation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210713 |
|
WD01 | Invention patent application deemed withdrawn after publication |