CN113112990A - Language identification method of variable-duration voice based on spectrum envelope diagram - Google Patents

Language identification method of variable-duration voice based on spectrum envelope diagram Download PDF

Info

Publication number
CN113112990A
CN113112990A CN202110238968.8A CN202110238968A CN113112990A CN 113112990 A CN113112990 A CN 113112990A CN 202110238968 A CN202110238968 A CN 202110238968A CN 113112990 A CN113112990 A CN 113112990A
Authority
CN
China
Prior art keywords
voice
short
spectrum envelope
language
language identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110238968.8A
Other languages
Chinese (zh)
Inventor
龙华
王瑶
邵玉斌
杜庆治
王延凯
陈亮
唐维康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202110238968.8A priority Critical patent/CN113112990A/en
Publication of CN113112990A publication Critical patent/CN113112990A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling

Abstract

The invention relates to a language identification method of variable-duration voice based on a spectrum envelope diagram, belonging to the technical field of audio signal processing. Firstly, dividing a voice signal into short-time voice with the duration of t seconds; then extracting the spectrum envelope diagrams of each short-time speech, and distributing the spectrum envelope diagrams into a training set and a test set according to the proportion of a to b; fitting the training set into a residual error network for training, and selecting a language identification model with the highest identification rate through adjusting model parameters and repeatedly testing the test set; when the duration of the voice to be detected is longer than t seconds, the voice to be detected is divided into a plurality of short-term voices with the length of t seconds, and then the recognition condition of the short voice obtained by dividing each long voice is counted to judge the language of the whole long voice. The invention can not only accelerate the speed of language identification, but also can identify the language under the condition of different durations of the voice to be detected, and can ensure higher accuracy.

Description

Language identification method of variable-duration voice based on spectrum envelope diagram
Technical Field
The invention relates to a language identification method of variable-duration voice based on a spectrum envelope diagram, belonging to the technical field of audio signal processing.
Background
According to survey, over 6900 different languages exist around the world, and it is very complicated to classify them completely by human labor. There are many language translation tools, and the front end of these translation tools refers to language identification technology. When the durations of the voices to be detected are unequal, the training set cannot contain voice signals of all durations, so the language identification effect of the voices with different durations is obviously reduced. At present, in order to solve the problem of language identification of voices with different durations, the most widely used method is to change the speed of the voices to adjust the durations, and although the frequency domain characteristics of the voices can be kept basically unchanged, the time domain characteristics are changed greatly, and the language identification effect is not ideal. There is therefore still a great deal of research space for language identification of variable duration speech.
Disclosure of Invention
The invention provides a language identification method of variable-duration speech based on a spectrum envelope diagram, which is used for solving the problem that the language identification effect is sharply reduced when the durations of the speech to be detected are unequal.
The technical scheme of the invention is as follows: a language identification method of variable-duration voice based on a spectrum envelope diagram mainly comprises two parts, wherein the first part is used for language identification of short voice (duration is 1 second), and the second part is used for regular duration of long voice to be detected, namely, the long voice is divided into a plurality of phrase voice signals with duration of 1 second.
In the language identification process of the first part of short speech, a spectrum envelope image of a speech signal is used as the characteristic input of a language identification system, the extraction process mainly comprises the steps of framing the speech signal, windowing and homomorphic processing, each frame spectrum envelope of the short speech is obtained, then the spectrum envelopes are spliced together according to lines to form a spectrum envelope image of a section of short speech (the duration is 1 second), then the spectrum envelope images are distributed into a training set and a test set according to the ratio of 4:1, the training set is used for fitting a residual error network to form a language identification model, and the test set is used for testing the generated language identification model to select the language identification model with the best identification effect.
The second part is that long speech with different time length is divided into several short speech (time length is 1 second), then these short speech are sent into language identification system to make test, and the language of long speech is discriminated by counting the identification condition of these short speech.
The method comprises the following specific steps:
step 1: and dividing long-segment voice signals of different languages into short-time voice with shorter time length, and defining the time length of the short-time voice signals as t seconds.
Step 2: and performing framing and windowing functions on the short-time voice with the time length of t seconds, and then solving the spectrum envelope of each frame of the short-time voice with the time length of t seconds through homomorphic processing.
Step 3: the spectrum envelopes of each frame signal of the same short-time speech are arranged and combined according to rows, a spectrum envelope graph corresponding to each section of speech is drawn, the horizontal axis represents a spectrum, and the vertical axis represents a time domain.
Step 4: and filtering the generated spectrum envelope diagram, removing high-frequency and low-frequency parts of the voice signal, and reserving a middle-frequency part of the voice signal to enable the frequency of the middle-frequency part to be in the range of 500HZ to 3000 HZ.
Step 5: and (3) according to the spectrum envelope diagram of each language as N: and m is distributed into a training set and a testing set, and labels of corresponding languages are marked.
Step 6: and fitting the training set to a residual error network, training by adjusting parameters to obtain different language identification models, testing the language identification models by using the test set, and selecting the language identification model with the highest language identification rate.
Step 7: when the time lengths of the voices to be detected are unequal, the voice signals are divided into a plurality of short-time voice signals, the time length is t seconds, a plurality of phrase voices obtained by dividing each long voice are fitted into a language recognition model in Step6, and the language of the long-time voice is judged by counting the recognition conditions of the short voices.
The invention has the beneficial effects that: the method can not only accelerate the speed of language identification, but also identify the languages under the condition that the time lengths of the voices to be detected are different, and can ensure higher accuracy.
Drawings
FIG. 1 is a block diagram of the overall architecture of the present invention;
FIG. 2 is a flow chart of spectral envelope extraction according to the present invention;
FIG. 3 is a graph of vocal tract impulse response (spectral envelope) and glottal excitation pulses of the present invention;
FIG. 4 is a graphical depiction of a spectral envelope of the present invention;
FIG. 5 is a graph of a filtered spectral envelope of the present invention;
FIG. 6 is a diagram illustrating long speech segmentation according to the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
As shown in fig. 1, a speech recognition method for variable duration speech based on a spectral envelope diagram mainly includes two parts, the first part is speech recognition for short speech (duration is 1 second), and the second part is a phrase tone signal with regular duration of the long speech to be measured, i.e. the phrase tone signal is divided into several phrase tone signals with duration of 1 second.
In the language identification process of the first part of short speech, a spectrum envelope image of a speech signal is used as the characteristic input of a language identification system, the extraction process mainly comprises the steps of framing the speech signal, windowing and homomorphic processing, each frame spectrum envelope of the short speech is obtained, then the spectrum envelopes are spliced together according to lines to form a spectrum envelope image of a section of short speech (the duration is 1 second), then the spectrum envelope images are distributed into a training set and a test set according to the ratio of 4:1, the training set is used for fitting a residual error network to form a language identification model, and the test set is used for testing the generated language identification model to select the language identification model with the best identification effect.
The second part is that long speech with different time length is divided into several short speech (time length is 1 second), then these short speech are sent into language identification system to make test, and the language of long speech is discriminated by counting the identification condition of these short speech.
The method specifically comprises the following steps:
step 1: generating voice data;
downloading audio signals of 8 languages with equal time duration on a broadcasting station, respectively dividing the audio signals into phrase tones with time duration of t seconds through a program, transcoding a voice file into a wav file by adobe audio software, wherein the sampling rate is 16k, and the data is monaural. The 8 voices are mandarin, korean, tibetan, burmese, cambodia, laos, and vietnamese, respectively.
Step 2: extracting a spectrum envelope;
speech is a complex multi-frequency signal, the amplitudes of the signals of all frequencies are different, the signals are arranged according to the magnitude sequence, and the connecting line of the midpoints of the amplitudes is the spectrum envelope referred to in the invention. The shape of the envelope varies from utterance to utterance, and since there is a difference in the utterances of the respective languages, it is proposed herein to use the spectral envelope as a speech acoustic feature for language recognition. In the process of extracting the spectral envelope, homomorphic filtering is the most important part, and the role of the homomorphic filtering is to separate the vocal tract impulse response and the glottal excitation pulse in the speech signal, and in the invention, the obtained vocal tract impulse response is the spectral envelope required by us. The extraction process of the spectral envelope is shown in fig. 2.
Step2.1: framing;
dividing a voice signal with the duration of 1 second into short-time frame signals, and regarding each frame signal as a steady-state time-invariant signal. In the process of framing, the overlapping portion between two adjacent frames is called frame shift, and the length of the frame shift is generally one half of the frame length.
Step2.2: windowing;
adding rectangular window, Hamming window or Haining window to each frame signal, selecting different window functions will have different bandwidth and spectrum leakage, and the Hamming window is selected in the invention.
Step2.3: homomorphic processing;
after framing and windowing, each frame of speech signal can be represented as:
x(n)=x1(n)*x2(n) (1)
in the formula (1), x1(n) and x2(n) represent the vocal tract impulse response (spectral envelope) and the glottal excitation signal, respectively.
The original convolution signal is transformed into a multiplicative signal by fast fourier transform:
DFT[x(n)]=DFT[x1(n)*x2(n)]=X1(k)·X2(k)=X(k) (2)
in the formula (2), DFT [ ] is a discrete Fourier transform.
Figure BDA0002961413020000041
In the formula (3), x (k) is a transformed multiplicative signal.
The multiplicative signal is converted into an additive signal by taking the logarithm.
Figure BDA0002961413020000042
In the formula (4), the reaction mixture is,
Figure BDA0002961413020000043
for the logarithmic spectrum of signal x (n), the spectral envelope proposed in the present invention is referred to
Figure BDA0002961413020000044
The central envelope of (c).
To pair
Figure BDA0002961413020000045
To carry outInverse discrete Fourier transform to obtain complex cepstrum of signal x (n)
Figure BDA0002961413020000046
Figure BDA0002961413020000047
DFT-1[]Is an inverse fast fourier transform.
Figure BDA0002961413020000048
Get
Figure BDA0002961413020000049
Real part of
Figure BDA00029614130200000410
Figure BDA00029614130200000411
In the formula (7), a
Figure BDA00029614130200000412
Separated by a specific spectral line m
Figure BDA00029614130200000413
And
Figure BDA00029614130200000414
then to
Figure BDA00029614130200000415
A discrete Fourier transform is performed, and a real part of the transform result obtains a spectrum envelope of each frame.
Figure BDA00029614130200000416
Y (k) is the spectral envelope of the signal. As shown in fig. 3, the vocal tract impulse response (spectral envelope) and the glottal excitation pulse of a frame of speech signal are shown.
Step 3: drawing a spectrum envelope graph;
and combining the spectral envelopes of each frame of the same short-time voice according to rows, and drawing a spectral envelope graph corresponding to each section of voice through a python program. The horizontal axis of the spectrum envelope graph is frequency characteristic, and the vertical axis is time domain characteristic, and the drawing diagram is shown in fig. 4.
Step 4: filtering the frequency spectrum;
in order to eliminate the influence on the speech recognition result due to the difference of the frequency distribution range of different languages, the high frequency part and the low frequency part of the spectrum envelope map are now removed, and only the spectrum envelopes of the speech signals 500HZ to 3000HZ are reserved. As shown in fig. 4, the picture is a filtered picture of a spectrum envelope map of a short-term speech with a duration of t seconds.
Step 5: generating a training set and a test set;
the method comprises the steps of distributing a spectrum envelope diagram of each language into a training set and a test set according to a ratio of 4:1, marking labels of the corresponding languages, then concentrating the training sets of various voices into a training set file, and concentrating the test sets of various voices into a test set file.
Step 6: generating a language identification model;
and fitting the training set to a residual error network, obtaining different language identification models by adjusting parameters, and testing the language identification models by using the test set to select the language identification model with the highest language identification rate.
Step 7: language identification of variable duration voice;
when the time lengths of the voices to be detected are unequal, the voice signals are divided into a plurality of short-time voice signals (the time length is 1 second)
A=[a1 a2 a3 ... aN] (9)
In the formula (9), a1,a2,a3.. are short-time speeches with duration of 1 secondAnd A is a voice signal with the duration of more than 1 second. When performing voice segmentation, firstly, the time length T of the voice signal is determined, and then the number N of voices with the time-sharing length of 1 second is determined by rounding the time-following length.
Figure BDA0002961413020000051
In the formula (10), S is the overlapping length of two adjacent short-term voices during long voice segmentation, when S is greater than 0, | S | represents the overlapping duration of two adjacent short-term voices, and when S is less than 0, | S | represents the distance between two adjacent short-term voices, which is specifically segmented as shown in fig. 5.
After a long speech with a duration T is segmented into N short phonemes, the probability statistics can be expressed by equation (11):
P=[p1 p2 p3 ... pN] (11)
in the formula (11), piRepresenting the recognition probability of the ith segment of speech.
In this embodiment, eight languages are mentioned, and when a long speech is determined, the sum of probabilities of each language is obtained, and the language with the highest probability is the language of the long speech.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (1)

1. A language identification method of variable duration voice based on a spectrum envelope diagram is characterized in that:
step 1: dividing long-segment voice signals of different languages into short-time voice, and defining the duration of the short-time voice signals as t seconds;
step 2: performing framing and windowing functions on the short-time speech, and then solving the spectrum envelope of each frame of the short-time speech with the duration of t seconds;
step 3: combining the spectrum envelopes of each frame of the same short-time voice, and drawing a spectrum envelope graph corresponding to each section of voice;
step 4: filtering the generated spectrum envelope diagram to make the frequency of the spectrum envelope diagram within the range of 500HZ to 3000 HZ;
step 5: and (3) according to the spectrum envelope diagram of each language as N: m is distributed into a training set and a test set, and labels of corresponding languages are marked;
step 6: fitting the training set to a residual error network, obtaining different language identification models by adjusting parameters, testing the language identification models by using the test set, and selecting the language identification model with the highest language identification rate;
step 7: when the time lengths of the voices to be detected are unequal, the voice signals are divided into a plurality of short-time voice signals, the time length is t seconds, a plurality of phrase voices obtained by dividing each long voice are fitted into a language recognition model in Step6, and the language of the long-time voice is judged by counting the recognition conditions of the short voices.
CN202110238968.8A 2021-03-04 2021-03-04 Language identification method of variable-duration voice based on spectrum envelope diagram Pending CN113112990A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110238968.8A CN113112990A (en) 2021-03-04 2021-03-04 Language identification method of variable-duration voice based on spectrum envelope diagram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110238968.8A CN113112990A (en) 2021-03-04 2021-03-04 Language identification method of variable-duration voice based on spectrum envelope diagram

Publications (1)

Publication Number Publication Date
CN113112990A true CN113112990A (en) 2021-07-13

Family

ID=76710142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110238968.8A Pending CN113112990A (en) 2021-03-04 2021-03-04 Language identification method of variable-duration voice based on spectrum envelope diagram

Country Status (1)

Country Link
CN (1) CN113112990A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1971711A (en) * 2005-06-28 2007-05-30 哈曼贝克自动系统-威美科公司 System for adaptive enhancement of speech signals
CN101494049A (en) * 2009-03-11 2009-07-29 北京邮电大学 Method for extracting audio characteristic parameter of audio monitoring system
CN106653055A (en) * 2016-10-20 2017-05-10 北京创新伙伴教育科技有限公司 On-line oral English evaluating system
CN108091330A (en) * 2017-12-13 2018-05-29 北京小米移动软件有限公司 Output sound intensity adjusting method, device, electronic equipment and storage medium
CN109910739A (en) * 2018-12-21 2019-06-21 嘉兴智驾科技有限公司 A method of judging that intelligent driving is alarmed by voice recognition
CN110827793A (en) * 2019-10-21 2020-02-21 成都大公博创信息技术有限公司 Language identification method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1971711A (en) * 2005-06-28 2007-05-30 哈曼贝克自动系统-威美科公司 System for adaptive enhancement of speech signals
CN101494049A (en) * 2009-03-11 2009-07-29 北京邮电大学 Method for extracting audio characteristic parameter of audio monitoring system
CN106653055A (en) * 2016-10-20 2017-05-10 北京创新伙伴教育科技有限公司 On-line oral English evaluating system
CN108091330A (en) * 2017-12-13 2018-05-29 北京小米移动软件有限公司 Output sound intensity adjusting method, device, electronic equipment and storage medium
CN109910739A (en) * 2018-12-21 2019-06-21 嘉兴智驾科技有限公司 A method of judging that intelligent driving is alarmed by voice recognition
CN110827793A (en) * 2019-10-21 2020-02-21 成都大公博创信息技术有限公司 Language identification method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
刘芮衫: ""与文本无关的语种识别技术研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
孙乐: ""基于语种识别系统的语言距离关系研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
徐克虎: "《智能计算方法及其应用》", 30 July 2019 *
杜鑫: ""电话语音语种识别算法研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
王成勇 等: "《多媒体技术及应用》", 30 April 2005 *

Similar Documents

Publication Publication Date Title
Kolbæk et al. On loss functions for supervised monaural time-domain speech enhancement
CN109599093B (en) Intelligent quality inspection keyword detection method, device and equipment and readable storage medium
JP4943335B2 (en) Robust speech recognition system independent of speakers
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
CN106611604B (en) Automatic voice superposition detection method based on deep neural network
Moritz et al. An auditory inspired amplitude modulation filter bank for robust feature extraction in automatic speech recognition
Venter et al. Automatic detection of African elephant (Loxodonta africana) infrasonic vocalisations from recordings
Zhang et al. Effects of telephone transmission on the performance of formant-trajectory-based forensic voice comparison–female voices
Kinoshita et al. Text-informed speech enhancement with deep neural networks.
CN103440869A (en) Audio-reverberation inhibiting device and inhibiting method thereof
CN111128211B (en) Voice separation method and device
CN106157974A (en) Text recites quality assessment device and method
Sandhu et al. A comparative study of mel cepstra and EIH for phone classification under adverse conditions
CN109036470A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN112133289B (en) Voiceprint identification model training method, voiceprint identification device, voiceprint identification equipment and voiceprint identification medium
CN112116909A (en) Voice recognition method, device and system
Moritz et al. Integration of optimized modulation filter sets into deep neural networks for automatic speech recognition
CN112802456A (en) Voice evaluation scoring method and device, electronic equipment and storage medium
Zhang et al. Toward universal speech enhancement for diverse input conditions
CN113112990A (en) Language identification method of variable-duration voice based on spectrum envelope diagram
Ditter et al. Influence of Speaker-Specific Parameters on Speech Separation Systems.
CN116312561A (en) Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system
Vlaj et al. Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria
CN115713945A (en) Audio data processing method and prediction method
CN111091816B (en) Data processing system and method based on voice evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210713

WD01 Invention patent application deemed withdrawn after publication