CN104681036A - System and method for detecting language voice frequency - Google Patents
System and method for detecting language voice frequency Download PDFInfo
- Publication number
- CN104681036A CN104681036A CN201510091609.9A CN201510091609A CN104681036A CN 104681036 A CN104681036 A CN 104681036A CN 201510091609 A CN201510091609 A CN 201510091609A CN 104681036 A CN104681036 A CN 104681036A
- Authority
- CN
- China
- Prior art keywords
- language
- confidence
- phoneme sequence
- acoustic
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The invention discloses a system and method for detecting language voice frequency, and belongs to the technical field of language signal processing. The system comprises an acoustic feature extraction module, a phoneme identification module, an acoustic confidence calculation module, a language confidence calculation module, a prosodic feature extraction module and a classification discrimination module. Through comprehensive utilization of acoustic confidence, language confidence and prosodic feature information, the detection performance of the system is obviously improved, the system is suitable for the detection of voice frequencies with different lengths and has good detection stability, various non-target language voice frequencies and noise voice frequencies can be processed, the system has good practicability and can be quickly expanded according to the types of non-target languages by providing the acoustic model and the language model of a new language and then re-training a classifier model, so that the system structure has better flexibility and expandability.
Description
Technical field
The present invention relates to Speech signal processing technical field, particularly a kind of detection system of language audio and method.
Background technology
The actual application environment of voice technology is usually very complicated, system acceptance to audio frequency in may comprise the sound of a lot of non-target language, the voice of such as different language, music, natural noise and man-made noise etc.The existence of these audio frequency can have a strong impact on availability and the Consumer's Experience of voice technology.Therefore, adopt technological means to detect efficiently and filter these audio frequency to be necessary.
In this kind of technology, most typical is languages recognition technology and noise detection technique.Wherein, languages recognition technology utilizes the harmonious sounds information (such as special pronunciation unit, the different distributions of pronunciation unit or array mode etc.) comprised in voice to judge category of language.
In prior art, the most ripe languages recognition technology thinks that the aligned phoneme sequence distribution that different language recognizer produces and combination have respective different rule based on this technology of many phonemic language model technology of phoneme recognition, and the distribution probability of the phoneme recognition sequence therefore utilizing different language recognizer to export in the phonemic language model of different language language carries out languages identification.This technology has good precision and universal, but its performance can sharply decline on phrase sound, there is certain limitation.
Summary of the invention
In order to solve the problem of prior art, embodiments provide a kind of detection system and method for language audio.Described technical scheme is as follows:
On the one hand, provide a kind of detection system of language audio, described system comprises: acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module;
Wherein,
Described acoustic feature extraction module is for extracting the acoustic feature of input speech signal, and described acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
Described phoneme recognition module is made up of the one group of recognizer at least comprising the corresponding recognizer of target language, the language that described one group of recognizer is corresponding different respectively, for carrying out parallel voice identification decoding to described acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of described different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
Described acoustic confidence computing module is used for according to the best aligned phoneme sequence of described different language and corresponding time boundary, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
Described language confidence calculations module is used for according to the best aligned phoneme sequence of described different language and corresponding time boundary, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
Described prosodic features extraction module, for according to the best aligned phoneme sequence of described target language and the fundamental frequency feature of corresponding time boundary and described input audio frequency, calculates the prosodic features of input audio frequency;
Described discriminant classification module carries out target language/non-target language classification for utilizing sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency.
Alternatively, in described one group of recognizer, each recognizer adopts acoustic model and the language model of its corresponding language, described acoustic model needs to adopt the speech data of corresponding language to train in advance, and described speech model needs to adopt the text data of corresponding language to train in advance.
Alternatively, the prosodic features of described audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.
Alternatively, discriminant classification module is also for forming a super vector by the prosodic features of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.
On the other hand, provide a kind of detection method of language audio, described method comprises:
Extract the acoustic feature of input speech signal, described acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
Parallel voice identification decoding is carried out to described acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of described different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
According to best aligned phoneme sequence and the corresponding time boundary of described different language, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
According to best aligned phoneme sequence and the corresponding time boundary of described different language, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
According to the best aligned phoneme sequence of described target language and the fundamental frequency feature of corresponding time boundary and described input audio frequency, calculate the prosodic features of input audio frequency;
Sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency is utilized to carry out target language/non-target language classification.
Alternatively, the prosodic features of described audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.
Alternatively, utilize the proper vector of the good sorter of training in advance to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency to carry out target language/non-target language classification to comprise:
The prosodic features of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency is formed a super vector, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.
The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is:
Method provided by the present invention, by comprehensive utilization acoustic confidence, language degree of confidence and prosodic features information, the detection perform of system obtains and significantly improves, be applicable to the audio detection of different length, have and well detect stability, multiple non-target language audio frequency and noised audio can be processed, there is good practicality, Quick Extended can be carried out according to the type of non-target language, only need acoustic model and language model that new languages are provided, then re-training sorter model is just passable, has dirigibility and the extensibility of good system architecture.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the language audio detection system structural representation that the embodiment of the present invention provides;
Fig. 2 is the language audio detection method process flow diagram that the embodiment of the present invention provides.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
Fig. 1 is the detection system structural representation of the language audio that the embodiment of the present invention provides.See Fig. 1, this system comprises: acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module.Wherein,
This acoustic feature extraction module is for extracting the acoustic feature of input speech signal, and this acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
Wherein, this acoustic feature can comprise: PLP ((Perceptual Linear Predict ive, perception linear prediction) feature, MFCC (Mel Frequency Cepstrum Coefficient, mel frequency cepstral coefficient) feature, fbank feature etc.
This phoneme recognition module is made up of the one group of recognizer at least comprising the corresponding recognizer of target language, the language that this group of recognizer is corresponding different respectively, for carrying out parallel voice identification decoding to this acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of this different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
In embodiments of the present invention, phoneme recognition module, is made up of one group of phoneme recognizer, languages corresponding different respectively.In this group phoneme recognizer, the speech recognition device of corresponding target language must be comprised, that is to say, only can comprise the phoneme recognizer of target language in this group phoneme recognizer, system-computed amount can be reduced like this, the limited decline of detection perform; Also except the phoneme recognizer comprising target language, the phoneme recognizer of other non-target language can also be comprised, the languages that may run in corresponding actual application environment respectively.Each recognizer adopts acoustic model and the phonemic language model of its corresponding languages.What this module exported is time boundary and the internal state sequence of one group of aligned phoneme sequence and correspondence thereof.Alternatively, in this group of recognizer, each recognizer adopts acoustic model and the language model of its corresponding language, this acoustic model needs to adopt the speech data of corresponding language to train in advance, and this speech model needs to adopt the text data of corresponding language to train in advance.
Alternatively, this group of recognizer is unified adopts mutually isostructural acoustic model and language model.Typically, acoustic model adopts DNN (Deep Neural Network, deep neural network)/HMM (Hidden Markov Mode, hidden Markov model), and Acoustic Modeling unit is unified adopts phoneme; Language model adopts n-gram (the grammatical statistical model of the N unit) statistical language model of phoneme.In the preferred embodiment of the invention, the n-gram language model that decoding adopts is the phonemic language model of 3-gram.
This acoustic confidence computing module is used for according to the best aligned phoneme sequence of this different language and corresponding time boundary, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
Conventional confidence calculations method has a lot, comprises the degree of confidence technology of feature based, based on the degree of confidence technology etc. of N-best or lattice.The confidence calculations scheme that the embodiment of the present invention adopts is the posterior average of phoneme level acoustics based on DNN model.
Alternatively, the computing method of acoustic confidence are:
Wherein, C
as () is the acoustic confidence of sentence s, C
a(p
i) be the p of i-th phoneme in sentence
iacoustic confidence, n is the phoneme number in sentence s, and m is phoneme p
iin the feature frame number that comprises, P (s
j| o
j) be phoneme p
iin given jth acoustics observe o
jat state s
jon posterior probability.
This language confidence calculations module is used for according to the best aligned phoneme sequence of this different language and corresponding time boundary, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
In embodiments of the present invention, this degree of confidence calculates in the following way: the aligned phoneme sequence of the recognizer output of given languages A, calculates the generating probability of this sequence under a standard phonemic language model.This standard phonemic language model is different from the language model that phoneme recognition adopts, usually than the language model more high-order of phoneme recognition.This language model all refers to the n-gram language model of Corpus--based Method except special statement herein.
Alternatively, the computing method of language degree of confidence are:
C
1(s)=P(p
1,p
2…p
n)
=P(p
1)P(p
2|P
1)P(p
3|p
1|p
2)…P(p
n|p
n-k+1…p
n-1)。
Wherein, P (p
n| p
n-k+1p
n-1) be the probability of the phonemic language model of k-gram, can add up on a large amount of text data and obtain.
In the preferred embodiment of the invention, the language model for computational language degree of confidence is the phonemic language model of 4-gram.
This prosodic features extraction module, for according to the best aligned phoneme sequence of this target language and the fundamental frequency feature of corresponding time boundary and this input audio frequency, calculates the prosodic features of input audio frequency;
In embodiments of the present invention, the prosodic features of this audio frequency comprises Sentence-level fundamental frequency maximal value and minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance and variance in sentence, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value in sentence, ratio in sentence shared by sound section (segment of corresponding fundamental frequency non-zero), the ratio of noiseless phoneme (this phoneme intersegmental part fundamental frequency value is all zero) in sentence, maximum phoneme duration and minimum phoneme duration, the average of phoneme duration and variance.
This discriminant classification module carries out target language/non-target language classification for utilizing sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency.
Wherein, the sorter that this training in advance is good needs to adopt in advance to be collected and the mass data marked training.Conventional sorter has Bayes classifier, k nearest neighbor, support vector machine, decision tree, maximum entropy and condition random field and neural network etc.The present invention adopts support vector machine classifier.
In embodiments of the present invention, discriminant classification module is also for forming a super vector by the prosodic features of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.Wherein, what sorter needed to export must be divided under given audio case, and it belongs to the posterior probability of target language.If this posterior probability is greater than given threshold value, then judges that input audio frequency is as target language, otherwise be judged to be non-target language.
In invention preferred embodiment, the sorter that the classification carrying out target language/non-target language judges adopts the supporting vector machine model of radial basis kernel.
The system that the embodiment of the present invention provides, by comprehensive utilization acoustic confidence, language degree of confidence and prosodic features information, the detection perform of system obtains and significantly improves, be applicable to the audio detection of different length, have and well detect stability, multiple non-target language audio frequency and noised audio can be processed, there is good practicality, Quick Extended can be carried out according to the type of non-target language, only need acoustic model and language model that new languages are provided, then re-training sorter model is just passable, there is dirigibility and the extensibility of good system architecture.
Fig. 2 is the language audio detection method process flow diagram that the embodiment of the present invention provides, and see Fig. 2, the method comprises:
201, extract the acoustic feature of input speech signal, this acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
202, parallel voice identification decoding is carried out to this acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of this different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
203, according to the best aligned phoneme sequence of this different language and the time boundary of correspondence, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
204, according to the best aligned phoneme sequence of this different language and the time boundary of correspondence, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
205, according to the fundamental frequency feature of the best aligned phoneme sequence of this target language and the time boundary of correspondence and this input audio frequency, the prosodic features of input audio frequency is calculated;
206, sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency is utilized to carry out target language/non-target language classification.
Alternatively, the sorter that this training in advance is good needs to adopt in advance to be collected and the mass data marked training.
Alternatively, the prosodic features of this audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.
Alternatively, utilize the proper vector of the good sorter of training in advance to the prosodic features composition of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency to carry out target language/non-target language classification to comprise:
The prosodic features of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency is formed a super vector, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.
The method that the embodiment of the present invention provides, by comprehensive utilization acoustic confidence, language degree of confidence and prosodic features information, the detection perform of system obtains and significantly improves, be applicable to the audio detection of different length, have and well detect stability, multiple non-target language audio frequency and noised audio can be processed, there is good practicality, Quick Extended can be carried out according to the type of non-target language, only need acoustic model and language model that new languages are provided, then re-training sorter model is just passable, there is dirigibility and the extensibility of good system architecture.
One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, the program of being somebody's turn to do can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (7)
1. a detection system for language audio, is characterized in that, described system comprises: acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module;
Wherein,
Described acoustic feature extraction module is for extracting the acoustic feature of input speech signal, and described acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
Described phoneme recognition module is made up of the one group of recognizer at least comprising the corresponding recognizer of target language, the language that described one group of recognizer is corresponding different respectively, for carrying out parallel voice identification decoding to described acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of described different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
Described acoustic confidence computing module is used for according to the best aligned phoneme sequence of described different language and corresponding time boundary, calculate the posterior probability of aligned phoneme sequence on deep neural network DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
Described language confidence calculations module is used for according to the best aligned phoneme sequence of described different language and corresponding time boundary, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
Described prosodic features extraction module, for according to the best aligned phoneme sequence of described target language and the fundamental frequency feature of corresponding time boundary and described input audio frequency, calculates the prosodic features of input audio frequency;
Described discriminant classification module carries out target language/non-target language classification for utilizing sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency.
2. system according to claim 1, it is characterized in that, in described one group of recognizer, each recognizer adopts acoustic model and the language model of its corresponding language, described acoustic model needs to adopt the speech data of corresponding language to train in advance, and described speech model needs to adopt the text data of corresponding language to train in advance.
3. system according to claim 1, it is characterized in that, the prosodic features of described audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, the ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.
4. system according to claim 1, it is characterized in that, discriminant classification module is also for forming a super vector by the prosodic features of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determines that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.
5. a detection method for language audio, is characterized in that, described method comprises:
Extract the acoustic feature of input speech signal, described acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
Parallel voice identification decoding is carried out to described acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of described different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
According to best aligned phoneme sequence and the corresponding time boundary of described different language, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
According to best aligned phoneme sequence and the corresponding time boundary of described different language, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
According to the best aligned phoneme sequence of described target language and the fundamental frequency feature of corresponding time boundary and described input audio frequency, calculate the prosodic features of input audio frequency;
Sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency is utilized to carry out target language/non-target language classification.
6. method according to claim 5, it is characterized in that, the prosodic features of described audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, the ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.
7. method according to claim 5, it is characterized in that, utilize the proper vector of the good sorter of training in advance to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency to carry out target language/non-target language classification and comprise:
The prosodic features of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency is formed a super vector, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510091609.9A CN104681036B (en) | 2014-11-20 | 2015-02-28 | A kind of detecting system and method for language audio |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410668235 | 2014-11-20 | ||
CN2014106682358 | 2014-11-20 | ||
CN201510091609.9A CN104681036B (en) | 2014-11-20 | 2015-02-28 | A kind of detecting system and method for language audio |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104681036A true CN104681036A (en) | 2015-06-03 |
CN104681036B CN104681036B (en) | 2018-09-25 |
Family
ID=53315987
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510091609.9A Active CN104681036B (en) | 2014-11-20 | 2015-02-28 | A kind of detecting system and method for language audio |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104681036B (en) |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105427858A (en) * | 2015-11-06 | 2016-03-23 | 科大讯飞股份有限公司 | Method and system for achieving automatic voice classification |
CN105810191A (en) * | 2016-03-08 | 2016-07-27 | 江苏信息职业技术学院 | Prosodic information-combined Chinese dialect identification method |
CN106297828A (en) * | 2016-08-12 | 2017-01-04 | 苏州驰声信息科技有限公司 | The detection method of a kind of mistake utterance detection based on degree of depth study and device |
CN106373561A (en) * | 2015-07-24 | 2017-02-01 | 三星电子株式会社 | Apparatus and method of acoustic score calculation and speech recognition |
CN106847273A (en) * | 2016-12-23 | 2017-06-13 | 北京云知声信息技术有限公司 | The wake-up selected ci poem selection method and device of speech recognition |
WO2017114201A1 (en) * | 2015-12-31 | 2017-07-06 | 阿里巴巴集团控股有限公司 | Method and device for executing setting operation |
CN107045875A (en) * | 2016-02-03 | 2017-08-15 | 重庆工商职业学院 | Fundamental frequency detection method based on genetic algorithm |
CN108389573A (en) * | 2018-02-09 | 2018-08-10 | 北京易真学思教育科技有限公司 | Language Identification and device, training method and device, medium, terminal |
CN108428448A (en) * | 2017-02-13 | 2018-08-21 | 芋头科技(杭州)有限公司 | A kind of sound end detecting method and audio recognition method |
CN109493846A (en) * | 2018-11-18 | 2019-03-19 | 深圳市声希科技有限公司 | A kind of English accent identifying system |
CN109613526A (en) * | 2018-12-10 | 2019-04-12 | 航天南湖电子信息技术股份有限公司 | A kind of point mark filter method based on support vector machines |
CN109754789A (en) * | 2017-11-07 | 2019-05-14 | 北京国双科技有限公司 | The recognition methods of phoneme of speech sound and device |
CN110085216A (en) * | 2018-01-23 | 2019-08-02 | 中国科学院声学研究所 | A kind of vagitus detection method and device |
CN110176251A (en) * | 2019-04-03 | 2019-08-27 | 苏州驰声信息科技有限公司 | A kind of acoustic data automatic marking method and device |
CN110491382A (en) * | 2019-03-11 | 2019-11-22 | 腾讯科技(深圳)有限公司 | Audio recognition method, device and interactive voice equipment based on artificial intelligence |
CN111078937A (en) * | 2019-12-27 | 2020-04-28 | 北京世纪好未来教育科技有限公司 | Voice information retrieval method, device, equipment and computer readable storage medium |
CN111369978A (en) * | 2018-12-26 | 2020-07-03 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN111402861A (en) * | 2020-03-25 | 2020-07-10 | 苏州思必驰信息科技有限公司 | Voice recognition method, device, equipment and storage medium |
CN111583906A (en) * | 2019-02-18 | 2020-08-25 | 中国移动通信有限公司研究院 | Role recognition method, device and terminal for voice conversation |
CN111862939A (en) * | 2020-05-25 | 2020-10-30 | 北京捷通华声科技股份有限公司 | Prosodic phrase marking method and device |
CN112562649A (en) * | 2020-12-07 | 2021-03-26 | 北京大米科技有限公司 | Audio processing method and device, readable storage medium and electronic equipment |
CN112634874A (en) * | 2020-12-24 | 2021-04-09 | 江西台德智慧科技有限公司 | Automatic tuning terminal equipment based on artificial intelligence |
CN113327579A (en) * | 2021-08-03 | 2021-08-31 | 北京世纪好未来教育科技有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN113571045A (en) * | 2021-06-02 | 2021-10-29 | 北京它思智能科技有限公司 | Minnan language voice recognition method, system, equipment and medium |
WO2022100692A1 (en) * | 2020-11-12 | 2022-05-19 | 北京猿力未来科技有限公司 | Human voice audio recording method and apparatus |
CN115938351A (en) * | 2021-09-13 | 2023-04-07 | 北京数美时代科技有限公司 | ASR language model construction method, system, storage medium and electronic device |
WO2023103693A1 (en) * | 2021-12-07 | 2023-06-15 | 阿里巴巴(中国)有限公司 | Audio signal processing method and apparatus, device, and storage medium |
CN111369978B (en) * | 2018-12-26 | 2024-05-17 | 北京搜狗科技发展有限公司 | Data processing method and device for data processing |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW200421263A (en) * | 2003-04-10 | 2004-10-16 | Delta Electronics Inc | Speech recognition device and method using di-phone model to realize the mixed-multi-lingual global phoneme |
US20050033575A1 (en) * | 2002-01-17 | 2005-02-10 | Tobias Schneider | Operating method for an automated language recognizer intended for the speaker-independent language recognition of words in different languages and automated language recognizer |
US20120232901A1 (en) * | 2009-08-04 | 2012-09-13 | Autonomy Corporation Ltd. | Automatic spoken language identification based on phoneme sequence patterns |
CN103559879A (en) * | 2013-11-08 | 2014-02-05 | 安徽科大讯飞信息科技股份有限公司 | Method and device for extracting acoustic features in language identification system |
-
2015
- 2015-02-28 CN CN201510091609.9A patent/CN104681036B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050033575A1 (en) * | 2002-01-17 | 2005-02-10 | Tobias Schneider | Operating method for an automated language recognizer intended for the speaker-independent language recognition of words in different languages and automated language recognizer |
TW200421263A (en) * | 2003-04-10 | 2004-10-16 | Delta Electronics Inc | Speech recognition device and method using di-phone model to realize the mixed-multi-lingual global phoneme |
US20120232901A1 (en) * | 2009-08-04 | 2012-09-13 | Autonomy Corporation Ltd. | Automatic spoken language identification based on phoneme sequence patterns |
CN103559879A (en) * | 2013-11-08 | 2014-02-05 | 安徽科大讯飞信息科技股份有限公司 | Method and device for extracting acoustic features in language identification system |
Non-Patent Citations (2)
Title |
---|
YAN SONG ET AL: "i-vector representation based on bottleneck features for language identification", 《ELECTRONICS LETTERS》 * |
仲海兵等: "基于音素识别的语种辨识方法中的因子分析", 《模式识别与人工智能》 * |
Cited By (46)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106373561A (en) * | 2015-07-24 | 2017-02-01 | 三星电子株式会社 | Apparatus and method of acoustic score calculation and speech recognition |
CN106373561B (en) * | 2015-07-24 | 2021-11-30 | 三星电子株式会社 | Apparatus and method for acoustic score calculation and speech recognition |
CN105427858A (en) * | 2015-11-06 | 2016-03-23 | 科大讯飞股份有限公司 | Method and system for achieving automatic voice classification |
WO2017114201A1 (en) * | 2015-12-31 | 2017-07-06 | 阿里巴巴集团控股有限公司 | Method and device for executing setting operation |
CN106940998A (en) * | 2015-12-31 | 2017-07-11 | 阿里巴巴集团控股有限公司 | A kind of execution method and device of setting operation |
CN107045875A (en) * | 2016-02-03 | 2017-08-15 | 重庆工商职业学院 | Fundamental frequency detection method based on genetic algorithm |
CN107045875B (en) * | 2016-02-03 | 2019-12-06 | 重庆工商职业学院 | fundamental tone frequency detection method based on genetic algorithm |
CN105810191A (en) * | 2016-03-08 | 2016-07-27 | 江苏信息职业技术学院 | Prosodic information-combined Chinese dialect identification method |
CN105810191B (en) * | 2016-03-08 | 2019-11-29 | 江苏信息职业技术学院 | Merge the Chinese dialects identification method of prosodic information |
CN106297828B (en) * | 2016-08-12 | 2020-03-24 | 苏州驰声信息科技有限公司 | Detection method and device for false sounding detection based on deep learning |
CN106297828A (en) * | 2016-08-12 | 2017-01-04 | 苏州驰声信息科技有限公司 | The detection method of a kind of mistake utterance detection based on degree of depth study and device |
CN106847273B (en) * | 2016-12-23 | 2020-05-05 | 北京云知声信息技术有限公司 | Awakening word selection method and device for voice recognition |
CN106847273A (en) * | 2016-12-23 | 2017-06-13 | 北京云知声信息技术有限公司 | The wake-up selected ci poem selection method and device of speech recognition |
CN108428448A (en) * | 2017-02-13 | 2018-08-21 | 芋头科技(杭州)有限公司 | A kind of sound end detecting method and audio recognition method |
CN109754789A (en) * | 2017-11-07 | 2019-05-14 | 北京国双科技有限公司 | The recognition methods of phoneme of speech sound and device |
CN109754789B (en) * | 2017-11-07 | 2021-06-08 | 北京国双科技有限公司 | Method and device for recognizing voice phonemes |
CN110085216A (en) * | 2018-01-23 | 2019-08-02 | 中国科学院声学研究所 | A kind of vagitus detection method and device |
CN108389573B (en) * | 2018-02-09 | 2022-03-08 | 北京世纪好未来教育科技有限公司 | Language identification method and device, training method and device, medium and terminal |
CN108389573A (en) * | 2018-02-09 | 2018-08-10 | 北京易真学思教育科技有限公司 | Language Identification and device, training method and device, medium, terminal |
CN109493846A (en) * | 2018-11-18 | 2019-03-19 | 深圳市声希科技有限公司 | A kind of English accent identifying system |
CN109493846B (en) * | 2018-11-18 | 2021-06-08 | 深圳市声希科技有限公司 | English accent recognition system |
CN109613526A (en) * | 2018-12-10 | 2019-04-12 | 航天南湖电子信息技术股份有限公司 | A kind of point mark filter method based on support vector machines |
CN111369978A (en) * | 2018-12-26 | 2020-07-03 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN111369978B (en) * | 2018-12-26 | 2024-05-17 | 北京搜狗科技发展有限公司 | Data processing method and device for data processing |
CN111583906A (en) * | 2019-02-18 | 2020-08-25 | 中国移动通信有限公司研究院 | Role recognition method, device and terminal for voice conversation |
CN111583906B (en) * | 2019-02-18 | 2023-08-15 | 中国移动通信有限公司研究院 | Role recognition method, device and terminal for voice session |
CN110491382B (en) * | 2019-03-11 | 2020-12-04 | 腾讯科技(深圳)有限公司 | Speech recognition method and device based on artificial intelligence and speech interaction equipment |
CN110491382A (en) * | 2019-03-11 | 2019-11-22 | 腾讯科技(深圳)有限公司 | Audio recognition method, device and interactive voice equipment based on artificial intelligence |
CN110176251B (en) * | 2019-04-03 | 2021-12-21 | 苏州驰声信息科技有限公司 | Automatic acoustic data labeling method and device |
CN110176251A (en) * | 2019-04-03 | 2019-08-27 | 苏州驰声信息科技有限公司 | A kind of acoustic data automatic marking method and device |
CN111078937B (en) * | 2019-12-27 | 2021-08-10 | 北京世纪好未来教育科技有限公司 | Voice information retrieval method, device, equipment and computer readable storage medium |
CN111078937A (en) * | 2019-12-27 | 2020-04-28 | 北京世纪好未来教育科技有限公司 | Voice information retrieval method, device, equipment and computer readable storage medium |
CN111402861B (en) * | 2020-03-25 | 2022-11-15 | 思必驰科技股份有限公司 | Voice recognition method, device, equipment and storage medium |
CN111402861A (en) * | 2020-03-25 | 2020-07-10 | 苏州思必驰信息科技有限公司 | Voice recognition method, device, equipment and storage medium |
CN111862939A (en) * | 2020-05-25 | 2020-10-30 | 北京捷通华声科技股份有限公司 | Prosodic phrase marking method and device |
WO2022100692A1 (en) * | 2020-11-12 | 2022-05-19 | 北京猿力未来科技有限公司 | Human voice audio recording method and apparatus |
CN112562649B (en) * | 2020-12-07 | 2024-01-30 | 北京大米科技有限公司 | Audio processing method and device, readable storage medium and electronic equipment |
CN112562649A (en) * | 2020-12-07 | 2021-03-26 | 北京大米科技有限公司 | Audio processing method and device, readable storage medium and electronic equipment |
CN112634874B (en) * | 2020-12-24 | 2022-09-23 | 江西台德智慧科技有限公司 | Automatic tuning terminal equipment based on artificial intelligence |
CN112634874A (en) * | 2020-12-24 | 2021-04-09 | 江西台德智慧科技有限公司 | Automatic tuning terminal equipment based on artificial intelligence |
CN113571045A (en) * | 2021-06-02 | 2021-10-29 | 北京它思智能科技有限公司 | Minnan language voice recognition method, system, equipment and medium |
CN113571045B (en) * | 2021-06-02 | 2024-03-12 | 北京它思智能科技有限公司 | Method, system, equipment and medium for identifying Minnan language voice |
CN113327579A (en) * | 2021-08-03 | 2021-08-31 | 北京世纪好未来教育科技有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN115938351A (en) * | 2021-09-13 | 2023-04-07 | 北京数美时代科技有限公司 | ASR language model construction method, system, storage medium and electronic device |
CN115938351B (en) * | 2021-09-13 | 2023-08-15 | 北京数美时代科技有限公司 | ASR language model construction method, system, storage medium and electronic equipment |
WO2023103693A1 (en) * | 2021-12-07 | 2023-06-15 | 阿里巴巴(中国)有限公司 | Audio signal processing method and apparatus, device, and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN104681036B (en) | 2018-09-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104681036A (en) | System and method for detecting language voice frequency | |
US8301450B2 (en) | Apparatus, method, and medium for dialogue speech recognition using topic domain detection | |
CN104575490A (en) | Spoken language pronunciation detecting and evaluating method based on deep neural network posterior probability algorithm | |
JP5752060B2 (en) | Information processing apparatus, large vocabulary continuous speech recognition method and program | |
Ryant et al. | Highly accurate mandarin tone classification in the absence of pitch information | |
WO2010100853A1 (en) | Language model adaptation device, speech recognition device, language model adaptation method, and computer-readable recording medium | |
Kumar et al. | A comprehensive view of automatic speech recognition system-a systematic literature review | |
KR20180038707A (en) | Method for recogniting speech using dynamic weight and topic information | |
Agrawal et al. | Analysis and modeling of acoustic information for automatic dialect classification | |
Prabhavalkar et al. | Discriminative articulatory models for spoken term detection in low-resource conversational settings | |
Savargiv et al. | Persian speech emotion recognition | |
Gholamdokht Firooz et al. | Spoken language recognition using a new conditional cascade method to combine acoustic and phonetic results | |
Baljekar et al. | Using articulatory features and inferred phonological segments in zero resource speech processing | |
JP3660512B2 (en) | Voice recognition method, apparatus and program recording medium | |
Sahu et al. | A study on automatic speech recognition toolkits | |
Rabiee et al. | Persian accents identification using an adaptive neural network | |
Sharma et al. | Automatic speech recognition systems: challenges and recent implementation trends | |
Cui et al. | Improving deep neural network acoustic modeling for audio corpus indexing under the iarpa babel program | |
Kolesau et al. | Voice activation systems for embedded devices: Systematic literature review | |
Rasipuram et al. | Grapheme and multilingual posterior features for under-resourced speech recognition: a study on scottish gaelic | |
Sharma et al. | Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition: an overview and review of current state of the art | |
Schuller et al. | Late fusion of individual engines for improved recognition of negative emotion in speech-learning vs. democratic vote | |
KR20230156125A (en) | Lookup table recursive language model | |
Chiang et al. | A study on cross-language knowledge integration in Mandarin LVCSR | |
Tabibian | A survey on structured discriminative spoken keyword spotting |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |