CN104681036A - System and method for detecting language voice frequency - Google Patents

System and method for detecting language voice frequency Download PDF

Info

Publication number
CN104681036A
CN104681036A CN201510091609.9A CN201510091609A CN104681036A CN 104681036 A CN104681036 A CN 104681036A CN 201510091609 A CN201510091609 A CN 201510091609A CN 104681036 A CN104681036 A CN 104681036A
Authority
CN
China
Prior art keywords
language
confidence
phoneme sequence
acoustic
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510091609.9A
Other languages
Chinese (zh)
Other versions
CN104681036B (en
Inventor
王欢良
杨嵩
代大明
袁军峰
惠寅华
林远东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Chisheng Information Technology Co Ltd
Original Assignee
Suzhou Chisheng Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Chisheng Information Technology Co Ltd filed Critical Suzhou Chisheng Information Technology Co Ltd
Priority to CN201510091609.9A priority Critical patent/CN104681036B/en
Publication of CN104681036A publication Critical patent/CN104681036A/en
Application granted granted Critical
Publication of CN104681036B publication Critical patent/CN104681036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a system and method for detecting language voice frequency, and belongs to the technical field of language signal processing. The system comprises an acoustic feature extraction module, a phoneme identification module, an acoustic confidence calculation module, a language confidence calculation module, a prosodic feature extraction module and a classification discrimination module. Through comprehensive utilization of acoustic confidence, language confidence and prosodic feature information, the detection performance of the system is obviously improved, the system is suitable for the detection of voice frequencies with different lengths and has good detection stability, various non-target language voice frequencies and noise voice frequencies can be processed, the system has good practicability and can be quickly expanded according to the types of non-target languages by providing the acoustic model and the language model of a new language and then re-training a classifier model, so that the system structure has better flexibility and expandability.

Description

A kind of detection system of language audio and method
Technical field
The present invention relates to Speech signal processing technical field, particularly a kind of detection system of language audio and method.
Background technology
The actual application environment of voice technology is usually very complicated, system acceptance to audio frequency in may comprise the sound of a lot of non-target language, the voice of such as different language, music, natural noise and man-made noise etc.The existence of these audio frequency can have a strong impact on availability and the Consumer's Experience of voice technology.Therefore, adopt technological means to detect efficiently and filter these audio frequency to be necessary.
In this kind of technology, most typical is languages recognition technology and noise detection technique.Wherein, languages recognition technology utilizes the harmonious sounds information (such as special pronunciation unit, the different distributions of pronunciation unit or array mode etc.) comprised in voice to judge category of language.
In prior art, the most ripe languages recognition technology thinks that the aligned phoneme sequence distribution that different language recognizer produces and combination have respective different rule based on this technology of many phonemic language model technology of phoneme recognition, and the distribution probability of the phoneme recognition sequence therefore utilizing different language recognizer to export in the phonemic language model of different language language carries out languages identification.This technology has good precision and universal, but its performance can sharply decline on phrase sound, there is certain limitation.
Summary of the invention
In order to solve the problem of prior art, embodiments provide a kind of detection system and method for language audio.Described technical scheme is as follows:
On the one hand, provide a kind of detection system of language audio, described system comprises: acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module;
Wherein,
Described acoustic feature extraction module is for extracting the acoustic feature of input speech signal, and described acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
Described phoneme recognition module is made up of the one group of recognizer at least comprising the corresponding recognizer of target language, the language that described one group of recognizer is corresponding different respectively, for carrying out parallel voice identification decoding to described acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of described different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
Described acoustic confidence computing module is used for according to the best aligned phoneme sequence of described different language and corresponding time boundary, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
Described language confidence calculations module is used for according to the best aligned phoneme sequence of described different language and corresponding time boundary, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
Described prosodic features extraction module, for according to the best aligned phoneme sequence of described target language and the fundamental frequency feature of corresponding time boundary and described input audio frequency, calculates the prosodic features of input audio frequency;
Described discriminant classification module carries out target language/non-target language classification for utilizing sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency.
Alternatively, in described one group of recognizer, each recognizer adopts acoustic model and the language model of its corresponding language, described acoustic model needs to adopt the speech data of corresponding language to train in advance, and described speech model needs to adopt the text data of corresponding language to train in advance.
Alternatively, the prosodic features of described audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.
Alternatively, discriminant classification module is also for forming a super vector by the prosodic features of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.
On the other hand, provide a kind of detection method of language audio, described method comprises:
Extract the acoustic feature of input speech signal, described acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
Parallel voice identification decoding is carried out to described acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of described different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
According to best aligned phoneme sequence and the corresponding time boundary of described different language, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
According to best aligned phoneme sequence and the corresponding time boundary of described different language, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
According to the best aligned phoneme sequence of described target language and the fundamental frequency feature of corresponding time boundary and described input audio frequency, calculate the prosodic features of input audio frequency;
Sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency is utilized to carry out target language/non-target language classification.
Alternatively, the prosodic features of described audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.
Alternatively, utilize the proper vector of the good sorter of training in advance to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency to carry out target language/non-target language classification to comprise:
The prosodic features of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency is formed a super vector, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.
The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is:
Method provided by the present invention, by comprehensive utilization acoustic confidence, language degree of confidence and prosodic features information, the detection perform of system obtains and significantly improves, be applicable to the audio detection of different length, have and well detect stability, multiple non-target language audio frequency and noised audio can be processed, there is good practicality, Quick Extended can be carried out according to the type of non-target language, only need acoustic model and language model that new languages are provided, then re-training sorter model is just passable, has dirigibility and the extensibility of good system architecture.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the language audio detection system structural representation that the embodiment of the present invention provides;
Fig. 2 is the language audio detection method process flow diagram that the embodiment of the present invention provides.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
Fig. 1 is the detection system structural representation of the language audio that the embodiment of the present invention provides.See Fig. 1, this system comprises: acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module.Wherein,
This acoustic feature extraction module is for extracting the acoustic feature of input speech signal, and this acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
Wherein, this acoustic feature can comprise: PLP ((Perceptual Linear Predict ive, perception linear prediction) feature, MFCC (Mel Frequency Cepstrum Coefficient, mel frequency cepstral coefficient) feature, fbank feature etc.
This phoneme recognition module is made up of the one group of recognizer at least comprising the corresponding recognizer of target language, the language that this group of recognizer is corresponding different respectively, for carrying out parallel voice identification decoding to this acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of this different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
In embodiments of the present invention, phoneme recognition module, is made up of one group of phoneme recognizer, languages corresponding different respectively.In this group phoneme recognizer, the speech recognition device of corresponding target language must be comprised, that is to say, only can comprise the phoneme recognizer of target language in this group phoneme recognizer, system-computed amount can be reduced like this, the limited decline of detection perform; Also except the phoneme recognizer comprising target language, the phoneme recognizer of other non-target language can also be comprised, the languages that may run in corresponding actual application environment respectively.Each recognizer adopts acoustic model and the phonemic language model of its corresponding languages.What this module exported is time boundary and the internal state sequence of one group of aligned phoneme sequence and correspondence thereof.Alternatively, in this group of recognizer, each recognizer adopts acoustic model and the language model of its corresponding language, this acoustic model needs to adopt the speech data of corresponding language to train in advance, and this speech model needs to adopt the text data of corresponding language to train in advance.
Alternatively, this group of recognizer is unified adopts mutually isostructural acoustic model and language model.Typically, acoustic model adopts DNN (Deep Neural Network, deep neural network)/HMM (Hidden Markov Mode, hidden Markov model), and Acoustic Modeling unit is unified adopts phoneme; Language model adopts n-gram (the grammatical statistical model of the N unit) statistical language model of phoneme.In the preferred embodiment of the invention, the n-gram language model that decoding adopts is the phonemic language model of 3-gram.
This acoustic confidence computing module is used for according to the best aligned phoneme sequence of this different language and corresponding time boundary, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
Conventional confidence calculations method has a lot, comprises the degree of confidence technology of feature based, based on the degree of confidence technology etc. of N-best or lattice.The confidence calculations scheme that the embodiment of the present invention adopts is the posterior average of phoneme level acoustics based on DNN model.
Alternatively, the computing method of acoustic confidence are:
C a ( s ) = 1 n Σ i = 1 i = n e C a ( p i ) ,
C a ( p i ) = 1 m Σ j = 1 j = m ln P ( s j | o j ) .
Wherein, C as () is the acoustic confidence of sentence s, C a(p i) be the p of i-th phoneme in sentence iacoustic confidence, n is the phoneme number in sentence s, and m is phoneme p iin the feature frame number that comprises, P (s j| o j) be phoneme p iin given jth acoustics observe o jat state s jon posterior probability.
This language confidence calculations module is used for according to the best aligned phoneme sequence of this different language and corresponding time boundary, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
In embodiments of the present invention, this degree of confidence calculates in the following way: the aligned phoneme sequence of the recognizer output of given languages A, calculates the generating probability of this sequence under a standard phonemic language model.This standard phonemic language model is different from the language model that phoneme recognition adopts, usually than the language model more high-order of phoneme recognition.This language model all refers to the n-gram language model of Corpus--based Method except special statement herein.
Alternatively, the computing method of language degree of confidence are:
C 1(s)=P(p 1,p 2…p n)
=P(p 1)P(p 2|P 1)P(p 3|p 1|p 2)…P(p n|p n-k+1…p n-1)。
Wherein, P (p n| p n-k+1p n-1) be the probability of the phonemic language model of k-gram, can add up on a large amount of text data and obtain.
In the preferred embodiment of the invention, the language model for computational language degree of confidence is the phonemic language model of 4-gram.
This prosodic features extraction module, for according to the best aligned phoneme sequence of this target language and the fundamental frequency feature of corresponding time boundary and this input audio frequency, calculates the prosodic features of input audio frequency;
In embodiments of the present invention, the prosodic features of this audio frequency comprises Sentence-level fundamental frequency maximal value and minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance and variance in sentence, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value in sentence, ratio in sentence shared by sound section (segment of corresponding fundamental frequency non-zero), the ratio of noiseless phoneme (this phoneme intersegmental part fundamental frequency value is all zero) in sentence, maximum phoneme duration and minimum phoneme duration, the average of phoneme duration and variance.
This discriminant classification module carries out target language/non-target language classification for utilizing sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency.
Wherein, the sorter that this training in advance is good needs to adopt in advance to be collected and the mass data marked training.Conventional sorter has Bayes classifier, k nearest neighbor, support vector machine, decision tree, maximum entropy and condition random field and neural network etc.The present invention adopts support vector machine classifier.
In embodiments of the present invention, discriminant classification module is also for forming a super vector by the prosodic features of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.Wherein, what sorter needed to export must be divided under given audio case, and it belongs to the posterior probability of target language.If this posterior probability is greater than given threshold value, then judges that input audio frequency is as target language, otherwise be judged to be non-target language.
In invention preferred embodiment, the sorter that the classification carrying out target language/non-target language judges adopts the supporting vector machine model of radial basis kernel.
The system that the embodiment of the present invention provides, by comprehensive utilization acoustic confidence, language degree of confidence and prosodic features information, the detection perform of system obtains and significantly improves, be applicable to the audio detection of different length, have and well detect stability, multiple non-target language audio frequency and noised audio can be processed, there is good practicality, Quick Extended can be carried out according to the type of non-target language, only need acoustic model and language model that new languages are provided, then re-training sorter model is just passable, there is dirigibility and the extensibility of good system architecture.
Fig. 2 is the language audio detection method process flow diagram that the embodiment of the present invention provides, and see Fig. 2, the method comprises:
201, extract the acoustic feature of input speech signal, this acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
202, parallel voice identification decoding is carried out to this acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of this different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
203, according to the best aligned phoneme sequence of this different language and the time boundary of correspondence, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
204, according to the best aligned phoneme sequence of this different language and the time boundary of correspondence, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
205, according to the fundamental frequency feature of the best aligned phoneme sequence of this target language and the time boundary of correspondence and this input audio frequency, the prosodic features of input audio frequency is calculated;
206, sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency is utilized to carry out target language/non-target language classification.
Alternatively, the sorter that this training in advance is good needs to adopt in advance to be collected and the mass data marked training.
Alternatively, the prosodic features of this audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.
Alternatively, utilize the proper vector of the good sorter of training in advance to the prosodic features composition of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency to carry out target language/non-target language classification to comprise:
The prosodic features of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency is formed a super vector, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.
The method that the embodiment of the present invention provides, by comprehensive utilization acoustic confidence, language degree of confidence and prosodic features information, the detection perform of system obtains and significantly improves, be applicable to the audio detection of different length, have and well detect stability, multiple non-target language audio frequency and noised audio can be processed, there is good practicality, Quick Extended can be carried out according to the type of non-target language, only need acoustic model and language model that new languages are provided, then re-training sorter model is just passable, there is dirigibility and the extensibility of good system architecture.
One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, the program of being somebody's turn to do can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (7)

1. a detection system for language audio, is characterized in that, described system comprises: acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module;
Wherein,
Described acoustic feature extraction module is for extracting the acoustic feature of input speech signal, and described acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
Described phoneme recognition module is made up of the one group of recognizer at least comprising the corresponding recognizer of target language, the language that described one group of recognizer is corresponding different respectively, for carrying out parallel voice identification decoding to described acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of described different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
Described acoustic confidence computing module is used for according to the best aligned phoneme sequence of described different language and corresponding time boundary, calculate the posterior probability of aligned phoneme sequence on deep neural network DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
Described language confidence calculations module is used for according to the best aligned phoneme sequence of described different language and corresponding time boundary, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
Described prosodic features extraction module, for according to the best aligned phoneme sequence of described target language and the fundamental frequency feature of corresponding time boundary and described input audio frequency, calculates the prosodic features of input audio frequency;
Described discriminant classification module carries out target language/non-target language classification for utilizing sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency.
2. system according to claim 1, it is characterized in that, in described one group of recognizer, each recognizer adopts acoustic model and the language model of its corresponding language, described acoustic model needs to adopt the speech data of corresponding language to train in advance, and described speech model needs to adopt the text data of corresponding language to train in advance.
3. system according to claim 1, it is characterized in that, the prosodic features of described audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, the ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.
4. system according to claim 1, it is characterized in that, discriminant classification module is also for forming a super vector by the prosodic features of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determines that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.
5. a detection method for language audio, is characterized in that, described method comprises:
Extract the acoustic feature of input speech signal, described acoustic feature at least comprises the fundamental frequency feature of input audio frequency;
Parallel voice identification decoding is carried out to described acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of described different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;
According to best aligned phoneme sequence and the corresponding time boundary of described different language, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;
According to best aligned phoneme sequence and the corresponding time boundary of described different language, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;
According to the best aligned phoneme sequence of described target language and the fundamental frequency feature of corresponding time boundary and described input audio frequency, calculate the prosodic features of input audio frequency;
Sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency is utilized to carry out target language/non-target language classification.
6. method according to claim 5, it is characterized in that, the prosodic features of described audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, the ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.
7. method according to claim 5, it is characterized in that, utilize the proper vector of the good sorter of training in advance to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency to carry out target language/non-target language classification and comprise:
The prosodic features of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency is formed a super vector, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.
CN201510091609.9A 2014-11-20 2015-02-28 A kind of detecting system and method for language audio Active CN104681036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510091609.9A CN104681036B (en) 2014-11-20 2015-02-28 A kind of detecting system and method for language audio

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201410668235 2014-11-20
CN2014106682358 2014-11-20
CN201510091609.9A CN104681036B (en) 2014-11-20 2015-02-28 A kind of detecting system and method for language audio

Publications (2)

Publication Number Publication Date
CN104681036A true CN104681036A (en) 2015-06-03
CN104681036B CN104681036B (en) 2018-09-25

Family

ID=53315987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510091609.9A Active CN104681036B (en) 2014-11-20 2015-02-28 A kind of detecting system and method for language audio

Country Status (1)

Country Link
CN (1) CN104681036B (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105427858A (en) * 2015-11-06 2016-03-23 科大讯飞股份有限公司 Method and system for achieving automatic voice classification
CN105810191A (en) * 2016-03-08 2016-07-27 江苏信息职业技术学院 Prosodic information-combined Chinese dialect identification method
CN106297828A (en) * 2016-08-12 2017-01-04 苏州驰声信息科技有限公司 The detection method of a kind of mistake utterance detection based on degree of depth study and device
CN106373561A (en) * 2015-07-24 2017-02-01 三星电子株式会社 Apparatus and method of acoustic score calculation and speech recognition
CN106847273A (en) * 2016-12-23 2017-06-13 北京云知声信息技术有限公司 The wake-up selected ci poem selection method and device of speech recognition
WO2017114201A1 (en) * 2015-12-31 2017-07-06 阿里巴巴集团控股有限公司 Method and device for executing setting operation
CN107045875A (en) * 2016-02-03 2017-08-15 重庆工商职业学院 Fundamental frequency detection method based on genetic algorithm
CN108389573A (en) * 2018-02-09 2018-08-10 北京易真学思教育科技有限公司 Language Identification and device, training method and device, medium, terminal
CN108428448A (en) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 A kind of sound end detecting method and audio recognition method
CN109493846A (en) * 2018-11-18 2019-03-19 深圳市声希科技有限公司 A kind of English accent identifying system
CN109613526A (en) * 2018-12-10 2019-04-12 航天南湖电子信息技术股份有限公司 A kind of point mark filter method based on support vector machines
CN109754789A (en) * 2017-11-07 2019-05-14 北京国双科技有限公司 The recognition methods of phoneme of speech sound and device
CN110085216A (en) * 2018-01-23 2019-08-02 中国科学院声学研究所 A kind of vagitus detection method and device
CN110176251A (en) * 2019-04-03 2019-08-27 苏州驰声信息科技有限公司 A kind of acoustic data automatic marking method and device
CN110491382A (en) * 2019-03-11 2019-11-22 腾讯科技(深圳)有限公司 Audio recognition method, device and interactive voice equipment based on artificial intelligence
CN111078937A (en) * 2019-12-27 2020-04-28 北京世纪好未来教育科技有限公司 Voice information retrieval method, device, equipment and computer readable storage medium
CN111369978A (en) * 2018-12-26 2020-07-03 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN111402861A (en) * 2020-03-25 2020-07-10 苏州思必驰信息科技有限公司 Voice recognition method, device, equipment and storage medium
CN111583906A (en) * 2019-02-18 2020-08-25 中国移动通信有限公司研究院 Role recognition method, device and terminal for voice conversation
CN111862939A (en) * 2020-05-25 2020-10-30 北京捷通华声科技股份有限公司 Prosodic phrase marking method and device
CN112562649A (en) * 2020-12-07 2021-03-26 北京大米科技有限公司 Audio processing method and device, readable storage medium and electronic equipment
CN112634874A (en) * 2020-12-24 2021-04-09 江西台德智慧科技有限公司 Automatic tuning terminal equipment based on artificial intelligence
CN113327579A (en) * 2021-08-03 2021-08-31 北京世纪好未来教育科技有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN113571045A (en) * 2021-06-02 2021-10-29 北京它思智能科技有限公司 Minnan language voice recognition method, system, equipment and medium
WO2022100692A1 (en) * 2020-11-12 2022-05-19 北京猿力未来科技有限公司 Human voice audio recording method and apparatus
CN115938351A (en) * 2021-09-13 2023-04-07 北京数美时代科技有限公司 ASR language model construction method, system, storage medium and electronic device
WO2023103693A1 (en) * 2021-12-07 2023-06-15 阿里巴巴(中国)有限公司 Audio signal processing method and apparatus, device, and storage medium
CN111369978B (en) * 2018-12-26 2024-05-17 北京搜狗科技发展有限公司 Data processing method and device for data processing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200421263A (en) * 2003-04-10 2004-10-16 Delta Electronics Inc Speech recognition device and method using di-phone model to realize the mixed-multi-lingual global phoneme
US20050033575A1 (en) * 2002-01-17 2005-02-10 Tobias Schneider Operating method for an automated language recognizer intended for the speaker-independent language recognition of words in different languages and automated language recognizer
US20120232901A1 (en) * 2009-08-04 2012-09-13 Autonomy Corporation Ltd. Automatic spoken language identification based on phoneme sequence patterns
CN103559879A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and device for extracting acoustic features in language identification system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050033575A1 (en) * 2002-01-17 2005-02-10 Tobias Schneider Operating method for an automated language recognizer intended for the speaker-independent language recognition of words in different languages and automated language recognizer
TW200421263A (en) * 2003-04-10 2004-10-16 Delta Electronics Inc Speech recognition device and method using di-phone model to realize the mixed-multi-lingual global phoneme
US20120232901A1 (en) * 2009-08-04 2012-09-13 Autonomy Corporation Ltd. Automatic spoken language identification based on phoneme sequence patterns
CN103559879A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and device for extracting acoustic features in language identification system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YAN SONG ET AL: "i-vector representation based on bottleneck features for language identification", 《ELECTRONICS LETTERS》 *
仲海兵等: "基于音素识别的语种辨识方法中的因子分析", 《模式识别与人工智能》 *

Cited By (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106373561A (en) * 2015-07-24 2017-02-01 三星电子株式会社 Apparatus and method of acoustic score calculation and speech recognition
CN106373561B (en) * 2015-07-24 2021-11-30 三星电子株式会社 Apparatus and method for acoustic score calculation and speech recognition
CN105427858A (en) * 2015-11-06 2016-03-23 科大讯飞股份有限公司 Method and system for achieving automatic voice classification
WO2017114201A1 (en) * 2015-12-31 2017-07-06 阿里巴巴集团控股有限公司 Method and device for executing setting operation
CN106940998A (en) * 2015-12-31 2017-07-11 阿里巴巴集团控股有限公司 A kind of execution method and device of setting operation
CN107045875A (en) * 2016-02-03 2017-08-15 重庆工商职业学院 Fundamental frequency detection method based on genetic algorithm
CN107045875B (en) * 2016-02-03 2019-12-06 重庆工商职业学院 fundamental tone frequency detection method based on genetic algorithm
CN105810191A (en) * 2016-03-08 2016-07-27 江苏信息职业技术学院 Prosodic information-combined Chinese dialect identification method
CN105810191B (en) * 2016-03-08 2019-11-29 江苏信息职业技术学院 Merge the Chinese dialects identification method of prosodic information
CN106297828B (en) * 2016-08-12 2020-03-24 苏州驰声信息科技有限公司 Detection method and device for false sounding detection based on deep learning
CN106297828A (en) * 2016-08-12 2017-01-04 苏州驰声信息科技有限公司 The detection method of a kind of mistake utterance detection based on degree of depth study and device
CN106847273B (en) * 2016-12-23 2020-05-05 北京云知声信息技术有限公司 Awakening word selection method and device for voice recognition
CN106847273A (en) * 2016-12-23 2017-06-13 北京云知声信息技术有限公司 The wake-up selected ci poem selection method and device of speech recognition
CN108428448A (en) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 A kind of sound end detecting method and audio recognition method
CN109754789A (en) * 2017-11-07 2019-05-14 北京国双科技有限公司 The recognition methods of phoneme of speech sound and device
CN109754789B (en) * 2017-11-07 2021-06-08 北京国双科技有限公司 Method and device for recognizing voice phonemes
CN110085216A (en) * 2018-01-23 2019-08-02 中国科学院声学研究所 A kind of vagitus detection method and device
CN108389573B (en) * 2018-02-09 2022-03-08 北京世纪好未来教育科技有限公司 Language identification method and device, training method and device, medium and terminal
CN108389573A (en) * 2018-02-09 2018-08-10 北京易真学思教育科技有限公司 Language Identification and device, training method and device, medium, terminal
CN109493846A (en) * 2018-11-18 2019-03-19 深圳市声希科技有限公司 A kind of English accent identifying system
CN109493846B (en) * 2018-11-18 2021-06-08 深圳市声希科技有限公司 English accent recognition system
CN109613526A (en) * 2018-12-10 2019-04-12 航天南湖电子信息技术股份有限公司 A kind of point mark filter method based on support vector machines
CN111369978A (en) * 2018-12-26 2020-07-03 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN111369978B (en) * 2018-12-26 2024-05-17 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN111583906A (en) * 2019-02-18 2020-08-25 中国移动通信有限公司研究院 Role recognition method, device and terminal for voice conversation
CN111583906B (en) * 2019-02-18 2023-08-15 中国移动通信有限公司研究院 Role recognition method, device and terminal for voice session
CN110491382B (en) * 2019-03-11 2020-12-04 腾讯科技(深圳)有限公司 Speech recognition method and device based on artificial intelligence and speech interaction equipment
CN110491382A (en) * 2019-03-11 2019-11-22 腾讯科技(深圳)有限公司 Audio recognition method, device and interactive voice equipment based on artificial intelligence
CN110176251B (en) * 2019-04-03 2021-12-21 苏州驰声信息科技有限公司 Automatic acoustic data labeling method and device
CN110176251A (en) * 2019-04-03 2019-08-27 苏州驰声信息科技有限公司 A kind of acoustic data automatic marking method and device
CN111078937B (en) * 2019-12-27 2021-08-10 北京世纪好未来教育科技有限公司 Voice information retrieval method, device, equipment and computer readable storage medium
CN111078937A (en) * 2019-12-27 2020-04-28 北京世纪好未来教育科技有限公司 Voice information retrieval method, device, equipment and computer readable storage medium
CN111402861B (en) * 2020-03-25 2022-11-15 思必驰科技股份有限公司 Voice recognition method, device, equipment and storage medium
CN111402861A (en) * 2020-03-25 2020-07-10 苏州思必驰信息科技有限公司 Voice recognition method, device, equipment and storage medium
CN111862939A (en) * 2020-05-25 2020-10-30 北京捷通华声科技股份有限公司 Prosodic phrase marking method and device
WO2022100692A1 (en) * 2020-11-12 2022-05-19 北京猿力未来科技有限公司 Human voice audio recording method and apparatus
CN112562649B (en) * 2020-12-07 2024-01-30 北京大米科技有限公司 Audio processing method and device, readable storage medium and electronic equipment
CN112562649A (en) * 2020-12-07 2021-03-26 北京大米科技有限公司 Audio processing method and device, readable storage medium and electronic equipment
CN112634874B (en) * 2020-12-24 2022-09-23 江西台德智慧科技有限公司 Automatic tuning terminal equipment based on artificial intelligence
CN112634874A (en) * 2020-12-24 2021-04-09 江西台德智慧科技有限公司 Automatic tuning terminal equipment based on artificial intelligence
CN113571045A (en) * 2021-06-02 2021-10-29 北京它思智能科技有限公司 Minnan language voice recognition method, system, equipment and medium
CN113571045B (en) * 2021-06-02 2024-03-12 北京它思智能科技有限公司 Method, system, equipment and medium for identifying Minnan language voice
CN113327579A (en) * 2021-08-03 2021-08-31 北京世纪好未来教育科技有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN115938351A (en) * 2021-09-13 2023-04-07 北京数美时代科技有限公司 ASR language model construction method, system, storage medium and electronic device
CN115938351B (en) * 2021-09-13 2023-08-15 北京数美时代科技有限公司 ASR language model construction method, system, storage medium and electronic equipment
WO2023103693A1 (en) * 2021-12-07 2023-06-15 阿里巴巴(中国)有限公司 Audio signal processing method and apparatus, device, and storage medium

Also Published As

Publication number Publication date
CN104681036B (en) 2018-09-25

Similar Documents

Publication Publication Date Title
CN104681036A (en) System and method for detecting language voice frequency
US8301450B2 (en) Apparatus, method, and medium for dialogue speech recognition using topic domain detection
CN104575490A (en) Spoken language pronunciation detecting and evaluating method based on deep neural network posterior probability algorithm
JP5752060B2 (en) Information processing apparatus, large vocabulary continuous speech recognition method and program
Ryant et al. Highly accurate mandarin tone classification in the absence of pitch information
WO2010100853A1 (en) Language model adaptation device, speech recognition device, language model adaptation method, and computer-readable recording medium
Kumar et al. A comprehensive view of automatic speech recognition system-a systematic literature review
KR20180038707A (en) Method for recogniting speech using dynamic weight and topic information
Agrawal et al. Analysis and modeling of acoustic information for automatic dialect classification
Prabhavalkar et al. Discriminative articulatory models for spoken term detection in low-resource conversational settings
Savargiv et al. Persian speech emotion recognition
Gholamdokht Firooz et al. Spoken language recognition using a new conditional cascade method to combine acoustic and phonetic results
Baljekar et al. Using articulatory features and inferred phonological segments in zero resource speech processing
JP3660512B2 (en) Voice recognition method, apparatus and program recording medium
Sahu et al. A study on automatic speech recognition toolkits
Rabiee et al. Persian accents identification using an adaptive neural network
Sharma et al. Automatic speech recognition systems: challenges and recent implementation trends
Cui et al. Improving deep neural network acoustic modeling for audio corpus indexing under the iarpa babel program
Kolesau et al. Voice activation systems for embedded devices: Systematic literature review
Rasipuram et al. Grapheme and multilingual posterior features for under-resourced speech recognition: a study on scottish gaelic
Sharma et al. Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition: an overview and review of current state of the art
Schuller et al. Late fusion of individual engines for improved recognition of negative emotion in speech-learning vs. democratic vote
KR20230156125A (en) Lookup table recursive language model
Chiang et al. A study on cross-language knowledge integration in Mandarin LVCSR
Tabibian A survey on structured discriminative spoken keyword spotting

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant