CN104681036A

CN104681036A - System and method for detecting language voice frequency

Info

Publication number: CN104681036A
Application number: CN201510091609.9A
Authority: CN
Inventors: 王欢良; 杨嵩; 代大明; 袁军峰; 惠寅华; 林远东
Original assignee: Suzhou Chisheng Information Technology Co Ltd
Current assignee: Suzhou Chisheng Information Technology Co Ltd
Priority date: 2014-11-20
Filing date: 2015-02-28
Publication date: 2015-06-03
Anticipated expiration: 2035-02-28
Also published as: CN104681036B

Abstract

The invention discloses a system and method for detecting language voice frequency, and belongs to the technical field of language signal processing. The system comprises an acoustic feature extraction module, a phoneme identification module, an acoustic confidence calculation module, a language confidence calculation module, a prosodic feature extraction module and a classification discrimination module. Through comprehensive utilization of acoustic confidence, language confidence and prosodic feature information, the detection performance of the system is obviously improved, the system is suitable for the detection of voice frequencies with different lengths and has good detection stability, various non-target language voice frequencies and noise voice frequencies can be processed, the system has good practicability and can be quickly expanded according to the types of non-target languages by providing the acoustic model and the language model of a new language and then re-training a classifier model, so that the system structure has better flexibility and expandability.

Description

A kind of detection system of language audio and method

Technical field

The present invention relates to Speech signal processing technical field, particularly a kind of detection system of language audio and method.

Background technology

The actual application environment of voice technology is usually very complicated, system acceptance to audio frequency in may comprise the sound of a lot of non-target language, the voice of such as different language, music, natural noise and man-made noise etc.The existence of these audio frequency can have a strong impact on availability and the Consumer's Experience of voice technology.Therefore, adopt technological means to detect efficiently and filter these audio frequency to be necessary.

In this kind of technology, most typical is languages recognition technology and noise detection technique.Wherein, languages recognition technology utilizes the harmonious sounds information (such as special pronunciation unit, the different distributions of pronunciation unit or array mode etc.) comprised in voice to judge category of language.

In prior art, the most ripe languages recognition technology thinks that the aligned phoneme sequence distribution that different language recognizer produces and combination have respective different rule based on this technology of many phonemic language model technology of phoneme recognition, and the distribution probability of the phoneme recognition sequence therefore utilizing different language recognizer to export in the phonemic language model of different language language carries out languages identification.This technology has good precision and universal, but its performance can sharply decline on phrase sound, there is certain limitation.

Summary of the invention

In order to solve the problem of prior art, embodiments provide a kind of detection system and method for language audio.Described technical scheme is as follows:

On the one hand, provide a kind of detection system of language audio, described system comprises: acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module;

Wherein,

Described acoustic feature extraction module is for extracting the acoustic feature of input speech signal, and described acoustic feature at least comprises the fundamental frequency feature of input audio frequency;

Described phoneme recognition module is made up of the one group of recognizer at least comprising the corresponding recognizer of target language, the language that described one group of recognizer is corresponding different respectively, for carrying out parallel voice identification decoding to described acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of described different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;

Described acoustic confidence computing module is used for according to the best aligned phoneme sequence of described different language and corresponding time boundary, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;

Described language confidence calculations module is used for according to the best aligned phoneme sequence of described different language and corresponding time boundary, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;

Described prosodic features extraction module, for according to the best aligned phoneme sequence of described target language and the fundamental frequency feature of corresponding time boundary and described input audio frequency, calculates the prosodic features of input audio frequency;

Described discriminant classification module carries out target language/non-target language classification for utilizing sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency.

Alternatively, in described one group of recognizer, each recognizer adopts acoustic model and the language model of its corresponding language, described acoustic model needs to adopt the speech data of corresponding language to train in advance, and described speech model needs to adopt the text data of corresponding language to train in advance.

Alternatively, the prosodic features of described audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.

Alternatively, discriminant classification module is also for forming a super vector by the prosodic features of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.

On the other hand, provide a kind of detection method of language audio, described method comprises:

Extract the acoustic feature of input speech signal, described acoustic feature at least comprises the fundamental frequency feature of input audio frequency;

Parallel voice identification decoding is carried out to described acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of described different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;

According to best aligned phoneme sequence and the corresponding time boundary of described different language, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;

According to best aligned phoneme sequence and the corresponding time boundary of described different language, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;

According to the best aligned phoneme sequence of described target language and the fundamental frequency feature of corresponding time boundary and described input audio frequency, calculate the prosodic features of input audio frequency;

Sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency is utilized to carry out target language/non-target language classification.

Alternatively, utilize the proper vector of the good sorter of training in advance to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency to carry out target language/non-target language classification to comprise:

The prosodic features of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency is formed a super vector, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.

The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is:

Method provided by the present invention, by comprehensive utilization acoustic confidence, language degree of confidence and prosodic features information, the detection perform of system obtains and significantly improves, be applicable to the audio detection of different length, have and well detect stability, multiple non-target language audio frequency and noised audio can be processed, there is good practicality, Quick Extended can be carried out according to the type of non-target language, only need acoustic model and language model that new languages are provided, then re-training sorter model is just passable, has dirigibility and the extensibility of good system architecture.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the language audio detection system structural representation that the embodiment of the present invention provides;

Fig. 2 is the language audio detection method process flow diagram that the embodiment of the present invention provides.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.

Fig. 1 is the detection system structural representation of the language audio that the embodiment of the present invention provides.See Fig. 1, this system comprises: acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module.Wherein,

This acoustic feature extraction module is for extracting the acoustic feature of input speech signal, and this acoustic feature at least comprises the fundamental frequency feature of input audio frequency;

Wherein, this acoustic feature can comprise: PLP ((Perceptual Linear Predict ive, perception linear prediction) feature, MFCC (Mel Frequency Cepstrum Coefficient, mel frequency cepstral coefficient) feature, fbank feature etc.

This phoneme recognition module is made up of the one group of recognizer at least comprising the corresponding recognizer of target language, the language that this group of recognizer is corresponding different respectively, for carrying out parallel voice identification decoding to this acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of this different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;

In embodiments of the present invention, phoneme recognition module, is made up of one group of phoneme recognizer, languages corresponding different respectively.In this group phoneme recognizer, the speech recognition device of corresponding target language must be comprised, that is to say, only can comprise the phoneme recognizer of target language in this group phoneme recognizer, system-computed amount can be reduced like this, the limited decline of detection perform; Also except the phoneme recognizer comprising target language, the phoneme recognizer of other non-target language can also be comprised, the languages that may run in corresponding actual application environment respectively.Each recognizer adopts acoustic model and the phonemic language model of its corresponding languages.What this module exported is time boundary and the internal state sequence of one group of aligned phoneme sequence and correspondence thereof.Alternatively, in this group of recognizer, each recognizer adopts acoustic model and the language model of its corresponding language, this acoustic model needs to adopt the speech data of corresponding language to train in advance, and this speech model needs to adopt the text data of corresponding language to train in advance.

Alternatively, this group of recognizer is unified adopts mutually isostructural acoustic model and language model.Typically, acoustic model adopts DNN (Deep Neural Network, deep neural network)/HMM (Hidden Markov Mode, hidden Markov model), and Acoustic Modeling unit is unified adopts phoneme; Language model adopts n-gram (the grammatical statistical model of the N unit) statistical language model of phoneme.In the preferred embodiment of the invention, the n-gram language model that decoding adopts is the phonemic language model of 3-gram.

This acoustic confidence computing module is used for according to the best aligned phoneme sequence of this different language and corresponding time boundary, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;

Conventional confidence calculations method has a lot, comprises the degree of confidence technology of feature based, based on the degree of confidence technology etc. of N-best or lattice.The confidence calculations scheme that the embodiment of the present invention adopts is the posterior average of phoneme level acoustics based on DNN model.

Alternatively, the computing method of acoustic confidence are:

C_{a} (s) = \frac{1}{n} Σ_{i = 1}^{i = n} e^{C_{a} (p_{i})},

C_{a} (p_{i}) = \frac{1}{m} Σ_{j = 1}^{j = m} \ln P (s_{j} | o_{j}) .

Wherein, C _as () is the acoustic confidence of sentence s, C _a(p _i) be the p of i-th phoneme in sentence _iacoustic confidence, n is the phoneme number in sentence s, and m is phoneme p _iin the feature frame number that comprises, P (s _j| o _j) be phoneme p _iin given jth acoustics observe o _jat state s _jon posterior probability.

This language confidence calculations module is used for according to the best aligned phoneme sequence of this different language and corresponding time boundary, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;

In embodiments of the present invention, this degree of confidence calculates in the following way: the aligned phoneme sequence of the recognizer output of given languages A, calculates the generating probability of this sequence under a standard phonemic language model.This standard phonemic language model is different from the language model that phoneme recognition adopts, usually than the language model more high-order of phoneme recognition.This language model all refers to the n-gram language model of Corpus--based Method except special statement herein.

Alternatively, the computing method of language degree of confidence are:

C ₁(s)=P（p ₁，p ₂…p _n)

=P（p ₁)P(p ₂|P ₁)P(p ₃|p ₁|p ₂)…P(p _n|p _n-k+1…p _n-1)。

Wherein, P (p _n| p _n-k+1p _n-1) be the probability of the phonemic language model of k-gram, can add up on a large amount of text data and obtain.

In the preferred embodiment of the invention, the language model for computational language degree of confidence is the phonemic language model of 4-gram.

This prosodic features extraction module, for according to the best aligned phoneme sequence of this target language and the fundamental frequency feature of corresponding time boundary and this input audio frequency, calculates the prosodic features of input audio frequency;

In embodiments of the present invention, the prosodic features of this audio frequency comprises Sentence-level fundamental frequency maximal value and minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance and variance in sentence, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value in sentence, ratio in sentence shared by sound section (segment of corresponding fundamental frequency non-zero), the ratio of noiseless phoneme (this phoneme intersegmental part fundamental frequency value is all zero) in sentence, maximum phoneme duration and minimum phoneme duration, the average of phoneme duration and variance.

This discriminant classification module carries out target language/non-target language classification for utilizing sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency.

Wherein, the sorter that this training in advance is good needs to adopt in advance to be collected and the mass data marked training.Conventional sorter has Bayes classifier, k nearest neighbor, support vector machine, decision tree, maximum entropy and condition random field and neural network etc.The present invention adopts support vector machine classifier.

In embodiments of the present invention, discriminant classification module is also for forming a super vector by the prosodic features of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.Wherein, what sorter needed to export must be divided under given audio case, and it belongs to the posterior probability of target language.If this posterior probability is greater than given threshold value, then judges that input audio frequency is as target language, otherwise be judged to be non-target language.

In invention preferred embodiment, the sorter that the classification carrying out target language/non-target language judges adopts the supporting vector machine model of radial basis kernel.

The system that the embodiment of the present invention provides, by comprehensive utilization acoustic confidence, language degree of confidence and prosodic features information, the detection perform of system obtains and significantly improves, be applicable to the audio detection of different length, have and well detect stability, multiple non-target language audio frequency and noised audio can be processed, there is good practicality, Quick Extended can be carried out according to the type of non-target language, only need acoustic model and language model that new languages are provided, then re-training sorter model is just passable, there is dirigibility and the extensibility of good system architecture.

Fig. 2 is the language audio detection method process flow diagram that the embodiment of the present invention provides, and see Fig. 2, the method comprises:

201, extract the acoustic feature of input speech signal, this acoustic feature at least comprises the fundamental frequency feature of input audio frequency;

202, parallel voice identification decoding is carried out to this acoustic feature, obtain the best aligned phoneme sequence of different language and corresponding time boundary, the best aligned phoneme sequence of this different language and corresponding time boundary at least comprise the best aligned phoneme sequence of target language and corresponding time boundary;

203, according to the best aligned phoneme sequence of this different language and the time boundary of correspondence, calculate the posterior probability of aligned phoneme sequence on DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;

204, according to the best aligned phoneme sequence of this different language and the time boundary of correspondence, calculate the generating probability of aligned phoneme sequence on the more order language model of corresponding language of different language respectively, as the language degree of confidence of this aligned phoneme sequence, obtain the language degree of confidence of different language aligned phoneme sequence;

205, according to the fundamental frequency feature of the best aligned phoneme sequence of this target language and the time boundary of correspondence and this input audio frequency, the prosodic features of input audio frequency is calculated;

206, sorter that training in advance the is good proper vector to the prosodic features composition of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency is utilized to carry out target language/non-target language classification.

Alternatively, the sorter that this training in advance is good needs to adopt in advance to be collected and the mass data marked training.

Alternatively, the prosodic features of this audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.

Alternatively, utilize the proper vector of the good sorter of training in advance to the prosodic features composition of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency to carry out target language/non-target language classification to comprise:

The prosodic features of the acoustic confidence of this different language aligned phoneme sequence, voice degree of confidence and this input audio frequency is formed a super vector, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determine that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.

The method that the embodiment of the present invention provides, by comprehensive utilization acoustic confidence, language degree of confidence and prosodic features information, the detection perform of system obtains and significantly improves, be applicable to the audio detection of different length, have and well detect stability, multiple non-target language audio frequency and noised audio can be processed, there is good practicality, Quick Extended can be carried out according to the type of non-target language, only need acoustic model and language model that new languages are provided, then re-training sorter model is just passable, there is dirigibility and the extensibility of good system architecture.

One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, the program of being somebody's turn to do can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a detection system for language audio, is characterized in that, described system comprises: acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module;

Wherein,

Described acoustic confidence computing module is used for according to the best aligned phoneme sequence of described different language and corresponding time boundary, calculate the posterior probability of aligned phoneme sequence on deep neural network DNN model of different language respectively, as the acoustic confidence of this aligned phoneme sequence, obtain the acoustic confidence of different language aligned phoneme sequence;

2. system according to claim 1, it is characterized in that, in described one group of recognizer, each recognizer adopts acoustic model and the language model of its corresponding language, described acoustic model needs to adopt the speech data of corresponding language to train in advance, and described speech model needs to adopt the text data of corresponding language to train in advance.

3. system according to claim 1, it is characterized in that, the prosodic features of described audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, the ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.

4. system according to claim 1, it is characterized in that, discriminant classification module is also for forming a super vector by the prosodic features of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency, send into the good sorter of training in advance and carry out prediction classification, calculate the score of this super vector, if this score is greater than given threshold value, then determines that input language audio frequency is target language audio, otherwise be defined as non-target language audio frequency.

5. a detection method for language audio, is characterized in that, described method comprises:

6. method according to claim 5, it is characterized in that, the prosodic features of described audio frequency comprises Sentence-level fundamental frequency maximal value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the average of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, the maximal value of phoneme level fundamental frequency variance and the difference of minimum value, the ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the average of phoneme duration in sentence, the variance of phoneme duration in sentence.

7. method according to claim 5, it is characterized in that, utilize the proper vector of the good sorter of training in advance to the prosodic features composition of the acoustic confidence of described different language aligned phoneme sequence, voice degree of confidence and described input audio frequency to carry out target language/non-target language classification and comprise: