CN104681036B

CN104681036B - A kind of detecting system and method for language audio

Info

Publication number: CN104681036B
Application number: CN201510091609.9A
Authority: CN
Inventors: 王欢良; 杨嵩; 代大明; 袁军峰; 惠寅华; 林远东
Original assignee: Suzhou Chisheng Information Technology Co Ltd
Current assignee: Suzhou Chisheng Information Technology Co Ltd
Priority date: 2014-11-20
Filing date: 2015-02-28
Publication date: 2018-09-25
Anticipated expiration: 2035-02-28
Also published as: CN104681036A

Abstract

The invention discloses a kind of detecting system of language audio and methods, belong to Speech signal processing technical field.The system comprises：Acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module.The present invention is by comprehensively utilizing acoustic confidence, language confidence level and prosodic features information, the detection performance of system is significantly improved, audio detection suitable for different length, with good detection stability, a variety of non-target language audios and noised audio can be handled, with good practicability, Quick Extended can be carried out according to the type of non-target language, only need to provide the acoustic model and language model of new languages, then re -training sorter model, the flexibility with preferable system structure and scalability.

Description

A kind of detecting system and method for language audio

Technical field

The present invention relates to Speech signal processing technical field, more particularly to the detecting system and method for a kind of language audio.

Background technology

The actual application environment of voice technology is usually extremely complex, may include many non-mesh in the audio that system receives The sound, such as the voice of different language, music, natural noise and man-made noise etc. of poster speech.The presence of these audios can be tight Ghost image rings the availability and user experience of voice technology.Therefore, efficiently being detected using technological means and filtering these audios is It is highly desirable.

In this kind of technology, most typically languages identification technology and noise detection technique.Wherein, languages identification technology is Utilize the harmonious sounds information (such as special pronunciation unit, different distributions or combination of pronunciation unit etc.) for including in voice To judge category of language.

In the prior art, most ripe languages identification technology is the polyphone language model technology skill based on phoneme recognition Art thinks that the aligned phoneme sequence distribution that different language identifier generates and combination have respectively different rules, therefore utilizes different languages Distribution probability of the phoneme recognition sequence of kind identifier output in the phonemic language model of different language language carries out languages knowledge Not.The technology has preferable precision and universal, but its performance can drastically decline on phrase sound, and there are certain limitations Property.

Invention content

In order to solve problems in the prior art, an embodiment of the present invention provides a kind of detecting system of language audio and sides Method.The technical solution is as follows：

On the one hand, a kind of detecting system of language audio is provided, the system comprises：Acoustic feature extraction module, sound Plain identification module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification mould Block；

Wherein,

The acoustic feature extraction module is used to extract the acoustic feature of input speech signal, and the acoustic feature at least wraps Include the fundamental frequency feature of input audio；

The phoneme recognition module is made of one group of identifier for corresponding to identifier including at least object language, is corresponded to respectively Different language obtains the best aligned phoneme sequence of different language and corresponding time for being decoded to the acoustic feature Boundary, the best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence of object language With corresponding time boundary；

The acoustic confidence computing module is used for the best aligned phoneme sequence according to the different language and corresponding time Boundary calculates separately posterior probability of the aligned phoneme sequence of different language on DNN models, the acoustics confidence as the aligned phoneme sequence Degree, obtains the acoustic confidence of different language aligned phoneme sequence；

The language confidence calculations module is used for the best aligned phoneme sequence according to the different language and corresponding time Boundary calculates separately generating probability of the aligned phoneme sequence of different language on the higher order language model of corresponding language, as this The language confidence level of aligned phoneme sequence obtains the language confidence level of different language aligned phoneme sequence；

The prosodic features extraction module is used for the best aligned phoneme sequence according to the object language and corresponding time side The fundamental frequency feature of boundary and the input audio, calculates the prosodic features of input audio；

The discriminant classification module is used for the sound to the different language aligned phoneme sequence using advance trained grader The feature vector for learning the prosodic features composition of confidence level, language confidence level and the input audio carries out object language/non-mesh Mark language classification.

Optionally, each identifier corresponds to the acoustic model and language model of language using it in one group of identifier, The acoustic model needs to be trained using the voice data of corresponding language in advance, and the language model needs to use phase in advance The text data of language is answered to be trained.

Optionally, the grader trained in advance needs in advance using the mass data training collected and marked.

Optionally, the prosodic features of the audio includes Sentence-level fundamental frequency maximum value, and Sentence-level fundamental frequency is minimum Value, the variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, sound The difference of the maximum value and minimum value of plain grade fundamental frequency variance, the ratio in sentence shared by sound section, noiseless phoneme in sentence Ratio, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the mean value of phoneme duration in sentence, in sentence when phoneme Long variance.

Optionally, discriminant classification module is additionally operable to the acoustic confidence of the different language aligned phoneme sequence, language confidence Degree and the prosodic features of the input audio form a super vector, are sent into trained grader in advance and carry out prediction point Class calculates the score of the super vector, if the score is more than given threshold value, it is determined that input language audio is object language sound Frequently, otherwise it is determined as non-target language audio.

On the other hand, a kind of detection method of language audio is provided, the method includes：

The acoustic feature of input speech signal is extracted, the acoustic feature includes at least the fundamental frequency feature of input audio；

The acoustic feature is decoded, the best aligned phoneme sequence of different language and corresponding time boundary, institute are obtained State different language best aligned phoneme sequence and corresponding time boundary include at least object language best aligned phoneme sequence and correspondence Time boundary；

According to the best aligned phoneme sequence of the different language and corresponding time boundary, the phoneme of different language is calculated separately Posterior probability of the sequence on DNN models obtains the sound of different language aligned phoneme sequence as the acoustic confidence of the aligned phoneme sequence Learn confidence level；

According to the best aligned phoneme sequence of the different language and corresponding time boundary, the phoneme of different language is calculated separately Generating probability of the sequence on the higher order language model of corresponding language obtains not as the language confidence level of the aligned phoneme sequence With the language confidence level of language phoneme sequence；

According to the best aligned phoneme sequence of the object language and corresponding time boundary and the fundamental frequency of the input audio Feature calculates the prosodic features of input audio；

Using advance trained grader to the acoustic confidence of the different language aligned phoneme sequence, language confidence level with And the feature vector of the prosodic features composition of the input audio carries out object language/non-target language classification.

Optionally, the acoustic confidence using advance trained grader to the different language aligned phoneme sequence, language The feature vector of the prosodic features composition of confidence level and the input audio carries out object language/non-target language classification packet It includes：

The rhythm of the acoustic confidence of the different language aligned phoneme sequence, language confidence level and the input audio is special Sign one super vector of composition is sent into trained grader in advance and carries out prediction classification, calculate the score of the super vector, if should Score is more than given threshold value, it is determined that input language audio is target language audio, is otherwise determined as non-target language audio.

The advantageous effect that technical solution provided in an embodiment of the present invention is brought is：

Method provided by the present invention is by comprehensively utilizing acoustic confidence, language confidence level and prosodic features information The detection performance of system is significantly improved, and the audio detection of different length is suitable for, and is had detection stability well, can be located A variety of non-target language audios and noised audio are managed, there is good practicability, can be carried out according to the type of non-target language Quick Extended, it is only necessary to provide the acoustic model and language model of new languages, then re -training sorter model, have There are flexibility and the scalability of preferable system structure.

Description of the drawings

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is language audio detection system structure provided in an embodiment of the present invention；

Fig. 2 is language audio detection method flow chart provided in an embodiment of the present invention.

Specific implementation mode

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Fig. 1 is the detection system structure of language audio provided in an embodiment of the present invention.Referring to Fig. 1, the system packet It includes：Acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, the rhythm are special Levy extraction module and discriminant classification module.Wherein,

The acoustic feature extraction module is used to extract the acoustic feature of input speech signal, which includes at least defeated Enter the fundamental frequency feature of audio；

Wherein, which may include：Plp (perception linear prediction) feature, mfcc (mel frequency cepstral coefficients) are special Sign, fbank features etc..

The phoneme recognition module is made of one group of identifier for corresponding to identifier including at least object language, is corresponded to respectively not With language obtain the best aligned phoneme sequence of different language and corresponding time boundary for being decoded to the acoustic feature, The best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence and correspondence of object language Time boundary；

In embodiments of the present invention, phoneme recognition module is made of one group of phoneme recognizer, corresponds to different languages respectively Kind.In this group of phoneme recognizer, it is necessary to include the speech recognition device of corresponding object language.Each identifier corresponds to language using it The acoustic model and phonemic language model of kind.The output of this module is one group of aligned phoneme sequence and its corresponding time boundary and inside Status switch.Optionally, which can correspond to object language only there are one phoneme recognizer, in this way can be with Reduce system-computed amount, the limited decline of detection performance.Optionally, which can be with the voice of multigroup non-target language Phoneme recognizer corresponds to the languages being likely encountered in actual application environment respectively, can also be that selection wherein exemplary language is established Speech recognition device.

Optionally, each identifier corresponds to the acoustic model and language model of language using it in one group of identifier, should Acoustic model needs to be trained using the voice data of corresponding language in advance, which needs to use corresponding language in advance Text data be trained.

Optionally, which is used uniformly mutually isostructural acoustic model and language model.Typically, acoustic mode Type uses DNN/HMM models, Acoustic Modeling unit to be used uniformly phoneme；Language model uses the n-gram statistical language moulds of phoneme Type.In the preferred embodiment of the invention, the n-gram language models used are decoded as the phonemic language model of 3-gram.

The acoustic confidence computing module is used for the best aligned phoneme sequence according to the different language and corresponding time boundary, Calculate separately posteriority of the aligned phoneme sequence of different language on DNN (Deep Neural Network, deep neural network) model Probability obtains the acoustic confidence of different language aligned phoneme sequence as the acoustic confidence of the aligned phoneme sequence；

Common confidence calculations method has very much, includes the confidence level technology of feature based, be based on N-best or Confidence level technology of lattice etc..Confidence calculations scheme used in the embodiment of the present invention is the phoneme level sound based on DNN models Learn posterior mean value.

Optionally, the computational methods of acoustic confidence are：

Wherein, C_a(s) acoustic confidence for being sentence s, C_a(pi) it is the p of i-th of phoneme in sentence_iAcoustic confidence, n For the phoneme number in sentence s, m is phoneme p_iIn include feature frame number, P (s_j|o_j) it is phoneme p_iIn give j-th of acoustics Observe o_jIn state s_jOn posterior probability.

The language confidence calculations module is used for the best aligned phoneme sequence according to the different language and corresponding time boundary, Generating probability of the aligned phoneme sequence of different language on the higher order language model of corresponding language is calculated separately, as the phoneme sequence The language confidence level of row obtains the language confidence level of different language aligned phoneme sequence；

In embodiments of the present invention, which calculates in the following way：The phoneme of the identifier output of given languages A Sequence calculates generating probability of the sequence under a standard phonemic language model.The standard phonemic language model is different from sound Language model used by element identification, usually than the language model higher order of phoneme recognition.The language model removes special Shen herein The n-gram language models based on statistics are referred both to outside bright.

Optionally, the computational methods of language confidence level are：

C₁(s)=P (p₁p₂...p_n)

=P (p₁)P(p₂|p₁)P(p₃|p₁p₂)...P(p_n|p_n-k+1...p_n-1)。

Wherein, P (p_n|p_n-k+1...p_n-1) be k-gram phonemic language model probability, can be in a large amount of text datas Upper statistics obtains.

In the preferred embodiment of the invention, the language model for being used for computational language confidence level is the phonemic language mould of 4-gram Type.

The prosodic features extraction module be used for according to the best aligned phoneme sequence and corresponding time boundary of the object language with And the fundamental frequency feature of the input audio, calculate the prosodic features of input audio；

In embodiments of the present invention, the prosodic features of the audio includes Sentence-level fundamental frequency maximum value and minimum value, sentence The variance of sub- grade fundamental frequency, the mean value and variance of phoneme level fundamental frequency variance in sentence, phoneme level fundamental frequency in sentence The maximum value of variance and the difference of minimum value, the ratio sound section in sentence shared by (segment of corresponding fundamental frequency non-zero), sentence In noiseless phoneme (ratio of all zero) of the phoneme intersegmental part fundamental frequency value, maximum phoneme duration and minimum phoneme duration, The mean value and variance of phoneme duration.

The discriminant classification module is used to set the acoustics of the different language aligned phoneme sequence using advance trained grader The feature vector of the prosodic features composition of reliability, language confidence level and the input audio carries out object language/non-target language Classification.

Wherein, the mass data training that trained grader needs use collection and mark in advance in advance.Commonly Grader has Bayes classifier, k nearest neighbor, support vector machines, decision tree, maximum entropy and condition random field and neural network Deng.The present invention uses support vector machine classifier.

In embodiments of the present invention, discriminant classification module be additionally operable to by the acoustic confidence of the different language aligned phoneme sequence, The prosodic features of language confidence level and the input audio forms a super vector, is sent into trained grader in advance and carries out in advance Classification is surveyed, the score of the super vector is calculated, if the score is more than given threshold value, it is determined that input language audio is object language Otherwise audio is determined as non-target language audio.Wherein, grader needs, which export, is scored under given audio case it and belongs to The posterior probability of object language.If this posterior probability is more than given threshold value, judge that it is object language to input audio, otherwise It is determined as non-target language.

In invention preferred embodiment, the grader for carrying out the classification judgement of object language/non-target language uses diameter To the supporting vector machine model of base kernel.

System provided in an embodiment of the present invention is believed by comprehensively utilizing acoustic confidence, language confidence level and prosodic features Breath, the detection performance of system are significantly improved, and the audio detection of different length is suitable for, and have detection stability well, A variety of non-target language audios and noised audio can be handled, there is good practicability, it can be according to the class of non-target language Type carries out Quick Extended, it is only necessary to provide the acoustic model and language model of new languages, then re -training sorter model is just It can be with the flexibility with preferable system structure and scalability.

Fig. 2 is language audio detection method flow chart provided in an embodiment of the present invention, and referring to Fig. 2, this method includes：

201, the acoustic feature of input speech signal is extracted, which includes at least the fundamental frequency feature of input audio；

202, the acoustic feature is decoded, obtains the best aligned phoneme sequence of different language and corresponding time boundary, The best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence and correspondence of object language Time boundary；

203, according to the best aligned phoneme sequence of the different language and corresponding time boundary, the sound of different language is calculated separately Posterior probability of the prime sequences on DNN models obtains different language aligned phoneme sequence as the acoustic confidence of the aligned phoneme sequence Acoustic confidence；

204, according to the best aligned phoneme sequence of the different language and corresponding time boundary, the sound of different language is calculated separately Generating probability of the prime sequences on the higher order language model of corresponding language is obtained as the language confidence level of the aligned phoneme sequence The language confidence level of different language aligned phoneme sequence；

205, according to the fundamental frequency of the best aligned phoneme sequence and corresponding time boundary and the input audio of the object language Feature calculates the prosodic features of input audio；

206, advance acoustic confidence, language confidence level of the trained grader to the different language aligned phoneme sequence is utilized And the feature vector of the prosodic features composition of the input audio carries out object language/non-target language classification.

Optionally, the mass data training that trained grader needs use collection and mark in advance in advance.

Optionally, the prosodic features of the audio includes Sentence-level fundamental frequency maximum value, Sentence-level fundamental frequency minimum value, The variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, phoneme level The maximum value of fundamental frequency variance and the difference of minimum value, the ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, Maximum phoneme duration in sentence, minimum phoneme duration in sentence, the mean value of phoneme duration in sentence, the side of phoneme duration in sentence Difference.

Optionally, the acoustic confidence of the different language aligned phoneme sequence, language are set using advance trained grader The feature vector of the prosodic features of reliability and input audio composition carries out object language/non-target language classification：

By the prosodic features group of the acoustic confidence of the different language aligned phoneme sequence, language confidence level and the input audio At a super vector, it is sent into trained grader in advance and carries out prediction classification, the score of the super vector is calculated, if the score More than given threshold value, it is determined that input language audio is target language audio, is otherwise determined as non-target language audio.

Method provided in an embodiment of the present invention is believed by comprehensively utilizing acoustic confidence, language confidence level and prosodic features Breath, the detection performance of system are significantly improved, and the audio detection of different length is suitable for, and have detection stability well, A variety of non-target language audios and noised audio can be handled, there is good practicability, it can be according to the class of non-target language Type carries out Quick Extended, it is only necessary to provide the acoustic model and language model of new languages, then re -training sorter model is just It can be with the flexibility with preferable system structure and scalability.

One of ordinary skill in the art will appreciate that realizing that all or part of step of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program being somebody's turn to do can be stored in a kind of computer-readable deposit In storage media, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims

1. a kind of detecting system of language audio, which is characterized in that the system comprises：Acoustic feature extraction module, phoneme are known Other module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module； Wherein,

The acoustic feature extraction module is used to extract the acoustic feature of input speech signal, and the acoustic feature includes at least defeated Enter the fundamental frequency feature of audio；

The phoneme recognition module is made of one group of identifier for corresponding to identifier including at least object language, one group of identification Device corresponds to different language respectively, for carrying out parallel voice identification decoding to the acoustic feature, obtains different language most Good news prime sequences and corresponding time boundary, the best aligned phoneme sequence of the different language and corresponding time boundary include at least The best aligned phoneme sequence of object language and corresponding time boundary；

The acoustic confidence computing module is used for the best aligned phoneme sequence according to the different language and corresponding time boundary, Posterior probability of the aligned phoneme sequence of different language on deep neural network DNN models is calculated separately, as the aligned phoneme sequence Acoustic confidence obtains the acoustic confidence of different language aligned phoneme sequence；

The discriminant classification module is used to set the acoustics of the different language aligned phoneme sequence using advance trained grader The feature vector of the prosodic features composition of reliability, language confidence level and the input audio carries out object language/non-targeted language Speech classification.

2. system according to claim 1, which is characterized in that each identifier is corresponded to using it in one group of identifier The acoustic model and language model of language, the acoustic model need to be trained using the voice data of corresponding language in advance, The language model needs to be trained using the text data of corresponding language in advance.

3. system according to claim 1, which is characterized in that the prosodic features of the audio includes Sentence-level fundamental frequency Maximum value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, sound The variance of plain grade fundamental frequency variance, the difference of the maximum value and minimum value of phoneme level fundamental frequency variance, sound section of institute in sentence The ratio accounted for, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, sentence middle pitch The mean value of plain duration, the variance of phoneme duration in sentence.

4. system according to claim 1, which is characterized in that discriminant classification module is additionally operable to the different language phoneme The acoustic confidence of sequence, prosodic features one super vector of composition of language confidence level and the input audio, are sent into advance Trained grader carries out prediction classification, calculates the score of the super vector, if the score is more than given threshold value, it is determined that defeated It is target language audio to enter language audio, is otherwise determined as non-target language audio.

5. a kind of detection method of language audio, which is characterized in that the method includes：

Parallel voice identification decoding is carried out to the acoustic feature, obtains the best aligned phoneme sequence of different language and corresponding time Boundary, the best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence of object language With corresponding time boundary；

According to the best aligned phoneme sequence of the different language and corresponding time boundary, the aligned phoneme sequence of different language is calculated separately Posterior probability on DNN models, as the acoustic confidence of the aligned phoneme sequence, the acoustics for obtaining different language aligned phoneme sequence is set Reliability；

According to the best aligned phoneme sequence of the different language and corresponding time boundary, the aligned phoneme sequence of different language is calculated separately Generating probability on the higher order language model of corresponding language obtains different languages as the language confidence level of the aligned phoneme sequence Say the language confidence level of aligned phoneme sequence；

According to the best aligned phoneme sequence of the object language and corresponding time boundary and it is described input audio fundamental frequency feature, Calculate the prosodic features of input audio；

Using advance trained grader to the acoustic confidence, language confidence level and institute of the different language aligned phoneme sequence The feature vector for stating the prosodic features composition of input audio carries out object language/non-target language classification.

6. according to the method described in claim 5, it is characterized in that, the prosodic features of the audio includes Sentence-level fundamental frequency Maximum value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, sound The variance of plain grade fundamental frequency variance, the difference of the maximum value and minimum value of phoneme level fundamental frequency variance, sound section of institute in sentence The ratio accounted for, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, sentence middle pitch The mean value of plain duration, the variance of phoneme duration in sentence.

7. according to the method described in claim 5, it is characterized in that, using advance trained grader to the different language The acoustic confidence of aligned phoneme sequence, the feature vector of the prosodic features composition of language confidence level and the input audio carry out mesh Poster speech/non-target language is classified：

By the acoustic confidence of the different language aligned phoneme sequence, the prosodic features group of language confidence level and the input audio At a super vector, it is sent into trained grader in advance and carries out prediction classification, the score of the super vector is calculated, if the score More than given threshold value, it is determined that input language audio is target language audio, is otherwise determined as non-target language audio.