CN104681036B - A kind of detecting system and method for language audio - Google Patents

A kind of detecting system and method for language audio Download PDF

Info

Publication number
CN104681036B
CN104681036B CN201510091609.9A CN201510091609A CN104681036B CN 104681036 B CN104681036 B CN 104681036B CN 201510091609 A CN201510091609 A CN 201510091609A CN 104681036 B CN104681036 B CN 104681036B
Authority
CN
China
Prior art keywords
language
phoneme sequence
acoustic
sentence
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510091609.9A
Other languages
Chinese (zh)
Other versions
CN104681036A (en
Inventor
王欢良
杨嵩
代大明
袁军峰
惠寅华
林远东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Chisheng Information Technology Co Ltd
Original Assignee
Suzhou Chisheng Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Chisheng Information Technology Co Ltd filed Critical Suzhou Chisheng Information Technology Co Ltd
Priority to CN201510091609.9A priority Critical patent/CN104681036B/en
Publication of CN104681036A publication Critical patent/CN104681036A/en
Application granted granted Critical
Publication of CN104681036B publication Critical patent/CN104681036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a kind of detecting system of language audio and methods, belong to Speech signal processing technical field.The system comprises:Acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module.The present invention is by comprehensively utilizing acoustic confidence, language confidence level and prosodic features information, the detection performance of system is significantly improved, audio detection suitable for different length, with good detection stability, a variety of non-target language audios and noised audio can be handled, with good practicability, Quick Extended can be carried out according to the type of non-target language, only need to provide the acoustic model and language model of new languages, then re -training sorter model, the flexibility with preferable system structure and scalability.

Description

A kind of detecting system and method for language audio
Technical field
The present invention relates to Speech signal processing technical field, more particularly to the detecting system and method for a kind of language audio.
Background technology
The actual application environment of voice technology is usually extremely complex, may include many non-mesh in the audio that system receives The sound, such as the voice of different language, music, natural noise and man-made noise etc. of poster speech.The presence of these audios can be tight Ghost image rings the availability and user experience of voice technology.Therefore, efficiently being detected using technological means and filtering these audios is It is highly desirable.
In this kind of technology, most typically languages identification technology and noise detection technique.Wherein, languages identification technology is Utilize the harmonious sounds information (such as special pronunciation unit, different distributions or combination of pronunciation unit etc.) for including in voice To judge category of language.
In the prior art, most ripe languages identification technology is the polyphone language model technology skill based on phoneme recognition Art thinks that the aligned phoneme sequence distribution that different language identifier generates and combination have respectively different rules, therefore utilizes different languages Distribution probability of the phoneme recognition sequence of kind identifier output in the phonemic language model of different language language carries out languages knowledge Not.The technology has preferable precision and universal, but its performance can drastically decline on phrase sound, and there are certain limitations Property.
Invention content
In order to solve problems in the prior art, an embodiment of the present invention provides a kind of detecting system of language audio and sides Method.The technical solution is as follows:
On the one hand, a kind of detecting system of language audio is provided, the system comprises:Acoustic feature extraction module, sound Plain identification module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification mould Block;
Wherein,
The acoustic feature extraction module is used to extract the acoustic feature of input speech signal, and the acoustic feature at least wraps Include the fundamental frequency feature of input audio;
The phoneme recognition module is made of one group of identifier for corresponding to identifier including at least object language, is corresponded to respectively Different language obtains the best aligned phoneme sequence of different language and corresponding time for being decoded to the acoustic feature Boundary, the best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence of object language With corresponding time boundary;
The acoustic confidence computing module is used for the best aligned phoneme sequence according to the different language and corresponding time Boundary calculates separately posterior probability of the aligned phoneme sequence of different language on DNN models, the acoustics confidence as the aligned phoneme sequence Degree, obtains the acoustic confidence of different language aligned phoneme sequence;
The language confidence calculations module is used for the best aligned phoneme sequence according to the different language and corresponding time Boundary calculates separately generating probability of the aligned phoneme sequence of different language on the higher order language model of corresponding language, as this The language confidence level of aligned phoneme sequence obtains the language confidence level of different language aligned phoneme sequence;
The prosodic features extraction module is used for the best aligned phoneme sequence according to the object language and corresponding time side The fundamental frequency feature of boundary and the input audio, calculates the prosodic features of input audio;
The discriminant classification module is used for the sound to the different language aligned phoneme sequence using advance trained grader The feature vector for learning the prosodic features composition of confidence level, language confidence level and the input audio carries out object language/non-mesh Mark language classification.
Optionally, each identifier corresponds to the acoustic model and language model of language using it in one group of identifier, The acoustic model needs to be trained using the voice data of corresponding language in advance, and the language model needs to use phase in advance The text data of language is answered to be trained.
Optionally, the grader trained in advance needs in advance using the mass data training collected and marked.
Optionally, the prosodic features of the audio includes Sentence-level fundamental frequency maximum value, and Sentence-level fundamental frequency is minimum Value, the variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, sound The difference of the maximum value and minimum value of plain grade fundamental frequency variance, the ratio in sentence shared by sound section, noiseless phoneme in sentence Ratio, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the mean value of phoneme duration in sentence, in sentence when phoneme Long variance.
Optionally, discriminant classification module is additionally operable to the acoustic confidence of the different language aligned phoneme sequence, language confidence Degree and the prosodic features of the input audio form a super vector, are sent into trained grader in advance and carry out prediction point Class calculates the score of the super vector, if the score is more than given threshold value, it is determined that input language audio is object language sound Frequently, otherwise it is determined as non-target language audio.
On the other hand, a kind of detection method of language audio is provided, the method includes:
The acoustic feature of input speech signal is extracted, the acoustic feature includes at least the fundamental frequency feature of input audio;
The acoustic feature is decoded, the best aligned phoneme sequence of different language and corresponding time boundary, institute are obtained State different language best aligned phoneme sequence and corresponding time boundary include at least object language best aligned phoneme sequence and correspondence Time boundary;
According to the best aligned phoneme sequence of the different language and corresponding time boundary, the phoneme of different language is calculated separately Posterior probability of the sequence on DNN models obtains the sound of different language aligned phoneme sequence as the acoustic confidence of the aligned phoneme sequence Learn confidence level;
According to the best aligned phoneme sequence of the different language and corresponding time boundary, the phoneme of different language is calculated separately Generating probability of the sequence on the higher order language model of corresponding language obtains not as the language confidence level of the aligned phoneme sequence With the language confidence level of language phoneme sequence;
According to the best aligned phoneme sequence of the object language and corresponding time boundary and the fundamental frequency of the input audio Feature calculates the prosodic features of input audio;
Using advance trained grader to the acoustic confidence of the different language aligned phoneme sequence, language confidence level with And the feature vector of the prosodic features composition of the input audio carries out object language/non-target language classification.
Optionally, the grader trained in advance needs in advance using the mass data training collected and marked.
Optionally, the prosodic features of the audio includes Sentence-level fundamental frequency maximum value, and Sentence-level fundamental frequency is minimum Value, the variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, sound The difference of the maximum value and minimum value of plain grade fundamental frequency variance, the ratio in sentence shared by sound section, noiseless phoneme in sentence Ratio, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the mean value of phoneme duration in sentence, in sentence when phoneme Long variance.
Optionally, the acoustic confidence using advance trained grader to the different language aligned phoneme sequence, language The feature vector of the prosodic features composition of confidence level and the input audio carries out object language/non-target language classification packet It includes:
The rhythm of the acoustic confidence of the different language aligned phoneme sequence, language confidence level and the input audio is special Sign one super vector of composition is sent into trained grader in advance and carries out prediction classification, calculate the score of the super vector, if should Score is more than given threshold value, it is determined that input language audio is target language audio, is otherwise determined as non-target language audio.
The advantageous effect that technical solution provided in an embodiment of the present invention is brought is:
Method provided by the present invention is by comprehensively utilizing acoustic confidence, language confidence level and prosodic features information The detection performance of system is significantly improved, and the audio detection of different length is suitable for, and is had detection stability well, can be located A variety of non-target language audios and noised audio are managed, there is good practicability, can be carried out according to the type of non-target language Quick Extended, it is only necessary to provide the acoustic model and language model of new languages, then re -training sorter model, have There are flexibility and the scalability of preferable system structure.
Description of the drawings
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.
Fig. 1 is language audio detection system structure provided in an embodiment of the present invention;
Fig. 2 is language audio detection method flow chart provided in an embodiment of the present invention.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.
Fig. 1 is the detection system structure of language audio provided in an embodiment of the present invention.Referring to Fig. 1, the system packet It includes:Acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, the rhythm are special Levy extraction module and discriminant classification module.Wherein,
The acoustic feature extraction module is used to extract the acoustic feature of input speech signal, which includes at least defeated Enter the fundamental frequency feature of audio;
Wherein, which may include:Plp (perception linear prediction) feature, mfcc (mel frequency cepstral coefficients) are special Sign, fbank features etc..
The phoneme recognition module is made of one group of identifier for corresponding to identifier including at least object language, is corresponded to respectively not With language obtain the best aligned phoneme sequence of different language and corresponding time boundary for being decoded to the acoustic feature, The best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence and correspondence of object language Time boundary;
In embodiments of the present invention, phoneme recognition module is made of one group of phoneme recognizer, corresponds to different languages respectively Kind.In this group of phoneme recognizer, it is necessary to include the speech recognition device of corresponding object language.Each identifier corresponds to language using it The acoustic model and phonemic language model of kind.The output of this module is one group of aligned phoneme sequence and its corresponding time boundary and inside Status switch.Optionally, which can correspond to object language only there are one phoneme recognizer, in this way can be with Reduce system-computed amount, the limited decline of detection performance.Optionally, which can be with the voice of multigroup non-target language Phoneme recognizer corresponds to the languages being likely encountered in actual application environment respectively, can also be that selection wherein exemplary language is established Speech recognition device.
Optionally, each identifier corresponds to the acoustic model and language model of language using it in one group of identifier, should Acoustic model needs to be trained using the voice data of corresponding language in advance, which needs to use corresponding language in advance Text data be trained.
Optionally, which is used uniformly mutually isostructural acoustic model and language model.Typically, acoustic mode Type uses DNN/HMM models, Acoustic Modeling unit to be used uniformly phoneme;Language model uses the n-gram statistical language moulds of phoneme Type.In the preferred embodiment of the invention, the n-gram language models used are decoded as the phonemic language model of 3-gram.
The acoustic confidence computing module is used for the best aligned phoneme sequence according to the different language and corresponding time boundary, Calculate separately posteriority of the aligned phoneme sequence of different language on DNN (Deep Neural Network, deep neural network) model Probability obtains the acoustic confidence of different language aligned phoneme sequence as the acoustic confidence of the aligned phoneme sequence;
Common confidence calculations method has very much, includes the confidence level technology of feature based, be based on N-best or Confidence level technology of lattice etc..Confidence calculations scheme used in the embodiment of the present invention is the phoneme level sound based on DNN models Learn posterior mean value.
Optionally, the computational methods of acoustic confidence are:
Wherein, Ca(s) acoustic confidence for being sentence s, Ca(pi) it is the p of i-th of phoneme in sentenceiAcoustic confidence, n For the phoneme number in sentence s, m is phoneme piIn include feature frame number, P (sj|oj) it is phoneme piIn give j-th of acoustics Observe ojIn state sjOn posterior probability.
The language confidence calculations module is used for the best aligned phoneme sequence according to the different language and corresponding time boundary, Generating probability of the aligned phoneme sequence of different language on the higher order language model of corresponding language is calculated separately, as the phoneme sequence The language confidence level of row obtains the language confidence level of different language aligned phoneme sequence;
In embodiments of the present invention, which calculates in the following way:The phoneme of the identifier output of given languages A Sequence calculates generating probability of the sequence under a standard phonemic language model.The standard phonemic language model is different from sound Language model used by element identification, usually than the language model higher order of phoneme recognition.The language model removes special Shen herein The n-gram language models based on statistics are referred both to outside bright.
Optionally, the computational methods of language confidence level are:
C1(s)=P (p1p2...pn)
=P (p1)P(p2|p1)P(p3|p1p2)...P(pn|pn-k+1...pn-1)。
Wherein, P (pn|pn-k+1...pn-1) be k-gram phonemic language model probability, can be in a large amount of text datas Upper statistics obtains.
In the preferred embodiment of the invention, the language model for being used for computational language confidence level is the phonemic language mould of 4-gram Type.
The prosodic features extraction module be used for according to the best aligned phoneme sequence and corresponding time boundary of the object language with And the fundamental frequency feature of the input audio, calculate the prosodic features of input audio;
In embodiments of the present invention, the prosodic features of the audio includes Sentence-level fundamental frequency maximum value and minimum value, sentence The variance of sub- grade fundamental frequency, the mean value and variance of phoneme level fundamental frequency variance in sentence, phoneme level fundamental frequency in sentence The maximum value of variance and the difference of minimum value, the ratio sound section in sentence shared by (segment of corresponding fundamental frequency non-zero), sentence In noiseless phoneme (ratio of all zero) of the phoneme intersegmental part fundamental frequency value, maximum phoneme duration and minimum phoneme duration, The mean value and variance of phoneme duration.
The discriminant classification module is used to set the acoustics of the different language aligned phoneme sequence using advance trained grader The feature vector of the prosodic features composition of reliability, language confidence level and the input audio carries out object language/non-target language Classification.
Wherein, the mass data training that trained grader needs use collection and mark in advance in advance.Commonly Grader has Bayes classifier, k nearest neighbor, support vector machines, decision tree, maximum entropy and condition random field and neural network Deng.The present invention uses support vector machine classifier.
In embodiments of the present invention, discriminant classification module be additionally operable to by the acoustic confidence of the different language aligned phoneme sequence, The prosodic features of language confidence level and the input audio forms a super vector, is sent into trained grader in advance and carries out in advance Classification is surveyed, the score of the super vector is calculated, if the score is more than given threshold value, it is determined that input language audio is object language Otherwise audio is determined as non-target language audio.Wherein, grader needs, which export, is scored under given audio case it and belongs to The posterior probability of object language.If this posterior probability is more than given threshold value, judge that it is object language to input audio, otherwise It is determined as non-target language.
In invention preferred embodiment, the grader for carrying out the classification judgement of object language/non-target language uses diameter To the supporting vector machine model of base kernel.
System provided in an embodiment of the present invention is believed by comprehensively utilizing acoustic confidence, language confidence level and prosodic features Breath, the detection performance of system are significantly improved, and the audio detection of different length is suitable for, and have detection stability well, A variety of non-target language audios and noised audio can be handled, there is good practicability, it can be according to the class of non-target language Type carries out Quick Extended, it is only necessary to provide the acoustic model and language model of new languages, then re -training sorter model is just It can be with the flexibility with preferable system structure and scalability.
Fig. 2 is language audio detection method flow chart provided in an embodiment of the present invention, and referring to Fig. 2, this method includes:
201, the acoustic feature of input speech signal is extracted, which includes at least the fundamental frequency feature of input audio;
202, the acoustic feature is decoded, obtains the best aligned phoneme sequence of different language and corresponding time boundary, The best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence and correspondence of object language Time boundary;
203, according to the best aligned phoneme sequence of the different language and corresponding time boundary, the sound of different language is calculated separately Posterior probability of the prime sequences on DNN models obtains different language aligned phoneme sequence as the acoustic confidence of the aligned phoneme sequence Acoustic confidence;
204, according to the best aligned phoneme sequence of the different language and corresponding time boundary, the sound of different language is calculated separately Generating probability of the prime sequences on the higher order language model of corresponding language is obtained as the language confidence level of the aligned phoneme sequence The language confidence level of different language aligned phoneme sequence;
205, according to the fundamental frequency of the best aligned phoneme sequence and corresponding time boundary and the input audio of the object language Feature calculates the prosodic features of input audio;
206, advance acoustic confidence, language confidence level of the trained grader to the different language aligned phoneme sequence is utilized And the feature vector of the prosodic features composition of the input audio carries out object language/non-target language classification.
Optionally, the mass data training that trained grader needs use collection and mark in advance in advance.
Optionally, the prosodic features of the audio includes Sentence-level fundamental frequency maximum value, Sentence-level fundamental frequency minimum value, The variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, phoneme level The maximum value of fundamental frequency variance and the difference of minimum value, the ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence, Maximum phoneme duration in sentence, minimum phoneme duration in sentence, the mean value of phoneme duration in sentence, the side of phoneme duration in sentence Difference.
Optionally, the acoustic confidence of the different language aligned phoneme sequence, language are set using advance trained grader The feature vector of the prosodic features of reliability and input audio composition carries out object language/non-target language classification:
By the prosodic features group of the acoustic confidence of the different language aligned phoneme sequence, language confidence level and the input audio At a super vector, it is sent into trained grader in advance and carries out prediction classification, the score of the super vector is calculated, if the score More than given threshold value, it is determined that input language audio is target language audio, is otherwise determined as non-target language audio.
Method provided in an embodiment of the present invention is believed by comprehensively utilizing acoustic confidence, language confidence level and prosodic features Breath, the detection performance of system are significantly improved, and the audio detection of different length is suitable for, and have detection stability well, A variety of non-target language audios and noised audio can be handled, there is good practicability, it can be according to the class of non-target language Type carries out Quick Extended, it is only necessary to provide the acoustic model and language model of new languages, then re -training sorter model is just It can be with the flexibility with preferable system structure and scalability.
One of ordinary skill in the art will appreciate that realizing that all or part of step of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program being somebody's turn to do can be stored in a kind of computer-readable deposit In storage media, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims (7)

1. a kind of detecting system of language audio, which is characterized in that the system comprises:Acoustic feature extraction module, phoneme are known Other module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module; Wherein,
The acoustic feature extraction module is used to extract the acoustic feature of input speech signal, and the acoustic feature includes at least defeated Enter the fundamental frequency feature of audio;
The phoneme recognition module is made of one group of identifier for corresponding to identifier including at least object language, one group of identification Device corresponds to different language respectively, for carrying out parallel voice identification decoding to the acoustic feature, obtains different language most Good news prime sequences and corresponding time boundary, the best aligned phoneme sequence of the different language and corresponding time boundary include at least The best aligned phoneme sequence of object language and corresponding time boundary;
The acoustic confidence computing module is used for the best aligned phoneme sequence according to the different language and corresponding time boundary, Posterior probability of the aligned phoneme sequence of different language on deep neural network DNN models is calculated separately, as the aligned phoneme sequence Acoustic confidence obtains the acoustic confidence of different language aligned phoneme sequence;
The language confidence calculations module is used for the best aligned phoneme sequence according to the different language and corresponding time boundary, Generating probability of the aligned phoneme sequence of different language on the higher order language model of corresponding language is calculated separately, as the phoneme sequence The language confidence level of row obtains the language confidence level of different language aligned phoneme sequence;
The prosodic features extraction module be used for according to the best aligned phoneme sequence and corresponding time boundary of the object language with And the fundamental frequency feature of the input audio, calculate the prosodic features of input audio;
The discriminant classification module is used to set the acoustics of the different language aligned phoneme sequence using advance trained grader The feature vector of the prosodic features composition of reliability, language confidence level and the input audio carries out object language/non-targeted language Speech classification.
2. system according to claim 1, which is characterized in that each identifier is corresponded to using it in one group of identifier The acoustic model and language model of language, the acoustic model need to be trained using the voice data of corresponding language in advance, The language model needs to be trained using the text data of corresponding language in advance.
3. system according to claim 1, which is characterized in that the prosodic features of the audio includes Sentence-level fundamental frequency Maximum value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, sound The variance of plain grade fundamental frequency variance, the difference of the maximum value and minimum value of phoneme level fundamental frequency variance, sound section of institute in sentence The ratio accounted for, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, sentence middle pitch The mean value of plain duration, the variance of phoneme duration in sentence.
4. system according to claim 1, which is characterized in that discriminant classification module is additionally operable to the different language phoneme The acoustic confidence of sequence, prosodic features one super vector of composition of language confidence level and the input audio, are sent into advance Trained grader carries out prediction classification, calculates the score of the super vector, if the score is more than given threshold value, it is determined that defeated It is target language audio to enter language audio, is otherwise determined as non-target language audio.
5. a kind of detection method of language audio, which is characterized in that the method includes:
The acoustic feature of input speech signal is extracted, the acoustic feature includes at least the fundamental frequency feature of input audio;
Parallel voice identification decoding is carried out to the acoustic feature, obtains the best aligned phoneme sequence of different language and corresponding time Boundary, the best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence of object language With corresponding time boundary;
According to the best aligned phoneme sequence of the different language and corresponding time boundary, the aligned phoneme sequence of different language is calculated separately Posterior probability on DNN models, as the acoustic confidence of the aligned phoneme sequence, the acoustics for obtaining different language aligned phoneme sequence is set Reliability;
According to the best aligned phoneme sequence of the different language and corresponding time boundary, the aligned phoneme sequence of different language is calculated separately Generating probability on the higher order language model of corresponding language obtains different languages as the language confidence level of the aligned phoneme sequence Say the language confidence level of aligned phoneme sequence;
According to the best aligned phoneme sequence of the object language and corresponding time boundary and it is described input audio fundamental frequency feature, Calculate the prosodic features of input audio;
Using advance trained grader to the acoustic confidence, language confidence level and institute of the different language aligned phoneme sequence The feature vector for stating the prosodic features composition of input audio carries out object language/non-target language classification.
6. according to the method described in claim 5, it is characterized in that, the prosodic features of the audio includes Sentence-level fundamental frequency Maximum value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, sound The variance of plain grade fundamental frequency variance, the difference of the maximum value and minimum value of phoneme level fundamental frequency variance, sound section of institute in sentence The ratio accounted for, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, sentence middle pitch The mean value of plain duration, the variance of phoneme duration in sentence.
7. according to the method described in claim 5, it is characterized in that, using advance trained grader to the different language The acoustic confidence of aligned phoneme sequence, the feature vector of the prosodic features composition of language confidence level and the input audio carry out mesh Poster speech/non-target language is classified:
By the acoustic confidence of the different language aligned phoneme sequence, the prosodic features group of language confidence level and the input audio At a super vector, it is sent into trained grader in advance and carries out prediction classification, the score of the super vector is calculated, if the score More than given threshold value, it is determined that input language audio is target language audio, is otherwise determined as non-target language audio.
CN201510091609.9A 2014-11-20 2015-02-28 A kind of detecting system and method for language audio Active CN104681036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510091609.9A CN104681036B (en) 2014-11-20 2015-02-28 A kind of detecting system and method for language audio

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN2014106682358 2014-11-20
CN201410668235 2014-11-20
CN201510091609.9A CN104681036B (en) 2014-11-20 2015-02-28 A kind of detecting system and method for language audio

Publications (2)

Publication Number Publication Date
CN104681036A CN104681036A (en) 2015-06-03
CN104681036B true CN104681036B (en) 2018-09-25

Family

ID=53315987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510091609.9A Active CN104681036B (en) 2014-11-20 2015-02-28 A kind of detecting system and method for language audio

Country Status (1)

Country Link
CN (1) CN104681036B (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102413692B1 (en) * 2015-07-24 2022-06-27 삼성전자주식회사 Apparatus and method for caculating acoustic score for speech recognition, speech recognition apparatus and method, and electronic device
CN105427858B (en) * 2015-11-06 2019-09-03 科大讯飞股份有限公司 Realize the method and system that voice is classified automatically
CN106940998B (en) * 2015-12-31 2021-04-16 阿里巴巴集团控股有限公司 Execution method and device for setting operation
CN107045875B (en) * 2016-02-03 2019-12-06 重庆工商职业学院 fundamental tone frequency detection method based on genetic algorithm
CN105810191B (en) * 2016-03-08 2019-11-29 江苏信息职业技术学院 Merge the Chinese dialects identification method of prosodic information
CN106297828B (en) * 2016-08-12 2020-03-24 苏州驰声信息科技有限公司 Detection method and device for false sounding detection based on deep learning
CN106847273B (en) * 2016-12-23 2020-05-05 北京云知声信息技术有限公司 Awakening word selection method and device for voice recognition
CN108428448A (en) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 A kind of sound end detecting method and audio recognition method
CN109754789B (en) * 2017-11-07 2021-06-08 北京国双科技有限公司 Method and device for recognizing voice phonemes
CN110085216A (en) * 2018-01-23 2019-08-02 中国科学院声学研究所 A kind of vagitus detection method and device
CN108389573B (en) * 2018-02-09 2022-03-08 北京世纪好未来教育科技有限公司 Language identification method and device, training method and device, medium and terminal
CN109493846B (en) * 2018-11-18 2021-06-08 深圳市声希科技有限公司 English accent recognition system
CN109613526A (en) * 2018-12-10 2019-04-12 航天南湖电子信息技术股份有限公司 A kind of point mark filter method based on support vector machines
CN111369978B (en) * 2018-12-26 2024-05-17 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN111583906B (en) * 2019-02-18 2023-08-15 中国移动通信有限公司研究院 Role recognition method, device and terminal for voice session
CN109817213B (en) * 2019-03-11 2024-01-23 腾讯科技(深圳)有限公司 Method, device and equipment for performing voice recognition on self-adaptive language
CN110176251B (en) * 2019-04-03 2021-12-21 苏州驰声信息科技有限公司 Automatic acoustic data labeling method and device
CN111078937B (en) * 2019-12-27 2021-08-10 北京世纪好未来教育科技有限公司 Voice information retrieval method, device, equipment and computer readable storage medium
CN111079446A (en) * 2019-12-30 2020-04-28 北京讯鸟软件有限公司 Voice data reconstruction method and device and electronic equipment
CN111402861B (en) * 2020-03-25 2022-11-15 思必驰科技股份有限公司 Voice recognition method, device, equipment and storage medium
CN111862939B (en) * 2020-05-25 2024-06-14 北京捷通华声科技股份有限公司 Rhythm phrase labeling method and device
CN112382310B (en) * 2020-11-12 2022-09-27 北京猿力未来科技有限公司 Human voice audio recording method and device
CN112562649B (en) * 2020-12-07 2024-01-30 北京大米科技有限公司 Audio processing method and device, readable storage medium and electronic equipment
CN112634874B (en) * 2020-12-24 2022-09-23 江西台德智慧科技有限公司 Automatic tuning terminal equipment based on artificial intelligence
CN113571045B (en) * 2021-06-02 2024-03-12 北京它思智能科技有限公司 Method, system, equipment and medium for identifying Minnan language voice
CN113327579A (en) * 2021-08-03 2021-08-31 北京世纪好未来教育科技有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN115938351B (en) * 2021-09-13 2023-08-15 北京数美时代科技有限公司 ASR language model construction method, system, storage medium and electronic equipment
CN114299978A (en) * 2021-12-07 2022-04-08 阿里巴巴(中国)有限公司 Audio signal processing method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200421263A (en) * 2003-04-10 2004-10-16 Delta Electronics Inc Speech recognition device and method using di-phone model to realize the mixed-multi-lingual global phoneme
CN103559879A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and device for extracting acoustic features in language identification system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003060877A1 (en) * 2002-01-17 2003-07-24 Siemens Aktiengesellschaft Operating method for an automated language recognizer intended for the speaker-independent language recognition of words in different languages and automated language recognizer
US8190420B2 (en) * 2009-08-04 2012-05-29 Autonomy Corporation Ltd. Automatic spoken language identification based on phoneme sequence patterns

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200421263A (en) * 2003-04-10 2004-10-16 Delta Electronics Inc Speech recognition device and method using di-phone model to realize the mixed-multi-lingual global phoneme
CN103559879A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and device for extracting acoustic features in language identification system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
i-vector representation based on bottleneck features for language identification;Yan Song et al;《ELECTRONICS LETTERS》;20131121;第49卷(第24期);全文 *
基于音素识别的语种辨识方法中的因子分析;仲海兵等;《模式识别与人工智能》;20120229;第25卷(第1期);全文 *

Also Published As

Publication number Publication date
CN104681036A (en) 2015-06-03

Similar Documents

Publication Publication Date Title
CN104681036B (en) A kind of detecting system and method for language audio
AU2019395322B2 (en) Reconciliation between simulated data and speech recognition output using sequence-to-sequence mapping
CN105336322B (en) Polyphone model training method, and speech synthesis method and device
JP6189970B2 (en) Combination of auditory attention cue and phoneme posterior probability score for sound / vowel / syllable boundary detection
CN104575490B (en) Spoken language pronunciation evaluating method based on deep neural network posterior probability algorithm
CN104036774B (en) Tibetan dialect recognition methods and system
CN106531157B (en) Regularization accent adaptive approach in speech recognition
CN105632501A (en) Deep-learning-technology-based automatic accent classification method and apparatus
JP5752060B2 (en) Information processing apparatus, large vocabulary continuous speech recognition method and program
Ryant et al. Highly accurate mandarin tone classification in the absence of pitch information
CN112420026A (en) Optimized keyword retrieval system
US11935523B2 (en) Detection of correctness of pronunciation
CN106653002A (en) Literal live broadcasting method and platform
Hu et al. A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
Baljekar et al. Using articulatory features and inferred phonological segments in zero resource speech processing.
Rabiee et al. Persian accents identification using an adaptive neural network
US20140142925A1 (en) Self-organizing unit recognition for speech and other data series
Joshi et al. Vowel mispronunciation detection using DNN acoustic models with cross-lingual training.
Rasipuram et al. Grapheme and multilingual posterior features for under-resourced speech recognition: a study on scottish gaelic
Cui et al. Improving deep neural network acoustic modeling for audio corpus indexing under the iarpa babel program
Huang et al. Multi-task learning deep neural networks for speech feature denoising.
Chen et al. Multi-task learning in deep neural networks for Mandarin-English code-mixing speech recognition
Minh et al. The system for detecting Vietnamese mispronunciation
Karanasou et al. I-vector estimation using informative priors for adaptation of deep neural networks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant