CN104681036B - A kind of detecting system and method for language audio - Google Patents
A kind of detecting system and method for language audio Download PDFInfo
- Publication number
- CN104681036B CN104681036B CN201510091609.9A CN201510091609A CN104681036B CN 104681036 B CN104681036 B CN 104681036B CN 201510091609 A CN201510091609 A CN 201510091609A CN 104681036 B CN104681036 B CN 104681036B
- Authority
- CN
- China
- Prior art keywords
- language
- phoneme sequence
- acoustic
- sentence
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a kind of detecting system of language audio and methods, belong to Speech signal processing technical field.The system comprises:Acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module.The present invention is by comprehensively utilizing acoustic confidence, language confidence level and prosodic features information, the detection performance of system is significantly improved, audio detection suitable for different length, with good detection stability, a variety of non-target language audios and noised audio can be handled, with good practicability, Quick Extended can be carried out according to the type of non-target language, only need to provide the acoustic model and language model of new languages, then re -training sorter model, the flexibility with preferable system structure and scalability.
Description
Technical field
The present invention relates to Speech signal processing technical field, more particularly to the detecting system and method for a kind of language audio.
Background technology
The actual application environment of voice technology is usually extremely complex, may include many non-mesh in the audio that system receives
The sound, such as the voice of different language, music, natural noise and man-made noise etc. of poster speech.The presence of these audios can be tight
Ghost image rings the availability and user experience of voice technology.Therefore, efficiently being detected using technological means and filtering these audios is
It is highly desirable.
In this kind of technology, most typically languages identification technology and noise detection technique.Wherein, languages identification technology is
Utilize the harmonious sounds information (such as special pronunciation unit, different distributions or combination of pronunciation unit etc.) for including in voice
To judge category of language.
In the prior art, most ripe languages identification technology is the polyphone language model technology skill based on phoneme recognition
Art thinks that the aligned phoneme sequence distribution that different language identifier generates and combination have respectively different rules, therefore utilizes different languages
Distribution probability of the phoneme recognition sequence of kind identifier output in the phonemic language model of different language language carries out languages knowledge
Not.The technology has preferable precision and universal, but its performance can drastically decline on phrase sound, and there are certain limitations
Property.
Invention content
In order to solve problems in the prior art, an embodiment of the present invention provides a kind of detecting system of language audio and sides
Method.The technical solution is as follows:
On the one hand, a kind of detecting system of language audio is provided, the system comprises:Acoustic feature extraction module, sound
Plain identification module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification mould
Block;
Wherein,
The acoustic feature extraction module is used to extract the acoustic feature of input speech signal, and the acoustic feature at least wraps
Include the fundamental frequency feature of input audio;
The phoneme recognition module is made of one group of identifier for corresponding to identifier including at least object language, is corresponded to respectively
Different language obtains the best aligned phoneme sequence of different language and corresponding time for being decoded to the acoustic feature
Boundary, the best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence of object language
With corresponding time boundary;
The acoustic confidence computing module is used for the best aligned phoneme sequence according to the different language and corresponding time
Boundary calculates separately posterior probability of the aligned phoneme sequence of different language on DNN models, the acoustics confidence as the aligned phoneme sequence
Degree, obtains the acoustic confidence of different language aligned phoneme sequence;
The language confidence calculations module is used for the best aligned phoneme sequence according to the different language and corresponding time
Boundary calculates separately generating probability of the aligned phoneme sequence of different language on the higher order language model of corresponding language, as this
The language confidence level of aligned phoneme sequence obtains the language confidence level of different language aligned phoneme sequence;
The prosodic features extraction module is used for the best aligned phoneme sequence according to the object language and corresponding time side
The fundamental frequency feature of boundary and the input audio, calculates the prosodic features of input audio;
The discriminant classification module is used for the sound to the different language aligned phoneme sequence using advance trained grader
The feature vector for learning the prosodic features composition of confidence level, language confidence level and the input audio carries out object language/non-mesh
Mark language classification.
Optionally, each identifier corresponds to the acoustic model and language model of language using it in one group of identifier,
The acoustic model needs to be trained using the voice data of corresponding language in advance, and the language model needs to use phase in advance
The text data of language is answered to be trained.
Optionally, the grader trained in advance needs in advance using the mass data training collected and marked.
Optionally, the prosodic features of the audio includes Sentence-level fundamental frequency maximum value, and Sentence-level fundamental frequency is minimum
Value, the variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, sound
The difference of the maximum value and minimum value of plain grade fundamental frequency variance, the ratio in sentence shared by sound section, noiseless phoneme in sentence
Ratio, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the mean value of phoneme duration in sentence, in sentence when phoneme
Long variance.
Optionally, discriminant classification module is additionally operable to the acoustic confidence of the different language aligned phoneme sequence, language confidence
Degree and the prosodic features of the input audio form a super vector, are sent into trained grader in advance and carry out prediction point
Class calculates the score of the super vector, if the score is more than given threshold value, it is determined that input language audio is object language sound
Frequently, otherwise it is determined as non-target language audio.
On the other hand, a kind of detection method of language audio is provided, the method includes:
The acoustic feature of input speech signal is extracted, the acoustic feature includes at least the fundamental frequency feature of input audio;
The acoustic feature is decoded, the best aligned phoneme sequence of different language and corresponding time boundary, institute are obtained
State different language best aligned phoneme sequence and corresponding time boundary include at least object language best aligned phoneme sequence and correspondence
Time boundary;
According to the best aligned phoneme sequence of the different language and corresponding time boundary, the phoneme of different language is calculated separately
Posterior probability of the sequence on DNN models obtains the sound of different language aligned phoneme sequence as the acoustic confidence of the aligned phoneme sequence
Learn confidence level;
According to the best aligned phoneme sequence of the different language and corresponding time boundary, the phoneme of different language is calculated separately
Generating probability of the sequence on the higher order language model of corresponding language obtains not as the language confidence level of the aligned phoneme sequence
With the language confidence level of language phoneme sequence;
According to the best aligned phoneme sequence of the object language and corresponding time boundary and the fundamental frequency of the input audio
Feature calculates the prosodic features of input audio;
Using advance trained grader to the acoustic confidence of the different language aligned phoneme sequence, language confidence level with
And the feature vector of the prosodic features composition of the input audio carries out object language/non-target language classification.
Optionally, the grader trained in advance needs in advance using the mass data training collected and marked.
Optionally, the prosodic features of the audio includes Sentence-level fundamental frequency maximum value, and Sentence-level fundamental frequency is minimum
Value, the variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, sound
The difference of the maximum value and minimum value of plain grade fundamental frequency variance, the ratio in sentence shared by sound section, noiseless phoneme in sentence
Ratio, maximum phoneme duration in sentence, minimum phoneme duration in sentence, the mean value of phoneme duration in sentence, in sentence when phoneme
Long variance.
Optionally, the acoustic confidence using advance trained grader to the different language aligned phoneme sequence, language
The feature vector of the prosodic features composition of confidence level and the input audio carries out object language/non-target language classification packet
It includes:
The rhythm of the acoustic confidence of the different language aligned phoneme sequence, language confidence level and the input audio is special
Sign one super vector of composition is sent into trained grader in advance and carries out prediction classification, calculate the score of the super vector, if should
Score is more than given threshold value, it is determined that input language audio is target language audio, is otherwise determined as non-target language audio.
The advantageous effect that technical solution provided in an embodiment of the present invention is brought is:
Method provided by the present invention is by comprehensively utilizing acoustic confidence, language confidence level and prosodic features information
The detection performance of system is significantly improved, and the audio detection of different length is suitable for, and is had detection stability well, can be located
A variety of non-target language audios and noised audio are managed, there is good practicability, can be carried out according to the type of non-target language
Quick Extended, it is only necessary to provide the acoustic model and language model of new languages, then re -training sorter model, have
There are flexibility and the scalability of preferable system structure.
Description of the drawings
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings
Attached drawing.
Fig. 1 is language audio detection system structure provided in an embodiment of the present invention;
Fig. 2 is language audio detection method flow chart provided in an embodiment of the present invention.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention
Formula is described in further detail.
Fig. 1 is the detection system structure of language audio provided in an embodiment of the present invention.Referring to Fig. 1, the system packet
It includes:Acoustic feature extraction module, phoneme recognition module, acoustic confidence computing module, language confidence calculations module, the rhythm are special
Levy extraction module and discriminant classification module.Wherein,
The acoustic feature extraction module is used to extract the acoustic feature of input speech signal, which includes at least defeated
Enter the fundamental frequency feature of audio;
Wherein, which may include:Plp (perception linear prediction) feature, mfcc (mel frequency cepstral coefficients) are special
Sign, fbank features etc..
The phoneme recognition module is made of one group of identifier for corresponding to identifier including at least object language, is corresponded to respectively not
With language obtain the best aligned phoneme sequence of different language and corresponding time boundary for being decoded to the acoustic feature,
The best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence and correspondence of object language
Time boundary;
In embodiments of the present invention, phoneme recognition module is made of one group of phoneme recognizer, corresponds to different languages respectively
Kind.In this group of phoneme recognizer, it is necessary to include the speech recognition device of corresponding object language.Each identifier corresponds to language using it
The acoustic model and phonemic language model of kind.The output of this module is one group of aligned phoneme sequence and its corresponding time boundary and inside
Status switch.Optionally, which can correspond to object language only there are one phoneme recognizer, in this way can be with
Reduce system-computed amount, the limited decline of detection performance.Optionally, which can be with the voice of multigroup non-target language
Phoneme recognizer corresponds to the languages being likely encountered in actual application environment respectively, can also be that selection wherein exemplary language is established
Speech recognition device.
Optionally, each identifier corresponds to the acoustic model and language model of language using it in one group of identifier, should
Acoustic model needs to be trained using the voice data of corresponding language in advance, which needs to use corresponding language in advance
Text data be trained.
Optionally, which is used uniformly mutually isostructural acoustic model and language model.Typically, acoustic mode
Type uses DNN/HMM models, Acoustic Modeling unit to be used uniformly phoneme;Language model uses the n-gram statistical language moulds of phoneme
Type.In the preferred embodiment of the invention, the n-gram language models used are decoded as the phonemic language model of 3-gram.
The acoustic confidence computing module is used for the best aligned phoneme sequence according to the different language and corresponding time boundary,
Calculate separately posteriority of the aligned phoneme sequence of different language on DNN (Deep Neural Network, deep neural network) model
Probability obtains the acoustic confidence of different language aligned phoneme sequence as the acoustic confidence of the aligned phoneme sequence;
Common confidence calculations method has very much, includes the confidence level technology of feature based, be based on N-best or
Confidence level technology of lattice etc..Confidence calculations scheme used in the embodiment of the present invention is the phoneme level sound based on DNN models
Learn posterior mean value.
Optionally, the computational methods of acoustic confidence are:
Wherein, Ca(s) acoustic confidence for being sentence s, Ca(pi) it is the p of i-th of phoneme in sentenceiAcoustic confidence, n
For the phoneme number in sentence s, m is phoneme piIn include feature frame number, P (sj|oj) it is phoneme piIn give j-th of acoustics
Observe ojIn state sjOn posterior probability.
The language confidence calculations module is used for the best aligned phoneme sequence according to the different language and corresponding time boundary,
Generating probability of the aligned phoneme sequence of different language on the higher order language model of corresponding language is calculated separately, as the phoneme sequence
The language confidence level of row obtains the language confidence level of different language aligned phoneme sequence;
In embodiments of the present invention, which calculates in the following way:The phoneme of the identifier output of given languages A
Sequence calculates generating probability of the sequence under a standard phonemic language model.The standard phonemic language model is different from sound
Language model used by element identification, usually than the language model higher order of phoneme recognition.The language model removes special Shen herein
The n-gram language models based on statistics are referred both to outside bright.
Optionally, the computational methods of language confidence level are:
C1(s)=P (p1p2...pn)
=P (p1)P(p2|p1)P(p3|p1p2)...P(pn|pn-k+1...pn-1)。
Wherein, P (pn|pn-k+1...pn-1) be k-gram phonemic language model probability, can be in a large amount of text datas
Upper statistics obtains.
In the preferred embodiment of the invention, the language model for being used for computational language confidence level is the phonemic language mould of 4-gram
Type.
The prosodic features extraction module be used for according to the best aligned phoneme sequence and corresponding time boundary of the object language with
And the fundamental frequency feature of the input audio, calculate the prosodic features of input audio;
In embodiments of the present invention, the prosodic features of the audio includes Sentence-level fundamental frequency maximum value and minimum value, sentence
The variance of sub- grade fundamental frequency, the mean value and variance of phoneme level fundamental frequency variance in sentence, phoneme level fundamental frequency in sentence
The maximum value of variance and the difference of minimum value, the ratio sound section in sentence shared by (segment of corresponding fundamental frequency non-zero), sentence
In noiseless phoneme (ratio of all zero) of the phoneme intersegmental part fundamental frequency value, maximum phoneme duration and minimum phoneme duration,
The mean value and variance of phoneme duration.
The discriminant classification module is used to set the acoustics of the different language aligned phoneme sequence using advance trained grader
The feature vector of the prosodic features composition of reliability, language confidence level and the input audio carries out object language/non-target language
Classification.
Wherein, the mass data training that trained grader needs use collection and mark in advance in advance.Commonly
Grader has Bayes classifier, k nearest neighbor, support vector machines, decision tree, maximum entropy and condition random field and neural network
Deng.The present invention uses support vector machine classifier.
In embodiments of the present invention, discriminant classification module be additionally operable to by the acoustic confidence of the different language aligned phoneme sequence,
The prosodic features of language confidence level and the input audio forms a super vector, is sent into trained grader in advance and carries out in advance
Classification is surveyed, the score of the super vector is calculated, if the score is more than given threshold value, it is determined that input language audio is object language
Otherwise audio is determined as non-target language audio.Wherein, grader needs, which export, is scored under given audio case it and belongs to
The posterior probability of object language.If this posterior probability is more than given threshold value, judge that it is object language to input audio, otherwise
It is determined as non-target language.
In invention preferred embodiment, the grader for carrying out the classification judgement of object language/non-target language uses diameter
To the supporting vector machine model of base kernel.
System provided in an embodiment of the present invention is believed by comprehensively utilizing acoustic confidence, language confidence level and prosodic features
Breath, the detection performance of system are significantly improved, and the audio detection of different length is suitable for, and have detection stability well,
A variety of non-target language audios and noised audio can be handled, there is good practicability, it can be according to the class of non-target language
Type carries out Quick Extended, it is only necessary to provide the acoustic model and language model of new languages, then re -training sorter model is just
It can be with the flexibility with preferable system structure and scalability.
Fig. 2 is language audio detection method flow chart provided in an embodiment of the present invention, and referring to Fig. 2, this method includes:
201, the acoustic feature of input speech signal is extracted, which includes at least the fundamental frequency feature of input audio;
202, the acoustic feature is decoded, obtains the best aligned phoneme sequence of different language and corresponding time boundary,
The best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence and correspondence of object language
Time boundary;
203, according to the best aligned phoneme sequence of the different language and corresponding time boundary, the sound of different language is calculated separately
Posterior probability of the prime sequences on DNN models obtains different language aligned phoneme sequence as the acoustic confidence of the aligned phoneme sequence
Acoustic confidence;
204, according to the best aligned phoneme sequence of the different language and corresponding time boundary, the sound of different language is calculated separately
Generating probability of the prime sequences on the higher order language model of corresponding language is obtained as the language confidence level of the aligned phoneme sequence
The language confidence level of different language aligned phoneme sequence;
205, according to the fundamental frequency of the best aligned phoneme sequence and corresponding time boundary and the input audio of the object language
Feature calculates the prosodic features of input audio;
206, advance acoustic confidence, language confidence level of the trained grader to the different language aligned phoneme sequence is utilized
And the feature vector of the prosodic features composition of the input audio carries out object language/non-target language classification.
Optionally, the mass data training that trained grader needs use collection and mark in advance in advance.
Optionally, the prosodic features of the audio includes Sentence-level fundamental frequency maximum value, Sentence-level fundamental frequency minimum value,
The variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, the variance of phoneme level fundamental frequency variance, phoneme level
The maximum value of fundamental frequency variance and the difference of minimum value, the ratio in sentence shared by sound section, the ratio of noiseless phoneme in sentence,
Maximum phoneme duration in sentence, minimum phoneme duration in sentence, the mean value of phoneme duration in sentence, the side of phoneme duration in sentence
Difference.
Optionally, the acoustic confidence of the different language aligned phoneme sequence, language are set using advance trained grader
The feature vector of the prosodic features of reliability and input audio composition carries out object language/non-target language classification:
By the prosodic features group of the acoustic confidence of the different language aligned phoneme sequence, language confidence level and the input audio
At a super vector, it is sent into trained grader in advance and carries out prediction classification, the score of the super vector is calculated, if the score
More than given threshold value, it is determined that input language audio is target language audio, is otherwise determined as non-target language audio.
Method provided in an embodiment of the present invention is believed by comprehensively utilizing acoustic confidence, language confidence level and prosodic features
Breath, the detection performance of system are significantly improved, and the audio detection of different length is suitable for, and have detection stability well,
A variety of non-target language audios and noised audio can be handled, there is good practicability, it can be according to the class of non-target language
Type carries out Quick Extended, it is only necessary to provide the acoustic model and language model of new languages, then re -training sorter model is just
It can be with the flexibility with preferable system structure and scalability.
One of ordinary skill in the art will appreciate that realizing that all or part of step of above-described embodiment can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program being somebody's turn to do can be stored in a kind of computer-readable deposit
In storage media, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and
Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.
Claims (7)
1. a kind of detecting system of language audio, which is characterized in that the system comprises:Acoustic feature extraction module, phoneme are known
Other module, acoustic confidence computing module, language confidence calculations module, prosodic features extraction module and discriminant classification module;
Wherein,
The acoustic feature extraction module is used to extract the acoustic feature of input speech signal, and the acoustic feature includes at least defeated
Enter the fundamental frequency feature of audio;
The phoneme recognition module is made of one group of identifier for corresponding to identifier including at least object language, one group of identification
Device corresponds to different language respectively, for carrying out parallel voice identification decoding to the acoustic feature, obtains different language most
Good news prime sequences and corresponding time boundary, the best aligned phoneme sequence of the different language and corresponding time boundary include at least
The best aligned phoneme sequence of object language and corresponding time boundary;
The acoustic confidence computing module is used for the best aligned phoneme sequence according to the different language and corresponding time boundary,
Posterior probability of the aligned phoneme sequence of different language on deep neural network DNN models is calculated separately, as the aligned phoneme sequence
Acoustic confidence obtains the acoustic confidence of different language aligned phoneme sequence;
The language confidence calculations module is used for the best aligned phoneme sequence according to the different language and corresponding time boundary,
Generating probability of the aligned phoneme sequence of different language on the higher order language model of corresponding language is calculated separately, as the phoneme sequence
The language confidence level of row obtains the language confidence level of different language aligned phoneme sequence;
The prosodic features extraction module be used for according to the best aligned phoneme sequence and corresponding time boundary of the object language with
And the fundamental frequency feature of the input audio, calculate the prosodic features of input audio;
The discriminant classification module is used to set the acoustics of the different language aligned phoneme sequence using advance trained grader
The feature vector of the prosodic features composition of reliability, language confidence level and the input audio carries out object language/non-targeted language
Speech classification.
2. system according to claim 1, which is characterized in that each identifier is corresponded to using it in one group of identifier
The acoustic model and language model of language, the acoustic model need to be trained using the voice data of corresponding language in advance,
The language model needs to be trained using the text data of corresponding language in advance.
3. system according to claim 1, which is characterized in that the prosodic features of the audio includes Sentence-level fundamental frequency
Maximum value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, sound
The variance of plain grade fundamental frequency variance, the difference of the maximum value and minimum value of phoneme level fundamental frequency variance, sound section of institute in sentence
The ratio accounted for, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, sentence middle pitch
The mean value of plain duration, the variance of phoneme duration in sentence.
4. system according to claim 1, which is characterized in that discriminant classification module is additionally operable to the different language phoneme
The acoustic confidence of sequence, prosodic features one super vector of composition of language confidence level and the input audio, are sent into advance
Trained grader carries out prediction classification, calculates the score of the super vector, if the score is more than given threshold value, it is determined that defeated
It is target language audio to enter language audio, is otherwise determined as non-target language audio.
5. a kind of detection method of language audio, which is characterized in that the method includes:
The acoustic feature of input speech signal is extracted, the acoustic feature includes at least the fundamental frequency feature of input audio;
Parallel voice identification decoding is carried out to the acoustic feature, obtains the best aligned phoneme sequence of different language and corresponding time
Boundary, the best aligned phoneme sequence of the different language and corresponding time boundary include at least the best aligned phoneme sequence of object language
With corresponding time boundary;
According to the best aligned phoneme sequence of the different language and corresponding time boundary, the aligned phoneme sequence of different language is calculated separately
Posterior probability on DNN models, as the acoustic confidence of the aligned phoneme sequence, the acoustics for obtaining different language aligned phoneme sequence is set
Reliability;
According to the best aligned phoneme sequence of the different language and corresponding time boundary, the aligned phoneme sequence of different language is calculated separately
Generating probability on the higher order language model of corresponding language obtains different languages as the language confidence level of the aligned phoneme sequence
Say the language confidence level of aligned phoneme sequence;
According to the best aligned phoneme sequence of the object language and corresponding time boundary and it is described input audio fundamental frequency feature,
Calculate the prosodic features of input audio;
Using advance trained grader to the acoustic confidence, language confidence level and institute of the different language aligned phoneme sequence
The feature vector for stating the prosodic features composition of input audio carries out object language/non-target language classification.
6. according to the method described in claim 5, it is characterized in that, the prosodic features of the audio includes Sentence-level fundamental frequency
Maximum value, Sentence-level fundamental frequency minimum value, the variance of Sentence-level fundamental frequency, the mean value of phoneme level fundamental frequency variance, sound
The variance of plain grade fundamental frequency variance, the difference of the maximum value and minimum value of phoneme level fundamental frequency variance, sound section of institute in sentence
The ratio accounted for, the ratio of noiseless phoneme in sentence, maximum phoneme duration in sentence, minimum phoneme duration in sentence, sentence middle pitch
The mean value of plain duration, the variance of phoneme duration in sentence.
7. according to the method described in claim 5, it is characterized in that, using advance trained grader to the different language
The acoustic confidence of aligned phoneme sequence, the feature vector of the prosodic features composition of language confidence level and the input audio carry out mesh
Poster speech/non-target language is classified:
By the acoustic confidence of the different language aligned phoneme sequence, the prosodic features group of language confidence level and the input audio
At a super vector, it is sent into trained grader in advance and carries out prediction classification, the score of the super vector is calculated, if the score
More than given threshold value, it is determined that input language audio is target language audio, is otherwise determined as non-target language audio.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510091609.9A CN104681036B (en) | 2014-11-20 | 2015-02-28 | A kind of detecting system and method for language audio |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2014106682358 | 2014-11-20 | ||
CN201410668235 | 2014-11-20 | ||
CN201510091609.9A CN104681036B (en) | 2014-11-20 | 2015-02-28 | A kind of detecting system and method for language audio |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104681036A CN104681036A (en) | 2015-06-03 |
CN104681036B true CN104681036B (en) | 2018-09-25 |
Family
ID=53315987
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510091609.9A Active CN104681036B (en) | 2014-11-20 | 2015-02-28 | A kind of detecting system and method for language audio |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104681036B (en) |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102413692B1 (en) * | 2015-07-24 | 2022-06-27 | 삼성전자주식회사 | Apparatus and method for caculating acoustic score for speech recognition, speech recognition apparatus and method, and electronic device |
CN105427858B (en) * | 2015-11-06 | 2019-09-03 | 科大讯飞股份有限公司 | Realize the method and system that voice is classified automatically |
CN106940998B (en) * | 2015-12-31 | 2021-04-16 | 阿里巴巴集团控股有限公司 | Execution method and device for setting operation |
CN107045875B (en) * | 2016-02-03 | 2019-12-06 | 重庆工商职业学院 | fundamental tone frequency detection method based on genetic algorithm |
CN105810191B (en) * | 2016-03-08 | 2019-11-29 | 江苏信息职业技术学院 | Merge the Chinese dialects identification method of prosodic information |
CN106297828B (en) * | 2016-08-12 | 2020-03-24 | 苏州驰声信息科技有限公司 | Detection method and device for false sounding detection based on deep learning |
CN106847273B (en) * | 2016-12-23 | 2020-05-05 | 北京云知声信息技术有限公司 | Awakening word selection method and device for voice recognition |
CN108428448A (en) * | 2017-02-13 | 2018-08-21 | 芋头科技(杭州)有限公司 | A kind of sound end detecting method and audio recognition method |
CN109754789B (en) * | 2017-11-07 | 2021-06-08 | 北京国双科技有限公司 | Method and device for recognizing voice phonemes |
CN110085216A (en) * | 2018-01-23 | 2019-08-02 | 中国科学院声学研究所 | A kind of vagitus detection method and device |
CN108389573B (en) * | 2018-02-09 | 2022-03-08 | 北京世纪好未来教育科技有限公司 | Language identification method and device, training method and device, medium and terminal |
CN109493846B (en) * | 2018-11-18 | 2021-06-08 | 深圳市声希科技有限公司 | English accent recognition system |
CN109613526A (en) * | 2018-12-10 | 2019-04-12 | 航天南湖电子信息技术股份有限公司 | A kind of point mark filter method based on support vector machines |
CN111369978B (en) * | 2018-12-26 | 2024-05-17 | 北京搜狗科技发展有限公司 | Data processing method and device for data processing |
CN111583906B (en) * | 2019-02-18 | 2023-08-15 | 中国移动通信有限公司研究院 | Role recognition method, device and terminal for voice session |
CN109817213B (en) * | 2019-03-11 | 2024-01-23 | 腾讯科技(深圳)有限公司 | Method, device and equipment for performing voice recognition on self-adaptive language |
CN110176251B (en) * | 2019-04-03 | 2021-12-21 | 苏州驰声信息科技有限公司 | Automatic acoustic data labeling method and device |
CN111078937B (en) * | 2019-12-27 | 2021-08-10 | 北京世纪好未来教育科技有限公司 | Voice information retrieval method, device, equipment and computer readable storage medium |
CN111079446A (en) * | 2019-12-30 | 2020-04-28 | 北京讯鸟软件有限公司 | Voice data reconstruction method and device and electronic equipment |
CN111402861B (en) * | 2020-03-25 | 2022-11-15 | 思必驰科技股份有限公司 | Voice recognition method, device, equipment and storage medium |
CN111862939B (en) * | 2020-05-25 | 2024-06-14 | 北京捷通华声科技股份有限公司 | Rhythm phrase labeling method and device |
CN112382310B (en) * | 2020-11-12 | 2022-09-27 | 北京猿力未来科技有限公司 | Human voice audio recording method and device |
CN112562649B (en) * | 2020-12-07 | 2024-01-30 | 北京大米科技有限公司 | Audio processing method and device, readable storage medium and electronic equipment |
CN112634874B (en) * | 2020-12-24 | 2022-09-23 | 江西台德智慧科技有限公司 | Automatic tuning terminal equipment based on artificial intelligence |
CN113571045B (en) * | 2021-06-02 | 2024-03-12 | 北京它思智能科技有限公司 | Method, system, equipment and medium for identifying Minnan language voice |
CN113327579A (en) * | 2021-08-03 | 2021-08-31 | 北京世纪好未来教育科技有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN115938351B (en) * | 2021-09-13 | 2023-08-15 | 北京数美时代科技有限公司 | ASR language model construction method, system, storage medium and electronic equipment |
CN114299978A (en) * | 2021-12-07 | 2022-04-08 | 阿里巴巴(中国)有限公司 | Audio signal processing method, device, equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW200421263A (en) * | 2003-04-10 | 2004-10-16 | Delta Electronics Inc | Speech recognition device and method using di-phone model to realize the mixed-multi-lingual global phoneme |
CN103559879A (en) * | 2013-11-08 | 2014-02-05 | 安徽科大讯飞信息科技股份有限公司 | Method and device for extracting acoustic features in language identification system |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003060877A1 (en) * | 2002-01-17 | 2003-07-24 | Siemens Aktiengesellschaft | Operating method for an automated language recognizer intended for the speaker-independent language recognition of words in different languages and automated language recognizer |
US8190420B2 (en) * | 2009-08-04 | 2012-05-29 | Autonomy Corporation Ltd. | Automatic spoken language identification based on phoneme sequence patterns |
-
2015
- 2015-02-28 CN CN201510091609.9A patent/CN104681036B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW200421263A (en) * | 2003-04-10 | 2004-10-16 | Delta Electronics Inc | Speech recognition device and method using di-phone model to realize the mixed-multi-lingual global phoneme |
CN103559879A (en) * | 2013-11-08 | 2014-02-05 | 安徽科大讯飞信息科技股份有限公司 | Method and device for extracting acoustic features in language identification system |
Non-Patent Citations (2)
Title |
---|
i-vector representation based on bottleneck features for language identification;Yan Song et al;《ELECTRONICS LETTERS》;20131121;第49卷(第24期);全文 * |
基于音素识别的语种辨识方法中的因子分析;仲海兵等;《模式识别与人工智能》;20120229;第25卷(第1期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN104681036A (en) | 2015-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104681036B (en) | A kind of detecting system and method for language audio | |
AU2019395322B2 (en) | Reconciliation between simulated data and speech recognition output using sequence-to-sequence mapping | |
CN105336322B (en) | Polyphone model training method, and speech synthesis method and device | |
JP6189970B2 (en) | Combination of auditory attention cue and phoneme posterior probability score for sound / vowel / syllable boundary detection | |
CN104575490B (en) | Spoken language pronunciation evaluating method based on deep neural network posterior probability algorithm | |
CN104036774B (en) | Tibetan dialect recognition methods and system | |
CN106531157B (en) | Regularization accent adaptive approach in speech recognition | |
CN105632501A (en) | Deep-learning-technology-based automatic accent classification method and apparatus | |
JP5752060B2 (en) | Information processing apparatus, large vocabulary continuous speech recognition method and program | |
Ryant et al. | Highly accurate mandarin tone classification in the absence of pitch information | |
CN112420026A (en) | Optimized keyword retrieval system | |
US11935523B2 (en) | Detection of correctness of pronunciation | |
CN106653002A (en) | Literal live broadcasting method and platform | |
Hu et al. | A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training | |
CN106297769B (en) | A kind of distinctive feature extracting method applied to languages identification | |
Baljekar et al. | Using articulatory features and inferred phonological segments in zero resource speech processing. | |
Rabiee et al. | Persian accents identification using an adaptive neural network | |
US20140142925A1 (en) | Self-organizing unit recognition for speech and other data series | |
Joshi et al. | Vowel mispronunciation detection using DNN acoustic models with cross-lingual training. | |
Rasipuram et al. | Grapheme and multilingual posterior features for under-resourced speech recognition: a study on scottish gaelic | |
Cui et al. | Improving deep neural network acoustic modeling for audio corpus indexing under the iarpa babel program | |
Huang et al. | Multi-task learning deep neural networks for speech feature denoising. | |
Chen et al. | Multi-task learning in deep neural networks for Mandarin-English code-mixing speech recognition | |
Minh et al. | The system for detecting Vietnamese mispronunciation | |
Karanasou et al. | I-vector estimation using informative priors for adaptation of deep neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |